# <center>Unit 16
# <center>File Formats and File I/O</center>
# <center>Asg 16.3 (Coding)</center>

In [None]:
# set up notebook to display multiple output in one cell

from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

<div class="alert alert-block alert-warning">
    <b><font size="4">Files needed for this assignment:</font></b>
</div>

[cars.txt](https://drive.google.com/file/d/1mUl-lJNpim-Q9A0bZ6GZBlcQ69g9_mPD/view?usp=share_link)<br>
[cars.csv](https://drive.google.com/file/d/1bdQH9AqUwkMUtSrSHoV9Bx-5tR8PLJHA/view?usp=share_link)<br>
[cars.xlsx](https://docs.google.com/spreadsheets/d/1Mt0mTC7i6FjbVyxuQ85VYbot09pepDj8/edit?usp=share_link&ouid=117745432621363033141&rtpof=true&sd=true)<br>
[cars.json](https://drive.google.com/file/d/1Eri0QnR_zkMg19kAMZVYOB6AcheV6Nhz/view?usp=share_link)

<div class="alert alert-block alert-warning"><b>In this assignment you will read through the notebook and complete the exercises. Once you are satisfied with the results, submit the following to Google Classroom.<br>
    
1. .ipynb Jupyter Notebook file
2. .html file or PDF
3. cars_totals.xlsx file  
    
Your files should include all output, i.e. run each cell and save your file before submitting.</b></div>

<div class="alert alert-block alert-info">One aspect of data science is working with data from files. In this assignment we will learn to read in data from four different file types:
    
1. text file (using a for loop and .readlines)
2. csv file (using pandas)
3. excel file (using pandas)
4. JSON file (using json)
    
In the process we will be creating and manipulating Python lists. We will also see how data can be written to a new excel file. Later in the course we'll learn more about how to display this information neatly and manipulate the data more efficiently, but for now we start by learning the basics of reading and writing files.</div>

### Reading Text Files

You are given a file `cars.txt` that contains a sample of vehicle types with information about their performance.  Each row in the text file is a list of six values (`Type`, `Year`, `MPG`, `CO2 Emissions`, `Weight`, `HP`) separated by spaces. Note that the first column `Type` contains both the car type and the number of cylinders. 

`Sedan/Wagon/6	2018	28.2	315.6	4098	313
    CarSUV/6	2018	26.9	330	    4321	285
        Van/6	2018	26.9	329	    4635	282
    TruckSUV/6	2018	23.9	371.5	4760	297
    Pickup/6	2018	23.1	384.7	4791	305
Sedan/Wagon/8	2018	24.2	367.8	4345	477
    CarSUV/4	2018	30.2	295.1	3736	181
        Van/8	2018	15.4	577.1	6647	324
    TruckSUV/8	2018	20.8	428.2	6031	387
    Pickup/8	2018	20.9	424.6	5642	375`


In Python, there is an `open` method that takes the name of a text file in the current directory (or more generally a path to a text file in any directory on your computer) and returns what is known as a `file object`. This file object can be used to read from existing text file, create and write to a new file or append text to a pre-existing file. See the following documentation for more information:

__[Opening Files in Python](https://docs.python.org/3/library/functions.html#open)__

For example, 
```python
fileName = open('my_file.txt',r)
```

would open a file with filename `my_file.txt` for reading (i.e. `mode = 'r'`) and returns a corresponding file object which is assigned to the variable `fileName`. 

If the file cannot be opened for some reason (e.g. if the file doesn't exist in the current directory), then an error is generated. More specifically, an `Exception` object is created and said to be "thrown". 

Run the following cell to read in `cars.txt` using the `open` method.

In [4]:
# Open text file
fileName = open('cars.txt', 'r')

### Displaying File Contents

If `filename` is a file object corresponding to a text file, you can iterate over the lines of text in the file as follows:
```python
for line in filename:
   # Do something with each line...for example we can print the line
   # print(line)
```

In this next example we will use a `for` loop to iterate over each line (one at a time) in our file object `fileName`. The variable `line` will take on the value of each line then be printed. After a line is printed, the loop is executed again and the value of `line` will be overwritten with next line in the file object. The loop will continue until there are no more lines to read.

In [5]:
for line in fileName:
    print(line)

Sedan/Wagon/6	2018	28.2	315.6	4098	313

CarSUV/6	2018	26.9	330	4321	285

Van/6	2018	26.9	329	4635	282

TruckSUV/6	2018	23.9	371.5	4760	297

Pickup/6	2018	23.1	384.7	4791	305

Sedan/Wagon/8	2018	24.2	367.8	4345	477

CarSUV/4	2018	30.2	295.1	3736	181

Van/8	2018	15.4	577.1	6647	324

TruckSUV/8	2018	20.8	428.2	6031	387

Pickup/8	2018	20.9	424.6	5642	375


Now let's look at the variable `line`. Notice that it returns only the last line from the file and this variable is a string. We can also see that the columns in the file are separated by `\t`, or tabs.

In [8]:
print(line)
type(line)

Pickup/8	2018	20.9	424.6	5642	375


str

## Python Collection Data Types

Python has four collection types: Lists, Tuples, Dictionaries and Sets.  This week we will discuss Lists and Tuples. <br>

1. **Lists** are ordered sequences of elements, with that order being specified by the order that the elements are in when the list is created or as elements are added to the list.  

    1. Lists are created using the `[]` syntax.
    
    2. Lists are <font color ='green'>**mutable**</font>. You can add, remove, and replace values using functions such as `append()`, `extend()`, `insert()`, `pop()`, `remove()`, and `del`. 
    
    3. Lists can be created by string functions such as split() and strip().
    
2. **Tuples** are similar to lists except for the very important fact that they are <font color = 'green'>**immutable**</font>.

    1. Tuples are created using the `()` syntax.
    
    2. Since Tuples are immutable, there are no functions that are built-in to modify the variables of Tuples.
    
    3. When to use a Tuple?  When you have data that will never change, like the days of the week:
       `days_of_the_week = ("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")`
    
3. **Both Lists and Tuples**

    1. Can include mixed data types.
    
    2. Are accessed by index.
    
    3. Contain a sequence of individual elements.
    
    4. Are stored in the order in that they were added.

### Separate each column of line
We will use the `split()` method (defined in the String class) to break up each line of the file (which we showed above is a String object) into a list of its six string values (`Type, Year, MPG (miles per gallon), CO2 Emissions, Weight, HP (horse power`). We will study the String class and more of its methods in detail in a later module. 

To use the split method we need to first have a String object. Below we create String object called `line`. Then we call the `split()` method on this object in this way:
```python
line.split()
```

Run the cell below to see what you get...

In [9]:
# Use the split method to create a list of values for the line
lst = line.split()
print(lst)

type(lst)

['Pickup/8', '2018', '20.9', '424.6', '5642', '375']


list

Run the following three cells for some examples showing how to access elements of the list..

In [10]:
print(f'The first element of the list is {lst[0]}')

The first element of the list is Pickup/8


In [11]:
print(f'The fifth element of the list is {lst[4]}')

The fifth element of the list is 5642


In [12]:
print(f'The last element of the list is {lst[-1]}')

The last element of the list is 375


<div class="alert alert-block alert-success"><b>Problem 1</b>: Iterate over lines in the file 'cars.txt' as demonstrated above and print the following for each line:</div>

`" < Type > 'has an MPG of < MPG > 'with' < HP > 'Horse power'"`<br>

<div class="alert alert-block alert-info">For example, the first line printed should look like this: </div>

`Sedan/Wagon/6 has an MPG of 28.2 with 313 Horse power`

## Answer to Problem 1

In [3]:
with open("cars.txt", "r") as file:
    for line in file:
        array = line.split()
        print()

Pickup/8	2018	20.9	424.6	5642	375


### Creating Lists

Our next objective is to create three lists from the data: (1) a list of the vehicle types (2) the corresponding list of MPG and (3) the corresponding list of Weight. 

Since we closed the file in Problem #1, we need to reopen the `cars.txt` file for reading. This time we read all the lines at once using the file object's `readlines` method. What do you get when you run the following method?

In [13]:
cars = open("cars.txt", "r")
lines = cars.readlines()
print(lines)
print(40*'=')

# Check the data type for 'lines'
type(lines)

['Sedan/Wagon/6\t2018\t28.2\t315.6\t4098\t313\n', 'CarSUV/6\t2018\t26.9\t330\t4321\t285\n', 'Van/6\t2018\t26.9\t329\t4635\t282\n', 'TruckSUV/6\t2018\t23.9\t371.5\t4760\t297\n', 'Pickup/6\t2018\t23.1\t384.7\t4791\t305\n', 'Sedan/Wagon/8\t2018\t24.2\t367.8\t4345\t477\n', 'CarSUV/4\t2018\t30.2\t295.1\t3736\t181\n', 'Van/8\t2018\t15.4\t577.1\t6647\t324\n', 'TruckSUV/8\t2018\t20.8\t428.2\t6031\t387\n', 'Pickup/8\t2018\t20.9\t424.6\t5642\t375']


list

We iterate over `lines` in much the same way we iterated over (the file object) `cars`. But first let us give some examples of how the list method `append` list can be used to "grow" a list from scratch. As usual, you want to make sure you are running each of cells in the notebook one at a time...

In [14]:
my_list = [] # Start with an empty list

name = "Guido van Rossum" # Create a variable 'name'

my_list.append(name)  # Append 'name' to 'my_list' and print the list
print('my_list with name ', my_list)

age = 25  # Create a new variable 'age'

my_list.append(age) # Append 'age' to 'my_list' and print the list

print('my_list with name and age ', my_list)

my_list with name  ['Guido van Rossum']
my_list with name and age  ['Guido van Rossum', 25]


<div class="alert alert-block alert-success"><b>Problem 2 </b>: Remember that the <b><i>lines</i></b> list still contains the content from the <b><i>cars.txt</i></b> file. Complete the TODO in the cell below. The cell starts with three empty lists: <b><i>types,</i></b> <b><i>MPG</i></b> and <b><i>weight</i></b>. The loop should then iterate over the <b><i>lines</i></b> list, splitting each line in turn, and then obtaining both the type of car and each value adding the values to the corresponding list.</div>

In [None]:
types = []
MPG = []
weight = []

for line in lines:
    
    # TODO: Append the name of each type, MPG and Weight to the appropriate list.
    
   

Run the following three cells to check that `types`, `MPG` and `weight` lists were constructed properly.

In [None]:
# Show the 'types' list
print(types)

In [None]:
# Show the 'MPG' list
print(MPG)

In [None]:
# Show the 'weight' list
print(weight)

### Working with Methods

Next we will introduce two list methods and ask you use them together in a program. First we have the `max` method to get the maximum value in a list.

In [None]:
my_list = [1,2,3,10,4,5,6]
max(my_list)

Second, we can get the "position" of any value in the list using the `index` method. Note that the first position has `index` **zero** and not **one**. So it would be more accurate to think of the `index` as the `offset` as opposed to the `position`. 

In [None]:
my_list.index(10)

Run the following cell to double check that the value in position (offset) 3 really is 10...

In [None]:
my_list[3]

It is important to understand the two parts of the list and how to access these parts: (1) the actual value in the list and (2) the index that represents the actual value.

In [None]:
# Create a new list of colors named 'my_list2'
my_list2 = ['red','yellow','green','blue','purple']

# Display the elements of 'my_list2'
my_list2

# Confirm it is a list
type(my_list2)

# How long is our list?
len(my_list2)

Below we loop through `my_list2` showing how both the value of the list is referenced and how the index is referenced.

In [None]:
length = len(my_list2)

for num in range(0,length):
    print(f'The value is {my_list2[num]} and the index is {num}.')
    

<div class="alert alert-block alert-success"><b>Problem 3</b>: Complete the program in the cell below. We are defining a function <b><i>largest_value</i></b> that takes three list arguments: <b><i>type_list, value_list</i></b> and a variable called<b><i> label</i></b> which will show us what value variable was passed to the function.

You will use the three lists you created in Problem 2 to test out the function you write: `Types, MPG and weight`.<br>

The function should find the car type with the largest value and print the type of car together with the highest value and the label for that value. </div>

<div class="alert alert-block alert-info">For example, <br>

`largest_value(['Car1','Car2','Car3'], [30,40,25],'MPG')` should print: 

**Car2 has the most MPG of 40.**<br>
</div>

<div class="alert alert-block alert-success"><b>Problem 3</b>: To complete the program, complete the TODOs in the 3 code cells below. </div>

## Functions
Remember that utilizing a function has two steps: (1) define the function and (2) call the function.  

In [None]:
def largest_value(type_list,value_list, label):
    # Find the car with the highest of the value passed
    # Print out the results as shown above in the blue section
    
    #TODO: Find the maximum value and save to a variable
    
    
       
    #TODO: Print out the output as shown above
    
   
 

In [None]:
#TODO: Run this cell to test the function


In [None]:
# TODO: Run this cell to test the function


### Reading files using Pandas

__[Pandas Overvew](https://pandas.pydata.org/pandas-docs/stable/getting_started/overview.html)__

**Pandas** is a large library which is used extensively in Data Science to wrangle data.  As you can see in the cell below, the library must be loaded with the command:

<font color = 'green'>**import**</font> pandas <font color = 'green'>**as**</font> pd

**Pandas** gives you access to two additional data structures: (1) a one dimensional <i><u>Series</u></i> and (2) a two dimensional <i><u>DataFrame</u></i>. We will look at some basics of the DataFrame which has the following attributes:
1. spread-sheet like structure
2. has ordered collection of columns
3. each column can be of different value types such as numeric, boolean, string, etc.
4. has both a row and column index


We begin by importing pandas and reading in `cars.csv`.

In [None]:
import pandas as pd

cars_csv = pd.read_csv('cars.csv')
cars_csv

<div class="alert alert-block alert-success"><b>Problem 4</b>: As you can see from the result of reading in the csv file, the first record is appearing as the header.

Read the file in again so that there is no header and all records from the file show correctly. Use the documentation as needed: __[pandas.read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)__
<p>&nbsp;</p>
To complete this problem, complete the TODOs in the code cell below.
</div>

In [None]:
#TODO: Read in the 'cars.csv' file without a header


 
#TODO: Display the file contents to confirm all records are accurate


 

### Reading Excel files with Pandas

Next we read in an Excel file using pandas and look at the information about the file using the `info` method. Notice that `info( )` tells you how many rows, how many non-nulls per column and each variable type. After this we look at the first 5 rows of the data using the `head` method.

In [None]:
# Read in the file named 'cars.xlsx'
cars_excel = pd.read_excel('cars.xlsx')

# Inspect the data using the info method
cars_excel.info()

In [None]:
# Look at the first 5 rows
cars_excel.head()

<div class="alert alert-block alert-success"><b>Problem 5</b>: As you can see with the results of reading the excel file in, there is something wrong with the header; The actual header is appearing as the first row. If you check out the excel file, you will see that there are two comment lines.

There is a way to handle unneeded rows when reading in an excel file. Read the file in again so that the header and all records from the file show correctly. Reference: __[pandas.read_excel](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)__
![image.png](attachment:image.png)
    
<p>&nbsp;</p>
To complete this problem, complete the TODOs in the code cell below.
</div>

In [None]:
# TODO: Read in 'cars.xlsx' so that the header is correct
 


# TODO: Show the first five rows of data


 

In [None]:
# Look at the last 3 rows of data
cars_excel.tail(3)

In [None]:
# You can refer to one column of data 
cars_excel['MPG']

# OR ===== but this style will only work if there is no space in the column name
cars_excel.MPG

### Read a json file

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html

![a1.jpg](attachment:a1.jpg)

`json` has its own library, so we begin below by importing this to use in conjunction with pandas.

In [None]:
import json

cars_json = pd.read_json('cars.json')

# Determine the data type of 'cars_json'
type(cars_json)

# Look at first 3 records
cars_json.head(3)

# Display the file info
cars_json.info()

## Adding a Column to the DataFrame

We want to create a new field for Cylinders. We know that the last character of **Type** contains the cylinders so let's isolate that using <i>slice()</i>.

In [None]:
# Test out the use of slice
test = cars_json['Type'].str.slice(start=-1)
test

In [None]:
# Add a new column to the dataframe
cars_json['Cylinders'] = cars_json['Type'].str.slice(start=-1)

# Inspect the dataframe to see the new column
cars_json.info()
cars_json.head()

In [None]:
# Cylinders needs to be an integer
cars_json['Cylinders'] = cars_json['Cylinders'].astype(int)

# Inspect the data again to confirm 'Cylinders' is an int type
cars_json.info()

### Writing to a File

When you create a new column and add it to your dataframe, it is a good idea to write out the new file to save it for future reference.


<div class="alert alert-block alert-success"><b>Problem 6</b>: Create a new column called <mark>Cylinder_efficiency</mark> and add it to the cars_json DataFrame. 
    
The calculation for Cylinder_efficiency will be HP divided by the Cylinders; round the results to two decimal places.

Next write your new file out to an **Excel** file called <mark>cars_totals.xlsx</mark>.  Upload your new file into Google Classroom along with the .ipynb and HTML files. Reference: __[Pandas DataFrame to Excel File](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html)__
    
<p>&nbsp;</p>
To complete this problem, complete the TODOs in the code cell below.
</div>

In [None]:
#TODO: Create a new column called Cylinder_efficiency and add it to the cars_json DataFrame



#TODO: Show the first five rows of your file
 


#TODO: Write your file to an Excel file called 'cars_totals.xlsx'


 

### Viewing File Contents

On a side note, though you can easily view the contents of any text file using your favorite text editor, there are ways of doing this using Python. Here we are going to use the shell command appropriate to your operating system. To access the shell commands from within Jupyter notebook we need to prefix them with the `!` character.

In [None]:
import platform

if (platform.system() == 'Windows'):
    !type cars.txt
else:        
    !cat cars.txt