# File Input and Output

Up to now we haven't really been doing much with data, only what we type into the notebooks (short strings, numbers, lists etc). In many cases our source of data comes from files. Learning how to import, open and save files are important for processing larger amounts of data.

In this module we will cover the following topics:
* how to read and open files 
* how to write files 
* hope to close files 

In [5]:
# use the shell command cat to display the contents of the file test.txt
!cat test.txt

cat: test.txt: No such file or directory


## Opening files

We use the `open()` function to establish a *connection* to a file on disk. Once a file has been opened, we can perform operations on it like reading it into memory. Also be sure to use the `close()` function when you are done with the file (i.e. read the contents into memory)

**Syntax**
When we open a file, we assign it to an empty object, which in the case is `find_handler`. `file_handler` is a connection to the file, but it isn't the file contents itself. The name directory of the file and name of the file follows the `open` function and is placed within the parenthesis. Be sure to use quotations, double or single, when you identify the directory and name of the file. It's then followed by a built-in feature with the open function, here we are opening the document in `read` mode. In `read` mode, Python reads the entire file into memory at once. Don't do this with large files! We will use other techniques to read their contents. 

We then assign file_handler to an empty variable, called `file_contents`. We then print file_contents, which produces the contents of the text file. 


In [97]:
file_handler = open("assets/test.txt", 'r') # the 'r' tells Python you are Reading the file

In [98]:
# read the file into a variable
file_contents = file_handler.read()
file_contents

'I am a text file.\nI am not very big.\nOnly three lines of text.'

In [95]:
## You can also read the contents without creating a new varaible 
file_handler.read()

'I am a text file.\nI am not very big.\nOnly three lines of text.'

### Remember to close the file! 

Notice when we run the same code again, it comes up empty, the contents of the file that was printed above don't appear again. That's because once we read() a file, we have to close it and re-open it. For good meausre, you shoudl always close a file when you're done with it. 

In [13]:
# read the file into a variable
file_contents = file_handler.read()
file_contents

''

In [None]:
# close the previous open
file_handler.close()

In [17]:
# re-open the file and read it in
file_handler = open("assets/test.txt", 'r') 
file_contents = file_handler.read()
print(file_contents)
file_handler.close() # close it for good measure, you should always close a file when you are done

I am a text file.
I am not very big.
Only three lines of text.


###  The `\n` Character  
One thing to note, the "\n" gets printed as a newline by the `print()` function vs. raw output from Jupyter. When working with files, it is really important to understand the *newline* character A newline is represented in a string by `\n`. This is useful for processing a text file line-by-line

Print the following code, one without the `\n` character and one with the `\n` character. 

```python
print("Hello World!")

print("Hello\nWorld!")
```

We can use the repr() function to see the raw string from the file. 

```Python 
print(repr("Hello\nWorld!"))
```

In [27]:
print("Hello World!")

Hello World!


In [26]:
# A string with a newline in it
print("Hello\nWorld!")

Hello
World!


In [28]:
# leveraging the repr() function:

print(repr("Hello\nWorld!"))

'Hello\nWorld!'


### Working with files line by line

When you have files with data on each line, especially large files, you can loop over them. Just like iterating over lists, you can iterate over files. Python reads the contents of the file until it hits "\n" and then it puts that in the loop variable. Useful for working with *extremely large* files because you only store one line in memory at a time

In [29]:
file_handler = open("assets/test.txt", 'r')
for line in file_handler:
    print(line)
file_handler.close()

I am a text file.

I am not very big.

Only three lines of text.


### Reading Data Files

A file handler is not the file, it is a pointer to the file. This is how python can work with HUGE files. We can process large files line by line (assuming there are multiple lines). Each line gets treated as a separate string.

The following code counts the number of lines in the document. 

```python
# count the number of lines in the text file
file_handler = open('assets/diabetes.csv', 'r')
count = 0
for line in file_handler:
    count = count + 1
    #count+=1
file_handler.close()
print(count)
```

In [30]:
# use the unix command head to see the first 25 lines of the file
!head -n 25 diabetes.csv

head: cannot open 'diabetes.csv' for reading: No such file or directory


* Lets count the lines of the file

In [31]:
# count the number of lines in the text file
file_handler = open('assets/diabetes.csv', 'r')
count = 0
for line in file_handler:
    count = count + 1
    #count+=1
file_handler.close()
print(count)

404


## Reading in all the data

Why don't we read every line of the file into memory as a list? To start, we need create an empty list. We then append each line to the empty list. Each line therefore becomes one item in a list called *data*. 

In [None]:
# create an empty list to store each line
data = [] 

# count the number of lines in the text file
file_handler = open('assets/diabetes.csv', 'r')
for line in file_handler:
    # use the append function to add each line
    data.append(line)
file_handler.close() # close the file handler now that we are done.

print("Length:", len(data))
print("First 10 lines:", data[0:10])


## Write() Mode: writing files 

Above we looked at the read() mode within the open() function. There's also an option to `write()` files. Instead of reading the contents of a file into an object, write() mode, writes the content into a file. For an existing document, it will overwrite the content already in the file. 

Let's look at an example. If we open and read this text, you'll notice the content are numbers. When we use the write() mode, we are overwriting the file with new content. We'll use the write() mode to change the existing content, and substitute numbers with alphabetical characters. 


In [79]:
num = open("assets/numbers_test.txt", 'r') 
numbers = num.read()
print(numbers )
num.close()


1 2 3 4 5


To overwrite the existing text file, we create a new variable that contains the new content we would like to replace the text file with. We then re-open file, since in the previous line we closed the file. Within the parameters of the write() function, we identify the variable that contains the new text


```Python
new_text = ('one two three four five')
num = open("assets/numbers_test.txt", 'w') 
num.write(new_text)
```

To test if we successfully overwrite the file with alphabetical characters, let's read the file back in. 

```Python
num = open("assets/numbers_test.txt", 'r') 
numbers = num.read()
print(numbers )
num.close()
```

In [80]:
new_text = ('one two three four five')
num = open("assets/numbers_test.txt", 'w') 
num.write(new_text)
# Closing file
num.close()

In [81]:
num = open("assets/numbers_test.txt", 'r') 
numbers = num.read()
print(numbers )
num.close()

one two three four five


##  Append() Mode: append files 

If you don't want to overwrite an existing file but just want to add content to the file, it's best to use the `append()` mode. It will append data to the end of a document file. You'll notice we insert `\n` before the `new_text` variable. This inserts a new line before we append contents of new_text to the file. 
```Python
num = open("assets/numbers_test.txt", 'a') 
num.write("\n" + new_text)
# Closing file
num.close()
``` 

In [91]:
num = open("assets/numbers_test.txt", 'a') 
num.write("\n" + new_text)
# Closing file
num.close()

## With Open 

In the code blocks above, it was strongly advised to `close()` the file once you're done reading or writing the file. The `with open()` function allows you to avoid having to close() the file every time. Let's go back to the `test.txt` file 

**Syntax** 

You begin the code with `with open`, include the directory and name of the file in the paramater of the function. Make sure to include single or double quotations around the directory and file name. This is followed by the `as` keyword and the name of the empty object. You can give the empty object any name. Here, we've reused file_handler again. Create a new variable and assign the object to the variable, which in the example below is *test_doc*. Here, we are using the read() option to read the text file. You can also use the write() and append() option with the `with open` function. 

```Python
with open('assets/test.txt') as file_handler:
    test_doc = file_handler.read()
    print(test_doc)

``` 

In [108]:
with open('assets/test.txt') as file_handler:
    test_doc = file_handler.read()
    print(test_doc)

I am a text file.
I am not very big.
Only three lines of text.


## Let's give it a try! 

Using the `with open` function, write or append text to the document called `tryit.txt`. You'll find the text file in the `Assets` folder. 

In [None]:
## Wrote code here that uses the with open function to write or append text to the tryit.txt file. 


## Working With CSV Files

CSV files are used to store a large number of variables – or data. They are incredibly simplified spreadsheets – think Excel – only the content is stored in plaintext. Python has a CSV parser as part of the standard library. To parse CSV files, we use the csv module. The csv module provides a number of built-in functions to make it easier to parse and iterate through CSV files.

In [None]:
#  load the CSV module 
import csv

# open the diabetes file
diabetes_file = open("assets/diabetes.csv")

* Now we need to tell Python that the file stored in diabetes_file variable should be read as and interpreted as a CSV file. 
*  We do that by calling on the `reader()` function of the csv module

In [None]:
# Create a CSV reader 
diabetes_data = csv.reader(diabetes_file)

* At this point, the entire CSV file is treated as a table - a collection of rows and columns
* We can iterate (loop) through this table and get access to each individual row, just like the line-by-line above
* But CSV automatically splits it all into different values!

In [None]:
# loop over the file and print the row contents
for row in diabetes_data:
    print(row)
    

* You probably noticed that the row variable is just a list - it is a list of values contained in each column.
* You can access individual columns exactly the same way you would access values in a list.
* For example, the value of cholesterol is in a column called 'chol', which is a second column and therefore has the index of 1

In [None]:
# Since we already iterated through the CSV file once, we need to tell Python to start at the beginning again
# This action is called 'resetting the read position of the file object'
# It basically is like re-opening the file
diabetes_file.seek(0) 

for row in diabetes_data:
    print(row[1]) # print only the values for the chol column

* You probably also noticed that the first row does not contain data - it's just the column headers
* In order for us to do any mathematical or statistical operations on the data, we need to EXCLUDE the header
* We have to skip the header row. We can do this with the `next()` function to separate the header rows

In [None]:
# One way to do this is with a counter variable

diabetes_file.seek(0) # Reset the read position of the file object

# use next to skip the header row
headers = next(diabetes_file)
print(headers)

# now we can iterate through just the data values
for row in diabetes_data:
    print(row[1]) # print only the values for the chol column


## Practice Questions



### CSV files - Challenge 1

Calculate the average and the highest (max) cholesterol value based on the data available in the dataset.

In [None]:
# Step 1: Import csv module
import csv

In [None]:
# Step 2: Read the csv file
diabetes_file = open("diabetes.csv")
diabetes_data = csv.reader(diabetes_file)

In [None]:
# Step 3: Iterate through csv data
cnt = 0 # Initialize a temporary counter
diabetes_file.seek(0) # Reset the read position of the file object

# Hint: you'll need to declare variables to store average and maximum cholesterol here (outside of the loop)
for row in diabetes_data:
    if cnt > 0:
        ################################################################################################
        # This is where you need to complete the logic for calculating average and maximum cholesterol
        ################################################################################################
        
        print(row[1]) # print only the values for the chol column
    cnt = cnt + 1 # Increment the counter by one

### Answers

In [None]:
# Step 1: Import csv module
import csv

In [None]:
# Step 2: Read the csv file
diabetes_file = open("diabetes.csv")
diabetes_data = csv.reader(diabetes_file)

In [None]:
# Step 3: Calculate average cholesterol

cnt = 0 # Initialize a temporary counter
diabetes_file.seek(0) # Reset the read position of the file object
total = 0 # This variable will hold the sum of all cholesterol values

for row in diabetes_data:
    if row[1] != "":
        if cnt > 0:
            total = total + int(row[1])
        cnt = cnt + 1 # Increment the counter by one
        
print("Total: " , total)
print("Count: " , cnt)

avg_chol = total / cnt

print("Average: ", avg_chol)

In [None]:
# Step 4: Calculate maximum cholesterol

cnt = 0 # Initialize a temporary counter
diabetes_file.seek(0) # Reset the read position of the file object
max_chol = 0 # This variable will hold the sum of all cholesterol values

for row in diabetes_data:
    if row[1] != "":
        if cnt > 0:
            # Every time through the loop (for every row that contains a value)
            # we compare the value from the data with the value stored in 
            # max_chol variable.  
            # If the value from the data is larger, we set max_chol to that larger value
            # After the loop finishes running, the largest value will be stored in max_chols
            if max_chol < int(row[1]):
                max_chol = int(row[1])
        cnt = cnt + 1 # Increment the counter by one
        

print("Maximum cholesterol: ", max_chol)