# Files

* Up to now we haven't really been doing much with data, only what we type into the notebooks (short strings and numbers)
* In the real world we don't type our data into Notebooks, we store them in files!
* Opening files is where Python becomes useful for processing larger amounts of data
* Lets start with a small text file that has only a few lines

In [None]:
# use the shell command cat to display the contents of the file test.txt
!cat test.txt

## Opening files

* Use the `open(<filepath>, <mode>)` function to establish a *connection* to a file on disk
* When you connect to a file it can have different modes. Indicate a mode using a short string
    * `r` - Read only
    * `w` - Write (overwriting existing contents)
    * `a` - Append to a file
    * `x` - Write a new file (fails is file already exists)
    * `b` - Binary mode, for opening non-text files
* Python reads files as text by default. You can also specify the encoding with the `encoding` argument.
    * `utf-8` is the default.
* Once a file has been opened we can do operations on it like reading it into memory
* Python has a special syntax for safely opening and working with files

The `with open` syntax for safely opening files:

```
with open(<filepath>, '<mode>', <optional encoding>) as <variable>:
    # do something
    # the file is open inside this block

# the file is closed outside this block
```

In [None]:
with open("test.txt", 'r') as file_handler: # the 'r' tells Python you are Reading the file
    # do something with file_handler

* The `file_handler` is a connection to the file, but it isn't the file contents itself
* We use the `read()` function to read the entire file into memory at once
    * Don't do this with large files! We will use other techniques to read their contents

In [None]:
with open("test.txt", 'r') as file_handler: # the 'r' tells Python you are Reading the file
    # read the file content into a variable 
    file_contents = file_handler.read()

file_contents

* One thing to note, the "\n" gets printed as a newline by the `print()` function vs. raw output from Jupyter
* When working with files it is really important to understand the *newline* character
* A newline is represented in a string by `\n`
* This is useful for processing a text file line-by-line

In [None]:
# A string with a newline in it
print("Hello\nWorld!")

In [None]:
# display the contents of file_contents using print() instead of Jupyter Output
print(file_contents)

* It is useful to know that there are some minor differences in the display of output when you use the `print()` function vs. putting something in the last line of a cell in Jupyter

## Working with files line by line

* When you have files with data on each line, especially large files, you can loop over them 
* Just like iterating over lists, you can iterate over files
* Python reads the contents of the file until it hits "\n" and then it puts that in the loop variable
* Useful for working with *extremely large* files because you only store one line in memory at a time

In [None]:
# open the file
with open("test.txt", 'r') as file_handler:
    for line in file_handler:
        print(line)
        

### Reading Data Files

* A file handler is not the file, it is a pointer to the file
* This is how python can work with HUGE files
* We can process large files line by line (assuming there are multiple lines)
* Each line gets treated as a separate string

In [None]:
# use the unix command head to see the first 25 lines of the file
!head -n 25 diabetes.csv

* Lets count the lines of the file

In [None]:
with open('diabetes.csv', 'r') as file_handler:
    count = 0
    for line in file_handler:
        #count = count +1
        count += 1

print(count)

## Reading in all the data

* Why don't we read every line of the file into memory as a list

In [None]:
# create an empty list to store each line
data = [] 

# count the number of lines in the text file
with open('diabetes.csv', 'r') as file_handler:    
    for line in file_handler:
        # use the append function to add each line
        data.append(line)

print("Length:", len(data))
print("First 10 lines:", data[0:10])


* How is the data structures in the `data` variable?

## Working with Modules

* Python's [standard library](https://docs.python.org/3/library/) is very comprehensive 
    * Interact with your operating system with `os`
    * Work with emails using `email`
    * Run a web server with `http.server`
* Use this also to import 3rd-party libraries
* To import modules use the `import` command, this will load the module into memory
    * Use the syntax `import <module name> as <arbitrary name>` to use a different name

In [None]:
import calendar as cal

# is this year a leap year?
cal.isleap(2019)

## Working With CSV Files

* CSV files are used to store a large number of variables – or data. They are incredibly simplified spreadsheets – think Excel – only the content is stored in plaintext.
* Python has a CSV parser as part of the standard library
* To parse CSV files, we use the `csv` module.
* The csv module provides a number of built-in functions to make it easier to parse and iterate through CSV files.
 

In [None]:
#  load the CSV module 
import csv

# open the diabetes file
with open("diabetes.csv", 'r') as diabetes_file:
    # do nothing for now
    None

* Now we need to tell Python that the file stored in diabetes_file variable should be read as and interpreted as a CSV file. 
*  We do that by calling on the `reader()` function of the csv module

In [None]:
# open the diabetes file
with open("diabetes.csv", 'r') as diabetes_file:
    # Create a CSV reader 
    diabetes_data = csv.reader(diabetes_file)

* At this point, the entire CSV file is treated as a table - a collection of rows and columns
* We can iterate (loop) through this table and get access to each individual row, just like the line-by-line above
* But CSV automatically splits it all into different values!

In [None]:
# open the diabetes file
with open("diabetes.csv", 'r') as diabetes_file:
    # Create a CSV reader 
    diabetes_data = csv.reader(diabetes_file)
    # loop over the file and print the row contents 
    for row in diabetes_data:
        print(row)
    

* You probably noticed that the row variable is just a list - it is a list of values contained in each column.
* You can access individual columns exactly the same way you would access values in a list.
* For example, the value of cholesterol is in a column called 'chol', which is a second column and therefore has the index of 1

In [None]:
# open the diabetes file
with open("diabetes.csv", 'r') as diabetes_file:
    # Create a CSV reader 
    diabetes_data = csv.reader(diabetes_file)
    # loop over the file and print the row contents 
    for row in diabetes_data:
        print(row[1]) # print only the values for the chol column

* You probably also noticed that the first row does not contain data - it's just the column headers
* In order for us to do any mathematical or statistical operations on the data, we need to EXCLUDE the header
* We have to skip the header row. We can do this with the `next()` function to separate the header rows

In [None]:
# open the diabetes file
with open("diabetes.csv", 'r') as diabetes_file:
    # Create a CSV reader 
    diabetes_data = csv.reader(diabetes_file)

    # use next to skip the header row
    headers = next(diabetes_file)
    print(headers)

    # loop over the remaining lines file 
    for row in diabetes_data:
        print(row[1]) # print only the values for the chol column


## CSV files - Challenge 1

Calculate the _average_ and the _highest (max)_ cholesterol value based on the data available in the dataset.
This challenge will require you to do several things:
1. Open the file
2. Initialize the CSV reader
3. Skip the header row
4. Create variables for use in your calculation 
    * Hint: You'll need to store the max, sum, and number of values
5. Loop over every line and run calculations
    * Hint: Keep an eye on data types
    * Hint: Don't forget to check for missing values 

In [None]:
# Step 1: Import csv module
import csv

# Step 2: Read the csv file
with open("diabetes.csv", 'r') as diabetes_file:
    diabetes_data = csv.reader(diabetes_file)
    
    # Step 3: Separate the headers
    headers = next(diabetes_file)
    
    # Step 4: Create some variables
    
    
    
    # Step 5: Loop over the data an calculate the average and highest cholesteral value
    # Hint: you'll need to declare variables to store average and maximum cholesterol here (outside of the loop)
    # Hint: You might want to check and see if the data actually exists!
    for row in diabetes_data:
    
        # replace this code with your average a max calculation code
        print(row[1]) # print only the values for the chol column

## CSV files - Challenge 1 Solution

Calculate the _average_ and the _highest (max)_ cholesterol value based on the data available in the dataset.
This challenge will require you to do several things:
1. Open the file
2. Initialize the CSV reader
3. Skip the header row
4. Create variables for use in your calculation 
    * Hint: You'll need to store the max, sum, and number of values
5. Loop over every line and run calculations
    * Hint: Keep an eye on data types
    * Hint: Don't forget to check for missing values 

In [None]:
# Step 1: Import csv module
import csv

# Step 2: Read the csv file
with open("diabetes.csv", 'r') as diabetes_file:
    diabetes_data = csv.reader(diabetes_file)
    
    # Step 3: Separate the headers
    headers = next(diabetes_file)
    
    # Step 4: Create some variables
    count = 0 # Initialize a temporary counter
    total_chol = 0
    max_chol = 0
    
    
    # Step 5: Loop over the data an calculate the average and highest cholesteral value
    # Hint: you'll need to declare variables to store average and maximum cholesterol here (outside of the loop)
    # Hint: You might want to check and see if the data actually exists!
    for row in diabetes_data:
    
        # check to see if there is a data value
        if row[1]: # empty string resolves to False
            
            # convert to integer
            if row[1].isnumeric():
                chol = int(row[1])
                # tabulate the total chol
                total_chol = total_chol + chol
                count = count + 1
            
                # check if the current value is the max
                if  chol > max_chol:
                    # set the new max 
                    max_chol = chol

print("Total: " , total_chol)
print("Count: " , count)

avg_chol = total_chol / count

print("Average: ", avg_chol)
print("Max:", max_chol)
            
        