# Catching up on the Python



## Working with files line by line

* When you have files with data on each line, especially large files, you can loop over them 
* Just like iterating over lists, you can iterate over files
* Python reads the contents of the file until it hits "\n" and then it puts that in the loop variable
* Useful for working with *extremely large* files because you only store one line in memory at a time

In [None]:
file_handler = open("test.txt", 'r')
for line in file_handler:
    print(line)
file_handler.close()

### Reading Data Files

* A file handler is not the file, it is a pointer to the file
* This is how python can work with HUGE files
* We can process large files line by line (assuming there are multiple lines)
* Each line gets treated as a separate string

In [None]:
# use the unix command head to see the first 25 lines of the file
!head -n 25 diabetes.csv

* Lets count the lines of the file

In [None]:
# count the number of lines in the text file
file_handler = open('diabetes.csv', 'r')
count = 0
for line in file_handler:
    count = count + 1
    #count+=1
file_handler.close()
print(count)

## Reading in all the data

* Why don't we read every line of the file into memory as a list

In [None]:
# create an empty list to store each line
data = [] 

# count the number of lines in the text file
file_handler = open('diabetes.csv', 'r')
for line in file_handler:
    # use the append function to add each line
    data.append(line)
file_handler.close() # close the file handler now that we are done.

print("Length:", len(data))
print("First 10 lines:", data[0:10])


## Working With CSV Files

* CSV files are used to store a large number of variables – or data. They are incredibly simplified spreadsheets – think Excel – only the content is stored in plaintext.
* Python has a CSV parser as part of the standard library
* To parse CSV files, we use the csv module.
* The csv module provides a number of built-in functions to make it easier to parse and iterate through CSV files.

In [None]:
#  load the CSV module 
import csv

# open the diabetes file
diabetes_file = open("diabetes.csv")

* Now we need to tell Python that the file stored in diabetes_file variable should be read as and interpreted as a CSV file. 
*  We do that by calling on the `reader()` function of the csv module

In [None]:
# Create a CSV reader 
diabetes_data = csv.reader(diabetes_file)

* At this point, the entire CSV file is treated as a table - a collection of rows and columns
* We can iterate (loop) through this table and get access to each individual row, just like the line-by-line above
* But CSV automatically splits it all into different values!

In [None]:
# loop over the file and print the row contents
for row in diabetes_data:
    print(row)
    

* You probably noticed that the row variable is just a list - it is a list of values contained in each column.
* You can access individual columns exactly the same way you would access values in a list.
* For example, the value of cholesterol is in a column called 'chol', which is a second column and therefore has the index of 1

In [None]:
# Since we already iterated through the CSV file once, we need to tell Python to start at the beginning again
# This action is called 'resetting the read position of the file object'
# It basically is like re-opening the file
diabetes_file.seek(0) 

for row in diabetes_data:
    print(row[1]) # print only the values for the chol column

* You probably also noticed that the first row does not contain data - it's just the column headers
* In order for us to do any mathematical or statistical operations on the data, we need to EXCLUDE the header
* We have to skip the header row. We can do this with the `next()` function to separate the header rows

In [None]:
# One way to do this is with a counter variable

diabetes_file.seek(0) # Reset the read position of the file object

# use next to skip the header row
headers = next(diabetes_file)
print(headers)

# now we can iterate through just the data values
for row in diabetes_data:
    print(row[1]) # print only the values for the chol column


## CSV files - Challenge 1

Calculate the _average_ and the _highest (max)_ cholesterol value based on the data available in the dataset.


In [None]:
# Step 1: Import csv module
import csv

In [None]:
# Step 2: Read the csv file
diabetes_file = open("diabetes.csv")
diabetes_data = csv.reader(diabetes_file)

In [None]:
# Step 3: Iterate through csv data

diabetes_file.seek(0) # Reset the read position of the file object
headers = next(diabetes_file)
diabetes_data = csv.reader(diabetes_file)

# Hint: you'll need to declare variables to store average and maximum cholesterol here (outside of the loop)
# Hint: You might want to check and see if the data actually exists!
for row in diabetes_data:
    
    # replace this code with your average a max calculation code
    print(row[1]) # print only the values for the chol column

## CSV files - Challenge 1 Solution

Calculate the _average_ and the _highest (max)_ cholesterol value based on the data available in the dataset.


In [None]:
# Step 1: Import csv module
import csv

In [None]:
# Step 2: Read the csv file
diabetes_file = open("diabetes.csv")
diabetes_data = csv.reader(diabetes_file)

In [None]:
# Step 3: Iterate through csv data

diabetes_file.seek(0) # Reset the read position of the file object
headers = next(diabetes_file)

count = 0 # Initialize a temporary counter
total_chol = 0
max_chol = 0

# Hint: you'll need to declare variables to store average and maximum cholesterol here (outside of the loop)
# Hint: You might want to check and see if the data actually exists!
for row in diabetes_data:
    
    # make sure there is a data value in the row
    if row[1] != "":            
        # convert the value to a number
        chol = int(row[1])
            
        # tabulate the total chol
        total_chol = total_chol + chol
        count = count + 1
            
        # check if the current value is the max
        if  chol > max_chol:
            # set the new max 
            max_chol = chol

print("Total: " , total_chol)
print("Count: " , count)

avg_chol = total_chol / count

print("Average: ", avg_chol)
print("Max:", max_chol)