# Reading Text Files

### Working With CSV Files

CSV files are used to store a large number of variables – or data. They are incredibly simplified spreadsheets – think Excel – only the content is stored in plaintext.

And the CSV module is a built-in function that allows Python to parse these types of files.

In [None]:
# To parse CSV files, we use the csv module. CSV literally stands for comma separated value, 
# where the comma is what is known as a "delimiter." The csv module provides a number of built-in
# functions to make it easier to parse and iterate through CSV files.
import csv

In [None]:
# Open the diabetes file.  Note that when Python opens data files and stores them in variables,
# the variables DO NOT actually contain text.  In the example below, the diabetes_file 
# variable stores the file in a special format (one that Python can understand and interpret)
diabetes_file = open("diabetes.csv")

In [None]:
# See what happens when we try to print the variable where the data file is stored
# Essentially, the file is treated as an OBJECT - we'll learn about objects next week
print(diabetes_file)

In [None]:
# Now we need to tell Python that the file stored in diabetes_file variable should be read as 
# and interpreted as a CSV file.  We do that by calling on the reader() function of the csv module
diabetes_data = csv.reader(diabetes_file)

In [None]:
# At this point, the entire CSV file is treated as a table - a collection of rows and columns
# We can iterate (loop) through this table and get access to each individual row
for row in diabetes_data:
    print(row)
    

In [None]:
# You probably noticed that the row variable is just a list - it is a list of values contained in each column.
# You can access individual columns exactly the same way you would access values in a list.
# For example, the value of cholesterol is in a column called 'chol', which is a second column and 
# therefore has the index of 1

# Since we already iterated through the CSV file once, we need to tell Python to start at the beginning again
# This action is called 'resetting the read position of the file object'
diabetes_file.seek(0) 

for row in diabetes_data:
    print(row[1]) # print only the values for the chol column

In [None]:
# You probably also noticed that the first row does not contain data - it's just the column headers
# In order for us to do any mathematical or statistical operations on the data, we need to EXCLUDE the header
# One way to do this is with a counter variable

cnt = 0 # Initialize a temporary counter
diabetes_file.seek(0) # Reset the read position of the file object

for row in diabetes_data:
    # We will only get the values when the counter is greater then zero.
    # Because we initialized the counter to zero above, the first row will be 
    # excluded.  In order for this to work, it is critical to increment 
    # the counter by one outside of the if statement but inside of the loop
    if cnt > 0:
        print(row[1]) # print only the values for the chol column
    cnt = cnt + 1 # Increment the counter by one

**CSV files - Challenge 1**

Calculate the _average_ and the _highest (max)_ cholesterol value based on the data available in the dataset.


In [None]:
# Step 1: Import csv module
import csv

In [None]:
# Step 2: Read the csv file
diabetes_file = open("diabetes.csv")
diabetes_data = csv.reader(diabetes_file)

In [None]:
# Step 3: Iterate through csv data
cnt = 0 # Initialize a temporary counter
diabetes_file.seek(0) # Reset the read position of the file object

# Hint: you'll need to declare variables to store average and maximum cholesterol here (outside of the loop)
for row in diabetes_data:
    if cnt > 0:
        ################################################################################################
        # This is where you need to complete the logic for calculating average and maximum cholesterol
        ################################################################################################
        
        print(row[1]) # print only the values for the chol column
    cnt = cnt + 1 # Increment the counter by one

**CSV files - Challenge 1 Solution**

In [None]:
# Step 1: Import csv module
import csv

In [None]:
# Step 2: Read the csv file
diabetes_file = open("diabetes.csv")
diabetes_data = csv.reader(diabetes_file)

In [None]:
# Step 3: Calculate average cholesterol

cnt = 0 # Initialize a temporary counter
diabetes_file.seek(0) # Reset the read position of the file object
total = 0 # This variable will hold the sum of all cholesterol values

for row in diabetes_data:
    if row[1] != "":
        if cnt > 0:
            total = total + int(row[1])
        cnt = cnt + 1 # Increment the counter by one
        
print("Total: " , total)
print("Count: " , cnt)

avg_chol = total / cnt

print("Average: ", avg_chol)

In [None]:
# Step 4: Calculate maximum cholesterol

cnt = 0 # Initialize a temporary counter
diabetes_file.seek(0) # Reset the read position of the file object
max_chol = 0 # This variable will hold the sum of all cholesterol values

for row in diabetes_data:
    if row[1] != "":
        if cnt > 0:
            # Every time through the loop (for every row that contains a value)
            # we compare the value from the data with the value stored in 
            # max_chol variable.  
            # If the value from the data is larger, we set max_chol to that larger value
            # After the loop finishes running, the largest value will be stored in max_chols
            if max_chol < int(row[1]):
                max_chol = int(row[1])
        cnt = cnt + 1 # Increment the counter by one
        

print("Maximum cholesterol: ", max_chol)