# CSV notebook

This notebook shows you how to work with CSV files in Python. A CSV file is basically a spreadsheet-type table where everything is separated by commas.

The first step is to import the `csv` module in Python:

In [1]:
import csv

# Reading CSVs

To read a csv you first have to open the file (using the `open` function). Next you create a `csv.reader` from the file you opened. The `csv.reader` object can be used to view each of the rows in your table.

In [2]:
# open the file (path may vary)
with open('star-wars.csv','r') as csvfile:
    # create the reader object
    reader = csv.reader(csvfile)
    # loop through the rows in the csv file
    for row in reader:
        print(row)
# Question: What data type are the rows

# Working with the rows

If you answered 'list' to the previous question, you're right! Each row is a list, so we can access its elements with `row[0]`, `row[1]` etc. The elements are all strings (notice the quotes `''`) but we can convert the second column to a number: `float(row[1])`.

However, there is a problem. The first row contains text in the second column instead of a number. We can skep the first row

Example for reading the rows, skipping the first row, converting to numbers, and adding them up.

In [3]:
with open('star-wars.csv','r') as csvfile:
    reader = csv.reader(csvfile)
    # skip the first row
    for row in reader:
        break # stops the loop
    # loop through the remaining rows
    for row in reader:
        forcepower = float(row[1])
        print(forcepower)

# Mean and standard deviation

Two commonly used statistics are the mean and standard deviation. The mean is simply the average. It is equal to the sum of all the values divided by the number of values:

**Mean** = $\bar{x} = \frac{1}{N} \sum_{k=0}^{N-1} x_k = \frac{1}{7} \left(100 + 100 + 200 + 50 + 50 + 500 + 1000 + 9001 \right) \approx 1375$

The standard deviation tells you how much the value fluctuates about the mean. You can calculate the standard deviation by subtracting each value from the mean, then squaring it, then taking the square root of all of those results divided by $N-1$:

**Std. Dev.** = $\sqrt{\frac{\sum_{k=0}^{N-1} (x_k - \bar{x})^2}{N-1}}$

As a first step, we will calculate the mean, since we need it in order to calculate the standard deviation (actually, we don't, but getting around it requires some mathematical trickery).

In [4]:
mean = 0. # the period is important
N = 0 # no period
with open('star-wars.csv','r') as csvfile:
    reader = csv.reader(csvfile)
    # skip the first row
    for row in reader:
        break
    for row in reader:
        forcepower = float(row[1])
        mean = mean + forcepower # add each value to the mean
        N = N + 1 # increment the number of values we've seen
mean = mean / N # last step: divide by number of values
print('The mean is:')
print(mean)
# do you know why we need a period for the values but not the counter?

# Calculating the standard deviation

In [5]:
stddev = 0. # the period is important
with open('star-wars.csv','r') as csvfile:
    reader = csv.reader(csvfile)
    # skip the first row
    for row in reader:
        break
    for row in reader:
        forcepower = float(row[1])
        stddev += (forcepower - mean)**2
from math import sqrt
stddev = sqrt(stddev / (N-1))
print('The stddev is:')
print(stddev)