# Using CSV's to get, use, and display data

## Using the csv module in the simplest way

Being able to use the `csv` module in Python is as simple as `import csv`, however there is a lot to how to use this code correctly. For programmers with some experience, a reference like [the Official Python docs on the csv module](https://docs.python.org/3/library/csv.html), could help a lot, but for new programmers, I've included a fair bit of examples below. The grand goal in this Notebook is to comb a ~5,500 line spreadsheet in less than a second to count how many spam messages are contained in the file...

But first, here's using `csv` in a simple way, just on a List as we saw on Friday:

In [1]:
import csv

practice_data = ["1, 2, 3", "2, 4, 6", "3, 6, 9"]

data_reader = csv.reader(practice_data)

for row in data_reader:
    print(', '.join(row))

1,  2,  3
2,  4,  6
3,  6,  9


Expected output for the code above is:

```
1,  2,  3
2,  4,  6
3,  6,  9
```

Making the `csv.reader` read a List instead of a file, is a bit against its purpose, but it more clearly demonstrates where the data is coming from.

## Using csv.reader on a file instead of a list

Needing to copy and paste data from a CSV file to a Python script is possible but not very efficient (especially when applying the syntax for a List to it!). Next I will demonstrate using `csv.reader` directly on a file, which is more along the lines of why it was created.

In [3]:
import csv

# TODO : add explanation of `with`
with open('test.csv', newline='') as csvfile:
    data_reader = csv.reader(csvfile)
    
    # TODO : understand the `.join` more here...
    for row in data_reader:
        print(', '.join(row))

a,  b,  c
d,  e,  f
g,  h,  i
j,  k,  l


Expected output for the code above is:

```
a,  b,  c
d,  e,  f
g,  h,  i
j,  k,  l
```

If you view [test.csv](https://github.com/syreal17/pgss2020-corecs-jupyter-w2d1/blob/master/test.csv) on GitHub, you will see roughly the same thing just with better graphics.

## Accessing the fields in CSV files

So now that we can read files we can speed up a lot of work for ourselves! One important question remains: how can we access individual items in each row? Part of that is lies in the `print(', '.join(row))` statement. I don't think `join` is the most intuitive function, but it just takes a List of items and puts them together with `', '` between. Printing with commas between fields is not that important, but it is important to know that `row` in the code above is a Python List of field values. In code, the first row would be written like this: `row_1 = ["a", "b", "c"]` Accessing an element in this List is as easy as `row_1[0]` is `"a"`, `row_1[1]` is `"b"`, and so on.

Below I demonstrate printing only the last field of each row in the `test.csv` file:

In [4]:
import csv

with open('test.csv', newline='') as csvfile:
    data_reader = csv.reader(csvfile)
    
    for row in data_reader:
        print(row[2])

 c
 f
 i
 l


## Challenge 1 (Auto-Hypotenuse)

Take a look at \[starter.csv\]. This file contains 10 rows of data with 2 fields each. The fields are the lengths of the legs of right triangles with the first field being the length of side A and the second field being the length of side B. Please calculate the hypotonuse of all 10 triangles automatically by loading the CSV file and doing using the `a^2 + b^2 = c^2` formula for each triangle.

For reference: [Hypotenuse Wiki](https://en.wikipedia.org/wiki/Hypotenuse)

If you implement your program correctly, you should get the following values for the lengths of the hypotenuses:

```
5.0
18.439088914585774
46.69047011971501
98.73196037757987
656.4365925205572
11105.935079947118
1227.0309694543166
31365.50873172632
2.23606797749979
5.830951894845301
```

You will need to write `import math` at the top of your code, as well as `import csv`. Please Google using unfamiliar functions from the `math` module!!

In [3]:
import csv
import math

with open('starter.csv', newline='') as csvfile:
    data_reader = csv.reader(csvfile)
    
    for row in data_reader:
        a = int(row[0])
        b = int(row[1])
        c = math.sqrt( math.pow(a, 2) + math.pow(b, 2) )
        print(c)

5.0
18.439088914585774
46.69047011971501
98.73196037757987
656.4365925205572
11105.935079947118
1227.0309694543166
31365.50873172632
2.23606797749979
5.830951894845301


## Challenge 2 (Spam vs. Ham)

In the world of spam filtering, "ham" messages are good messages (actual messages between people) and "spam" messages are unwanted messages (generally computer generated messages targeting people for unsavory purposes).

In the data science world, these data sets can be used to train automatic classifiers, which is probably how most big web email providers (like Gmail) filter spam so well (but they probably have a fair bit of proprietary magic as well).

The data set I acquired is called [spam.csv](https://github.com/syreal17/pgss2020-corecs-jupyter-w2d1/blob/master/spam.csv). Please view that file to determine how to count the ratio of ham to spam as presented by that data. Please note that spam messages can have unsavory content in them. Don't click any links and please pardon any foul language.

In [5]:
import csv
import math

with open('spam.csv', newline='') as csvfile:
    data_reader = csv.reader(csvfile)
    
    for row in data_reader:
        print(row[0])

UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 606-607: invalid continuation byte