# Week 1 - Python Basics and Files

**Book Suggestion:** Data Wrangling with Python: Tips and Tools to Make Your Life Easier
<img align="right" style="padding-right:10px;" src="figures/Data_Wrangling_Book.jpg" width=200><br>
Authors: Jacqueline Kazil and Katherine Jamul<br>
Publisher: O'Reilly<br>
ISBN-13: 978-1491948811<br>
ISBN-10: 1491948817<br>

Sadly, this book is Python 2 and hasn't been updated to Python 3. With the eminent demise of Python 2, we cannot in good conscience require this book as the "official" textbook. Even so, there is a lot of good information here and we will be loosely following the book's organization and examples, translated to Python 3. <br>

**Optional Reading:** DWP Chapters 1 - 3 (pages 1 - 73)<br>

**Code & Data Repository:** https://github.com/jackiekazil/data-wrangling

**Outline:**<br>

* Documenting your work
    1. Markdown
    2. Markdown vs. Code Comments
    3. Know your audience
    4. FTEs as examples
* Python
    1. Python 3
    2. Installing (Anaconda)
    3. Jupyter notebooks / labs
    4. Data Structures
        * List
        * Dictionary
        * List of dictionaries
    5. Working With Files in Python
        * Location, location, location
        * Context managers and text files
        * CSVs the DIY way
* File formats
    1. CSV
        * Manually
        * Using Pandas
    2. JSON


# Documenting Your Work

We have heard from many employers and past students that the ability to document one's work is a vital skill. As such, we will expect you to document your deliverables to be documented in an intelligent and thoughtful manner. To that end, we (humbly) suggest using these From the Experts pages as examples of good documentation. 

## Markdown

Markdown is a semi-standard format for text formatting that is used in Jupyter notebook pages, GitHub Readme files, and many other places. Jupyter Lab has a Markdown cheatsheet and GitHub has a nice introductory page at https://guides.github.com/features/mastering-markdown/

## Markdown vs. Code Comments

Students frequently ask when they should use Markdown and when to use code comments. The rule of thumb used on these pages is: 

* Markdown is used for exposition. Explaining "bigger picture" concepts such as 
    * background motivating the project 
    * algorithms 
    * Explaining results and graphs <---- VERY IMPORTANT
    * etc. Basically, anything that isn't code
* Comments are used as fine-grained detail about the code itself:
    * Purpose of a variable
    * Explanation of a calculation
    * Describing function parameters or return values
    * etc.
    
## Know Your Audience

Just as when writing a paper or giving a presentation, it is important to direct project documentation to the appropriate level for an audience. For example, a C-level executive (CEO, CIO, etc.) will probably not have specialized knowledge necessary to understand the code, computations, and graphs you hand in as deliveables without a generous amount of explanation as to **why** the results are (or are not) significant. If you were to just throw a printout of code and graphs on a CTO's desk, at best you would told to re-do the report...

For purposes of this class, assume your audience is a C-level executive that DOES NOT read code. You will be expected to document your project well enough that this fictitious executive can understand the reason you took the steps you did and the results you got. **Failure to adhere to this standard will be grounds for rejecting the assignment or severe point penalties.**

## FTEs as Examples

Notice this document that was written as a Jupyter notebook page and is avilable for download has sections with headers for organization. Also notice (farther down) that there are hundreds of words of explanatory text before the first line of Python is written. 

# Python
## Python 3

Python 2 can now measure its (supported) life span in months. January 1, 2020 is its end of life. While we understand there is a lot of legacy Python 2 code in use, the way forward is Python 3. If you end up needing the techniques taught in this course in a Python 2 environment, the book listed above is a good resource and it really is fairly easy to translate backwards from Python 3 to 2 (or, better yet, learn how to write code that works for both). 

## Installing (Anaconda)

<img align="right" style="padding-right:10px;" src="figures/No_Python_2.png" width="200">

Even though [Python.org](https://www.python.org/) maintains the "official" Python distribution (many times called CPython because the interpreter is written in C), [Anaconda's](https://www.anaconda.com/) packaging of Python is heavily customized for scientific and data science applications. We strongly recommend using this version.


## Jupyter notebooks / labs

Jupyter notebook pages have become the defacto standard for the data wrangling and exploratory data analysis portions of data science. While we won't go so far as to say that knowing at least the basics is a *mandatory* skill, we feel it is very important. To that end, **all lab deliverables will be Jupyter notebook pager (.ipynb files).** 

**Jupyter notebook** is a standalone "server" for notebook pages. Using the notebook server, you can open individual notebook pages in browser tabs and work on them. 

**Jupyter Lab** is an Integrated Development Environment (IDE) for Jupyter notebook that also has an integrated editor, terminal window, the ability to open and arrange multiple notebook pages, and more. The course's author used Jupyter Lab for all phases of development of this course and highly recommends it. 

Anaconda Python has Jupyter notebook and Jupyter Lab pre-installed. 

<hr>

## Data Structures

There are two main data structures (data containers) we use in Python. 

* List
* Dictionary

Both of these can hold any type of data (integers, floating point, string, object, etc.), the only difference noticable to the casual observer is how we access the data. 

### List

To paraphrase William Shakespeare, 

> "An array by any other name, would still smell as sweet." 

OK, maybe that didn't work too well. The point is, lists in Python are essentially the same thing as arrays in other languages. We saw above that our `lines` list held a bunch of strings. It could just as easily held integers, objects or a mix of them all. Let's do a quick review:

First, a list containing the first 10 integers in our number system:

In [1]:
# range(x, y) generates y number of sequential numbers, starting at number x 
ints = list(range(0,10))
ints

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

A list element can be accessed with the name of the list (ints, in this case) and an index number. For example, the first element is always at 0:

In [2]:
ints[0]

0

In this case, it is coincidental that the data and index number are the same. 

Let's ask the list how many elements it has:

In [3]:
len(ints)

10

The `len()` function will tell us the size of many types of data or data structure. It is a very important and useful function. Keep it in mind.

To prove that the index and data are not coupled in any way, let's reverse the list and then iterate through it, printing index and data at every step:

In [4]:
# Remember, a list slice's parameters are [start index : number of elements : "Step" size]
# Blanks in the first two places means use defaults -- start element and end element. 
# -1 for step size means start at the back and go forward, element by element
# Finally, we store the reversed output in a new variable.
rev_ints = ints[::-1]
rev_ints

[9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

Remember, Python's `for` loops work like "for each" loops in other languages. Think of it like this:

```
for each ITEM in CONTAINER:
    do some action on ITEM
```

In this case, we will use `enumerate()` to count the elements as we iterate through the list. Conceptually, 

```
for each INDEX and ITEM in REV_INTS:
    print(INDEX, ITEM)
```


In [5]:
print('index\titem')
print('----\t----')
for index, item in enumerate(rev_ints):
    print(f'{index}\t{item}')

index	item
----	----
0	9
1	8
2	7
3	6
4	5
5	4
6	3
7	2
8	1
9	0


### Dictionary

Conceptually, dictionaries are similar to lists in that they have unique indeces (called *keys*) that reference elements (called *values*). For example:

In [6]:
ex_d = {'fruit':'apples', 'weight':5, 'color':'red'}
ex_d

{'fruit': 'apples', 'weight': 5, 'color': 'red'}

Keys are used like list indeces:

In [7]:
ex_d['fruit']

'apples'

In [8]:
ex_d['weight']

5

Where life starts to get interesting is when the **value** is another data structure, like a list:

In [9]:
ex_d['weight'] = [1,4,2]

In [10]:
ex_d['weight']

[1, 4, 2]

In [11]:
ex_d

{'fruit': 'apples', 'weight': [1, 4, 2], 'color': 'red'}

**Note:** Keys don't have to be strings but the fact that they *can be strings* is a major motivation to use dictionaries. 

It is also worth noting that the JSON data format we look at lower down on the page is, in effect, a dictionary. This is *very* convenient.

### List of dictionaries

Let's imagine for a moment that we have a file in which every line has the same categories of data. For example a bunch of scientists, and for each scientist we know name, date born, date died, age, and field of study. Each scientist could be contained in a discrete dictionary, and the dictionaries could be stored in a list. 

In [12]:
sc1 = {'name':'Rosaline Franklin', 'Born':'1920-07-25','Died':'1958-04-16', 'Age':'37', 'Field':'Chemist'}
sc1

{'name': 'Rosaline Franklin',
 'Born': '1920-07-25',
 'Died': '1958-04-16',
 'Age': '37',
 'Field': 'Chemist'}

In [13]:
sc2 = {'name':'William Gosset', 'Born':'1876-06-13','Died':'1937-10-16', 'Age':'61', 'Field':'Statistician'}
sc2

{'name': 'William Gosset',
 'Born': '1876-06-13',
 'Died': '1937-10-16',
 'Age': '61',
 'Field': 'Statistician'}

In [14]:
scientists = [sc1, sc2]
scientists

[{'name': 'Rosaline Franklin',
  'Born': '1920-07-25',
  'Died': '1958-04-16',
  'Age': '37',
  'Field': 'Chemist'},
 {'name': 'William Gosset',
  'Born': '1876-06-13',
  'Died': '1937-10-16',
  'Age': '61',
  'Field': 'Statistician'}]

Looking at `scientists`, we can tell it is a list because square brackets enclose the data. We could also use the `type()` function:

In [15]:
type(scientists)

list

If we ask `scientists` how many elements it has, we should get 2.

In [16]:
len(scientists)

2

And, let's look at the first element:

In [17]:
scientists[0]

{'name': 'Rosaline Franklin',
 'Born': '1920-07-25',
 'Died': '1958-04-16',
 'Age': '37',
 'Field': 'Chemist'}

Looks like a dictionary. Let's verify that:

In [18]:
type(scientists[0])

dict

Now, the tricky part. What if we wanted to get to the data held in the `name` key of the first element?

Well, we know that when we use an index with the list, we get the thing at the index, which happens to be a dictionary. So let's just tack on the key and see what happens:

In [19]:
scientists[0]['name']

'Rosaline Franklin'

In [20]:
scientists[1]['name']

'William Gosset'

Not the most intuitive code but **extremely** useful, as we will see. 

As a side note, this list-of-dictionaries structure is used often enough that there are several libraries one can install to help make using them easier.

<hr>

## Working With Files in Python

### Location, location, location
As in the real estate business, when working with files in any programming landuage, the location of the file is one of the most important aspects. 

**It is vitally important that you understand the file system of the computer you are working on!**

When working with files in a programming landuage, you must be able to specify the *path* to the file. The examples in these From the Expert pages all use data files stored in a subdirectory that is unsurprisingly called "data." You do not have to organize your work that way, just be aware that if you download the actual notebook pages to run the examples (HIGHLY ENCOURAGED), you *may* have to adjust the paths.

As a warmup, we will read a simple text file -- the Shakespeare compendium from Peter Norvig's natural language processing page that we've used in several other classes. The URL is https://norvig.com/ngrams/shakespeare.txt



<hr>

## Context managers and text files
A relatively recent addition to Python is called **context managers.** A context manager a block of code that begins with the word `with`. Everything indented below that line belongs in that context. This means that variables declared in the context go away, files get automatically closed, etc. <br>
<br>
Context managers are particularly useful for reading and writing files because you don't have to worry about closing the file when you are done. In the example below, we are going to treat the file as a container full of lines (an iterable) and use a for loop to read line by line. Since there are a lot of lines, we'll stuff them into a list (rather than print them out).

In [21]:
# with open starts a context and opens a file. 
# the variable after the word as is our handle to the file.
lines = [] # List to hold the lines. 
with open ('data/shakespeare.txt', 'r') as infile:    
    for line in infile:
        lines.append(line)

# out of context. the file is closed.

OK, that seemed to go well. Let's check the size of our list to see how many lines of text were read in.

In [22]:
print(f'lines has {len(lines)} lines in it.')

lines has 129107 lines in it.


129,107 lines of text are too much to many to display. Let's use the function 

`enumerate()`

along with a for loop to look at a *few* lines.

In [23]:
# Enumerate outputs an integer each time through a loop. We have to give it a variable to store the number in.
# In this case, I'm using the variable "c" -- short for "count".
for c,line in enumerate(lines):
    if c < 10:      # 10 is an arbitrary number
        print(line)
    else:
        break
    

A MIDSUMMER-NIGHT'S DREAM



Now , fair Hippolyta , our nuptial hour 

Draws on apace : four happy days bring in 

Another moon ; but O ! methinks how slow 

This old moon wanes ; she lingers my desires ,

Like to a step dame , or a dowager 

Long withering out a young man's revenue .



Four days will quickly steep themselves in night ;



Or, we could look at a slice of the list, like so:

In [24]:
lines[:10]   # output 10 list elements starting at the default (first element)

["A MIDSUMMER-NIGHT'S DREAM\n",
 '\n',
 'Now , fair Hippolyta , our nuptial hour \n',
 'Draws on apace : four happy days bring in \n',
 'Another moon ; but O ! methinks how slow \n',
 'This old moon wanes ; she lingers my desires ,\n',
 'Like to a step dame , or a dowager \n',
 "Long withering out a young man's revenue .\n",
 '\n',
 'Four days will quickly steep themselves in night ;\n']

Notice how the output is different? That's because the first way used a print function, which expects strings and uses the end-of-line character "\n" but slicing the list gives us the raw content of each element. 

## CSVs the DIY way

Let's pretend for a moment that we didn't have Pandas or the CSV library and we have that list of scientists in a file. **The first step should _always_ be to look at the data in a programmer's text editor (Atom, VS Code, SublimeText, vim, etc.).**

In this case, assume we've looked at the file and see that it is a normal CSV file with a header row and no surprises. Let's read it in like a text file first.

In [25]:
csv_text = []
with open('data/scientists.csv', 'r') as infile:
    for line in infile:
        csv_text.append(line)
        
csv_text

['Name,Born,Died,Age,Occupation\n',
 'Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist\n',
 'William Gosset,1876-06-13,1937-10-16,61,Statistician\n',
 'Florence Nightingale,1820-05-12,1910-08-13,90,Nurse\n',
 'Marie Curie,1867-11-07,1934-07-04,66,Chemist\n',
 'Rachel Carson,1907-05-27,1964-04-14,56,Biologist\n',
 'John Snow,1813-03-15,1858-06-16,45,Physician\n',
 'Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist\n',
 'Johann Gauss,1777-04-30,1855-02-23,77,Mathematician\n']

OK, no surprises. Let's slice and dice the data into our favorite data structure.

First, let's pull out the header and split it up:

In [26]:
header = csv_text[0].split(",")

In [27]:
header

['Name', 'Born', 'Died', 'Age', 'Occupation\n']

Notice how the end of line is still on the last element? Since the list is short, we can deal with it easily. 

Since Python is 0 based, the last index in a list is always one **smaller** than the length. Observe:

In [28]:
len(header)

5

In [29]:
header[4]

'Occupation\n'

So, to correct it programmatically, use len(header) - 1

In [30]:
last = len(header) - 1
header[last] = header[last].strip()
header

['Name', 'Born', 'Died', 'Age', 'Occupation']

There are several ways to convert each line to a dictionary with the strings in `header` as the keys. The most direct way is using the function called `zip()`, which takes two lists and joins them together. Then if we wrap that result in a dictionary, it will convert for us.  Here is a simple example to start with:

In [31]:
list1 = ['apple','orange','pear']
list2 = ['red','orange','yellow']
dict(zip(list1, list2))

{'apple': 'red', 'orange': 'orange', 'pear': 'yellow'}

**BUT** we have a slight problem: the data lines of csv_text (index 1 to the end) are not broken up. Each one is still a string:

In [32]:
csv_text[1:]

['Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist\n',
 'William Gosset,1876-06-13,1937-10-16,61,Statistician\n',
 'Florence Nightingale,1820-05-12,1910-08-13,90,Nurse\n',
 'Marie Curie,1867-11-07,1934-07-04,66,Chemist\n',
 'Rachel Carson,1907-05-27,1964-04-14,56,Biologist\n',
 'John Snow,1813-03-15,1858-06-16,45,Physician\n',
 'Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist\n',
 'Johann Gauss,1777-04-30,1855-02-23,77,Mathematician\n']

But, we know we can use the `split()` function on strings to break it into a list:

In [33]:
csv_text[1].split(',')

['Rosaline Franklin', '1920-07-25', '1958-04-16', '37', 'Chemist\n']

Actually, that's the last thing we need to know. Let's put this in pseudocode:

```
scientists = new list container
for each item in csv_text[1:]:  # so, each item is one of our scientist strings
    data_line = split item into a list at ','
    data_dictionary zip header and data_line then convert to dictionary
    append data_dictionary to end of scientists
```

That should do it. Let's try it out:
    

In [34]:
scientists = []
for item in csv_text[1:]:
    data_line = item.split(',')
    data_dictionary = dict(zip(header, data_line))
    scientists.append(data_dictionary)
    
scientists

[{'Name': 'Rosaline Franklin',
  'Born': '1920-07-25',
  'Died': '1958-04-16',
  'Age': '37',
  'Occupation': 'Chemist\n'},
 {'Name': 'William Gosset',
  'Born': '1876-06-13',
  'Died': '1937-10-16',
  'Age': '61',
  'Occupation': 'Statistician\n'},
 {'Name': 'Florence Nightingale',
  'Born': '1820-05-12',
  'Died': '1910-08-13',
  'Age': '90',
  'Occupation': 'Nurse\n'},
 {'Name': 'Marie Curie',
  'Born': '1867-11-07',
  'Died': '1934-07-04',
  'Age': '66',
  'Occupation': 'Chemist\n'},
 {'Name': 'Rachel Carson',
  'Born': '1907-05-27',
  'Died': '1964-04-14',
  'Age': '56',
  'Occupation': 'Biologist\n'},
 {'Name': 'John Snow',
  'Born': '1813-03-15',
  'Died': '1858-06-16',
  'Age': '45',
  'Occupation': 'Physician\n'},
 {'Name': 'Alan Turing',
  'Born': '1912-06-23',
  'Died': '1954-06-07',
  'Age': '41',
  'Occupation': 'Computer Scientist\n'},
 {'Name': 'Johann Gauss',
  'Born': '1777-04-30',
  'Died': '1855-02-23',
  'Age': '77',
  'Occupation': 'Mathematician\n'}]

Not bad! Let's print out each scientist's name:

In [35]:
for scientist in scientists:
    print(scientist['Name'])

Rosaline Franklin
William Gosset
Florence Nightingale
Marie Curie
Rachel Carson
John Snow
Alan Turing
Johann Gauss


Hopefully, it should be reasonably apparent that we just loaded a CSV file into a data structure that would make it reasonably simple to manipulate "rows" or "columns."

Of course if we were going to use this method, we would make it into a function and do the read into list-of-dictionary in one set of steps to make it easy to use the same code on several files.

Fortunately, there is a better way.

<hr>

# File Formats

*Libraries* are 3rd party bundles of code that add functionality to Python. Since these libraries are external to "core" Python, they must be imported, or in some cases installed first then imported, to work. 

The first such library we will look at is for CSV files.

In [36]:
import csv

with open('data/scientists.csv', 'r') as infile:
    reader = csv.reader(infile)
    for line in reader:
        print(line)

['Name', 'Born', 'Died', 'Age', 'Occupation']
['Rosaline Franklin', '1920-07-25', '1958-04-16', '37', 'Chemist']
['William Gosset', '1876-06-13', '1937-10-16', '61', 'Statistician']
['Florence Nightingale', '1820-05-12', '1910-08-13', '90', 'Nurse']
['Marie Curie', '1867-11-07', '1934-07-04', '66', 'Chemist']
['Rachel Carson', '1907-05-27', '1964-04-14', '56', 'Biologist']
['John Snow', '1813-03-15', '1858-06-16', '45', 'Physician']
['Alan Turing', '1912-06-23', '1954-06-07', '41', 'Computer Scientist']
['Johann Gauss', '1777-04-30', '1855-02-23', '77', 'Mathematician']


You can see immediately that the CSV library understands what we are reading and automatically splits each line. It even handles the ending '\n' character.

That is useful, but we can go a step further and read each line into a dictionary!

**Note:** In Python 3, each line is *actually* read into an `OrderedDict()` container. In code you can treat it just like a regular dictionary, but it prints a little funny.

In [37]:
# You only have to import a library once

with open('data/scientists.csv', 'r') as infile:
    reader = csv.DictReader(infile)  # Notice the change to a DictReader here
    for line in reader:
        print(line)

OrderedDict([('Name', 'Rosaline Franklin'), ('Born', '1920-07-25'), ('Died', '1958-04-16'), ('Age', '37'), ('Occupation', 'Chemist')])
OrderedDict([('Name', 'William Gosset'), ('Born', '1876-06-13'), ('Died', '1937-10-16'), ('Age', '61'), ('Occupation', 'Statistician')])
OrderedDict([('Name', 'Florence Nightingale'), ('Born', '1820-05-12'), ('Died', '1910-08-13'), ('Age', '90'), ('Occupation', 'Nurse')])
OrderedDict([('Name', 'Marie Curie'), ('Born', '1867-11-07'), ('Died', '1934-07-04'), ('Age', '66'), ('Occupation', 'Chemist')])
OrderedDict([('Name', 'Rachel Carson'), ('Born', '1907-05-27'), ('Died', '1964-04-14'), ('Age', '56'), ('Occupation', 'Biologist')])
OrderedDict([('Name', 'John Snow'), ('Born', '1813-03-15'), ('Died', '1858-06-16'), ('Age', '45'), ('Occupation', 'Physician')])
OrderedDict([('Name', 'Alan Turing'), ('Born', '1912-06-23'), ('Died', '1954-06-07'), ('Age', '41'), ('Occupation', 'Computer Scientist')])
OrderedDict([('Name', 'Johann Gauss'), ('Born', '1777-04-30')

We can use the "Pretty Print" library to make it look nicer.

In [38]:
from pprint import pprint

with open('data/scientists.csv', 'r') as infile:
    reader = csv.DictReader(infile)  # Notice the change to a DictReader here
    for line in reader:
        pprint(line)

OrderedDict([('Name', 'Rosaline Franklin'),
             ('Born', '1920-07-25'),
             ('Died', '1958-04-16'),
             ('Age', '37'),
             ('Occupation', 'Chemist')])
OrderedDict([('Name', 'William Gosset'),
             ('Born', '1876-06-13'),
             ('Died', '1937-10-16'),
             ('Age', '61'),
             ('Occupation', 'Statistician')])
OrderedDict([('Name', 'Florence Nightingale'),
             ('Born', '1820-05-12'),
             ('Died', '1910-08-13'),
             ('Age', '90'),
             ('Occupation', 'Nurse')])
OrderedDict([('Name', 'Marie Curie'),
             ('Born', '1867-11-07'),
             ('Died', '1934-07-04'),
             ('Age', '66'),
             ('Occupation', 'Chemist')])
OrderedDict([('Name', 'Rachel Carson'),
             ('Born', '1907-05-27'),
             ('Died', '1964-04-14'),
             ('Age', '56'),
             ('Occupation', 'Biologist')])
OrderedDict([('Name', 'John Snow'),
             ('Born', '1813-03-15'

Or, you can convert the OrderedDict to a regular dictionary, then pretty print it:

In [39]:
with open('data/scientists.csv', 'r') as infile:
    reader = csv.DictReader(infile)  # Notice the change to a DictReader here
    for line in reader:
        pprint(dict(line))

{'Age': '37',
 'Born': '1920-07-25',
 'Died': '1958-04-16',
 'Name': 'Rosaline Franklin',
 'Occupation': 'Chemist'}
{'Age': '61',
 'Born': '1876-06-13',
 'Died': '1937-10-16',
 'Name': 'William Gosset',
 'Occupation': 'Statistician'}
{'Age': '90',
 'Born': '1820-05-12',
 'Died': '1910-08-13',
 'Name': 'Florence Nightingale',
 'Occupation': 'Nurse'}
{'Age': '66',
 'Born': '1867-11-07',
 'Died': '1934-07-04',
 'Name': 'Marie Curie',
 'Occupation': 'Chemist'}
{'Age': '56',
 'Born': '1907-05-27',
 'Died': '1964-04-14',
 'Name': 'Rachel Carson',
 'Occupation': 'Biologist'}
{'Age': '45',
 'Born': '1813-03-15',
 'Died': '1858-06-16',
 'Name': 'John Snow',
 'Occupation': 'Physician'}
{'Age': '41',
 'Born': '1912-06-23',
 'Died': '1954-06-07',
 'Name': 'Alan Turing',
 'Occupation': 'Computer Scientist'}
{'Age': '77',
 'Born': '1777-04-30',
 'Died': '1855-02-23',
 'Name': 'Johann Gauss',
 'Occupation': 'Mathematician'}


The `reader` object keeps track of the headers:

In [40]:
with open('data/scientists.csv', 'r') as infile:
    reader = csv.DictReader(infile) 
    print(f'Headers: {reader.fieldnames}')


Headers: ['Name', 'Born', 'Died', 'Age', 'Occupation']


It should be readily apparent that the CSV library saves us some steps, however, the best advice is **If you are working with CSV files, use the Pandas library.** 

Pandas is a very Data Science-focused library, designed for reading and manipulating CSV data in table format (called DataFrames).

In [41]:
import pandas as pd

In [42]:
df = pd.read_csv('data/scientists.csv')

In [43]:
df

Unnamed: 0,Name,Born,Died,Age,Occupation
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist
4,Rachel Carson,1907-05-27,1964-04-14,56,Biologist
5,John Snow,1813-03-15,1858-06-16,45,Physician
6,Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


## JSON

JSON stands for Javascript Object Notation and is an extremely popular and useful format for moving data between applications, particularly web applications and document databases like MongoDB (see Week 4 FTE for more detail on NoSQL and document databases). 

JSON stores data in **objects** where each object is a collection of unique keys that reference values. Sounds a lot like a Python dictionary, doesn't it? As a matter of fact, JSON and dictionaries are almost a perfect match and dictionaries are frequently used with JSON. 

Let's look at an example using our Scientists data.

Pandas is again our friend here. We'll use it to convert the dataframe to JSON for easy viewing. Later I'll show you how to convert the CSV file directly to JSON without Pandas.

In [44]:
import json

# Pandas gives a funny column-oriented format by default
# So we can tell it to orient on the dataframe's index
j_sci = df.to_json(orient='records')

# Pandas will give us a string. OK to send to a file, not so great to print in the screen
# So we'll use the json module to help
j_sci = json.loads(j_sci)

We can use the **'Pretty Print'** module (pprint) to space out our fields so it is easy to read. Keep that phrase 'pretty print' in mind. It is a general computer science term for formatting output and will be useful in Google searches.

In [45]:
import pprint
pprint.pprint(j_sci, indent=2)

[ { 'Age': 37,
    'Born': '1920-07-25',
    'Died': '1958-04-16',
    'Name': 'Rosaline Franklin',
    'Occupation': 'Chemist'},
  { 'Age': 61,
    'Born': '1876-06-13',
    'Died': '1937-10-16',
    'Name': 'William Gosset',
    'Occupation': 'Statistician'},
  { 'Age': 90,
    'Born': '1820-05-12',
    'Died': '1910-08-13',
    'Name': 'Florence Nightingale',
    'Occupation': 'Nurse'},
  { 'Age': 66,
    'Born': '1867-11-07',
    'Died': '1934-07-04',
    'Name': 'Marie Curie',
    'Occupation': 'Chemist'},
  { 'Age': 56,
    'Born': '1907-05-27',
    'Died': '1964-04-14',
    'Name': 'Rachel Carson',
    'Occupation': 'Biologist'},
  { 'Age': 45,
    'Born': '1813-03-15',
    'Died': '1858-06-16',
    'Name': 'John Snow',
    'Occupation': 'Physician'},
  { 'Age': 41,
    'Born': '1912-06-23',
    'Died': '1954-06-07',
    'Name': 'Alan Turing',
    'Occupation': 'Computer Scientist'},
  { 'Age': 77,
    'Born': '1777-04-30',
    'Died': '1855-02-23',
    'Name': 'Johann Gauss',
 

Here is a special function to print nested data structures like JSON that may do a better job of making it readable:

In [46]:
def printplus(obj):
    """
    Pretty-prints the object passed in.

    """
    # Dict
    if isinstance(obj, dict):
        for k, v in sorted(obj.items()):
            print (u'{0}: {1}'.format(k, v))

    # List or tuple            
    elif isinstance(obj, list) or isinstance(obj, tuple):
        for x in obj:
            print (x)

    # Other
    else:
        print (obj)


In [47]:
printplus(j_sci)

{'Name': 'Rosaline Franklin', 'Born': '1920-07-25', 'Died': '1958-04-16', 'Age': 37, 'Occupation': 'Chemist'}
{'Name': 'William Gosset', 'Born': '1876-06-13', 'Died': '1937-10-16', 'Age': 61, 'Occupation': 'Statistician'}
{'Name': 'Florence Nightingale', 'Born': '1820-05-12', 'Died': '1910-08-13', 'Age': 90, 'Occupation': 'Nurse'}
{'Name': 'Marie Curie', 'Born': '1867-11-07', 'Died': '1934-07-04', 'Age': 66, 'Occupation': 'Chemist'}
{'Name': 'Rachel Carson', 'Born': '1907-05-27', 'Died': '1964-04-14', 'Age': 56, 'Occupation': 'Biologist'}
{'Name': 'John Snow', 'Born': '1813-03-15', 'Died': '1858-06-16', 'Age': 45, 'Occupation': 'Physician'}
{'Name': 'Alan Turing', 'Born': '1912-06-23', 'Died': '1954-06-07', 'Age': 41, 'Occupation': 'Computer Scientist'}
{'Name': 'Johann Gauss', 'Born': '1777-04-30', 'Died': '1855-02-23', 'Age': 77, 'Occupation': 'Mathematician'}


In [48]:
import json
import csv

# Read CSV file. This wouldn't work well for very large files
with open('data/scientists.csv') as f:
    reader = csv.DictReader(f)
    rows = list(reader)
    
# Write JSON file to disk
with open('data/scientists.json', 'w') as f:
    json.dump(rows, f)

In [49]:
# Proof that it works -- Read JSON file in to new variable and print

with open('data/scientists.json') as f:
    data = json.load(f)
printplus(data)

{'Name': 'Rosaline Franklin', 'Born': '1920-07-25', 'Died': '1958-04-16', 'Age': '37', 'Occupation': 'Chemist'}
{'Name': 'William Gosset', 'Born': '1876-06-13', 'Died': '1937-10-16', 'Age': '61', 'Occupation': 'Statistician'}
{'Name': 'Florence Nightingale', 'Born': '1820-05-12', 'Died': '1910-08-13', 'Age': '90', 'Occupation': 'Nurse'}
{'Name': 'Marie Curie', 'Born': '1867-11-07', 'Died': '1934-07-04', 'Age': '66', 'Occupation': 'Chemist'}
{'Name': 'Rachel Carson', 'Born': '1907-05-27', 'Died': '1964-04-14', 'Age': '56', 'Occupation': 'Biologist'}
{'Name': 'John Snow', 'Born': '1813-03-15', 'Died': '1858-06-16', 'Age': '45', 'Occupation': 'Physician'}
{'Name': 'Alan Turing', 'Born': '1912-06-23', 'Died': '1954-06-07', 'Age': '41', 'Occupation': 'Computer Scientist'}
{'Name': 'Johann Gauss', 'Born': '1777-04-30', 'Died': '1855-02-23', 'Age': '77', 'Occupation': 'Mathematician'}
