Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# DataTables, Indexes, Pandas, and Seaborn

## Some useful (free) resources

Introductory:

* [Getting started with Python for research](https://github.com/TiesdeKok/LearnPythonforResearch), a gentle introduction to Python in data-intensive research.

* [A Whirlwind Tour of Python](https://jakevdp.github.io/WhirlwindTourOfPython/index.html), by Jake VanderPlas, another quick Python intro (with notebooks).

Core Pandas/Data Science books:

* [The Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/), by Jake VanderPlas.

* [Python for Data Analysis, 2nd Edition](http://proquest.safaribooksonline.com/book/programming/python/9781491957653), by  Wes McKinney, creator of Pandas. [Companion Notebooks](https://github.com/wesm/pydata-book)

* [Effective Pandas](https://github.com/TomAugspurger/effective-pandas), a book by Tom Augspurger, core Pandas developer.


Complementary resources:

* [An introduction to "Data Science"](https://github.com/stefanv/ds_intro), a collection of Notebooks by BIDS' [Stéfan Van der Walt](https://bids.berkeley.edu/people/st%C3%A9fan-van-der-walt).

* [Effective Computation in Physics](http://proquest.safaribooksonline.com/book/physics/9781491901564), by Kathryn D. Huff; Anthony Scopatz. [Notebooks to accompany the book](https://github.com/physics-codes/seminar). Don't be fooled by the title, it's a great book on modern computational practices with very little that's physics-specific.


OK, let's load and configure some of our core libraries (as an aside, you can find a nice visual gallery of available matplotlib sytles [here](https://tonysyu.github.io/raw_content/matplotlib-style-gallery/gallery.html)).

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

## Getting the Data

https://www.ssa.gov/OACT/babynames/index.html

https://www.ssa.gov/data

As we saw before, we can download data from the internet with Python, and do so only if needed:

In [None]:
import requests
from pathlib import Path

namesbystate_path = Path('namesbystate.zip')
data_url = 'https://www.ssa.gov/oact/babynames/state/namesbystate.zip'

if not namesbystate_path.exists():
    print('Downloading...', end=' ')
    resp = requests.get(data_url)
    with namesbystate_path.open('wb') as f:
        f.write(resp.content)
    print('Done!')

## Question 2: Most popular names in all states for each year of each gender?

### Put all DFs together

Again, we'll work off our in-memory, compressed zip archive and pull the data out of it into Pandas DataFrames without ever putting it all on disk. We can see how large the compressed and uncompressed data is:

In [None]:
import zipfile
zf = zipfile.ZipFile(namesbystate_path, 'r')
sum(f.file_size for f in zf.filelist)/1_000_000

In [None]:
sum(f.compress_size for f in zf.filelist)/1_000_000

In [None]:
__/_  # divide the next-previous result by the previous one

We want a single huge dataframe containing every state's data. Let's start by reading in the dataframe for each state into a Python list of dataframes.

In [None]:
%%time
data_frames_for_all_states = []

field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
i = 0
for f in zf.filelist:
    i += 1
    if not f.filename.endswith('.TXT'):
        continue
    if (i > 51):
        break
    with zf.open(f) as fh:
        data_frames_for_all_states.append(pd.read_csv(fh, header=None, names=field_names))

Now, we create a single DataFrame by concatenating these into one:

In [None]:
baby_names = pd.concat(data_frames_for_all_states).reset_index(drop=True)
baby_names.tail()

In [None]:
baby_names.shape

### Group by state and year

In [None]:
baby_names[
    (baby_names['State'] == 'CA')
    & (baby_names['Year'] == 1995)
    & (baby_names['Sex'] == 'M')
].head()

# The lame way to build our DataFrame would be to manually write down
# the answers for all combinations of State, Year, and Sex.

In [None]:
%%time
baby_names.groupby('State').size().head()

In [None]:
state_counts = baby_names.loc[:, ('State', 'Count')]
state_counts.head()

In [None]:
sg = state_counts.groupby('State')
sg

In [None]:
state_counts.groupby('State').sum().head()

For Data 8 veterans, this is equivalent to this code from Data 8:

    state_and_groups.group('State', np.sum)
    
In pandas, could also use agg here, yielding:

    state_counts.groupby('State').agg(np.sum)

### Grouping by multiple columns

In [None]:
baby_names.groupby(['State', 'Year']).size().head(3)

In [None]:
baby_names.groupby(['State', 'Year']).sum().head(3)

In [None]:
baby_names.groupby(['State', 'Year', 'Sex']).sum().head()

In [None]:
#%%time
def first(series):
    '''Returns the first value in the series.'''
    return series.iloc[0]

most_popular_names = baby_names.groupby(['State', 'Year', 'Sex']).agg(first)

most_popular_names.head()

As we'd expect, we get a MultiIndexed DataFrame, which we can index using [] just like our single indexed DataFrames.

In [None]:
most_popular_names[most_popular_names['Name'] == 'Samuel']

`.loc` is a bit more complicated:

In [None]:
most_popular_names.loc['CA', 2017, :, :]

In [None]:
most_popular_names.loc['CA', 1997, 'M', :]

In [None]:
most_popular_names.loc['CA', 1997, 'M']

## Question 3: Can I deduce birth sex from the last letter of a person’s name?

### Compute last letter of each name

In [None]:
baby_names.head()

In [None]:
baby_names['Name'].apply(len).head()

In [None]:
baby_names['Name'].str.len().head()

In [None]:
baby_names['Name'].str[-1].head()

To add column to dataframe:

In [None]:
baby_names['Last letter'] = baby_names['Name'].str[-1]
baby_names.head()

### Group by last letter and sex

In [None]:
letter_counts = (baby_names
                 .loc[:, ('Sex', 'Count', 'Last letter')]
                 .groupby(['Last letter', 'Sex'])
                 .sum())
letter_counts.head()

### Visualize our result

Use .plot to get some basic plotting functionality:

In [None]:
# Why is this not good?
letter_counts.plot.barh(figsize=(15, 15));

Reading the docs shows me that pandas will make one set of bars for each column in my table. How do I move each sex into its own column? I have to use pivot:

In [None]:
# For comparison, the group above:
# letter_counts = (baby_names
#                  .loc[:, ('Sex', 'Count', 'Last letter')]
#                  .groupby(['Last letter', 'Sex'])
#                  .sum())

last_letter_pivot = baby_names.pivot_table(
    index='Last letter', # the rows (turned into index)
    columns='Sex', # the column values
    values='Count', # the field(s) to processed in each group
    aggfunc=sum, # group operation
)
last_letter_pivot.head()

---

### Slides: GroupBy/Pivot comparison slides and Quiz

At this point, I highly recommend [this very nice tutorial on Pivot Tables](http://pbpython.com/pandas-pivot-table-explained.html).

In [None]:
last_letter_pivot.plot.barh(figsize=(10, 10));

Why is this still not ideal?

- Plotting raw counts
- Not sorted by any order

In [None]:
totals = last_letter_pivot['F'] + last_letter_pivot['M']

last_letter_props = pd.DataFrame({
    'F': last_letter_pivot['F'] / totals,
    'M': last_letter_pivot['M'] / totals,
}).sort_values('M')
last_letter_props.head()

In [None]:
last_letter_props.plot.barh(figsize=(10, 10));

What do you notice?

## Submission

You're done!

Before submitting this assignment, ensure to:

1. Restart the Kernel (in the menubar, select Kernel->Restart & Run All)
2. Validate the notebook by clicking the "Validate" button

Finally, make sure to **submit** the assignment via the Assignments tab in Datahub