# CME538 - Introduction to Data Science
## Lecture 2.2 - Pandas II
### New Concepts
* Operations on String series, e.g. `babynames['Name'].str.startswith()`
* Creating and dropping columns.
    * Creating temporary columns is often convenient for sorting.
* Passing an index as an argument to `.loc`.
    * Useful as an alternate way to sort a dataframe.
* Groupby: Output of `.groupby('Name')` is a `DataFrameGroupBy` object. Condense back into a DataFrame or Series with:
    * groupby.agg
    * groupby.size
    * groupby.filter
    * and more...
* Pivot tables: An alternate way to group by exactly two columns. 


### Lecture Structure
In this lecture, we'll introduce additional syntax by trying to solve various practical problems on the baby names dataset.
* [Goal 1](#section1): Find the most popular name in California in 2018.
* [Goal 2](#section2): Find all names that start with J.
* [Goal 3](#section3): Sort names by length.
* [Goal 4](#section4): Find the name whose popularity has changed the most.
* [groupby Puzzles](#groupby_puzzles1): Some groupby.agg puzzles. 
* [Goal 5](#section5): Count the number of female and male babies born in each year.
* [groupby Puzzles](#groupby_puzzles2): Another groupby.agg puzzle. 

## Setup Notebook
At the start of a notebook, we need to import the Python packages we plan to use.
* [Time](https://docs.python.org/3/library/time.html) - This module provides various time-related functions. 
* [NumPy](https://numpy.org/) - A library for the Python programming language, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays. NumPy was introcuded in Lecture 3 and we will learn more about its functionality in this lecture. It is customary to `import numpy as np`.
* [Pandas](https://pandas.pydata.org/) - pandas is a fast, powerful, flexible and easy to use open source data analysis and manipulation tool, built on top of the Python programming language. Lecture 4, 5, and 6 will do a deep dive into the core functionality of Pandas. It is customary to `import pandas as pd`. 
* [Seaborn](https://seaborn.pydata.org/) - Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. We will use Seaborn throughout CIV1498 for data visualization. It is customary to `import seaborn as pd`.  
* [Maplotlib](https://matplotlib.org//) - Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations in Python. We will use Matplotlib throughout CIV1498 for data visualization. It is customary to `import matplotlib.pyplot as plt`. 

Next, we want to configure the Jupyter Notebook.
* `%matplotlib inline` - This code configured the notebook to display all plots, from Seaborn or Matplotlib, in the Notebook as opposed to in a separate pop-up window.
* `plt.style.use('fivethirtyeight')` - This code configured the plots with the "fivethirtyeight" styling, which tries to replicate the styles from the website [FiveThirtyEight](https://fivethirtyeight.com/).
* `sns.set_context("notebook")` - This sets the plotting context parameters to be optimized for a Notebook. This affects things like the size of the labels, lines, and other elements of the plot, but not the overall style.

In [None]:
# Import 3rd party libraries
import time
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Configure Notebook
%matplotlib inline
plt.style.use('fivethirtyeight')
sns.set_context("notebook")

## Baby Names Dataset
Let's start by loading the New York baby names again.

In [None]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # if the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'NY.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    baby_names = pd.read_csv(fh, header=None, names=field_names)

baby_names.sample(5)

<a id='section1'></a>
## Goal 1
#### Find the most popular baby name in New York in 2018.

Let's start by filtering the dataset to 2018 data points and then sorting by the `Count` column.

In [None]:
baby_names[baby_names['Year'] == 2018].sort_values(by='Count', ascending=False).head()

<a id='section2'></a>
## Goal 2
#### Find all names that start with J.
Goal 2 will focus on introducing operations on String Series.
#### Approach 1: Combine Lecture 4 Pandas syntax with Lecture 3 / APS106 ideas.
Let's take a look and the first 10 baby names.

In [None]:
baby_names['Name'].head(10)

 We can first use Python list comprehension, which was reviewed in Lecture 3 and covered in APS106, to create a Boolean list. The value is True when the name starts with **J** and False when it does not. 

In [None]:
starts_with_j = [x[0] == 'J' for x in baby_names['Name']]
starts_with_j[0:10]

Next, we can use the Boolean list to filter our DataFrame.

In [None]:
baby_names[starts_with_j].head()

To make our code more compact, we can combine the operations from the previous two cells.

In [None]:
baby_names[[x[0] == 'J' for x in baby_names['Name']]].head()

#### Approach 2: Use the Series.str methods.
Series.str are vectorized string functions for Series and Index.

In [None]:
baby_names['Name'].str.startswith('J').head()

This produces a Boolean Series which can then be used to filter our DataFrame.

In [None]:
baby_names[baby_names['Name'].str.startswith('J')].head()

Although both approaches are perfectly valid, we would say that **Approach 1** is not idiomatic. Meaning that people from the broader pandas community won’t like reading your code. Additionally, **Approach 2** is easiest to understand, which is always important when writing code.

Series.str has many other useful methods

In [None]:
baby_names[baby_names['Name'].str.contains('ice')].head()

In [None]:
baby_names['Name'].str.split('a').head()

**Challenge:** Try to write a line of code that creates a list (or Series or array) of all names that end with “ert”.

<details>
    <summary>Solution</summary>
<code>
baby_names[baby_names['Name'].str.endswith('ert')]['Name'].unique()
</code>
</details>

<a id='section3'></a>
## Goal 3
#### Sort names by length.
Suppose we want to sort all rows by the length of the name.

As before, there are ways to do this using only Lecture 3 and 4 content as well as concepts covered in. Check out this code below.

In [None]:
baby_names.iloc[
    [i for i, m in sorted(enumerate(baby_names['Name']), 
                          key=lambda x: -len(x[1]))]
].head(5)

#### Approach 1: Create a temporary column.
Create a new series of only the lengths. Then add that series to the dataframe as a column. Then sort by that column. Then drop that column.

Create a new series of only the lengths.

In [None]:
baby_name_lengths = baby_names['Name'].str.len()

Then, add that series to the dataframe as a column.

In [None]:
baby_names['name_lengths'] = baby_name_lengths
baby_names.head()

Next, sort by the temporary column.

In [None]:
baby_names = baby_names.sort_values(by='name_lengths', ascending=False)
baby_names.head()

And finally, drop the temporary column.

In [None]:
baby_names = baby_names.drop('name_lengths', axis = 1)
baby_names.head()

We can also use the Python `.map` function if we want to use an arbitrarily defined function. Suppose we want to sort by the number of occurrences of `'dr'` plus the number of occurences of `'ea'`.

In [None]:
def dr_ea_count(string):
    return string.count('dr') + string.count('ea')

#create the temporary column
baby_names['dr_ea_count'] = baby_names['Name'].map(dr_ea_count)

#sort by the temporary column
baby_names = baby_names.sort_values(by = 'dr_ea_count', ascending=False)

#drop that column
baby_names = baby_names.drop('dr_ea_count', axis = 1)
baby_names.head(5)

#### Approach 2: Generate an index sorted in the desired order.
Let's start over by first scrambling the order of baby_names.

In [None]:
baby_names = baby_names.sample(frac=1)
baby_names.head()

Another approach is to take advantage of the fact that .loc can accept an index.
- `df.loc[idx]` returns df with its rows in the same order as the given index.
- Only works if the index exactly matches the DataFrame.

The first step was to create a sequence of the lengths of the names.

In [None]:
name_lengths = baby_names['Name'].str.len()
name_lengths.head()

The next step is to sort the new series we just created.

In [None]:
name_lengths_sorted_by_length = name_lengths.sort_values(ascending=False)
name_lengths_sorted_by_length.head()

Next, we pass the index of the sorted series to the `.loc` method of the original DataFrame.

In [None]:
index_sorted_by_length = name_lengths_sorted_by_length.index
index_sorted_by_length[0:5]

In [None]:
baby_names.loc[index_sorted_by_length].head()

This is a lot of code, so let's try combining it all in one line of code.

In [None]:
baby_names.loc[baby_names['Name'].str.len().sort_values(ascending=False).index].head()

<a id='section4'></a>
## Goal 4
#### Name whose popularity has changed the most. 
First we need to define change in popularity. 

For the purposes of lecture, let’s stay simple and use the absolute max/min difference (AMMD): max(count) - min(count). 

To make sure we understand this quantity, let's consider the name Jennifer.

In [None]:
jennifer_counts = baby_names[baby_names['Name']=='Jennifer']
jennifer_counts.head()

Let's calculate the AMMD for Jennifer.

In [None]:
def ammd(series):
    return max(series) - min(series)

In [None]:
ammd(jennifer_counts['Count'])

In [None]:
ammd(baby_names[baby_names['Name']=='Jennifer']['Count'])

#### Approach 1: Hack something using our existing Python knowledge.

In [None]:
start_time = time.time()
baby_name_count_ammd = dict()
for name in sorted(baby_names['Name'].unique()):
    counts_of_current_name = baby_names[baby_names['Name'] == name]['Count']
    baby_name_count_ammd['name'] = ammd(counts_of_current_name)
print('Compute time: {} seconds.'.format(time.time() - start_time))

You can run this code if you like but it takes quite some time. 5.5 minutes to be exact. The most expensive operation is filtering the DataFrame many many times.

There must be a better way right?

Introducing `.groupby().agg()`!

But first, what exactly does `.groupby()` do?

In [None]:
for idx, (name, group) in enumerate(baby_names.groupby('Name')):
    
    # Print the name and the group DataFrame
    print(name)
    print(group)
    print('')
    
    if idx >= 5:
        break

Aggregate using one or more operations over the specified axis. These operations can be built in (min, max) or user-defined (ammd).

<br>
<img src="images/pandas_groupby_agg_biggest_change.png" alt="drawing" width="700"/>
<br> 
<center>Courtesy of Josh Hug</center>

Let's take a look at the DataFrame again.

In [None]:
baby_names.head()

If I run `baby_names.groupby('Name').agg(ammd).head()`, what would you expect the output to be?

In [None]:
baby_names.groupby('Name').agg(ammd).head()

What do you think the Year column represents?
- The number of years a name appeared.
- The difference between the earliest and latest year a name appeared.
- It has no meaning because our code was only designed to work with counts.
- Not sure.

<details>
    <summary>Solution</summary>
<code>
The difference between the earliest and latest year a name appeared.
</code>
</details>

Why don't we see columns for Sex or State? We did not tell GroupBy which columns we wanted it to apply the aggregation function on, so it applied it to all the relevant columns and returned the output. Sex and State have string-object datatypes and therefore, `ammd` count not be computed (cannot substract strings). For example, see the code below, which gets the first row of each group.

In [None]:
baby_names.groupby('Name').first().head()

<a id='groupby_puzzles1'></a>
## `.groupby()` Puzzles
#### Puzzle 1
To test your understanding, try to interpret the result of the code below.

In [None]:
baby_names.head(5)

In [None]:
baby_names.groupby('Year').agg(ammd).plot();

For reference, the first 5 values from the plot above are below.

In [None]:
baby_names.groupby('Year').agg(ammd).head()

#### Puzzle 2
Let's import another elections dataset.

In [None]:
elections = pd.read_csv('elections.csv')
elections.head()

Let's groupby political party and then take the maximum value of each column in a group.

In [None]:
elections.groupby('Party').agg(max).head(10)

We have to be careful when using aggregation functions. For example, the code below might be misinterpreted to say that Woodrow Wilson ran for election in 2016. Why is this happening?

Every column is calculated independently! Among Democrats:
- Last year they ran: 2016
- Alphabetically latest candidate name: Woodrow Wilson
- Highest number of votes: 69498516
- Alphabetically latest Result ['loss', 'win']: win
- Highest % of vote: 61.34

#### Puzzle 3
Inspired by above, try to predict the results of the groupby operation shown. 
<br>
<img src="images/groupby_puzzle_3.png" alt="drawing" width="700"/>
<br> 
<center>Courtesy of Josh Hug</center>

<details>
    <summary>Solution</summary>
<code>
[A, 3, hi]
[B, 6, tx]
[C, 9, sd]
</code>
</details>

#### Puzzle 4
Next we'll write code that properly returns the best result by each party. That is, each row should show the Year, Candidate, Popular Vote, Result, and % for the election in which that party saw its best results (rather than mixing them as in Puzzle 2).

First sort the DataFrame so that rows are in ascending order of %.

In [None]:
elections_sorted_by_percent = elections.sort_values('%', ascending=False)
elections_sorted_by_percent.head()

Then group by Party and take the 0th member of each series, which will be the best outcome (highest vote).

In [None]:
elections_sorted_by_percent.groupby('Party').agg(lambda x : x.iloc[0])

#### Other `.groupby()` features.
**.size()**
<br>
Size returns a Series giving the size of each group.

![](groupby_size.png)
<center>Courtesy of UCBerkeley</center>

In [None]:
elections.groupby('Party').size().head()

**.filter()**
<br>
Filter gives a copy of the original DataFrame where row r is included if its group obeys the given condition.

Note: Filtering is done per GROUP, not per ROW.

![](images/groupby_filter.png)
<center>Courtesy of Josh Hug</center>

In [None]:
elections.groupby('Year').filter(lambda df: df['%'].max() < 45)

**`.sum()`, `.mean()`, etc.**
<br>
As an alternative to `groupby.agg(sum)`, we can also simply do `groupby.sum()`.
<br>
<img src="images/groupby_sum.png" alt="drawing" width="700"/>
<br> 
<center>Courtesy of Josh Hug</center>

In [None]:
elections.groupby('Year').agg(sum).head()

In [None]:
elections.groupby('Year').sum().head()

Try other common operations.

`elections.groupby('Year').min()`<br>
`elections.groupby('Year').max()`<br>
`elections.groupby('Year').mean()`<br>
`elections.groupby('Year').median()`

**.groupby() multiple features.**
<br>
It is possible to group a DataFrame by multiple features. For example, if we group by Year and Sex we get back a DataFrame with the total number of babies of each sex born in each year.

The DataFrame resulting from the aggregation operation is now multi-indexed. That is, it has more than one dimension to its index.

In [None]:
baby_names.groupby(['Name', 'Sex']).sum().head()

In [None]:
baby_names.groupby(['Name', 'Sex']).sum().loc[['Mike', 'Blake', 'Avery'], :]

<a id='section5'></a>
## Goal 5
#### Finding the number of babies born in each year of each sex.
Suppose we want to build a table showing the total number of babies born of each sex in each year. One way is to `.groupby()` using both columns of interest.

In [None]:
baby_names.groupby(['Year', 'Sex']).sum().head()

A more natural approach is to use a **Pivot Table**.

In [None]:
baby_names.head()

The basic idea behind **Pivot Tables** is shown in the image below.
<br>
<img src="images/pivot_table.png" alt="drawing" width="700"/>
<br> 
<center>Courtesy of Josh Hug</center>

In [None]:
baby_names_pivot = baby_names.pivot_table(
    index='Year',    # the rows (turned into index)
    columns='Sex',   # the column values
    values='Count',  # the field(s) to processed in each group
    aggfunc=sum,     # group operation
)
baby_names_pivot.head()

In [None]:
baby_names_pivot_2 = baby_names.pivot_table(
    index='Year',              # the rows (turned into index)
    columns='Sex',             # the column values
    values=['Count', 'Name'],  # the field(s) to processed in each group
    aggfunc=np.max,            # group operation
)
baby_names_pivot_2.head()

<a id='groupby_puzzles2'></a>
## Another `.groupby()` Puzzle
#### More careful look at the most popular 2018 name in California.
In Goal 1, we didn't take into account the unlikely possibility that the most popular name was actually spread across both birth sexes. For example, what if in the table below it turns out that there were 300 female Noahs born in CA in 2018. In that case, Noah would actually be the most popular.

Since our queries are getting pretty long, I've stuck them inside parentheses which allows us to spread them over many lines.

In [None]:
(
    baby_names[baby_names['Year'] == 2018]
        .sort_values(by='Count', ascending=False)
        .head()
)

Try to add a single line to the operation above so that each row represents the sum of male and female babies born in 2018 with that name. 

In [None]:
(
    baby_names[baby_names['Year'] == 2018]
        .groupby(['Name'])
        .agg(lambda df: df.sum())[['Year', 'State', 'Count']]
        .sort_values(by='Count', ascending=False)
        .head()
)