# Lecture 3 – Data 100, Summer 2024

Data 100, Summer 2024

[Acknowledgments Page](https://ds100.org/su24/acks/)

A demonstration of advanced `pandas` syntax to accompany Lecture 3.

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

In [None]:
weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
weird

### <font color='red'>STOP!</font> Extraction Exercise

Try figuring out what the following cells evaluate to **without** running them:

In [None]:
weird[1]

In [None]:
weird["1"]

In [None]:
weird[1:]

In [None]:
weird[["1", 1]]

In [None]:
# Results in a KeyError

weird.loc[:,0]

In [None]:
weird.loc[1]

## Dataset: California baby names

In today's lecture, we'll work with the `babynames` dataset, which contains information about the names of infants born in California.

The cell below pulls census data from a government website and then loads it into a usable form. The code shown here is outside of the scope of Data 100, but you're encouraged to dig into it if you are interested!

In [None]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "data/babynamesbystate.zip"
if not os.path.exists(local_filename): # If the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'STATE.CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.head()

In [None]:
babynames

## Conditional Selection

In [None]:
# Ask yourself: Why is :9 is the correct slice to select the first 10 rows?
babynames_first_10_rows = babynames.loc[:9, :]

babynames_first_10_rows

By passing in a sequence (list, array, or `Series`) of boolean values, we can extract a subset of the rows in a `DataFrame`. We will keep *only* the rows that correspond to a boolean value of `True`.

In [None]:
# Notice how we have exactly 10 elements in our boolean array argument.
babynames_first_10_rows[[True, False, True, False, True, False, True, False, True, False]]

In [None]:
# Or using .loc to filter a DataFrame by a Boolean array argument.
babynames_first_10_rows.loc[[True, False, True, False, True, False, True, False, True, False], :]


Oftentimes, we'll use boolean selection to check for entries in a `DataFrame` that meet a particular condition.

In [None]:
# First, use a logical condition to generate a boolean Series
logical_operator = (babynames["Sex"] == "F")
logical_operator

In [None]:
# Then, use this boolean Series to filter the DataFrame
babynames[logical_operator]

Boolean selection also works with `loc`!

In [None]:
# Notice that we did not have to specify columns to select 
# If no columns are referenced, pandas will automatically select all columns
babynames.loc[babynames["Sex"] == "F"]

To filter on multiple conditions, we combine boolean operators using **bitwise comparisons**.

Symbol | Usage      | Meaning 
------ | ---------- | -------------------------------------
~    | ~p       | Returns negation of p
&#124; | p &#124; q | p OR q
&    | p & q    | p AND q
^  | p ^ q | p XOR q (exclusive or)

In [None]:
babynames[(babynames["Sex"] == "F") & (babynames["Year"] < 2000)]

In [None]:
babynames[(babynames["Sex"] == "F") | (babynames["Year"] < 2000)]

### <font color='red'>STOP!</font> Slido Exercise

Try answering the Slido poll/following question **without** running the next cell: Which of the following pandas statements returns a DataFrame of the first 3 baby names with `Count > 300`?

<img src="images/slido_1.png" width="200"/>

In [None]:
babynames.iloc[[0, 233, 485], [3, 4]]

In [None]:
babynames.loc[[0, 233, 485]]

In [None]:
babynames.loc[babynames["Count"] > 300, ["Name", "Count"]].head(3)

In [None]:
babynames.loc[babynames["Count"] > 300, ["Name", "Count"]].iloc[0:2, :]

In [None]:
# Note: The parentheses surrounding the code make it possible to break the code into multiple lines for readability

(
    babynames[(babynames["Name"] == "Angela") | 
              (babynames["Name"] == "Jacob") |
              (babynames["Name"] == "Zekai") |
              (babynames["Name"] == "Maya")]
)


In [None]:
# A more concise method to achieve the above: .isin
names = ["Angela", "Jacob", "Zekai", "Maya"]
display(babynames["Name"].isin(names))
display(babynames[babynames["Name"].isin(names)])

In [None]:
# What if we only want names that start with "M"?
display(babynames["Name"].str.startswith("M"))
display(babynames[babynames["Name"].str.startswith("M")])

### <font color='red'>STOP!</font> Conditional Selection Exercise

If possible, try answering the following questions without peeking:
* What is the count for `Alex` for `sex` `F` in `2000`?
* How do I get all the rows where `Name` starts with `Kev` or `Count > 600`

In [None]:
babynames[(babynames['Name'] == 'Alex') & (babynames['Sex'] == 'F') & (babynames['Year'] == 2000)].iloc[0, 4]

In [None]:
babynames[(babynames["Name"].str.startswith("Kev")) | (babynames['Count'] > 600)]

In [None]:
babynames

## Adding, Removing, and Modifying Columns

To add a column, use `[]` to reference the desired new column, then assign it to a `Series` or array of appropriate length.

In [None]:
# Create a Series of the length of each name
babyname_lengths = babynames["Name"].str.len()

# Add a column named "name_lengths" that includes the length of each name
babynames["name_lengths"] = babyname_lengths

babynames

To modify a column, use `[]` to access the desired column, then re-assign it to a new array or Series.

In [None]:
# Modify the "name_lengths" column to be one less than its original value
babynames["name_lengths"] = babynames["name_lengths"] - 1
babynames

Rename a column using the `.rename()` method.

In [None]:
# Rename "name_lengths" to "Length"
babynames = babynames.rename(columns={"name_lengths":"Length"})
babynames

Remove a column using `.drop()`.

In [None]:
# Remove our new "Length" column
babynames = babynames.drop("Length", axis="columns")
babynames

## Useful Utility Functions

#### `NumPy`

The `NumPy` functions you encountered in [Data 8](https://www.data8.org/su23/reference/#array-functions-and-methods) are compatible with objects in `pandas`. 

In [None]:
yash_counts = babynames[babynames["Name"] == "Yash"]["Count"]
yash_counts

In [None]:
# Average number of babies named Yash each year

np.mean(yash_counts)

In [None]:
# Max number of babies named Yash born in any single year

max(yash_counts)

#### Built-In `pandas` Methods

There are many, *many* utility functions built into `pandas`, far more than we can possibly cover in lecture. You are encouraged to explore all the functionality outlined in the `pandas` [documentation](https://pandas.pydata.org/docs/reference/index.html).

In [None]:
# Returns the shape of the object in the format (num_rows, num_columns)
babynames.shape

In [None]:
# Returns the total number of entries in the object, equal to num_rows * num_columns
babynames.size

In [None]:
# What summary statistics can we describe?
babynames.describe()

In [None]:
# Our statistics are slightly different when working with a Series
babynames["Sex"].describe()

In [None]:
# Randomly sample row(s) from the DataFrame
babynames.sample()

In [None]:
# Rerun this cell a few times – you'll get different results!
babynames.sample(5).iloc[:, 2:]

In [None]:
# Sampling with replacement
babynames[babynames["Year"] == 2000].sample(4, replace = True).iloc[:,2:]

In [None]:
# Count the number of times each unique value occurs in a Series
babynames["Name"].value_counts()

In [None]:
# Return an array of all unique values in the Series
babynames["Name"].unique()

In [None]:
# Sort a Series
babynames["Name"].sort_values()

In [None]:
# Sort a DataFrame – there are lots of Michaels in California
babynames.sort_values(by="Count", ascending=False)

## Custom sorting

### Approach 1: Create a temporary column

In [None]:
# Create a Series of the length of each name
babyname_lengths = babynames["Name"].str.len()

# Add a column named "name_lengths" that includes the length of each name
babynames["name_lengths"] = babyname_lengths
babynames.head(5)

In [None]:
# Sort by the temporary column
babynames = babynames.sort_values(by="name_lengths", ascending=False)
babynames.head(5)

In [None]:
# Drop the 'name_length' column
babynames = babynames.drop("name_lengths", axis="columns")
babynames.head(5)

### Approach 2: Sorting using the `key` argument

In [None]:
babynames.sort_values("Name", key=lambda x:x.str.len(), ascending=False).head()

### Approach 3: Sorting Using the `map` Function

We can also use the Python map function if we want to use an arbitrarily defined function. Suppose we want to sort by the number of occurrences of "dr" plus the number of occurences of "ea".

In [None]:
# First, define a function to count the number of times "dr" or "ea" appear in each name
def dr_ea_count(string):
    return string.count('dr') + string.count('ea')

# Then, use `map` to apply `dr_ea_count` to each name in the "Name" column
babynames["dr_ea_count"] = babynames["Name"].map(dr_ea_count)

# Sort the DataFrame by the new "dr_ea_count" column so we can see our handiwork
babynames = babynames.sort_values(by="dr_ea_count", ascending=False)
babynames.head()

In [None]:
# Drop the `dr_ea_count` column
babynames = babynames.drop("dr_ea_count", axis="columns")
babynames.head(5)

## Grouping

Group rows that share a common feature, then aggregate data across the group.

In this example, we count the total number of babies born in each year (considering only a small subset of the data, for simplicity).

<img src="images/groupby.png" width="800"/>

In [None]:
# The code below uses the full babynames dataset, which is why some numbers are different relative to the diagram
babynames[["Year", "Count"]].groupby("Year").agg(sum)

There are many different aggregation functions we can use, all of which are useful in different applications.

In [None]:
# What is the earliest year in which each name appeared?
babynames.groupby("Name")[["Year"]].agg(min)

In [None]:
# What is the largest single-year count of each name?
babynames.groupby("Name")[["Count"]].agg(max)

In this example, we count the total number of babies born each year (considering only a small subset of the data for simplicity).

In [None]:
babynames.groupby("Year")

In [None]:
# Selecting only numerical columns to perform grouping on and then grouping by "Year"
babies_by_year = babynames[["Year", "Count"]].groupby("Year").agg(sum)
babies_by_year

What happens if we don't select columns `Year` and `Count` before calling `groupby` and our aggregation function? The results are messy! 

In [None]:
babynames.groupby("Year").agg(sum)

Alternatively, we could select the relevant columns after calling `groupby` from the "sub-`DataFrames`":

In [None]:
babynames.groupby("Year")[["Year", "Count"]].agg(sum)

Or, another way (Note: the result is slightly different as it doesn't aggregate the `Year` column despite it being numeric because we are grouping by it): 

In [None]:
babynames.groupby("Year").sum(numeric_only=True)

In [None]:
# Plotting baby counts per year
fig = px.line(babies_by_year, y = "Count")
fig.update_layout(font_size = 18, 
                  autosize=False, 
                  width=700, 
                  height=400)

### <font color='red'>STOP!</font> Slido Exercise

Try answering the Slido poll/following question **without** looking at the next image. Try to predict the results of the `groupby` operation shown. 

The answer is below the image.

<img src="images/slido_groupby.png" alt="Image" width="600">

The top ?? will be "hi", the second ?? will be "tx", and the third ?? will be "sd". 

In [None]:
ds = pd.DataFrame(dict(x=[3, 1, 4, 1, 5, 9, 2, 5, 6], 
                      y=['ak', 'tx', 'fl', 'hi', 'mi', 'ak', 'ca', 'sd', 'nc']), 
                      index=list('ABCABCACB') )
ds

In [None]:
# Performing groupby on the first column with max aggregation function
ds.groupby(ds.index).agg(max)

In [None]:
('hi' > 'ak') & ('hi' > 'ca')

***
If we have extra time.... Otherwise this will be next lecture!

### Case Study: Name "Popularity"

In this exercise, let's find the name with sex "F" that has dropped most in popularity since its peak usage in California. We'll start by filtering `babynames` to only include names corresponding to sex "F".

In [None]:
f_babynames = babynames[babynames["Sex"] == "F"]
f_babynames

In [None]:
# We sort the data by "Year"
f_babynames = f_babynames.sort_values("Year")
f_babynames

To build our intuition on how to answer our research question, let's visualize the prevalence of the name "Jennifer" over time.

In [None]:
# We'll talk about how to generate plots in a later lecture
fig = px.line(f_babynames[f_babynames["Name"] == "Jennifer"],
              x="Year", y="Count")

fig.update_layout(font_size = 18, 
                  autosize=False, 
                  width=1000, 
                  height=400)

We'll need a mathematical definition for the change in popularity of a name in California.

Define the metric "Ratio to Peak" (RTP). We'll calculate this as the count of the name in 2022 (the most recent year for which we have data) divided by the largest count of this name in *any* year. 

A demo calculation for Jennifer:

In [None]:
# In the year with the highest Jennifer count, 6065 Jennifers were born
max_jenn = np.max(f_babynames[f_babynames["Name"] == "Jennifer"]["Count"])
max_jenn

In [None]:
# Remember that we sorted f_babynames by "Year". 
# This means that grabbing the final entry gives us the most recent count of Jennifers: 114
# In 2022, the most recent year for which we have data, 114 Jennifers were born
curr_jenn = f_babynames[f_babynames["Name"] == "Jennifer"]["Count"].iloc[-1]
curr_jenn

In [None]:
# Compute the RTP
curr_jenn / max_jenn

We can also write a function that produces the `ratio_to_peak`for a given `Series`. This will allow us to use `.groupby` to speed up our computation for all names in the dataset.

In [None]:
def ratio_to_peak(series):
    """
    Compute the RTP for a Series containing the counts per year for a single name (year column sorted ascendingly).
    """
    return series.iloc[-1] / np.max(series)

In [None]:
# Construct a Series containing our Jennifer count data
jenn_counts_ser = f_babynames[f_babynames["Name"] == "Jennifer"]["Count"]

# Then, find the RTP
ratio_to_peak(jenn_counts_ser)

Now, let's use `.groupby` to compute the RTPs for *all* names in the dataset.

You may see a warning message when running the cell below. As discussed in the lecture, `pandas` can't apply an aggregation function to non-numeric data (it doens't make sense to divide "CA" by a number). We can select numerical columns of interest directly.

In [None]:
# Results in a TypeError
rtp_table = f_babynames.groupby("Name").agg(ratio_to_peak)
rtp_table

In [None]:
rtp_table = f_babynames.groupby("Name")[["Year", "Count"]].agg(ratio_to_peak)
rtp_table

This is the `pandas` equivalent of `.group` from [Data 8](http://data8.org/datascience/_autosummary/datascience.tables.Table.group.html). If we wanted to achieve this same result using the `datascience` library, we would write:

`f_babynames.group("Name", ratio_to_peak)`

### <font color='red'>STOP!</font> Slido Exercise

Try answering the Slido poll/following question **without** running the next cell: Is there a row where `Year` is not equal to 1?

In [None]:
# Unique values in the Year column
rtp_table["Year"].unique()

In [None]:
# Dropping the "Year" column
rtp_table.drop("Year", axis="columns", inplace=True)
rtp_table

In [None]:
# Rename "Count" to "Count RTP" for clarity
rtp_table = rtp_table.rename(columns = {"Count": "Count RTP"})
rtp_table

In [None]:
# What name has fallen the most in popularity?
rtp_table.sort_values("Count RTP")

We can visualize the decrease in the popularity of the name "Debra:"

In [None]:
def plot_name(*names):
    fig = px.line(f_babynames[f_babynames["Name"].isin(names)], 
                  x = "Year", y = "Count", color="Name",
                  title=f"Popularity for: {names}")
    fig.update_layout(font_size = 18, 
                  autosize=False, 
                  width=1000, 
                  height=400)
    return fig

plot_name("Debra")

In [None]:
# Find the 10 names that have decreased the most in popularity
top10 = rtp_table.sort_values("Count RTP").head(10).index
top10

In [None]:
plot_name(*top10)