# Lecture 4 – Data 100, Fall 2025

Data 100, Fall 2025

[Acknowledgments Page](https://ds100.org/fa25/acks/)

A demonstration of advanced `pandas` syntax to accompany Lecture 4.

In [None]:
import numpy as np
import pandas as pd
import plotly.express as px

### Loading `babynames` Dataset for California

In [None]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "data/babynamesbystate.zip"
if not os.path.exists(local_filename): # If the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'STATE.CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.tail(10)

## Grouping

Group rows that share a common feature, then aggregate data across the group.

In this example, we count the total number of babies born each year in California.

In [None]:
babynames.groupby("Year")

In [None]:
# Grouping by "Year" and aggregating the "Count" column
# to get the total number of babies born each year.
babies_by_year = babynames.groupby("Year")[["Count"]].agg("sum")
babies_by_year

In [None]:
# Plotting baby counts per year
# Don't worry about the syntax here. We will cover visualization later in the course.
fig = px.line(babies_by_year, y="Count")
fig.update_layout(font_size=18, 
                  autosize=False, 
                  width=700, 
                  height=400)

### Slido Exercise

Try to predict the results of the `groupby` operation shown. The answer is below the image.

<img src="images/groupby_mystery.png" alt="Image" width="600">

In [None]:
df = pd.DataFrame({
  'col1' : ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'C', 'B'],
  'col2' : [3, 1, 4, 1, 5, 9, 2, 5, 6], 
  'col3' : ['ak', 'tx', 'fl', 'hi', 'mi', 'ak', 'ca', 'sd', 'nc']
})
df

In [None]:
# When we don't specify the columns, pandas will try to apply the aggregation to all columns. See next cell for proof!
df.groupby('col1').agg('max')

In [None]:
df.groupby('col1')[['col2', 'col3']].agg('max')

### Case Study: Name "Popularity"

In this exercise, let's find the name with sex "F" that has dropped most in popularity since its peak usage in California. We'll start by filtering `babynames` to only include names corresponding to sex "F".

In [None]:
f_babynames = babynames[babynames["Sex"]=="F"]
f_babynames

To build our intuition on how to answer our research question, let's visualize the prevalence of the name "Jennifer" over time.

In [None]:
jenn_entries = f_babynames[f_babynames["Name"]=="Jennifer"]
jenn_entries

In [None]:
# We'll talk about how to generate plots in a later lecture
fig = px.line(jenn_entries, x="Year", y="Count")

fig.update_layout(font_size = 18, 
                  autosize=False, 
                  width=1000, 
                  height=400)

We'll need a mathematical definition for the change in popularity of a name in California.

Define the metric "Ratio to Peak" (RTP). We'll calculate this as the count of the name in 2022 (the most recent year for which we have data) divided by the largest count of this name in *any* year. 

A demo calculation for Jennifer:

In [None]:
# Construct a Series containing our Jennifer count data
jenn_counts_ser = jenn_entries["Count"]

In [None]:
# In the year with the highest Jennifer count, 6065 Jennifers were born
max_jenn = np.max(jenn_counts_ser)
max_jenn

In [None]:
# Remember that we sorted f_babynames by "Year". 
# This means that grabbing the final entry gives us the most recent count of Jennifers: 114
# In 2022, the most recent year for which we have data, 114 Jennifers were born
latest_jenn = jenn_counts_ser.iloc[-1]
latest_jenn

In [None]:
# Compute the RTP
latest_jenn / max_jenn

We can also write a function that produces the `ratio_to_peak`for a given `Series`. This will allow us to use `.groupby` to speed up our computation for all names in the dataset.

In [None]:
def ratio_to_peak(series):
    """
    Compute the RTP for a Series containing the counts per year for a single name (year column sorted ascendingly).
    """
    return series.iloc[-1] / np.max(series)

In [None]:
# Then, find the RTP
ratio_to_peak(jenn_counts_ser)

Now, let's use `.groupby` to compute the RTPs for *all* names in the dataset.

You may see a warning message when running the cell below. As discussed in the lecture, `pandas` can't apply an aggregation function to non-numeric data (it doesn't make sense to divide "CA" by a number). We can select numerical columns of interest directly.

In [None]:
rtp_table = f_babynames.groupby("Name")[["Year", "Count"]].agg(ratio_to_peak)
rtp_table

This is the `pandas` equivalent of `.group` from [Data 8](http://data8.org/datascience/_autosummary/datascience.tables.Table.group.html). If we wanted to achieve this same result using the `datascience` library, we would write:

`f_babynames.group("Name", ratio_to_peak)`

### Slido Exercise

Is there a row where `Year` is not equal to 1? Recall that `babynames` is sorted ascending by year.

In [None]:
f_babynames

In [None]:
# Unique values in the Year column
rtp_table["Year"].unique()

A hint: If we randomly shuffle the dataset, we see values of `Year` other than 1.
- The maximum year for each name is no longer guaranteed to be the last-appearing year for each name! So, the ratio is no longer 1.

In [None]:
f_babynames.sample(frac=1, replace=False).groupby("Name")[["Year", "Count"]].agg(ratio_to_peak)

In [None]:
# Dropping the "Year" column
rtp_table = rtp_table.drop("Year", axis="columns")
rtp_table

In [None]:
# Rename "Count" to "Count RTP" for clarity
rtp_table = rtp_table.rename(columns={"Count":"Count RTP"})
rtp_table

In [None]:
# What name has fallen the most in popularity?
rtp_table.sort_values("Count RTP", ascending=True)

We can visualize the decrease in the popularity of the name "Debra:"

In [None]:
# Don't worry about the * in the function definition! 
# Focus on the plot.
def plot_name(*names):
    fig = px.line(f_babynames[f_babynames["Name"].isin(names)], 
                  x="Year", y="Count", color="Name",
                  title=f"Popularity for: {names}")
    fig.update_layout(font_size=18, 
                  autosize=False, 
                  width=1000, 
                  height=400)
    return fig

plot_name("Debra")

In [None]:
# Find the 10 names that have decreased the most in popularity
top10 = rtp_table.sort_values("Count RTP").head(10).index
top10

In [None]:
plot_name(*top10)

For fun, try plotting your name or your friends' names.

### `groupby.size` and `groupby.count()`

In [None]:
df = pd.DataFrame({'letter':['A', 'A', 'B', 'C', 'C', 'C'], 
                   'num':[1, 2, 3, 4, np.nan, 4], 
                   'state':[np.nan, 'tx', 'fl', 'hi', np.nan, 'ak']})
df

`groupby.size()` returns a `Series`, indexed by the `letter`s that we grouped by, with values denoting the number of rows in each group/sub-DataFrame. It does not care about missing (`NaN`) values.

In [None]:
df.groupby("letter").size()

You might recall `value_counts()` function we talked about last week. What's the difference?

In [None]:
df["letter"].value_counts()

Turns out `value_counts()` does something similar to `groupby.size()`, except that it also sorts the values of the resulting `Series` in descending order.

`groupby.count()` returns a `DataFrame`, indexed by the `letter`s that we grouped by. Each column represents the number of non-missing values for that `letter`.

In [None]:
df.groupby("letter").count()

## Filtering by Group

Another common use for groups is to filter data:

Usage: `groupby(___).filter(func)`

`.filter()` applies `func` to each group's sub-DataFrame (`sf`).
- `func` must return a scalar `True` or `False` for each `sf`.
- If `func` returns `True` for a `sf`, then all rows belonging to the group are preserved.
- If `func` returns `False` for a `sf`, then all rows belonging to that group are filtered out.

For example, we can filter to the subframes of `df` with at least 2 rows:

In [None]:
df.groupby("letter").filter(lambda sf: len(sf) >= 2)

### Slido Exercise

Which of the following returns all rows of `babynames` with names that appeared for the first time after 2010?

In [None]:
# babynames.groupby("Name").filter(lambda sf: sf["Year"].min() > 2010)

In [None]:
# babynames.groupby("Name").filter(lambda sf: sf["Year"].max() > 2010)

In [None]:
# babynames.groupby("Name").filter(lambda sf: sf["Year"] > 2010)

In [None]:
# babynames.groupby(["Name", "Year"]).filter(lambda sf: sf["Year"] > 2010)

In [None]:
# Let's read the elections dataset
elections = pd.read_csv("data/elections.csv")
elections.sample(5)

Let's keep only the elections years where the maximum vote share `%` is less than 45%.

In [None]:
elections.groupby("Year").filter(lambda sf: sf["%"].max() < 45).head(10)

In [None]:
# Why did we get a DataFrame instead of a Series?
# Notice that "%" is in its own sublist!
elections_max_percentage = elections.groupby("Year")[["%"]].agg("max")
elections_max_percentage

In [None]:
elections_max_percentage.sort_values(by="%").head()

### `groupby` Puzzle

Assume that we want to know the best election by each party.

#### Attempt #1

We have to be careful when using aggregation functions. For example, the code below might be misinterpreted to say that Woodrow Wilson successfully ran for election in 2020. Why is this happening?

In [None]:
elections.groupby("Party").agg("max").head(10)

It's generally a good idea to be explicit about which columns to aggregate! 

#### Attempt #2

Next, we'll write code that properly returns _the best result by each party_. That is, each row should show the Year, Candidate, Popular Vote, Result, and % for the election in which that party saw its best results (rather than mixing them as in the example above). Here's what the first few rows of the correct output should look like:

<img src="images/parties.png" alt="Image" width="600">

In [None]:
elections_sorted_by_percent = elections.sort_values("%", ascending=False)
elections_sorted_by_percent.head(8)

In [None]:
elections_sorted_by_percent.groupby("Party").head(1)

#### Alternative Solutions

You'll soon discover that with `Pandas` rich tool set, there's typically more than one way to get to the same answer. Each approach has different tradeoffs in terms of readability, performance, memory consumption, complexity, and more. It will take some experience for you to develop a sense of which approach is better for each problem, but you should, in general, try to think if you can at least envision a different solution to a given problem, especially if you find your current solution to be particularly convoluted or hard to read.

Here are a couple of other ways of obtaining the same result (in each case, we only show the top part with `head()`). The first approach uses `groupby` but finds the location of the maximum value via the `idxmax()` method (look up its documentation!).  We then index and sort by `Party` to match the requested formatting:

In [None]:
elections.groupby("Party")["%"].idxmax()

In [None]:
# This is the computational part
best_per_party = elections.loc[elections.groupby("Party")["%"].idxmax()]

# This indexes by Party to match the formatting above
best_per_party.set_index('Party').sort_index().head() 

And this one doesn't even use `groupby`! This approach instead uses the `drop_duplicates` method to keep only the last occurrence of of each party after having sorted by "%", which is the best performance.  Again, the 2nd line is purely formatting:

In [None]:
best_per_party2 = elections.sort_values("%").drop_duplicates(["Party"], keep="last")
best_per_party2.set_index("Party").sort_index().head()  # Formatting

*Challenge:* See if you can find yet another approach that still gives the same answer.

### `DataFrameGroupBy` Objects

The result of `groupby` is not a `DataFrame` or a list of `DataFrame`s. It is instead a special type called a `DataFrameGroupBy`.

In [None]:
grouped_by_party = elections.groupby("Party")
type(grouped_by_party)

`GroupBy` objects are structured like dictionaries. In fact, we can actually see the dictionaries with the following code:

In [None]:
grouped_by_party.groups

The `key`s of the dictionary are the groups (in this case, `Party`), and the `value`s are the **indices** of rows belonging to that group. We can access a particular sub-`DataFrame` using `get_group`:

In [None]:
grouped_by_party.get_group("Socialist")

---

## Pivot Tables

### `Groupby` with multiple columns

We want to build a table showing the total number of babies born of each sex in each year. One way is to `groupby` using both columns of interest:

In [None]:
babynames.groupby(["Year", "Sex"])[["Count"]].sum()

### `pivot_table`

In [None]:
babynames.head()

In [None]:
babynames.pivot_table(
    index="Year", 
    columns="Sex", 
    values="Count", 
    aggfunc="sum")

### `pivot_table` with Multiple values

In [None]:
babynames.pivot_table(
    index="Year", 
    columns="Sex", 
    values=["Count", "Name"], 
    aggfunc="max")

---

## Join Tables

What if we want to know the popularity of presidential candidates' first names in California in 2022? What can we do?

In [None]:
elections.head(10)

In [None]:
babynames_2022 = babynames[babynames["Year"]==2022]
babynames_2022.head(10)

In [None]:
elections["First Name"] = elections["Candidate"].str.split(" ").str[0]
elections

Unlike in Data 8, the join function is called `merge` in pandas. `join` in pandas does something slightly different—we won't talk about it in this class.

In [None]:
display(elections.head())
display(babynames_2022.head())

merged = pd.merge(left=elections, right=babynames_2022, 
                  left_on="First Name", right_on="Name")
merged

In [None]:
merged.sort_values("Count", ascending=False)