# Lecture 5

 Fall 2023

A demonstration of advanced `pandas` syntax to accompany Lecture 5.

In [1]:
import numpy as np
import pandas as pd
import plotly.express as px

## More on `Groupby`

### Slido Exercise

Try to predict the results of the `groupby` operation shown. The answer is below the image.

<img src="/content/drive/MyDrive/groupby.png" alt="Image" width="600">

The top ?? will be "hi", the second ?? will be "tx", and the third ?? will be "sd".

In [2]:
# Form a data frame using dictionary
# Answer Here
ds = pd.DataFrame(dict(x=[3, 1, 4, 1, 5, 9, 2, 5, 6],
                      y=['ak', 'tx', 'fl', 'hi', 'mi', 'ak', 'ca', 'sd', 'nc']),
                      index=list('ABCABCACB') )
ds

Unnamed: 0,x,y
A,3,ak
B,1,tx
C,4,fl
A,1,hi
B,5,mi
C,9,ak
A,2,ca
C,5,sd
B,6,nc


In [3]:
#Use groupby on index and get max of each group
ds.groupby(ds.index).max()

Unnamed: 0,x,y
A,3,hi
B,6,tx
C,9,sd


### Loading `babynames` Dataset

In [None]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "/content/drive/MyDrive/data/babynamesbystate.zip"
if not os.path.exists(local_filename): # If the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'STATE.CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.tail(10)

In [4]:

field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
data = pd.read_csv('babynamesbystate\STATE.CA.TXT', header=None, names=field_names)

### Case Study: Name "Popularity"

In this exercise, let's find the name with sex "F" that has dropped most in popularity since its peak usage in California. We'll start by filtering `babynames` to only include names corresponding to sex "F".

In [8]:
# Select the names only. of baby grils

female_data = data[data["Sex"]=="F"]

In [9]:
# We sort the data by year
female_data = female_data.sort_values(["Year"])

To build our intuition on how to answer our research question, let's visualize the prevalence of the name "Jennifer" over time.

In [10]:
# We'll talk about how to generate plots in a later lecture

jen = female_data[female_data["Name"] == "Jennifer"]["Count"]

We'll need a mathematical definition for the change in popularity of a name in California.

Define the metric "Ratio to Peak" (RTP). We'll calculate this as the count of the name in 2022 (the most recent year for which we have data) divided by the largest count of this name in *any* year.

A demo calculation for Jennifer:

In [15]:
# In the year with the highest Jennifer count, 6065 Jennifers were born
max_jenn = max(female_data[female_data["Name"] == "Jennifer"]["Count"])
curr_jenn = female_data[female_data["Name"] == "Jennifer"]["Count"].iloc[-1]
curr_jenn
# This means that grabbing the final entry gives us the most recent count of Jennifers: 114
# In 2022, the most recent year for which we have data, 114 Jennifers were born

114

In [16]:
# Compute the RTP
rtp = curr_jenn / max_jenn
rtp




0.018796372629843364

We can also write a function that produces the `ratio_to_peak`for a given `Series`. This will allow us to use `.groupby` to speed up our computation for all names in the dataset.

In [18]:
# Construct a Series containing our Jennifer count data
# Then, find the RTP
def ratio_to_peak(series):
    return series.iloc[-1] / max(series)
jenn_counts_ser = female_data[female_data["Name"] == "Jennifer"]["Count"]
ratio_to_peak(jenn_counts_ser)



0.018796372629843364

Now, let's use `.groupby` to compute the RTPs for *all* names in the dataset.

You may see a warning message when running the cell below. As discussed in the lecture, `pandas` can't apply an aggregation function to non-numeric data (it doens't make sense to divide "CA" by a number). We can select numerical columns of interest directly.

In [19]:
rtp_table = female_data.groupby("Name")[["Year","Count"]].agg(ratio_to_peak)

In [None]:
# Results in a TypeError
# rtp_table = f_babynames.groupby("Name").agg(ratio_to_peak)
# rtp_table

#uncomment this to check the error 

### Slido Exercise

Is there a row where `Year` is not equal to 1?

In [21]:
# Find Unique values in the Year column of rtp_table dataframe

rtp_table = (female_data.groupby("Name")[["Year","Count"]].agg(ratio_to_peak))
rtp_table


Unnamed: 0_level_0,Year,Count
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Aadhini,1.0,1.000000
Aadhira,1.0,0.500000
Aadhya,1.0,0.660000
Aadya,1.0,0.586207
Aahana,1.0,0.269231
...,...,...
Zyanya,1.0,0.466667
Zyla,1.0,1.000000
Zylah,1.0,1.000000
Zyra,1.0,1.000000


In [38]:
# Rename "Count" to "Count RTP" for clarity
rtp_table = female_data.groupby("Name")[["Count"]].agg(ratio_to_peak)
rtp_table = rtp_table.rename(columns = {"Count": "Count RTP"})


In [44]:
data = data.rename(columns = {"Count": "Count RTP"})


In [26]:
# What name has fallen the most in popularity?

rtp_table.sort_values("Count RTP")

Unnamed: 0_level_0,Count RTP
Name,Unnamed: 1_level_1
Debra,0.001260
Debbie,0.002815
Carol,0.003180
Tammy,0.003249
Susan,0.003305
...,...
Fidelia,1.000000
Naveyah,1.000000
Finlee,1.000000
Roseline,1.000000


In [31]:
rtp_table.reset_index(inplace=True)

We can visualize the decrease in the popularity of the name "Debra:"

In [27]:
def plot_name(*names):
    fig = px.line(female_data[female_data["Name"].isin(names)],
                  x = "Year", y = "Count", color="Name",
                  title=f"Popularity for: {names}")
    fig.update_layout(font_size = 18,
                  width=1000,
                  height=400)
    return fig

plot_name("Debra")

In [28]:
# Find the 10 names that have decreased the most in popularity
top10 = rtp_table.sort_values("Count RTP").head(10).index

In [29]:
plot_name(*top10)

For fun, try plotting your name or your friends' names.

### Slido Exercise

Given the example below on `babynames` dataset, write code to compute the total number of babies with each name in California using with and without agg.

In [45]:
# code here

# With agg
with_agg = data[data['State'] == 'CA'].groupby('Name').agg({'Count RTP': 'sum'})
print("Total number of babies with each name in California using agg:",with_agg)


# Without agg
no_agg = data[data['State'] == 'CA'].groupby('Name')['Count RTP'].sum()
print("\nTotal number of babies with each name in California without agg:",no_agg)



Total number of babies with each name in California using agg:          Count RTP
Name              
Aadan           18
Aadarsh          6
Aaden          647
Aadhav          27
Aadhini          6
...            ...
Zymir            5
Zyon           133
Zyra           103
Zyrah           21
Zyrus            5

[20437 rows x 1 columns]

Total number of babies with each name in California without agg: Name
Aadan       18
Aadarsh      6
Aaden      647
Aadhav      27
Aadhini      6
          ... 
Zymir        5
Zyon       133
Zyra       103
Zyrah       21
Zyrus        5
Name: Count RTP, Length: 20437, dtype: int64


### Slido Exercise

Write code to compute the total number of babies born each year in California.

In [48]:
# code here
#As data only contains California State so no need to filter it for state

each_year = data.groupby("Year")["Count RTP"].sum()


In [50]:
px.line(each_year.index,each_year.values)

### `groupby.size` and `groupby.count()`

In [51]:
df = pd.DataFrame({'letter':['A', 'A', 'B', 'C', 'C', 'C'],
                   'num':[1, 2, 3, 4, np.NaN, 4],
                   'state':[np.NaN, 'tx', 'fl', 'hi', np.NaN, 'ak']})
df

Unnamed: 0,letter,num,state
0,A,1.0,
1,A,2.0,tx
2,B,3.0,fl
3,C,4.0,hi
4,C,,
5,C,4.0,ak


`groupby.size()` returns a `Series`, indexed by the `letter`s that we grouped by, with values denoting the number of rows in each group/sub-DataFrame. It does not care about missing (`NaN`) values.

In [52]:
# Use groupby with size()
df.groupby("letter")["num"].size()


letter
A    2
B    1
C    3
Name: num, dtype: int64

`groupby.count()` returns a `DataFrame`, indexed by the `letter`s that we grouped by. Each column represents the number of non-missing values for that `letter`.

In [53]:
# Use groupby with count()
df.groupby("letter")["num"].count().reset_index()


Unnamed: 0,letter,num
0,A,2
1,B,1
2,C,2


You might recall `value_counts()` function we talked about last week. What's the difference?

In [55]:
# Use value_count() on DataFrame described above
df.value_counts() 
#it will count the frequency of each unique value and that it also sorts the resulting `Series` in descending order.

letter  num  state
A       2.0  tx       1
B       3.0  fl       1
C       4.0  ak       1
             hi       1
Name: count, dtype: int64

Turns out `value_counts()` does something similar to `groupby.size()`, except that it also sorts the resulting `Series` in descending order.

## Filtering by Group

In [56]:
# Let's read the elections dataset
edata = pd.read_csv("H:\Machine Learning\Excel files\elections 1.csv")
edata.head()

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
0,1824,Andrew Jackson,Democratic-Republican,151271,loss,57.210122
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
2,1828,Andrew Jackson,Democratic,642806,win,56.203927
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
4,1832,Andrew Jackson,Democratic,702735,win,54.574789


Let's keep only the elections years where the maximum vote share `%` is less than 45%.

In [57]:
# use filter function
edata = edata[edata["%"]<45]
edata.head()


Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
5,1832,Henry Clay,National Republican,484205,loss,37.603628
6,1832,William Wirt,Anti-Masonic,100715,loss,7.821583
7,1836,Hugh Lawson White,Whig,146109,loss,10.005985


### `groupby` Puzzle

Assume that we want to know the best election by each party.

#### Attempt #1

We have to be careful when using aggregation functions. For example, the code below might be misinterpreted to say that Woodrow Wilson successfully ran for election in 2020. Why is this happening?

In [58]:
# Use agg(max)
edata.groupby("Party").max().head(10)

Unnamed: 0_level_0,Year,Candidate,Popular vote,Result,%
Party,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
American,1976,Thomas J. Anderson,873053,loss,21.554001
American Independent,1976,Lester Maddox,9901118,loss,13.571218
Anti-Masonic,1832,William Wirt,100715,loss,7.821583
Anti-Monopoly,1884,Benjamin Butler,134294,loss,1.335838
Citizens,1980,Barry Commoner,233052,loss,0.270182
Communist,1932,William Z. Foster,103307,loss,0.261069
Constitution,2016,Michael Peroutka,203091,loss,0.152398
Constitutional Union,1860,John Bell,590901,loss,12.639283
Democratic,1992,Woodrow Wilson,44909806,win,44.446312
Democratic-Republican,1824,John Quincy Adams,113142,win,42.789878


#### Attempt #2

Next, we'll write code that properly returns _the best result by each party_. That is, each row should show the Year, Candidate, Popular Vote, Result, and % for the election in which that party saw its best results (rather than mixing them as in the example above). Here's what the first rows of the correct output should look like:

![parties.png](attachment:ab21f8de-ad29-46c2-bea7-e9aea9c40e3e.png)

In [59]:
elections_sorted_by_percent = edata.sort_values("%", ascending=False)
elections_sorted_by_percent.groupby("Party").first()

Unnamed: 0_level_0,Year,Candidate,Popular vote,Result,%
Party,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
American,1856,Millard Fillmore,873053,loss,21.554001
American Independent,1968,George Wallace,9901118,loss,13.571218
Anti-Masonic,1832,William Wirt,100715,loss,7.821583
Anti-Monopoly,1884,Benjamin Butler,134294,loss,1.335838
Citizens,1980,Barry Commoner,233052,loss,0.270182
Communist,1932,William Z. Foster,103307,loss,0.261069
Constitution,2008,Chuck Baldwin,199750,loss,0.152398
Constitutional Union,1860,John Bell,590901,loss,12.639283
Democratic,1952,Adlai Stevenson,27375090,loss,44.446312
Democratic-Republican,1824,John Quincy Adams,113142,win,42.789878


#### Alternative Solutions

You'll soon discover that with Pandas rich tool set, there's typically more than one way to get to the same answer. Each approach has different tradeoffs in terms of readability, performance, memory consumption, complexity, and more. It will take some experience for you to develop a sense of which approach is better for each problem, but you should, in general, try to think if you can at least envision a different solution to a given problem, especially if you find your current solution to be particularly convoluted or hard to read.

Here are a couple of other ways of obtaining the same result (in each case, we only show the top part with `head()`). The first approach uses `groupby` but finds the location of the maximum value via the `idxmax()` method (look up its documentation!).  We then index and sort by `Party` to match the requested formatting:

In [60]:
# Use idxmax function
best_per_party = edata.loc[edata.groupby("Party")["%"].idxmax()]
best_per_party

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
22,1856,Millard Fillmore,American,873053,loss,21.554001
115,1968,George Wallace,American Independent,9901118,loss,13.571218
6,1832,William Wirt,Anti-Masonic,100715,loss,7.821583
38,1884,Benjamin Butler,Anti-Monopoly,134294,loss,1.335838
127,1980,Barry Commoner,Citizens,233052,loss,0.270182
89,1932,William Z. Foster,Communist,103307,loss,0.261069
164,2008,Chuck Baldwin,Constitution,199750,loss,0.152398
24,1860,John Bell,Constitutional Union,590901,loss,12.639283
105,1952,Adlai Stevenson,Democratic,27375090,loss,44.446312
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878


Another approach is listed below. And this one doesn't even use `groupby`!

This approach instead uses the `drop_duplicates` method to keep only the last occurrence of of each party after having sorted by "%", which is the best performance.

In [62]:
# code here

best_per_party2 = edata.sort_values("%").drop_duplicates(["Party"], keep="last")
best_per_party2 

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
148,1996,John Hagelin,Natural Law,113670,loss,0.118219
164,2008,Chuck Baldwin,Constitution,199750,loss,0.152398
110,1956,T. Coleman Andrews,States' Rights,107929,loss,0.174883
147,1996,Howard Phillips,Taxpayers,184656,loss,0.192045
136,1988,Lenora Fulani,New Alliance,217221,loss,0.237804
89,1932,William Z. Foster,Communist,103307,loss,0.261069
127,1980,Barry Commoner,Citizens,233052,loss,0.270182
50,1896,John M. Palmer,National Democratic,134645,loss,0.969566
78,1920,Parley P. Christensen,Farmer–Labor,265398,loss,0.995804
42,1888,Alson Streeter,Union Labor,146602,loss,1.288861


*Challenge:* See if you can find yet another approach that still gives the same answer.

### `DataFrameGroupBy` Objects

The result of `groupby` is not a `DataFrame` or a list of `DataFrame`s. It is instead a special type called a `DataFrameGroupBy`.

In [63]:
grouped_by_party = edata.groupby("Party")
type(grouped_by_party)
edata

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
1,1824,John Quincy Adams,Democratic-Republican,113142,win,42.789878
3,1828,John Quincy Adams,National Republican,500897,loss,43.796073
5,1832,Henry Clay,National Republican,484205,loss,37.603628
6,1832,William Wirt,Anti-Masonic,100715,loss,7.821583
7,1836,Hugh Lawson White,Whig,146109,loss,10.005985
...,...,...,...,...,...,...
174,2016,Evan McMullin,Independent,732273,loss,0.539546
175,2016,Gary Johnson,Libertarian,4489235,loss,3.307714
177,2016,Jill Stein,Green,1457226,loss,1.073699
180,2020,Jo Jorgensen,Libertarian,1865724,loss,1.177979


`GroupBy` objects are structured like dictionaries. In fact, we can actually see the dictionaries with the following code:

In [66]:
# visualize groups
grouped_by_party.groups

{'American': [22, 126], 'American Independent': [115, 119, 124], 'Anti-Masonic': [6], 'Anti-Monopoly': [38], 'Citizens': [127], 'Communist': [89], 'Constitution': [160, 164, 172], 'Constitutional Union': [24], 'Democratic': [14, 57, 64, 70, 77, 81, 83, 105, 108, 116, 118, 129, 134, 140], 'Democratic-Republican': [1], 'Dixiecrat': [103], 'Farmer–Labor': [78], 'Free Soil': [15, 18], 'Green': [149, 155, 156, 165, 170, 177, 181], 'Greenback': [35], 'Independent': [121, 130, 143, 161, 167, 174], 'Liberal Republican': [31], 'Libertarian': [125, 128, 132, 138, 139, 146, 153, 159, 163, 169, 175, 180], 'National Democratic': [50], 'National Republican': [3, 5], 'Natural Law': [148], 'New Alliance': [136], 'Northern Democratic': [26], 'Populist': [48, 61, 141], 'Progressive': [68, 82, 101, 107], 'Prohibition': [41, 44, 49, 51, 54, 59, 63, 67, 73, 75, 99], 'Reform': [150, 154], 'Republican': [21, 23, 46, 69, 87, 90, 96, 113, 117, 142, 145], 'Socialist': [58, 62, 66, 71, 76, 85, 88, 92, 95, 102], 

The `key`s of the dictionary are the groups (in this case, `Party`), and the `value`s are the **indices** of rows belonging to that group. We can access a particular sub-`DataFrame` using `get_group`:

In [67]:
# code here

grouped_by_party.get_group("American")

Unnamed: 0,Year,Candidate,Party,Popular vote,Result,%
22,1856,Millard Fillmore,American,873053,loss,21.554001
126,1976,Thomas J. Anderson,American,158271,loss,0.194862


---