# Lecture 4 –Fall 2023

A demonstration of advanced `pandas` syntax to accompany Lecture 4.

In [71]:
import numpy as np
import pandas as pd
import plotly.express as px

In [118]:
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']

babynames = pd.read_csv('H:/Machine Learning/babynamesbystate/STATE.CA.TXT', header=None, names=field_names)

data = babynames.copy()

data.head(2)

Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
1,CA,F,1910,Helen,239


## Dataset: California baby names

In today's lecture, we'll work with the `babynames` dataset, which contains information about the names of infants born in California.

The cell below pulls census data from a government website and then loads it into a usable form. The code shown here is outside of the scope of Data 100, but you're encouraged to dig into it if you are interested!

### Exercises
We want to obtain the first three baby names with `count > 250`.

1.Code this using, loc and head()

2.Code this using, loc and iloc()

3.Code this using [] and head ()


In [73]:
# Answer Here

#1- using loc and head()
data.loc[data["Count"]>250].head(3)


Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
233,CA,F,1911,Mary,390
484,CA,F,1912,Mary,534


In [74]:
# Answer Here
#2 using loc and iloc()
data.loc[data["Count"]>250].iloc[:3]

Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
233,CA,F,1911,Mary,390
484,CA,F,1912,Mary,534


In [75]:
# Answer Here
#3 using [] and head

data[data["Count"]>250].head(3)

Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
233,CA,F,1911,Mary,390
484,CA,F,1912,Mary,534


### `.isin` for Selection based on a list, array, or `Series`

In [76]:
# Note: The parentheses surrounding the code make it possible to break the code into multiple lines for readability

( data[(data["Name"] == "Bella") |
              (data["Name"] == "Alex") |
              (data["Name"] == "Narges") |
              (data["Name"] == "Lisa")])


Unnamed: 0,State,Sex,Year,Name,Count
6289,CA,F,1923,Bella,5
7512,CA,F,1925,Bella,8
12368,CA,F,1932,Lisa,5
14741,CA,F,1936,Lisa,8
17084,CA,F,1939,Lisa,5
...,...,...,...,...,...
393248,CA,M,2018,Alex,495
396111,CA,M,2019,Alex,438
398983,CA,M,2020,Alex,379
401788,CA,M,2021,Alex,333


In [77]:
# A more concise method to achieve the above: .isin
#Answer Here
names = ["Bella","Alex","Narges","Lisa"] 

data[data["Name"].isin(names)]

Unnamed: 0,State,Sex,Year,Name,Count
6289,CA,F,1923,Bella,5
7512,CA,F,1925,Bella,8
12368,CA,F,1932,Lisa,5
14741,CA,F,1936,Lisa,8
17084,CA,F,1939,Lisa,5
...,...,...,...,...,...
393248,CA,M,2018,Alex,495
396111,CA,M,2019,Alex,438
398983,CA,M,2020,Alex,379
401788,CA,M,2021,Alex,333


### `.str` Functions for Defining a Condition

In [78]:
# What if we only want names that start with "J"?
#Answer Here

data[data["Name"].str.startswith("J")]

Unnamed: 0,State,Sex,Year,Name,Count
16,CA,F,1910,Josephine,66
44,CA,F,1910,Jean,35
46,CA,F,1910,Jessie,32
59,CA,F,1910,Julia,28
66,CA,F,1910,Juanita,25
...,...,...,...,...,...
407245,CA,M,2022,Jibreel,5
407246,CA,M,2022,Joseangel,5
407247,CA,M,2022,Josejulian,5
407248,CA,M,2022,Juelz,5


## Adding, Removing, and Modifying Columns

### Add a Column
To add a column, use `[]` to reference the desired new column, then assign it to a `Series` or array of appropriate length.

In [79]:
# Create a Series of the length of each name
n_len = data["Name"].str.len()
# Add a column named "name_lengths" that includes the length of each name
data["name_lengths"] = n_len

In [80]:
data.head(2)

Unnamed: 0,State,Sex,Year,Name,Count,name_lengths
0,CA,F,1910,Mary,295,4
1,CA,F,1910,Helen,239,5


### Modify a Column
To modify a column, use `[]` to access the desired column, then re-assign it to a new array or Series.

In [81]:
# Modify the "name_lengths" column to be one less than its original value
data["name_lengths"] = data["name_lengths"]-1


In [82]:
data.head(3)

Unnamed: 0,State,Sex,Year,Name,Count,name_lengths
0,CA,F,1910,Mary,295,3
1,CA,F,1910,Helen,239,4
2,CA,F,1910,Dorothy,220,6


### Rename a Column Name
Rename a column using the `.rename()` method.

In [83]:
# Rename "name_lengths" to "Length"
data.rename(columns={"name_lengths":"length"},inplace=True)


In [84]:
data.head()

Unnamed: 0,State,Sex,Year,Name,Count,length
0,CA,F,1910,Mary,295,3
1,CA,F,1910,Helen,239,4
2,CA,F,1910,Dorothy,220,6
3,CA,F,1910,Margaret,163,7
4,CA,F,1910,Frances,134,6


### Delete a Column
Remove a column using `.drop()`.

In [85]:
# Remove our new "Length" column
data.drop(columns="length")

Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
1,CA,F,1910,Helen,239
2,CA,F,1910,Dorothy,220
3,CA,F,1910,Margaret,163
4,CA,F,1910,Frances,134
...,...,...,...,...,...
407423,CA,M,2022,Zayvier,5
407424,CA,M,2022,Zia,5
407425,CA,M,2022,Zora,5
407426,CA,M,2022,Zuriel,5


## Custom sorting

In [86]:
# Sort a Series Containing Names

data["Name"].sort_values()

366001      Aadan
384005      Aadan
369120      Aadan
398211    Aadarsh
370306      Aaden
           ...   
220691      Zyrah
197529      Zyrah
217429      Zyrah
232167      Zyrah
404544      Zyrus
Name: Name, Length: 407428, dtype: object

In [87]:
# Sort a DataFrame – there are lots of Michaels in California
data.sort_values(by="Name")

Unnamed: 0,State,Sex,Year,Name,Count,length
366001,CA,M,2008,Aadan,7,4
384005,CA,M,2014,Aadan,5,4
369120,CA,M,2009,Aadan,6,4
398211,CA,M,2019,Aadarsh,6,6
370306,CA,M,2010,Aaden,62,4
...,...,...,...,...,...,...
220691,CA,F,2017,Zyrah,6,4
197529,CA,F,2011,Zyrah,5,4
217429,CA,F,2016,Zyrah,5,4
232167,CA,F,2020,Zyrah,5,4


In [88]:
data.drop(columns='length',inplace=True)

### Approach 1: Create a temporary column

In [89]:
# Create a Series of the length of each name
n_len = data["Name"].str.len()
# Add a column named "name_lengths" that includes the length of each name
data["name_lengths"] = n_len
# Sort by the temporary column
data.sort_values(by="name_lengths")






Unnamed: 0,State,Sex,Year,Name,Count,name_lengths
326570,CA,M,1993,An,8,2
292150,CA,M,1976,Al,13,2
252556,CA,M,1937,Al,21,2
401470,CA,M,2020,Jr,5,2
260022,CA,M,1948,Ed,43,2
...,...,...,...,...,...,...
339472,CA,M,1998,Franciscojavier,6,15
327358,CA,M,1993,Johnchristopher,5,15
337477,CA,M,1997,Ryanchristopher,5,15
312543,CA,M,1987,Franciscojavier,5,15


In [90]:
# Drop the 'name_length' column
data.drop(columns="name_lengths")

Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
1,CA,F,1910,Helen,239
2,CA,F,1910,Dorothy,220
3,CA,F,1910,Margaret,163
4,CA,F,1910,Frances,134
...,...,...,...,...,...
407423,CA,M,2022,Zayvier,5
407424,CA,M,2022,Zia,5
407425,CA,M,2022,Zora,5
407426,CA,M,2022,Zuriel,5


### Approach 2: Sorting using the `key` argument

In [91]:
# Answer Here
data.sort_values(by='name_lengths', key=lambda x: -x)

#This will sort values in descending order with respect to column 'name length'

Unnamed: 0,State,Sex,Year,Name,Count,name_lengths
337477,CA,M,1997,Ryanchristopher,5,15
334166,CA,M,1996,Franciscojavier,8,15
313977,CA,M,1988,Franciscojavier,10,15
327358,CA,M,1993,Johnchristopher,5,15
102505,CA,F,1986,Mariadelosangel,5,15
...,...,...,...,...,...,...
400876,CA,M,2020,Cj,7,2
282211,CA,M,1969,Ty,51,2
354082,CA,M,2004,Cy,11,2
343636,CA,M,2000,Bo,12,2


### Approach 3: Sorting Using the `map` Function

We can also use the Python map function if we want to use an arbitrarily defined function. Suppose we want to sort by the number of occurrences of "dr" plus the number of occurences of "ea".

In [92]:
# First, define a function to count the number of times "sa" or "me" appear in each name
def count_sa_me(name):
    return name.count('sa') + name.count('me')

# Then, use `map` to apply `dr_ea_count` to each name in the "Name" column
data["sa_me_count"] = data["Name"].map(count_sa_me)

# Sort the DataFrame by the new "dr_ea_count" column so we can see our handiwork
data.sort_values(by='sa_me_count',inplace = True)

data


Unnamed: 0,State,Sex,Year,Name,Count,name_lengths,sa_me_count
0,CA,F,1910,Mary,295,4,0
269384,CA,M,1958,Ernesto,66,7,0
269383,CA,M,1958,Roderick,67,8,0
269382,CA,M,1958,Rod,67,3,0
269381,CA,M,1958,Lynn,68,4,0
...,...,...,...,...,...,...,...
80793,CA,F,1979,Summer,301,6,1
80797,CA,F,1979,Theresa,297,7,1
80803,CA,F,1979,Carmen,272,6,1
323991,CA,M,1992,Romel,9,5,1


In [93]:
# Drop the `dr_ea_count` column
data.drop(columns='sa_me_count',inplace=True)

In [94]:
data.head(2)

Unnamed: 0,State,Sex,Year,Name,Count,name_lengths
0,CA,F,1910,Mary,295,4
269384,CA,M,1958,Ernesto,66,7


## Grouping

Group rows that share a common feature, then aggregate data across the group.

In this example, we count the total number of babies born in each year (considering only a small subset of the data, for simplicity).

<img src="images/groupby.png" width="800"/>

In [95]:
# DataFrame with baby gril names only
girl_data = data[data["Sex"] == 'F']

#Groupby similar features like year and apply aggregate
baby_each_year = girl_data.groupby("Year").agg({

    'Count' : "sum",
    "name_lengths" : "sum"
})


# Sort by Count
cont = girl_data.sort_values(by='Count')


In [96]:
# print first 10 entries
cont.head(10)


Unnamed: 0,State,Sex,Year,Name,Count,name_lengths
12957,CA,F,1933,Bonita,5,6
62934,CA,F,1970,Kierstin,5,8
156571,CA,F,2001,Thao,5,4
85669,CA,F,1980,Loryn,5,5
85668,CA,F,1980,Lorrine,5,7
85667,CA,F,1980,Lorinda,5,7
85666,CA,F,1980,Lorene,5,6
85665,CA,F,1980,Lorelei,5,7
85664,CA,F,1980,Lloana,5,6
85663,CA,F,1980,Livier,5,6


In [97]:
#the total baby count in each year
# Answer Here

baby_each_year




Unnamed: 0_level_0,Count,name_lengths
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1910,5950,1362
1911,6602,1450
1912,9804,1767
1913,11860,1950
1914,13815,2153
...,...,...
2018,189208,22112
2019,184228,21884
2020,173763,21522
2021,173913,21746


There are many different aggregation functions we can use, all of which are useful in different applications.

In [98]:
# What is the earliest year in which each name appeared?
# Answer 

earliest_appear = girl_data.groupby("Name")["Year"].first()

earliest_appear

Name
Aadhini    2022
Aadhira    2018
Aadhya     2017
Aadya      2017
Aahana     2017
           ... 
Zyanya     2017
Zyla       2017
Zylah      2017
Zyra       2016
Zyrah      2016
Name: Year, Length: 13782, dtype: int64

In [99]:
# What is the largest single-year count of each name?
a= data.groupby(["Name","Year"])["Count"]

a.max()

Name     Year
Aadan    2008     7
         2009     6
         2014     5
Aadarsh  2019     6
Aaden    2007    20
                 ..
Zyrah    2011     5
         2016     5
         2017     6
         2020     5
Zyrus    2021     5
Name: Count, Length: 385681, dtype: int64

In [100]:
#Can you find the most popular baby name in the state of California (CA) for each year? use idxmax function.
#Provide a list of years along with the corresponding most popular names."
result = data.groupby("Year")['Count'].idxmax()
a = data.loc[result,["State","Name","Year"]]
a

Unnamed: 0,State,Name,Year
0,CA,Mary,1910
233,CA,Mary,1911
484,CA,Mary,1912
240064,CA,John,1913
1120,CA,Mary,1914
...,...,...,...
221194,CA,Emma,2018
396004,CA,Noah,2019
398869,CA,Noah,2020
401665,CA,Noah,2021


## Case Study: Name "Popularity"

In this exercise, let's find the name with sex "F" that has dropped most in popularity since its peak usage. We'll start by filtering `babynames` to only include names corresponding to sex "F".

In [120]:
#Answer Here

female_data = babynames[babynames["Sex"]=="F"]


maximum = female_data.groupby("Name")["Count"].max()


minimum = female_data.groupby("Name")["Count"].last()

val = (maximum - minimum) / maximum * 100


name_fount = val.idxmax()


print("The name dropped most in popularity since its peak is :: ",name_fount)


The name dropped most in popularity since its peak is ::  Debra


In [102]:
female_data.tail()



Unnamed: 0,State,Sex,Year,Name,Count,name_lengths
80781,CA,F,1979,Lindsay,327,7
80793,CA,F,1979,Summer,301,6
80797,CA,F,1979,Theresa,297,7
80803,CA,F,1979,Carmen,272,6
203713,CA,F,2013,Jaime,11,5


In [103]:
female_data.head()

Unnamed: 0,State,Sex,Year,Name,Count,name_lengths
0,CA,F,1910,Mary,295,4
217697,CA,F,2017,Irene,142,5
217696,CA,F,2017,Amara,143,5
217695,CA,F,2017,Alejandra,143,9
217694,CA,F,2017,Brooke,144,6


In [104]:
# We sort the data by year

data.head(3)

Unnamed: 0,State,Sex,Year,Name,Count,name_lengths
0,CA,F,1910,Mary,295,4
269384,CA,M,1958,Ernesto,66,7
269383,CA,M,1958,Roderick,67,8


To build our intuition on how to answer our research question, let's visualize the prevalence of the name "Jennifer" over time.

In [105]:
# We'll talk about how to generate plots in a later lecture
fig = px.line(female_data[female_data["Name"] == "Jennifer"],
              x = "Year", y = "Count")
fig.update_layout(font_size = 18,
                  autosize=False,
                 width=1000,
                  height=400)

We'll need a mathematical definition for the change in popularity of a name.

Define the metric "ratio to peak" (RTP). We'll calculate this as the count of the name in 2022 (the most recent year for which we have data) divided by the largest count of this name in *any* year.

A demo calculation for Jennifer:

In [141]:
# Find the highest Jennifer 'count'

def ratio_to_peak(series):
    return series.iloc[-1] / max(series)
count_jenn = female_data[female_data["Name"] == "Jennifer"]["Count"]
ratio_to_peak(count_jenn)


0.018796372629843364

We can also write a function that produces the `ratio_to_peak`for a given `Series`. This will allow us to use `.groupby` to speed up our computation for all names in the dataset.

In [123]:
# define the function for RTP
"""
Compute the RTP for a Series containing the counts per year for a single name
"""
def ratio_to_peak(series):
    return series.iloc[-1] / max(series)


In [124]:
# Construct a Series containing our Jennifer count data
# Then, find the RTP using the function define above

def ratio_to_peak(series):
    return series.iloc[-1] / max(series)
count_jenn = female_data[female_data["Name"] == "Jennifer"]["Count"]
ratio_to_peak(count_jenn)


0.018796372629843364

Now, let's use `.groupby` to compute the RTPs for *all* names in the dataset.

You may see a warning message when running the cell below. As discussed in lecture, `pandas` can't apply an aggregation function to non-numeric data (it doens't make sense to divide "CA" by a number). By default, `.groupby` will drop any columns that cannot be aggregated.

In [127]:
# Results in a TypeError
#rtp_table = female_data.groupby("Name").agg(ratio_to_peak)
#rtp_table

In [131]:
# Find the RTP fro all names at once using groupby as describe in lec slides
rtp_table = female_data.groupby("Name")


rtp_table

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000002719DFA5690>

To avoid the warning message above, we explicitly extract only the columns relevant to our analysis before using `.agg`.

In [132]:
# Recompute the RTPs, but only performing the calculation on the "Count" column

rtp_table = female_data.groupby("Name")[["Count"]].agg(ratio_to_peak)



In [133]:
# Rename "Count" to "Count RTP" for clarity
rtp_table = rtp_table.rename(columns = {"Count": "Count RTP"})


In [134]:
# What name has fallen the most in popularity?
rtp_table.sort_values("Count RTP")


Unnamed: 0_level_0,Count RTP
Name,Unnamed: 1_level_1
Debra,0.001260
Debbie,0.002815
Carol,0.003180
Tammy,0.003249
Susan,0.003305
...,...
Fidelia,1.000000
Naveyah,1.000000
Finlee,1.000000
Roseline,1.000000


We can visualize the decrease in the popularity of the name "?:"

In [136]:
def plot_name(*names):
    fig = px.line(female_data[female_data["Name"].isin(names)],
                  x = "Year", y = "Count", color="Name",
                  title=f"Popularity for: {names}")
    fig.update_layout(font_size = 18,
                  autosize=False,
                  width=1000,
                  height=400)
    return fig
# pass the name into plot_name
plot_name("Debra")

In [137]:
# Find the 10 names that have decreased the most in popularity
# Answer Here

top10 = rtp_table.sort_values("Count RTP").head(10).index

In [140]:
px.line(female_data[female_data["Name"].isin(top10)], x = "Year", y = "Count", color = "Name")

For fun, try plotting your name or your friends' names.