# Lecture 5 –Fall 2024

A demonstration of advanced `pandas` syntax to accompany Lecture 4.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [2]:
import numpy as np
import pandas as pd
import plotly.express as px

## Dataset: California baby names

In today's lecture, we'll work with the `babynames` dataset, which contains information about the names of infants born in California.

The cell below pulls census data from a government website and then loads it into a usable form. The code shown here is outside of the scope of Data 100, but you're encouraged to dig into it if you are interested!

In [4]:
import urllib.request
import os.path
import zipfile

data_url = "https://www.ssa.gov/oact/babynames/state/namesbystate.zip"
local_filename = "babynamesbystate.zip"
if not os.path.exists(local_filename): # If the data exists don't download again
    with urllib.request.urlopen(data_url) as resp, open(local_filename, 'wb') as f:
        f.write(resp.read())

zf = zipfile.ZipFile(local_filename, 'r')

ca_name = 'CA.TXT'
field_names = ['State', 'Sex', 'Year', 'Name', 'Count']
with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

babynames.head()

Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
1,CA,F,1910,Helen,239
2,CA,F,1910,Dorothy,220
3,CA,F,1910,Margaret,163
4,CA,F,1910,Frances,134


## Case Study: Name "Popularity"

#**Case Study Question**
**Title**: Identifying the Most Consistently Popular Female Baby Name Over Time

Objective:
In this exercise, we will analyze the dataset to find the female baby name that has shown the most consistent popularity over the years. This involves filtering the data, calculating the consistency of name counts, and determining the most stable name.



#Instructions:

**Data** Preparation:

Filter the dataset to only include entries where the sex is "F" (female).
Calculate Consistency:

For each name, calculate the standard deviation of the counts over the years. A lower standard deviation indicates more consistent popularity.
**Identify Most Consistent Name:**

Determine the name with the lowest standard deviation in counts, signifying the most consistent popularity.


In [2]:

with zf.open(ca_name) as fh:
    babynames = pd.read_csv(fh, header=None, names=field_names)

print(babynames.head())



  State Sex  Year      Name  Count
0    CA   F  1910      Mary    295
1    CA   F  1910     Helen    239
2    CA   F  1910   Dorothy    220
3    CA   F  1910  Margaret    163
4    CA   F  1910   Frances    134


In [24]:
babynames.head(100)

Unnamed: 0,State,Sex,Year,Name,Count
0,CA,F,1910,Mary,295
1,CA,F,1910,Helen,239
2,CA,F,1910,Dorothy,220
3,CA,F,1910,Margaret,163
4,CA,F,1910,Frances,134
...,...,...,...,...,...
95,CA,F,1910,Lorraine,17
96,CA,F,1910,Madeline,17
97,CA,F,1910,Maxine,17
98,CA,F,1910,Bessie,16


In [23]:
babynames.tail(100)

Unnamed: 0,State,Sex,Year,Name,Count
413794,CA,M,2023,Newton,5
413795,CA,M,2023,Nilan,5
413796,CA,M,2023,Niles,5
413797,CA,M,2023,Nilo,5
413798,CA,M,2023,Nio,5
...,...,...,...,...,...
413889,CA,M,2023,Ziah,5
413890,CA,M,2023,Ziaire,5
413891,CA,M,2023,Zidane,5
413892,CA,M,2023,Zyan,5


In [21]:
name_std_dev = babynames.groupby('Name')['Count'].std().reset_index()
name_std_dev.columns = ['Mary', 'StdDev']

In [22]:
most_consistent_name = name_std_dev.loc[name_std_dev['StdDev'].idxmin()]
print(most_consistent_name)

Mary      Aaleah
StdDev       0.0
Name: 26, dtype: object


In [25]:

female_babynames = babynames[babynames['Sex'] == 'F']

name_std_dev = female_babynames.groupby('Name')['Count'].std()

most_consistent_name = name_std_dev.idxmin()
most_consistent_std_dev = name_std_dev.min()

print(f"The most consistently popular female baby name is {most_consistent_name} with a standard deviation of {most_consistent_std_dev:.2f}")


The most consistently popular female baby name is Aaleah with a standard deviation of 0.00
