# **Data Analysis Using Pandas: Baby Names**

This Jupyter Notebook demonstrates the capabilities of Python libraries for handling data, particularly focusing on data importation, manipulation, and advanced visualization techniques using a dataset of baby names.

***

## **1. Library Imports and Data Acquisition**

Import necessary libraries and acquire the dataset from an online source.

In [1]:
# Import libraries for data manipulation, visualization, and linear algebra
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import os.path
import plotly.offline as py
import plotly.graph_objs as go
import plotly.figure_factory as ff
import plotly.express as px
import urllib.request

# Download the baby names dataset if it's not already present locally
data_url = "https://www.ssa.gov/oact/babynames/names.zip"
local_filename = "babynames.zip"
if not os.path.exists("../data/" + local_filename):
    with urllib.request.urlopen(data_url) as response, open("../data/" + local_filename, 'wb') as out_file:
        out_file.write(response.read())

# Load data directly from the downloaded ZIP without extracting
import zipfile
babynames = []
with zipfile.ZipFile("../data/" + local_filename, "r") as zf:
    for f in zf.infolist():
        if f.filename.endswith('.txt'):
            year = int(f.filename[3:7])
            df = pd.read_csv(zf.open(f.filename), names=["Name", "Sex", "Count"])
            df["Year"] = year
            babynames.append(df)
babynames = pd.concat(babynames)
display(babynames.head(10))

Unnamed: 0,Name,Sex,Count,Year
0,Mary,F,7065,1880
1,Anna,F,2604,1880
2,Emma,F,2003,1880
3,Elizabeth,F,1939,1880
4,Minnie,F,1746,1880
5,Margaret,F,1578,1880
6,Ida,F,1472,1880
7,Alice,F,1414,1880
8,Bertha,F,1320,1880
9,Sarah,F,1288,1880


***

## **2. Data Exploration and Manipulation**

Perform exploratory data analysis and manipulate data frames to derive additional insights.

In [2]:
# Query to find the most popular male name in 2020
most_popular_male_2020 = babynames.query("Year == 2020 and Sex == 'M'").sort_values(by='Count', ascending=False).iloc[0]["Name"]
print(f"The most popular male name in 2020 was: {most_popular_male_2020}")

The most popular male name in 2020 was: Liam


In [3]:
# Add a new column to find the longest male name in 2020
babynames['name_length'] = babynames['Name'].str.len()
longest_male_name_2020 = babynames.query("Year == 2020 and Sex == 'M'").sort_values(by='name_length', ascending=False).iloc[0]["Name"]
display(babynames.head(10))

# Print longest male name found in 2020
print(f"The longest male name in 2020 was: {longest_male_name_2020}")

Unnamed: 0,Name,Sex,Count,Year,name_length
0,Mary,F,7065,1880,4
1,Anna,F,2604,1880,4
2,Emma,F,2003,1880,4
3,Elizabeth,F,1939,1880,9
4,Minnie,F,1746,1880,6
5,Margaret,F,1578,1880,8
6,Ida,F,1472,1880,3
7,Alice,F,1414,1880,5
8,Bertha,F,1320,1880,6
9,Sarah,F,1288,1880,5


The longest male name in 2020 was: Alexanderjames


In [4]:
# Remove the temporary 'name_length' column
babynames.drop(columns='name_length', inplace=True)
display(babynames.head(10))

Unnamed: 0,Name,Sex,Count,Year
0,Mary,F,7065,1880
1,Anna,F,2604,1880
2,Emma,F,2003,1880
3,Elizabeth,F,1939,1880
4,Minnie,F,1746,1880
5,Margaret,F,1578,1880
6,Ida,F,1472,1880
7,Alice,F,1414,1880
8,Bertha,F,1320,1880
9,Sarah,F,1288,1880


***

## **3. Custom Column Values and Sorting**

Custom calculations on column values and complex sorting techniques.

In [5]:
# Define a function to count occurrences of substrings 'dr' and 'ea' in names
def count_dr_ea(name):
    return name.count('dr') + name.count('ea')

# Apply the function and create a new column 'dr_ea'
babynames['dr_ea'] = babynames['Name'].apply(count_dr_ea)
babynames_sorted = babynames.sort_values(by='dr_ea', ascending=False)
display(babynames_sorted.head(10))

Unnamed: 0,Name,Sex,Count,Year,dr_ea
3729,Deandrea,F,35,2001,3
7987,Deandrea,F,15,2010,3
6972,Leandrea,F,8,1976,3
25383,Leandrea,M,5,1993,3
1460,Deandrea,F,94,1988,3
11488,Keandrea,F,5,1985,3
3388,Leandrea,F,25,1981,3
14008,Leandrea,F,7,2013,3
3204,Deandrea,F,16,1963,3
26458,Deandrea,M,8,2003,3


In [6]:
# Clean up by removing the 'dr_ea' column after use
babynames.drop(columns='dr_ea', inplace=True)
display(babynames.head(10))

Unnamed: 0,Name,Sex,Count,Year
0,Mary,F,7065,1880
1,Anna,F,2604,1880
2,Emma,F,2003,1880
3,Elizabeth,F,1939,1880
4,Minnie,F,1746,1880
5,Margaret,F,1578,1880
6,Ida,F,1472,1880
7,Alice,F,1414,1880
8,Bertha,F,1320,1880
9,Sarah,F,1288,1880


***

## **4. Column Renaming and Grouped Aggregations**

Renaming columns and using `groupby` to aggregate data.

In [7]:
# Rename the 'Name' column to 'baby_name'
babynames.rename(columns={"Name": "baby_name"}, inplace=True)
display(babynames.head(10))

Unnamed: 0,baby_name,Sex,Count,Year
0,Mary,F,7065,1880
1,Anna,F,2604,1880
2,Emma,F,2003,1880
3,Elizabeth,F,1939,1880
4,Minnie,F,1746,1880
5,Margaret,F,1578,1880
6,Ida,F,1472,1880
7,Alice,F,1414,1880
8,Bertha,F,1320,1880
9,Sarah,F,1288,1880


In [8]:
# Aggregate to find the total count for each baby name
total_counts = babynames.groupby("baby_name")["Count"].sum()
print(total_counts.head(10))

baby_name
Aaban        127
Aabha         62
Aabid         16
Aabidah        5
Aabir         19
Aabriella     51
Aada          13
Aadam        343
Aadan        136
Aadarsh      246
Name: Count, dtype: int64


In [9]:
# Define a function to calculate the Ratio to Peak (RTP) for any given series of counts
def ratio_to_peak(series):
    return series.iloc[-1] / series.max()

# Use groupby to find the RTP for each female baby name and sort to find the name whose popularity has fallen the most
rtp_series = babynames.query('Sex == "F"').groupby("baby_name")["Count"].agg(ratio_to_peak).rename("RTP")
least_popular_female_name = rtp_series.sort_values().index[0]
least_popular_female_name_rtp = rtp_series.sort_values().iloc[0]

print(f"The female baby name whose popularity has fallen the most is: {least_popular_female_name}, with an RTP of {least_popular_female_name_rtp}")

The female baby name whose popularity has fallen the most is: Debra, with an RTP of 0.0005933779026069069
