# Popular Baby Names Dataset

In this project, I will be loading the Popular Baby Names dataset from the City of New York. I will be preparing it for analysis, and doing some initial EDA (Exploratory Data Analysis) in Python.

The dataset I am using: https://catalog.data.gov/dataset/popular-baby-names

These first few lines of code import than pandas and numpy packages, as well as loading the dataset from a csv file. I'll also take a general glance at the data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
baby_data = pd.read_csv("Popular_Baby_Names.csv")

In [3]:
baby_data.head()

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2011,FEMALE,HISPANIC,GERALDINE,13,75
1,2011,FEMALE,HISPANIC,GIA,21,67
2,2011,FEMALE,HISPANIC,GIANNA,49,42
3,2011,FEMALE,HISPANIC,GISELLE,38,51
4,2011,FEMALE,HISPANIC,GRACE,36,53


In [4]:
baby_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57582 entries, 0 to 57581
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Year of Birth       57582 non-null  int64 
 1   Gender              57582 non-null  object
 2   Ethnicity           57582 non-null  object
 3   Child's First Name  57582 non-null  object
 4   Count               57582 non-null  int64 
 5   Rank                57582 non-null  int64 
dtypes: int64(3), object(3)
memory usage: 2.0+ MB


Some things to notice right away: Names are categorized by year, gender, ethnicity, name, count (acting like a raw score), and rank. This gives a lot of variables for examining the data and allows the data to be viewed in different ways.

I took a look into the values for some of the variables:

## Ethnicity

I asked myself a few questions about the attributes of this data:

  How many names are in this dataset for each ethnicity? For each year?
  How popular are these names?
  What are the most popular names for each ethnicity?

  
The next cell attempts to answer the first quesiton.

In [5]:
baby_data.groupby("Ethnicity").size()

Ethnicity
ASIAN AND PACI                 2125
ASIAN AND PACIFIC ISLANDER     7830
BLACK NON HISP                 2093
BLACK NON HISPANIC             8335
HISPANIC                      16930
WHITE NON HISP                 4142
WHITE NON HISPANIC            16127
dtype: int64

Something must have occured in the data where the names for some of the ethnicities were not filled in correctly. To fix this, I used the replace() method to make the ethnicity names consistent. I also lowercased the gender values and title cased the names.

In [6]:
# replacing text for ethnicity

ethnicities = ["WHITE NON HISP", "WHITE NON HISPANIC",
               "ASIAN AND PACI", "ASIAN AND PACIFIC ISLANDER",
               "BLACK NON HISP", "BLACK NON HISPANIC",
               "HISPANIC"]

ethnicities_new = ["White", "White",
                   "Asian/PI", "Asian/PI",
                   "Black", "Black",
                   "Hispanic"]

baby_data = baby_data.replace(ethnicities, ethnicities_new)

# lowercasing and title casing the gender and names, repsectively:

baby_data["Gender"] = baby_data["Gender"].str.lower()
baby_data["Child's First Name"] = baby_data["Child's First Name"].str.title()

baby_data.head()

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
0,2011,female,Hispanic,Geraldine,13,75
1,2011,female,Hispanic,Gia,21,67
2,2011,female,Hispanic,Gianna,49,42
3,2011,female,Hispanic,Giselle,38,51
4,2011,female,Hispanic,Grace,36,53


In [7]:
baby_data.groupby("Ethnicity").size()

Ethnicity
Asian/PI     9955
Black       10428
Hispanic    16930
White       20269
dtype: int64

I also wanted to find out the most popular names for each ethnicity.

In [8]:
# group by ethnicity, sort by name count descending
baby_data.sort_values(["Count", "Ethnicity"], ascending=False).groupby("Ethnicity").head(2)

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
3447,2011,male,Hispanic,Jayden,426,1
17403,2011,male,Hispanic,Jayden,426,1
9612,2013,male,White,David,304,1
11328,2013,male,White,David,304,1
9472,2013,male,Asian/PI,Jayden,220,1
20005,2013,male,Asian/PI,Jayden,220,1
795,2011,male,Black,Jayden,184,1
2246,2011,male,Black,Jayden,184,1


It looks like there are duplicate rows. I will remove them and do this again.

In [9]:
# remove duplicates
baby_data_distinct = baby_data.drop_duplicates()

baby_data_distinct.sort_values(["Count", "Ethnicity"], ascending=False).groupby("Ethnicity").head(2)

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
3447,2011,male,Hispanic,Jayden,426,1
37557,2019,male,Hispanic,Liam,423,1
9612,2013,male,White,David,304,1
5791,2012,male,White,Joseph,300,1
9472,2013,male,Asian/PI,Jayden,220,1
5129,2012,male,Asian/PI,Ryan,197,1
795,2011,male,Black,Jayden,184,1
1218,2011,female,Black,Madison,176,1


According to this data, Jayden is a popular baby name across many ethnicities.

## Data Across Years

Next, I will look at the data for different years. I would like to know the count for every year.

In [10]:
baby_data.groupby("Year of Birth").size()

Year of Birth
2011    11752
2012    11871
2013    11889
2014    12090
2015     2045
2016     2063
2017     1973
2018     1964
2019     1935
dtype: int64

There are higher values for 2011-2014 than there are 2015-2019.

I would also like to know the top boy and girl names for every year:

In [11]:
baby_data_distinct.sort_values(["Year of Birth", "Count"], ascending=[True, False]).groupby(["Year of Birth", "Gender"]).head(1)

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
3447,2011,male,Hispanic,Jayden,426,1
22,2011,female,Hispanic,Isabella,331,1
5489,2012,male,Hispanic,Jayden,364,1
4434,2012,female,Hispanic,Isabella,327,1
6142,2013,male,Hispanic,Jayden,352,1
7607,2013,female,Hispanic,Isabella,326,1
13454,2014,female,Hispanic,Isabella,331,1
7574,2014,male,Hispanic,Liam,312,1
31934,2015,male,Hispanic,Liam,356,1
30844,2015,female,Hispanic,Isabella,307,1


Jayden was the most popular boy name from 2011-2013, and Isabella was the most popular girl name for every year except 2017!

## Year, Gender, and Ethnicity

I also want to see the most popular names for each year, for different ethnicities:

In [12]:
baby_data_distinct.sort_values(["Year of Birth", "Count"], ascending=[True, False]).groupby(["Year of Birth", "Ethnicity", "Gender"]).head(1)

Unnamed: 0,Year of Birth,Gender,Ethnicity,Child's First Name,Count,Rank
3447,2011,male,Hispanic,Jayden,426,1
22,2011,female,Hispanic,Isabella,331,1
1849,2011,male,White,Michael,292,1
322,2011,female,White,Esther,224,1
795,2011,male,Black,Jayden,184,1
...,...,...,...,...,...,...
38769,2019,female,White,Chaya,209,1
38075,2019,male,Asian/PI,Ethan,154,1
39152,2019,male,Black,Noah,135,1
38250,2019,female,Asian/PI,Chloe,131,1


While this data is what I'm looking for, is it not presented in the best way. To fix this, I made a pivot table which puts the years in the rows, and the gender and ethnicity in the columns.
as
To do this, I made a boolean mask for names with a ranking of 1, and droppped the Na values. I then dropped duplicates of year, ethnicity, and gender, because duplicates with those columns would prevent me from pivoting. Next, I made a pivot table with the year as the index (rows), the ethnicity as the columns, further grouped by gender, and the names for the values. After this, I sorted the columns by Gender and Ethnicity.

In [13]:
# apply a boolean mask with only rank 1 names
rank_1 = baby_data_distinct["Rank"] != 1
baby_data_long = baby_data_distinct.mask(rank_1)

# drop the Na values and duplicates
baby_data_long = baby_data_long.dropna()
baby_data_long = baby_data_long.drop_duplicates(subset=["Year of Birth", "Ethnicity", "Gender"])

# make the years integers instead of floats (why are they initially floats?)
baby_data_long["Year of Birth"] = baby_data_long["Year of Birth"].astype(int)

# pivot the data to have ethnicites as columns
baby_data_wide = baby_data_long.pivot(index="Year of Birth", columns=["Gender", "Ethnicity"], values="Child's First Name")

# change the order of the columns
baby_data_wide = baby_data_wide.sort_values(["Gender", "Ethnicity"], axis=1)

baby_data_wide

Gender,female,female,female,female,male,male,male,male
Ethnicity,Asian/PI,Black,Hispanic,White,Asian/PI,Black,Hispanic,White
Year of Birth,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
2011,Sophia,Madison,Isabella,Esther,Ethan,Jayden,Jayden,Michael
2012,Chloe,Madison,Isabella,Emma,Ryan,Jayden,Jayden,Joseph
2013,Sophia,Madison,Isabella,Olivia,Jayden,Ethan,Jayden,David
2014,Olivia,Madison,Isabella,Olivia,Jayden,Ethan,Liam,Joseph
2015,Olivia,Madison,Isabella,Emma,Jayden,Noah,Liam,David
2016,Olivia,Ava,Isabella,Olivia,Ethan,Noah,Liam,Joseph
2017,Olivia,Ava,Mia,Esther,Muhammad,Noah,Liam,David
2018,Chloe,Ava,Isabella,Esther,Muhammad,Noah,Liam,David
2019,Chloe,Ava,Isabella,Chaya,Ethan,Noah,Liam,David


The data is now organized in a much more viewable format.

## Name Distribution

Something else that makes this dataset interesting is that each name has its own count value. This can allow for analysis of the distribution of names.

In [14]:
baby_data_distinct["Count"].describe()

count    18053.000000
mean        33.573589
std         38.672649
min         10.000000
25%         13.000000
50%         20.000000
75%         35.000000
max        426.000000
Name: Count, dtype: float64

Because the 75th percentile has a count of only 36, compared to the 426 max, there are a lot of names in this dataset with a low count. In other words, just a fraction of the names are ones that are really popular in this dataset.

Let's look at how many names are above a certain count:

In [15]:
count_values = [0, 50, 100, 150, 200, 250, 300]
distribution = [len(baby_data_distinct[baby_data_distinct["Count"]>=x]) for x in count_values]
distribution_dict = {'Count (popularity)': count_values, 'Names at least': distribution}
distribution_df = pd.DataFrame(data=distribution_dict)
distribution_df

Unnamed: 0,Count (popularity),Names at least
0,0,18053
1,50,2918
2,100,1075
3,150,487
4,200,212
5,250,89
6,300,24


The vast majority of names in this dataset have a count of less than 50. only around 3,000 of the 18,000 names have a count of more than 50.

## Conclusion

This dataset has a lot of interesting aspects that can be analyzed even further. The variables add several dimensions to the content of the data, and can deliver insights from several different angles. The tasks done in this project are just a few of the ways that this dataset can be manipulated and looked at.