# Calculating the uniqueness and average age of my children
This is just a playful analysis of my kids' names using data from the Social Security Administration (SSA). In particular, I was curious about the fact that before my daughter, I only ever knew of two women named Maya: Angelou and Rudolph. We named our daughter after the former, believing it to be a beautiful and seemingly unique name, and Maya Angelou was a great poet and activist who had died shortly before our daughter's birth. (Side note: it also allowed me to tie in my love for astronomy: the "oldest sister" in the Pleiades star cluster is named Maia). However, when we moved to California a few of years ago, we started meeting several girls between the ages of 2 to 12 named Maya. Is the name more popular than we realized? Or is this a case of frequency illusion? With regards to our son, Henry, we assumed that the average age of Henry's has to be around 80, but again, we seem to be meeting more and more kids named Henry. So what is the average age of Henrys? Let's see what we can find out.

## 1.0 Load Libraries
We will use a few libraries for reading data (io, zipfile, urllib), data manipulation (pandas, numpy), and visualization (matplotlib, plotly).

In [1]:
# Load Modules
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from io import BytesIO
from zipfile import ZipFile
from urllib.request import urlopen
import plotly.plotly as py
import plotly.graph_objs as go
from IPython.display import IFrame

plt.style.use('ggplot')

## 1.1 Load the Data
We will be pulling in data from the [SSA's baby names dataset](https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data). This is a zip file containing data on names, birth counts, and sex broken down into individual files for each year. We will extract the files and append them to a dataframe for easy manipulation.

In [2]:
def build_babynames_dataframe(file_url='https://www.ssa.gov/oact/babynames/names.zip'):
    '''
    This is a function to create a Pandas dataframe from the Social Security Administration's
    zip file containing data back to 1880.
    '''
    resp = urlopen(file_url)
    zipfile = ZipFile(BytesIO(resp.read()))
    
    babynames = []
    
    filenames = [f for f in zipfile.namelist() if f.startswith('yob')]
    
    for each in filenames:
        # open each txt file
        file = zipfile.open(each)
        
        # define column names
        columns = ['Name', 'Sex', 'Count']
        
        # read each csv file into a dataframe
        df = pd.read_csv(file, sep=',', header=0, names=columns)
        
        # extract the year from the filename
        year = int(''.join([d for d in each if d.isdigit()]))
        
        # insert a column for the year
        df.insert(0, 'Year', year)
    
        # append each year's dataframe to the babynames list
        babynames.append(df)

    # convert the babynames list into a dataframe
    babynames = pd.concat(babynames, axis=0)
    
    return babynames

babynames = build_babynames_dataframe()
babynames.sample(10)

Unnamed: 0,Year,Name,Sex,Count
4627,1935,Madell,F,5
9682,2002,Dhamar,F,10
3920,1971,Starlet,F,16
1053,1885,Arlena,F,5
14155,2004,Zharick,F,7
3021,1920,Macil,F,11
6260,1953,Min,F,5
8378,1939,Ronell,M,6
27398,2001,Cashawn,M,6
22416,2015,Syler,M,34


## 2.0 The Popularity of Maya
Let's start by plotting the number of babies named Maya or Henry dating back to 1880.

In [4]:
def get_info(name, sex):
    '''
    This function takes a name and sex and generates a dataframe 
    with that specific information from the babynames dataframe
    '''
    name_data = babynames[(babynames['Name'] == name) & (babynames['Sex'] == sex)]
    return name_data

def plot_trends(names, sex, scale='linear', years=[1880,2017], count=[0,12], 
                title=None, cmap='Set2'):
    '''
    This function takes a name and sex and creates a plot showing
    the number of births for that name from 1880 to 2017
    '''
    colors = [c for c in plt.get_cmap(cmap).colors]
    data = []
    for idx in range(len(names)):
        info = get_info(names[idx], sex[idx])
        
        trace = go.Scatter(
            x = info['Year'],
            y = info['Count']/1000,
            mode = 'lines',
            name = names[idx])
        
        data.append(trace)
        
        layout = go.Layout(
            title=title,
            xaxis=dict(
                range=years,
                dtick=10,
                title='Year',
                titlefont=dict(
                    family='Courier New, monospace',
                    size=18,
                    color='#7f7f7f'
                )
            ),
            yaxis=dict(
                range=count,
                dtick=2,
                title='Count (thousands)',
                titlefont=dict(
                    family='Courier New, monospace',
                    size=18,
                    color='#7f7f7f'
                    )
                )
            )
        
        fig = go.Figure(data=data, layout=layout)
        
    return py.iplot(fig, filename=title)

# Plot trends for Maya and Henry
kids = ['Maya','Henry']
plot_trends(kids, ['F','M'], title='The Popularity of Henry & Maya')

Regarding the name Maya, the first thing that stands out is that it doesn't appear in the SSA data until around 1940. The name is found in lots of cultures, so I'm curious if the introduction of the name into the US was the result of post-World War II migration.  Unfortunately, that will remain speculation with this current data set, but maybe something to look into in the future.  There has definitely been a steady increase in girls named Maya since the mid-1980s, with a peak just before my daughter was born. This peak, however, is nothing compared to the numbers we are getting from the name Henry. It seems like Henry is making a very strong comeback and nearing its peak in the early 20th century.  Looks like we are going to have a lot of Henrys over the age of 75 and under the age of 10.

So far it seems like Maya is a relatively unique name.  To get a better idea of how unique, let's plot Maya against the top 5 most popular girl's name for 2014. I will create dataframes for each of the top 5 names so that we can pull the birth count numbers over time. Then we can plot the data compared to the name Maya.  Interestingly enough, if you plot all of the birth counts of these names over the years, you will see that they were all relatively unpopular until the mid-80's and 90's when they all began to climb rapidly. For this reason, I started the data off at 1980 so that we can get a better look.

In [5]:
def get_popular_names(df, sex, year, topn):
    '''
    A function to return the most popular names for a given sex and year.
    '''
    topn_df = df[(df.Sex == sex) & (df.Year == year)]
    topn_array = np.array(topn_df.iloc[:topn,1]).tolist()
    
    return topn_array

# Top 5 female names in 2014
top5_2014 = get_popular_names(df   = babynames,
                              sex  = 'F',
                              year = 2014,
                              topn = 5
                             )

# add Maya to list of names
top5_2014.append('Maya')

sex_list = ['F' for i in range(len(top5_2014))]
plot_trends(top5_2014, sex=sex_list, years=[1980,2017], count=[0,22], 
                title='Maya vs Top 5 Girls Names in 2014')

We can see from the plot that the name Maya was only about 1/5th as popular as the name Olivia.  However, it does not give us a very robust idea of the name popularity. We could really dive in and start calculating the number of Maya's projected to be living between the ages of 2 and 14, and from that determine the percentage of 2- to 14-year-olds named Maya, but let's just look at one last calculation to get a better idea of the popularity of the names in 2014.

In [7]:
# compute total number of births by year
total_by_year = babynames.groupby(['Year']).sum()

#compute total number of births by year and sex
total_by_sex_and_year = babynames.groupby(['Year','Sex']).sum()

# reset indices for merging
total_by_year = total_by_year.reset_index()
total_by_sex_and_year = total_by_sex_and_year.reset_index()

# merge total by year column
babynames = babynames.merge(total_by_year, on='Year', suffixes=('','_yr'))

# merge total by sex and year columns
babynames = babynames.merge(total_by_sex_and_year, on=['Year','Sex'], suffixes=('','_sex'))

# rename columns
babynames.columns = ['Year', 'Name', 'Sex', 'Count', 'Total_by_Year','Total_by_Sex_and_Year']

# Add columns to dataframe: 'Percent of Total' and 'Percent by Sex'
babynames['Pct_of_Total'] = (babynames['Count'] / babynames['Total_by_Year']) * 100
babynames['Pct_by_Sex'] = (babynames['Count'] / babynames['Total_by_Sex_and_Year']) * 100

# get the ranking of Maya in 2014
rank = babynames[(babynames.Year == 2014) & (babynames.Sex == 'F')].sort_values(by='Count', ascending=False).reset_index( drop=True)
print('Maya was ranked {} in 2014 birth names.'.format(rank[rank.Name == 'Maya'].index[0]+1))

Maya was ranked 73 in 2014 birth names.


In [17]:
# get top 5 female names from 2014
top5_2014_df = rank[(rank.Year == 2014) & (rank.Sex == 'F')].sort_values('Count', ascending=False)[:5]
top5_2014 = top5_2014_df.Name.values.tolist()
top5_2014.append('Maya')

# subset the top 5 female names from 2014 with Maya
top5_df = babynames[(babynames.Name.isin(top5_2014)) & (babynames.Year == 2014) & (babynames.Sex == 'F')].set_index('Name', drop=True)
top5_df[['Pct_of_Total','Pct_by_Sex']].style.background_gradient(cmap='cividis')

Unnamed: 0_level_0,Pct_of_Total,Pct_by_Sex
Name,Unnamed: 1_level_1,Unnamed: 2_level_1
Olivia,0.538474,1.12439
Sophia,0.506015,1.05662
Isabella,0.464386,0.969691
Ava,0.426839,0.891289
Mia,0.367472,0.767322
Maya,0.10739,0.224243


Although none of the top five accounted for more that 1.12% of female names in 2014, Maya come in at the 73rd most popular name and accounts for on 0.22% of females born that year (5 times less than the most popular name). It can be argued that Maya is a unique name, but so are all of the other names. I will have to assume that my hearing the name Maya more frequently is simply the result of the frequency illusion bias. A better question might be: is it more unique to give your daughter a name that doesn't end in 'a'?

## 2.1 The Age of Henry
Now to figure out the average age of Henrys. First, we need to pull in data from the [SSA's Actuarial Life Tables](https://www.ssa.gov/oact/STATS/table4c6.html). You can see the head of the table and definitions of features below.

In [19]:
# load the actuarial life tables
life_table_males = pd.read_csv('https://www.ssa.gov/oact/HistEst/PerLifeTables/2018/PerLifeTables_M_Hist_TR2018.csv', skiprows=4)
life_table_males.head(3)

Unnamed: 0,Year,x,q(x),l(x),d(x),L(x),T(x),e(x),D(x),M(x),A(x),N(x),a(x),12a(x)
0,1900,0,0.145957,100000,14596,90026,4640595,46.41,100000,39810,0.3981,2289435,22.8943,269.23
1,1900,1,0.03814,85404,3257,83776,4550569,53.28,83159,25598,0.3078,2189435,26.3283,310.44
2,1900,2,0.019577,82147,1608,81343,4466793,54.38,77884,22510,0.289,2106276,27.0436,319.02


### Defining Actuarial Table Features

| Column        | Description                                                         |
|---------------|---------------------------------------------------------------------|
| x             | The age of the person.                                              |
| q<sub>x</sub> | The probability that a person exact age x will die within one year. |
| l<sub>x</sub> | The number of persons surviving to exact age x (in 100,000s).       |
| d<sub>x</sub> | The number of deaths between exact ages x and x+1.                  |
| L<sub>x</sub> | The number of person-years lived between exact ages x and x+1.      |
| T<sub>x</sub> | The number of person-years lived after exact age x.                 |
| e<sub>x</sub> | The average number of years of life remaining at exact age x.       |

The rest of the table contains actuarial calculations that you can get more information about in the SSA's [Definitions of Life Tables Functions](https://www.ssa.gov/oact/HistEst/PerLifeTables/LifeTableDefinitions.pdf) document, but we won't be using those for now.

Next, I am going to merge the life table data with Henry's dataframe to calculate the estimated number of living Henry's for each year. Below is a plot of the estimated number of living Henrys by age.  

In [21]:
# get dataframe of Henry
henry = get_info('Henry', 'M')

# extract subset relevant to those alive in 2015 since this is most recent year available
life_table_males_2015 = life_table_males.loc[life_table_males['Year'] + life_table_males['x'] == 2017]

# merge with life tables
living_henrys = henry.merge(life_table_males_2015, on='Year')

# compute estimated number of living Henrys
living_henrys['n_alive'] = living_henrys['l(x)'] * living_henrys['Count'] / (10**5)

# plot the estimated number of living Henrys by age
trace = go.Scatter(
            x = living_henrys['x'],
            y = living_henrys['n_alive'],
            mode = 'lines'
            )
        
layout = go.Layout(
            title='Estimated Number of Living Henrys by Age',
            xaxis=dict(
                dtick=10,
                title='Age',
                titlefont=dict(
                    family='Courier New, monospace',
                   size=18,
                    color='#7f7f7f'
                )
            ),
            yaxis=dict(
                title='No. of Living Henrys',
                titlefont=dict(
                    family='Courier New, monospace',
                    size=18,
                    color='#7f7f7f'
                    )
                )
            )
        
fig = go.Figure(data=[trace], layout=layout)
py.iplot(fig, filename='number-of-living-henrys')

We can see that although there is a large grouping of Henrys aged 50-70, they are dwarfed by the number of Henrys that have been born in the past 10 years.  This should have a noticeable effect by dragging the average under the age of 50.  Let's calculate the average age and see.

In [22]:
# Create a column of the product of age and number alive
living_henrys['rel_age'] = living_henrys.x * living_henrys.n_alive

# From weighted age, calculate the average age
avg_age = living_henrys.rel_age.sum() / living_henrys.n_alive.sum()

print('The average age of Henry is %.1f' % avg_age)

The average age of Henry is 37.1


I guess Henrys could say they are getting younger every day (statistically speaking, of course). This is an important reminder that an average is not always a good statistic. Afterall, we have over twice as many 62-year-old Henrys and five times as many 2-year-old Henrys. We can get a better sense of how death plays a roll in this moving average by comparing the births by year to the estimated number of living.

In [23]:
# plot the number of Henrys born compared to estimated living
trace0 = go.Scatter(
            x = living_henrys['Year'],
            y = living_henrys['n_alive'],
            mode = 'lines',
            name = 'No. Living',
            fill='tonexty'
            )

trace1 = go.Scatter(
            x = living_henrys['Year'],
            y = living_henrys['Count'],
            mode = 'lines',
            name = 'No. of Births'
            )
        
layout = go.Layout(
            title='Estimated Number of Living Henrys Compared to Number of Births',
            xaxis=dict(
                dtick=10,
                title='Age',
                titlefont=dict(
                    family='Courier New, monospace',
                   size=18,
                    color='#7f7f7f'
                )
            ),
            yaxis=dict(
                title='No. of Living Henrys',
                titlefont=dict(
                    family='Courier New, monospace',
                    size=18,
                    color='#7f7f7f'
                    )
                )
            )
        
fig = go.Figure(data=[trace0, trace1], layout=layout)
py.iplot(fig, filename='birth_count_vs_living')

## 3.0 Conclusion
It appears that I was wrong on both accounts: there are not an inordinate number of Mayas being born and the average age of Henry is not even close to 80. The name Maya only accounted for 0.22% of females born in 2014. However, the name has unarguably gained popularity in the last 30 years. The average age of Henry is only 37, but the guess of 80 was not completely off-base. The peak year for Henrys was 1921, which means there are plenty of Henrys born 80 or more years ago.

## 4.0 Bonus Material
I fell into a bit of a rabbit hole while looking at the babynames and actuarial data, so here are some additional insights that I found. First, lets look at the peak years for Henry and Maya.

In [28]:
# plot max Henrys and max Mayas
trace0 = go.Scatter(
            x = maya['Year'],
            y = maya['Count'],
            mode = 'lines',
            name = 'Maya'
            )

trace1 = go.Scatter(
            x = henry['Year'],
            y = henry['Count'],
            mode = 'lines',
            name = 'Henry'
            )
        
layout = go.Layout(
            title='Comparing Henry to Maya in Birth Count',
            xaxis=dict(
                dtick=10,
                title='Year',
                titlefont=dict(
                    family='Courier New, monospace',
                   size=18,
                    color='#7f7f7f'
                )
            ),
            yaxis=dict(
                title='No. of Births',
                titlefont=dict(
                          family='Courier New, monospace',
                          size=18,
                          color='#7f7f7f'
                          )
                      ),
            
            annotations=[
                 dict(
                      x=henry[henry.Count == henry.Count.max()].Year.values[0],
                      y=henry.Count.max(),
                      xref='x',
                      yref='y',
                      text='Max Henry Births',
                      showarrow=True,
                      arrowhead=7,
                      ax=20,
                      ay=-40
                     ),
                 dict(
                      x=maya[maya.Count == maya.Count.max()].Year.values[0],
                      y=maya.Count.max(),
                      xref='x',
                      yref='y',
                      text='Max Maya Births',
                      showarrow=True,
                      arrowhead=7,
                      ax=-20,
                      ay=-40
                     )
             ]
            )
        
fig = go.Figure(data=[trace0, trace1], layout=layout)
py.iplot(fig, filename='max_maya_and_henry')

It will be interesting to see if Henry can reach its 1920s peak or if it will continue to level off and drop. It looked like Maya was going to start climbing above Henry in popularity, but the name really started dropping off since its peak in 2006.

Next, let's add data for my wife (Raina) and I.

In [29]:
# fetch data for family names
maya = get_info('Maya', 'F')
raina = get_info('Raina', 'F')
chris = get_info('Christopher', 'M')

# plot birth counts for all family names
trace0 = go.Scatter(
            x = maya['Year'],
            y = maya['Count'],
            mode = 'lines',
            name = 'Maya'
            )

trace1 = go.Scatter(
            x = henry['Year'],
            y = henry['Count'],
            mode = 'lines',
            name = 'Henry'
            )
trace2 = go.Scatter(
            x = raina['Year'],
            y = raina['Count'],
            mode = 'lines',
            name = 'Raina'
            )

trace3 = go.Scatter(
            x = chris['Year'],
            y = chris['Count'],
            mode = 'lines',
            name = 'Chris'
            )
        
layout = go.Layout(
            title='Comparing Our Family\'s Names by Birth Count',
            xaxis=dict(
                dtick=10,
                title='Year',
                titlefont=dict(
                    family='Courier New, monospace',
                   size=18,
                    color='#7f7f7f'
                )
            ),
            yaxis=dict(
                title='No. of Births',
                titlefont=dict(
                          family='Courier New, monospace',
                          size=18,
                          color='#7f7f7f'
                          )
                      )
            )
  
data = [trace0, trace1, trace2, trace3]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='family_name_trends')

Wow. I often joke about there being so many Chris's that we need to number off, but look at Mt. Chris up there! It towers over the other names so much that we lose all visual information about Raina and most of it from Maya. Let's compare the name popularities on a logarithmic scale to get a better sense of how their popularity has changed over time.

In [30]:
# plot family name birth counts on logarithmic scale
trace0 = go.Scatter(
            x = maya['Year'],
            y = np.log(1 + maya['Count']),
            mode = 'lines',
            name = 'Maya'
            )

trace1 = go.Scatter(
            x = henry['Year'],
            y = np.log(1 + henry['Count']),
            mode = 'lines',
            name = 'Henry'
            )
trace2 = go.Scatter(
            x = raina['Year'],
            y = np.log(1 + raina['Count']),
            mode = 'lines',
            name = 'Raina'
            )

trace3 = go.Scatter(
            x = chris['Year'],
            y = np.log(1 + chris['Count']),
            mode = 'lines',
            name = 'Chris'
            )
        
layout = go.Layout(
            title='Comparing Our Family\'s Names by Birth Count (Log)',
            xaxis=dict(
                dtick=10,
                title='Year',
                titlefont=dict(
                    family='Courier New, monospace',
                   size=18,
                    color='#7f7f7f'
                )
            ),
            yaxis=dict(
                title='No. of Births',
                titlefont=dict(
                          family='Courier New, monospace',
                          size=18,
                          color='#7f7f7f'
                          )
                      )
            )
  
data = [trace0, trace1, trace2, trace3]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig, filename='family_name_trends_log')

After converting to a logarithmic scale we can better see the change in name popularity over time.  We see that although Raina still remains a relatively obscure name, Maya appears much closer to Henry and Chris.  It is fascinating to me that both Raina ***and*** Maya don't seem to appear in the US until World War II. Regardless, one thing is clear, my wife takes the crown on most unique name in our house.

While looking at the data, I noticed some babies named Unknown showed up. Just out of curiosity, let's see if there are any trends in the number of people who were too tired to give their baby a name.  Anybody who has gone labor knows what I'm talking about.  I was exhausted and all I had to do was support the real work that my wife was doing. Honestly, I'm surprised we don't see any "Who Cares" on the list.

In [31]:
# fetch data on babies named Unknown
unknown = babynames[babynames.Name == 'Unknown']

# plot birth count of babies named Unknown
trace = go.Scatter(
            x = unknown['Year'],
            y = unknown['Count'],
            mode = 'lines'
            )
        
layout = go.Layout(
            title='Too Tired To Worry About Names',
            xaxis=dict(
                dtick=10,
                title='Year',
                titlefont=dict(
                    family='Courier New, monospace',
                   size=18,
                    color='#7f7f7f'
                )
            ),
            yaxis=dict(
                title='No. of Births',
                titlefont=dict(
                    family='Courier New, monospace',
                    size=18,
                    color='#7f7f7f'
                    )
                )
            )
        
fig = go.Figure(data=[trace], layout=layout)
py.iplot(fig, filename='number-of-known-unknowns')

It looks like there was a bit of an issue with naming babies in the 1950s?  Did parents get tired of having to name all of their babyboomer kids? If anyone knows an Unknown, let me know. I really want to know how things turned out.  