# Baby Names

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

from ssas import ssa_data
from plots import history_plot
from generate_names import *

## Fetch baby name data from US SSA

First time executing `ssa_data()` will download the babyname data from the ssa.gov website. Data is saved to a directory called "data" in the baby_names repo. Future executions of `ssa_data()` will load data from the data directory rather than downloading from the ssa.gov website. The `ssa_data` object contains pandas dataframes of the baby name information. Let's go ahead and download/load that data.

In [None]:
name_data = ssa_data()

## Examining the data

Now that we have the data in the ssa_data object, we can access either the national or state level data. Let's look at the national data for name data after the year 2000.

In [None]:
df = name_data.national
df[df['year']>'2000']

Another example for the state level data is to look at the data for everyone named John in Wyoming (you'll need to know your state abbreviations).

In [None]:
df = name_data.state
df[(df['name']=='John') & (df['state']=='WY')]

## Plots

Now we can generate interesting plots with the data. Let's look at the trends in the top 10 female names since the year 2000. We will use the `history_plot()` function to make nice looking plots. You need to give the function an ax to plot on, the names you are interested in plotting, the genders of those names, the dataframe to plot from, and which column of the dataframe we are plotting (keyword argument: `plot_type`, defaults to frequency). We can also choose to highlight certains names in rank plots if it seems too busy (uncomment the keyword).

In [None]:
top10_F = name_data.national.loc[(name_data.national['rank']<=10) &
                                 (name_data.national['gender']=='F') &
                                 (name_data.national['year']>='2000')]
names = pd.unique(top10_F['name'])
genders = ['F']*len(names)

fig, ax = plt.subplots()
history_plot(ax, names, genders, top10_F, plot_type='rank', legend=True,
#              highlight=['Emma', 'Mia'],
             log_scale=False, ms=10, lw=2)
ax.set_title('Top 10 Female Names since 2000')
fig.savefig('Female_top10_2000.png', fmt='png', transparent=False,
            facecolor=fig.get_facecolor(), edgecolor='none', bbox_inches='tight')

Another plot is the name fraction of a name as a function of time. A lower number means the name is less popular.

In [None]:
names = ['Bob', 'Tom', 'Dick', 'Matt']
genders = ['M']*len(names)

fig, ax = plt.subplots()
history_plot(ax, names, genders, name_data.national, plot_type='f')

It is also possible to plot the total occurances of that name by using `plot_type=n`.

In [None]:
names = ['Bob', 'Tom', 'Dick', 'Matt']
genders = ['M']*len(names)

fig, ax = plt.subplots()
history_plot(ax, names, genders, name_data.national, plot_type='n')

## Generate list of random names

Trying to decide on a baby name but don't know where to start? Why not start by drawing from the database at random. Use the `generate_names()` function. Let's randomly draw 50 Female names.

In [None]:
random_names = generate_names(name_data.national, n=50, pout=True, gender='F')

We can slice the name database to narrow our selection down to satisfy our desired criteria. When `n=None` (default) then the generator returns all results that match our criteria. We can slice by gender, first letter of the name, rank, and year.

Note that when considering data over multiple years (year_start!=year_end), that slicing by rank returns a name if it satisfied those conditions in any single year. Therefore, if you are looking for name that isn't too popular this is not gaurenteed to filter those results (as a name that fits your crteria in 1950, but is too popular in 2015, will still be returned as the 1950 result is valid while the 2015 isn't). If you want to reject a name that fails your criteria in any single year, then use `strict_rank_criteria=True` (Default: False).

In [None]:
desirable_names = generate_names(name_data.national, pout=True, gender='F',
                                 first_letter = ['M', 'O', 'T'],
                                 rank_lower_bound=500, rank_upper_bound=100,
                                 year_start=1990, year_end=2000,
                                 strict_rank_criteria=True)