# Wrangle Firstnames

The purpose of this script is to wrangle data on firstnames.

### Datasource #1 
captured from the census via API, [webpage linked here](https://www.ssa.gov/oact/babynames/limits.html).

### Datasource #2
Captured from a published peice of literature that captures approximate race breakdowns of 4,250 first names based on mortgage datasets.
See [link here](https://www.nature.com/articles/sdata201825) for information on the article/study.
See [link here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TYJKEZ) for information on where the data was downloaded from.



----------------------

<p>Author: PJ Gibson</p>
<p>Date: 2023-01-10</p>
<p>Updated Date: 2023-07-26</p>
<p>Contact: peter.gibson@doh.wa.gov</p>
<p>Other Contact: pjgibson25@gmail.com</p>

## 0. Import libraries, functions, settings

In [None]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import re
from pandas.errors import SettingWithCopyWarning
import warnings
from tqdm import tqdm

import requests
import zipfile
import shutil

warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

In [None]:
state_abbreviation = 'WA'
state_name = 'Washington'
state_fips = '53'

In [None]:
state_abbreviations = {
    'Alabama': 'AL',
    'Alaska': 'AK',
    'Arizona': 'AZ',
    'Arkansas': 'AR',
    'California': 'CA',
    'Colorado': 'CO',
    'Connecticut': 'CT',
    'Delaware': 'DE',
    'Florida': 'FL',
    'Georgia': 'GA',
    'Hawaii': 'HI',
    'Idaho': 'ID',
    'Illinois': 'IL',
    'Indiana': 'IN',
    'Iowa': 'IA',
    'Kansas': 'KS',
    'Kentucky': 'KY',
    'Louisiana': 'LA',
    'Maine': 'ME',
    'Maryland': 'MD',
    'Massachusetts': 'MA',
    'Michigan': 'MI',
    'Minnesota': 'MN',
    'Mississippi': 'MS',
    'Missouri': 'MO',
    'Montana': 'MT',
    'Nebraska': 'NE',
    'Nevada': 'NV',
    'New Hampshire': 'NH',
    'New Jersey': 'NJ',
    'New Mexico': 'NM',
    'New York': 'NY',
    'North Carolina': 'NC',
    'North Dakota': 'ND',
    'Ohio': 'OH',
    'Oklahoma': 'OK',
    'Oregon': 'OR',
    'Pennsylvania': 'PA',
    'Rhode Island': 'RI',
    'South Carolina': 'SC',
    'South Dakota': 'SD',
    'Tennessee': 'TN',
    'Texas': 'TX',
    'Utah': 'UT',
    'Vermont': 'VT',
    'Virginia': 'VA',
    'Washington': 'WA',
    'West Virginia': 'WV',
    'Wisconsin': 'WI',
    'Wyoming': 'WY',
    'District of Columbia': 'DC',
    'Puerto Rico': 'PR',
    'Guam': 'GU',
    'American Samoa': 'AS',
    'U.S. Virgin Islands': 'VI',
    'Northern Mariana Islands': 'MP'
}

## 1. API pulls

In [None]:
# Define the URLs of the zip files
urls = [
    "https://www.ssa.gov/oact/babynames/names.zip",
    "https://www.ssa.gov/oact/babynames/state/namesbystate.zip",
    "https://www.ssa.gov/oact/babynames/territory/namesbyterritory.zip",
]

# For each file download...
for url in urls:

    # Get the file name by splitting the url and picking up the last string
    file_name = url.split("/")[-1]
    
    # Create a directory name based on the file name
    directory_name = file_name.replace('.zip', '')

    # Send a GET request to the URL
    r = requests.get(url, stream=True)

    # Save the response content as a zip file
    with open(file_name, 'wb') as fd:
        for chunk in r.iter_content(chunk_size=128):
            fd.write(chunk)

    # Extract the zip file
    with zipfile.ZipFile(file_name, 'r') as zip_ref:
        # Create a new directory with the same name as the zip file (without the extension)
        os.makedirs(f'../../SupportingDocs/Names/01_Raw/{directory_name}', exist_ok=True)
        # Extract all the contents of zip file in current directory
        zip_ref.extractall(f'../../SupportingDocs/Names/01_Raw/{directory_name}')
    
    # Delete the zip file
    os.remove(file_name)

## 2. Read Data

### 2.1 Vital Stats 

In [None]:
df_vital_stats = pd.read_csv('../../SupportingDocs/Births/03_Complete/VitalStats_byYear_byState.csv').query(f'State == "{state_name}"')[['Year','Births']]

### 2.2 National name data by year

In [None]:
output = []

for curyear in range(1880,2022):

    df = pd.read_csv(f'../../SupportingDocs/Names/01_Raw/names/yob{curyear}.txt', header=None)
    df.columns = ['Name','Sex','Count']
    df['Year'] = curyear
    df['Name'] = df['Name'].str.upper()
    df['id_1'] = df['Sex'] + ' | ' + df['Name'] + ' | ' + df['Year'].astype(str)
    df['id_2'] = df['Sex'] + ' | ' + df['Name']
    output.append(df)
    
df_national = pd.concat(output)

### 2.3 State name data by year

In [None]:
column_names_state = ['State','Sex','Year','Name','Count']
df_state_names = pd.read_csv(f'../../SupportingDocs/Names/01_Raw/namesbystate/STATE.{state_abbreviation}.TXT', names=column_names_state)

df_state_names['Name'] = df_state_names['Name'].str.upper()
df_state_names['id_1'] = df_state_names['Sex'] + ' | ' + df_state_names['Name'] + ' | ' + df_state_names['Year'].astype(str)
df_state_names['id_2'] = df_state_names['Sex'] + ' | ' + df_state_names['Name']

## 3. Data Wrangling

We have a count for each name in each state for each year just through the census data.
Which is awesome!
Unfortunately, they mask any names that have less than 5 occurrences.
We want to include these rarer names, which in my experience can often come from under-represented populations.
At the very least, their names are under-represented...

Here are some of the knowns:
* We have the number of estimated births for a given state each year (Vital Stats)
* We have counts of newborn names for a given state each year *excluding* for names that occur less than 5 times in that state & year.
* We have counts of newborn names at the national level for each year *excluding* names that occur less than 5 times in a given year nationally.

So when we perform the math:

$$ NumBirths_{year} - \sum{StateNameCounts_{year}} \approx NumMaskedNames $$

We want to generate data for the "Masked Names".
To do this, we'll pull from the national data.
See example below:

> consider for the year 1990, there were 3 males in Washington state with the name "Xenon".  The SSA file `namesbystate/STATE.WA.TXT` would exclude that name in the data (for the year 1990) because there are fewer than 5 occurrences.  For this example however, on a national scale there are 30 other males named "Xenon" in other states.  Hence there are a total of 30 + 3 = 33 males named "Xenon" in the United States.  Because 33 occurrences surpasses the masking-threshold, the name "Xenon" would appear in the SSA file `names/yob1990.txt`.  Thus, we can use national data to infer small numbers unrepresented in the state-level data.

We'll use this sort of logic to attempt to fill in the number of masked names.
Below is an example of a generated "Masked Name" record:

| State | Sex | Year | Name | Count |
| --- | --- | --- | --- | --- |
| WA | F | 1990 | Yolanda | 4 |


### Pool #1

Any name/sex combinations not found in our state file that is found in the national file for the same year can be considered as **Pool #1**, a pool to pull for when generating "Masked Names".
The count of any single "Masked Name" for a given state in a given year must fall between 1-4 times (<5 masking threshold).
We pull from **Pool #1** using a custom function.
This custom function applies the probabilities of a record in *Pool #1* belonging to the masked records in comparison to every other record in the pool.
However, a record can be chosen a maximum of 4 times.
If all records have been chosen 4 times and there are still records to mask...(see below)


### Pool #2

So we've used all of the reserves in Pool #1 four times and still haven't generated the proper number of masked records.
Now, we use all national name data available for all years that is previously unseen for the given year.
This group of exceedingly rare records will comprise **Pool #2**.
Using a similar logic, we'll loop through this pool and assign a count of masked records to each name within the pool.

If we still haven't reached the NumMaskedRecords by the end of this section, we end our efforts to generate more masked records and know that we tried our best to account for less common names.
The script will run perfectly fine without reaching $NumMaskedRecords=0$.
The primary purpose is to account for rarer situations to account for underrepresented individuals.

---

Side note, there's a chance that the names used in pool #1 or pool #2 have more representation of males or females.
Since we're using their raw count to determine likelyhood of being added as a "masked record", there's a chance that we could inadvertently create more "masked records" of one specific gender as opposed to the other.
While you could argue that you could put more coding logic in place to combat this, this is more of a problem with the SSA data collection.
And if you still have a quarrel with it, [click here](https://www.youtube.com/watch?v=Vim4ZKuNm6k).

### 3.1 Define function for generating names

This function serves as the way that we pull names from pool #1 and pool #2.  

In [None]:
def generate_names(df, n, id_column, year, state):

    # In the instance where the number of desired records for masking is > 4 times the length of df...
    if (n > len(df)*4):
        
        # Our second output will return false
        output2 = False

        # We redefine n as to not break the function
        n = len(df)*4        
    
    else:
        output2 = True

    # Prepare data
    ids = df[id_column].values
    probabilities = df['Count'].values

    # Normalize probabilities
    probabilities = probabilities / probabilities.sum()

    # Create the array of all possible ids with a count of 4
    all_ids = np.repeat(ids, 4)

    # Create the array of probabilities, boosted by the number of times each name is still available
    probabilities = np.repeat(probabilities, 4)

    result = []
    for _ in range(n):
        # Choose a name
        chosen_index = np.random.choice(len(all_ids), p=probabilities/probabilities.sum())
        chosen_name = all_ids[chosen_index]

        # Append to result
        result.append(chosen_name)

        # Remove this name from the selection process
        all_ids = np.delete(all_ids, chosen_index)
        probabilities = np.delete(probabilities, chosen_index)


    # Formatting output1
    pd_result = pd.Series(result).value_counts().reset_index().rename(columns={'index':id_column,'count':'Count'})
    pd_result['Year'] = year
    pd_result['State'] = state

    # Depending on id_1 or id_2...
    if id_column == 'id_1':
        pd_result[['Sex','Name','Year']] = pd_result[id_column].str.split(' \| ',expand=True)
        pd_result['id_2'] = pd_result['Sex'] + ' | ' + pd_result['Name']

    else:
        pd_result[['Sex','Name']] = pd_result[id_column].str.split(' \| ',expand=True)
        pd_result['id_1'] = pd_result['Sex'] + ' | ' + pd_result['Name'] + ' | ' + pd_result['Year'].astype(str)
    
    # Format output
    output1 = pd_result[['State', 'Sex', 'Year', 'Name', 'Count', 'id_1', 'id_2']]

    return output1, output2

### 3.2 Generate masked names

Here we generate the masked names and save them to the folder:
'{root}/SupportingDocs/Names/02_Wrangled/StateFirstNames'

In [None]:
# Make directory that we'll drop files into
dir_wrangled = '../../SupportingDocs/Names/02_Wrangled/StateFirstNames'
os.makedirs(dir_wrangled, exist_ok=True)


# Define bounds of our for-loop.  Only calculating for years within our vital stats known birth years
year_start = df_vital_stats['Year'].min()
year_end = df_vital_stats['Year'].max()

# Loop through years of vital stats, checking the expected births vs. what we have name data for
for cur_year in tqdm(np.arange(year_start,year_end)):
    
    # Using vital stats, find year and number of births
    count_births = df_vital_stats.query(f'Year == {cur_year}')['Births'].iloc[0]

    # Find number of births accounted for in state SSA naming data
    df_state = df_state_names.query(f'Year == {cur_year}')
    sum_state_names = df_state['Count'].sum()

    # Filter down the national set to the current year first...
    df_sub_national = df_national.query(f'Year == {cur_year}') 

    # Calculate number of masked names as described in documentation above
    num_masked_names = count_births - sum_state_names

    # If there's more than 0 masked names (there always should be...)
    if num_masked_names > 0:

        # Generate pool #1
        pool1 = df_sub_national[~df_sub_national['id_1'].isin(df_state['id_1'])]

        # Using our custom function, generate "Masked Names"
        generated_masked_records, bool_ok = generate_names(pool1, num_masked_names, 'id_1', cur_year, state_abbreviation)

        new_num_masked_names = num_masked_names - len(generated_masked_records)

        # If second output of last record was false, we need to turn to another pool...
        if bool_ok == False:
            
            # Need to perform 2 anti joins -> historic national records not in the state dataset AND not in the recently generated masked dataset
            pool2_unfiltered = df_sub_national[~df_sub_national['id_2'].isin(df_state['id_2'])]
            pool2 = pool2_unfiltered[~pool2_unfiltered['id_2'].isin(generated_masked_records['id_2'])]

            # Using our custom function, generate "Masked Names"
            generated_masked_records2, _ = generate_names(pool2, new_num_masked_names, 'id_2', cur_year, state_abbreviation)

            # Redefine our output
            generated_masked_records = pd.concat([generated_masked_records,generated_masked_records2])

        # Concat our original state data (df_state) with our new masked records.  Drop columns unused 
        df = pd.concat([df_state,generated_masked_records]).drop(['id_1','id_2'],axis='columns')

        # Write out to csv in save location
        df.to_csv(f'../../SupportingDocs/Names/02_Wrangled/StateFirstNames/year{cur_year}.csv', index=False)

    # Nothing should come down here...
    else:
        pass


### 3.3 Reread data

In [None]:
# Get list of files we just downloaded to
list_files = os.listdir(dir_wrangled)

output = []
# For each file, load and append to list we'll eventually concat together
for file in list_files:
    df = pd.read_csv(f'{dir_wrangled}/{file}')
    output.append(df)

# Concat all dataframes into one
df_names_with_masks = pd.concat(output)

# Drop state column, useful in files for reference that we're working state-level, but unimportant for computation
df_names_with_masks = df_names_with_masks.drop('State', axis='columns')

### 3.4 Append Early dates

We only have vital birthing stats (count births for each state) from 1914 moving forward.
However, the distribution of first names is still important for years prior.

We'll just use national-level breakdowns for years 1880-1914.

In [None]:
# Query national data for years prior to our compiled data above
df_national_early_data = df_national.query(f'Year < {df_names_with_masks["Year"].min()}')[['Sex', 'Year', 'Name', 'Count']]

# Rename our final formatted data as df_names
df_names = pd.concat([df_names_with_masks,df_national_early_data])

### 3.5 Combine with race data

#### 3.5.1 Read race data

Gives breakdown of first names by race in the united states.
For names not assessed, there is a general breakdown.
* Data source [linked here](https://www.nature.com/articles/sdata201825) - description
* Data source [linked here](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TYJKEZ) - download location

In [None]:
# Read in the data
df_race = pd.read_excel(f'../../SupportingDocs/Names/01_Raw/firstnames.xlsx',
                   sheet_name='Data')

# Last row indicates probabilities for all other names
row_allothernames = df_race.iloc[-1:]

# Remove that last row from the data
df_race = df_race.iloc[:-1]

#### 3.5.2 Join Data

In [None]:
# Join the data and drop columns we don't want to use
df = pd.merge(df_names, df_race, left_on = 'Name', right_on = 'firstname', how='left')\
       .drop(['firstname','obs'] , axis=1)

## 4. Aggregate Calculations

We eventually want to see how likely any name is for any specific year/race/sex combination.
To do this, we'll need to:
1. (Numerator) Approximate how many people of a given race/sex have a specific name within a given year. 
2. (Denominator) Approximate how many people fit within a race/sex category within a given year for our dataset.
3.  Calculate likelyhood by taking numerator / denominator.

### 4.1 Melt our data

We want things to be in a more easily accessible long format to perform some of our calculations.

Note that when we melt, normalize, and start to do calculations, we are removing any information on the pct2race field.
That means that our percents for the 5 race fields will NOT sum to 100%.  
We don't intentionally bring in those percents / redistribute them intentionally.
We're just reducing our dataset to people who only identify as having 1 race.

In [None]:
# Define value variables we'll melt on
perc_race_cols = ['pcthispanic','pctwhite','pctblack','pctapi','pctaian']

# Melt our data so we only see one percent_race column per row.
#### New data only has columns ['Name','Sex','Year','Count','variable','value']
df_melted = pd.melt(df.reset_index(), id_vars=['Name','Sex','Year','Count'], value_vars=perc_race_cols)

# Normalize to 1.0 instead of 100 for value variables
df_melted['value'] = df_melted['value'] * 0.01

# We only care about where the value (percent chance of having a race given a name) is more than 0
df_melted = df_melted.query('value > 0')

# Map the percentage column name with the race category it represents. NOT REPLACING VALUE HERE
variable_mapping = {
    'pcthispanic':'Hispanic',
    'pctwhite':'White',
    'pctblack':'Black or African American',
    'pctapi':'Asian or Pacific Islander',
    'pctaian':'American Indian or Alaska Native'
}

# Use our mapping to replace the variable name
df_melted['variable'] = df_melted['variable'].replace(variable_mapping)

# Rename our variable as race for readability/understandability moving forward
df_melted.rename(columns={'variable':'Race'}, inplace=True)

### 4.2 Calculate our "Numerator"

Take the given example:
* There are 100 instances of a Male named "BEN" in our dataset for the year 2010
* Roughly 20% of all people named "BEN" are Asian 

Based on these 2 statements, we can assume that there are 20 Asians named BEN (in our dataset) for the year 2010.
We'll use this logic to calculate our numerator in finding likelyhood of a name by race, sex, and year.

In [None]:
df_melted['NameCount_Race'] = df_melted['Count'] * df_melted['value']

### 4.3 Calculate our "Denominator"

Now take the example:

* There are only 3 possible names for Asian males in 2010.
   * 20 are named Ben
   * 5 are named John
   * 70 are named Jack

How would you go about finding likelyhood a given name for Asian males in 2010?
You'd find out how many total Asian males existed in our data for 2010: `20 + 5 + 70 = 95`.
Then you'd take:
   * `20 / 95 = 0.2105...` chance of being named Ben
   * `5 / 95 = 0.05263...` chance of being named John
   * `70 / 95 = 0.7368...` chance of being named Jack
   
In this section, we find the denominator, or the sum of all represented race/sex/year combos in our dataset.

In [None]:
# Calculate that denominator using a groupby statement
SumRaceCounts = df_melted.groupby(['Race','Year','Sex'])['NameCount_Race'].sum().reset_index()\
                         .rename(columns={'NameCount_Race':'Sum_RaceCount'})

# Join back to our data
df_melted = pd.merge(df_melted, SumRaceCounts, on=['Race','Year','Sex'], how='inner')

### 4.4 Calculate Name Probabilities

We have the numerator, denominator.
Now it's time to find our probability of having a name given the year, subject's race & sex.

In [None]:
df_melted['Probability'] = df_melted['NameCount_Race'] / df_melted['Sum_RaceCount']

## 5. Save

In [None]:
# Define columns of interest
output_columns = ['Name','Sex','Year','Race','Probability']

# Save our columns of interest
df_melted[output_columns].to_csv(f'../../SupportingDocs/Names/03_Complete/firstname_probabilities.csv',
                                header = True, index=False )