# Wrangle Lastnames

The purpose of this script is to wrangle data on last names.

Data sources:
* Census Surname data 2000 - https://api.census.gov/data/2000/surname (perform API to get data)
* Census Surname data 2010 - https://api.census.gov/data/2010/surname (perform API to get data)
* Census Surname data 1990 - https://www2.census.gov/topics/genealogy/1990surnames/dist.all.last

In this script, we're looking to compile the likelyhood that given a specific race, you would have a specific last name.
This information is only applied at the beginning of the simulation when we initialize our population.
From that point forward, lastnames are passed down.

----------------------

<p>Author: PJ Gibson</p>
<p>Date: 2023-01-10</p>
<p>Updated Date: 2023-07-26</p>
<p>Contact: peter.gibson@doh.wa.gov</p>
<p>Other Contact: pjgibson25@gmail.com</p>

## 0. Import libraries, functions, settings

In [1]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import re
from pandas.errors import SettingWithCopyWarning
import warnings
import requests
import json
import pickle

warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

## 1. Load data

### 1.1 Census Data 2000, 2010 via API

#### 1.1.1 Define your API key

You can request an API key at [this link](https://api.census.gov/data/key_signup.html).
One will be sent to you via email.

In [2]:
# Load the key from the pickle file
with open('secrets.pkl', 'rb') as f:
    census_key = pickle.load(f)

#### 1.1.2 Create our function

In [3]:
def get_census_surnames( year, key ):
    '''
    A query to collect census surname data for a specific decennial year (either 2000 or 2010)
    
    Args:
        year - int, year to perform query for
        key - str, census API key
        
    Returns:
        pandas dataframe object reflecting data we'll use    
    '''
    # Define columns to query
    census_requested_cols = ['NAME',
                             'COUNT',
                             'PROP100K',
                             'PCTAPI',
                             'PCTBLACK',
                             'PCTAIAN',
                             'PCTWHITE',
                             'PCTHISPANIC',
                             'PCT2PRACE']
    
    # Define base URL
    census_url = f'https://api.census.gov/data/{year}/surname'

    # Define api location
    api_url = f'{census_url}?get={",".join(census_requested_cols)}&RANK=0:1000000&key={key}'
    
    # Perform API request
    response_census = requests.get(api_url)

    # Parse the API text output into pandas dataframe, rename the columns
    df_census = pd.DataFrame(json.loads(response_census.text), dtype = str)
    df_census.columns = census_requested_cols + ['RANK']
    
    # Remove row of just column names, where name = "ALL OTHER NAMES"
    df_census = df_census[1:]
    
    # Replace (S) with null values
    df_census.replace('(S)', np.nan, inplace=True)
    
    # Define float columns, int cols
    float_cols = ['PROP100K','PCTAPI','PCTBLACK','PCTAIAN','PCTWHITE','PCTHISPANIC','PCT2PRACE']
    int_cols = ['COUNT','RANK']
    
    # change type to floats
    for column in float_cols:
        df_census[column] = df_census[column].astype(float)
    
    # change type to ints
    for column in int_cols:
        df_census[column] = df_census[column].astype(int)
    
    # Return outputs
    return df_census

#### 1.1.3 Apply our function

In [4]:
census2000 = get_census_surnames( 2000, census_key )
census2010 = get_census_surnames( 2010, census_key )

### 1.2 Census data 1990

Also performed via API but not querying for specific variables, moreso webscraping.

In [5]:
# Define the URL we'll scrape from
census_url_1990 = 'https://www2.census.gov/topics/genealogy/1990surnames/dist.all.last'

# API Call
apiresponse_1990 = requests.get(census_url_1990)

# Split the raw document string via newlines and convert to series
precensus1990 = pd.Series(apiresponse_1990.text.split('\n'))

# Further split each line by the space characters (one or more, taking the max)
census1990 = precensus1990.str.split('\s+',expand=True)

# Rename columns and dropna
census1990.columns = ['NAME','FREQUENCY','CUM_FREQUENCY','RANK']
census1990 = census1990.dropna()

## 2. Identify extra names

There are several lastnames that aren't included in the 2010 dataset that are included in the 2000 and 1990 dataset.  
Since the constrictions of the 2010 census data are that the lastname must occur 100 or more times, we'll assume that all of the names found in 1990/2000 still exist in 2010, they just occur under that 100 threshold.
We'll assume that they occur 99 times.

We already know the breakdowns by percent race for the 2000 data, but we'll have to infer the breakdowns by race for the 1990 data based on the 2010 row that says `Name == "ALL OTHER NAMES"`.

### 2.1 Extranames 2000 data

In [6]:
# Find out what was in the 2000 data but not the 2010 data
extranames2000 = pd.merge(census2000, census2010.NAME, on='NAME', how='left', indicator=True)\
                   .query('_merge == "left_only"')\
                   .drop('_merge', axis=1)

# Overwrite the "COUNT" field to read 99, just 1 below the threshold for making the dataset
extranames2000['COUNT'] = 99
extranames2000['PROP100K'] = np.nan


### 2.2 Othernames 1990 data

In [7]:
# Find out what was in the 1990 data but not the 2010 data
extranames1990 = pd.merge(census1990, census2010.NAME, on='NAME', how='left', indicator=True)\
                   .query('_merge == "left_only"')\
                   .drop('_merge', axis=1)

# Find out what was in the 1990 data not already in the "othernames2000" dataset
extranames1990 = pd.merge(extranames1990, extranames2000.NAME, on='NAME', how='left', indicator=True)\
                   .query('_merge == "left_only"')\
                   .drop('_merge', axis=1)

# Add rank == 0 bit for joining purposes
extranames1990['RANK'] = 0

# Join with our data of "ALL OTHER NAMES"
extranames1990 = pd.merge(extranames1990, census2010.drop('NAME',axis=1), on='RANK', how='inner')

# Make count = 99 and drop unused columns
extranames1990['COUNT'] = 99
extranames1990['PROP100K'] = np.nan
extranames1990.drop(['FREQUENCY','CUM_FREQUENCY'],axis=1,inplace=True)

## 3. Combine and wrangle data


We eventually want to see how likely any last name is for any specific race.
To do this, we'll need to:

1. (Numerator) Approximate how many people of a given race have each specific lastname.
2. (Denominator) Approximate how many total people of each race within our dataset.
3.  Calculate probability by taking numerator / denominator.

### 3.1 Melt our data

We want things to be in a more easily accessible long format to perform some of our calculations.

In [9]:
# Unionbyname our datasets
df = pd.concat([census2010, extranames2000, extranames1990])

# Remove the all other names field
df = df.query('NAME != "ALL OTHER NAMES"')

# Melt the data, dropna
df_melted = pd.melt(df, id_vars=['NAME','COUNT'], value_vars=['PCTAPI','PCTBLACK','PCTAIAN','PCTWHITE','PCTHISPANIC'])\
              .dropna(subset=['value'])

# Ensure we're only working with non-zero likelyhoods
df_melted = df_melted.query('value > 0')

#### 3.1.1 Clean melted data

In [10]:
# Normalize our percent to 1.0 instead of 100
df_melted['value'] = df_melted['value'] / 100


# Map the percentage column name with the race category it represents. NOT REPLACING VALUE HERE
variable_mapping = {
    'PCTHISPANIC':'Hispanic',
    'PCTWHITE':'White',
    'PCTBLACK':'Black or African American',
    'PCTAPI':'Asian or Pacific Islander',
    'PCTAIAN':'American Indian or Alaska Native'
}

# Use our mapping to replace the variable name
df_melted['variable'] = df_melted['variable'].replace(variable_mapping)
df_melted.rename(columns={'variable':'Race',
                          'NAME':'Name'}, inplace=True)

### 3.2 Calculate our "Numerator"

Take the given example:
* There are 100 instances of people with lastname "SMITH".
* Roughly 20% of all people with lastname "SMITH" are Asian.

Based on these 2 statements, we can assume that there are 20 Asians a lastname of "SMITH" (in our dataset).
We'll use this logic to calculate our numerator in finding likelyhood of a name by race, sex, and year.

In [11]:
# Add new column
df_melted['Race_NameCount'] = df_melted['COUNT'] * df_melted['value']

### 3.3 Calculate our "Denominator"

Now take the example:

* There are only 3 possible lastnames for Asian people (I know...super simplified)
   * 20 with lastname "SMITH"
   * 5 with lastname "JACKSON"
   * 70 with lastname "LEE"

How would you go about finding the probability any Asian might have a speicific lastname?
You'd find out how many total Asian existed in our data: `20 + 5 + 70 = 95`.
Then you'd take:
   * `20 / 95 = 0.2105...` chance of having a lastname "SMITH"
   * `5 / 95 = 0.05263...` chance of having a lastname "JACKSON"
   * `70 / 95 = 0.7368...` chance of having a lastname "LEE"
   
In this section, we find the denominator, or the sum of all represented races in our dataset.

In [12]:
# Calculate that denominator using a groupby statement
SumRaceCounts = df_melted.groupby('Race')['Race_NameCount'].sum().reset_index()\
                         .rename(columns={'Race_NameCount':'Sum_RaceCount'})

# Join back to our data
df_melted = pd.merge(df_melted, SumRaceCounts, on='Race', how='inner')

### 3.4 Calculate Name Probabilities

We have the numerator, denominator.
Now it's time to find our probability of having a name given a subject's race.

In [13]:
df_melted['Probability'] = df_melted['Race_NameCount'] / df_melted['Sum_RaceCount']

## 4. Save

In [15]:
# Define output columns
output_columns = ['Name','Race','Probability']

# Save
df_melted[output_columns].to_csv(f'../../../SupportingDocs/Names/03_Complete/lastname_probabilities.csv',
                                header = True, index=False)