<center><img style="float: center;" src="images/CI_horizontal.png" width="600"></center>
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>

<center> Julia Lane, Benjamin Feder, Tian Lou, Lina Osorio-Copete </center>

# Outcome measurement and imputation

## Introduction

What should you do when you encounter missing values in your data? Unfortunately, there is usually no *right* answer. However, you can try to impute these missing values, providing your best guess for each missing point's true value. Here, you will learn how to implement common imputation methods you can use in approaching missing values in your own work.

### Learning Objectives

* Gain understanding of the concept of measurement error in the context of a cohort's earnings

* Explore options for imputing missing values

* Visualize estimate changes following imputation

In this notebook, you will focus on 2012-13 Ohio community college graduates' earnings during their first year after graduation, particularly in their first and fourth quarters after graduation. Recall that in the [Data Exploration](01_2_Dataset_Exploration.ipynb) notebook, you examined the earnings distribution for all members of this cohort who had positive earnings in this time period in Ohio. To evaluate the earnings outcomes of all 2012-13 Ohio community college graduates, you need to decide what to do when you cannot find their earnings in the Ohio Unemployment Insurance (UI) wage records. A person may not appear in Ohio's UI wage records for several reasons:
- The person is unemployed. 
- The person is out of labor force, e.g., schooling, childcare, etc...
- The person was employed outside of Ohio.
- The person's job is not covered in UI wage records, e.g.,self-employed, independent contractors, federal government works, etc. <a href='https://www.nap.edu/read/10206/chapter/11#294'>(Hotz and Scholz, 2002)</a>

You will explore the resulting earnings outcomes after applying different earnings imputation methods. The methods covered in this notebook include:
- Dropping all "missing" values
- Filling in zero for people who do not have records in Ohio UI wage records data 
- Substituting missing values with the average earnings of people who are in the same degree fields and have the same gender
- Regression imputation
- Adding in Indiana, Missouri, and Illinois UI wage records for the cohort in question

## Python Setup and Database Connection

Before you begin, you neded to run the code cells below to import the libraries and connect to our PostgreSQL database. You should already be familiar with the `matplotlib`, `pandas`, and `numpy` libraries from previous notebooks.

In [None]:
# pandas-related imports
import pandas as pd

# Numpy
import numpy as np

# database interaction imports
import sqlalchemy

#Matplotlib
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

# regression modeling
from sklearn.linear_model import LinearRegression

# Jupyter-specific "magic command" to plot images directly in the notebook.
%matplotlib inline

In [None]:
# to create a connection to the database, 
# we need to pass the name of the database and host of the database

host = 'stuffed.adrf.info'
DB = 'appliedda'

connection_string = "postgresql://{}/{}".format(host, DB)
conn = sqlalchemy.create_engine(connection_string)

Let's also create some temporary tables to recall our cohort.

In [None]:
# We need to use degree year (degcert_yr_earned) and degree term (degcert_term_earned) to identify 
# 2012-13 academic year.
# We use campus type to identify community college students 
# TC=technical college, SC=state college, CC=community college
# Drop transfer students (Those with degcert_subject = 'TRAMOD')

qry = '''
create temp table cc_grads as
select *
from data_ohio_olda_2018.oh_hei_long a
left join data_ohio_olda_2018.oh_hei_campus_county_lkp lkp
on a.degcert_campus = lkp.campus_num
where ((a.degcert_yr_earned = '2012' and (a.degcert_term_earned = '4' or a.degcert_term_earned = '1')) or 
    (a.degcert_yr_earned = '2013' and (a.degcert_term_earned = '2' or a.degcert_term_earned = '3'))) and 
    lkp.campus_type_code in ('TC', 'SC', 'CC') and
    a.degcert_subject != 'TRAMOD'
'''
conn.execute(qry, conn)

In [None]:
# Find the most recent graduation within the span of 2012-13 academic year
# We convert degree term and degree year to dates and sort each person's records descendingly by date.
# Then we only keep each graduate's first record.

qry = '''
create temp table cc_grads_recent as
select distinct on (ssn_hash) *
from (
SELECT *, 
    CASE WHEN degcert_term_earned = 4 THEN
        format('%%s-%%s-01', degcert_yr_earned, 7)::date 
    WHEN degcert_term_earned = 1 THEN
        format('%%s-%%s-01', degcert_yr_earned, 10)::date 
    WHEN degcert_term_earned = 2 THEN
        format('%%s-%%s-01', degcert_yr_earned, 1)::date 
    WHEN degcert_term_earned = 3 THEN
        format('%%s-%%s-01', degcert_yr_earned, 4)::date 
    END AS deg_date
    from cc_grads
) q
order by ssn_hash, deg_date DESC
'''
conn.execute(qry)

## Join Cohort to Ohio UI wage records

Recall that we've already joined our cohort of 2012-13 academic year Ohio community college graduates to the Ohio UI wage records data in the table `cohort_oh_jobs` in the `ada_20_osu` schema. For the purposes of this notebook, we slighly adapted `cohort_oh_jobs` to include the graduate's degree (`degcert_subject`) so we can use degree for one of our imputation techniques. The following SQL code was used to recreate `cohort_oh_jobs`, which has already been done for you.

    create table ada_20_osu.cohort_oh_jobs as
    select a.ssn_hash, a.deg_date, a.degcert_subject, b.job_date, b.sumwages, (b.job_date - a.deg_date) time_after_grad
    from cc_grads_recent a
    join ada_20_osu.small_ohio_ui b
    on a.ssn_hash = b.ssn_hash
    where b.job_date > a.deg_date AND (a.deg_date + '1 year'::interval) >= b.job_date

In [None]:
# Take a look at cohort_oh_jobs with subject
qry = '''
select *
from ada_20_osu.cohort_oh_jobs
limit 5;
'''
pd.read_sql(qry, conn)

## Brief Exploration: Earnings during first quarter after graduation

Before we start performing imputation, we need to do some quick data manipulation to isolate earnings from the first quarter after each individual's graduation. To do so, we can create a new column, `qrt_after_grad`, in python by dividing `time_after_grad` by 90 and rounding the nearest whole number. From there, we can see how many members of our cohort had positive earnings in Ohio in their first quarter after graduation.

In [None]:
# read in ada_20_osu.cohort_oh_jobs in the dataframe `df`
qry = '''
select *
from ada_20_osu.cohort_oh_jobs
'''
df = pd.read_sql(qry, conn)

In [None]:
# Convert days after graduation into quarters
df['qrt_after_grad'] = round(df.loc[:,('time_after_grad')]/90)

> Note: Since the `time_after_grad` column is generated by subtracting the first day of different quarters divided by 90 it will capture all first quarter after grad records.

In [None]:
# Filter quarter 1 after graduation
df_q1 = df[df['qrt_after_grad']==1]
df_q1 = df_q1.dropna()

In [None]:
df_q1['ssn_hash'].nunique()

In [None]:
print('Total graduates with positive earnings during first quarter after graduation: {:,.0f}'\
.format(df_q1['ssn_hash'].nunique()))
print('That is {:.1f}% of the study cohort'\
.format((df_q1['ssn_hash'].nunique()/df['ssn_hash'].nunique())*100))

<h3 style="color:red">Checkpoint 1: Identifying Earnings in the Fourth Quarter after Graduation</h3>

Given the code above, create a data subset `df_q4` that contains all earnings for our cohort in their fourth quarter after graduation. How many members of our cohort had positive earnings in this quarter? Do you expect this number to be higher or lower than the number in the first quarter?

In [None]:
# hint: you can refer to the code above for filtering `df` when 'qrt_after_grad' is 4



## Add graduates without positive earnings for Q1

If we were to subset `cohort_oh_jobs` to their first quarter after graduation (`df_q1`), this table would only contain individuals with positive earnings in their first quarter after graduation in Ohio. Let's add in members of our cohort who did not appear in Ohio's wage records during this time period into the group in `cohort_oh_jobs`. This will let us easily analyze different earnings distributions in our cohort's first quarter after graduation moving forward.

> Before we do this, we will grab some demographic information on our cohort for future imputation methods.

In [None]:
# Find demographics information for all graduates in our cohort
qry = '''
select distinct a.ssn_hash, a.deg_date, a.degcert_subject, b.birthdate_y, b.race_ethnic_code, b.gender
from cc_grads_recent a
LEFT JOIN data_ohio_olda_2018.oh_hei_demo b 
on a.ssn_hash = b.ssn_hash
'''
grads = pd.read_sql(qry, conn)

In [None]:
# see grads
grads.head()

In [None]:
# see total number of members in our cohort
grads['ssn_hash'].count()

In [None]:
# see number of unique ssn_hash values in our cohort
grads['ssn_hash'].nunique()

Because we have differing values when running `.count()` and `.nunique`, it's clear that we have a duplicate `ssn_hash` in `grads`. Let's see what's going on.

In [None]:
# showing the duplicate
grads[grads['ssn_hash'].duplicated(keep=False)]

It is likely that due to clerical errors in the data collection the difference is a typo. Given the information on the other variables, we can assume that both rows correspond to the same `ssn_hash`. We will replace this discrepancy as `null` and then drop the duplicate row.

In [None]:
# Replacing with NULL
grads.loc[(grads['ssn_hash'].duplicated(keep=False)), 'birthdate_y'] = np.nan

In [None]:
# Dropping duplicates
grads = grads.drop_duplicates()

In [None]:
# confirm that number of ssn_hash values is same as nunique()
grads['ssn_hash'].count()

Let's also check to see if we have any missing values for our demographic variables. If so, let's fill these in as `unknown` so they won't be dropped in future analyses.

In [None]:
# see null counts for demographic variables
grads.isnull().sum()

> Theoretically, you could apply these imputation methods to these missing demographic values. However, for the purposes of this notebook, we will focus our imputation techniques on missing earnings values.

In [None]:
# replace null values on grads for 'unknown'
grads.fillna('unknown', inplace = True)

Now that we have confirmed that `grads` looks as intended, we can merge it with earnings outcomes for our cohort's first quarter after graduation. In the following left join, if a member of our cohort did not appear in the Ohio UI wage records, they will have `NULL` earnings. Below, `merge` works very similarly to SQL's `JOIN` in that you designate the two tables to merge and then describe how you will merge the tables.

In [None]:
# left join earnings information for Q1 after graduation on grads
cohort_oh_jobs_q1 = pd.merge(grads, df_q1[['ssn_hash', 'sumwages']], 
                             how = 'left', on='ssn_hash')

In [None]:
# see cohort_oh_jobs_q1
cohort_oh_jobs_q1.head()

In [None]:
# See how many have missing earnings values
cohort_oh_jobs_q1['sumwages'].isnull().sum()

As a sanity check, you could make sure that the number of unique `ssn_hash` values from `grads` is equal to the sum of the number of unique `ssn_hash` values from `df_q1` and the number of rows with null `sumwages` in `cohort_oh_jobs_q1`.

In [None]:
# sanity check
cohort_oh_jobs_q1['sumwages'].isnull().sum() + df_q1['ssn_hash'].nunique()  == grads['ssn_hash'].nunique()

<h3 style="color:red">Checkpoint 2: Replicate for Q4</h3>

Create a DataFrame `cohort_oh_jobs_q4` that mirrors `cohort_oh_jobs_q1` except for Q4. Feel free to add in as many code cells as you deem necessary.

## Impute Wage Values

Now that we have confirmed that our `cohort_oh_jobs_q1` DataFrame is ready to use for testing our imputation methods, we can get started. To recall, here are the four methods we will be trying out in this notebook:
- Dropping all "missing" values
- Filling in zero for people who do not have records in Ohio UI data
- Filling in missing values with the average earnings of people who are in the same degree fields and have the same gender
- Regression
- Filling in missing values by adding in Indiana, Missouri, and Illinois UI records for the cohort in question.

### 1. Drop All Missing Values

First, let's look at the earnings outcomes during first quarter after graduation when we drop all missing earnings values. Here, by ignoring potentially non-missing values, we are hoping that they mirror the same distribution as the present one. Although this is fairly common, you should **never, ever, ever** use this method in practice. 

In [None]:
# drop missing values
wages_no_missing = cohort_oh_jobs_q1[["ssn_hash","sumwages"]].dropna()

In [None]:
# see earnings distribution
wages_no_missing.describe()

<h4 style="color:red">Checkpoint 3: Replicate for Q4</h4>

What does the earnings distribution look like for Q4 when you drop missing values?

### 2. Fill in Missing Values with Zero

Next, let's see how the earnings distribution shifts when we encode all missing earnings outcomes as 0. Here, we are assuming that all missing earnings are due to unemployment.

In [None]:
# fill all null sumwages with 0
wages_zero = cohort_oh_jobs_q1[['ssn_hash','sumwages']].fillna(0)

In [None]:
# Take a look at the distribution. How does it vary from the distribution you get in method 1?
wages_zero.describe()

In [None]:
# drop missing values distribution for comparison
wages_no_missing.describe()

In [None]:
print('Average earnings if missing are dropped is ${:,.2f}'.format(wages_no_missing['sumwages'].mean()))
print('Average earnings if missing are imputed as 0 is ${:,.2f}'.format(wages_zero.mean()))

<h4 style="color:red">Checkpoint 4: Replicate for Q4</h4>

What does the earnings distribution look like for Q4 when you fill missing values with zero?

### 3. Fill in Missing Values with Major/Gender Mean Earnings

Now, instead of either ignoring missing values or assuming the earnings are 0, we will try imputing missing earnings for each individual as the average yearly earnings of the other individuals in our cohort of the same gender and that graduated with the same subject degree (2 digits).

Here, our strategy is as follows:
- Find mean earnings for each subject by gender
- Merge the mean earnings for each subject to year's earnings for each member of the cohort by subject
- If the earnings are null, replace with mean earnings of subject

In [None]:
# Function that adds 0 before subject code in cases in which subject code has 5 digits
def add_0(x):
    index = 6
    if len(x) < index:
       return '0' + x
    elif len(x) == index:
       return x

> We have recently added in a new column into the `oh_hei_long` table that contains subject codes with leading zeros so you won't have to deal with subject codes of length 5. The column is `degcert_subject_upd`. The same applies for `oh_otc`, where the updated column is `hei_subject_code_upd`. These changes are also reflected in the data dictionary located on the class website.

In [None]:
# Applying function to add 0 on some subjects that because where declared as numeric ended up with only 5 digits
cohort_oh_jobs_q1['degcert_subject2'] = cohort_oh_jobs_q1.loc[:,('degcert_subject')].apply(lambda x: add_0(x))

First, let's grab the first two digits of every `degcert_subject` to get two-digit subject codes. After that, we can follow our strategy stated above.

In [None]:
# New column with 2 digit subject
cohort_oh_jobs_q1['subj_2dig'] = cohort_oh_jobs_q1['degcert_subject2'].str[0 : 2 : ]

In [None]:
# see new column
cohort_oh_jobs_q1['subj_2dig'].head()

In [None]:
# ignore all missing earnings
subset = cohort_oh_jobs_q1[['subj_2dig', 'gender', 'sumwages']].dropna()

In [None]:
# see subset
subset.head()

In [None]:
# find mean earnings by gender, subject combination
sub_gend_w = subset.groupby(['subj_2dig', 'gender'])['sumwages'].agg('mean').reset_index()

In [None]:
# see sub_gend_w
sub_gend_w.head()

Now, we will merge the two DataFrames, `sub_gend_w` and `cohort_oh_jobs_q1` using `merge()`.
> Note: We will rename the `sumwages` column in `sub_gend_w` so we don't get confused between the mean earnings by degree and earnings for the individual after the merge.

In [None]:
# rename columns for merge
sub_gend_w.columns = ('subj_2dig', 'gender', 'mean_w')

In [None]:
# see renamed columns
sub_gend_w.head()

In [None]:
# Add column of mean earnings by major
wages_missing_as_mean = pd.merge(cohort_oh_jobs_q1, sub_gend_w, how = 'left', on=['gender', 'subj_2dig'])

In [None]:
# see wages_missing_as_mean
wages_missing_as_mean.head()

Now, we can add a new column to `wages_missing_as_mean` to include the mean degree wage if the individual did not appear in the Ohio UI wage records data. To do so, we will use `pandas` `mask()` command, which replaces values if they match a specific condition.
> Note: Here, that specific condition is when `sumwages` is `NULL`.

In [None]:
# Replacing missing sumwages by mean of earning by major when possible
wages_missing_as_mean['imputed_wages'] = wages_missing_as_mean['sumwages'].mask(
    wages_missing_as_mean['sumwages'].isnull(), wages_missing_as_mean['mean_w'])

In [None]:
# see updated wages_missing_as_mean
wages_missing_as_mean.head()

In using this method, there is a chance we could not impute missing values for all individuals in our cohort. If `imputed_wages` is still `NULL`, we can assume there were no individuals in the cohort with non-missing earnings with the same degree/gender combination.

In [None]:
# see if any still don't have imputed earnings
sum(wages_missing_as_mean['imputed_wages'].isnull())

Unfortunately, it seems as though we do not have available earnings for every combination of gender and 2-digit subject code. For the sake of the exercise, we will ignore the earnings of those whose we could not impute using this method.

In [None]:
wages_missing_as_mean['imputed_wages'].describe()

<h4 style="color:red">Checkpoint 5: Replicate for Q4</h4>
Impute missing earnings values as the mean earnings of individuals in the cohort with the same gender and subject degree fields. What does the earning distribution look like? For how many individuals could you not impute values using this method?

### 4. Regression imputation

We can also use regression to try to get more accurate earnings values. We build a regression equation from the obervations for which we know the earnings, then use the equation to essentially predict the earnings missing values. This is, in effect, an extension of the mean imputation by subgroup. Here, we will use demographic information of graduates such as birthdate, gender, ethnicity, degree quarter, and major.

In [None]:
# Select the variables that we need for the regression
cohort_q1 = cohort_oh_jobs_q1.loc[:,('ssn_hash','sumwages', 'birthdate_y', 'race_ethnic_code', 'deg_date', 'subj_2dig')]

In [None]:
# Drop unknown birthdate_y values
cohort_q1 = cohort_q1[cohort_q1['birthdate_y']!='unknown']

In [None]:
#replace missing values with -1 because this value doesn't otherwise exist in the earnings table
cohort_q1['sumwages'] = cohort_q1['sumwages'].fillna(-1)

We need to use the `get_dummies` function in order to properly treat the categorical variables in our DataFrame. The function will convert all categorical variables to dummy variables.

In [None]:
# make categorical variables dummies
df_dummied = pd.get_dummies(cohort_q1[['race_ethnic_code', 'deg_date', 'subj_2dig']].astype(
    'category', copy=False), drop_first = True)

In [None]:
# add in non-categorical variables
df_dummied['ssn_hash'] = cohort_q1['ssn_hash']
df_dummied['sumwages'] = cohort_q1['sumwages']
df_dummied['birthdate_y'] = cohort_q1['birthdate_y']

In [None]:
df_dummied.columns

In [None]:
# removed missing values first
df_nona = df_dummied[df_dummied['sumwages'] != -1]
# Drop ssn_hash
df_reg = df_nona.loc[:, df_nona.columns != ('ssn_hash')]

In [None]:
# Saved missing values for imputation later
df_miss = df_dummied[df_dummied['sumwages'] == -1]
# drop ssn_hash and empty column sumwages
df_pred = df_miss.drop(['ssn_hash', 'sumwages'], axis=1)

In [None]:
# see size of df_nona
df_nona.shape

In [None]:
# see size of df_miss
df_miss.shape

In [None]:
# see df_pred
df_pred.head()

The model creation process for a linear regression can be done using `scikit-learn`. The process is as follows: We will create the model object, then give it the data, and then use the model object to generate our predictions. The model object essentially contains all of the instructions on how to fit the model, and when we give it the data, it fits the model to that data.

In [None]:
# Create model object
ols = LinearRegression()

# Predictors and Outcome
predictors = df_reg.drop(['sumwages'], axis = 1)
outcome = df_reg.sumwages

# Fit the model
ols.fit(X = predictors, y = outcome)

Now that we've fit our model, we can find the predicted values for earnings.

In [None]:
# add in the ssn_hash and predicted values
missing_wages = pd.DataFrame({'ssn_hash':df_miss['ssn_hash'], 'sumwages':ols.predict(df_pred)})

In [None]:
# only looking at imputed values
missing_wages.describe()

In [None]:
# after imputation
wages_missing_regress = df_nona.loc[:,('ssn_hash', 'sumwages')].append(missing_wages)
wages_missing_regress.describe()

<font color=red><h4> Checkpoint 6: Include gender as a categorical variable and re-run the regression</h4></font> 

When you include gender as a categorical variable in the regression, how does the earnings distribution compare to the one using the previous linear regression to impute values?

### 5. Add in Indiana, Missouri, and Illinois UI data

Finally, let's see how the earnings distribution changes when we add in some bordering states' UI wage records. You will see how we joined Indiana, Missouri and Illinois' UI wage records to our `cc_grads_recent` table. Afterwards, we will combine these tables to analyze the overall earnings distribution.

By adding in contiguous states' wage records, we should be able to capture most earnings of our cohort that were outside of Ohio.

Recall that in the data exploration notebook, we created the permanent table `cohort_in_jobs` by joining `cc_grads_recent` to `small_indiana_ui`, which was a subset of the entire UI wage records for Indiana. The following SQL queries created `cohort_in_jobs`, `cohort_il_jobs`, and `cohort_mo_jobs`.

> Note: `cohort_in_jobs`, `cohort_mo_jobs`, and `cohort_il_jobs` do not contain `degcert_subject` columns. Feel free to add them in yourself using `cc_grads_recent` and the corresponding UI wage record table.

    create table ada_20_osu.cohort_in_jobs as
    select a.ssn_hash, a.deg_date, b.job_date, b.sumwages, (b.job_date - a.deg_date) time_after_grad
    from cc_grads_recent a
    join ada_20_osu.small_indiana_ui b
    on a.ssn_hash = b.ssn_hash
    where b.job_date > a.deg_date AND (a.deg_date + '1 year'::interval) >= b.job_date

    create table ada_20_osu.cohort_mo_jobs as
    select a.ssn_hash, a.deg_date, b.job_date, sum(b.wage), (b.job_date - a.deg_date) time_after_grad
    from cc_grads_recent a
    join mo_small b
    on a.ssn_hash = b.ssn
    group by a.ssn_hash
    where b.job_date > a.deg_date AND (a.deg_date + '1 year'::interval) >= b.job_date

    create table ada_20_osu.cohort_il_jobs as
    select a.ssn_hash, a.deg_date, b.job_date, sum(b.wage), (b.job_date - a.deg_date) time_after_grad
    from cc_grads_recent a
    join il_small b
    on a.ssn_hash = b.ssn
    group by a.ssn_hash
    where b.job_date > a.deg_date AND (a.deg_date + '1 year'::interval) >= b.job_date

Let's briefly explore these tables to see how many `ssn_hash` values they captured.

In [None]:
# Read in indiana table 
qry = '''
select * 
from ada_20_osu.cohort_in_jobs
'''
in_df=pd.read_sql(qry, conn) 

In [None]:
# see indiana table
in_df.head()

In [None]:
# amount of ssn_hash values found in Indiana
in_df['ssn_hash'].nunique()

Missouri and Illinois only have by employer UI wage records, but this will not affect your analysis. You can still repeat the previous process, i.e., create a small UI table for each state, join them to `cc_grads_recent`, and then aggregate the data by `ssn_hash`, as shown in the table creation code.

In [None]:
# Now you have the Missouri UI records for 2012-13 Ohio community college graduates.
qry = '''
select * 
from ada_20_osu.cohort_mo_jobs
'''
mo_df = pd.read_sql(qry, conn)

In [None]:
# see mo_df
mo_df.head()

In [None]:
# amount of ssn_hash values
mo_df['ssn_hash'].nunique()

In [None]:
# Now you have the Illinois UI records for 2012-13 Ohio community college graduates.
qry = '''
select * 
from ada_20_osu.cohort_il_jobs
'''
il_df = pd.read_sql(qry, conn)

In [None]:
il_df.head()

In [None]:
# amount of ssn_hash values
il_df['ssn_hash'].nunique()

We have access to four tables that have Ohio community college graduates' UI records from the four states. You can append the four tables by using `union` in SQL. You can also create a `state` column to track the state the individual found employment in during first quarter after graduation.

In [None]:
# Join graduates employment outcomes in other states during first quarter after graduation
#also included distinction by state
qry = '''
drop table if exists all_wages; 
create temp table all_wages as
select distinct ssn_hash, deg_date, job_date, wage as sumwages, time_after_grad, 'il' as state
from ada_20_osu.cohort_il_jobs
where time_after_grad in (91, 92)
union all
select distinct ssn_hash, deg_date, job_date, sumwages, time_after_grad, 'oh' as state
from ada_20_osu.cohort_oh_jobs
where time_after_grad in (91, 92)
union all
select distinct ssn_hash, deg_date, job_date, wages as sumwages, time_after_grad, 'in' as state
from ada_20_osu.cohort_in_jobs
where time_after_grad in (91, 92)
union all
select distinct ssn_hash, deg_date, job_date, wage as sumwages, time_after_grad, 'mo' as state
from ada_20_osu.cohort_mo_jobs
where time_after_grad in (91, 92)
'''
conn.execute(qry)

In [None]:
# Table a look at the table.
qry = '''
select * from all_wages limit 5
'''
pd.read_sql(qry, conn)

As we saw when initially forming `df_q1`, `all_wages` does not include people who do not have incomes in the four states. You need to add these people back by joining `all_wages` with the `ssn_hash` values from `cc_grads_recent` that were not already captured in `all_wages`.

In [None]:
qry = '''
drop table if exists all_jobs_w_missing;
create temp table all_jobs_w_missing as
select ssn_hash, deg_date, NULL as job_date, NULL as sumwages, NULL as time_after_grad, 'No Record' as state
from cc_grads_recent a
where ssn_hash NOT IN (SELECT distinct ssn_hash from all_wages)
UNION ALL
select * from all_wages;
'''
conn.execute(qry)

In [None]:
#Let's check the table
qry = '''
select * 
from all_jobs_w_missing
'''
cross_state_df = pd.read_sql(qry, conn)

In [None]:
cross_state_df.head()

In [None]:
# Let's check how many people have earnings in each state.
cross_state_df['state'].value_counts()

Let's see how many people worked in one, two, three and four states!

In [None]:
#First drop all duplicates 'ssn_hash' within same state column as `no_dup`
no_dup = cross_state_df.loc[cross_state_df['state'].isin([
    'il', 'mo', 'in', 'oh'])][['ssn_hash', 'state']].drop_duplicates()

In [None]:
# Count number of jobs in different states by ssn_hash
num_jobs = no_dup.groupby(['ssn_hash'])['state'].agg('count').reset_index(name='num_states')
num_jobs.head()

In [None]:
# get aggregate count by num_states
num_jobs.groupby('num_states').agg('count')

Let's check how many missing values we have filled in by adding additional states' UI records. Note that some people have worked in more than one state.

In [None]:
#Let's check how many more people have positive earnings now
added_recs = cross_state_df[cross_state_df['sumwages']>0]['ssn_hash'].nunique() - df_q1['ssn_hash'].nunique()

print('''
By adding in UI wage records from a handful of bordering states, 
we have managed to find wage records for {} more people, 
as well as augmented earnings for some others. 
Let's see how this change affected the earnings distribution.
'''.format(added_recs))

In [None]:
# Let's see the earnings distribution after we add UI records from other states
cross_state_df.groupby(['ssn_hash'])['sumwages'].agg('sum').describe()

## Visualizing Earnings Distributions

We can quickly determine whether either or both imputation methods have significantly altered the pre-imputation wage distribution with visualization. Plotting side-by-side boxplots can be an effective choice.

In [None]:
# Creating a dataframe with all the imputation methods outcomes
df1 = cohort_oh_jobs_q1[['ssn_hash', 'sumwages']]
df2 = wages_no_missing.rename(columns=({'sumwages':'earnings_no_imp'}))
df3 = wages_zero.rename(columns=({'sumwages':'earnings_imp_zero'}))
df4 = wages_missing_as_mean[['ssn_hash', 'imputed_wages']].rename(columns=({'imputed_wages':'earnings_imp_mean'}))
df5 = wages_missing_regress.rename(columns=({'sumwages':'earnings_imp_regress'}))
frames = [df1, df2, df3, df4, df5]
result = pd.concat(frames, axis=1, sort=False)

In [None]:
result.describe()

In [None]:
# see all distributions for one quarter after graduation side-by-side
fig,ax = plt.subplots(figsize = (15, 8))
result[['earnings_no_imp', 'earnings_imp_zero', 
    'earnings_imp_mean', 'earnings_imp_regress']].\
boxplot(grid = False, vert = False)
ax.set(title = 'distribution of earnings one quarter after graduation',
       yticklabels = ['no imputation', 'imputed zero', 
                      'imputed mean by gender and major', 'regression'],
       xlim = (-500,11000),
       xticks = (np.arange(0, 11000, 1000)))
plt.annotate('Sources: OH HEI data and UI wage records', 
             xy=(0.75,-0.1), xycoords="axes fraction");

<h3 style="color:red">Checkpoint 7: Visualizing cross state earnings</h3>
Add the cross state earnings distribution to the above visualization.

## Multiple histograms

We can also look at the differences in the earnings distribution by looking at side-by-side histograms. To do so, we need to convert `result` from a wide (lots of columns, less entries) to a long format (less columns, lots of entries).

Before we append each of the individual DataFrames, we need to make sure they all have the same columns. Let's also drop the `ssn_hash` columns from the DataFrames that have the column and instead rely on the `index`.

In [None]:
# make sure columns are index, earnings, method
df2['method'] = 'no imputation'
df2.columns = ('index', 'earnings', 'method')
df3['method'] = 'imputed zero'
df3.columns = ('index', 'earnings', 'method')
df4['method'] = 'imputed mean'
df4.columns = ('index', 'earnings', 'method')
df5['method'] = 'regression'
df5.columns = ('index', 'earnings', 'method')
df6 = cross_state_df.loc[:,('ssn_hash', 'sumwages')]
df6['method'] = 'cross state'
df6.columns = ('index', 'earnings', 'method')

In [None]:
result2 = df2.append(df3).append(df4).append(df5).append(df6)

In [None]:
# Prepare our grid, which will share axes across multiple plots (wrapping after 5 columns)
g = sns.FacetGrid(result2, col='method',col_wrap=5)

# Create a lineplot for each cell of the grid
g = g.map(plt.hist, "earnings", color="lightcoral")

# Simplify the titles inside each cell
g.set_titles("{col_name}")

plt.annotate(
        'Sources: OH HEI data and UI wage records',
        fontsize='x-small',
        xycoords="figure fraction", # specify x and y positions as % of the overall figure
        xy=(1, 0.01), # 100% to the right (x) and 1% to the top (y) means bottom right
        horizontalalignment='right')

# Remove the spine (vertical line) along the y axis
sns.despine(left=True)

### (Optional) Advanced: Using machine learning to impute values

To impute values, we can also machine learning algorithms such as `K-nearest Neighbors` and `Decision Trees`. The principle behind `K-nearest Neighbors` is quite simple: the missing values can be imputed by values of "closest neighbors" - as approximated by other, known, features. 

For example, if we had cases where the data on earnings of some graduates was completely missing, we could approximate their earnings by referring to other characteristics which could be shared by major group (their 'closest neighbors' in terms of characteristics).

The algorithm calculates the distance between the input values (the missing values) and helps to identify the nearest possible value based on other features (such as known characteristics of the closest major group). Imputing missing data using machine learning has become a research hotbed, and there are plenty of papers covering the various algorithms if you are curious.