<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc" style="margin-top: 1em;"><ul class="toc-item"><li><span><a href="#Measurement-error:-impute-missing-wage" data-toc-modified-id="Measurement-error:-impute-missing-wage-1">Measurement error: impute missing wage</a></span><ul class="toc-item"><li><span><a href="#Python-Setup" data-toc-modified-id="Python-Setup-1.1">Python Setup</a></span></li><li><span><a href="#Define-the-study-cohort" data-toc-modified-id="Define-the-study-cohort-1.2">Define the study cohort</a></span></li><li><span><a href="#Locate-in-all-states'-wage-data-full-term-employment-with-the-same-employer-within-one-year-after-graduation" data-toc-modified-id="Locate-in-all-states'-wage-data-full-term-employment-with-the-same-employer-within-one-year-after-graduation-1.3">Locate in all states' wage data full term employment with the same employer within one year after graduation</a></span></li><li><span><a href="#Isolate-cases-where,-for-the-same-grad/employer-pair,-we-have-a-wage-for-t-1,-t+1,-but-not-t" data-toc-modified-id="Isolate-cases-where,-for-the-same-grad/employer-pair,-we-have-a-wage-for-t-1,-t+1,-but-not-t-1.4">Isolate cases where, for the same grad/employer pair, we have a wage for t-1, t+1, but not t</a></span></li><li><span><a href="#impute-wage-values-and-explore-resulting-wage-estimate-distributions" data-toc-modified-id="impute-wage-values-and-explore-resulting-wage-estimate-distributions-1.5">impute wage values and explore resulting wage estimate distributions</a></span></li></ul></li></ul></div>

<img style="float: center;" src="./images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>


# Measurement error: impute missing wage

If our research question focuses on determining the quarterly wage for 2009 Missouri university and college graduates holding full-term employment in the quarter falling one year after graduation, we encounter bias in our quarterly wage estimates when the data contain wage values for quarter t-1, quarter t+1, but no value for quarter t. In this notebook, we will explore the effects of addressing missing value bias through imputation.

## Python Setup

Before we begin, run the code cell below to initialize the libraries we'll be using in this assignment. We're already familiar with `matplotlib`, `pandas`, and `psycopg2` from previous tutorials.

In [None]:
%pylab inline
import pandas as pd
import psycopg2
from sqlalchemy import create_engine
import numpy as np
import matplotlib.pyplot as plt
import time

In [None]:
# and set our database connection parameters
db_name = "appliedda"
hostname = "10.10.2.10"

In [None]:
# set database connections - use psycopg2 to more easily execute queries without returning data 
# (eg for series of CREATE queries)
conn = psycopg2.connect(database=db_name, host=hostname)
cursor = conn.cursor()

## Define the study cohort
2009 grads of Missouri public colleges/universtities

In [None]:
# quick glance at the data

sql = '''
select *
from mo_dhe.completions
limit 5;
'''
df = pd.read_sql(sql, conn)
df.head()

In [None]:
# create temp table of all unique 2009 graduates
start_time = time.time()
sql = '''
drop table if exists cohort_2009;

create temp table cohort_2009 AS
select distinct on (deident_id) deident_id, calyear,
    case when acterm = '31' then 1 when acterm = '41' then 2
        when acterm = '11' then 3 when acterm = '21' then 3 else null end as quarter
from mo_dhe.completions
where calyear = 2009;

commit;
'''

cursor.execute(sql)

print('query complete in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
start_time = time.time()
sql = '''
alter table cohort_2009
    add column yr_q text;
commit;

update cohort_2009 
    set yr_q = format('%s-%s-1', calyear, quarter*3-2)::date;
commit;
'''

cursor.execute(sql)

In [None]:
# a quick look at the grad data
sql = '''
select *
from cohort_2009
'''
df = pd.read_sql(sql, conn)
df.head()

In [None]:
print('there are {:,.0f} graduates in our selected study period'.format(df.shape[0]))

In [None]:
df['deident_id'].nunique() # confirm unique individual records

## Locate in all states' wage data full term employment with the same employer within one year after graduation 

Above we defined our population.

Now we'll say a given individual has achieved full term employment if s/he has the same employer for all of quarter t, which means s/he must have also been with that employer for some or all of quarter t-1 and some or all of quarter t+1.

In [None]:
# first up: Missouri workers
start_time = time.time()

sql = '''
drop table if exists cohort_2009_mo_jobs_1yr;

create temp table cohort_2009_mo_jobs_1yr as
select *
from kcmo_lehd.mo_wage
where (year = 2010 or (year = 2009 and quarter = 4) 
        or (year = 2011 and quarter = 1))
    and ssn in (select distinct on (deident_id) deident_id from cohort_2009);

commit;
'''
cursor.execute(sql)

print('query complete in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
# next up:workers in Illinois
start_time = time.time()

sql = '''
drop table if exists cohort_2009_il_jobs_1yr;

create temp table cohort_2009_il_jobs_1yr AS
select *
from il_des_kcmo.il_wage
where (year = 2010 or (year = 2009 and quarter = 4) 
        or (year = 2011 and quarter = 1))
    and ssn in (select distinct on (deident_id) deident_id from cohort_2009);

commit;
'''
cursor.execute(sql)

print('query complete in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
# next up: workers in Ohio

start_time = time.time()

sql = '''
drop table if exists cohort_2009_oh_jobs_1yr;

create temp table cohort_2009_oh_jobs_1yr as
select a.*, b.ssn_hash as ssn
from data_ohio_olda_2018.oh_ui_wage_by_employer a
join data_ohio_olda_2018.oh_person b
on a.key_id = b.key_id
where (year = 2010 or (year = 2009 and quarter = 4) 
        or (year = 2011 and quarter = 1))
    and b.ssn_hash in (select distinct on (deident_id) deident_id from cohort_2009);

commit;
'''
cursor.execute(sql)

print('query complete in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
# last up: workers in Indiana

start_time = time.time()

sql = '''
drop table if exists cohort_2009_in_jobs_1yr;

create temp table cohort_2009_in_jobs_1yr as
select *
from in_data_2019.wages_by_employer
where (year = 2010 or (year = 2009 and quarter = 4) 
        or (year = 2011 and quarter = 1))
    and ssn in (select distinct on (deident_id) deident_id from cohort_2009);

commit;
'''
cursor.execute(sql)

print('query complete in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
# compile cohort jobs from all states into single table
sql = """ 
drop table if exists cohort_2009_jobs_1yr;

create temp table cohort_2009_jobs_1yr as
select ssn, ein, state, format('%s-%s-1', year, quarter*3-2)::date j_yr_q, wage
from cohort_2009_mo_jobs_1yr
union all
select ssn, ein, state, format('%s-%s-1', year, quarter*3-2)::date j_yr_q, wage
FROM cohort_2009_il_jobs_1yr
union all
select ssn, employer::text as ein, '39' as state, format('%s-%s-1', year, quarter*3-2)::date j_yr_q, wages as wage
FROM cohort_2009_oh_jobs_1yr
union all
select ssn, fein as ein, '18' as state, format('%s-%s-1', year, quarter*3-2)::date j_yr_q, wages as wage
from cohort_2009_in_jobs_1yr;

commit;
"""
cursor.execute(sql)

print('query complete in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
# quick look at our combined wage data
sql = '''
select *
from cohort_2009_jobs_1yr
limit 5
'''
df = pd.read_sql(sql, conn)
df.head()

## Isolate cases where, for the same grad/employer pair, we have a wage for t-1, t+1, but not t
* we create a single table of t-1, t, and t+1 wages where the t-1 pair = t+1 pair 

In [None]:
# create a table for wages earned 9 months (t-1), 12 months (t), and 15 months (t+1)
SQL_COHORT_2009_Q = """
drop table if exists cohort_2009_link_full;

create temp table cohort_2009_link_full as

with t_minus as (select a.deident_id, a.yr_q, d.ssn, d.wage, d.j_yr_q, d.ein
    from cohort_2009 a
    join cohort_2009_jobs_1yr d
        on a.deident_id = d.ssn
        and a.yr_q::date = (d.j_yr_q::date -'9 month'::interval)::date),
    
    t as (select a.deident_id, a.yr_q, c.ssn, c.wage, c.j_yr_q, c.ein
    from cohort_2009 a
    join cohort_2009_jobs_1yr c
        on a.deident_id = c.ssn
        and a.yr_q::date = (c.j_yr_q::date -'12 month'::interval)::date),

    t_plus as (select a.deident_id, a.yr_q, b.ssn, b.wage, b.j_yr_q, b.ein
        from cohort_2009 a
        join cohort_2009_jobs_1yr b
            on a.deident_id = b.ssn
            and a.yr_q::date = (b.j_yr_q::date - '15 month'::interval)::date)
        
select a.deident_id, t_minus.j_yr_q as t_minus_1, t.j_yr_q as quarter_t, t_plus.j_yr_q as t_plus_1,
    t_minus.ein as employer, 
    t_minus.wage as wage_t_minus_1,  t.wage as wage_t, t_plus.wage as wage_t_plus_1  
from cohort_2009 as a
    left join t_minus on a.deident_id = t_minus.ssn
    left join t on a.deident_id = t.ssn
    left join t_plus on a.deident_id = t_plus.ssn
where concat(a.deident_id, t_minus.ein) = concat(a.deident_id, t_plus.ein)
and (t_minus.ein <> 'None' or t.ein <> 'None' or t_plus.ein <> 'None')
order by a.deident_id, t_minus.ein;

commit;

"""
cursor.execute(SQL_COHORT_2009_Q)

In [None]:
# load the data into a pandas dataframe and get a quick look
q = """
select * from cohort_2009_link_full
"""
df = pd.read_sql(q, conn)
df.head(10)

In [None]:
df.shape

In [None]:
# how many missing wage values are there?
df['wage_t'].isna().sum()

## impute wage values and explore resulting wage estimate distributions

In [None]:
# let's look at the distribution of wages for quarter t before imputation
df['wage_t'].describe()

In [None]:
fig,ax = plt.subplots(figsize = (10, 5))
df[['wage_t']].boxplot(grid = False, vert = False)
ax.set(title = 'distribution of wage values',
       xlim = (-500,30000),
       xticks = (np.arange(0, 30000, 2500)));

In [None]:
# impute missing quarter t wages row wise as mean of t-1, t+1 wages 
df['wage_t_imp_mean'] = df.wage_t.fillna(df[['wage_t_minus_1', 'wage_t_plus_1']].mean(axis = 1))
df['wage_t_imp_mean'].describe()

In [None]:
# impute t wage as zero
df['wage_t_imp_zero'] = df.wage_t.fillna(0)
df['wage_t_imp_zero'].describe()

In [None]:
# see all three distributions side-by-side
fig,ax = plt.subplots(figsize = (10, 10))
df[['wage_t', 'wage_t_imp_mean', 'wage_t_imp_zero']].boxplot(grid = False, vert = False)
ax.set(title = 'distribution of wage values',
       xlim = (-500,30000),
       xticks = (np.arange(0, 30000, 2500)));