<img style="float: center;" src="./images/CI_horizontal.png" width="600">
<center>
    <span style="font-size: 1.5em;">
        <a href='https://www.coleridgeinitiative.org'>Website</a>
    </span>
</center>


Ghani, Rayid, Frauke Kreuter, Julia Lane, Adrianne Bradford, Alex Engler, Nicolas Guetta Jeanrenaud, Graham Henke, Daniela Hochfellner, Clayton Hunter, Brian Kim, Avishek Kumar, Jonathan Morgan, Ursula Kaczmarek.

# Outcome measurement and imputation
### Learning Objectives

* Gain understanding of the concept of measurement error in the context of a cohort's earnings

* Explore options for imputing missing values

* Visualize estimate changes following imputation


To determine the outcome of employment earnings for members of our 2014 Q4 TANF cohort, we need to decide what to do when earnings data is missing. Earnings may be missing for any number of reasons. The cohort member may have found work outside Indiana or Illinois, the QCEW may not report the member's earnings, or the member may not be receiving any earnings in the given time period. 

In this notebook, we explore the resulting earnings outcomes for three points in time after leaving TANF. Outcomes are calculated one quarter later, two quarters later, and one year later, and we will compare earnings distributions when (a) dropping missing values, (b) setting missing values to zero, (c) imputing missing values as the mean for the overall cohort.

## Table of Contents

- [Python Setup and Database Connection](#Python-Setup-and-Database-Connection)

- [Pull the Cohort Data](#Pull-the-Cohort-Data)

- [Isolate Missing Earnings Cases](#Isolate-Missing-Earnings-Cases)

- [Explore Earnings Estimates Before Imputation](#Explore-Earnings-Estimates-Before-Imputation)

- [Impute Wage Values](#Impute-Wage-Values)

- [Compare Distributions Through Visualization](#Compare-Distributions-Through-Visualization)

## Python Setup and Database Connection
- Back to [Table of Contents](#Table-of-Contents)

Before we begin, run the code cell below to initialize the libraries. We're already familiar with `matplotlib`, `pandas`, and `psycopg2` from previous notebooks.

In [None]:
%pylab inline
import pandas as pd
import psycopg2
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
import time

In [None]:
# set up sqlalchemy engine
host = 'stuffed.adrf.info'
DB = 'appliedda'

connection_string = "postgresql://{}/{}".format(host, DB)
conn = create_engine(connection_string)

## Pull the Cohort Data
- Back to [Table of Contents](#Table-of-Contents)


Our cohort consists of Indiana and Illinois TANF recipients who had a spell end sometime in Q4 of 2014.

In [None]:
# create temp table of all cohort members
start_time = time.time()

query = """
DROP TABLE IF EXISTS cohort_2014;

CREATE TEMP TABLE cohort_2014 AS
-- Illinois cohort members
SELECT DISTINCT ON (m.ssn_hash) ssn_hash AS member_ssn, 17 AS state, 
    '2014-10-1'::date as end_yr_q, m.sex::text AS gender
FROM il_dhs.indcase_spells i, il_dhs.member_relation r, il_dhs.member m
WHERE i.recptno = r.recptno AND i.ch_dpa_caseid = r.ch_dpa_caseid 
AND i.recptno = m.recptno AND i.ch_dpa_caseid = m.ch_dpa_caseid
AND i.end_date BETWEEN '2014-10-01'::DATE AND '2014-12-31'::DATE 
AND i.benefit_type = 'tanf46'
AND r.reltogte = 82

-- Indiana cohort members
UNION ALL 
SELECT DISTINCT ON (ssn) ssn AS member_ssn, 18 AS state, 
    '2014-10-1'::date as end_yr_q, gender
FROM in_fssa.person_month 
WHERE tanf_end_date::DATE BETWEEN '2014-10-01'::DATE AND '2014-12-31'::DATE
AND tanf = 1
AND affil = '1';

COMMIT;
"""
conn.execute(query)

# report how long creating this table took
print('query ran in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
# how many members comprise our cohort
cohort = pd.read_sql('SELECT * FROM cohort_2014', conn)
cohort['member_ssn'].nunique()

## Pull earnings data
- Back to [Table of Contents](#Table-of-Contents)

First, let's pull the earnings data for the next quarter (2015q1), and see what we get.

In [None]:
# select all earnings data for 2015Q1
start_time = time.time()
query = """
DROP TABLE IF EXISTS cohort_earnings_q1;

CREATE TEMP TABLE cohort_earnings_q1 AS
-- Illinois earnings
SELECT ssn, 17 as state, ein, wage AS earnings
FROM il_des_kcmo.il_wage
WHERE year = 2015 AND quarter = 1 
    AND ssn IN (SELECT member_ssn FROM cohort_2014)

UNION ALL
-- Indiana earnings
SELECT ssn, 18 as state, fein AS ein, wages AS earnings
FROM in_dwd.wage_by_employer
WHERE year = 2015 AND quarter = 1 
    AND ssn IN (SELECT member_ssn FROM cohort_2014);
    
COMMIT;
"""
conn.execute(query)

# report how long creating this table took
print('query ran in {:.2f} seconds'.format(time.time()-start_time))

In [None]:
# load the data into a pandas dataframe and get a quick look
query = """
select * from cohort_earnings_q1
"""
df = pd.read_sql(query, conn)
df.head()

## Explore Earnings Estimates Before Imputation
* Back to [Table of Contents](#Table-of-Contents)

Viewing summary statistics on the number of missing wage values is a useful start.

In [None]:
print('total number of earnings records: {:,.0f}'.format(df.shape[0]))
print('number of individuals with any earnings: {:,.0f}'.format(df['ssn'].nunique()))
print('number of individuals missing values:{:,.0f}'\
.format(cohort['member_ssn'].nunique()-df['ssn'].nunique()))

Some of our cohort return more than one job. How should we handle them?

In [None]:
# check the overall difference between sum and average of earnings
df.groupby('ssn')['earnings'].agg({'sum', 'mean'}).describe()

For this example, let's calculate the **sum** of earnings for our three outcome time horizons and add the result directly in our cohort table

In [None]:
# list of [year, quarter] values we want to calculate
year_quarter = [[2015,1] ,[2015,2] , [2015,4]]

for yr, qtr in year_quarter:
    print(yr, qtr)

    query = '''
    ALTER TABLE cohort_2014 DROP COLUMN IF EXISTS earnings{year}q{quarter};
    commit;
    ALTER TABLE cohort_2014 ADD COLUMN earnings{year}q{quarter} numeric;
    commit;
    
    DROP TABLE IF EXISTS cohort_earnings_{year}q{quarter};
    commit;

    CREATE TEMP TABLE cohort_earnings_{year}q{quarter} AS
    select ssn, sum(earnings) earnings
    FROM (
        
        -- Illinois earnings
        SELECT ssn, 17 as state, ein, wage AS earnings
        FROM il_des_kcmo.il_wage
        WHERE year = {year} AND quarter = {quarter} 
            AND ssn IN (SELECT member_ssn FROM cohort_2014)
        UNION ALL
        -- Indiana earnings
        SELECT ssn, 18 as state, fein AS ein, wages AS earnings
        FROM in_dwd.wage_by_employer
        WHERE year = {year} AND quarter = {quarter} 
            AND ssn IN (SELECT member_ssn FROM cohort_2014)
    ) q
    GROUP BY 1;

    COMMIT;
    
    UPDATE cohort_2014 a SET earnings{year}q{quarter} = b.earnings
    FROM cohort_earnings_{year}q{quarter} b
    WHERE a.member_ssn = b.ssn;
    
    commit;
    
    '''.format(year=yr, quarter=qtr)
    conn.execute(query)
    print('completed {year}q{quarter}'.format(year=yr, quarter=qtr))

In [None]:
# pull in our cohort as "df"
df = pd.read_sql('select * from cohort_2014', conn)

<h3 style="color:red">Checkpoint: Generating Summary Statistics</h3>
Let's look at the distribution of earnings for our initial outcome results without any imputation

In [None]:
# hint: you can refer to the code above for the Pandas function for simple summary stats



## Impute Wage Values
- Back to [Table of Contents](#Table-of-Contents)

We will impute the following values as mentioned above: (a) simply dropping missing values (basically what we summarized above), (b) setting missing values to zero, (c) imputing missing values as the mean for the overall cohort, and (d) imputing missing values as the mean by gender.


In [None]:
df['earnings2015q2'].mean()

In [None]:
# impute mean of the quarter's values we do have

for yr,q in year_quarter:
    # calculate the mean of this column
    value = df['earnings{}q{}'.format(yr,q)].mean()
    # copy initial values to a new column
    df['earnings{}q{}_imp_mean'.format(yr,q)] = df['earnings{}q{}'.format(yr,q)]
    # fill missing values with the mean
    df['earnings{}q{}_imp_mean'.format(yr,q)] = df['earnings{}q{}'.format(yr,q)].fillna(value)

# view results
df.describe()

<h3 style="color:red">Checkpoint: Imputing Values as zero</h3>

Now let's impute "missing" as simply zero to compare the outcome measures.


In [None]:
# impute earnings as zero
# hint: what could you change in the code above to fill missing values with 0?

for yr,q in year_quarter:
    

# view results
df.describe()

In [None]:
# and finally, let's calculate the mean by gender
# we can do this by combining Pandas' "groupby" and "transform", like this:
df.groupby('gender')['earnings2015q1'].transform(lambda x: x.fillna(x.mean()))

In [None]:

for yr,q in year_quarter:
    old_col = 'earnings{}q{}'.format(yr,q)
    new_col = 'earnings{}q{}_mean_gender'.format(yr,q)
    df[new_col] = df.groupby('gender')[old_col].transform(lambda x: x.fillna(x.mean()))

In [None]:
df.describe().T

## Compare Distributions Through Visualization
- Back to [Table of Contents](#Table-of-Contents)

We can quickly determine whether either or both imputation methods have significantly altered the pre-imputation wage distribution with visualization. Plotting side-by-side boxplots is an effective choice.

In [None]:
# see all distributions for 2015q1 side-by-side
fig,ax = plt.subplots(figsize = (15, 8))
df[['earnings2015q1', 'earnings2015q1_imp_zero', 
    'earnings2015q1_imp_mean', 'earnings2015q1_mean_gender']].\
boxplot(grid = False, vert = False)
ax.set(title = 'distribution of earnings in 2015-Q1',
       yticklabels = ['no imputation', 'imputed zero', 
                      'imputed mean', 'imputed mean by gender'],
       xlim = (-500,11000),
       xticks = (np.arange(0, 11000, 1000)))
plt.annotate('Sources: IL DES, IN DWD, IL DHS, IN FSSA', 
             xy=(0.75,-0.1), xycoords="axes fraction");

<h3 style="color:red">Checkpoint: Visualizing Other Quarters</h3>
Let's replicate the graph above for the other quarters.

In [None]:
# see all distributions for 2015q2 side-by-side
fig,ax = plt.subplots(figsize = (15, 8))
df[['earnings2015q2', 'earnings2015q2_imp_zero', 
    'earnings2015q2_imp_mean', 'earnings2015q2_mean_gender']].\
boxplot(grid = False, vert = False)
ax.set(title = 'distribution of earnings in 2015-Q2',
       yticklabels = ['no imputation', 'imputed zero', 
                      'imputed mean', 'imputed mean by gender'],
       xlim = (-500,11000),
       xticks = (np.arange(0, 11000, 1000)))
plt.annotate('Sources: IL DES, IN DWD, IL DHS, IN FSSA', 
             xy=(0.75,-0.1), xycoords="axes fraction");

In [None]:
## replicate for the Q4 values
