Alex Kappes <br>
3 Sep 2018 <br>
EconS 512 - **Problem Set 2**

**Question 2**. <br>
This assignment is completed using Jupyter Notebook in Python.

**(1)**, **(2)**, and **(3)**

*Data management*: Importing data and dropping observations with missing values. 

The total count of observation dropped due to null values is 202. The null values are not replaced with '0' because of the biased effects of introducing measurement error. The proportion of dropped observations in the data set is small enough that their exclusion will not distort estimation results. 

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import plotly.plotly as plt
import plotly.figure_factory as ff

data = pd.read_csv('/home/akappes/WSU/512_MetricsII/bwght2.csv')

# Drop nan values
null_list = data.index[data.isnull().any(axis=1)]
sub_data = data.drop(null_list)
sub_data = sub_data.reset_index()

print('Dropped observations:', len(null_list))
print('Remaining observations:', len(sub_data.index))
print('Proportion of dropped observations:', round(np.divide(len(null_list), len(data.index)), 4))

# Create treatment binary variable for mother smoking
sub_data['treatment'] = sub_data['cigs'].apply(lambda x: 1 if x > 0 else 0)

# Difference in ln birth weight, smoker v non smoker
smoker_list = sub_data.index[sub_data['treatment'] == 1]
nonsmoker_list = sub_data.index[sub_data['treatment'] == 0]
mean_diff = sub_data.loc[smoker_list, 'lbwght'].mean() - sub_data.loc[nonsmoker_list, 'lbwght'].mean()

print('The difference in log birthweight for mothers who smoke and those who do not is', round(mean_diff, 4))

Dropped observations: 220
Remaining observations: 1612
Proportion of dropped observations: 0.1201
The difference in log birthweight for mothers who smoke and those who do not is -0.063


The mean difference value provides that on average, mothers who smoke give birth to babies that weight $e^{-0.063} = 0.939$ less units. (pounds or ounces? Weight units is not specified.)

**(4)**. The estimation summary results for the OLS specified model $Y = \alpha T_i + \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon}$ for $i = \{0, 1\}$ are presented below.

In [2]:
# OLS Regression
x_vars = ['treatment', 'mage', 'meduc', 'monpre', 'npvis', 'fage', 'feduc', 'fblck', 'magesq', 'npvissq', 'mblck']
X = sm.add_constant(sub_data.loc[:, x_vars])
y = sub_data.loc[:, 'lbwght']
model = sm.OLS(y, X).fit()

model.summary()

0,1,2,3
Dep. Variable:,lbwght,R-squared:,0.03
Model:,OLS,Adj. R-squared:,0.023
Method:,Least Squares,F-statistic:,4.459
Date:,"Mon, 24 Sep 2018",Prob (F-statistic):,1.19e-06
Time:,11:19:33,Log-Likelihood:,445.31
No. Observations:,1612,AIC:,-866.6
Df Residuals:,1600,BIC:,-802.0
Df Model:,11,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,7.5602,0.140,53.910,0.000,7.285,7.835
treatment,-0.0525,0.017,-3.116,0.002,-0.086,-0.019
mage,0.0249,0.009,2.721,0.007,0.007,0.043
meduc,-0.0015,0.003,-0.523,0.601,-0.007,0.004
monpre,0.0125,0.004,2.913,0.004,0.004,0.021
npvis,0.0121,0.004,3.179,0.002,0.005,0.020
fage,0.0018,0.001,1.548,0.122,-0.000,0.004
feduc,0.0023,0.003,0.911,0.362,-0.003,0.007
fblck,0.0665,0.050,1.327,0.185,-0.032,0.165

0,1,2,3
Omnibus:,795.574,Durbin-Watson:,1.89
Prob(Omnibus):,0.0,Jarque-Bera (JB):,10828.201
Skew:,-1.962,Prob(JB):,0.0
Kurtosis:,15.075,Cond. No.,29000.0


**(5)**. The propensity score matching routine is conducted by using predicted logistic regression values and then matching treated scores to their untreated nearest neighbor. Matching is accomplished by taking a treated individual's propensity score and computing the distance between all other untreated individuals' propensity scores. The minimum distance between propensity scores provides a nearest neighbor for the treated individual, resulting in a match on other characteristic values. 

In [3]:
# Logit model for propensity scores
logistic_X = X.loc[:, X.columns != 'treatment']
logistic_y = sub_data.loc[:, 'treatment']
logistic_mod = sm.Logit(logistic_y, logistic_X).fit(disp=0)
prop_score = pd.DataFrame(logistic_mod.predict())
match_data = pd.concat([sub_data.loc[:, x_vars].reset_index(), prop_score], axis=1)
match_data = match_data.loc[:, match_data.columns != 'index']
match_data = match_data.rename(columns={0: 'score'})

In [4]:
# Matching
treat_data = match_data.loc[match_data['treatment'] == 1, :].reset_index()
nontreat_data = match_data.loc[match_data['treatment'] != 1, :].reset_index()

### Treated matching
dist = pd.DataFrame(np.empty([len(nontreat_data.index), len(treat_data.index)]))

for i in range(len(treat_data.index)):
    dist[i] = abs(treat_data.loc[i, 'score'] - nontreat_data['score'])

nearest_neighbor = list(dist.columns)

for i in range(len(dist.columns)):
    nearest_neighbor[i] = dist[i].idxmin()

treat_y_list = list(treat_data['index'])
treat_y = pd.DataFrame(sub_data.loc[treat_y_list, 'lbwght']).reset_index()
nearest_neighbor_y = pd.DataFrame(sub_data.loc[nearest_neighbor, 'lbwght']).reset_index()

### Untreated matching
control_dist = dist.T
control_nn = list(control_dist.columns)

for i in range(len(control_dist.columns)):
    control_nn[i] = control_dist[i].idxmin()

control_match = pd.DataFrame(0, index=control_nn, columns=['cm_lbwght'])

for i in control_nn:
    control_match.loc[i, 'cm_lbwght'] = treat_y.loc[i, 'lbwght']

t_match = pd.DataFrame(treat_y['lbwght'] - nearest_neighbor_y['lbwght']).sum()
u_match = pd.DataFrame(control_match['cm_lbwght'] - sub_data.loc[sub_data['treatment'] == 0, 'lbwght']).sum()

The matched average treatment effect is presented below.

In [5]:
# Estimated ATE and TT from logit propensity score
est_ate = (t_match[0] + u_match[0])/len(sub_data.index)        
est_tt = pd.DataFrame(treat_y['lbwght'] - nearest_neighbor_y['lbwght']).mean()

print('Matched ATE:', est_ate)

Matched ATE: -0.03995062577221544


**(6)** The matched treatment on the treated effect is presented below.

In [6]:
print('Matched TT:', est_tt[0])

Matched TT: -0.024255089137864352


The difference between matched ATE and TT provides information about average treatment effects across the whole sample of treated and untreated groups, versus treatment effects based only on the treated group and their closest match.

**(7)**. The densities for smoker and non-smoker propensity scores are shown below. 

In [7]:
# Density plots
group_labels = ['Treated', 'Non Treated']
plot_data = ff.create_distplot([treat_data['score'], nontreat_data.loc[list(nearest_neighbor_y['index']), 'score']], group_labels, bin_size=.001)
plot_data['layout'].update(title='Matched Propensity Score Densities')
plt.iplot(plot_data, filename='Treated_Nontreated_density_Matched')

In [8]:
plot_data = ff.create_distplot([treat_data['score'], nontreat_data['score']], group_labels, bin_size=.01)
plot_data['layout'].update(title='Unmatched Propensity Score Densities')
plt.iplot(plot_data, filename='Treated_Nontreated_density_Unmatched')

The propensity score densities for matched individuals are very close, which shows that the matching routine worked well. The densities for unmatched individual propensity scores reveal a wider distribution in the conditional probability of being placed in the "smoker" group given the characteristics of that individual. 