# Replication Lalonde 1986

This notebook replicates results from 

LaLonde, Robert. “Evaluating the Econometric Evaluations of Training Programs with Experimental Data.” The American Economics Review, pp. 604-620, 1986.

First we import all necessarie packages

In [332]:
from pandas import read_stata
from astropy.table import Table, Column
from numpy import mean
from numpy import std
from numpy import sqrt
import pandas as pd
from sklearn import linear_model
import math
from scipy import stats
import csv

Next we have to upload the NSW, PSID and CPS data. The data was downloaded from 'http://users.nber.org/~rdehejia/data/nswdata2.html', but is not available there anymore.
It was provided by Dehijia and Wahaba and is the data from their paper 

Dehejia, Rajeev, and Sadek Wahba. "Causal Effects in Non-Experimental Studies: Re-Evaluating the Evaluation of Training Programs," Journal of the American Statistical Association, Volume 94, Number 448 (December 1999), pp. 1053–1062, 1999.

It cotains the data of the NSW male participants that were included in the analysis i.e. with complete pre- and postprogramm earnings and that were not in supported work in 1978 or entered the programm before 1976. As well as the PSID-1, PSID-3, CPS-1, CPS-2 and CPS-3 data that were used in the Lalonde paper. The PSID-2 data was not available and the CPS-2 and CPS-3 data differs slightly from Lalondes data. For the exact definition of the comparison groups look at Table 3 footnote a in Lalondes paper.

We start by importing all necessary packages for this analysis

In [333]:
nsw = read_stata('nsw.dta')
psid1 = read_stata('psid_controls.dta')
psid3 = read_stata('psid_controls3.dta')
cps1 = read_stata('cps_controls.dta')
cps2 = read_stata('cps_controls2.dta')
cps3 = read_stata('cps_controls3.dta')

#### Table 1:

We will now recreate the part of Table 1 we have the data for. That is to say, the average age, years of schooling, proportion of High School dropouts, proportion of married participants and race, for the NSW male participants.

As Lalonde uses the data of all NSW participants in table 1, while Dehejia and Wahaba only provided the data of those participants that were later included in the analysis, the results in this replication differ from Lalondes results.

We start by deviding the data into the treatment and the controll group. The first 297 observations are in the treatment group, the other 425 in the control group.

We call the data of the treatment group 'nswt' and the controll group data 'nswc'-

In [334]:
nswt = nsw[:297]
nswc = nsw[297:]

Then we calculate the mean and standard deviations of the different charachteristics presented in Table 1 for the treatment and control group.

#### Mean:

Treatment group:

In [335]:
nswtage = mean(nswt.age) #average age
nswtedu = mean(nswt.education) #Years of schooling
nswtnod = mean(nswt.nodegree) #Proportion of High School dropouts
nswtbla = mean(nswt.black) #Proportion of black participants
nswthis = mean(nswt.hispanic) #Proportion of hispanic participants
nswtmar = mean(nswt.married) #Proportion of married participants

Control group:

In [336]:
nswcage = mean(nswc.age)
nswcedu = mean(nswc.education)
nswcnod = mean(nswc.nodegree)
nswcbla = mean(nswc.black)
nswchis = mean(nswc.hispanic)
nswcmar = mean(nswc.married)

#### Standard deviation:

Treatment group:

In [337]:
nswtagest = std(nswt.age)
nswtedust = std(nswt.education)
nswtnodst = std(nswt.nodegree)
nswtblast = std(nswt.black)
nswthisst = std(nswt.hispanic)
nswtmarst = std(nswt.married)

Control group:

In [338]:
nswcagest = std(nswc.age)
nswcedust = std(nswc.education)
nswcnodst = std(nswc.nodegree)
nswcblast = std(nswc.black)
nswchisst = std(nswc.hispanic)
nswcmarst = std(nswc.married)

With these value we can now can now recreate part of Table 1.

We call our table 'Table 1' and round our values to two decimal places

In [339]:
Table1 =Table()
Table1['Variable'] = ['Age', '', 'Years of School','', 'Proportion High school dropouts', '', 'Propotion married','', 'Propotion Black', '', 'Propotion Hispanic', '']
Table1['Treatment'] =[round(nswtage,2), [round(nswtagest,2)], round(nswtedu,2), [round(nswtedust,2)], round(nswtnod,2), [round(nswtnodst,2)], round(nswtmar,2), [round(nswtmarst,2)], round(nswtbla,2), [round(nswtblast,2)], round(nswthis,2), [round(nswthisst,2)]]
Table1['Control'] =[ round(nswcage,2), [round(nswcagest,2)], round(nswcedu,2), [round(nswcedust,2)], round(nswcnod,2), [round(nswcnodst,2)], round(nswcmar,2), [round(nswcmarst,2)],  round(nswcbla,2), [round(nswcblast,2)], round(nswchis,2), [round(nswchisst,2)]]

In [340]:
Table1

Variable,Treatment,Control
str31,object,object
Age,24.63,24.45
,[6.68],[6.58]
Years of School,10.38,10.19
,[1.81],[1.62]
Proportion High school dropouts,0.73,0.81
,[0.44],[0.39]
Propotion married,0.17,0.16
,[0.37],[0.36]
Propotion Black,0.8,0.8
,[0.4],[0.4]


We can see that our results are relatively close, to Lalondes data

#### Table 3:

Now we recreate parts of Table 3. This table shows the annual earnings of the treatment, control and comparison group.

We do not have data on the earnings in 1976 and 1977, so we will not replicate this part of the table.

We proceede in the same way, as we did for Table 1, by calculating the average earnings for each group and year.

In [341]:
nswtre75 = mean(nswt.re75)
nswtre78 = mean(nswt.re78)

nswcre75 = mean(nswc.re75)
nswcre78 = mean(nswc.re78)

psid1re75 = mean(psid1.re75)
psid1re78 = mean(psid1.re78)

psid3re75 = mean(psid3.re75)
psid3re78 = mean(psid3.re78)

cps1re75 = mean(cps1.re75)
cps1re78 = mean(cps1.re78)

cps2re75 = mean(cps2.re75)
cps2re78 = mean(cps2.re78)

cps3re75 = mean(cps3.re75)
cps3re78 = mean(cps3.re78)

Then we calculate the standard errors of these values.

In [342]:
nswtre75st = stats.sem(nswt.re75, axis=None, ddof=0)
nswtre78st = stats.sem(nswt.re78, axis=None, ddof=0)

nswcre75st = stats.sem(nswc.re75, axis=None, ddof=0)
nswcre78st = stats.sem(nswc.re78, axis=None, ddof=0)

psid1re75st = stats.sem(psid1.re75, axis=None, ddof=0)
psid1re78st = stats.sem(psid1.re78, axis=None, ddof=0)

psid3re75st = stats.sem(psid3.re75, axis=None, ddof=0)
psid3re78st = stats.sem(psid3.re78, axis=None, ddof=0)

cps1re75st = stats.sem(cps1.re75, axis=None, ddof=0)
cps1re78st = stats.sem(cps1.re78, axis=None, ddof=0)

cps2re75st = stats.sem(cps2.re75, axis=None, ddof=0)
cps2re78st = stats.sem(cps2.re78, axis=None, ddof=0)

cps3re75st = stats.sem(cps3.re75, axis=None, ddof=0)
cps3re78st = stats.sem(cps3.re78, axis=None, ddof=0)

We now round all these values and put them in a table that we call 'Table 3'. We also add the number of observations for each group.

In [343]:
Table3 = Table()
Table3['Year'] = [1975, '', 1978, '', 'Number of Obervations']
Table3['Treatments'] = [round(nswtre75), [round(nswtre75st)], round(nswtre78), [round(nswtre78st)], nswt.shape[0]]
Table3['Controls'] = [round(nswcre75), [round(nswcre75st)], round(nswcre78), [round(nswcre78st)], nswc.shape[0]]
Table3['PSID-1'] = [round(psid1re75), [round(psid1re75st)], round(psid1re78), [round(psid1re78st)], psid1.shape[0]]
Table3['PSID-3'] = [round(psid3re75), [round(psid3re75st)], round(psid3re78), [round(psid3re78st)], psid3.shape[0]]
Table3['CPS-SSA-1'] = [round(cps1re75), [round(cps1re75st)], round(cps1re78), [round(cps1re78st)], cps1.shape[0]]
Table3['CPS-SSA-2'] = [round(cps2re75), [round(cps2re75st)], round(cps2re78), [round(cps2re78st)], cps2.shape[0]]
Table3['CPS-SSA-3'] = [round(cps3re75), [round(cps3re75st)], round(cps3re78), [round(cps3re78st)], cps3.shape[0]]

In [344]:
Table3

Year,Treatments,Controls,PSID-1,PSID-3,CPS-SSA-1,CPS-SSA-2,CPS-SSA-3
str21,object,object,object,object,object,object,object
1975,3066,3027,19063,2611,13651,7397,2466
,[282.0],[252.0],[272.0],[491.0],[73.0],[167.0],[159.0]
1978,5976,5090,21554,5279,14847,10171,6984
,[401.0],[277.0],[312.0],[683.0],[76.0],[182.0],[352.0]
Number of Obervations,297,425,2490,128,15992,2369,429


The CPS-2 and CPS-3 data differs from the data in Lalondes paper, which explains why the results differ slightly from Lalondes results.

#### Table 5:

Now we can recreate Table 5. This table shows the earnings comparisons and estimated treatment effects of the NSW male participants.

###### Earning Growth

The first collumn shows the earning growth between 1975 and 1978 of each comparison group.

In order to replicate this column, we simple substract the average earning in 1975 from the average earnings in 1978 for all control groups. We calculatet these values already for Table 3.

We call this first part of Table 5 'Table 51'.

In [345]:
Table51 = Table()
Table51['Name of Comparison Group'] = ['Controls', 'PSID-1', 'PSID-3', 'CPS-SSA-1', 'CPS-SSA-2', 'CPS-SSA-3']
Table51['Earnings Growth 75-78'] = [round(nswcre78 - nswcre75), round(psid1re78 - psid1re75), round(psid3re78 - psid3re75), round(cps1re78 - cps1re75), round(cps2re78 - cps2re75), round(cps3re78 - cps3re75)]

In [346]:
Table51

Name of Comparison Group,Earnings Growth 75-78
str9,int32
Controls,2063
PSID-1,2491
PSID-3,2669
CPS-SSA-1,1196
CPS-SSA-2,2774
CPS-SSA-3,4518


###### Treatment earnings less comparison group earnings, unadjusted

Column 2 shows the unadjusted treatment earnings less the comparison groups earnings in 1975 and column 4 the same in 1978.

In order to replicate those two columns we have to substract those values from each other. Again we already calculated the values for Table 3 and only have to substract them now.

We recreate those to columns in a table, that we call Table 52

In [347]:
Table52 = Table()
Table52['Name of Comparison Group'] = ['Controls', 'PSID-1', 'PSID-3', 'CPS-SSA-1', 'CPS-SSA-2', 'CPS-SSA-3']
Table52['Treatment Earnings 1975 Unadjusted'] = [round(nswtre75 - nswcre75), round(nswtre75 - psid1re75), round(nswtre75 - psid3re75), round(nswtre75 - cps1re75), round(nswtre75 - cps2re75), round(nswtre75 - cps3re75)]
Table52['Treatment Earnings 1978 Unadjusted'] = [round(nswtre78 - nswcre78), round(nswtre78 - psid1re78), round(nswtre78 - psid3re78), round(nswtre78 - cps1re78), round(nswtre78 - cps2re78), round(nswtre78 - cps3re78)]

In [348]:
Table52

Name of Comparison Group,Treatment Earnings 1975 Unadjusted,Treatment Earnings 1978 Unadjusted
str9,int32,int32
Controls,39,886
PSID-1,-15997,-15578
PSID-3,455,697
CPS-SSA-1,-10585,-8871
CPS-SSA-2,-4331,-4195
CPS-SSA-3,600,-1008


###### Treatment earnings less comparison group earnings, adjusted 1975

Column 3 and five adjust the results for age, education, race and high school drop out status.

We start with the NSW control group data.

As we do not have the column age squarred in our dataframe, we need to add it first. 

In [349]:
# nsw.agesq = nsw.age**2 -> Does not work.
# alternative way
a = nsw.values
b = a[:,2]
c = b**2
nsw['agesq'] = c

The next step is to make the regression. We take treatment (whether a participant was in the treatment group or not), age, education, race, high school drop out status and age squarred as independent variables and earnings in 1975 as dependent variable.

In [350]:
reg75nsw = linear_model.LinearRegression()
reg75nsw.fit(nsw[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq']],nsw.re75)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Now we predict the earnings of participants and of non-participants, by setting the variable 'treatment' to 1 and 0.

In [351]:
reg75nsw.predict([[1, mean(nsw.age), mean(nsw.education), mean(nsw.black), mean(nsw.hispanic), mean(nsw.nodegree), mean(nsw.agesq)]])

array([3030.57868592])

In [352]:
reg75nsw.predict([[0, mean(nsw.age), mean(nsw.education), mean(nsw.black), mean(nsw.hispanic), mean(nsw.nodegree), mean(nsw.agesq)]])

array([3051.50462218])

The predicted earnings for participants are 3030 and the predicted earnings for non-participants are 3051. We substract the non-participant earnings from the participant earnings now. The result is the adjusted value. We call this value re75nswadj.

In [353]:
re75nswadj = 3031-3052
re75nswadj

-21

The adjusted value is -21 Dollars

To make the regression for the other values, we have to merge the nsw-treatment-group dataframe with the dataframe of the comparison groups.

We redefine nswt again, as we have now added age squarred to the nsw dataframe and also need this for the other regressions.

In [354]:
nswt = nsw[:297]

We will need to add age squarred also to the comparison groups and to drop the Earnings in 1974, so that both dataframes can be merged.

In [355]:
psid1['agesq'] = psid1['age']**2
psid1 = psid1.drop('re74',axis=1)

Now we can merge the NSW treatment and the PSID-1 data

In [356]:
nswpsid1 = pd.concat([nswt, psid1], ignore_index=True, sort=False)

From here on we do exactly the same as for the adjusted value of the control variable. We make a regression and substract the predicted earnings of the treatment and comparison group

In [357]:
reg75psid1 = linear_model.LinearRegression()
reg75psid1.fit(nswpsid1[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq']],nswpsid1.re75)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [358]:
reg75psid1.predict([[1, mean(nswpsid1.age), mean(nswpsid1.education), mean(nswpsid1.black), mean(nswpsid1.hispanic), mean(nswpsid1.nodegree), mean(nswpsid1.agesq)]])

array([10547.34896355])

In [359]:
reg75psid1.predict([[0, mean(nswpsid1.age), mean(nswpsid1.education), mean(nswpsid1.black), mean(nswpsid1.hispanic), mean(nswpsid1.nodegree), mean(nswpsid1.agesq)]])

array([18170.99430488])

In [360]:
re75psid1adj = 10547 - 18171
re75psid1adj

-7624

We repeat these steps for the PSID-3 and CPS-SSA- 1, 2 and 3.

##### PSID-3:

In [361]:
psid3['agesq'] = psid3['age']**2
psid3 = psid3.drop('re74',axis=1)

In [362]:
nswpsid3 = pd.concat([nswt, psid3], ignore_index=True, sort=False)

In [363]:
reg75psid3 = linear_model.LinearRegression()
reg75psid3.fit(nswpsid3[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq']],nswpsid3.re75)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [364]:
reg75psid3.predict([[1, mean(nswpsid3.age), mean(nswpsid3.education), mean(nswpsid3.black), mean(nswpsid3.hispanic), mean(nswpsid3.nodegree), mean(nswpsid3.agesq)]])

array([3065.86858985])

In [365]:
reg75psid3.predict([[0, mean(nswpsid3.age), mean(nswpsid3.education), mean(nswpsid3.black), mean(nswpsid3.hispanic), mean(nswpsid3.nodegree), mean(nswpsid3.agesq)]])

array([2611.22809241])

In [366]:
re75psid3adj = 3066 - 2611
re75psid3adj

455

##### CPS-SSA-1:

In [367]:
cps1['agesq'] = cps1['age']**2
cps1 = cps1.drop('re74',axis=1)

In [368]:
nswcps1 = pd.concat([nswt, cps1], ignore_index=True, sort=False)

In [369]:
reg75cps1 = linear_model.LinearRegression()
reg75cps1.fit(nswcps1[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq']],nswcps1.re75)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [370]:
reg75cps1.predict([[1, mean(nswcps1.age), mean(nswcps1.education), mean(nswcps1.black), mean(nswcps1.hispanic), mean(nswcps1.nodegree), mean(nswcps1.agesq)]])

array([8888.80244224])

In [371]:
reg75cps1.predict([[0, mean(nswcps1.age), mean(nswcps1.education), mean(nswcps1.black), mean(nswcps1.hispanic), mean(nswcps1.nodegree), mean(nswcps1.agesq)]])

array([13542.66520565])

In [372]:
re75cps1adj = 8889 - 13543
re75cps1adj

-4654

##### CPS-SSA-2:

In [373]:
cps2['agesq'] = cps2['age']**2
cps2 = cps2.drop('re74',axis=1)

In [374]:
nswcps2 = pd.concat([nswt, cps2], ignore_index=True, sort=False)

In [375]:
reg75cps2 = linear_model.LinearRegression()
reg75cps2.fit(nswcps2[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq']],nswcps2.re75)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [376]:
reg75cps2.predict([[1, mean(nswcps2.age), mean(nswcps2.education), mean(nswcps2.black), mean(nswcps2.hispanic), mean(nswcps2.nodegree), mean(nswcps2.agesq)]])

array([5031.38131491])

In [377]:
reg75cps2.predict([[0, mean(nswcps2.age), mean(nswcps2.education), mean(nswcps2.black), mean(nswcps2.hispanic), mean(nswcps2.nodegree), mean(nswcps2.agesq)]])

array([7150.84865518])

In [378]:
re75cps2adj = 5031 - 7150
re75cps2adj

-2119

##### CPS-SSA-3:

In [379]:
cps3['agesq'] = cps3['age']**2
cps3 = cps3.drop('re74',axis=1)

In [380]:
nswcps3 = pd.concat([nswt, cps3], ignore_index=True, sort=False)

In [381]:
reg75cps3 = linear_model.LinearRegression()
reg75cps3.fit(nswcps3[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq']],nswcps3.re75)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [382]:
reg75cps3.predict([[1, mean(nswcps3.age), mean(nswcps3.education), mean(nswcps3.black), mean(nswcps3.hispanic), mean(nswcps3.nodegree), mean(nswcps3.agesq)]])

array([3285.59128467])

In [383]:
reg75cps3.predict([[0, mean(nswcps3.age), mean(nswcps3.education), mean(nswcps3.black), mean(nswcps3.hispanic), mean(nswcps3.nodegree), mean(nswcps3.agesq)]])

array([2314.52775315])

In [384]:
re75cps3adj = 3286 - 2315
re75cps3adj

971

###### Treatment earnings less comparison group earnings, adjusted 1978

Now we adjust the earnings in the post- training year 1978. The only difference to the adjustment we just did, is that our dependent variable is earnings in 1978 now.

##### NSW- control group:

In [385]:
reg78nsw = linear_model.LinearRegression()
reg78nsw.fit(nsw[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq']],nsw.re78)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [386]:
reg78nsw.predict([[1, mean(nsw.age), mean(nsw.education), mean(nsw.black), mean(nsw.hispanic), mean(nsw.nodegree), mean(nsw.agesq)]])

array([5924.57939016])

In [387]:
reg78nsw.predict([[0, mean(nsw.age), mean(nsw.education), mean(nsw.black), mean(nsw.hispanic), mean(nsw.nodegree), mean(nsw.agesq)]])

array([5126.22823567])

In [388]:
re78nswadj = 5925 - 5126
re78nswadj

799

##### PSID-1:

In [389]:
reg78psid1 = linear_model.LinearRegression()
reg78psid1.fit(nswpsid1[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq']],nswpsid1.re78)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [390]:
reg78psid1.predict([[1, mean(nswpsid1.age), mean(nswpsid1.education), mean(nswpsid1.black), mean(nswpsid1.hispanic), mean(nswpsid1.nodegree), mean(nswpsid1.agesq)]])

array([12686.25922394])

In [391]:
reg78psid1.predict([[0, mean(nswpsid1.age), mean(nswpsid1.education), mean(nswpsid1.black), mean(nswpsid1.hispanic), mean(nswpsid1.nodegree), mean(nswpsid1.agesq)]])

array([20753.5813371])

In [392]:
re78psid1adj = 12686 - 20754
re78psid1adj

-8068

##### PSID-3:

In [393]:
reg78psid3 = linear_model.LinearRegression()
reg78psid3.fit(nswpsid3[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq']],nswpsid3.re78)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [394]:
reg78psid3.predict([[1, mean(nswpsid3.age), mean(nswpsid3.education), mean(nswpsid3.black), mean(nswpsid3.hispanic), mean(nswpsid3.nodegree), mean(nswpsid3.agesq)]])

array([5613.05005719])

In [395]:
reg78psid3.predict([[0, mean(nswpsid3.age), mean(nswpsid3.education), mean(nswpsid3.black), mean(nswpsid3.hispanic), mean(nswpsid3.nodegree), mean(nswpsid3.agesq)]])

array([6122.26561517])

In [396]:
re78psid3adj = 5613 - 6122
re78psid3adj

-509

##### CPS-SSA-1:

In [397]:
reg78cps1 = linear_model.LinearRegression()
reg78cps1.fit(nswcps1[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq']],nswcps1.re78)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [398]:
reg78cps1.predict([[1, mean(nswcps1.age), mean(nswcps1.education), mean(nswcps1.black), mean(nswcps1.hispanic), mean(nswcps1.nodegree), mean(nswcps1.agesq)]])

array([10349.23562426])

In [399]:
reg78cps1.predict([[0, mean(nswcps1.age), mean(nswcps1.education), mean(nswcps1.black), mean(nswcps1.hispanic), mean(nswcps1.nodegree), mean(nswcps1.agesq)]])

array([14765.44721875])

In [400]:
re78cps1adj = 10349 - 14765
re78cps1adj

-4416

##### CPS-SSA-2:

In [401]:
reg78cps2 = linear_model.LinearRegression()
reg78cps2.fit(nswcps2[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq']],nswcps2.re78)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [402]:
reg78cps2.predict([[1, mean(nswcps2.age), mean(nswcps2.education), mean(nswcps2.black), mean(nswcps2.hispanic), mean(nswcps2.nodegree), mean(nswcps2.agesq)]])

array([7623.76263814])

In [403]:
reg78cps2.predict([[0, mean(nswcps2.age), mean(nswcps2.education), mean(nswcps2.black), mean(nswcps2.hispanic), mean(nswcps2.nodegree), mean(nswcps2.agesq)]])

array([9964.57647097])

In [404]:
re78cps2adj = 7624 - 9965
re78cps2adj

-2341

##### CPS-SSA-3:

In [405]:
reg78cps3 = linear_model.LinearRegression()
reg78cps3.fit(nswcps3[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq']],nswcps3.re78)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [406]:
reg78cps3.predict([[1, mean(nswcps3.age), mean(nswcps3.education), mean(nswcps3.black), mean(nswcps3.hispanic), mean(nswcps3.nodegree), mean(nswcps3.agesq)]])

array([6571.02588767])

In [407]:
reg78cps3.predict([[0, mean(nswcps3.age), mean(nswcps3.education), mean(nswcps3.black), mean(nswcps3.hispanic), mean(nswcps3.nodegree), mean(nswcps3.agesq)]])

array([6572.47224973])

In [408]:
re78cps3adj = 6571 - 6572
re78cps3adj

-1

We can now recreate the first half of Table 5. We call this 'Tables 53'. We take Table 51 and add the values from Table 52 and the adjusted values.

In [409]:
Table53 = Table()
Table53 = Table51
Table53['Treatment Earnings 1975 Unadjusted'] = [round(nswtre75 - nswcre75), round(nswtre75 - psid1re75), round(nswtre75 - psid3re75), round(nswtre75 - cps1re75), round(nswtre75 - cps2re75), round(nswtre75 - cps3re75)]
Table53['Treatment Earnings 1975 Adjusted'] = [re75nswadj, re75psid1adj, re75psid3adj, re75cps1adj, re75cps2adj, re75cps3adj]
Table53['Treatment Earnings 1978 Unadjusted'] = [round(nswtre78 - nswcre78), round(nswtre78 - psid1re78), round(nswtre78 - psid3re78), round(nswtre78 - cps1re78), round(nswtre78 - cps2re78), round(nswtre78 - cps3re78)]
Table53['Treatment Earnings 1978 Adjusted'] = [re78nswadj, re78psid1adj, re78psid3adj, re78cps1adj, re78cps2adj, re78cps3adj]

In [410]:
Table53

Name of Comparison Group,Earnings Growth 75-78,Treatment Earnings 1975 Unadjusted,Treatment Earnings 1975 Adjusted,Treatment Earnings 1978 Unadjusted,Treatment Earnings 1978 Adjusted
str9,int32,int32,int32,int32,int32
Controls,2063,39,-21,886,799
PSID-1,2491,-15997,-7624,-15578,-8068
PSID-3,2669,455,455,697,-509
CPS-SSA-1,1196,-10585,-4654,-8871,-4416
CPS-SSA-2,2774,-4331,-2119,-4195,-2341
CPS-SSA-3,4518,600,971,-1008,-1


###### Difference in Differences Without age

The next step is to calculate the Difference in Differnce for the Earning growth from 1975 to 1978. In the unadjusted Diff in Diff we simply have to calculate (Earnings 1978 treatment - Earnings 1978 comparison) - (Earnings 1975 treatment- Earnings 1975 comparison)

In [411]:
didnsw = (nswtre78 - nswtre75) - (nswcre78 - nswcre75)
didpsid1 = (nswtre78 - nswtre75) - (psid1re78 - psid1re75)
didpsid3 = (nswtre78 - nswtre75) - (psid3re78 - psid3re75)
didcps1 = (nswtre78 - nswtre75) - (cps1re78 - cps1re75)
didcps2 = (nswtre78 - nswtre75) - (cps2re78 - cps2re75)
didcps3 = (nswtre78 - nswtre75) - (cps3re78 - cps3re75)

###### Difference in Differences With age

Now we calculate the Dif-in-Dif adjusted for age

To do so we have to create the column "'Earnings in 1978' - 'Earnings in 1975'" in our dataframe. We call this column 'dif'.

In [412]:
nsw['dif'] = nsw['re78'] - nsw['re75']

We can now make a regression with dif as dependent variable and treatment, age and age quarred as independent variables. Then we can predict the difference in earnings for treatment and control participants, controlled for age.

In [413]:
regdifnsw = linear_model.LinearRegression()
regdifnsw.fit(nsw[['treat','age', 'agesq']],nsw.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [414]:
regdifnsw.predict([[1, mean(nsw.age), mean(nsw.agesq)]])

array([2916.15359374])

In [415]:
regdifnsw.predict([[0, mean(nsw.age), mean(nsw.agesq)]])

array([2059.24264464])

As last step we have to substract the differences in earnings for the treatment and control group.

In [416]:
didadjnsw = 2916 - 2059

We repeat these steps for the PSID and CPS data.

##### PSID-1:

In [417]:
nswpsid1['dif'] = nswpsid1['re78'] - nswpsid1['re75']

In [418]:
regdifpsid1 = linear_model.LinearRegression()
regdifpsid1.fit(nswpsid1[['treat','age', 'agesq']],nswpsid1.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [419]:
regdifpsid1.predict([[1, mean(nswpsid1.age), mean(nswpsid1.agesq)]])

array([1865.93886751])

In [420]:
regdifpsid1.predict([[0, mean(nswpsid1.age), mean(nswpsid1.agesq)]])

array([2615.14608603])

In [421]:
didadjpsid1 = 1866 - 2615

##### PSID-3:

In [422]:
nswpsid3['dif'] = nswpsid3['re78'] - nswpsid3['re75']

In [423]:
regdifpsid3 = linear_model.LinearRegression()
regdifpsid3.fit(nswpsid3[['treat','age', 'agesq']],nswpsid3.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [424]:
regdifpsid3.predict([[1, mean(nswpsid3.age), mean(nswpsid3.agesq)]])

array([2438.54529157])

In [425]:
regdifpsid3.predict([[0, mean(nswpsid3.age), mean(nswpsid3.agesq)]])

array([3763.10719288])

In [426]:
didadjpsid3 = 2439 - 3763
didadjpsid3

-1324

##### CPS-SSA-1:

In [427]:
nswcps1['dif'] = nswcps1['re78'] - nswcps1['re75']

In [428]:
regdifcps1 = linear_model.LinearRegression()
regdifcps1.fit(nswcps1[['treat','age', 'agesq']],nswcps1.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [429]:
regdifcps1.predict([[1, mean(nswcps1.age), mean(nswcps1.agesq)]])

array([1418.731062])

In [430]:
regdifcps1.predict([[0, mean(nswcps1.age), mean(nswcps1.agesq)]])

array([1223.55646855])

In [431]:
didadjcps1 = 1419 - 1224
didadjcps1

195

##### CPS-SSA-2:

In [432]:
nswcps2['dif'] = nswcps2['re78'] - nswcps2['re75']

In [433]:
regdifcps2 = linear_model.LinearRegression()
regdifcps2.fit(nswcps2[['treat','age', 'agesq']],nswcps2.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [434]:
regdifcps2.predict([[1, mean(nswcps2.age), mean(nswcps2.agesq)]])

array([2367.19655441])

In [435]:
regdifcps2.predict([[0, mean(nswcps2.age), mean(nswcps2.agesq)]])

array([2841.95919543])

In [436]:
didadjcps2 = 2367 - 2842
didadjcps2

-475

##### CPS-SSA-3:

In [437]:
nswcps3['dif'] = nswcps3['re78'] - nswcps3['re75']

In [438]:
regdifcps3 = linear_model.LinearRegression()
regdifcps3.fit(nswcps3[['treat','age', 'agesq']],nswcps3.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [439]:
regdifcps3.predict([[1, mean(nswcps3.age), mean(nswcps3.agesq)]])

array([2975.52584962])

In [440]:
regdifcps3.predict([[0, mean(nswcps3.age), mean(nswcps3.agesq)]])

array([4472.49699132])

In [441]:
didadjcps3 = 2976 - 4472
didadjcps3

-1496

###### Unrestricted Difference in Differences, Unadjusted

In order to calculate the unrestricted difference in differences estimator, we substract the differences in earnings between treatment and comparison group, when earnings in 1975 is kept constant, i.e. treatment and earnings in 1975 are our independent variables.

In [442]:
regudifnsw = linear_model.LinearRegression()
regudifnsw.fit(nsw[['treat', 're75']],nsw.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [443]:
regudifnsw.predict([[1, mean(nsw.re75)]])

array([2929.02711368])

In [444]:
regudifnsw.predict([[0, mean(nsw.re75)]])

array([2050.246413])

In [445]:
udidnsw = 2929 - 2050

##### PSID-1:

In [446]:
regudifpsid1 = linear_model.LinearRegression()
regudifpsid1.fit(nswpsid1[['treat', 're75']],nswpsid1.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [447]:
regudifpsid1.predict([[1, mean(nswpsid1.re75)]])

array([408.86357957])

In [448]:
regudifpsid1.predict([[0, mean(nswpsid1.re75)]])

array([2788.94512254])

In [449]:
udidpsid1 = 409 - 2789

##### PSID-3:

In [450]:
regudifpsid3 = linear_model.LinearRegression()
regudifpsid3.fit(nswpsid3[['treat', 're75']],nswpsid3.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [451]:
regudifpsid3.predict([[1, mean(nswpsid3.re75)]])

array([3026.88518407])

In [452]:
regudifpsid3.predict([[0, mean(nswpsid3.re75)]])

array([2397.97826268])

In [453]:
udidpsid3 = 3027 - 2398
udidpsid3

629

##### CPS-SSA-1:

In [454]:
regudifcps1 = linear_model.LinearRegression()
regudifcps1.fit(nswcps1[['treat', 're75']],nswcps1.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [455]:
regudifcps1.predict([[1, mean(nswcps1.re75)]])

array([-288.29321213])

In [456]:
regudifcps1.predict([[0, mean(nswcps1.re75)]])

array([1255.21826248])

In [457]:
udidcps1 = -288 - 1255
udidcps1

-1543

##### CPS-SSA-2:

In [458]:
regudifcps2 = linear_model.LinearRegression()
regudifcps2.fit(nswcps2[['treat', 're75']],nswcps2.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [459]:
regudifcps2.predict([[1, mean(nswcps2.re75)]])

array([1323.82167511])

In [460]:
regudifcps2.predict([[0, mean(nswcps2.re75)]])

array([2972.76686554])

In [461]:
udidcps2 = 1324 - 2973
udidcps2

-1649

##### CPS-SSA-3:

In [462]:
regudifcps3 = linear_model.LinearRegression()
regudifcps3.fit(nswcps3[['treat', 're75']],nswcps3.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [463]:
regudifcps3.predict([[1, mean(nswcps3.re75)]])

array([3148.37921699])

In [464]:
regudifcps3.predict([[0, mean(nswcps3.re75)]])

array([4352.82965644])

In [465]:
udidcps3 = 3148 - 4353
udidcps3

-1205

###### Unrestricted Difference in Differences, Adjusted

For the adjusted unrestriced Difference in differences estimator we use more independent variables.

In [466]:
regudifnswa = linear_model.LinearRegression()
regudifnswa.fit(nsw[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq', 're75']],nsw.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [467]:
regudifnswa.predict([[1, mean(nsw.age), mean(nsw.education), mean(nsw.black), mean(nsw.hispanic), mean(nsw.nodegree), mean(nsw.agesq), mean(nsw.re75)]])

array([2883.85953327])

In [468]:
regudifnswa.predict([[0, mean(nsw.age), mean(nsw.education), mean(nsw.black), mean(nsw.hispanic), mean(nsw.nodegree), mean(nsw.agesq), mean(nsw.re75)]])

array([2081.81097573])

In [469]:
udidnswa = 2884 - 2082
udidnswa

802

##### PSID-1:

In [470]:
regudifpsid1a = linear_model.LinearRegression()
regudifpsid1a.fit(nswpsid1[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq', 're75']],nswpsid1.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [471]:
regudifpsid1a.predict([[1, mean(nswpsid1.age), mean(nswpsid1.education), mean(nswpsid1.black), mean(nswpsid1.hispanic), mean(nswpsid1.nodegree), mean(nswpsid1.agesq), mean(nswpsid1.re75)]])

array([642.48343767])

In [472]:
regudifpsid1a.predict([[0, mean(nswpsid1.age), mean(nswpsid1.education), mean(nswpsid1.black), mean(nswpsid1.hispanic), mean(nswpsid1.nodegree), mean(nswpsid1.agesq), mean(nswpsid1.re75)]])

array([2761.08012124])

In [473]:
udidpsid1a = 642 - 2761
udidpsid1a

-2119

##### PSID-3:

In [474]:
regudifpsid3a = linear_model.LinearRegression()
regudifpsid3a.fit(nswpsid3[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq', 're75']],nswpsid3.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [475]:
regudifpsid3a.predict([[1, mean(nswpsid3.age), mean(nswpsid3.education), mean(nswpsid3.black), mean(nswpsid3.hispanic), mean(nswpsid3.nodegree), mean(nswpsid3.agesq), mean(nswpsid3.re75)]])

array([2671.208132])

In [476]:
regudifpsid3a.predict([[0, mean(nswpsid3.age), mean(nswpsid3.education), mean(nswpsid3.black), mean(nswpsid3.hispanic), mean(nswpsid3.nodegree), mean(nswpsid3.agesq), mean(nswpsid3.re75)]])

array([3223.25956288])

In [477]:
udidpsid3a = 2671 - 3223
udidpsid3a

-552

##### CPS-SSA-1:

In [478]:
regudifcps1a = linear_model.LinearRegression()
regudifcps1a.fit(nswcps1[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq', 're75']],nswcps1.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [479]:
regudifcps1a.predict([[1, mean(nswcps1.age), mean(nswcps1.education), mean(nswcps1.black), mean(nswcps1.hispanic), mean(nswcps1.nodegree), mean(nswcps1.agesq), mean(nswcps1.re75)]])

array([144.8538476])

In [480]:
regudifcps1a.predict([[0, mean(nswcps1.age), mean(nswcps1.education), mean(nswcps1.black), mean(nswcps1.hispanic), mean(nswcps1.nodegree), mean(nswcps1.agesq), mean(nswcps1.re75)]])

array([1247.17629803])

In [481]:
udidcps1a = 145 - 1247
udidcps1a

-1102

##### CPS-SSA-2:

In [482]:
regudifcps2a = linear_model.LinearRegression()
regudifcps2a.fit(nswcps2[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq', 're75']],nswcps2.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [483]:
regudifcps2a.predict([[1, mean(nswcps2.age), mean(nswcps2.education), mean(nswcps2.black), mean(nswcps2.hispanic), mean(nswcps2.nodegree), mean(nswcps2.agesq), mean(nswcps2.re75)]])

array([1786.26207465])

In [484]:
regudifcps2a.predict([[0, mean(nswcps2.age), mean(nswcps2.education), mean(nswcps2.black), mean(nswcps2.hispanic), mean(nswcps2.nodegree), mean(nswcps2.agesq), mean(nswcps2.re75)]])

array([2914.79096506])

In [485]:
udidcps2a = 1786 - 2915
udidcps2a

-1129

##### CPS-SSA-3:

In [486]:
regudifcps3a = linear_model.LinearRegression()
regudifcps3a.fit(nswcps3[['treat','age', 'education','black', 'hispanic', 'nodegree', 'agesq', 're75']],nswcps3.dif)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [487]:
regudifcps3a.predict([[1, mean(nswcps3.age), mean(nswcps3.education), mean(nswcps3.black), mean(nswcps3.hispanic), mean(nswcps3.nodegree), mean(nswcps3.agesq), mean(nswcps3.re75)]])

array([3704.73514774])

In [488]:
regudifcps3a.predict([[0, mean(nswcps3.age), mean(nswcps3.education), mean(nswcps3.black), mean(nswcps3.hispanic), mean(nswcps3.nodegree), mean(nswcps3.agesq), mean(nswcps3.re75)]])

array([3967.65951141])

In [489]:
udidcps3a = 3705 - 3968
udidcps3a

-263

We can now recreate Table 5 entirely exept for column 10, which includes data, that we do not have

In [490]:
Table5 = Table()
Table5 = Table53
Table5['Diff in Diff Without age'] = [round(didnsw), round(didpsid1), round(didpsid3), round(didcps1), round(didcps2), round(didcps3)]
Table5['Diff in Diff With Age'] = [didadjnsw, didadjpsid1, didadjpsid3, didadjcps1, didadjcps2, didadjcps3]
Table5['Unrestricted Diff in Diff Unadjusted'] = [udidnsw, udidpsid1, udidpsid3, udidcps1, udidcps2, udidcps3]
Table5['Unrestricted Diff in Diff Adjusted'] = [udidnswa, udidpsid1a, udidpsid3a, udidcps1a, udidcps2a, udidcps3a]
Table5

Name of Comparison Group,Earnings Growth 75-78,Treatment Earnings 1975 Unadjusted,Treatment Earnings 1975 Adjusted,Treatment Earnings 1978 Unadjusted,Treatment Earnings 1978 Adjusted,Diff in Diff Without age,Diff in Diff With Age,Unrestricted Diff in Diff Unadjusted,Unrestricted Diff in Diff Adjusted
str9,int32,int32,int32,int32,int32,int32,int32,int32,int32
Controls,2063,39,-21,886,799,847,857,879,802
PSID-1,2491,-15997,-7624,-15578,-8068,420,-749,-2380,-2119
PSID-3,2669,455,455,697,-509,242,-1324,629,-552
CPS-SSA-1,1196,-10585,-4654,-8871,-4416,1714,195,-1543,-1102
CPS-SSA-2,2774,-4331,-2119,-4195,-2341,136,-475,-1649,-1129
CPS-SSA-3,4518,600,971,-1008,-1,-1607,-1496,-1205,-263
