In [1]:
import pandas as pd
from statsmodels.regression.linear_model import OLS
from linearmodels.iv import IV2SLS
import numpy as np

# 14.32 Problem Set 4

## Problem 4

Loading the data from the whitespace-separated Angrist and Krueger data file.

In [2]:
df = pd.read_csv('asciiqob.txt', sep='\s+', names=['lwklywge', 'educ', 
                                                   'yob', 'qob', 'pob'])

We add a constant column to run OLS with a constant.

In [3]:
df['_cons'] = 1

### Part A

First we regress weekly wages on years of education naively. This will likely include some bias, but it will give us a basis on which to compare the instrumental variables coefficient. Note we are using heteroskedastic robust standard errors.

In [4]:
naive_model = OLS(df['lwklywge'], df[['_cons', 'educ']])
naive_res = naive_model.fit(cov_type='HC1', use_t=True)
print(naive_res.summary())

                            OLS Regression Results                            
Dep. Variable:               lwklywge   R-squared:                       0.117
Model:                            OLS   Adj. R-squared:                  0.117
Method:                 Least Squares   F-statistic:                 3.458e+04
Date:                Thu, 11 Apr 2019   Prob (F-statistic):               0.00
Time:                        20:29:00   Log-Likelihood:            -3.1935e+05
No. Observations:              329509   AIC:                         6.387e+05
Df Residuals:                  329507   BIC:                         6.387e+05
Df Model:                           1                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons          4.9952      0.005    984.491      0.0

The coefficient of the effect of education on wages is,

In [5]:
naive_res.params['educ']

0.07085103867008755

The heteroskedastic robust standard error of the coefficient is,

In [6]:
naive_res.HC1_se['educ']

0.0003810233853165221

### Part B

Refer to the attached writeup.

### Part C

Refer to the attached writeup.

### Part D

First, we can form the instrumental variable discussed in Part C. $z$ will be an indicator variable for if the person was born in the second half of the year.

In [7]:
df['z'] = (df['qob'] > 2).astype(np.int64)

We can check if the relevance condition is satisfied by regressing years of schooling on $z$. Note we are still using heteroskedastic robust standard errors.

In [8]:
first_stage_model = OLS(df['educ'], df[['_cons', 'z']])
first_stage_res = first_stage_model.fit(cov_type='HC1', use_t=True)
print(first_stage_res.summary())

                            OLS Regression Results                            
Dep. Variable:                   educ   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     85.39
Date:                Thu, 11 Apr 2019   Prob (F-statistic):           2.46e-20
Time:                        20:29:00   Log-Likelihood:            -8.5904e+05
No. Observations:              329509   AIC:                         1.718e+06
Df Residuals:                  329507   BIC:                         1.718e+06
Df Model:                           1                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons         12.7161      0.008   1540.925      0.0

The coefficient of the effect of $z$ on the years of education is,

In [9]:
first_stage_res.params['z']

0.10569066631425825

The heteroskedastic robust standard error for the coefficient is,

In [10]:
first_stage_res.HC1_se['z']

0.011437457674739538

From the $t$-statistic, we can confirm that the instrumental variable $z$ is relevant.

### Part E

We next do the reduced regression of weekly wage on $z$. Note we still use heteroskedastic robust standard errors.

In [11]:
reduced_model = OLS(df['lwklywge'], df[['_cons', 'z']])
reduced_res = reduced_model.fit(cov_type='HC1', use_t=True)
print(reduced_res.summary())

                            OLS Regression Results                            
Dep. Variable:               lwklywge   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     25.66
Date:                Thu, 11 Apr 2019   Prob (F-statistic):           4.08e-07
Time:                        20:29:00   Log-Likelihood:            -3.3989e+05
No. Observations:              329509   AIC:                         6.798e+05
Df Residuals:                  329507   BIC:                         6.798e+05
Df Model:                           1                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons          5.8938      0.002   3470.536      0.0

The coefficient of the effect of $z$ on weekly wages is,

In [12]:
reduced_res.params['z']

0.011984652902653358

The heteroskedastic standard error for the coefficient is,

In [13]:
reduced_res.HC1_se['z']

0.002365942739518425

### Part F

Using the reduced form coefficient and the first stage coefficient, we can form the indirect least squares estimate for the effect of education on weekly wages.

In [14]:
reduced_res.params['z'] / first_stage_res.params['z']

0.1133936734490675

### Part G

We now do manual two-stage least squares by regressing weekly wage on the predicted years of schooling from the first stage regression.

First we add the first-stage predictions to the data frame.

In [15]:
df['fst'] = first_stage_res.predict(df[['_cons', 'z']])

Now we run the second stage regression.

In [16]:
second_stage_model = OLS(df['lwklywge'], df[['_cons', 'fst']])
second_stage_res = second_stage_model.fit(cov_type='HC1', use_t=True)
print(second_stage_res.summary())

                            OLS Regression Results                            
Dep. Variable:               lwklywge   R-squared:                       0.000
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     25.66
Date:                Thu, 11 Apr 2019   Prob (F-statistic):           4.08e-07
Time:                        20:29:01   Log-Likelihood:            -3.3989e+05
No. Observations:              329509   AIC:                         6.798e+05
Df Residuals:                  329507   BIC:                         6.798e+05
Df Model:                           1                                         
Covariance Type:                  HC1                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
_cons          4.4519      0.286     15.573      0.0

The manual two-stage regression estimate of the effect of education on weekly wage is,

In [17]:
second_stage_res.params['fst']

0.11339367344969792

This is nearly identical to the estimate from Part F.

### Part H

We can also run automatic two stage least squares regression to compare the results.

In [18]:
auto_model = IV2SLS(df['lwklywge'], df['_cons'], df['educ'], df['z'])
auto_res = auto_model.fit(cov_type='robust')
print(auto_res.summary)

                          IV-2SLS Estimation Summary                          
Dep. Variable:               lwklywge   R-squared:                      0.0750
Estimator:                    IV-2SLS   Adj. R-squared:                 0.0750
No. Observations:              329509   F-statistic:                    27.737
Date:                Thu, Apr 11 2019   P-value (F-stat)                0.0000
Time:                        20:29:02   Distribution:                  chi2(1)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
_cons          4.4519     0.2750     16.191     0.0000      3.9130      4.9908
educ           0.1134     0.0215     5.2666     0.00

The automatic two-stage regression estimate of the effect of education on weekly wage is,

In [19]:
auto_res.params['educ']

0.11339367345135543

Once again, this is nearly identical to the estimate from the previous two parts.