# Exercise 13: Solutions 

Estimating

\begin{equation*}
log(wage)=\beta_0+\beta_1educ+\beta_2exper+u
\end{equation*}  

does not give you the causal effect of education on wages. This because you have an omitted variable bias because factors that are correlated with education and affect wage are missing. Examples are ability or effort. Thus, the zero conditional mean assumption $E(u|x_1, x_2)=0$ does not hold. IF this is the case, we have the so called problem of __endogeneity__. Another source of endogeneity, besides omitted variable bias, is simultaneity, which means that the outcome variable is ont only affected by the regressors, it also affects them itself. In the wage example this should be less of concern, however.

There are several ways to adress an omitted variable bias probelm. The straightforward way would be to include the omitted variables. We can add experience for instance, which is likely to be correlated with education and to have an impact on wages.

\begin{equation*}
log(wage)=\beta_0+\beta_1educ+\beta_2exper+u
\end{equation*}  


However, we only observe some of the important variables which cause the omitted variable bias but not all. So this does not solve the problem.

If we had panel data here we could also use a fixed effects model. This would allow us to correct for _unobserved heteregoeneity_. However, using a fixed effects model only allows to correct for the omission of important variables if they don't vary offer time. However, besides the fact that we don't have panel data here, it is also likely that some of the omitted variables also change over time. For instance ability may increase with education and experience. In this case estimating a fixed effects model would not fully eliminate the omitted variable bias.

Another option is to use instrument variables. If we find an instrument that also varies over time, this could a good solution to the endogeneity problem that arises from time varying omitted variables. There are two requirements when using an isntrumental variable approach. The instrument has to be relevant, i.e. has to have a sufficiently strong correlation with the endogeneous variable. This is testable with an first-stage (partial) f-test. The second assumption is that the instrument has to be valid. This means the instrument must not be correlated with the error term. Unfortunately, this is not testable and the validity of the instrument is subject to theory and arguments.

A potential candidate instrument from the dataset is the father's education. It is plausible to assume that the father's education is positively correlated with the education of his children but not directly with his childrens' wages (only indirectly through its effect on his children's education. So the instrument is probably valid. It is also strong as we can see from the first stage statistics: the partial f-statistic is alöost 90, which is notably above 10, which is the critical value by rule of thumb.

It is important to note, that when estimating an IV model by hand, we'll get incorrect inference. The reason is, that the estimator does not know that we use $\hat{x}$ instead of $x$. However, $\hat{x}$ is an estimate of $x$ and because of this, there is additional uncertainty. The automatical IV estimator `iv.IV2SLS` corrects the standard errors automatically and provides correct inference.

The instrumental variable approach only works with models that are linear in parameters. In models that are non-linear in parameters, such as probit, logit, count data models or semi-parametric models, one has to use a control function approach instead. The control function approach is identicial two the IV regression with 2SLS in the first stage. However, the residuals are stored instead of the fitted values here. And instead of replacing the endogeneous variable with its first stage fitted values, we will add the first stage residuals now to the second stage regression model and keep the endogeneous variable in it. You'll get the same results a with IV in the case of a single endogenous variable and one instrument.

In [15]:
import wooldridge as woo
import numpy as np
import pandas as pd
import linearmodels.iv as iv
import statsmodels.formula.api as smf
import statsmodels.api as sm

mroz = woo.dataWoo('mroz')

# restrict to non-missing wage observations:
mroz = mroz.dropna(subset=['lwage'])

# OLS:
reg_ols = smf.ols(formula='np.log(wage) ~ educ', data=mroz)
results_ols = reg_ols.fit()

# print regression table:
table_ols = pd.DataFrame({'b': round(results_ols.params, 4),
                          'se': round(results_ols.bse, 4),
                          't': round(results_ols.tvalues, 4),
                          'pval': round(results_ols.pvalues, 4)})
print(f'table_ols: \n{table_ols}\n')


# OLS after adding experience to the model:
reg_ols_exp = smf.ols(formula='np.log(wage) ~ exper + educ', data=mroz)
results_ols_exp = reg_ols_exp.fit()

# print regression table:
table_ols_exp = pd.DataFrame({'b': round(results_ols_exp.params, 4),
                          'se': round(results_ols_exp.bse, 4),
                          't': round(results_ols_exp.tvalues, 4),
                          'pval': round(results_ols_exp.pvalues, 4)})
print(f'table_ols_exp: \n{table_ols_exp}\n')


# IV manually
# First-stage
reg_fs =  smf.ols(formula='educ ~ fatheduc + exper', data=mroz)
results_fs = reg_fs.fit()
mroz['educ_hat'] = results_fs.fittedvalues
# print regression table:
table_fs = pd.DataFrame({'b': round(results_fs.params, 4),
                          'se': round(results_fs.bse, 4),
                          't': round(results_fs.tvalues, 4),
                          'pval': round(results_fs.pvalues, 4)})
print(f'table_first_stage: \n{table_fs}\n')
# Second stage
reg_ss = smf.ols(formula='np.log(wage) ~ exper + educ_hat', data=mroz)
results_ss = reg_ss.fit()

# print regression table:
table_ss = pd.DataFrame({'b': round(results_ss.params, 4),
                          'se': round(results_ss.bse, 4),
                          't': round(results_ss.tvalues, 4),
                          'pval': round(results_ss.pvalues, 4)})
print(f'table_iv_manually: \n{table_ss}\n')

# IV automatically:
reg_iv_auto = iv.IV2SLS.from_formula(formula='np.log(wage) ~ 1 + exper +  [educ ~ fatheduc]',
                                data=mroz)
results_iv_auto = reg_iv.fit(cov_type='unadjusted', debiased=True)

# print regression table:
table_iv_auto = pd.DataFrame({'b': round(results_iv_auto.params, 4),
                         'se': round(results_iv_auto.std_errors, 4),
                         't': round(results_iv_auto.tstats, 4),
                         'pval': round(results_iv_auto.pvalues, 4)})
print(f'table_iv_automatically: \n{table_iv_auto}\n')

# obtain first-stage statistics
print(results_iv_auto.first_stage)


# Control function
# First-stage
reg_fs =  smf.ols(formula='educ ~ fatheduc + exper', data=mroz)
results_fs = reg_fs.fit()
mroz['control_function'] = results_fs.resid
# Second stage
reg_cf = smf.ols(formula='np.log(wage) ~ exper + educ + control_function', data=mroz)
results_cf = reg_cf.fit()

# print regression table:
table_cf = pd.DataFrame({'b': round(results_cf.params, 4),
                          'se': round(results_cf.bse, 4),
                          't': round(results_cf.tvalues, 4),
                          'pval': round(results_cf.pvalues, 4)})
print(f'table_control_function: \n{table_cf}\n')


pd.options.display.max_columns=None
mroz.describe()

table_ols: 
                b      se       t   pval
Intercept -0.1852  0.1852 -0.9998  0.318
educ       0.1086  0.0144  7.5451  0.000

table_ols_exp: 
                b      se       t    pval
Intercept -0.4002  0.1904 -2.1021  0.0361
exper      0.0157  0.0040  3.8998  0.0001
educ       0.1095  0.0142  7.7283  0.0000

table_first_stage: 
                 b      se        t    pval
Intercept  10.0788  0.3385  29.7788  0.0000
fatheduc    0.2723  0.0288   9.4500  0.0000
exper       0.0102  0.0126   0.8084  0.4193

table_iv_manually: 
                b      se       t    pval
Intercept  0.0356  0.4640  0.0768  0.9389
exper      0.0155  0.0043  3.6336  0.0003
educ_hat   0.0752  0.0361  2.0821  0.0379

table_iv_automatically: 
                b      se       t    pval
Intercept  0.0356  0.4397  0.0810  0.9355
exper      0.0155  0.0040  3.8346  0.0001
educ       0.0752  0.0342  2.1972  0.0285

    First Stage Estimation Results   
                                 educ
-----------------------

Unnamed: 0,inlf,hours,kidslt6,kidsge6,age,educ,wage,repwage,hushrs,husage,huseduc,huswage,faminc,mtr,motheduc,fatheduc,unem,city,exper,nwifeinc,lwage,expersq,educ_hat,control_function
count,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0
mean,1.0,1302.929907,0.140187,1.350467,41.971963,12.658879,4.177682,3.185864,2233.464953,44.609813,12.61215,7.226226,24130.422897,0.668333,9.516355,8.988318,8.545561,0.640187,13.037383,18.937483,1.190173,234.719626,12.658879,6.731894e-15
std,0.0,776.274385,0.391923,1.315935,7.721084,2.285376,3.310282,2.43964,582.908769,7.950055,3.035163,3.571217,11671.255986,0.076936,3.3081,3.523405,3.033328,0.480507,8.055923,10.591354,0.723198,270.043358,0.95284,2.077267
min,1.0,12.0,0.0,0.0,30.0,5.0,0.1282,0.0,175.0,30.0,4.0,0.5128,2400.0,0.4415,0.0,0.0,3.0,0.0,0.0,-0.029057,-2.054164,0.0,10.190806,-8.437765
25%,1.0,609.5,0.0,0.0,35.0,12.0,2.2626,1.42,1920.0,38.0,11.0,4.82175,16286.25,0.6215,7.0,7.0,7.5,0.0,7.0,12.365249,0.816509,49.0,12.076376,-1.145137
50%,1.0,1365.5,0.0,1.0,42.0,12.0,3.4819,3.195,2106.5,45.0,12.0,6.6831,21961.0,0.6915,10.0,7.0,7.5,1.0,12.0,17.079998,1.247574,144.0,12.229177,-0.1069364
75%,1.0,1910.5,0.0,2.0,47.25,14.0,4.97075,4.55,2504.0,51.0,16.0,8.837775,29793.0,0.7215,12.0,12.0,11.0,1.0,18.0,23.514996,1.603571,324.0,13.430125,1.05017
max,1.0,4950.0,2.0,8.0,60.0,17.0,25.0,9.98,5010.0,60.0,17.0,26.577999,91044.0,0.9415,17.0,17.0,14.0,1.0,38.0,91.0,3.218876,1444.0,15.013075,5.931241
