# Exercise 13

Exercise: Use the dataset 'mroz' from 'wooldridge' and estimate the following model, where education ('educ') is considered endogenous.  


\begin{equation*}
log(wage)=\beta_0+\beta_1educ+\beta_2exper+u
\end{equation*}  

  
  
- Do you think we get a causal effect for the return on education on wage? If not, why not?
- How is the problem called and which OLS assumption is violated?
- What are potential solutions to adress this problem? What are the requirements to use them in order to be able to estimate caual effect?
- There are two potential sources of endogeneity that we've covered in the class. How do we call them and how do they cause endogeneity?
- If this would be panel data, how could we correct for endogeneity? Which type of the endogeneity could we adress by taking advantage of the panel data structure?
- What are the two requirements of an instrument to work?
- Which one can be tested and which one can't?
- Think about potential instruments which could be used to adress the endogeneity in this case.
- A candidate is the father's education. But does it fulfill the exogeneity requirement if we estimate the model as it is at the moment?
- Estimate OLS, IV by hand and IV using an implemented estimator and report the results.
- Is the instrument relevant (strong enough)? Please test the instrument relevance.
- What can you say about the inference when estimating IV by hand? 
- Can you use IV also in non-linear models? If not, what would be an alternative?
- Estimate the model with the control function approach


In [2]:
import wooldridge as woo
import numpy as np
import pandas as pd
import linearmodels.iv as iv
import statsmodels.formula.api as smf

mroz = woo.dataWoo('mroz')

#restrict to non-missing wage observations:
mroz = mroz.dropna(subset=['lwage'])

cov_yz = np.cov(mroz['lwage'], mroz['fatheduc'])[1, 0]
cov_xy = np.cov(mroz['educ'], mroz['lwage'])[1, 0]
cov_xz = np.cov(mroz['educ'], mroz['fatheduc'])[1, 0]
var_x = np.var(mroz['educ'], ddof=1)
x_bar = np.mean(mroz['educ'])
y_bar = np.mean(mroz['lwage'])

# OLS slope parameter manually:
b_ols_man = cov_xy / var_x
print(f'b_ols_man: {b_ols_man}\n')

# IV slope parameter manually:
b_iv_man = cov_yz / cov_xz
print(f'b_iv_man: {b_iv_man}\n')

# OLS automatically:
reg_ols = smf.ols(formula='np.log(wage) ~ educ + exper', data=mroz)
results_ols = reg_ols.fit()

# print regression table:
table_ols = pd.DataFrame({'b': round(results_ols.params, 4),
                          'se': round(results_ols.bse, 4),
                          't': round(results_ols.tvalues, 4),
                          'pval': round(results_ols.pvalues, 4)})
print(f'table_ols: \n{table_ols}\n')



# print regression table:
table_ols = pd.DataFrame({'b': round(results_ols.params, 4),
                          'se': round(results_ols.bse, 4),
                          't': round(results_ols.tvalues, 4),
                          'pval': round(results_ols.pvalues, 4)})
print(f'table_ols: \n{table_ols}\n')

# IV automatically:
reg_iv = iv.IV2SLS.from_formula(formula='np.log(wage) ~ 1 + [educ ~ fatheduc]',
                                data=mroz)
results_iv = reg_iv.fit(cov_type='unadjusted', debiased=True)

# print regression table:
table_iv = pd.DataFrame({'b': round(results_iv.params, 4),
                         'se': round(results_iv.std_errors, 4),
                         't': round(results_iv.tstats, 4),
                         'pval': round(results_iv.pvalues, 4)})
print(f'table_iv: \n{table_iv}\n')


pd.options.display.max_columns=None
mroz.describe()

b_ols_man: 0.10864865517467513

b_iv_man: 0.05917347999936595

table_ols: 
                b      se       t    pval
Intercept -0.4002  0.1904 -2.1021  0.0361
educ       0.1095  0.0142  7.7283  0.0000
exper      0.0157  0.0040  3.8998  0.0001

table_ols: 
                b      se       t    pval
Intercept -0.4002  0.1904 -2.1021  0.0361
educ       0.1095  0.0142  7.7283  0.0000
exper      0.0157  0.0040  3.8998  0.0001

table_iv: 
                b      se       t    pval
Intercept  0.4411  0.4461  0.9888  0.3233
educ       0.0592  0.0351  1.6839  0.0929



Unnamed: 0,inlf,hours,kidslt6,kidsge6,age,educ,wage,repwage,hushrs,husage,huseduc,huswage,faminc,mtr,motheduc,fatheduc,unem,city,exper,nwifeinc,lwage,expersq
count,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0,428.0
mean,1.0,1302.929907,0.140187,1.350467,41.971963,12.658879,4.177682,3.185864,2233.464953,44.609813,12.61215,7.226226,24130.422897,0.668333,9.516355,8.988318,8.545561,0.640187,13.037383,18.937483,1.190173,234.719626
std,0.0,776.274385,0.391923,1.315935,7.721084,2.285376,3.310282,2.43964,582.908769,7.950055,3.035163,3.571217,11671.255986,0.076936,3.3081,3.523405,3.033328,0.480507,8.055923,10.591354,0.723198,270.043358
min,1.0,12.0,0.0,0.0,30.0,5.0,0.1282,0.0,175.0,30.0,4.0,0.5128,2400.0,0.4415,0.0,0.0,3.0,0.0,0.0,-0.029057,-2.054164,0.0
25%,1.0,609.5,0.0,0.0,35.0,12.0,2.2626,1.42,1920.0,38.0,11.0,4.82175,16286.25,0.6215,7.0,7.0,7.5,0.0,7.0,12.365249,0.816509,49.0
50%,1.0,1365.5,0.0,1.0,42.0,12.0,3.4819,3.195,2106.5,45.0,12.0,6.6831,21961.0,0.6915,10.0,7.0,7.5,1.0,12.0,17.079998,1.247574,144.0
75%,1.0,1910.5,0.0,2.0,47.25,14.0,4.97075,4.55,2504.0,51.0,16.0,8.837775,29793.0,0.7215,12.0,12.0,11.0,1.0,18.0,23.514996,1.603571,324.0
max,1.0,4950.0,2.0,8.0,60.0,17.0,25.0,9.98,5010.0,60.0,17.0,26.577999,91044.0,0.9415,17.0,17.0,14.0,1.0,38.0,91.0,3.218876,1444.0
