# Final Exam

The coding portion of the exam is open book, open note, and open Google. You are not allowed outside
help from another person, however. All work must be yours alone. Turn in this coding portion by downloading your completed Colab notebook as a .ipynb file and submitting it via Learning suite. To get full credit, the completed notebook should be able to run top to bottom, producing the results asked for in the prompt below.

## The Question

An important question in microeconomics is the causal return to additional schooling. A simple regression of later-life earnings on  years of education may give a biased estimate of the causal effect of schooling on earnings because individuals with higher non-schooling determinants of earnings  (like intelligence, drive, family background) might also obtain more schooling. As a result, it is important to account for covariates in the estimation, and possibly to use instrumental variables. One instrumental variables strategy to answer this question, proposed by Angrist and Krueger (1991) is to use the quarter of the year in which an individual was born as an instrument. The logic is that it is somewhat random when in the year a person is born, but individuals born later in the year will on average obtain more schooling. The reason is that compulsory schooling laws require children to stay in school until they turn 16. Since children born later in the year turn 16 later, they will be "forced" to stay in school a little longer than children who turn 16 earlier in the school year. Compulsory schooling laws vary somewhat by state, so the instruments are generated by interacting quarter of birth with state of birth.

However, when quarter of birth is interacted with state of birth, that generates around 150 instruments, which can lead to problems with many instruments.

In this final exam you will use machine learning to estimate the causal effect of years of education on the natural log of an individual's weekly wage. In the shared Econ 484 Google Drive "datasets" folder you will find a dataset called "ak91.csv" and the associated codebook "ak91codebook.txt" that gives some information about each of the variables. The data set is that used by Angrist and Krueger (1991). The outcome variable is the natural log of the weekly wage. The "treatment" variable is years of education. Instruments are indicators for quarter of birth and their interactions with state of birth. Additional covariates are age, marital status, region of residence, race, urban residence status, and state of birth indicators.

## The Task

Estimate the causal effect of years of schooling on the natural log of weekly wages in two ways:

1) Via OLS regression where you use machine learning to control for the additional covariates.

2) Via instrumental variables regression using quarter of birth and interactions between quarter of birth and state of birth as instruments for years of schooling, where you use machine learning to solve problems that arise with many instruments

## Hints and Requirements

*   Thoroughly document your code with comments explaining what each part of your code is doing

*   Be sure to "print" all of the relevant results after estimating/calculating them

*   Use best practices that we have learned this semester, including pre-processing variables as necessary and choosing tuning parameters.

*   Choose the machine learning method(s) you use based on what yields the best out-of-sample accuracy among at least two different machine learning methods (e.g., lasso and ridge), where out-of-sample accuracy is assessed using a held out test set.

In [None]:
# Here I import packages that I will (and think about) using.
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import KFold
from sklearn.linear_model import Lasso
from sklearn.linear_model import LassoCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.linear_model import RidgeCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from statsmodels.regression import linear_model
from sklearn.impute import SimpleImputer



In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [None]:
# 
Angrist=pd.read_csv('/content/gdrive/MyDrive/Econ 484/ak91.csv')
Angrist.head()




Unnamed: 0,AGE,EDUC,ENOCENT,ESOCENT,LWKLYWGE,MARRIED,MIDATL,MT,NEWENG,RACE,SMSA,SOATL,WNOCENT,WSOCENT,YOB,_IQOB_2,_IQOB_3,_IQOB_4,_ISTATE_2,_ISTATE_4,_ISTATE_5,_ISTATE_6,_ISTATE_8,_ISTATE_9,_ISTATE_10,_ISTATE_11,_ISTATE_12,_ISTATE_13,_ISTATE_15,_ISTATE_16,_ISTATE_17,_ISTATE_18,_ISTATE_19,_ISTATE_20,_ISTATE_21,_ISTATE_22,_ISTATE_23,_ISTATE_24,_ISTATE_25,_ISTATE_26,...,_IQOBXSTA_4_15,_IQOBXSTA_4_16,_IQOBXSTA_4_17,_IQOBXSTA_4_18,_IQOBXSTA_4_19,_IQOBXSTA_4_20,_IQOBXSTA_4_21,_IQOBXSTA_4_22,_IQOBXSTA_4_23,_IQOBXSTA_4_24,_IQOBXSTA_4_25,_IQOBXSTA_4_26,_IQOBXSTA_4_27,_IQOBXSTA_4_28,_IQOBXSTA_4_29,_IQOBXSTA_4_30,_IQOBXSTA_4_31,_IQOBXSTA_4_32,_IQOBXSTA_4_33,_IQOBXSTA_4_34,_IQOBXSTA_4_35,_IQOBXSTA_4_36,_IQOBXSTA_4_37,_IQOBXSTA_4_38,_IQOBXSTA_4_39,_IQOBXSTA_4_40,_IQOBXSTA_4_41,_IQOBXSTA_4_42,_IQOBXSTA_4_44,_IQOBXSTA_4_45,_IQOBXSTA_4_46,_IQOBXSTA_4_47,_IQOBXSTA_4_48,_IQOBXSTA_4_49,_IQOBXSTA_4_50,_IQOBXSTA_4_51,_IQOBXSTA_4_53,_IQOBXSTA_4_54,_IQOBXSTA_4_55,_IQOBXSTA_4_56
0,47,12,0,0,6.245846,1,0,0,0,1,0,0,0,0,33,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,46,12,0,0,5.847161,1,0,0,0,1,0,0,0,0,33,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
2,50,12,0,0,6.645516,1,0,0,0,1,0,0,0,0,30,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,47,16,0,0,6.706133,1,0,0,0,1,0,0,0,0,33,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
4,42,14,0,0,6.357876,1,0,0,0,1,0,0,0,0,37,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0


In [None]:
# Lets explore some of the data
Angrist.describe()

Unnamed: 0,AGE,EDUC,ENOCENT,ESOCENT,LWKLYWGE,MARRIED,MIDATL,MT,NEWENG,RACE,SMSA,SOATL,WNOCENT,WSOCENT,YOB,_IQOB_2,_IQOB_3,_IQOB_4,_ISTATE_2,_ISTATE_4,_ISTATE_5,_ISTATE_6,_ISTATE_8,_ISTATE_9,_ISTATE_10,_ISTATE_11,_ISTATE_12,_ISTATE_13,_ISTATE_15,_ISTATE_16,_ISTATE_17,_ISTATE_18,_ISTATE_19,_ISTATE_20,_ISTATE_21,_ISTATE_22,_ISTATE_23,_ISTATE_24,_ISTATE_25,_ISTATE_26,...,_IQOBXSTA_4_15,_IQOBXSTA_4_16,_IQOBXSTA_4_17,_IQOBXSTA_4_18,_IQOBXSTA_4_19,_IQOBXSTA_4_20,_IQOBXSTA_4_21,_IQOBXSTA_4_22,_IQOBXSTA_4_23,_IQOBXSTA_4_24,_IQOBXSTA_4_25,_IQOBXSTA_4_26,_IQOBXSTA_4_27,_IQOBXSTA_4_28,_IQOBXSTA_4_29,_IQOBXSTA_4_30,_IQOBXSTA_4_31,_IQOBXSTA_4_32,_IQOBXSTA_4_33,_IQOBXSTA_4_34,_IQOBXSTA_4_35,_IQOBXSTA_4_36,_IQOBXSTA_4_37,_IQOBXSTA_4_38,_IQOBXSTA_4_39,_IQOBXSTA_4_40,_IQOBXSTA_4_41,_IQOBXSTA_4_42,_IQOBXSTA_4_44,_IQOBXSTA_4_45,_IQOBXSTA_4_46,_IQOBXSTA_4_47,_IQOBXSTA_4_48,_IQOBXSTA_4_49,_IQOBXSTA_4_50,_IQOBXSTA_4_51,_IQOBXSTA_4_53,_IQOBXSTA_4_54,_IQOBXSTA_4_55,_IQOBXSTA_4_56
count,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,...,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0,329509.0
mean,44.645084,12.769912,0.201466,0.065452,5.899944,0.862559,0.161738,0.049419,0.056165,0.081676,0.186332,0.168071,0.077992,0.096926,34.602773,0.243204,0.263592,0.245347,0.000237,0.003235,0.017584,0.03362,0.008552,0.011666,0.001815,0.003754,0.011875,0.025526,0.000747,0.004853,0.055765,0.027065,0.02033,0.014588,0.02711,0.018133,0.007356,0.012561,0.030212,0.042721,...,0.000231,0.001159,0.013736,0.006534,0.004901,0.003715,0.006774,0.004865,0.00176,0.003147,0.007053,0.010124,0.005338,0.00427,0.007056,0.001035,0.002616,0.00024,0.00085,0.00641,0.001102,0.021046,0.008145,0.001478,0.012485,0.005402,0.001545,0.019383,0.001238,0.004115,0.001296,0.006133,0.012849,0.001539,0.000722,0.005302,0.002622,0.004762,0.006258,0.000486
std,2.939745,3.281244,0.401096,0.247322,0.678824,0.344313,0.36821,0.216742,0.230241,0.273871,0.389375,0.37393,0.268159,0.295857,2.904956,0.429018,0.440581,0.430293,0.015384,0.056786,0.131433,0.180249,0.092082,0.107377,0.042562,0.061155,0.108325,0.157716,0.027313,0.069492,0.229467,0.162272,0.141128,0.119898,0.162404,0.133433,0.085454,0.11137,0.171169,0.202228,...,0.015185,0.034029,0.116391,0.080569,0.069837,0.060834,0.082023,0.069578,0.041918,0.056011,0.083685,0.100108,0.072868,0.065206,0.083703,0.032153,0.05108,0.015482,0.029138,0.079803,0.033173,0.143539,0.089884,0.038416,0.111038,0.073299,0.039273,0.137869,0.035166,0.064018,0.035975,0.078075,0.112625,0.039196,0.026866,0.07262,0.051139,0.06884,0.078858,0.02203
min,40.0,0.0,0.0,0.0,-2.341806,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,30.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,42.0,12.0,0.0,0.0,5.636505,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,32.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,45.0,12.0,0.0,0.0,5.952494,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,35.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,47.0,15.0,0.0,0.0,6.257376,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,37.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,50.0,20.0,1.0,1.0,10.532096,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,39.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
# Our labeled dataset is Angrist.. I Tried primativel to drop any missing values for good measure

Angrist = Angrist.dropna()
Angrist.shape


(329509, 218)

In [None]:
# Here we assign Our Label & our Features of interest. I believe this is where I ran into trouble in this assignment



y=Angrist.loc[:,'LWKLYWGE'] # Y is our feature of interest or our label.
d=Angrist.loc[:,['EDUC',]] # D is our treatment feature

X=Angrist.loc[:,['EDUC'and 'AGE' and 'Married'and 'RACE'and 'ESOCENT'and  'MIDATL'and  'MT'and  'NEWENG' and  'SOATL'and  'WNOCENT'and 'WSOCENT'and 'SMSA'and '_ISTATE_2' and	'_ISTATE_3' and	'_ISTATE_4' and	'_ISTATE_5' and	'_ISTATE_6' and	'_ISTATE_7' and	'_ISTATE_8' and	'_ISTATE_9' and	'_ISTATE_10' and	'_ISTATE_11' and	'_ISTATE_12' and	'_ISTATE_13' and	'_ISTATE_14' and	'_ISTATE_15' and	'_ISTATE_16' and	'_ISTATE_17' and	'_ISTATE_18' and	'_ISTATE_19' and '_ISTATE_20' and	'_ISTATE_21' and	'_ISTATE_22' and	'_ISTATE_23' and	'_ISTATE_24' and	'_ISTATE_25' and	'_ISTATE_26' and	'_ISTATE_27' and	'_ISTATE_28' and	'_ISTATE_29' and	'_ISTATE_30' and	'_ISTATE_31' and	'_ISTATE_32' and '_ISTATE_33' and	'_ISTATE_34' and	'_ISTATE_35' and	'_ISTATE_36' and	'_ISTATE_37' and	'_ISTATE_38' and	'_ISTATE_39' and	'_ISTATE_40' and	'_ISTATE_41' and	'_ISTATE_42' and	'_ISTATE_43' and	'_ISTATE_44' and	'_ISTATE_45' and	'_ISTATE_46' and	'_ISTATE_47' and '_ISTATE_48' and	'_ISTATE_49' and	'_ISTATE_50' and	'_ISTATE_51'
]]
# Here we are assigning all features EXCEPT these variables to X

 

# I realize that the way I assigned Z1 is NOT ideal.

Z1 = Angrist.filter(like='_IQOB_2'and'_IQOB_3' and '_IQOB_4'and '_IQOBXSTA_2_2' and '_IQOBXSTA_2_4' and	'_IQOBXSTA_2_5' and	'_IQOBXSTA_2_6' and	'_IQOBXSTA_2_8' and	'_IQOBXSTA_2_9' and '_IQOBXSTA_2_10'and	'_IQOBXSTA_2_11' and	'_IQOBXSTA_2_12' and	'_IQOBXSTA_2_13' and	'_IQOBXSTA_2_15' and	'_IQOBXSTA_2_16' and	'_IQOBXSTA_2_17' and	'_IQOBXSTA_2_18' and	'_IQOBXSTA_2_19' and	'_IQOBXSTA_2_20' and	'_IQOBXSTA_2_21' and	'_IQOBXSTA_2_22' and	'_IQOBXSTA_2_23' and	'_IQOBXSTA_2_24' and	'_IQOBXSTA_2_25' and	'_IQOBXSTA_2_26' and	'_IQOBXSTA_2_27' and	'_IQOBXSTA_2_28' and	'_IQOBXSTA_2_29' and	'_IQOBXSTA_2_30' and	'_IQOBXSTA_2_31' and	'_IQOBXSTA_2_32' and	'_IQOBXSTA_2_33' and	'_IQOBXSTA_2_34' and	'_IQOBXSTA_2_35' and	'_IQOBXSTA_2_36' and	'_IQOBXSTA_2_37' and	'_IQOBXSTA_2_38' and	'_IQOBXSTA_2_39' and	'_IQOBXSTA_2_40' and	'_IQOBXSTA_2_41' and	'_IQOBXSTA_2_42' and	'_IQOBXSTA_2_44' and	'_IQOBXSTA_2_45' and	'_IQOBXSTA_2_46' and	'_IQOBXSTA_2_47' and	'_IQOBXSTA_2_48' and	'_IQOBXSTA_2_49'	and'_IQOBXSTA_2_50' and	'_IQOBXSTA_2_51' and	'_IQOBXSTA_2_53' and	'_IQOBXSTA_2_54' and	'_IQOBXSTA_2_55' and	'_IQOBXSTA_2_56' and	'_IQOBXSTA_3_2' and	'_IQOBXSTA_3_4' and	'_IQOBXSTA_3_5' and	'_IQOBXSTA_3_6' and	'_IQOBXSTA_3_8' and	'_IQOBXSTA_3_9'	and '_IQOBXSTA_3_10' and	'_IQOBXSTA_3_11' and	'_IQOBXSTA_3_12' and	'_IQOBXSTA_3_13' and	'_IQOBXSTA_3_15' and	'_IQOBXSTA_3_16' and	'_IQOBXSTA_3_17' and	'_IQOBXSTA_3_18' and	'_IQOBXSTA_3_19' and	'_IQOBXSTA_3_20' and	'_IQOBXSTA_3_21' and	'_IQOBXSTA_3_22' and	'_IQOBXSTA_3_23' and	'_IQOBXSTA_3_24' and	'_IQOBXSTA_3_25' and	'_IQOBXSTA_3_26' and	'_IQOBXSTA_3_27' and	'_IQOBXSTA_3_28' and	'_IQOBXSTA_3_29' and	'_IQOBXSTA_3_30' and	'_IQOBXSTA_3_31' and	'_IQOBXSTA_3_32' and	'_IQOBXSTA_3_33'	and '_IQOBXSTA_3_34' and	'_IQOBXSTA_3_35' and	'_IQOBXSTA_3_36' and	'_IQOBXSTA_3_37' and	'_IQOBXSTA_3_38' and	'_IQOBXSTA_3_39' and	'_IQOBXSTA_3_40' and	'_IQOBXSTA_3_41' and	'_IQOBXSTA_3_42' and	'_IQOBXSTA_3_44' and	'_IQOBXSTA_3_45' and	'_IQOBXSTA_3_46' and	'_IQOBXSTA_3_47'	and '_IQOBXSTA_3_48' and	'_IQOBXSTA_3_49' and	'_IQOBXSTA_3_50' and	'_IQOBXSTA_3_51' and	'_IQOBXSTA_3_53' and	'_IQOBXSTA_3_54' and	'_IQOBXSTA_3_55' and	'_IQOBXSTA_3_56' and	'_IQOBXSTA_4_2'	and '_IQOBXSTA_4_4'	and '_IQOBXSTA_4_5' and	'_IQOBXSTA_4_6' and	'_IQOBXSTA_4_8' and	'_IQOBXSTA_4_9' and	'_IQOBXSTA_4_10' and	'_IQOBXSTA_4_11' and	'_IQOBXSTA_4_12' and	'_IQOBXSTA_4_13' and	'_IQOBXSTA_4_15' and	'_IQOBXSTA_4_16' and	'_IQOBXSTA_4_17' and	'_IQOBXSTA_4_18' and	'_IQOBXSTA_4_19' and	'_IQOBXSTA_4_20' and	'_IQOBXSTA_4_21' and	'_IQOBXSTA_4_22' and	'_IQOBXSTA_4_23' and	'_IQOBXSTA_4_24' and	'_IQOBXSTA_4_25' and	'_IQOBXSTA_4_26' and	'_IQOBXSTA_4_27' and	'_IQOBXSTA_4_28' and	'_IQOBXSTA_4_29'	and '_IQOBXSTA_4_30' and	'_IQOBXSTA_4_31' and '_IQOBXSTA_4_32' and	'_IQOBXSTA_4_33' and	'_IQOBXSTA_4_34' and	'_IQOBXSTA_4_35' and	'_IQOBXSTA_4_36' and	'_IQOBXSTA_4_37' and	'_IQOBXSTA_4_38' and	'_IQOBXSTA_4_39' and	'_IQOBXSTA_4_40' and	'_IQOBXSTA_4_41' and	'_IQOBXSTA_4_42' and	'_IQOBXSTA_4_44' and	'_IQOBXSTA_4_45' and	'_IQOBXSTA_4_46' and	'_IQOBXSTA_4_47'	and '_IQOBXSTA_4_48' and	'_IQOBXSTA_4_49' and	'_IQOBXSTA_4_50' and	'_IQOBXSTA_4_51' and	'_IQOBXSTA_4_53' and	'_IQOBXSTA_4_54' and	'_IQOBXSTA_4_55'
,axis=1)


 








print('our y stuff',y.head)
print('our D stuff',d.head)
print('our z stuff',Z1.head)

print('x stuff', X.head)








our y stuff <bound method NDFrame.head of 0         6.245846
1         5.847161
2         6.645516
3         6.706133
4         6.357876
            ...   
329504    4.583833
329505    5.784210
329506    5.707302
329507    5.952494
329508    6.047782
Name: LWKLYWGE, Length: 329509, dtype: float64>
our D stuff <bound method NDFrame.head of         EDUC
0         12
1         12
2         12
3         16
4         14
...      ...
329504    10
329505    12
329506    12
329507    12
329508    13

[329509 rows x 1 columns]>
our z stuff <bound method NDFrame.head of         _IQOBXSTA_4_55
0                    0
1                    0
2                    0
3                    0
4                    0
...                ...
329504               0
329505               0
329506               0
329507               0
329508               0

[329509 rows x 1 columns]>
x stuff <bound method NDFrame.head of         _ISTATE_51
0                0
1                0
2                0
3              

In [None]:
print(type('_IQOB_2'))
print(type('EDUC'))
print(d)
Z1.shape
d.shape


<class 'str'>
<class 'str'>
        EDUC
0         12
1         12
2         12
3         16
4         14
...      ...
329504    10
329505    12
329506    12
329507    12
329508    13

[329509 rows x 1 columns]


(329509, 1)

In [None]:
X_train, X_test, y_train, y_test=train_test_split(X, y, random_state=42)
X_train2, X_test2, d_train, d_test=train_test_split(X, d, random_state=42)

In [None]:
from sklearn import linear_model
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


# Here is the NAIVE OLS approach. 
ols = linear_model.LinearRegression()
ols_reg = ols.fit(d.values.reshape(-1,1),y)
print('OLS coefficient: ',ols_reg.coef_) # Naive estimation coefficient of our our treatment

OLS coefficient:  [0.07085104]


In [None]:
# Standard Lasso X and Y scenario
lasso=LassoCV()
scaler=StandardScaler()
lassofitrain=scaler.fit_transform(X_train)
lassofitest=scaler.fit_transform(X_test)
lassotest=lasso.fit(lassofitrain, y_train)
print("lassofittest MSE testscore:" + str(lassotest.score(lassofitest, y_test)))
lassotest=lasso.fit(X_train, y_train)
print("MSE x and y test est" + str(lassotest.score(X_test, y_test)))

# Here we 
lasso=LassoCV()
scaler=StandardScaler()
lassofitrain=scaler.fit_transform(X_train2)
lassofitest=scaler.fit_transform(X_test2)
lassoD=lasso.fit(lassofitrain, d_train)
print(" LASSO D test MSE:" + str(lassoD.score(lassofitest, d_test)))
print(lasso.alpha_)
LASSO=lasso.fit(X_train, y_train)
print(" MSE STANDARD LASSO test score:" + str(LASSO.score(X_test, y_test)))
print(lasso.alpha_)

# Here we use the treatment to predict our predicted Y values.
y_pred = lassoD.predict(X_test)
print(y_pred)

lassofittest MSE testscore:0.020817558160175387
MSE x and y test est0.020821489593689124


  y = column_or_1d(y, warn=True)


 LASSO D test MSE:0.01768736982278485
0.00042885384368337754
 MSE STANDARD LASSO test score:0.020821489593689124
3.773165592083584e-05
[5.94651337 5.94651337 5.94651337 ... 5.94651337 5.94651337 5.94651337]


In [None]:
ridgeY = Ridge().fit(X,y)
residY = y - ridgeY.predict(X)

ridgey = linear_model.Ridge(alpha=0.001, max_iter=1000,normalize=True).fit(X, y)
yresid=y-ridgey.predict(X)

ridgeD = Ridge().fit(X,d)
residD = d - ridgeD.predict(X)

ridged = linear_model.Ridge(alpha=0.001, max_iter=1000,normalize=True).fit(X, d)
dresid=d-ridged.predict(X)

ddmlreg=linear_model.LinearRegression().fit(dresid,yresid)
print("DDML regression.. the effect of years on log age. d : {:.3f}".format(ddmlreg.coef_[0]))


DDML regression.. the effect of years on log age. d : 0.068


In [None]:
# Double debiased method with Sample SPlitting
from sklearn.model_selection import KFold
# create our sample splitting "object"
kf = KFold(n_splits=5,shuffle=True,random_state=42)#shuffle the observations between five folds after each time. 

# apply the splits to our Xs
kf.get_n_splits(X)

# initialize array to hold each fold's regression coefficient
coeffs=np.zeros(5)

# Now loop through each fold
ii=0
for train_index, test_index in kf.split(X):#the split allows it do go five times 
  X_train, X_test = X.iloc[train_index,:], X.iloc[test_index,:]#train_index is the observations in the training folds. 
  y_train, y_test = y.iloc[train_index], y.iloc[test_index]
  d_train, d_test = d.iloc[train_index,:], d.iloc[test_index,:]
  wt_train, wt_test = instwt.iloc[train_index], instwt.iloc[test_index]
  # Do DDML thing
  # Ridge y on training folds:
  ridgeY.fit(X_train, y_train)

  # but get residuals in test set
  yresid=y_test-ridgeY.predict(X_test)
  
  #Ridge d on training folds
  ridgeD.fit(X_train, d_train)

  #but get residuals in test set
  dresid=d_test-ridgeD.predict(X_test)

  # regress resids on resids
  ddmlreg=linear_model.LinearRegression().fit(dresid,yresid,wt_test)

  # save coefficient in a vector
  coeffs[ii]=ddmlreg.coef_[0]
  ii+=1

# Take average
print("Double-Debiased Machine Learning effect of selective college: {:.4f}".format(np.mean(coeffs)))
coeffs

NameError: ignored

In [None]:
# Here I attempted to follow the process as done in class.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(Z1)
Z_scaled = scaler.transform(Z1)

# Here we choose our penalty paramter
lasso=linear_model.Lasso(alpha=.1)

# predict d using Z_scaled:
lasso.fit(Z_scaled,d.values.reshape(-1,1))

# !!! Here is where I realize that I have issues likely with the way i preprocessed the data. My time constraint made it difficult to fix i am sure its something obvious. I realize that its throwing away 0s which is the everything so its a problem in this case... IF i chose something close to zero it would still run.. but i know that it will be the same for any value above zero so i am doing it COMPLETELY WRONG. I am trying to fix this im not sure i will be able to given my time constraint however, if ic oudl fix the problem with my variables I am confident that i could get this to work properly in a real life setting... but i did my best!
Z_selected=Z_scaled[:,lasso.coef_!=.0000001] # I realize this doesnt actually work but if it did everythign would work nice.

# do the first stage regression via OLS using the selected Zs and get the fitted values:
postlasso_fs = ols.fit(Z_selected,d.values.reshape(-1,1))

dhat_postlasso = postlasso_fs.predict(Z_selected)

#  2nd stage regression using the post-lasso fitted values:

tsls_postlasso = ols.fit(dhat_postlasso,y)
print('Erroenous Post-Lasso 2SLS coefficient: ',tsls_postlasso.coef_) # IF i had followed the other steps correctly this would give me the causal coefficient for EDUC on our Y variable.




In [None]:
#
ridge=RidgeCV()
ridgereg=ridge.fit(X_train, y_train)
print(ridgereg.score(X_test, y_test)))
ridgereg2=ridge.fit(lassofitrain, y_train)
print("MSE for lassofitest:" + str(ridgereg.score(lassofitest, y_test)))
param_grids={'n_estimators' : [20, 30, 50, 70, 100, 120],'max_depth' : [5, 6, 7, 8, 9, 10]}



SyntaxError: ignored