# Prediction

Outcome to be predicted: $Y_i$
> *example:* a worker's log wage

Characteristics (aka **features**): $X_i=\left(X_{1i},\ldots,X_{pi}\right)'$
> *example:* education, age, state of birth, parents' education, cognitive ability, family background


In [None]:
%matplotlib inline
# give notebook access to google drive:
from google.colab import drive
drive.mount('/content/gdrive')
%cd '/content/gdrive/My Drive/mixtape ML+causal'

# import some useful packages
import pandas as pd
import numpy as np
from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split


plt.style.use('seaborn-whitegrid')

# read in data
nlsy=pd.read_csv('data/nlsy97.csv')
nlsy

## Least squares benchmark

In [None]:
# generate dictionary of transformations of education
powerlist=[nlsy['educ']**j for j in np.arange(1,10)]
X=pd.concat(powerlist,axis=1)
X.columns = ['educ'+str(j) for j in np.arange(1,10)]
# standardize our X matrix (doesn't matter for OLS, but will matter for lasso below)
scaler = StandardScaler()
scaler.fit(X)
X_scaled = scaler.transform(X)

# run least squares regression
reg=linear_model.LinearRegression().fit(X_scaled,nlsy['lnw_2016'])
yhat=reg.predict(X_scaled)


In [None]:
# plot predicted values
lnwbar=nlsy.groupby('educ')['lnw_2016'].mean()
Xbar=pd.DataFrame({'educ':lnwbar.index.values})
powerlist=[Xbar['educ']**j for j in np.arange(1,10)]
Xbar=pd.concat(powerlist,axis=1)
Xbar.columns = X.columns
Xbar_scaled = scaler.transform(Xbar)
ybarhat=reg.predict(Xbar_scaled)
fig = plt.figure()
ax = plt.axes()
ax.plot(Xbar['educ1'],lnwbar,'bo',Xbar['educ1'],ybarhat,'g-');
plt.title("ln Wages by Education in the NLSY")
plt.xlabel("years of schooling")
plt.ylabel("ln wages");

As we can see, least squares linear regression can approximate any continuous function and can certainly be used for prediction. Include a rich enough set of transformations, and OLS predictions will yield unbiased estimates of the true ideal predictor, the conditional expectation function. But these estimates will be quite noisy. Penalized regression can greatly reduce the variance, at the expense of some bias. But if the bias reduction is great enough, the predictions can have lower MSE. Back to the whiteboard!


## Lasso in action

Welcome back! Let's see lasso in action:

In [None]:
# fit lasso with a couple of different alphas and plot results
lasso1 = linear_model.Lasso(alpha=.001,max_iter=1000).fit(X_scaled,nlsy['lnw_2016'])
ybarhat1=lasso1.predict(Xbar_scaled)
lasso2 = linear_model.Lasso(alpha=.01,max_iter=1000).fit(X_scaled,nlsy['lnw_2016'])
ybarhat2=lasso2.predict(Xbar_scaled)


Plot results

In [None]:
#@title
fig1,(ax11,ax12,ax13) = plt.subplots(1,3,figsize=(12, 4))
ax11.barh(Xbar.columns,reg.coef_,align='center');
ax11.set_title("OLS coefficients")
ax11.set_xlabel("coefficient")
ax12.barh(Xbar.columns,lasso1.coef_,align='center');
ax12.set_title("Lasso coefficients (alpha = {:.3f})".format(lasso1.get_params()['alpha']))
ax12.set_xlabel("coefficient")
ax13.barh(Xbar.columns,lasso2.coef_,align='center');
ax13.set_title("Lasso coefficients (alpha = {:.2f})".format(lasso2.get_params()['alpha']))
ax13.set_xlabel("coefficient")
fig2,(ax21,ax22,ax23) = plt.subplots(1,3,figsize=(12,4))
ax21.plot(Xbar['educ1'],lnwbar,'bo',Xbar['educ1'],ybarhat,'g-');
ax21.set_title("ln Wages by Education in the NLSY")
ax21.set_xlabel("years of schooling")
ax21.set_ylabel("ln wages");
ax22.plot(Xbar['educ1'],lnwbar,'bo',Xbar['educ1'],ybarhat1,'g-');
ax22.set_title("ln Wages by Education in the NLSY")
ax22.set_xlabel("years of schooling")
ax22.set_ylabel("ln wages");
ax23.plot(Xbar['educ1'],lnwbar,'bo',Xbar['educ1'],ybarhat2,'g-');
ax23.set_title("ln Wages by Education in the NLSY")
ax23.set_xlabel("years of schooling")
ax23.set_ylabel("ln wages");

Play around with different values for alpha to see how the fit changes!

### Data-driven tuning parameters: Cross-validation

In [None]:
# define grid for alpha
alpha_grid = {'alpha': [.0001,.001,.002, .004, .006, .008, .01, .012, .014, .016 ,.018, .02 ],'max_iter': [100000]}
grid_search = GridSearchCV(linear_model.Lasso(),alpha_grid,cv=5,return_train_score=True).fit(X_scaled,nlsy['lnw_2016'])
print("Best alpha: ",grid_search.best_estimator_.get_params()['alpha'])

### Lasso-guided variable selection
For illustrative purposes we've been using lasso to determine the functional form for a single underlying regressor: education. But lasso's real power comes in selecting among a large number of regressors.

In [None]:
# Define "menu" of regressors:
X=nlsy.drop(columns=['lnw_2016','exp'])

# Divide into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, nlsy['lnw_2016'],random_state=42)
# Scale regressors
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Do cross-validated Lasso (the easy way!)
lassocv=linear_model.LassoCV(random_state=42).fit(X_train_scaled,y_train)
print("Number of regressors in the menu: ",len(X.columns))
print("Number of regressors selected by lasso: ",sum(lassocv.coef_!=0))
# look at the coefficients
results = pd.DataFrame({'feature': X.columns[lassocv.coef_!=0],'coefficient': lassocv.coef_[lassocv.coef_!=0]})
results

To try on your own: load the Oregon HIE data from earlier and try lassoing the OLS regression we did there. What do you notice?


In [None]:
# Load Oregon HIE data . . .

## Connection to ML

Where does machine learning fit into this? It might be tempting to treat
this regression as a prediction exercise where we are predicting $Y_{i}$
given $D_{i}$ and $X_{i}$. Don't give in to this temptation. We are not
after a prediction for $Y_{i}$, we are after a coefficient on $D_{i}$.
Modern machine learning algorithms are finely tuned for producing
predictions, but along the way they compromise coefficients. So how can we
deploy machine learning in the service of estimating the causal coefficient $\delta $?

To see where ML fits in, first remember that an equivalent way to estimate $%
\delta $ is the following three-step procedure:


1.   Regress $Y_{i}$ on $X_{i}$ and compute the residuals, $\tilde{Y}%
_{i}=Y_{i}-\hat{Y}_{i}^{OLS}$, where $\hat{Y}_{i}^{OLS}=X_{i}^{\prime
}\left( X^{\prime }X\right) ^{-1}X^{\prime }Y$
2.   Regress $D_{i}$ on $X_{i}$ and compute the residuals, $\tilde{D}%
_{i}=D_{i}-\hat{D}_{i}^{OLS}$, where $\hat{D}_{i}^{OLS}=X_{i}^{\prime
}\left( X^{\prime }X\right) ^{-1}X^{\prime }D$

3. Regress $\tilde{Y}_{i}$ on $\tilde{D}_{i}$.

Let's try it!

In [None]:
# Regress outcome on covariates
yreg=linear_model.LinearRegression().fit(x,y,w)
# Calculate residuals
ytilde = y - yreg.predict(x)

# regress treatment on covariates
dreg = linear_model.LinearRegression().fit(x,d,w)
# Calculate residuals
dtilde = d - dreg.predict(x)

# regress ytilde on dtilde
lm.fit(dtilde,ytilde,w)
print("Estimated effect of Medicaid elibility on \n number of doctor visits" +
      " (partialled out): {:.3f}".format(lm.coef_[0]))

ML enters the picture by providing an alternate way to generate $\hat{Y}_i$ and $\hat{D}_i$ when OLS is not the best tool for the job. The first two steps are really just prediction exercises, and in principle any supervised machine learning algorithm can step in here. Back to the whiteboard!