# Prediction 

## Introduction

In labor economics an important question is what determines the wage of workers. This is a causal question, but we could begin to investigate from a predictive perspective.

In the following wage example,  𝑌  is the hourly wage of a worker and  𝑋  is a vector of worker's characteristics, e.g., education, experience, gender. Two main questions here are:

* How to use job-relevant characteristics, such as education and experience, to best predict wages?

* What is the difference in predicted wages between men and women with the same job-relevant characteristics?

In this lab, we focus on the prediction question first.

## Data

The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015. We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors; individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below  3 .

The variable of interest  𝑌  is the hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size  $n=5150$ .

---

In [28]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import sklearn.linear_model as lm
import sklearn.metrics as metrics
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score, train_test_split

## Data Analysis
Set the following file_directory to a place where you downloaded https://github.com/CausalAIBook/MetricsMLNotebooks/blob/main/data/wage2015_subsample_inference.rdata

In [19]:
file_dir = 'wage2015_subsample_inference.csv'
df = pd.read_csv(file_dir)
pd.read_csv("https://raw.githubusercontent.com/VC2015/DMLonGitHub/master/penn_jae.dat", delim_whitespace=True)

In [25]:
df.describe()

Unnamed: 0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2
count,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0
mean,23.41041,2.970787,0.444466,0.023301,0.243883,0.278058,0.31767,0.137087,0.259612,0.296505,0.216117,0.227767,13.760583,3.018925,8.235867,25.118038,5310.737476,11.670874,6629.154951,13.316893
std,21.003016,0.570385,0.496955,0.150872,0.429465,0.448086,0.465616,0.343973,0.438464,0.456761,0.411635,0.419432,10.609465,4.000904,14.488962,53.530225,11874.35608,6.966684,5333.443992,5.701019
min,3.021978,1.105912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,370.0,2.0
25%,13.461538,2.599837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.25,0.125,0.0625,1740.0,5.0,4880.0,9.0
50%,19.230769,2.956512,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,1.0,1.0,4040.0,13.0,7370.0,14.0
75%,27.777778,3.324236,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,21.0,4.41,9.261,19.4481,5610.0,17.0,8190.0,18.0
max,528.845673,6.270697,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,47.0,22.09,103.823,487.9681,100000.0,22.0,100000.0,22.0


### Construct variables

We are constructing the output variable  $Y$  and the matrix  $Z$  which includes the characteristics of workers that are given in the data.

In [21]:
Y = np.log(df['wage'])
Z = df.drop(['wage', 'lwage'], axis=1)
Z.shape

(5150, 18)

For the outcome variable wage and a subset of the raw regressors, we calculate the empirical mean and other empirical measures to get familiar with the data.



In [22]:
Z.describe()

Unnamed: 0,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2
count,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0
mean,0.444466,0.023301,0.243883,0.278058,0.31767,0.137087,0.259612,0.296505,0.216117,0.227767,13.760583,3.018925,8.235867,25.118038,5310.737476,11.670874,6629.154951,13.316893
std,0.496955,0.150872,0.429465,0.448086,0.465616,0.343973,0.438464,0.456761,0.411635,0.419432,10.609465,4.000904,14.488962,53.530225,11874.35608,6.966684,5333.443992,5.701019
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,370.0,2.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.25,0.125,0.0625,1740.0,5.0,4880.0,9.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,1.0,1.0,4040.0,13.0,7370.0,14.0
75%,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,21.0,4.41,9.261,19.4481,5610.0,17.0,8190.0,18.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,47.0,22.09,103.823,487.9681,100000.0,22.0,100000.0,22.0


E.g., the share of female workers in our sample is ~44% ( 𝑠𝑒𝑥=1  if female).

In [23]:
# if you want to print this table to latex
print(Z.describe().style.to_latex())

\begin{tabular}{lrrrrrrrrrrrrrrrrrr}
 & sex & shs & hsg & scl & clg & ad & mw & so & we & ne & exp1 & exp2 & exp3 & exp4 & occ & occ2 & ind & ind2 \\
count & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 \\
mean & 0.444466 & 0.023301 & 0.243883 & 0.278058 & 0.317670 & 0.137087 & 0.259612 & 0.296505 & 0.216117 & 0.227767 & 13.760583 & 3.018925 & 8.235867 & 25.118038 & 5310.737476 & 11.670874 & 6629.154951 & 13.316893 \\
std & 0.496955 & 0.150872 & 0.429465 & 0.448086 & 0.465616 & 0.343973 & 0.438464 & 0.456761 & 0.411635 & 0.419432 & 10.609465 & 4.000904 & 14.488962 & 53.530225 & 11874.356080 & 6.966684 & 5333.443992 & 5.701019 \\
min & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 

## Prediction Question

Now, we will construct a prediction rule for hourly wage  $Y$, which depends linearly on job-relevant characteristics  $X$:

$$
𝑌=\beta′𝑋+𝜖.
$$

Our goals are:

* Predict wages using various characteristics of workers.

* Assess the predictive performance using the (adjusted) sample MSE, the (adjusted) sample $R^2$  and the out-of-sample MSE and  $R^2$ .

We employ two different specifications for prediction:

1. **Basic Model**:  $𝑋$  consists of a set of raw regressors (e.g. gender, experience, education indicators, occupation and industry indicators, regional indicators).
2. **Flexible Model**:  $𝑋$  consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g.,  $exp^2$  and  $exp^3$ ) and additional two-way interactions of polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is experience times the indicator of having a college degree.

Using the **Flexible Model**, enables us to approximate the real relationship by a more complex regression model and therefore to reduce the bias. The **Flexible Model** increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver good prediction accuracy but give models which are harder to interpret.

Now, let us fit both models to our data by running ordinary least squares (ols):

In [24]:
# 1. Basic Model
model1 = 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + C(occ2) + C(ind2)'
results1 = smf.ols(model1, data=df).fit()
print(results1.summary())

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.310
Model:                            OLS   Adj. R-squared:                  0.303
Method:                 Least Squares   F-statistic:                     45.83
Date:                Tue, 17 Jan 2023   Prob (F-statistic):               0.00
Time:                        02:13:33   Log-Likelihood:                -3459.9
No. Observations:                5150   AIC:                             7022.
Df Residuals:                    5099   BIC:                             7356.
Df Model:                          50                                         
Covariance Type:            nonrobust                                         
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         3.7222      0.080     46.330

In [15]:
# 2. Flexible Model
model2 = 'lwage ~ sex + shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)'
results2 = smf.ols(model2, data=df).fit()
print(results2.summary())

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.351
Model:                            OLS   Adj. R-squared:                  0.319
Method:                 Least Squares   F-statistic:                     10.83
Date:                Tue, 17 Jan 2023   Prob (F-statistic):          2.69e-305
Time:                        02:10:37   Log-Likelihood:                -3301.9
No. Observations:                5150   AIC:                             7096.
Df Residuals:                    4904   BIC:                             8706.
Df Model:                         245                                         
Covariance Type:            nonrobust                                         
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              3.8603      0

#### Next we try the Lasso

We use the statsmodels package with the formula api and the sklearn Lasso with cross-validation to tune the regularization hyperparameter.

In [42]:
# Lasso with cross-validation
model3 = smf.ols(model2, data=df)
X = model3.data.exog[:, 1:] # exclude the intercept; we don't want to penalize the intercept
y = model3.data.endog

# train model using CV 
X = StandardScaler().fit_transform(X)
results3 = lm.LassoCV().fit(X, y)

  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descent_gram(
  model = cd_fast.enet_coordinate_descen

In [43]:
lasso_df = pd.DataFrame()
lasso_df['Var'] = model3.exog_names
lasso_df['Coef.'] = np.concatenate(([results3.intercept_], results3.coef_))
lasso_df

Unnamed: 0,Var,Coef.
0,Intercept,2.970787
1,C(occ2)[T.2],0.002958
2,C(occ2)[T.3],0.005807
3,C(occ2)[T.4],0.000000
4,C(occ2)[T.5],-0.006160
...,...,...
241,exp4:scl,-0.000676
242,exp4:clg,-0.014633
243,exp4:mw,-0.017830
244,exp4:so,-0.008131


#### Result evaluation

Now, we can evaluate the performance of both models based on the (adjusted)  $R^2_{𝑠𝑎𝑚𝑝𝑙𝑒}$  and the (adjusted) $𝑀𝑆𝐸_{𝑠𝑎𝑚𝑝𝑙𝑒}$.

In [44]:
# Print the R^2

# basic model
r2_basic = results1.rsquared
print(f"R-squared for the basic model: {r2_basic}")

adj_r2_basic = results1.rsquared_adj
print(f"Adjusted R-squared for the basic model: {adj_r2_basic}")

# flexible model
r2_flexible = results2.rsquared
print(f"R-squared for the flexible model: {r2_flexible}")

adj_r2_flexible = results2.rsquared_adj
print(f"Adjusted R-squared for the flexible model: {adj_r2_flexible}")

# Lasso model
lasso_preds = results3.predict(X)
mse_lasso = sum((y-lasso_preds)**2)/y.shape[0]
r2_lasso = 1. - mse_lasso/np.var(y)

print(f"R-squared for the lasso model: {r2_lasso}")

n = X.shape[0]
p = X.shape[1]
adj_r2_lasso = 1. - (1.-r2_lasso)*((n-1.)/(n-p-1.))
print(f"Adjusted R-squared for the lasso model: {adj_r2_lasso}")

R-squared for the basic model: 0.3100465069221948
Adjusted R-squared for the basic model: 0.303280930406429
R-squared for the flexible model: 0.3511098950617233
Adjusted R-squared for the flexible model: 0.31869185352218865
R-squared for the lasso model: 0.3177193384363113
Adjusted R-squared for the lasso model: 0.28363313083372077


In [45]:
# Print the MSE

# basic model
mse_basic = np.mean(results1.resid**2)
print(f"MSE for the basic model: {mse_basic}")

adj_mse_basic = results1.mse_resid
print(f"Adjusted MSE for the basic model: {adj_mse_basic}")

# flexible model
mse_flexible = np.mean(results2.resid**2)
print(f"MSE for the flexible model: {mse_flexible}")

adj_mse_flexible = results2.mse_resid
print(f"Adjusted MSE for the flexible model: {adj_mse_flexible}")

# Lasso model
#mse_lasso = metrics.mean_squared_error(lasso_preds, y)
print(f"MSE for the lasso model: {mse_lasso}")

adj_mse_lasso = mse_lasso*n/(n-p)
print(f"Adjusted MSE for the lasso model: {adj_mse_lasso}")

MSE for the basic model: 0.22442505581164465
Adjusted MSE for the basic model: 0.22666974650519053
MSE for the flexible model: 0.21106813644318276
Adjusted MSE for the flexible model: 0.2216559752614983
MSE for the lasso model: 0.22192927072168467
Adjusted MSE for the lasso model: 0.23301442287801755


In [46]:
# store the results in a table
res_df = pd.DataFrame()

res_df['Model'] = ['Basic reg', 'Flexible reg', 'Flexible Lasso']

res_df['p'] = [results1.params.shape[0],
           results2.params.shape[0],
           results2.params.shape[0]]

res_df['R2'] = [r2_basic, r2_flexible, r2_lasso]
res_df['MSE'] = [mse_basic, mse_flexible, mse_lasso]

res_df['adj_R2'] = [adj_r2_basic, adj_r2_flexible, adj_r2_lasso]
res_df['adj_MSE'] = [adj_mse_basic, adj_mse_flexible, adj_mse_lasso]

# Show results
res_df.head()

Unnamed: 0,Model,p,R2,MSE,adj_R2,adj_MSE
0,Basic reg,51,0.310047,0.224425,0.303281,0.22667
1,Flexible reg,246,0.35111,0.211068,0.318692,0.221656
2,Flexible Lasso,246,0.317719,0.221929,0.283633,0.233014


In [47]:
# print to Latex
print(res_df.style.to_latex())

\begin{tabular}{llrrrrr}
 & Model & p & R2 & MSE & adj_R2 & adj_MSE \\
0 & Basic reg & 51 & 0.310047 & 0.224425 & 0.303281 & 0.226670 \\
1 & Flexible reg & 246 & 0.351110 & 0.211068 & 0.318692 & 0.221656 \\
2 & Flexible Lasso & 246 & 0.317719 & 0.221929 & 0.283633 & 0.233014 \\
\end{tabular}



## Data Splitting

Measure the prediction quality of the two models via data splitting:

* Randomly split the data into one training sample and one testing sample. Here we just use a simple method (stratified splitting is a more sophiscticated version of splitting that we can consider).
* Use the training sample for estimating the parameters of the Basic Model and the Flexible Model.
* Use the testing sample for evaluation. Predict the  𝚠𝚊𝚐𝚎  of every observation in the testing sample based on the estimated parameters in the training sample.
* Calculate the Mean Squared Prediction Error $MSE_{test}$ based on the testing sample for both prediction models.

In [48]:
# split the data into training and testing sets

# get the indices
n = df.shape[0]
sh = 4./5.
train_idx = np.random.choice(n, int(np.floor(sh*n)), replace=False)
test_idx = np.setdiff1d(np.arange(n), train_idx)

In [49]:
# Basic model

# estimating the parameters in the training sample
mod1 = smf.ols(model1, data=df)
X1 = mod1.data.exog
y1 = mod1.data.endog

# separate training and testing sets
X1_train = X1[train_idx,:]
X1_test = X1[test_idx,:]
y1_train = y1[train_idx]
y1_test = y1[test_idx]

# estimating the parameters in the training sample
regbasic = sm.OLS(y1_train, X1_train).fit()

# predict out of sample
trainregbasic = regbasic.predict(X1_test)

# calculating out-of-sample MSE
MSE_test1 = sum((y1_test-trainregbasic)**2)/y1_test.shape[0]
R2_test1 = 1. - MSE_test1/np.var(y1_test)

print("Test MSE for the basic model: "+ str(MSE_test1))
print("Test R2 for the basic model: "+ str(R2_test1))

Test MSE for the basic model: 0.22723673540905004
Test R2 for the basic model: 0.28717496580701496


In the basic model, the  $𝑀𝑆𝐸_{𝑡𝑒𝑠𝑡}$  is quite closed to the  $𝑀𝑆𝐸_{𝑠𝑎𝑚𝑝𝑙𝑒}$.

In [52]:
# Flexible model

# estimating the parameters in the training sample
mod2 = smf.ols(model2, data=df)
X2 = mod2.data.exog
y2 = mod2.data.endog

# separate training and testing sets
X2_train = X2[train_idx,:]
X2_test = X2[test_idx,:]
y2_train = y2[train_idx]
y2_test = y2[test_idx]

# estimating the parameters in the training sample
regflex = sm.OLS(y2_train, X2_train).fit()

# predict out of sample
trainregflex = regflex.predict(X2_test)

# calculating out-of-sample MSE
MSE_test2 = sum((y2_test-trainregflex)**2)/y2_test.shape[0]
R2_test2 = 1. - MSE_test2/np.var(y2_test)

print("Test MSE for the flexible model: "+ str(MSE_test2))
print("Test R2 for the flexible model: "+ str(R2_test2))

Test MSE for the flexible model: 0.2353247281368928
Test R2 for the flexible model: 0.26180352363064796


In the flexible model, the discrepancy between the  $𝑀𝑆𝐸_{𝑡𝑒𝑠𝑡}$  and the  $𝑀𝑆𝐸_{𝑠𝑎𝑚𝑝𝑙𝑒}$  is not large.

It is worth to notice that the  $𝑀𝑆𝐸_{𝑡𝑒𝑠𝑡}$  vary across different data splits. Hence, it is a good idea average the out-of-sample MSE over different data splits to get valid results.

Nevertheless, we observe that, based on the out-of-sample  $𝑀𝑆𝐸$ , the basic model using ols regression performs is about as well (or slightly better) than the flexible model.

Next, let us use lasso regression in the flexible model instead of ols regression. Lasso (least absolute shrinkage and selection operator) is a penalized regression method that can be used to reduce the complexity of a regression model when the number of regressors  $p$  is relatively large in relation to  $n$ .

Note that the out-of-sample  𝑀𝑆𝐸  on the test sample can be computed for any other black-box prediction method as well. Thus, let us finally compare the performance of lasso regression in the flexible model to ols regression.

In [54]:

# predict out of sample
trainreglasso = results3.predict(X2_test[:, 1:])

# calculating out-of-sample MSE
MSE_test3 = sum((y2_test-trainreglasso)**2)/y2_test.shape[0]
R2_test3 = 1. - MSE_test3/np.var(y2_test)

print("Test MSE for the basic model: "+ str(MSE_test3))
print("Test R2 for the basic model: "+ str(R2_test3))

Test MSE for the basic model: 2.302599375525625
Test R2 for the basic model: -6.223085984038611


In [55]:
# store the results in a table
res_df2 = pd.DataFrame()

res_df2['Model'] = ['Basic reg', 'Flexible reg', 'Flexible Lasso']

res_df2['$MSE_{test}$'] = [MSE_test1, MSE_test2, MSE_test3]
res_df2['$R^2_{test}$'] = [R2_test1, R2_test2, R2_test3]

# Show results
res_df2.head()

Unnamed: 0,Model,$MSE_{test}$,$R^2_{test}$
0,Basic reg,0.227237,0.287175
1,Flexible reg,0.235325,0.261804
2,Flexible Lasso,2.302599,-6.223086


In [56]:
# print to Latex
print(res_df2.style.to_latex())

\begin{tabular}{llrr}
 & Model & $MSE_{test}$ & $R^2_{test}$ \\
0 & Basic reg & 0.227237 & 0.287175 \\
1 & Flexible reg & 0.235325 & 0.261804 \\
2 & Flexible Lasso & 2.302599 & -6.223086 \\
\end{tabular}

