Authors: Andreas Haupt, Jannis Kück, Alexander Quispe, Anzony Quispe

# Machine Learning Estimators for Wage Prediction

We illustrate how to predict an outcome variable $Y$ in a high-dimensional setting, where the number of covariates $p$ is large in relation to the sample size $n$. So far we have used linear prediction rules, e.g. Lasso regression, for estimation.
Now, we also consider nonlinear prediction rules including tree-based methods.

## Data

Again, we consider data from the U.S. March Supplement of the Current Population Survey (CPS) in 2015.
The preproccessed sample consists of $5150$ never-married individuals.

Set the following file_directory to a place where you downloaded https://github.com/CausalAIBook/MetricsMLNotebooks/blob/main/PM1/wage2015_subsample_inference.rdata

In [None]:
# Import relevant packages
!pip install pyreadr
!pip install wget
import pyreadr
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import statsmodels.formula.api as smf
import wget
np.random.seed(1234)

In [None]:
rdata_read = pyreadr.read_r(wget.download("https://github.com/CausalAIBook/MetricsMLNotebooks/blob/main/data/wage2015_subsample_inference.rdata?raw=true"))
data = rdata_read['data']
type(data)
data.shape
data.head

In [None]:
Y = np.log(data['wage'])
Z = data[data.columns.difference(['wage', 'lwage'])]
Z.columns

The following figure shows the weekly wage distribution from the US survey data.

In [None]:
plt.hist(data.wage , bins = np.arange(0, 350, 20) )
plt.xlabel('hourly wage')
plt.ylabel('Frequency')
plt.title( 'Empirical wage distribution from the US survey data' )
plt.ylim((0, 3000))

Wages show a high degree of skewness. Hence, wages are transformed in almost all studies by
the logarithm.

## Analysis

Due to the skewness of the data, we are considering log wages which leads to the following regression model

$$\log(\operatorname{wage}) = g(Z) + \epsilon.$$

We will estimate the two sets of prediction rules: Linear and Nonlinear Models.
In linear models, we estimate the prediction rule of the form

$$\hat g(Z) = \hat \beta'X.$$
Again, we generate $X$ in two ways:
 
1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators, regional indicators).


2. Flexible Model:  $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., $\operatorname{exp}^2$ and $\operatorname{exp}^3$) and additional two-way interactions.

To evaluate the out-of-sample performance, we split the data first.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(Z,Y, test_size = 0.25, random_state = 123)

In [None]:
data_train = pd.concat([y_train, X_train], axis=1)
print(data_train.shape)
data_train

We are starting by running a simple OLS regression. We fit the basic and flexible model to our training data by running an ols regression and compute the mean squared error on the test sample

In [None]:
model1 = 'wage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + C(occ2) + C(ind2)'
results1 = smf.ols(model1, data=data_train).fit(cov_type = "HC3")

In [None]:
yhat_lm_basic = results1.predict(X_test)
print( f"The mean squared error (MSE) using the basic model is equal to , {np.mean((y_test-yhat_lm_basic)**2)} ") # MSE OLS (basic model)    

We can als compute the out-of-sample MSE and the standard error in one step:

In [None]:
resid_basic = (y_test-yhat_lm_basic)**2

MSE_lm_basic = sm.OLS(resid_basic , np.ones(resid_basic.shape[0])).fit().summary2().tables[1].iloc[0, 0:2]
MSE_lm_basic

We also compute the out-of-sample $R^2$:

In [None]:
R2_lm_basic = 1 - ( MSE_lm_basic[0]/y_test.var() )
print( f"The R^2 using the basic model is equal to, {R2_lm_basic}" ) # MSE OLS (basic model) 

We repeat the same procedure for the flexible model.

In [None]:
model2 = 'wage ~ sex + shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)'
results2 = smf.ols(model2, data = data_train).fit(cov_type = "HC3")

In [None]:
yhat_lm_flex = results2.predict(X_test)
print( f"The mean squared error (MSE) using the flexible model is equal to , {np.mean((y_test-yhat_lm_flex)**2)} ") # MSE OLS (flex model)  

In [None]:
resid_flex = (y_test-yhat_lm_flex)**2

MSE_lm_flex = sm.OLS(resid_flex , np.ones(resid_flex.shape[0])).fit().summary2().tables[1].iloc[0, 0:2]
MSE_lm_flex

In [None]:
R2_lm_flex = 1 - ( MSE_lm_flex[0]/y_test.var() )
print( f"The R^2 using the flexible model is equal to, {R2_lm_flex}" ) # MSE OLS (flex model) 

We observe that ols regression works better for the basic model with smaller $p/n$ ratio. We are proceeding by running lasso regressions and its versions.

In [None]:
import sklearn.linear_model as lm

# Lasso with cross-validation
flex_model_train = smf.ols(model2, data = data_train)
X_train_flex = flex_model_train.data.exog # create model matrix


# train model using CV 
lassocv_reg = lm.LassoCV(cv=10, fit_intercept= False)
lassomod = lassocv_reg.fit(X_train_flex, y_train)

# predict out of sample
data_test = pd.concat([y_test, X_test], axis=1)
flex_model_test = smf.ols(model2, data = data_test)
X_test_matrix = flex_model_test.data.exog
y2 = flex_model_test.data.endog
trainreglasso = lassomod.predict(X_test_matrix)

# calculating out-of-sample MSE
MSE_lasso = np.mean((y_test-trainreglasso)**2)
R2_lasso = 1. - MSE_lasso/np.var(y_test)

print("Test MSE for the flexibel model using lasso: "+ str(MSE_lasso))
print("Test R2 for the flexibel model using lasso: "+ str(R2_lasso))


In [None]:
# Ridge with cross-validation

# train model using CV 
ridgecv_reg = lm.RidgeCV(cv=5, fit_intercept= False)
ridgemod = ridgecv_reg.fit(X_train_flex, y_train)

# predict out of sample
trainregridge = ridgemod.predict(X_test_matrix)

# calculating out-of-sample MSE
MSE_ridge = np.mean((np.array(y_test)-trainregridge)**2)
R2_ridge = 1. - MSE_ridge/np.var(y_test)

print("Test MSE for the flexibel model using ridge: "+ str(MSE_ridge))
print("Test R2 for the flexibel model using ridge: "+ str(R2_ridge))


In [None]:
trainregridge


