# Machine Learning Estimators for Wage Prediction

We illustrate how to predict an outcome variable $Y$ in a high-dimensional setting, where the number of covariates $p$ is large in relation to the sample size $n$. So far we have used linear prediction rules, e.g. Lasso regression, for estimation.
Now, we also consider nonlinear prediction rules including tree-based methods.

## Data

Again, we consider data from the U.S. March Supplement of the Current Population Survey (CPS) in 2015.
The preproccessed sample consists of $5150$ never-married individuals.

Set the following file_directory to a place where you downloaded https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv

In [None]:
# Import relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler
import patsy
import warnings
from sklearn.base import BaseEstimator, clone
import statsmodels.api as sm
warnings.simplefilter('ignore')
np.random.seed(1234)

In [None]:
file = "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv"
data = pd.read_csv(file)

In [None]:
data.describe()

In [None]:
y = np.log(data['wage']).values
Z = data.drop(['wage', 'lwage'], axis=1)
Z.columns

The following figure shows the weekly wage distribution from the US survey data.

In [None]:
plt.hist(data.wage , bins = np.arange(0, 350, 20) )
plt.xlabel('hourly wage')
plt.ylabel('Frequency')
plt.title( 'Empirical wage distribution from the US survey data' )
plt.ylim((0, 3000))

Wages show a high degree of skewness. Hence, wages are transformed in almost all studies by
the logarithm.

## Analysis

Due to the skewness of the data, we are considering log wages which leads to the following regression model

$$\log(\operatorname{wage}) = g(Z) + \epsilon.$$

We will estimate the two sets of prediction rules: Linear and Nonlinear Models.
In linear models, we estimate the prediction rule of the form

$$\hat g(Z) = \hat \beta'X.$$
Again, we generate $X$ in two ways:

1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators, regional indicators).


2. Flexible Model:  $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., $\operatorname{exp}^2$ and $\operatorname{exp}^3$) and additional two-way interactions.

To evaluate the out-of-sample performance, we split the data first and we use the following helper function to calculate evaluation metrics.

In [None]:
train_idx, test_idx = train_test_split(np.arange(len(y)), test_size=0.25, random_state=123)
y_train, y_test = y[train_idx], y[test_idx]

In [None]:
Zbase = patsy.dmatrix('0 + sex + exp1 + shs + hsg+ scl + clg + mw + so + we + C(occ2) + C(ind2)',
                      Z, return_type='dataframe').values
X_train, X_test,  = Zbase[train_idx], Zbase[test_idx]

In [None]:
Zflex = patsy.dmatrix('0 + sex + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)',
                      Z, return_type='dataframe').values
Xflex_train, Xflex_test = Zflex[train_idx], Zflex[test_idx]

In [None]:
def metrics(X_test, y_test, estimator):
    mse = np.mean((y_test - estimator.predict(X_test))**2)
    semse = np.std((y_test - estimator.predict(X_test))**2) / np.sqrt(len(y_test))
    r2 = 1 - mse / np.var(y_test)
    print(f'{mse:.4f}, {semse:.4f}, {r2:.4f}')
    return mse, semse, r2

results = {} # dictionary that will store all the metric results from each estimator

We are starting by running a simple OLS regression. We fit the basic and flexible model to our training data by running an ols regression and compute the R-squared on the test sample

### Low dimensional specification

In [None]:
lr_base = LinearRegression().fit(X_train, y_train)
ypred_ols = lr_base.predict(X_test)
results['ols'] = metrics(X_test, y_test, lr_base)

### High-dimensional specification

We repeat the same procedure for the flexible model.

In [None]:
lr_flex = LinearRegression().fit(Xflex_train, y_train)
ypred_ols_flex = lr_flex.predict(Xflex_test)
results['ols_flex'] = metrics(Xflex_test, y_test, lr_flex)

### Penalized Regressions

We observe that ols regression works better for the basic model with smaller $p/n$ ratio. We are proceeding by running penalized regressions.

First we try a pure `l1` penalty, tuned using cross-validation

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)

In [None]:
lcv = make_pipeline(StandardScaler(), LassoCV(cv=cv, random_state=123)).fit(X_train, y_train)
ypred_lcv = lcv.predict(X_test)
results['lcv'] = metrics(X_test, y_test, lcv)

In [None]:
lcv_flex = make_pipeline(StandardScaler(), LassoCV(cv=cv, random_state=123)).fit(Xflex_train, y_train)
ypred_lcv_flex = lcv_flex.predict(Xflex_test)
results['lcv_flex'] = metrics(Xflex_test, y_test, lcv_flex)

Then we try a pure `l2` penalty, tuned using cross-validation

In [None]:
rcv = make_pipeline(StandardScaler(), RidgeCV(cv=cv)).fit(X_train, y_train)
ypred_rcv = rcv.predict(X_test)
results['rcv'] = metrics(X_test, y_test, rcv)

In [None]:
rcv_flex = make_pipeline(StandardScaler(), RidgeCV(cv=cv)).fit(Xflex_train, y_train)
ypred_rcv_flex = rcv_flex.predict(Xflex_test)
results['rcv_flex'] = metrics(Xflex_test, y_test, rcv_flex)

Finally, we try an equal combination of the two penalties, with the overall weight tuned using cross validation

In [None]:
ecv = make_pipeline(StandardScaler(), ElasticNetCV(cv=cv, random_state=123)).fit(X_train, y_train)
ypred_ecv = ecv.predict(X_test)
results['ecv'] = metrics(X_test, y_test, ecv)

In [None]:
ecv_flex = make_pipeline(StandardScaler(), ElasticNetCV(cv=cv, random_state=123)).fit(Xflex_train, y_train)
ypred_ecv_flex = ecv_flex.predict(Xflex_test)
results['ecv_flex'] = metrics(Xflex_test, y_test, ecv_flex)

We can also try a variant of the `l1` penalty, where the weight is chosen based on theoretical derivations. This is a based on a Python implementation that tries to replicate the main function of hdm r-package. It was made by [Max Huppertz](https://maxhuppertz.github.io/code/). His library is this [repository](https://github.com/maxhuppertz/hdmpy). Download its repository and copy this folder to your site-packages folder. In my case it is located here ***C:\Python\Python38\Lib\site-packages*** . It requires the multiprocess package ***pip install multiprocess***.

In [None]:
!git clone https://github.com/maxhuppertz/hdmpy.git
!pip install multiprocess

In [None]:
# We wrap the package so that it has the familiar sklearn API
import hdmpy
from sklearn.base import RegressorMixin

class RLasso(BaseEstimator, RegressorMixin):

    def __init__(self, *, post=True):
        self.post = post

    def fit(self, X, y):
        self.rlasso_ = hdmpy.rlasso(X, y, post=self.post)
        return self

    @property
    def coef_(self):
        return np.array(self.rlasso_.est['beta']).flatten()
    @property
    def intercept_(self):
        return np.array(self.rlasso_.est['intercept'])

    def predict(self, X):
        return X @ self.coef_ + self.intercept_

In [None]:
lasso = make_pipeline(StandardScaler(), RLasso(post=False)).fit(X_train, y_train)
ypred_lasso = lasso.predict(X_test)
results['lasso'] = metrics(X_test, y_test, lasso)

In [None]:
lasso_flex = make_pipeline(StandardScaler(), RLasso(post=False)).fit(Xflex_train, y_train)
ypred_lasso_flex = lasso_flex.predict(Xflex_test)
results['lasso_flex'] = metrics(Xflex_test, y_test, lasso_flex)

In [None]:
postlasso = make_pipeline(StandardScaler(), RLasso(post=True)).fit(X_train, y_train)
ypred_postlasso = postlasso.predict(X_test)
results['postlasso'] = metrics(X_test, y_test, postlasso)

In [None]:
postlasso_flex = make_pipeline(StandardScaler(), RLasso(post=True)).fit(Xflex_train, y_train)
ypred_postlasso_flex = postlasso_flex.predict(Xflex_test)
results['postlasso_flex'] = metrics(Xflex_test, y_test, postlasso_flex)

# Non-Linear Models

Besides linear regression models, we consider nonlinear regression models to build a predictive model. We are applying regression trees, random forests, boosted trees and neural nets to estimate the regression function $g(X)$.

## Regression Trees

We fit a regression tree to the training data using the basic model. The variable *alpha_cp* controls the complexity of the regression tree, i.e. how deep we build the tree.

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
dtr = DecisionTreeRegressor(ccp_alpha=0.001, min_samples_leaf=5, random_state=123).fit(X_train, y_train)
ypred_dtr = dtr.predict(X_test)
results['dtr'] = metrics(X_test, y_test, dtr)

## Random Forests

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf = RandomForestRegressor(n_estimators=2000, min_samples_leaf=5, random_state=123)
rf.fit(X_train, y_train)
ypred_rf = rf.predict(X_test)
results['rf'] = metrics(X_test, y_test, rf)

## Gradient Boosted Forests

In [None]:
from sklearn.ensemble import GradientBoostingRegressor

In [None]:
gbf = GradientBoostingRegressor(n_estimators=1000, learning_rate=.01,
                                subsample=.5, max_depth=2, random_state=123)
gbf.fit(X_train, y_train)
ypred_gbf = gbf.predict(X_test)
results['gbf'] = metrics(X_test, y_test, gbf)

## NNets

In [None]:
from sklearn.neural_network import MLPRegressor

In [None]:
nnet = MLPRegressor((200, 20,), 'relu',
                    learning_rate_init=0.01,
                    batch_size=10, max_iter=10,
                    random_state=123)
nnet.fit(X_train, y_train)
ypred_nnet = nnet.predict(X_test)
results['nnet'] = metrics(X_test, y_test, nnet)

### Using the PyTorch Neural Network Library and its Sklearn API Skorch

We first need to install skorch.

In [None]:
!pip install skorch

In [None]:
import skorch
from skorch import NeuralNetRegressor
from torch.nn import Sequential
import torch.nn as nn
import torch

In [None]:
arch = nn.Sequential(nn.Linear(X_train.shape[1], 200), nn.ReLU(),
                     nn.Linear(200, 20), nn.ReLU(),
                     nn.Linear(20, 1))
nnet_early = NeuralNetRegressor(arch, lr=0.01, batch_size=10,
                                max_epochs=100,
                                optimizer=torch.optim.Adam,
                                callbacks=[skorch.callbacks.EarlyStopping()])
nnet_early.fit(X_train.astype(np.float32), y_train.reshape(-1, 1).astype(np.float32))
ypred_nnet_early = nnet_early.predict(X_test.astype(np.float32)).flatten()
results['nnet_early'] = metrics(X_test.astype(np.float32),
                                y_test.reshape(-1, 1).astype(np.float32), nnet_early)

In [None]:
df = pd.DataFrame(results).T
df.columns = ['MSE', 'S.E. MSE', '$R^2$']
df

Above, we displayed the results for a single split of data into the training and testing part. The table shows the test MSE in column 1 as well as the standard error in column 2 and the test $R^2$
in column 3. We see that the prediction rule produced by Cross-Validated Lasso using the flexible model performs the best here, giving the lowest test MSE. Cross-Validated Ridge performs nearly as well. For the majority of the considered methods, test MSEs are within one standard error of each other. Remarkably, OLS with just the basic variables performs extremely well. However, OLS on a flexible model with many regressors performs very poorly giving the highest test MSE. It is worth noticing that, as this is just a simple illustration that is meant to be relatively quick, the nonlinear methods are not tuned. Thus, there is potential to improve the performance of the nonlinear methods we used in the analysis.

# Combining Predictions with Stacking

In the final step, we can build a prediction model by combining the strength of the models we considered so far. We consider stacking which froms its prediction rule as
	$$ f(x) = \sum_{k=1}^K \alpha_k f_k(x) $$
where the $f_k$'s denote our prediction rules from the table above and the $\alpha_k$'s are the corresponding weights. We choose to estimate the weights here without penalization.

In [None]:
method_name = ['OLS', 'OLS (flexible)', 'CV Lasso', 'CV Lasso (flexible)',
               'CV Ridge', 'CV Ridge (flexible)', 'CV ElasticNet', 'CV ElasticNet (flexible)',
               'Lasso', 'Lasso (flexible)', 'Post-Lasso OLS', 'Post-Lasso OLS (flexible)',
               'Decision Tree', 'Random Forest', 'Boosted Forest', 'Neural Net', 'Neural Net (early stopping)']
ypreds = np.stack((ypred_ols, ypred_ols_flex, ypred_lcv, ypred_lcv_flex,
                   ypred_rcv, ypred_rcv_flex, ypred_ecv, ypred_ecv_flex,
                   ypred_lasso, ypred_lasso_flex, ypred_postlasso, ypred_postlasso_flex,
                   ypred_dtr, ypred_rf, ypred_gbf, ypred_nnet, ypred_nnet_early), axis=-1)

In [None]:
stack_ols = LinearRegression().fit(ypreds, y_test)

In [None]:
pd.DataFrame({'weight': stack_ols.coef_}, index=method_name)

We can calculate the test sample MSE. Though for more unbiased performance evaluation, we should have left out a third sample to validate the performance of the stacked model.

In [None]:
mse = np.mean((y_test - stack_ols.predict(ypreds))**2)
r2 = 1 - mse / np.var(y_test)

In [None]:
mse, r2

Alternatively, we can determine the weights via lasso regression.

In [None]:
stack_lasso = RLasso(post=False).fit(ypreds, y_test)

In [None]:
pd.DataFrame({'weight': stack_lasso.coef_}, index=method_name)

We can calculate the test sample MSE. Though for more unbiased performance evaluation, we should have left out a third sample to validate the performance of the stacked model.

In [None]:
mse = np.mean((y_test - stack_lasso.predict(ypreds))**2)
r2 = 1 - mse / np.var(y_test)

In [None]:
mse, r2

# Redoing it in a more  scikit-learn way

We can also do it in a more sklearn way, by defining a formula transformer and corresponding pipelines

In [None]:
from sklearn.base import TransformerMixin, BaseEstimator

class FormulaTransformer(TransformerMixin, BaseEstimator):

    def __init__(self, formula):
        self.formula = formula

    def fit(self, X, y=None):
        mat = patsy.dmatrix(self.formula, X, return_type='matrix')
        self.design_info = mat.design_info
        return self

    def transform(self, X, y=None):
        return patsy.build_design_matrices([self.design_info], X)[0]

In [None]:
base = FormulaTransformer('0 + sex + exp1 + shs + hsg+ scl + clg + mw + so + we + C(occ2) + C(ind2)')
flex = FormulaTransformer('0 + sex + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)')

In [None]:
methods = [('ols', make_pipeline(base, LinearRegression())),
           ('ols_flex', make_pipeline(flex, LinearRegression())),
           ('lasso', make_pipeline(base, StandardScaler(), RLasso(post=False))),
           ('lasso_flex', make_pipeline(flex, StandardScaler(), RLasso(post=False))),
           ('postlasso', make_pipeline(base, StandardScaler(), RLasso(post=True))),
           ('postlasso_flex', make_pipeline(flex, StandardScaler(), RLasso(post=True))),
           ('lcv', make_pipeline(base, StandardScaler(), LassoCV())),
           ('lcv_flex', make_pipeline(flex, StandardScaler(), LassoCV())),
           ('rcv', make_pipeline(base, StandardScaler(), RidgeCV())),
           ('rcv_flex', make_pipeline(flex, StandardScaler(), RidgeCV())),
           ('ecv', make_pipeline(base, StandardScaler(), ElasticNetCV())),
           ('ecv_flex', make_pipeline(flex, StandardScaler(), ElasticNetCV())),
           ('dtr', make_pipeline(base, DecisionTreeRegressor(ccp_alpha=0.001, min_samples_leaf=5,
                                                             random_state=123))),
           ('rf', make_pipeline(base, RandomForestRegressor(n_estimators=2000, min_samples_leaf=5,
                                                            random_state=123))),
           ('gbf', make_pipeline(base, GradientBoostingRegressor(n_estimators=1000, learning_rate=.01,
                                                                 subsample=.5, max_depth=2,
                                                                 random_state=123))),
           ('nnet', make_pipeline(base, MLPRegressor((200, 20,), 'relu',
                                                     learning_rate_init=0.01,
                                                     batch_size=10, max_iter=10,
                                                     random_state=123)))]

In [None]:
train_idx, test_idx = train_test_split(np.arange(len(y)), test_size=0.25, random_state=123)

results = {}
ypreds = np.zeros((len(test_idx), len(methods))) # test predictions used for stacking

for it, (name, estimator) in enumerate(methods):
    estimator.fit(Z.iloc[train_idx], y[train_idx])
    results[name] = metrics(Z.iloc[test_idx], y[test_idx], estimator)
    ypreds[:, it] = estimator.predict(Z.iloc[test_idx])

In [None]:
df = pd.DataFrame(results).T
df.columns = ['MSE', 'S.E. MSE', '$R^2$']
df

In [None]:
stack_lasso = RLasso(post=False).fit(ypreds, y[test_idx])

In [None]:
pd.DataFrame({'weight': stack_lasso.coef_}, index=[name for name, _ in methods])

For a more unbiased performance evaluation we should have left a further evaluation sample that was not used for the stacking weights

In [None]:
mse = np.mean((y_test - stack_lasso.predict(ypreds))**2)
r2 = 1 - mse / np.var(y_test)

In [None]:
mse, r2

### Sklearn also provides a Stacking API

The sklearn Stacking API wraps the stacking process. Here, we're also using also k-fold cross validation instead of just sample splitting.

In [None]:
from sklearn.ensemble import StackingRegressor

stack = StackingRegressor(methods,
                          final_estimator=RLasso(),
                          cv=3,
                          verbose=3)

We will construct a stacked ensemble using only the training data for unbiased performance evaluation. The stacking regressor will partition the data in k-folds, based on the `cv` parameter. For each fold it will train each of the estimators in the `methods` parameter on all the data outside of the fold and then predict on the data in the fold. Then using all the predictions on all the data from each method, it will train a `final_estimator` predicting the true outcome using the out-of-fold predictions of each method as features. This will define how the estimators are being aggregated. In the end, all the base estimators are re-fitted on all the data and the final predictor will first predict based on each fitted based estimator and then aggregate based on the fitted `final_estimator`.

In [None]:
stack.fit(Z.iloc[train_idx], y[train_idx])

We can see the weights placed on each estimator by accessing the final model

In [None]:
pd.DataFrame({'weight': stack.final_estimator_.coef_}, index=[name for name, _ in methods])

Calculate out of sample performance metrics

In [None]:
mse, semse, r2 = metrics(Z.iloc[test_idx], y[test_idx], stack)

We find that this stacked estimator achieved the best out of sample performance.

# FLAML AutoML Framework

In [None]:
!pip install flaml

In [None]:
from flaml import AutoML

automl = make_pipeline(base, AutoML(task='regression', time_budget=60, early_stop=True,
                                    eval_method='cv', n_splits=3, metric='r2',
                                    verbose=3,))

In [None]:
train_idx, test_idx = train_test_split(np.arange(len(y)), test_size=0.25, random_state=123)

In [None]:
automl.fit(Z.iloc[train_idx], y[train_idx])

In [None]:
mse, semse, r2 = metrics(Z.iloc[test_idx], y[test_idx], automl)

We see that it best model chosen matches the performance of the stacked estimator we strived to achieve on our own without automl. Moreover, we can also do stacking within the automl framework

In [None]:
automl = make_pipeline(base, AutoML(task='regression', time_budget=60, early_stop=True,
                                    eval_method='cv', n_splits=3, metric='r2',
                                    verbose=3,
                                    ensemble={'passthrough': False, # whether stacker will use raw X's or just predictions
                                              'final_estimator': RLasso()}))

In [None]:
train_idx, test_idx = train_test_split(np.arange(len(y)), test_size=0.25, random_state=123)

In [None]:
automl.fit(Z.iloc[train_idx], y[train_idx])

In [None]:
mse, semse, r2 = metrics(Z.iloc[test_idx], y[test_idx], automl)

In [None]:
automl = make_pipeline(base, AutoML(task='regression', time_budget=60, early_stop=True,
                                    eval_method='cv', n_splits=3, metric='r2',
                                    verbose=3,
                                    ensemble={'passthrough': True, # whether stacker will use raw X's or just predictions
                                              'final_estimator': RLasso()}))

In [None]:
train_idx, test_idx = train_test_split(np.arange(len(y)), test_size=0.25, random_state=123)

In [None]:
automl.fit(Z.iloc[train_idx], y[train_idx])

In [None]:
mse, semse, r2 = metrics(Z.iloc[test_idx], y[test_idx], automl)