Authors: Andreas Haupt, Jannis Kück, Alexander Quispe, Anzony Quispe, Vasilis Syrgkanis

# Machine Learning Estimators for Wage Prediction

We illustrate how to predict an outcome variable $Y$ in a high-dimensional setting, where the number of covariates $p$ is large in relation to the sample size $n$. So far we have used linear prediction rules, e.g. Lasso regression, for estimation.
Now, we also consider nonlinear prediction rules including tree-based methods.

## Data

Again, we consider data from the U.S. March Supplement of the Current Population Survey (CPS) in 2015.
The preproccessed sample consists of $5150$ never-married individuals.

Set the following file_directory to a place where you downloaded https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv

In [None]:
# Import relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, KFold, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LinearRegression, Ridge, Lasso
import patsy
import warnings
from sklearn.base import BaseEstimator, clone
import statsmodels.api as sm
warnings.simplefilter('ignore')
np.random.seed(1234)

In [None]:
file = "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv"
data = pd.read_csv(file)

In [None]:
data.describe()

In [None]:
y = np.log(data['wage']).values
Z = data.drop(['wage', 'lwage'], axis=1)
Z.columns

The following figure shows the weekly wage distribution from the US survey data.

In [None]:
plt.hist(data.wage , bins = np.arange(0, 350, 20) )
plt.xlabel('hourly wage')
plt.ylabel('Frequency')
plt.title( 'Empirical wage distribution from the US survey data' )
plt.ylim((0, 3000))

Wages show a high degree of skewness. Hence, wages are transformed in almost all studies by
the logarithm.

## Analysis

Due to the skewness of the data, we are considering log wages which leads to the following regression model

$$\log(\operatorname{wage}) = g(Z) + \epsilon.$$

We will estimate the two sets of prediction rules: Linear and Nonlinear Models.
In linear models, we estimate the prediction rule of the form

$$\hat g(Z) = \hat \beta'X.$$
Again, we generate $X$ in two ways:

1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators, regional indicators).


2. Flexible Model:  $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., $\operatorname{exp}^2$ and $\operatorname{exp}^3$) and additional two-way interactions.

To evaluate the out-of-sample performance, we split the data first.

We are starting by running a simple OLS regression. We fit the basic and flexible model to our training data by running an ols regression and compute the R-squared on the test sample

### Low dimensional specification

In [None]:
Zbase = patsy.dmatrix('0 + sex + exp1 + shs + hsg+ scl + clg + mw + so + we + C(occ2) + C(ind2)',
                      Z, return_type='dataframe').values

In [None]:
X_train, X_test, y_train, y_test = train_test_split(Zbase, y, test_size=0.25, random_state=123)

In [None]:
lr_base = LinearRegression().fit(X_train, y_train)

Let's calculate R-squared on the test set

In [None]:
r2_base = 1 - np.mean((y_test - lr_base.predict(X_test))**2) / np.var(y_test)
print(f'{r2_base:.4f}')

In fact `sklearn` provides an implementation

In [None]:
print(f'{r2_score(y_test, lr_base.predict(X_test)):.4f}')

Since out of sample performance can be varying for different train-test splits, it is more stable to look at average performance across multiple splits, using K-fold cross validation.

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
rsquares = cross_val_score(LinearRegression(), Zbase, y, scoring='r2', cv=cv)
print(f'{np.mean(rsquares):.4f}')

### High-dimensional specification

We repeat the same procedure for the flexible model.

In [None]:
Zflex = patsy.dmatrix('0 + sex + (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)',
                      Z, return_type='dataframe').values

In [None]:
Zflex.shape

In [None]:
X_train, X_test, y_train, y_test = train_test_split(Zflex, y, test_size = 0.25, random_state = 123)

In [None]:
lr_flex = LinearRegression().fit(X_train, y_train)

In [None]:
print(f'{r2_score(y_test, lr_flex.predict(X_test)):.4f}')

However, OLS can be quite un-stable for such high-dimensional problems and it really matters what solution is being returned among the multitude of solutions to the least squares objectives (which are non-unique in high-dimensional settings). For instance, we see that the `sklearn` implementation returns a numerically un-stable solution whose error blows up in some cases.

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
rsquares = cross_val_score(LinearRegression(), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)
print(f'{np.mean(rsquares):.4f}')

`sklearn`'s implementation uses the least squares solver from `scipy.linalg.lstsq`. If for instance we instead use the pseudo-inverse based implementation we get a different result

In [None]:
class MyOLS(BaseEstimator):

    def fit(self, X, y):
        X = np.hstack([np.ones((X.shape[0], 1)), X])
        CXX = (X.T @ X) / X.shape[0]
        CXy = (X.T @ y) / X.shape[0]
        self.coef_ = np.linalg.pinv(CXX) @ CXy
        return self

    def predict(self, X):
        X = np.hstack([np.ones((X.shape[0], 1)), X])
        return X @ self.coef_

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
rsquares = cross_val_score(MyOLS(), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)
print(f'{np.mean(rsquares):.4f}')

This also recovers the solution provided by `statsmodels.api.OLS`

In [None]:
class StatsModelsOLS(BaseEstimator):

    def fit(self, X, y):
        X = np.hstack([np.ones((X.shape[0], 1)), X])
        self.ols_ = sm.OLS(y, X).fit()
        return self

    def predict(self, X):
        X = np.hstack([np.ones((X.shape[0], 1)), X])
        return self.ols_.predict(X)

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
rsquares = cross_val_score(StatsModelsOLS(), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)
print(f'{np.mean(rsquares):.4f}')

We can also choose different solvers by using `sklearn.linear_model.Ridge` which allows for no penalty and a multitude of solvers. We see that the `lsqr` solver is more stable than solvers based on singular value decompositions of the covariance matrix $E_n[X X']$.

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
rsquares = cross_val_score(Ridge(alpha=0.0, solver='lsqr'), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)
print(f'{np.mean(rsquares):.4f}')

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
rsquares = cross_val_score(Ridge(alpha=0.0, solver='cholesky'), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)
print(f'{np.mean(rsquares):.4f}')

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
rsquares = cross_val_score(Ridge(alpha=0.0, solver='svd'), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)
print(f'{np.mean(rsquares):.4f}')

### Penalized Regressions

We observe that ols regression works better for the basic model with smaller $p/n$ ratio. We are proceeding by running penalized regressions.

First we try a pure `l1` penalty, tuned using cross-validation

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
rsquares = cross_val_score(LassoCV(cv=cv), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)
print(f'{np.mean(rsquares):.4f}')

Oops! For penalized regressions it is important that our features have the same standard deviation, so that we are symmetrically penalizing them

In [None]:
from sklearn.preprocessing import StandardScaler
Zflex = StandardScaler().fit_transform(Zflex)

Let's try again!

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
rsquares = cross_val_score(LassoCV(cv=cv), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)
print(f'{np.mean(rsquares):.4f}')

Then we try a pure `l2` penalty, tuned using cross-validation

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
rsquares = cross_val_score(RidgeCV(cv=cv), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)
print(f'{np.mean(rsquares):.4f}')

Finally, we try an equal combination of the two penalties, with the overall weight tuned using cross validation

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
rsquares = cross_val_score(ElasticNetCV(cv=cv), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)
print(f'{np.mean(rsquares):.4f}')

We can also try a variant of the `l1` penalty, where the weight is chosen based on theoretical derivations. This is a based on a Python implementation that tries to replicate the main function of hdm r-package. It was made by [Max Huppertz](https://maxhuppertz.github.io/code/). His library is this [repository](https://github.com/maxhuppertz/hdmpy). If running not on colab, download its repository and copy this folder to your site-packages folder. In my case it is located here ***C:\Python\Python38\Lib\site-packages*** . It requires the multiprocess package ***pip install multiprocess***.

Specifically, we use "plug-in" tuning with a theoretically valid choice of penalty $\lambda = 2 \cdot c \hat{\sigma} \sqrt{n} \Phi^{-1}(1-\alpha/2p)$, where $c>1$ and $1-\alpha$ is a confidence level, $\Phi^{-1}$ denotes the quantile function, and $\hat{\sigma}$ is estimated in an iterative manner (see corresponding notes in book). Under homoskedasticity, this choice ensures that the Lasso predictor is well behaved, delivering good predictive performance under approximate sparsity. In practice, this formula will work well even in the absence of homoskedasticity, especially when the random variables $\epsilon$ and $X$ in the regression equation decay quickly at the tails.

In practice, many people choose to use cross-validation, which is perfectly fine for predictive tasks. However, when conducting inference, to make our analysis valid we will require cross-fitting in addition to cross-validation. As we have not yet discussed cross-fitting, we rely on this theoretically-driven penalty in order to allow for accurate inference in the upcoming notebooks.

In [None]:
!git clone https://github.com/maxhuppertz/hdmpy.git
!pip install multiprocess

In [None]:
# We wrap the package so that it has the familiar sklearn API
import hdmpy

class RLasso(BaseEstimator):

    def __init__(self, *, post=True):
        self.post = post

    def fit(self, X, y):
        self.rlasso_ = hdmpy.rlasso(X, y, post=self.post)
        return self

    def predict(self, X):
        return X @ np.array(self.rlasso_.est['beta']).flatten() + self.rlasso_.est['intercept'].values

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
rsquares = cross_val_score(RLasso(), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)
print(f'{np.mean(rsquares):.4f}')

Finally, we try the combination of a sparse and a dense coefficient using the LAVA method

In [None]:
# We construct an sklearn API estimator that implements the LAVA method

class Lava(BaseEstimator):

    def __init__(self, *, alpha2=1, iterations=3):
        self.alpha2 = alpha2
        self.iterations = iterations

    def fit(self, X, y):
        lasso = RLasso(post=False).fit(X, y)
        ridge = Ridge(self.alpha2).fit(X, y - lasso.predict(X).flatten())

        for _ in range(self.iterations - 1):
            lasso = lasso.fit(X, y - ridge.predict(X))
            ridge = ridge.fit(X, y - lasso.predict(X).flatten())

        self.lasso_ = lasso
        self.ridge_ = ridge
        return self

    def predict(self, X):
        return self.lasso_.predict(X) + self.ridge_.predict(X)

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
rsquares = cross_val_score(Lava(alpha2=20), Zflex, y, scoring='r2', cv=cv, n_jobs=-1)
print(f'{np.mean(rsquares):.4f}')

We find that for this dataset the low dimensional OLS was the best among all specifications. The high-dimensional approaches did not manage to increase the explainability power of the outcome.