## Introduction

In labor economics an important question is what determines the wage of workers. This is a causal question, but we could begin to investigate from a predictive perspective.

In the following wage example,  𝑌  is the (log) hourly wage of a worker and  𝑋  is a vector of worker's characteristics, e.g., education, experience, sex. Two main questions here are:

* How to use job-relevant characteristics, such as education and experience, to best predict wages?

* What is the difference in predicted wages between male and female workers with the same job-relevant characteristics?

In this lab, we focus on the prediction question first.

## Data

The data set we consider is from the March Supplement of the U.S. Current Population Survey, year 2015. We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week during at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors; individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below  3 .

The variable of interest  𝑌  is the (log) hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size  $n=5150$ .

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import sklearn.linear_model as lm
import statsmodels.formula.api as smf
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
import warnings
# ignore potential convergence warnings; for some small penalty levels,
# tried out, optimization might not converge
warnings.simplefilter('ignore')

## Data Analysis

In [2]:
file = "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv"
df = pd.read_csv(file)

In [3]:
df.describe()

Unnamed: 0,wage,lwage,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2
count,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0
mean,23.41041,2.970787,0.444466,0.023301,0.243883,0.278058,0.31767,0.137087,0.259612,0.296505,0.216117,0.227767,13.760583,3.018925,8.235867,25.118038,5310.737476,11.670874,6629.154951,13.316893
std,21.003016,0.570385,0.496955,0.150872,0.429465,0.448086,0.465616,0.343973,0.438464,0.456761,0.411635,0.419432,10.609465,4.000904,14.488962,53.530225,11874.35608,6.966684,5333.443992,5.701019
min,3.021978,1.105912,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,370.0,2.0
25%,13.461538,2.599837,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.25,0.125,0.0625,1740.0,5.0,4880.0,9.0
50%,19.230769,2.956512,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,1.0,1.0,4040.0,13.0,7370.0,14.0
75%,27.777778,3.324236,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,21.0,4.41,9.261,19.4481,5610.0,17.0,8190.0,18.0
max,528.845673,6.270697,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,47.0,22.09,103.823,487.9681,100000.0,22.0,100000.0,22.0


### Construct variables

We are constructing the output variable  $Y$  and the matrix  $Z$  which includes the characteristics of workers that are given in the data.

In [4]:
Y = np.log(df['wage'])
Z = df.drop(['wage', 'lwage'], axis=1)
Z.shape

(5150, 18)

For the outcome variable (log) wage and a subset of the raw regressors, we calculate the empirical mean and other empirical measures to get familiar with the data.



In [5]:
Z.describe()

Unnamed: 0,sex,shs,hsg,scl,clg,ad,mw,so,we,ne,exp1,exp2,exp3,exp4,occ,occ2,ind,ind2
count,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0,5150.0
mean,0.444466,0.023301,0.243883,0.278058,0.31767,0.137087,0.259612,0.296505,0.216117,0.227767,13.760583,3.018925,8.235867,25.118038,5310.737476,11.670874,6629.154951,13.316893
std,0.496955,0.150872,0.429465,0.448086,0.465616,0.343973,0.438464,0.456761,0.411635,0.419432,10.609465,4.000904,14.488962,53.530225,11874.35608,6.966684,5333.443992,5.701019
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,370.0,2.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.25,0.125,0.0625,1740.0,5.0,4880.0,9.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,10.0,1.0,1.0,1.0,4040.0,13.0,7370.0,14.0
75%,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,21.0,4.41,9.261,19.4481,5610.0,17.0,8190.0,18.0
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,47.0,22.09,103.823,487.9681,100000.0,22.0,100000.0,22.0


E.g., the share of female workers in our sample is ~44% ( 𝑠𝑒𝑥=1  if female).

In [6]:
# if you want to print this table to latex
print(Z.describe().style.to_latex())

\begin{tabular}{lrrrrrrrrrrrrrrrrrr}
 & sex & shs & hsg & scl & clg & ad & mw & so & we & ne & exp1 & exp2 & exp3 & exp4 & occ & occ2 & ind & ind2 \\
count & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 & 5150.000000 \\
mean & 0.444466 & 0.023301 & 0.243883 & 0.278058 & 0.317670 & 0.137087 & 0.259612 & 0.296505 & 0.216117 & 0.227767 & 13.760583 & 3.018925 & 8.235867 & 25.118038 & 5310.737476 & 11.670874 & 6629.154951 & 13.316893 \\
std & 0.496955 & 0.150872 & 0.429465 & 0.448086 & 0.465616 & 0.343973 & 0.438464 & 0.456761 & 0.411635 & 0.419432 & 10.609465 & 4.000904 & 14.488962 & 53.530225 & 11874.356080 & 6.966684 & 5333.443992 & 5.701019 \\
min & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 & 0.000000 

## Prediction Question

Now, we will construct a prediction rule for hourly (log) wage  $Y$, which depends linearly on job-relevant characteristics  $X$:

$$
𝑌=\beta′𝑋+𝜖.
$$


Our goals are

* Predict wages using various characteristics of workers.

* Assess the predictive performance of a given model using the (adjusted) sample MSE, the (adjusted) sample $R^2$ and the out-of-sample MSE and $R^2$.


Toward answering the latter, we measure the prediction quality of the two models via data splitting:

- Randomly split the data into one training sample and one testing sample. Here we just use a simple method (stratified splitting is a more sophisticated version of splitting that we might consider).
- Use the training sample to estimate the parameters of the Basic Model and the Flexible Model.
- Before using the testing sample, we evaluate in-sample fit.


In [7]:
train, test = train_test_split(df, test_size=0.20, random_state=123)


We employ two different specifications for prediction:


1. Basic Model:   $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators,  occupation and industry indicators and regional indicators).


2. Flexible Model:  $X$ consists of all raw regressors from the basic model plus a dictionary of transformations (e.g., ${exp}^2$ and ${exp}^3$) and additional two-way interactions of a polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is *experience* times the indicator of having a *college degree*.

Using the **Flexible Model** enables us to approximate the real relationship by a more complex regression model and therefore to reduce the bias. The **Flexible Model** increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver higher prediction accuracy but are harder to interpret.

## Data-Splitting: In-sample performance

Let us fit both models to our data by running ordinary least squares (ols):

In [8]:
# 1. Basic Model
model_base = 'lwage ~ sex + exp1 + shs + hsg+ scl + clg + mw + so + we + C(occ2) + C(ind2)'
base = smf.ols(model_base, data=train)
results_base = base.fit()

In [9]:
rsquared_base = results_base.rsquared
rsquared_adj_base = results_base.rsquared_adj
mse_base = np.mean(results_base.resid**2)
mse_adj_base = results_base.mse_resid
print(f'Rsquared={rsquared_base:.4f}')
print(f'Rsquared_adjusted={rsquared_adj_base:.4f}')
print(f'MSE={mse_base:.4f}')
print(f'MSE_adjusted={mse_adj_base:.4f}')

Rsquared=0.3176
Rsquared_adjusted=0.3092
MSE=0.2202
MSE_adjusted=0.2229


In [10]:
# verify the formulas
X, y = base.data.exog, base.data.endog
n, p = X.shape
mse = np.mean((y - results_base.predict(X, transform=False))**2)
mse_adj = mse * n / (n - p)
rsquared = 1 - mse / np.var(y)
rsquared_adj = 1 - mse_adj / np.var(y)
print(f'Rsquared={rsquared:.4f}')
print(f'Rsquared_adjusted={rsquared_adj:.4f}')
print(f'MSE={mse:.4f}')
print(f'MSE_adjusted={mse_adj:.4f}')

Rsquared=0.3176
Rsquared_adjusted=0.3091
MSE=0.2202
MSE_adjusted=0.2229


In [11]:
# 2. Flexible Model
model_flex = ('lwage ~ sex + shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we '
              '+ (exp1+exp2+exp3+exp4)*(shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)')
flex = smf.ols(model_flex, data=train)
results_flex = flex.fit()

In [12]:
rsquared_flex = results_flex.rsquared
rsquared_adj_flex = results_flex.rsquared_adj
mse_flex = np.mean(results_flex.resid**2)
mse_adj_flex = results_flex.mse_resid
print(f'Rsquared={rsquared_flex:.4f}')
print(f'Rsquared_adjusted={rsquared_adj_flex:.4f}')
print(f'MSE={mse_flex:.4f}')
print(f'MSE_adjusted={mse_adj_flex:.4f}')

Rsquared=0.3643
Rsquared_adjusted=0.3241
MSE=0.2051
MSE_adjusted=0.2181


#### Re-estimating the flexible model using Lasso
We re-estimate the flexible model using Lasso (the least absolute shrinkage and selection operator) rather than ols. Lasso is a penalized regression method that can be used to reduce the complexity of a regression model when the ratio $p/n$ is not small. We will introduce this approach formally later in the course, but for now, we try it out here as a black-box method.  


We use the statsmodels package with the formula api for uniformity in feature construction and the sklearn Lasso with cross-validation to tune the regularization hyperparameter.

In [13]:
# Lasso with cross-validation
X = flex.data.exog[:, 1:]  # exclude the intercept; we don't want the lasso to penalize the intercept
y = flex.data.endog

# train model using Lasso with cross validation and variable normalization
lasso = Pipeline([('scale', StandardScaler()),  # standardize the variables
                  ('lasso', lm.LassoCV())])
lasso.fit(X, y)

In [14]:
# verify the formulas
n, p = X.shape
p += 1
mse_lasso = np.mean((y - lasso.predict(X))**2)
mse_adj_lasso = mse_lasso * n / (n - p)
rsquared_lasso = 1 - mse_lasso / np.var(y)
rsquared_adj_lasso = 1 - mse_adj_lasso / np.var(y)
print(f'Rsquared={rsquared_lasso:.4f}')
print(f'Rsquared_adjusted={rsquared_adj_lasso:.4f}')
print(f'MSE={mse_lasso:.4f}')
print(f'MSE_adjusted={mse_adj_lasso:.4f}')

Rsquared=0.3309
Rsquared_adjusted=0.2885
MSE=0.2159
MSE_adjusted=0.2296


In [15]:
# store the results in a table
res_df = pd.DataFrame()

res_df['Model'] = ['Basic reg', 'Flexible reg', 'Flexible Lasso']

res_df['p'] = [results_base.params.shape[0],
               results_flex.params.shape[0],
               results_flex.params.shape[0]]

res_df['R2'] = [rsquared_base, rsquared_flex, rsquared_lasso]
res_df['MSE'] = [mse_base, mse_flex, mse_lasso]

res_df['adj_R2'] = [rsquared_adj_base, rsquared_adj_flex, rsquared_adj_lasso]
res_df['adj_MSE'] = [mse_adj_base, mse_adj_flex, mse_adj_lasso]

# Show results
res_df.head()

Unnamed: 0,Model,p,R2,MSE,adj_R2,adj_MSE
0,Basic reg,51,0.317622,0.220187,0.309237,0.222947
1,Flexible reg,246,0.364346,0.20511,0.324146,0.218135
2,Flexible Lasso,246,0.330946,0.215888,0.288461,0.229597


In [16]:
# print to Latex
print(res_df.style.to_latex())

\begin{tabular}{llrrrrr}
 & Model & p & R2 & MSE & adj_R2 & adj_MSE \\
0 & Basic reg & 51 & 0.317622 & 0.220187 & 0.309237 & 0.222947 \\
1 & Flexible reg & 246 & 0.364346 & 0.205110 & 0.324146 & 0.218135 \\
2 & Flexible Lasso & 246 & 0.330946 & 0.215888 & 0.288461 & 0.229597 \\
\end{tabular}



Considering the measures above, the flexible model performs slightly better than the basic model.

As $p/n$ is not large, the discrepancy between the adjusted and unadjusted measures is not large. However, if it were, we might still like to apply **data splitting** as a more general procedure to deal with potential overfitting if $p/n$. We illustrate the approach in the following.

## Data Splitting: Out-of-sample performance

Now that we have seen in-sample fit, we evaluate our models on the out-of-sample performance:
- Use the testing sample for evaluation. Predict the $\mathtt{wage}$  of every observation in the testing sample based on the estimated parameters in the training sample.
- Calculate the Mean Squared Prediction Error $MSE_{test}$ based on the testing sample for both prediction models.


In [17]:
# We will use smf.ols just to get the full data frame and sm.OLS to
# test out of sample for convenience
# This is because predict is a bit tricky with smf.ols out of sample.
tmp = smf.ols(model_base, data=df)  # just to extract df, not actually using this model
X_full = tmp.data.exog
y_full = tmp.data.endog
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=.2, shuffle=True)

In [18]:
# estimating the parameters in the training sample
regbasic = sm.OLS(y_train, X_train).fit()

# predict out of sample
yhat_reg_base = regbasic.predict(X_test)

# calculating out-of-sample MSE
MSE_test1 = sum((y_test - yhat_reg_base)**2) / y_test.shape[0]
R2_test1 = 1. - MSE_test1 / np.var(y_test)

print("Test MSE for the basic model: " + str(MSE_test1))
print("Test R2 for the basic model: " + str(R2_test1))

Test MSE for the basic model: 0.24144789903247718
Test R2 for the basic model: 0.27769500013958415


In the basic model, the  $𝑀𝑆𝐸_{𝑡𝑒𝑠𝑡}$  is quite closed to the  $𝑀𝑆𝐸_{𝑠𝑎𝑚𝑝𝑙𝑒}$.

In [19]:
# We will use smf.ols just to get the full data frame and sm.OLS to test out of sample for convenience
# This is because predict is a bit tricky with smf.ols out of sample.
tmp = smf.ols(model_flex, data=df)  # just to extract df, not actually using this model
X_full = tmp.data.exog
y_full = tmp.data.endog
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=.2, shuffle=True)

# estimating the parameters in the training sample
regflex = sm.OLS(y_train, X_train).fit()

# predict out of sample
yhat_reg_flex = regflex.predict(X_test)

# calculating out-of-sample MSE
MSE_test2 = np.mean((y_test - yhat_reg_flex)**2)
R2_test2 = 1. - MSE_test2 / np.var(y_test)

print("Test MSE for the flexible model: " + str(MSE_test2))
print("Test R2 for the flexible model: " + str(R2_test2))

Test MSE for the flexible model: 0.2580214248313413
Test R2 for the flexible model: 0.18728670499465283


In the flexible model, the discrepancy between the  $𝑀𝑆𝐸_{𝑡𝑒𝑠𝑡}$  and the  $𝑀𝑆𝐸_{𝑠𝑎𝑚𝑝𝑙𝑒}$  is not large.

It is worth to notice that the  $𝑀𝑆𝐸_{𝑡𝑒𝑠𝑡}$  vary across different data splits. Hence, it is a good idea average the out-of-sample MSE over different data splits to get valid results.

Nevertheless, we observe that, based on the out-of-sample  $𝑀𝑆𝐸$ , the basic model using ols regression performs is about as well (or slightly better) than the flexible model.

Next, let us use lasso regression in the flexible model instead of ols regression. Note that the out-of-sample  𝑀𝑆𝐸  on the test sample can be computed for any other black-box prediction method as well. Thus, let us finally compare the performance of lasso regression in the flexible model to ols regression.

In [20]:
# train model using Lasso with cross validation and variable normalization
lasso = Pipeline([('scale', StandardScaler()),  # standardize the variables
                  ('lasso', lm.LassoCV())])
lasso.fit(X_train[:, 1:], y_train)

# predict out of sample
yhat_test_lasso = lasso.predict(X_test[:, 1:])

# calculating out-of-sample MSE
MSE_test3 = np.mean((y_test - yhat_test_lasso)**2)
R2_test3 = 1. - MSE_test3 / np.var(y_test)

print("Test MSE for the basic model: " + str(MSE_test3))
print("Test R2 for the basic model: " + str(R2_test3))

Test MSE for the basic model: 0.2401688387151312
Test R2 for the basic model: 0.2435186016146943


In [21]:
# store the results in a table
res_df2 = pd.DataFrame()

res_df2['Model'] = ['Basic reg', 'Flexible reg', 'Flexible Lasso']

res_df2['$MSE_{test}$'] = [MSE_test1, MSE_test2, MSE_test3]
res_df2['$R^2_{test}$'] = [R2_test1, R2_test2, R2_test3]

# Show results
res_df2.head()

Unnamed: 0,Model,$MSE_{test}$,$R^2_{test}$
0,Basic reg,0.241448,0.277695
1,Flexible reg,0.258021,0.187287
2,Flexible Lasso,0.240169,0.243519


In [22]:
# print to Latex
print(res_df2.style.to_latex())

\begin{tabular}{llrr}
 & Model & $MSE_{test}$ & $R^2_{test}$ \\
0 & Basic reg & 0.241448 & 0.277695 \\
1 & Flexible reg & 0.258021 & 0.187287 \\
2 & Flexible Lasso & 0.240169 & 0.243519 \\
\end{tabular}



## Extra flexible model and Overfitting
Given the results above, it is not immediately clear why one would choose to use Lasso as results are fairly similar. To motivate, we consider an extra flexible model to show how OLS can overfit significantly to the in-sample train data and perform poorly on the out-of-sample testing data.



In [23]:
# Extra Flexible Model
model_extra = ('lwage ~ sex + (exp1+exp2+exp3+exp4+shs+hsg+scl+clg+C(occ2)+C(ind2)+mw+so+we)**2')
tmp = smf.ols(model_extra, data=df)  # just to extract df, not actually using this model

# In-sample fit
insamplefit = tmp.fit()
rsquared_ex = insamplefit.rsquared
rsquared_adj_ex = insamplefit.rsquared_adj
mse_ex = np.mean(insamplefit.resid**2)
mse_adj_ex = insamplefit.mse_resid
print(f'(In-sample) Rsquared={rsquared_ex :.4f}')
print(f'(In-sample) Rsquared_adjusted={rsquared_adj_ex :.4f}')
print(f'(In-sample) MSE={mse_ex :.4f}')
print(f'(In-sample) MSE_adjusted={mse_adj_ex:.4f}')

# Train test Split
X_full = tmp.data.exog
y_full = tmp.data.endog
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=.2, shuffle=True)

# estimating the parameters in the training sample
regextra = sm.OLS(y_train, X_train).fit()

# predict out of sample
yhat_reg_extra = regextra.predict(X_test)

# calculating out-of-sample MSE
MSE_test4 = np.mean((y_test - yhat_reg_extra)**2)
R2_test4 = 1. - MSE_test4 / np.var(y_test)

print("Test MSE for the flexible model: " + str(MSE_test4))
print("Test R2 for the flexible model: " + str(R2_test4))

(In-sample) Rsquared=0.4489
(In-sample) Rsquared_adjusted=0.3506
(In-sample) MSE=0.1793
(In-sample) MSE_adjusted=0.2113
Test MSE for the flexible model: 0.31058193523814376
Test R2 for the flexible model: 0.11967617818015497


As we can see, a simple OLS overfits when the dimensionality of covariates is high, as the out-of-sample performance suffers dramatically in comparison to the in-sample performance.

Contrast this with Lasso:

In [24]:
np.sum(lasso.named_steps['lasso'].coef_ != 0)

99

In [25]:
# train model using Lasso with cross validation and variable normalization
lasso = Pipeline([('scale', StandardScaler()),  # standardize the variables
                  ('lasso', lm.LassoCV())])
lasso.fit(X_train[:, 1:], y_train)

# predict in sample
yhat_train_lasso = lasso.predict(X_train[:, 1:])

# Calculate R-squared
R2_L = 1 - np.sum((yhat_train_lasso - y_train)**2) / np.sum((y_train - np.mean(y_train))**2)

# Calculate adjusted R-squared
pL = np.sum(lasso.named_steps['lasso'].coef_ != 0)
ntrain = len(X_train)
baseline = np.sum((y_train - np.mean(y_train))**2) / (ntrain - 1)
R2_adjL = 1 - (np.sum((yhat_train_lasso - y_train)**2) / (ntrain - pL - 1)) / baseline

# Calculate Mean Squared Error (MSE)
lasso_res = y_train - yhat_train_lasso
MSEL = np.mean(lasso_res**2)

# Calculate adjusted MSE
MSE_adjL = (ntrain / (ntrain - pL - 1)) * MSEL

# Print the results
print("R-squared for the lasso with the extra flexible model (in-sample):", R2_L)
print("Adjusted R-squared for the extra flexible model (in-sample):", R2_adjL)
print("MSE for the lasso with the extra flexible model (in-sample):", MSEL)
print("Adjusted MSE for the lasso with the extra flexible model (in-sample):", MSE_adjL)

# predict out of sample
yhat_test_lasso = lasso.predict(X_test[:, 1:])

# calculating out-of-sample MSE
MSE_test5 = np.mean((y_test - yhat_test_lasso)**2)
R2_test5 = 1. - MSE_test5 / np.var(y_test)

print("Test MSE for the basic model: " + str(MSE_test5))
print("Test R2 for the basic model: " + str(R2_test5))

R-squared for the lasso with the extra flexible model (in-sample): 0.38587383112046936
Adjusted R-squared for the extra flexible model (in-sample): 0.34313537013378703
MSE for the lasso with the extra flexible model (in-sample): 0.19551477442333862
Adjusted MSE for the lasso with the extra flexible model (in-sample): 0.20917186980632435
Test MSE for the basic model: 0.2540434959281549
Test R2 for the basic model: 0.27993062097295673


As shown above, the overfitting effect is mitigated with the penalized regression model.

# **Adjustments to try to improve predictions**
Looking to find the "sweet-spot" between the basic model and the extra flexible model to reduce overfitting but still maintain a better R-squared and MSE

In [45]:
# Reduced Flexible Model
# Removed overall second-order interactions from all covariates and reduced interaction terms while keeping some key interactions between experience, region, industry, occupation, etc.
model_reduced = ('lwage ~ sex + exp1 + exp2 + exp3 + exp4 + shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we + exp1:mw + exp1:so + exp2:hsg + exp2:clg + exp3:mw + exp3:so + exp4:C(occ2) + exp1:C(ind2) + shs:hsg + C(occ2):C(ind2) + mw:we + so:we')

tmp_reduced = smf.ols(model_reduced, data=df)  # just to extract df, not actually using this model

# In-sample fit
insamplefit_reduced = tmp_reduced.fit()
rsquared_reduced = insamplefit_reduced.rsquared
rsquared_adj_reduced = insamplefit_reduced.rsquared_adj
mse_reduced = np.mean(insamplefit_reduced.resid**2)
mse_adj_reduced = insamplefit_reduced.mse_resid
print(f'(In-sample) Rsquared={rsquared_reduced :.4f}')
print(f'(In-sample) Rsquared_adjusted={rsquared_adj_reduced :.4f}')
print(f'(In-sample) MSE={mse_reduced :.4f}')
print(f'(In-sample) MSE_adjusted={mse_adj_reduced:.4f}')

# Train test Split
X_full = tmp.data.exog
y_full = tmp.data.endog
X_train, X_test, y_train, y_test = train_test_split(X_full, y_full, test_size=.2, shuffle=True)

# estimating the parameters in the training sample
regreduced = sm.OLS(y_train, X_train).fit()

# predict out of sample
yhat_reg_reduced = regreduced.predict(X_test)

# calculating out-of-sample MSE
MSE_test6 = np.mean((y_test - yhat_reg_reduced)**2)
R2_test6 = 1. - MSE_test6 / np.var(y_test)

print("Test MSE for the reduced flexibility model: " + str(MSE_test6))
print("Test R2 for the reduced flexibility model: " + str(R2_test6))

(In-sample) Rsquared=0.3726
(In-sample) Rsquared_adjusted=0.3263
(In-sample) MSE=0.2041
(In-sample) MSE_adjusted=0.2192
Test MSE for the reduced flexibility model: 0.22664521972559085
Test R2 for the reduced flexibility model: 0.3324008717364906


**Comparison to Extra Flexible model using OLS:**

**In-Sample Performance:**
The Extra flexible model has a higher r-squared and lower MSE than the reduced flexibility model

**Out Of Sample Performance:**
The reduced flexibility model has a higher r-squared and lower MSE than the extra flexible model


This indicates that while reducing flexibility from the extra flexible model but still keeping interaction terms that could help explain nuance effects of the wage difference across regions/experience/industry/etc., we maintain a level of complexity while reducing potential overfitting. Our in-sample performance and out-of-sample performance are quite similar, so we can assume our model generalizes well to new data.

In [47]:
# train model using Lasso with cross validation and variable normalization
lasso_reduced = Pipeline([('scale', StandardScaler()),  # standardize the variables
                  ('lasso', lm.LassoCV())])
lasso_reduced.fit(X_train[:, 1:], y_train)

# predict in sample
yhat_train_lasso_reduced = lasso_reduced.predict(X_train[:, 1:])

# Calculate R-squared
R2_L_reduced = 1 - np.sum((yhat_train_lasso_reduced - y_train)**2) / np.sum((y_train - np.mean(y_train))**2)

# Calculate adjusted R-squared
pL_reduced = np.sum(lasso_reduced.named_steps['lasso'].coef_ != 0)
ntrain_reduced = len(X_train)
baseline_reduced = np.sum((y_train - np.mean(y_train))**2) / (ntrain_reduced - 1)
R2_adjL_reduced = 1 - (np.sum((yhat_train_lasso_reduced - y_train)**2) / (ntrain_reduced - pL_reduced - 1)) / baseline_reduced

# Calculate Mean Squared Error (MSE)
lasso_res_reduced = y_train - yhat_train_lasso_reduced
MSEL_reduced = np.mean(lasso_res_reduced**2)

# Calculate adjusted MSE
MSE_adjL_reduced = (ntrain_reduced / (ntrain_reduced - pL_reduced - 1)) * MSEL_reduced

# Print the results
print("R-squared for the lasso with the reduced flexible model (in-sample):", R2_L_reduced)
print("Adjusted R-squared for the reduced flexible model (in-sample):", R2_adjL_reduced)
print("MSE for the lasso with the reduced flexible model (in-sample):", MSEL_reduced)
print("Adjusted MSE for the lasso with the reduced flexible model (in-sample):", MSE_adjL_reduced)

# predict out of sample
yhat_test_lasso_reduced = lasso_reduced.predict(X_test[:, 1:])

# calculating out-of-sample MSE
MSE_test7 = np.mean((y_test - yhat_test_lasso_reduced)**2)
R2_test7 = 1. - MSE_test7 / np.var(y_test)

print("Test MSE for the basic model: " + str(MSE_test7))
print("Test R2 for the basic model: " + str(R2_test7))

R-squared for the lasso with the reduced flexible model (in-sample): 0.32912844759676074
Adjusted R-squared for the reduced flexible model (in-sample): 0.3030718980204432
MSE for the lasso with the reduced flexible model (in-sample): 0.21571436247680437
Adjusted MSE for the lasso with the reduced flexible model (in-sample): 0.22414708030376645
Test MSE for the basic model: 0.2210911434432156
Test R2 for the basic model: 0.3487607865359823


**Comparison to Extra Flexible model using Lasso:**

**In-Sample Performance:**
The Extra flexible model has a higher r-squared and lower MSE than the reduced flexibility model

**Out Of Sample Performance:**
The reduced flexibility model has a higher r-squared and lower MSE than the extra flexible model


As stated above, this indicates that while reducing flexibility from the extra flexible model but still keeping interaction terms that could help explain nuance effects of the wage difference across regions/experience/industry/etc., we maintain a level of complexity while reducing potential overfitting. Our in-sample performance and out-of-sample performance are quite similar, so we can assume our model generalizes well to new data.

##**Conclusion on adjusted models**
To briefly compare the reduced flexibility OLS model to the Lasso model, the Lasso model has both a higher r-squared and lower MSE for the out-of-sample performance, so the Lasso reduced flexibility model would ultimately be the model to use. These adjustments highlight the importance of testing multiple models to find the one that strikes the right balance between performance and generalization. Achieving the best possible model requires not only optimizing in-sample performance but also ensuring that the model generalizes well to new, unseen data.