# Econometrics of Big Data Final Problem
*by Christian Stolborg*

*15-07-2022*

## Part B

## Data preparation

In [1]:
import warnings

import numpy as np
import pandas as pd
from scipy import stats
import statsmodels.api as sm
import doubleml as dml

from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV, LogisticRegressionCV
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

from xgboost import XGBClassifier, XGBRegressor

import matplotlib.pyplot as plt
import seaborn as sns



warnings.filterwarnings('ignore')

In [2]:
df = pd.read_csv("./data/401ksubs.csv")

# From DoubleML docs - need to convert to float due to sklearn issue https://github.com/scikit-learn/scikit-learn/issues/21997
dtypes = df.dtypes
dtypes['net_tfa'] = 'float64'
dtypes['inc'] = 'float64'
df = df.astype(dtypes)

X_cols = "age,inc,fsize,educ,db,marr,twoearn,pira,hown".split(",")
df.head(2)

Unnamed: 0,net_tfa,age,inc,fsize,educ,db,marr,twoearn,e401,p401,pira,hown
0,0.0,47,6765.0,2,8,0,0,0,0,0,0,1
1,1015.0,36,28452.0,1,16,0,0,0,0,0,0,1


### (i)

The problem of estimating the policy effect ($D$) of 401(k) eligibility on net financial assets ($Y$) with only 9 control variables ($X$) becomes a high-dimensional problem when either $Y$ or $D$ is functionally related to $X$ in a way that is not linear in $X$. For example, through higher-order terms, interactions or additional unobservable variables. In this model, enrollment into a 401(k) plan is not random and it is highly likely that net financial assets is affected by households' heterogeneity in saving preferences. Hence, we should believe that both $Y$ and $D$ are partly determined by other factors such as those present in $X$. However, the relationship between $Y$, $D$ and $X$ could very well be a non-linear relationship. Exploring whether this is the case can be done in a data-driven way with double machine learning methods.

### (ii)



In [3]:
X = df[["e401"]+X_cols]
X = sm.add_constant(X)
y = df["net_tfa"]

model = sm.OLS(y, X).fit()
model.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-3.291e+04,4276.223,-7.695,0.000,-4.13e+04,-2.45e+04
e401,5896.1984,1250.014,4.717,0.000,3445.917,8346.480
age,624.1455,59.521,10.486,0.000,507.472,740.819
inc,0.9357,0.030,30.982,0.000,0.876,0.995
fsize,-1018.7979,449.859,-2.265,0.024,-1900.614,-136.982
educ,-639.7538,228.499,-2.800,0.005,-1087.659,-191.848
db,-4904.5684,1359.098,-3.609,0.000,-7568.677,-2240.460
marr,743.3445,1795.556,0.414,0.679,-2776.310,4262.999
twoearn,-1.923e+04,1576.431,-12.196,0.000,-2.23e+04,-1.61e+04


Estimating $net_tfa = \alpha_0 e401 + \beta'X + \epsilon$ I find that $\alpha_0= 5896$ with a standard error of $1250$. Thus, in the linear model, 401(k) eligibility corresponds to an increase in net financial assets of almost \$6,000

### (iii)

I now repeat the analysis using double machine learning, allowing for non-linear nuisance functions. I perform the analysis on Lasso, because it was the best predictor in Part A. I also include a Random Forest and XGB. For all models I use 5-fold cross-fitting and I use 5-fold cross-validation to find hyperparameters. 

In [4]:
dml_data_base = dml.DoubleMLData(df,
                        y_col='net_tfa',
                        d_cols='e401',
                        x_cols=X_cols)

In [5]:
def nl_transform(data, features: str) -> pd.DataFrame:
    """ Perform non-linear transformation on features in data """
    features = data.copy()[features]

    # Add polynomials
    poly_dict = {'age': 3,
                'inc': 3,
                'educ': 3,
                'fsize': 3}

    for key, degree in poly_dict.items():
        poly = PolynomialFeatures(degree, include_bias=False)
        data_transf = poly.fit_transform(data[[key]])
        x_cols = poly.get_feature_names_out([key])
        data_transf = pd.DataFrame(data_transf, columns=x_cols)

        features = pd.concat((features, data_transf),
                            axis=1, sort=False)

    # Add interaction terms
    cols = features.columns
    for col1 in cols:
        for col2 in cols:
            c1 = col1.strip("^23")
            c2 = col2.strip("^23")
            if c1 != c2:
                features[col1+"_"+col2] = features[col1] * features[col2]

    model_data = pd.concat((data.copy()[['net_tfa', 'e401']], features.copy()),
                            axis=1, sort=False)

    return model_data

model_data = nl_transform(df, features=['marr', 'twoearn', 'db', 'pira', 'hown'])
print(f"Shape of flexible data model: {model_data.shape}")


# Initialize DoubleMLData (data-backend of DoubleML)
dml_data_flex = dml.DoubleMLData(model_data, y_col='net_tfa', d_cols='e401')

Shape of flexible data model: (9915, 267)


In [6]:
# Initialize learners
Cs = 0.0001*np.logspace(0, 4, 10)
lasso = make_pipeline(StandardScaler(), LassoCV(cv=5, max_iter=10000))
lasso_class = make_pipeline(StandardScaler(),
                            LogisticRegressionCV(cv=5, penalty='l1', solver='liblinear',
                                                 Cs = Cs, max_iter=1000))

np.random.seed(42)
# Initialize DoubleMLPLR model
dml_plr_lasso = dml.DoubleMLPLR(dml_data_base,
                                ml_l = lasso,
                                ml_m = lasso_class,
                                n_folds = 5)

dml_plr_lasso.fit(store_predictions=True)
lasso_base = dml_plr_lasso.summary
lasso_base

Unnamed: 0,coef,std err,t,P>|t|,2.5 %,97.5 %
e401,5760.268502,1371.244866,4.200758,2.7e-05,3072.677951,8447.859053


In [7]:
# Estimate the ATE in the flexible model with lasso
np.random.seed(42)
dml_plr_lasso = dml.DoubleMLPLR(dml_data_flex,
                                ml_l = lasso,
                                ml_m = lasso_class,
                                n_folds = 5)

dml_plr_lasso.fit(store_predictions=True)
lasso_summary = dml_plr_lasso.summary

lasso_summary

Unnamed: 0,coef,std err,t,P>|t|,2.5 %,97.5 %
e401,9400.019296,1333.841572,7.047328,1.823858e-12,6785.737854,12014.300738


In [8]:
dml_plr_lasso = dml.DoubleMLPLR(dml_data_flex,
                                ml_l = lasso,
                                ml_m = lasso_class,
                                n_folds = 1)

dml_plr_lasso.fit(store_predictions=True)
lasso_summary_nospl = dml_plr_lasso.summary

Repeating the procedure with a random forest and gradient boosting:

In [9]:
# Random Forest
randomForest = RandomForestRegressor(
    n_estimators=500, max_depth=7, max_features=3, min_samples_leaf=3)
randomForest_class = RandomForestClassifier(
    n_estimators=500, max_depth=5, max_features=4, min_samples_leaf=7)

np.random.seed(42)
dml_plr_forest = dml.DoubleMLPLR(dml_data_base,
                                 ml_l = randomForest,
                                 ml_m = randomForest_class,
                                 n_folds = 5)
dml_plr_forest.fit(store_predictions=True)
forest_summary = dml_plr_forest.summary

forest_summary

Unnamed: 0,coef,std err,t,P>|t|,2.5 %,97.5 %
e401,8965.966775,1309.656687,6.846044,7.592035e-12,6399.086837,11532.846713


In [10]:
# Random Forest - no splitting

np.random.seed(42)
dml_plr_forest = dml.DoubleMLPLR(dml_data_base,
                                 ml_l = randomForest,
                                 ml_m = randomForest_class,
                                 n_folds = 1)
dml_plr_forest.fit(store_predictions=True)
forest_summary_nospl = dml_plr_forest.summary

In [11]:
# Boosted Trees
boost = XGBRegressor(n_jobs=1, objective = "reg:squarederror",
                     eta=0.1, n_estimators=35)
boost_class = XGBClassifier(use_label_encoder=False, n_jobs=1,
                            objective = "binary:logistic", eval_metric = "logloss",
                            eta=0.1, n_estimators=34)

np.random.seed(42)
dml_plr_boost = dml.DoubleMLPLR(dml_data_base,
                                ml_l = boost,
                                ml_m = boost_class,
                                n_folds = 5)
dml_plr_boost.fit(store_predictions=True)
boost_summary = dml_plr_boost.summary

boost_summary

Unnamed: 0,coef,std err,t,P>|t|,2.5 %,97.5 %
e401,8598.748862,1334.712041,6.4424,1.175988e-10,5982.761332,11214.736392


In [12]:
np.random.seed(42)
dml_plr_boost = dml.DoubleMLPLR(dml_data_base,
                                ml_l = boost,
                                ml_m = boost_class,
                                n_folds = 1)
dml_plr_boost.fit(store_predictions=True)
boost_summary_nospl = dml_plr_boost.summary

In [13]:
plr_summary = pd.concat((lasso_base ,lasso_summary, lasso_summary_nospl, forest_summary, forest_summary_nospl, boost_summary, boost_summary_nospl))
plr_summary.index = ['Base Lasso','Flexible Lasso', 'Flexible Lasso - no split', 'Forest', 'Forest - no split', 'XGB', 'XGB - no split']
plr_summary[['coef', '2.5 %', '97.5 %']].round(1)

Unnamed: 0,coef,2.5 %,97.5 %
Base Lasso,5760.3,3072.7,8447.9
Flexible Lasso,9400.0,6785.7,12014.3
Flexible Lasso - no split,9385.3,6881.6,11889.0
Forest,8966.0,6399.1,11532.8
Forest - no split,8751.8,6394.1,11109.6
XGB,8598.7,5982.8,11214.7
XGB - no split,8550.9,6508.1,10593.6


From the base Lasso model and the linear regression, the DML models appear to put a much higher weight on the effect of 401(k) eligibility on net financial asset. Assuming that 401(k) eligibility is in fact exogenous after controlling for X in the above partially linear models, we can conclude that the effect is around \$9,000 rather than the linear estimate of just below \$6,000.

Oddly enough removing the sample splits does not significantly alter the conclusions. Yet, as can be seen from the large difference in policy outcome $\alpha_0$ from the linear model to the ML models, the nuisance function $g_0(X)$ does in fact seem non-linear. It might simply be that the number of parameters then, relative to the number of observations is low enough for the models to get a (relatively) unbiased estimate of the treatment effect without cross-fitting. Note however, that all three models are biased downwards towards the linear model when one removes the cross-fitting.