# Inference on Predictive and Causal Effects in High-Dimensional Nonlinear Models

## Impact of 401(k) on  Financial Wealth

As a practical illustration of the methods developed in this lecture, we consider estimation of the effect of 401(k) eligibility and participation 
on accumulated assets. 401(k) plans are pension accounts sponsored by employers. The key problem in determining the effect of participation in 401(k) plans on accumulated assets is saver heterogeneity coupled with the fact that the decision to enroll in a 401(k) is non-random. It is generally recognized that some people have a higher preference for saving than others. It also seems likely that those individuals with high unobserved preference for saving would be most likely to choose to participate in tax-advantaged retirement savings plans and would tend to have otherwise high amounts of accumulated assets. The presence of unobserved savings preferences with these properties then implies that conventional estimates that do not account for saver heterogeneity and endogeneity of participation will be biased upward, tending to overstate the savings effects of 401(k) participation.

One can argue that eligibility for enrolling in a 401(k) plan in this data can be taken as exogenous after conditioning on a few observables of which the most important for their argument is income. The basic idea is that, at least around the time 401(k)’s initially became available, people were unlikely to be basing their employment decisions on whether an employer offered a 401(k) but would instead focus on income and other aspects of the job. 

### Data

The data set can be downloaded from the github repo


In [None]:
# Import relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, cross_val_predict, KFold
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LinearRegression, Ridge, Lasso, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
import patsy
import warnings
from sklearn.base import BaseEstimator, clone
import statsmodels.api as sm
from IPython.display import Markdown
import wget
import os
import seaborn as sns
warnings.simplefilter('ignore')
np.random.seed(1234)

In [None]:
file = "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/401k.csv"
data = pd.read_csv(file)

In [None]:
data.describe()

In [None]:
data.head()

In [None]:
readme = "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/401k.md"
filename = wget.download(readme)
display(Markdown(open(filename, 'r').read()))

The data consist of 9,915 observations at the household level drawn from the 1991 Survey of Income and Program Participation (SIPP).  All the variables are referred to 1990. We use net financial assets (*net\_tfa*) as the outcome variable, $Y$,  in our analysis. The net financial assets are computed as the sum of IRA balances, 401(k) balances, checking accounts, saving bonds, other interest-earning accounts, other interest-earning assets, stocks, and mutual funds less non mortgage debts. 

Among the $9915$ individuals, $3682$ are eligible to participate in the program. The variable *e401* indicates eligibility and *p401* indicates participation, respectively.

In [None]:
sns.countplot(data['e401'],)
plt.show()

Eligibility is highly associated with financial wealth:

In [None]:
sns.displot(data=data, x='net_tfa', kind='kde', col='e401', hue='e401', fill=True)
plt.show()

The unconditional APE of e401 is about $19559$:

In [None]:
e1 = data[data['e401'] == 1]['net_tfa']
e0 = data[data['e401'] == 0]['net_tfa']
print(f'{np.mean(e1) - np.mean(e0):.0f}')

Among the $3682$ individuals that  are eligible, $2594$ decided to participate in the program. The unconditional APE of p401 is about $27372$:

In [None]:
e1 = data[data['p401'] == 1]['net_tfa']
e0 = data[data['p401'] == 0]['net_tfa']
print(f'{np.mean(e1) - np.mean(e0):.0f}')

As discussed, these estimates are biased since they do not account for saver heterogeneity and endogeneity of participation.

In [None]:
y = data['net_tfa'].values
Z = data['e401'].values
D = data['p401'].values
X = data.drop(['e401', 'p401', 'a401', 'tw', 'tfa', 'net_tfa', 'tfa_he',
               'hval', 'hmort', 'hequity',
               'nifa', 'net_nifa', 'net_n401', 'ira',
               'dum91', 'icat', 'ecat', 'zhat',
               'i1', 'i2', 'i3', 'i4', 'i5', 'i6', 'i7',
               'a1', 'a2', 'a3', 'a4', 'a5'], axis=1)
X.columns

### We define a transformer that constructs the engineered features for controls

In [None]:
!pip install formulaic

In [None]:
from sklearn.base import TransformerMixin, BaseEstimator
from formulaic import Formula

class FormulaTransformer(TransformerMixin, BaseEstimator):
    
    def __init__(self, formula, array=False):
        self.formula = formula
        self.array = array
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        df = Formula(self.formula).get_model_matrix(X)
        if self.array:
            return df.values
        return df

In [None]:
transformer = FormulaTransformer("0 + poly(age, degree=6, raw=True) + poly(inc, degree=8, raw=True) "
                                 "+ poly(educ, degree=4, raw=True) + poly(fsize, degree=2, raw=True) "
                                 "+ male + marr + twoearn + db + pira + hown")

In [None]:
transformer.fit_transform(X).describe()

In [None]:
transformer = FormulaTransformer("0 + poly(age, degree=6, raw=True) + poly(inc, degree=8, raw=True) "
                                 "+ poly(educ, degree=4, raw=True) + poly(fsize, degree=2, raw=True) "
                                 "+ male + marr + twoearn + db + pira + hown", array=True)

# Effect of Eligibility on Financial Assets

In [None]:
modely = make_pipeline(transformer, StandardScaler(), LassoCV())
modelz = make_pipeline(transformer, StandardScaler(), LassoCV())

In [None]:
resy = y - modely.fit(X, y).predict(X)
resZ = Z - modelz.fit(X, Z).predict(X)

In [None]:
np.mean(resy * resZ) / np.mean(resZ**2)

# Instrumental Variables: Effect of 401k Participation on Financial Assets

# Double ML IV under Partial Linearity

Now, we consider estimation of average treatment effects of participation in 401k, i.e. `p401`, with the binary instrument being eligibility in 401k, i.e. `e401`. As before, $Y$ denotes the outcome `net_tfa`, and $X$ is the vector of covariates. We consider a partially linear structural equation model:
\begin{eqnarray*}
Y & := & g_Y(\epsilon_Y) D + f_Y(A, X, \epsilon_Y),  \\
D & := & f_D(Z, X, A, \epsilon_D), \\
Z & := & f_Z(X, \epsilon_Z),\\
A & : =  & f_A(X, \epsilon_A), \\
X & := &  \epsilon_X,
\end{eqnarray*}
where $A$ is a vector of un-observed confounders.

Under this structural equation model, the average treatment effect:
\begin{align}
\alpha = E[Y(1) - Y(0)]
\end{align}
can be identified by the moment restriction:
\begin{align}
E[(\tilde{Y} - \alpha \tilde{D}) \tilde{Z}] = 0
\end{align}
where for any variable $V$, we denote with $\tilde{V} = V - E[V|X]$.

In [None]:
modely = make_pipeline(transformer, StandardScaler(), LassoCV())
modeld = make_pipeline(transformer, StandardScaler(), LassoCV())
modelz = make_pipeline(transformer, StandardScaler(), LassoCV())

In [None]:
resy = y - modely.fit(X, y).predict(X)
resZ = Z - modelz.fit(X, Z).predict(X) # instrument is e401k (eligibility)
resD = D - modeld.fit(X, D).predict(X) # treatment is p401k (participation)

In [None]:
np.mean(resy * resZ) / np.mean(resD * resZ)

### DML with Non-Linear ML Models and Cross-fitting

In [None]:
def dml(X, Z, D, y, modely, modeld, modelz, *, nfolds, classifier=False):
    '''
    DML for the Partially Linear Model setting with cross-fitting
    
    Input
    -----
    X: the controls
    Z: the instrument
    D: the treatment
    y: the outcome
    modely: the ML model for predicting the outcome y
    modeld: the ML model for predicting the treatment D
    modelz: the ML model for predicting the instrument Z
    nfolds: the number of folds in cross-fitting
    classifier: bool, whether the modeld is a classifier or a regressor
    
    Output
    ------
    point: the point estimate of the treatment effect of D on y
    stderr: the standard error of the treatment effect
    yhat: the cross-fitted predictions for the outcome y
    Dhat: the cross-fitted predictions for the treatment D
    Zhat: the cross-fitted predictions for the instrument Z
    resy: the outcome residuals
    resD: the treatment residuals
    resZ: the instrument residuals
    epsilon: the final residual-on-residual OLS regression residual
    '''
    cv = KFold(n_splits=nfolds, shuffle=True, random_state=123) # shuffled k-folds
    yhat = cross_val_predict(modely, X, y, cv=cv, n_jobs=-1) # out-of-fold predictions for y
    # out-of-fold predictions for D
    # use predict or predict_proba dependent on classifier or regressor for D
    if classifier: 
        Dhat = cross_val_predict(modeld, X, D, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]
        Zhat = cross_val_predict(modelz, X, Z, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]
    else:
        Dhat = cross_val_predict(modeld, X, D, cv=cv, n_jobs=-1)
        Zhat = cross_val_predict(modelz, X, Z, cv=cv, n_jobs=-1)
    # calculate outcome and treatment residuals
    resy = y - yhat
    resD = D - Dhat
    resZ = Z - Zhat
    # final stage ols based point estimate and standard error
    point = np.mean(resy * resZ) / np.mean(resD*resZ)
    epsilon = resy - point * resD
    var = np.mean(epsilon**2 * resZ**2) / np.mean(resD*resZ)**2
    stderr = np.sqrt(var / X.shape[0])
    return point, stderr, yhat, Dhat, Zhat, resy, resD, resZ, epsilon

In [None]:
def summary(point, stderr, yhat, Dhat, Zhat, resy, resD, resZ, epsilon, X, Z, D, y, *, name):
    '''
    Convenience summary function that takes the results of the DML function
    and summarizes several estimation quantities and performance metrics.
    '''
    return pd.DataFrame({'estimate': point, # point estimate
                         'stderr': stderr, # standard error
                         'lower': point - 1.96*stderr, # lower end of 95% confidence interval
                         'upper': point + 1.96*stderr, # upper end of 95% confidence interval
                         'rmse y': np.sqrt(np.mean(resy**2)), # RMSE of model that predicts outcome y
                         'rmse D': np.sqrt(np.mean(resD**2)), # RMSE of model that predicts treatment D
                         'rmse Z': np.sqrt(np.mean(resZ**2)), # RMSE of model that predicts treatment D
                         'accuracy D': np.mean(np.abs(resD) < .5), # binary classification accuracy of model for D
                         'accuracy Z': np.mean(np.abs(resZ) < .5), # binary classification accuracy of model for Z
                         }, index=[name])

#### Double Lasso with Cross-Fitting

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
lassoy = make_pipeline(transformer, StandardScaler(), LassoCV(cv=cv))
lassod = make_pipeline(transformer, StandardScaler(), LassoCV(cv=cv))
lassoz = make_pipeline(transformer, StandardScaler(), LassoCV(cv=cv))
result = dml(X, Z, D, y, lassoy, lassod, lassoz, nfolds=3)

In [None]:
table = summary(*result, X, Z, D, y, name='double lasso')
table

#### Using a Penalized Logistic Regression for D

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
lassoy = make_pipeline(transformer, StandardScaler(), LassoCV(cv=cv))
lgrd = make_pipeline(transformer, StandardScaler(), LogisticRegressionCV(cv=cv))
lgrz = make_pipeline(transformer, StandardScaler(), LogisticRegressionCV(cv=cv))
result = dml(X, Z, D, y, lassoy, lgrd, lgrz, nfolds=3, classifier=True)

In [None]:
table = table.append(summary(*result,  X, Z, D, y, name='lasso/logistic'))
table

### Random Forests

In [None]:
rfy = make_pipeline(transformer, RandomForestRegressor(n_estimators=100, min_samples_leaf=10, ccp_alpha=.001))
rfd = make_pipeline(transformer, RandomForestClassifier(n_estimators=100, min_samples_leaf=10, ccp_alpha=.001))
rfz = make_pipeline(transformer, RandomForestClassifier(n_estimators=100, min_samples_leaf=10, ccp_alpha=.001))
result = dml(X, Z, D, y, rfy, rfd, rfz, nfolds=3, classifier=True)

In [None]:
table = table.append(summary(*result,  X, Z, D, y, name='random forest'))
table

### Decision Trees

In [None]:
dtry = make_pipeline(transformer, DecisionTreeRegressor(min_samples_leaf=10, ccp_alpha=.001))
dtrd = make_pipeline(transformer, DecisionTreeClassifier(min_samples_leaf=10, ccp_alpha=.001))
dtrz = make_pipeline(transformer, DecisionTreeClassifier(min_samples_leaf=10, ccp_alpha=.001))
result = dml(X, Z, D, y, dtry, dtrd, dtrz, nfolds=3, classifier=True)

In [None]:
table = table.append(summary(*result,  X, Z, D, y, name='decision tree'))
table

### Boosted Trees

In [None]:
gbfy = make_pipeline(transformer, GradientBoostingRegressor(max_depth=2, n_iter_no_change=5))
gbfd = make_pipeline(transformer, GradientBoostingClassifier(max_depth=2, n_iter_no_change=5))
gbfz = make_pipeline(transformer, GradientBoostingClassifier(max_depth=2, n_iter_no_change=5))
result = dml(X, Z, D, y, gbfy, gbfd, gbfz, nfolds=3, classifier=True)

In [None]:
table = table.append(summary(*result,  X, Z, D, y, name='boosted forest'))
table

# Semi-Crossfitting and AutoML

In [None]:
from flaml import AutoML

flamly = make_pipeline(transformer, AutoML(time_budget=100, task='regression', early_stop=True,
                                    eval_method='cv', n_splits=3, metric='r2', verbose=0))
flamld = make_pipeline(transformer, AutoML(time_budget=100, task='classification', early_stop=True,
                                           eval_method='cv', n_splits=3, metric='r2', verbose=0))
flamlz = make_pipeline(transformer, AutoML(time_budget=100, task='classification', early_stop=True,
                                           eval_method='cv', n_splits=3, metric='r2', verbose=0))

In [None]:
flamly.fit(X, y)
besty = make_pipeline(transformer, clone(flamly[-1].best_model_for_estimator(flamly[-1].best_estimator)))

In [None]:
flamld.fit(X, D)
bestd = make_pipeline(transformer, clone(flamld[-1].best_model_for_estimator(flamld[-1].best_estimator)))

In [None]:
flamlz.fit(X, Z)
bestz = make_pipeline(transformer, clone(flamlz[-1].best_model_for_estimator(flamlz[-1].best_estimator)))

In [None]:
result = dml(X, Z, D, y, besty, bestd, bestz, nfolds=3, classifier=True)

In [None]:
table = table.append(summary(*result,  X, Z, D, y, name='automl (semi-cfit)'))
table

# Inference Robust to Weak Identification

In [None]:
import scipy.stats

def robust_inference(point, stderr, yhat, Dhat, Zhat, resy, resD, resZ, epsilon, X, Z, D, y, *, grid, alpha=0.05):
    '''
    Inference in the partially linear IV model that is robust to weak identification.
    grid: grid of theta values to search over when trying to identify the confidence region
    alpha: confidence level
    '''
    n = X.shape[0]
    thr = scipy.stats.chi2.ppf(1 - alpha, df=1)
    accept = []
    for theta in grid:
        moment = (resy - theta * resD) * resZ
        test = n * np.mean(moment)**2 / np.var(moment)
        if test <= thr:
            accept.append(theta)
    return accept

In [None]:
region = robust_inference(*result, X, Z, D, y, grid=np.linspace(0, 20000, 10000))

In [None]:
np.min(region), np.max(region)

We find that the robust inference confidence region is almost identical to the normal based inference. We are most probably in the strong instrument regime. We can check the t-statistic for the effect of the instrument on the treatment, to verify that

In [None]:
beta = np.mean(resZ * resD) / np.mean(resZ**2)
var_beta = np.mean((resD - beta * resZ)**2 * resZ**2) / np.mean(resZ**2)**2
se_beta = np.sqrt(var_beta / resD.shape[0])
print(np.abs(beta) / se_beta)

Since the $t$-statistic is very large (much larger than the rule of thumb of $4$), the normal based approximation and confidence intervals should be fine. We can also get this $t$-statistic by simply using the statsmodels package

In [None]:
from statsmodels.api import OLS
OLS(endog=resD, exog=resZ, hasconst=False).fit(cov_type='HC0').summary()

# Interactive IV Model and LATE

Now, we consider estimation of local average treatment effects (LATE) of participation `p401`, with the binary instrument `e401`. As before, $Y$ denotes the outcome `net_tfa`, and $X$ is the vector of covariates.  Here the structural equation model is:
\begin{eqnarray}
Y &:=&  f_Y (D, X, A, \epsilon_Y) \\
D &:= & f_D(Z, X, A, \epsilon_D) \in \{0,1\},  \\ 
Z  &:= & f_Z(X,\epsilon_Z) \in \{0,1\},  \\
X &:=&  \epsilon_X, \quad A = \epsilon_A,
\end{eqnarray}
where $\epsilon$'s are all exogenous and independent,
and 
$$
z \mapsto f_D(z , A, X, \epsilon_D) \text{ is weakly increasing (weakly monotone)}.
$$
and $A$ is a vector of unobserved confounders. Note that in our setting monotonicity is satisfied, since participation is only feasible when it is eligible. Thus we have that $D=0$ whenever $Z=0$. Thus it can only be that $f_D(1, A, X, \epsilon_D) \geq 0 = f_D(0, A, X, \epsilon_D)$.

In this case, we can estimate the local average treatment effect (LATE):
$$
\alpha = E[Y(1) - Y(0) | D(1) > D(0)]
$$
This can be identified using the Neyman orthogonal moment equation:
\begin{align}
E\left[g(1, X) - g(0, X) + H(Z) (Y - g(Z, X)) - \alpha \cdot  (m(1, X) - m(0, X) + H(Z) (D - m(Z, X))\right] = 0
\end{align}
where 
\begin{align}
g(Z,X) =~& E[Y|Z,X],\\
m(Z,X) =~& E[D|Z,X],\\
H(Z) =~& \frac{Z}{Pr(Z=1|X)} - \frac{1 - Z}{1 - Pr(Z=1|X)}
\end{align}

In [None]:
def iiv(X, Z, D, y, modely0, modely1, modeld1, modeld0, modelz, *, trimming=0.01, nfolds):
    '''
    DML for the Interactive IV Model setting with cross-fitting
    
    Input
    -----
    X: the controls
    D: the treatment
    y: the outcome
    modely0: the ML model for predicting the outcome y in the Z=0 population
    modely1: the ML model for predicting the outcome y in the Z=1 population
    modeld0: the ML model for predicting the treatment D in the Z=0 population
    modeld1: the ML model for predicting the treatment D in the Z=1 population
    modelz: the ML model for predicting the instrument Z
    trimming: threshold below which to trim propensities
    nfolds: the number of folds in cross-fitting
    
    Output
    ------
    point: the point estimate of the treatment effect of D on y
    stderr: the standard error of the treatment effect
    yhat: the cross-fitted predictions for the outcome y
    Dhat: the cross-fitted predictions for the outcome D
    resy: the outcome residuals
    resD: the treatment residuals
    drhat: the doubly robust quantity for each sample
    '''
    cv = KFold(n_splits=nfolds, shuffle=True, random_state=123)
    yhat0, yhat1 = np.zeros(y.shape), np.zeros(y.shape)
    Dhat0, Dhat1 = np.zeros(D.shape), np.zeros(D.shape)
    # we will fit a model E[Y| D, X] by fitting a separate model for D==0
    # and a separate model for D==1.
    for train, test in cv.split(X, y):
        # train an outcome model on training data that received Z=0 and predict outcome on all data in test set
        yhat0[test] = clone(modely0).fit(X.iloc[train][Z[train]==0], y[train][Z[train]==0]).predict(X.iloc[test])
        # train an outcome model on training data that received Z=1 and predict outcome on all data in test set
        yhat1[test] = clone(modely1).fit(X.iloc[train][Z[train]==1], y[train][Z[train]==1]).predict(X.iloc[test])
        # train a treatment model on training data that received Z=0 and predict treatment on all data in test set
        if np.mean(D[train][Z[train]==0]) > 0: # it could be that D=0, whenever Z=0 deterministically
            modeld0_ = clone(modeld0).fit(X.iloc[train][Z[train]==0], D[train][Z[train]==0])
            Dhat0[test] = modeld0_.predict_proba(X.iloc[test])[:, 1]
        # train a treamtent model on training data that received Z=1 and predict treatment on all data in test set
        if np.mean(D[train][Z[train]==1]) < 1: # it could be that D=1, whenever Z=1 deterministically
            modeld1_ = clone(modeld1).fit(X.iloc[train][Z[train]==1], D[train][Z[train]==1])
            Dhat1[test] = modeld1_.predict_proba(X.iloc[test])[:, 1]
        else:
            Dhat1[test] = 1
    # prediction of treatment and outcome for observed instrument
    yhat = yhat0 * (1 - Z) + yhat1 * Z
    Dhat = Dhat0 * (1 - Z) + Dhat1 * Z
    # propensity scores
    Zhat = cross_val_predict(modelz, X, Z, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]
    Zhat = np.clip(Zhat, trimming, 1 - trimming)
    # doubly robust quantity for every sample
    HZ = Z/Zhat - (1 - Z)/(1 - Zhat)
    drZ = yhat1 - yhat0 + (y - yhat) * HZ
    drD = Dhat1 - Dhat0 + (D - Dhat) * HZ
    point = np.mean(drZ) / np.mean(drD)
    psi = drZ - point * drD
    Jhat = np.mean(drD)
    var = np.mean(psi**2) / Jhat**2
    stderr = np.sqrt(var / X.shape[0])
    return point, stderr, yhat, Dhat, Zhat, y - yhat, D - Dhat, Z - Zhat, drZ, drD

In [None]:
def summary(point, stderr, yhat, Dhat, Zhat, resy, resD, resZ, drZ, drD, X, Z, D, y, *, name):
    '''
    Convenience summary function that takes the results of the DML function
    and summarizes several estimation quantities and performance metrics.
    '''
    return pd.DataFrame({'estimate': point, # point estimate
                         'stderr': stderr, # standard error
                         'lower': point - 1.96*stderr, # lower end of 95% confidence interval
                         'upper': point + 1.96*stderr, # upper end of 95% confidence interval
                         'rmse y': np.sqrt(np.mean(resy**2)), # RMSE of model that predicts outcome y
                         'rmse D': np.sqrt(np.mean(resD**2)), # RMSE of model that predicts treatment D
                         'rmse Z': np.sqrt(np.mean(resZ**2)), # RMSE of model that predicts treatment D
                         'accuracy D': np.mean(np.abs(resD) < .5), # binary classification accuracy of model for D
                         'accuracy Z': np.mean(np.abs(resZ) < .5), # binary classification accuracy of model for Z
                         }, index=[name])

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
lassoy = make_pipeline(transformer, StandardScaler(), LassoCV(cv=cv))
lgrd = make_pipeline(transformer, StandardScaler(), LogisticRegressionCV(cv=cv))
lgrz = make_pipeline(transformer, StandardScaler(), LogisticRegressionCV(cv=cv))
result = iiv(X, Z, D, y, lassoy, lassoy, lgrd, lgrd, lgrz, nfolds=3)

In [None]:
tableiiv = summary(*result, X, Z, D, y, name='lasso/logistic')
tableiiv

In [None]:
rfy = make_pipeline(transformer, RandomForestRegressor(n_estimators=100, min_samples_leaf=10, ccp_alpha=.001))
rfd = make_pipeline(transformer, RandomForestClassifier(n_estimators=100, min_samples_leaf=10, ccp_alpha=.001))
rfz = make_pipeline(transformer, RandomForestClassifier(n_estimators=100, min_samples_leaf=10, ccp_alpha=.001))
result = iiv(X, Z, D, y, rfy, rfy, rfd, rfd, rfz, nfolds=3)

In [None]:
tableiiv = tableiiv.append(summary(*result, X, Z, D, y, name='random forest'))
tableiiv

In [None]:
dtry = make_pipeline(transformer, DecisionTreeRegressor(min_samples_leaf=10, ccp_alpha=.001))
dtrd = make_pipeline(transformer, DecisionTreeClassifier(min_samples_leaf=10, ccp_alpha=.001))
dtrz = make_pipeline(transformer, DecisionTreeClassifier(min_samples_leaf=10, ccp_alpha=.001))
result = iiv(X, Z, D, y, dtry, dtry, dtrd, dtrd, dtrz, nfolds=3)

In [None]:
tableiiv = tableiiv.append(summary(*result, X, Z, D, y, name='decision tree'))
tableiiv

In [None]:
gbfy = make_pipeline(transformer, GradientBoostingRegressor(max_depth=2, n_iter_no_change=5))
gbfd = make_pipeline(transformer, GradientBoostingClassifier(max_depth=2, n_iter_no_change=5))
gbfz = make_pipeline(transformer, GradientBoostingClassifier(max_depth=2, n_iter_no_change=5))
result = iiv(X, Z, D, y, gbfy, gbfy, gbfd, gbfd, gbfz, nfolds=3)

In [None]:
tableiiv = tableiiv.append(summary(*result, X, Z, D, y, name='boosted forest'))
tableiiv

## Semi-Crossfitting

In [None]:
from flaml import AutoML
flamly0 = make_pipeline(transformer, AutoML(time_budget=60, task='regression', early_stop=True,
                                     eval_method='cv', n_splits=3, metric='r2', verbose=0))
flamly1 = make_pipeline(transformer, AutoML(time_budget=60, task='regression', early_stop=True,
                                     eval_method='cv', n_splits=3, metric='r2', verbose=0))
flamld1 = make_pipeline(transformer, AutoML(time_budget=60, task='classification', early_stop=True,
                                           eval_method='cv', n_splits=3, metric='r2', verbose=0))
flamlz = make_pipeline(transformer, AutoML(time_budget=60, task='classification', early_stop=True,
                                           eval_method='cv', n_splits=3, metric='r2', verbose=0))

In [None]:
flamly0.fit(X[Z==0], y[Z==0])
besty0 = make_pipeline(transformer, clone(flamly0[-1].best_model_for_estimator(flamly0[-1].best_estimator)))

In [None]:
flamly1.fit(X[Z==1], y[Z==1])
besty1 = make_pipeline(transformer, clone(flamly1[-1].best_model_for_estimator(flamly1[-1].best_estimator)))

In [None]:
from sklearn.dummy import DummyClassifier
bestd0 = DummyClassifier() # since D=0 whenever Z=0

In [None]:
flamld1.fit(X[Z==1], D[Z==1])
bestd1 = make_pipeline(transformer, clone(flamld1[-1].best_model_for_estimator(flamld1[-1].best_estimator)))

In [None]:
flamlz.fit(X, Z)
bestz = make_pipeline(transformer, clone(flamlz[-1].best_model_for_estimator(flamlz[-1].best_estimator)))

In [None]:
result = iiv(X, Z, D, y, besty0, besty1, bestd0, bestd1, bestz, nfolds=3)

In [None]:
tableiiv = tableiiv.append(summary(*result, X, Z, D, y, name='automl (semi-cfit)'))
tableiiv

Comparing with the PLR model

In [None]:
table

We find that the PLR model overestimates the effect by around 1k; though both sets of results have overlapping confidence intervals

In [None]:
import scipy.stats

def iivm_robust_inference(point, stderr, yhat, Dhat, Zhat, resy, resD, resZ, drZ, drD, X, Z, D, y, *, grid, alpha=0.05):
    '''
    Inference in the partially linear IV model that is robust to weak identification.
    grid: grid of theta values to search over when trying to identify the confidence region
    alpha: confidence level
    '''
    n = X.shape[0]
    thr = scipy.stats.chi2.ppf(1 - alpha, df=1)
    accept = []
    for theta in grid:
        moment = drZ - theta * drD
        test = n * np.mean(moment)**2 / np.var(moment)
        if test <= thr:
            accept.append(theta)
    return accept

In [None]:
region = iivm_robust_inference(*result, X, Z, D, y, grid=np.linspace(0, 20000, 10000))

In [None]:
np.min(region), np.max(region)

We find again that the robust inference confidence region is almost identical to the normal based inference. We are most probably in the strong instrument regime. We can check the t-statistic for the effect of the instrument on the treatment, to verify that

# Using EconML

In [None]:
!pip install econml

In [None]:
W = StandardScaler().fit_transform(transformer.fit_transform(X))

In [None]:
from econml.iv.dml import OrthoIV

cv = KFold(n_splits=5, shuffle=True, random_state=123)
plriv = OrthoIV(model_y_xw=LassoCV(cv=cv),
                model_t_xw=LogisticRegressionCV(cv=cv),
                model_z_xw=LogisticRegressionCV(cv=cv),
                cv=3, discrete_treatment=True, discrete_instrument=True, random_state=123)

In [None]:
plriv.fit(y, D, Z=Z, W=W)

In [None]:
plriv.summary()

EconML does not yet support LATE estimation under the fully interactive IV model. 

It does support however a more general IV model where the variables `X` are allowed to alter the effect heterogeneity and the compliance heterogeneity, but the un-observed confounder is not allowed to jointly alter both. In other words, it assumes a structural equation model of the form:
\begin{eqnarray*}
Y & := & g_Y(X, \epsilon_Y) D + f_Y(A, X, \epsilon_Y),  \\
D & := & f_D(Z, X, A, \epsilon_D), \\
Z & := & f_Z(X, \epsilon_Z),\\
A & : =  & f_A(X, \epsilon_A), \\
X & := &  \epsilon_X,
\end{eqnarray*}
where $A$ is a vector of un-observed confounders. Under these assumptions the average treatment effect is identifiable (not just the local average treatment effect). In particular, the average treatment effect can be identified as:
\begin{align}
\alpha := E[Y(1) - Y(0)] = E\left[ \frac{Cov(Y, Z\mid X)}{Cov(D, Z\mid X)} \right] = E\left[ \frac{E[\tilde{Y} \tilde{Z}\mid X]}{E[\tilde{Z} \tilde{D}\mid X]} \right]
\end{align}
where for any variable $V$ we have $\tilde{V}=V-E[V|X]$. 

However, the variance of this method can be quite larger than the LATE method, since it uses a local compliance measure, i.e. $E[\tilde{Z} \tilde{D}\mid X]$, which can be small for some regions of $X$. This extra variance stems from the fact that we are going after a more challenging causal quantity which is the average treatment effect, instead of the local average treatment effect, and hence we need to re-weight the data based on compliance levels, conditional on observable covariates $X$.

In [None]:
from econml.iv.dr import LinearDRIV

driv = LinearDRIV(model_y_xw=LassoCV(cv=cv),
                  model_t_xw=LogisticRegressionCV(cv=cv),
                  model_t_xwz=LogisticRegressionCV(cv=cv),
                  model_tz_xw=LassoCV(cv=cv),
                  flexible_model_effect=LassoCV(cv=cv),
                  projection=True,
                  discrete_instrument=True, discrete_treatment=True, cv=3, cov_clip=0.01, random_state=123)

In [None]:
driv.fit(y, D, Z=Z, W=W)

In [None]:
driv.summary()

# Using the DoubleML Package

In [None]:
!pip install doubleml

In [None]:
from doubleml import DoubleMLData
dml_data = DoubleMLData.from_arrays(W, y, D, z=Z)
print(dml_data)

In [None]:
import doubleml as dml

class RegWrapper(BaseEstimator):

    def __init__(self, clf):
        self.clf = clf

    def fit(self, X, y):
        self.clf_ = clone(self.clf).fit(X, y)
        return self

    def predict(self, X):
        return self.clf_.predict_proba(X)[:, 1]

dml_plr_obj = dml.DoubleMLPLIV(dml_data,
                               LassoCV(cv=cv),
                               RegWrapper(LogisticRegressionCV(cv=cv)),
                               RegWrapper(LogisticRegressionCV(cv=cv)),
                               n_folds=3)
print(dml_plr_obj.fit())

In [None]:
import doubleml as dml

class Wrapper(BaseEstimator):

    def __init__(self, clf):
        self.clf = clf

    def fit(self, X, y):
        if np.mean(y) == 0 or np.mean(y) == 1:
            self.clf_ = np.mean(y)
        else:
            self.clf_ = clone(self.clf).fit(X, y)
        self.classes_ = np.array([0, 1])
        return self
    
    def predict_proba(self, X):
        probs = np.zeros((X.shape[0], 2))
        if self.clf_ == 0:
            probs[:, 0] = 1
            return probs
        if self.clf_ == 1:
            probs[:, 1] = 1
            return probs
        return self.clf_.predict_proba(X)

    def predict(self, X):
        return self.predict_proba(X)[:, 1] >= .5
    
dml_plr_obj = dml.DoubleMLIIVM(dml_data,
                               LassoCV(cv=cv),
                               Wrapper(LogisticRegressionCV(cv=cv)),
                               Wrapper(LogisticRegressionCV(cv=cv)),
                               n_folds=3)
print(dml_plr_obj.fit())