# Negative (Proxy) Controls for Unobserved Confounding 


Consider the following SEM, where $Y$ is the outcome, $D$ is the treatment, $A$ is some unobserved confounding, and $Q$, $X$, $S$ are the observed covariates. In particular, $Q$ is considered to be the proxy control treatment as it a priori has no effect on the actual outcome $Y$, and $S$ is considered to be the proxy control outcome as it a priori is not affected by the actual treatment $D$. See also [An Introduction to Proximal Causal Learning](https://arxiv.org/pdf/2009.10982.pdf), for more information on this setting.

![proxy_dag.png](https://raw.githubusercontent.com/stanford-msande228/winter23/main/proxy_dag.png)

Under linearity assumptions, the average treatment effect can be estimated by solving the vector of moment equations:
\begin{align}
E\left[(\tilde{Y} - \alpha \tilde{D} - \delta \tilde{S}) \left(\begin{aligned}\tilde{D}\\ \tilde{Q}\end{aligned}\right) \right] = 0
\end{align}
where for every variable $V$ we denote with $\tilde{V} = V - E[V|X]$.

When the dimension of the proxy treatment variables $Q$ is larger than the dimension of proxy outcome variables $S$, then the above system of equations is over-identified. In these settings, we first project the "technical instrument" variables $\tilde{V}=(\tilde{D}, \tilde{Q})$ onto the space of "technical treatment" variables $\tilde{W}=(\tilde{D}, \tilde{S})$ and use the projected $\tilde{V}$ as a new "technical instrument". In particular, we run an OLS regression of $\tilde{W}$ on $\tilde{V}$, and define $\tilde{Z} = E[\tilde{W}\mid \tilde{V}] = B \tilde{V}$, where the $t$-th row $\beta_t$ of the matrix $B$ is the OLS coefficient in the regression of $\tilde{W}_t$ on $\tilde{V}$. These new variables $\tilde{Z}$, can also be viewed as engineered technical instrument variables. Then we have the exactly identified system of equations:
\begin{align}
E\left[(\tilde{Y} - \alpha \tilde{D} - \delta \tilde{S}) \tilde{Z} \right] := E\left[(\tilde{Y} - \alpha \tilde{D} - \delta \tilde{S}) B \left(\begin{aligned}\tilde{D}\\ \tilde{Q}\end{aligned}\right) \right] = 0
\end{align}

In fact the solution to this system of equations is numerically equivalent to the following two stage algorithm:
- Run OLS of $\tilde{W}=(\tilde{D}, \tilde{S})$ on $\tilde{V}=(\tilde{D}, \tilde{Q})$
- Define $\tilde{Z}$ as the predictions of the OLS model
- Run OLS of $\tilde{Y}$ on $\tilde{Z}$.
This is the well-known Two-Stage-Least-Squares (2SLS) algorithm for instrumental variable regression.

In [None]:
# Import relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score, cross_val_predict, KFold
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LinearRegression, Ridge, Lasso, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
import patsy
import warnings
from sklearn.base import BaseEstimator, clone
import statsmodels.api as sm
from IPython.display import Markdown
import wget
import os
import seaborn as sns
from sklearn.multioutput import MultiOutputRegressor
warnings.simplefilter('ignore')
np.random.seed(1234)

# Analyzing Simulated Data

First, let's evaluate the methods on simulated data generated from a linear SEM characterized by the above DAG. For this simulation, we'll set the ATE to 2.

In [None]:
# generate data from the SCM
import numpy as np

def gen_data(n, ate):
    X = np.random.normal(0, 1, size=(n, 10))
    A = 2 * X[:, [0]] + np.random.normal(0, 1, size=(n, 1))
    Q = 10 * A + 2 * X[:, [0]] + np.random.normal(0, 1, size=(n, 1))
    S = 5 * A + X[:, [0]] + np.random.normal(0, 1, size=(n, 1))
    D = Q - A + 2 * X[:, [0]] + np.random.normal(0, 1, size=(n, 1))
    Y = ate * D + 5 * A + 2 * S + 0.5 * X[:, [0]] + np.random.normal(0, 1, size=(n, 1))
    return [X, A, Q, S, D, Y.flatten()]

In [None]:
X, A, Q, S, D, y = gen_data(5000, 2)

We define the techincal instrument $V=(D, Q)$ and technical treatment $W=(D, S)$ and then it is a matter of solving an instrument variable regression problem with instruments $V$ and treatments $W$ and looking at the first coefficient associated with $D$. 

In [None]:
V = np.hstack([D, Q]) # technical instruments
W = np.hstack([D, S]) # technical treatments

### Partialling-Out X

In [None]:
modely = make_pipeline(StandardScaler(), LassoCV())
modelw = make_pipeline(StandardScaler(), LassoCV())
modelv = make_pipeline(StandardScaler(), LassoCV())

In [None]:
resy = y - modely.fit(X, y).predict(X)
resV = V - MultiOutputRegressor(modelv).fit(X, V).predict(X) # residual instrument
resW = W - MultiOutputRegressor(modelw).fit(X, W).predict(X) # residual treatment

### Approach 1: Solving the Moment Equation

In this case since $V$ and $W$ have the same dimension, we can just solve the moment equation:
\begin{align}
E\left[(\tilde{Y} - \theta'\tilde{W}) \tilde{V} \right] = 0
\end{align}

In [None]:
n = resV.shape[0]
J = (resV.T @ resW) / n
alpha = (resV.T @ resy) / n
point = np.linalg.inv(J) @ alpha
point[0]

### Approach 2: Projecting the Instrument on the Treatment

Alternatively, we could have constructed a technical instrument by calculating regression $W$ on $V$ with OLS and using $Z=E[W|V]$ as the new instrument.

In [None]:
resZ = LinearRegression(fit_intercept=False).fit(resV, resW).predict(resV)

In [None]:
J = (resZ.T @ resW) / n
alpha = (resZ.T @ resy) / n
point = np.linalg.inv(J) @ alpha
point[0]

In this case we see that because we started with an "exactly" identified system, this projection step doesn't change the result. This is provably always the case. 

### Approach 3:  2SLS

We can take one step further and use the two stage least squares approach, were we run OLS of $\tilde{Z}$ on $\tilde{y}$.

In [None]:
LinearRegression(fit_intercept=False).fit(resZ, resy).coef_[0]

We see that again this doesn't change the result, since 2SLS is equivalent to solving the moment condition in Approach 2.

# With Cross-Fitting

In [None]:
def proxydml(X, Q, S, D, y, modely, modelw, modelv, *, nfolds):
    '''
    DML for the Partially Linear Model setting with cross-fitting
    
    Input
    -----
    X: the controls
    Q: the treatment proxy
    S: the outcome proxy
    D: the treatment
    y: the outcome
    modely: the ML model for predicting the outcome y
    modelw: the ML model for predicting the technical treatments W=(D, S) from X
    modelv: the ML model for predicting the technical instruments V=(D, Q) from X
    nfolds: the number of folds in cross-fitting
    
    Output
    ------
    point: the point estimate of the treatment effect of D on y
    yhat: the cross-fitted predictions for the outcome y
    What: the cross-fitted predictions for the technical treatments W
    Vhat: the cross-fitted predictions for the technical instruments V
    resy: the outcome residuals
    resW: the treatment residuals
    resV: the instrument residuals
    '''
    W = np.hstack([D, S]) # technical treatments
    V = np.hstack([D, Q]) # technical instruments

    cv = KFold(n_splits=nfolds, shuffle=True, random_state=123) # shuffled k-folds
    yhat = cross_val_predict(modely, X, y, cv=cv, n_jobs=-1) # out-of-fold predictions for y
    What = cross_val_predict(MultiOutputRegressor(modelw), X, W, cv=cv, n_jobs=-1)
    Vhat = cross_val_predict(MultiOutputRegressor(modelv), X, V, cv=cv, n_jobs=-1)

    # calculate outcome and treatment residuals
    resy = y - yhat
    resW = W - What
    resV = V - Vhat

    # project the residual instruments on the residual treatments
    resZ = LinearRegression(fit_intercept=False).fit(resV, resW).predict(resV)

    # final stage ols based point estimate and standard error
    n = resW.shape[0]
    J = (resZ.T @ resW) / n
    Jinv = np.linalg.inv(J)
    alpha = (resZ.T @ resy) / n
    params = Jinv @ alpha
    point = params[0]
    stderr = None # implement this as an exercise!!

    return point, stderr, yhat, What, Vhat, resy, resW, resV

In [None]:
def summary(point, stderr, yhat, What, Vhat, resy, resW, resV, X, Q, S, D, y, *, name):
    '''
    Convenience summary function that takes the results of the DML function
    and summarizes several estimation quantities and performance metrics.
    '''
    lower = point - 1.96 * stderr if stderr is not None else None
    upper = point + 1.96 * stderr if stderr is not None else None
    W = np.hstack([D, S])
    V = np.hstack([D, Q])
    return pd.DataFrame({'estimate': point, # point estimate
                         'stderr': stderr, # standard error
                         'lower': lower, # lower end of 95% confidence interval
                         'upper': upper, # upper end of 95% confidence interval
                         'rmse y': np.sqrt(np.mean(resy**2)), # RMSE of model that predicts outcome y
                         'r2 y': 1 - np.mean(resy**2) / np.var(y),
                         'rmse W': np.sqrt(np.mean(resW**2)), # RMSE of model that predicts treatments W
                         'avg. r2 W': np.mean(1 - np.mean(resW**2, axis=0) / np.var(W, axis=0)),
                         'rmse V': np.sqrt(np.mean(resV**2)), # RMSE of model that predicts treatments V
                         'avg. r2 V': np.mean(1 - np.mean(resV**2, axis=0) / np.var(V, axis=0)),
                         }, index=[name])

In [None]:
X, A, Q, S, D, y = gen_data(5000, 2)

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
modely = make_pipeline(StandardScaler(), LassoCV(cv=cv))
modelw = make_pipeline(StandardScaler(), LassoCV(cv=cv))
modelv = make_pipeline(StandardScaler(), LassoCV(cv=cv))
res = proxydml(X, Q, S, D, y, modely, modelw, modelv, nfolds=3)

In [None]:
summary(*res, X, Q, S, D, y, name='lassocv')

In [None]:
from joblib import Parallel, delayed

def exp(it):
    np.random.seed(it)
    X, A, Q, S, D, y = gen_data(5000, 2)
    cv = KFold(n_splits=5, shuffle=True, random_state=123)
    lassoy = make_pipeline(StandardScaler(), LassoCV(cv=cv))
    lassow = make_pipeline(StandardScaler(), LassoCV(cv=cv))
    lassov = make_pipeline(StandardScaler(), LassoCV(cv=cv))
    res = proxydml(X, Q, S, D, y, modely, modelw, modelv, nfolds=3)
    point = res[0]
    stderr = 0 if res[1] is None else res[1] # this will be fixed once stderr is implemented!
    return point, point - 1.96 * stderr, point + 1.96 * stderr

results = Parallel(n_jobs=-1, verbose=3)(delayed(exp)(it) for it in range(100))

In [None]:
points, lowers, uppers = zip(*results)

In [None]:
coverage = np.mean((np.array(lowers) <= 2) & (2 <= np.array(uppers)))
coverage

In [None]:
np.std(points)

In [None]:
np.mean(points)

## Real Data - Effects of Smoking on Birth Weight

In this study, we will be studying the effects of smoking on baby weight. Base on the domain knowledge, we will consider the following setup:

Outcome ($Y$): baby weight

Treatment ($D$): smoking

Unobserved condounding ($A$): family income 

The observed covariates are put in to 3 groups:


*   Proxy treatment control ($Q$): mother's education
*   Proxy outcome control ($S$): parity (total number of previous pregnancies)
*   Other observed covariates ($X$): mother's race and age and infant sex


Education serves as a proxy treatment control $Q$ because it reflects unobserved confounding due to household income $A$ but has no direct medical effect on birth weight $Y$. Parity and sex serve as a proxy outcome control $S$ because family size reflects household income $A$ but is not directly caused by smoking $D$ or education $Q$.

A description of the data used can be found [here](https://www.stat.berkeley.edu/users/statlabs/data/babies.readme).

In [None]:
import pandas as pd
data = pd.read_csv('https://www.stat.berkeley.edu/users/statlabs/data/babies23.data',sep='\s+')
data

In [None]:
# Filter data so to exclude entries where income, number of cigarettes smoked,
# parity, and baby weight are not asked or not known
data = data[data.wt!=999]
data = data[data.parity!=99]
data = data[data.parity!=9]
data = data[np.logical_and(data.number!=98, data.number!=99)]
data = data[np.logical_and(data.inc!=98, data.inc!=99)]
data.shape

In [None]:
import patsy

X = np.array(patsy.dmatrix('0+C(race)+age+C(sex)', data))
D = np.array(patsy.dmatrix('0+number', data))
Q = np.array(patsy.dmatrix('0+C(ed)', data))
S = np.array(patsy.dmatrix('0+parity', data))
A = np.array(patsy.dmatrix('0+inc', data))
y = np.array(patsy.dmatrix('0+wt', data)).flatten()

In [None]:
cv = KFold(n_splits=5, shuffle=True, random_state=123)
modely = make_pipeline(StandardScaler(), LassoCV(cv=cv))
modelw = make_pipeline(StandardScaler(), LassoCV(cv=cv))
modelv = make_pipeline(StandardScaler(), LassoCV(cv=cv))
res = proxydml(X, Q, S, D, y, modely, modelw, modelv, nfolds=3)

In [None]:
summary(*res, X, Q, S, D, y, name='lassocv')