# Analyzing RCT with Precision by Adjusting for Baseline Covariates

In [None]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf

# Jonathan Roth's DGP

Here we set up a DGP with heterogenous effects. In this example, with is due to Jonathan Roth, we have
$$
E [Y(0) | Z] = - Z, \quad E [Y(1) |Z] = Z, \quad Z \sim N(0,1).
$$
The CATE is
$$
E [Y(1) - Y(0) | Z ]= 2 Z.
$$
and the ATE is
$$
2 E Z = 0.
$$

We would like to estimate ATE as precisely as possible.

An economic motivation for this example could be provided as follows: Let D be the treatment of going to college, and let $Z$ be academic skills.  Suppose that academic skills cause lower earnings Y(0) in jobs that don't require a college degree, and cause higher earnings  Y(1) in jobs that require college degrees. This type of scenario is reflected in the DGP set-up above.



In [None]:
np.random.seed(123)
n = 1000             # sample size
Z = np.random.normal(size=n)         # generate Z
Y0 = -Z + np.random.normal(size=n)   # conditional average baseline response is -Z
Y1 = Z + np.random.normal(size=n)    # conditional average treatment effect is +Z
D = np.random.binomial(1, .2, size=n)    # treatment indicator; only 20% get treated
Y = Y1 * D + Y0 * (1 - D)  # observed Y
Z = Z - Z.mean()       # demean Z
data = pd.DataFrame({"Y": Y, "D": D, "Z": Z})

# Analyze the RCT data with Precision Adjustment

Consider

*  classical 2-sample approach, no adjustment (CL)
*  classical linear regression adjustment (CRA)
*  interactive regression adjusment (IRA)

Carry out inference using robust inference, using the sandwich formulas (Eicker-Huber-White).  

Observe that CRA delivers estimates that are less efficient than CL (pointed out by Freedman), whereas IRA delivers estimates that are more efficient (pointed out by Lin). In order for CRA to be more efficient than CL, we need the linear model to be a correct model of the conditional expectation function of Y given D and X, which is not the case here.

In [None]:
CL = smf.ols("Y ~ D", data=data).fit()
CRA = smf.ols("Y ~ D + Z", data=data).fit()      #classical
IRA = smf.ols("Y ~ D + Z + Z*D", data=data).fit() #interactive approach
# we are interested in the coefficients on variable "D".
print(CL.get_robustcov_results(cov_type="HC1").summary())
print(CRA.get_robustcov_results(cov_type="HC1").summary())
print(IRA.get_robustcov_results(cov_type="HC1").summary())

# Using classical standard errors (non-robust) is misleading here.

We don't teach non-robust standard errors in econometrics courses, but the default statistical inference for the `fit` procedure in python, `smf.ols()`, still uses 100 year old concepts, perhaps in part due to historical legacy.  

Here the non-robust standard errors suggest that there is not much difference between the different approaches, contrary to the conclusions reached using the robust standard errors.


In [None]:
print(smf.ols("Y ~ D", data).fit().summary())
print(smf.ols("Y ~ D + Z", data).fit().summary())
print(smf.ols("Y ~ D + Z + Z*D", data).fit().summary())

# Verify Asymptotic Approximations Hold in Finite-Sample Simulation Experiment

In [None]:
from joblib import Parallel, delayed

np.random.seed(123)

def exp(it, n):
    np.random.seed(it)
    Z = np.random.normal(size=n)
    Y0 = -Z + np.random.normal(size=n)
    Y1 =  Z + np.random.normal(size=n)
    D = np.random.binomial(1, .2, size=n)
    Y = Y1 * D + Y0 * (1-D)

    Z = Z - Z.mean()
    data = pd.DataFrame({"Z": Z, "D": D, "Y": Y})
    CL = smf.ols("Y ~ D", data).fit().params["D"]
    CRA = smf.ols("Y ~ D + Z", data).fit().params["D"]
    IRA = smf.ols("Y ~ D + Z+ Z*D", data).fit().params["D"]
    return CL, CRA, IRA

n = 1000
B = 1000
res = Parallel(n_jobs=-1, verbose=3)(delayed(exp)(it, n) for it in range(B))

In [None]:
res = np.array(res)
CLs, CRAs, IRAs = res[:, 0], res[:, 1], res[:, 2]
print("Standard deviations for estimators")
print(np.sqrt((CLs**2).mean()))
print(np.sqrt((CRAs**2).mean()))
print(np.sqrt((IRAs**2).mean()))