In the experiment estimating the effect of __Online Learning(Treatment)__ for __Exam score(Outcome)__, to estimate the treatment effect, we can use the following regression model.

$$exam_i = \beta_0 + \beta_1 Online_i + \mu_i$$

Where $\beta_0$ is a baseline, $\beta_1$ would be the treatment effect, and $Online_i = 1$ when it is online learning and 0 otherwise, $\mu_i$ is the other factors that can not be explained by $Online_i$ treatment.

In [3]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
import graphviz as gr

In [4]:
url = "https://raw.githubusercontent.com/matheusfacure/python-causality-handbook/master/causal-inference-for-the-brave-and-true/data/online_classroom.csv"
data = pd.read_csv(url).query("format_blended==0")
data.head()

Unnamed: 0,gender,asian,black,hawaiian,hispanic,unknown,white,format_ol,format_blended,falsexam
0,0,0.0,0.0,0.0,0.0,0.0,1.0,0,0.0,63.29997
1,1,0.0,0.0,0.0,0.0,0.0,1.0,0,0.0,79.96
4,1,0.0,0.0,0.0,0.0,0.0,1.0,1,0.0,83.3
5,0,1.0,0.0,0.0,0.0,0.0,0.0,1,0.0,88.34996
7,1,1.0,0.0,0.0,0.0,0.0,0.0,0,0.0,90.0


In [5]:
result = smf.ols('falsexam ~ format_ol', data=data).fit()
result.summary().tables[1]

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,78.5475,1.113,70.563,0.000,76.353,80.742
format_ol,-4.9122,1.680,-2.925,0.004,-8.223,-1.601


This nice form gives us the treatment effect $\beta_1 = -4.9122$ and the baseline $\beta_0 = 78.54$, and also a p-value and confidence interval for free.

In fact to derive the treatment effect can directly use the regression formula

$$\beta_1 = \frac{Cov(Y_i, T_i)}{Var(T_i)}$$

In [10]:
data['falsexam'].cov(data['format_ol'])/data['format_ol'].var()

-4.912221498226949

In fact, if we want to consider a multivariate regression with other variable, we should derive the same result

$$exam_i = \beta_0 + \beta_1 Online_i + \beta_2 X_0i + ... + \mu_i$$

And $\beta_1$ can be derived from

$$\beta_1 = \frac{Cov(Y_i, \tilde{T_i})}{Var(\tilde{T_i})}$$

Where $\tilde{T_i}$ is the residual when regress $X_0, X_1, ...$ on $T$. The intuition behind this is that if we can use $X$ to predict $T$, meaning the treatment is not random, so the residual $\tilde{T}$ is the part can not be explained by $X$. The treatment version of $\tilde{T_i}$ has nothing to do with any factors $X$.

In [18]:
reg1 = smf.ols("format_ol ~ gender+asian+black+hawaiian+hispanic+unknown+white", data=data).fit()
reg1.summary().tables[1]

data['resid'] = reg1.resid
data['falsexam'].cov(data['resid'])/data['resid'].var()

-4.241452673704799