In [52]:
%matplotlib inline

import statsmodels.formula.api as smf
import pandas as pd
import numpy as np

from auxiliary import *

np.random.seed(123)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Regression estimators of causal effects

We start with different ways of using regression

* descriptive tools
* estimating causal effects

## Regression as a descriptive tool

Goldberger (1991), for example, motivates least squares regression as a technique to estimate a best-fitting linear approximation to a conditional expectation function that may be nonlinear in the population.

<img src="material/regression_demonstration_one.png" height=300 width=300 />

In [56]:
df = get_sample_demonstration_1(num_agents=10000)
df.head()

Unnamed: 0,Y,D,S,Y_1,Y_0
0,2.147654,0.0,1.0,4.975437,2.147654
1,3.997684,0.0,1.0,3.487191,3.997684
2,6.323708,0.0,2.0,8.331052,6.323708
3,14.622041,1.0,3.0,14.622041,9.943213
4,8.969212,1.0,2.0,8.969212,6.101375


In [57]:
df.groupby(['D', 'S'])['Y'].mean()

D    S  
0.0  1.0     1.969594
     2.0     5.998332
     3.0    10.013828
1.0  1.0     4.007653
     2.0     8.001473
     3.0    14.032471
Name: Y, dtype: float64

In [58]:
df['predict'] = smf.ols(formula='Y ~ D + S', data = df).fit().predict()
df.groupby(['D', 'S'])['Y', 'predict'].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,Y,predict
D,S,Unnamed: 2_level_1,Unnamed: 3_level_1
0.0,1.0,1.969594,1.704295
0.0,2.0,5.998332,6.173057
0.0,3.0,10.013828,10.64182
1.0,1.0,4.007653,4.401916
1.0,2.0,8.001473,8.870678
1.0,3.0,14.032471,13.339441


# TODO: get the code to create the dummies for the variable values to estimate the fully saturated modl.


## Regression models and omitted-variable bias

<img src="material/omitted-variable-bias.png" height=300 width=300 />
