In [1]:
import numpy as np
import pandas as pd

### Generate Data

In [3]:
N = 1000000
n = 1

p_treated = 0.5
treatment = np.random.binomial(n, p_treated, size=N)

p_recovery = (0.5 + treatment) / 2. # = 0.25 + 0.5 * treatment 
# If treatment_i = 1, p_recovery_i = .75 and if treatment_i = 0, p_recovery_i = .25
recovery = np.random.binomial(n, p_recovery)

X = pd.DataFrame({'treatment': treatment, 
                  'recovery': recovery})[['treatment', 'recovery']]
X.head()

Unnamed: 0,treatment,recovery
0,1,0
1,1,0
2,1,0
3,0,0
4,0,1


Treatment causes recovery directly, so the effect of treatment on recovery is just the coefficient, 0.5. That will be the correct ATE. Let's explore with the naive estimator. Treatment is assigned at random, so it should be unbiased for the ATE!

We can estimate $E[Y|D=1]$ and $E[Y|D=0]$ using a quick trick. The groupby operation is handy on discrete data:

In [4]:
X.groupby('treatment').mean()

Unnamed: 0_level_0,recovery
treatment,Unnamed: 1_level_1
0,0.250535
1,0.749271


### Estimate the ATE

#### Since we're using random assignment, we can assume $Y^1$ and $Y^0$ are independent of $D$. Then 
#### $E[\delta] = E[Y^1|D=1] - E[Y^0|D=0] = E[Y|D=1] - E[Y|D=0]$
#### and the naive estimator is unbiased.

#### First, we can estimate the conditionals,

In [5]:
X.groupby(('treatment')).mean()[['recovery']]

Unnamed: 0_level_0,recovery
treatment,Unnamed: 1_level_1
0,0.250535
1,0.749271


#### and then take the difference at each level of the treatment:

In [6]:
X.groupby(('treatment')).mean()[['recovery']].values[1] - X.groupby(('treatment')).mean()[['recovery']].values[0]

array([0.49873609])

#### So we get
#### $E[Y^1|D=1] = 0.75$
#### $E[Y^0|D=0] = 0.25$
#### and the difference is 0.5, the result we can read off from the data generating process!

### How can we break it? That is, how can we make the naive estimator biased for the ATE?

#### First, let's try adding extra noise to the recovery (compare this with the process above):

In [7]:
n = 1

p_contaminated = 0.5
contaminated = np.random.binomial(n, p_contaminated, size=N)

p_treated = 0.5
treatment = np.random.binomial(n, p_treated, size=N)

p_recovery = (0.5 + treatment - 0.5 * contaminated) / 2. # = 0.25 + 0.5 * treatment - 0.25 * contaminated
# If treatment = 1 and contaminated = 1, then p_recovery = 0.5
# If treatment = 1 and contaminated = 0, then p_recovery = 0.75
# If treatment = 0 and contaminated = 1, then p_recovery = 0
# If treatment = 0 and contaminated = 0, then p_recovery = 0.25
recovery = np.random.binomial(n, p_recovery)

X = pd.DataFrame({'treatment': treatment, 
                  'recovery': recovery,
                  'contaminated': contaminated})[['treatment', 'recovery', 'contaminated']]
X.head()

Unnamed: 0,treatment,recovery,contaminated
0,1,1,0
1,0,1,0
2,0,0,1
3,1,1,1
4,0,0,0


#### Then, the conditionals change!

In [9]:
X.groupby(('treatment')).mean()[['recovery']]

Unnamed: 0_level_0,recovery
treatment,Unnamed: 1_level_1
0,0.123863
1,0.623976


### But the treatment effect stays the same!!!

In [10]:
X.groupby(('treatment')).mean()[['recovery']].values[1] - X.groupby(('treatment')).mean()[['recovery']].values[0]

array([0.50011238])

#### The chances someone in the treatment group is contaminated is the same as for someone in the control group, so, from
#### p_recovery = (0.5 + treatment - 0.5 * contaminated) / 2 # = 0.25 + 0.5 * treatment

#### we can see

#### $E[Y^1|D=1] = E[Y^1|D=0]$
#### and
#### $E[Y^0|D=1] = E[Y^0|D=0]$

#### We need to make treatment and being contaminated associated with each other to create bias! In other words, treatment and outcome have to share a common cause! (what would this causal graph look like?)

In [16]:
n = 1

p_contaminated = 0.7
contaminated = np.random.binomial(n, p_contaminated, size=N)

p_treated = (0.5 + 0.5 * contaminated) / 2.
# If contaminated = 1, p_treated = 0.5
# If contaminated = 0, p_treated = 0.25
treatment = np.random.binomial(n, p_treated)

p_recovery = (0.5 + treatment - 0.5 * contaminated) / 2. # = 0.25 + 0.5 * treatment - 0.25 * contaminated
# Same conditionals last example
recovery = np.random.binomial(n, p_recovery)

X = pd.DataFrame({'treatment': treatment, 
                  'recovery': recovery,
                  'contaminated': contaminated})[['treatment', 'recovery', 'contaminated']]
X.head()

Unnamed: 0,treatment,recovery,contaminated
0,1,0,1
1,1,0,1
2,1,0,1
3,1,1,1
4,1,1,1


#### We can estimate the conditionals again,

In [17]:
X.groupby(('treatment')).mean()[['recovery']]

Unnamed: 0_level_0,recovery
treatment,Unnamed: 1_level_1
0,0.097661
1,0.544997


#### and the naive estimator, which is now biased!

In [18]:
X.groupby(('treatment')).mean()[['recovery']].values[1] - X.groupby(('treatment')).mean()[['recovery']].values[0]

array([0.44733593])

#### We now have a small (roughly 10%) bias for the ATE!