# Post Treatment Bias

In [1]:
%matplotlib inline
import pymc3 as pm
import numpy as np
import pandas as pd
from scipy import stats
# R-like interface, alternatively you can import statsmodels as import statsmodels.api as sm
import statsmodels.formula.api as smf 
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns

%config InlineBackend.figure_format = 'retina'
plt.style.use(['seaborn-colorblind', 'seaborn-darkgrid'])

It is routine to worry about mistaken inferences that arise from omitting predictor variables. Such mistakes are often called OMITTED VARIABLE BIAS, and the milk example helps to illustrate it. It is much less routine to worry about mistaken inferences arising from including variables that are consequences of other variables. We'll call this POST-TREATMENT BIAS. Being aware of post treatment bias is important in all types of studies.


The language 'post-treatment' comes in fact from thinking about experimental designs. Suppose for example that you are growing some plants in a greenhouse. You want to know the difference in growth under different anti-fungal soil treatments, because fungus on the plants tend to reduce their growth. Plants are intially seeded then sprout. Their heights are measured. Then different soil treatments are applied. Final measures are the height of the plant and the presence of fungus. There are four variables of interest here: initial height, final height, treatment, and presence of fungus. Final height is the outcome of interest. But which of the other variables should be in the model? If your goal is to make a causal inference about the treatment, you shouldn’t include the presence of fungus, because it is a post-treatment effect.

In [22]:
# Lets simulate some data to make the example more transparent and see what exactly goes wrong when we include a post-treatment
# variable

# number of plants
N = 100

# simulate initial heights
h0 = stats.norm.rvs(size = N, loc = 10, scale = 2)

# assign treatments and simulate fungus and growth
treatment = np.repeat([0, 1], [N/2]*2)
fungus = np.random.binomial(n=1, p=(0.5-treatment * 0.4), size=N)
h1 = h0 + stats.norm.rvs(size= N, loc= 5- 3*fungus, scale=1)

# compose a clean data frame
d = pd.DataFrame({'h0': h0,
                  'h1': h1,
                  'Treatment':treatment,
                  'Fungus': fungus})

## A prior is born
When designing the model, it helps to pretend you don't have the data generating process just above. In real research, you will not know the real daata generating process. But you will have a lot of scientific information to guide model construction. So lets spend some time taking this mock analysis seriously.

We know that the plants at time t=1 should be taller than time t=0, whatever scale they are measured on. So if we put the parameters on a scale of 'proportion' of height at time t=0, rather than the absolute scale of the data, we can set our priors more easily. To make this simpler, lets focus right now only on the height variables, ignoring the predictor variables. We may have a linear model like:

$$
h_{1,i} = Normal(mu,sigma) 

mu = h_{0,i} x p 
$$

In [24]:
d

Unnamed: 0,h0,h1,Treatment,Fungus
0,10.044921,11.596383,0,1
1,9.175687,11.352764,0,1
2,9.820916,12.563685,0,1
3,10.344470,12.808757,0,1
4,9.218136,13.993919,0,0
...,...,...,...,...
95,8.159119,11.024459,1,1
96,7.495589,11.016186,1,1
97,13.299164,17.973997,1,0
98,11.129605,14.344529,1,0


In [29]:
with pm.Model() as m5_13:
    a = pm.Normal('a',mu = 0, sd=100)
    #bh = pm.Normal('bh',mu = 0, sd=10)
    bt = pm.Normal('bt',mu = 0, sd=10)
    bf = pm.Normal('bf',mu = 0, sd=10)
    mu = pm.Deterministic('mu', h0 * (a + bt * treatment + bf * fungus))
    sigma = pm.Uniform('sigma', lower= 0 , upper= 10)
    h1 = pm.Normal('h1', mu = mu, sd=sigma, observed = d['h1'])
    trace_5_13 = pm.sample(1000, tune=1000) 

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, bf, bt, a]
Sampling 4 chains, 0 divergences: 100%|██████████| 8000/8000 [00:05<00:00, 1354.56draws/s]


In [31]:
varnames = ['a', 'bt', 'bf', 'sigma']
pm.summary(trace_5_13, varnames).round(3)

Unnamed: 0,mean,sd,hpd_3%,hpd_97%,mcse_mean,mcse_sd,ess_mean,ess_sd,ess_bulk,ess_tail,r_hat
a,1.462,0.023,1.419,1.507,0.001,0.0,1748.0,1747.0,1746.0,1931.0,1.0
bt,0.012,0.027,-0.039,0.06,0.001,0.0,2039.0,1933.0,2040.0,2248.0,1.0
bf,-0.273,0.028,-0.328,-0.222,0.001,0.0,1988.0,1967.0,1998.0,2215.0,1.0
sigma,1.255,0.09,1.094,1.431,0.002,0.001,2515.0,2510.0,2526.0,2348.0,1.0


Looking at the results we can see that fungus has a negative effect on growth, but treatment doesn't? thats wierd as we set it up so that it does.

### Blocked by consequence
The problem is that fungus is mostly a consequence on treatment. This is to sat that fungus is a post-treatment variable. So when we control for fungus, the model is implicitly answering the question: Once we already know whether or not a plant developed fungus, does soil treatment matter? The answer is no, because soil treatment has its effects on growth through reducing fungus. To measure this properly, we should omit the post-treatment variable fungus.

Lets do that again

In [32]:
with pm.Model() as m5_13_b:
    a = pm.Normal('a',mu = 0, sd=100)
    bt = pm.Normal('bt',mu = 0, sd=10)
    mu = pm.Deterministic('mu', h0 * (a + bt * treatment))
    sigma = pm.Uniform('sigma', lower= 0 , upper= 10)
    h1 = pm.Normal('h1', mu = mu, sd=sigma, observed = d['h1'])
    trace_5_13_b = pm.sample(1000, tune=1000) 

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, bt, a]
Sampling 4 chains, 0 divergences: 100%|██████████| 8000/8000 [00:04<00:00, 1619.70draws/s]


In [33]:
varnames = ['a', 'bt', 'sigma']
pm.summary(trace_5_13_b, varnames).round(3)

Unnamed: 0,mean,sd,hpd_3%,hpd_97%,mcse_mean,mcse_sd,ess_mean,ess_sd,ess_bulk,ess_tail,r_hat
a,1.299,0.024,1.251,1.34,0.001,0.0,1810.0,1810.0,1812.0,2243.0,1.0
bt,0.137,0.035,0.072,0.203,0.001,0.001,1676.0,1676.0,1677.0,2336.0,1.0
sigma,1.771,0.133,1.528,2.014,0.003,0.002,2436.0,2426.0,2431.0,2263.0,1.0


Now the impact of treatment is clearly positive, as it should be. It makes sense to control for pre-treatment differences, like the initial height h0, that might mask the causal influence of treatment. But including post-treatment variables can actually mask the treatment itself. This doesn’t mean you don’t want the model that includes both treatment and fungus. The fact that including fungus zeros the coefficient for treatment suggests that the treatment works for exactly the anticipated reasons. It tells us about mechanism. But a correct inference about the treatment still depends upon omitting the post-treatment variable.