#    Regression

The regression coefficient tells us the associational relationship of observational data.
y = a + bx + cz. For a fixed value of z, a unit increase in x is associated with b increase in y.
But we cannot say, in general, that changing x is associated with b increase in y. That is the difference between observation "Seeing" and interventional "Doing". The latter is the causal relationship. How do we know in a regression if the coefficient is a causal impact? It depends on what z is. If z is a confounder, then b is a causal impact. If z is a collider, b is not a causal impact.


In [2]:
import numpy as np
import matplotlib.pyplot as plt
import pandas

from statsmodels.formula.api import ols

import warnings
warnings.filterwarnings('ignore')

## Confounder
Let us simulate a data generating process. The first case is a common cause z of the two variables x and y. If we control for the confounding variable z, then then x and y become independent. 

Note that if z is a mediator between x and y and we control for z, then z and y become independent. Statistics alone cannot tell us which one is the right model.

In [3]:
# Fork (Confounder): x<-z->y
z=np.random.normal(size=10000)
x=2*z + np.random.normal(size=10000)
y=3*z + np.random.normal(size=10000)
data = pandas.DataFrame({'x': x, 'y': y, 'z': z})

mod7 = ols("y~x", data).fit()
print(mod7.params)
mod8 = ols("y~x+z", data).fit()
print(mod8.params)

Intercept    0.006435
x            1.196264
dtype: float64
Intercept    0.009075
x            0.023401
z            2.938442
dtype: float64


## Mediator
The second case is a mediator z between x and y. But we also add a direct path from x to y.
If we control for z, we can meaure the direct causal impact of x on y. Otherwise we have the total impact (direct + indirect) of x on y.

In [6]:
#Mediator (direct and indirect causal impact): x->z->y and x->y
x=np.random.normal(size=10000)
z = 2*x + np.random.normal(size=10000)
y = 3*z + 5*x
data = pandas.DataFrame({'x': x, 'y': y, 'z': z})
mod5 = ols("y~x", data).fit()
print(mod5.params)
mod6 = ols("y~x+z", data).fit()
print(mod6.params)

Intercept    -0.012600
x            10.982178
dtype: float64
Intercept    9.298118e-16
x            5.000000e+00
z            3.000000e+00
dtype: float64


## Collider

The third case is a common effect z of the two variables x and y. The variables x and y are indpendent. If we control for the variable z, then then x and y become dependent. 
This shows that controlling for more variables creates a bias. 

In [8]:
#Collider:  x->z<-y. Controlling for Z introduces bias
x=np.random.normal(size=10000)
y=np.random.normal(size=10000)
z=x+y
data = pandas.DataFrame({'x': x, 'y': y, 'z': z})
mod1 = ols("y~x", data).fit()
print(mod1.params)
mod2 = ols("y~x+z", data).fit()
print(mod2.params)
mod3 = ols("z~x+y", data).fit() # The regression coeffient of z with respect to x does not change if we add/remove y,
                                #since x and y are independent 
print(mod3.params)
mod4 = ols("z~x", data).fit() # it is no different than mod3
print(mod4.params)

Intercept   -0.001684
x            0.020741
dtype: float64
Intercept   -2.571728e-16
x           -1.000000e+00
z            1.000000e+00
dtype: float64
Intercept    8.413409e-17
x            1.000000e+00
y            1.000000e+00
dtype: float64
Intercept   -0.001684
x            1.020741
dtype: float64


In [24]:
print(x)

[ 1.29394308  0.68682049  0.09061396 ... -1.57996167  0.06590928
 -0.63723545]
