# Understanding Conditional Independence and d-separation

If we have the right causal graph for a system, we should be able to read off the conditional independence relationships. How can we check them? Regression!

If we fit a regression for $Y$ using $X$ as a regressor by minimizing mean-squared error loss, and the regression model is the right model for the underlying conditional expectation, then the regression prediction is an estimator for E[Y|X]. Let's look at an example.

In [1]:
import numpy as np
import pandas as pd

N = 10000

d = np.random.binomial(1, p=0.5, size=N)
y = np.random.normal(d)

df = pd.DataFrame({'D': d, 'Y': y})

Questions:
    1. What is E[Y|D=1]? E[Y|D=0]?
    2. How do we get the same estimates from a regression?

In [4]:
# First question, we can groupby D, and average Y.

df.groupby('D').mean()

Unnamed: 0_level_0,Y
D,Unnamed: 1_level_1
0,0.017549
1,0.997722


In [6]:
from statsmodels.api import OLS

model = OLS(df['Y'], df['D'])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.331
Model:,OLS,Adj. R-squared:,0.331
Method:,Least Squares,F-statistic:,4950.0
Date:,"Thu, 20 Sep 2018",Prob (F-statistic):,0.0
Time:,13:29:39,Log-Likelihood:,-14233.0
No. Observations:,10000,AIC:,28470.0
Df Residuals:,9999,BIC:,28480.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D,0.9977,0.014,70.357,0.000,0.970,1.026

0,1,2,3
Omnibus:,2.114,Durbin-Watson:,1.995
Prob(Omnibus):,0.348,Jarque-Bera (JB):,2.123
Skew:,0.023,Prob(JB):,0.346
Kurtosis:,2.946,Cond. No.,1.0


In [13]:
# This should be E[Y|D=1]

result.predict(1.)

array([0.99772241])

In [14]:
# This should be E[Y|D=0]

result.predict(0.)

array([0.])

## What happens when we add a variable Z, with different relationships to Y and D?

We could have a chain graph, collider, formed by D, Y, and Z. It'll be interesting to see when we get dependence or not. 

Question:

How can you tell if the regression's prediction depends on one of the regressors?

#### First, let's make a chain with $Z$ in the middle.

In [18]:
d = np.random.normal(size=N)
z = d + np.random.normal(size=N)
y = z + np.random.normal(size=N)


df = pd.DataFrame({'D': d, 'Z': z, 'Y': y})


Everything is statistically dependent here:

In [19]:
df.corr()

Unnamed: 0,D,Z,Y
D,1.0,0.711558,0.585688
Z,0.711558,1.0,0.819372
Y,0.585688,0.819372,1.0


You can see the dependence between D and Y from the regression coefficients, too (they're non-zero):

In [20]:
model = OLS(df['Y'], df['D'])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.343
Model:,OLS,Adj. R-squared:,0.343
Method:,Least Squares,F-statistic:,5221.0
Date:,"Thu, 20 Sep 2018",Prob (F-statistic):,0.0
Time:,14:10:41,Log-Likelihood:,-17547.0
No. Observations:,10000,AIC:,35100.0
Df Residuals:,9999,BIC:,35100.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D,1.0196,0.014,72.257,0.000,0.992,1.047

0,1,2,3
Omnibus:,0.898,Durbin-Watson:,1.976
Prob(Omnibus):,0.638,Jarque-Bera (JB):,0.922
Skew:,-0.003,Prob(JB):,0.631
Kurtosis:,2.953,Cond. No.,1.0


and likewise with Y and Z:

In [21]:
model = OLS(df['Y'], df['Z'])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.671
Model:,OLS,Adj. R-squared:,0.671
Method:,Least Squares,F-statistic:,20430.0
Date:,"Thu, 20 Sep 2018",Prob (F-statistic):,0.0
Time:,14:11:13,Log-Likelihood:,-14083.0
No. Observations:,10000,AIC:,28170.0
Df Residuals:,9999,BIC:,28170.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Z,1.0064,0.007,142.932,0.000,0.993,1.020

0,1,2,3
Omnibus:,0.468,Durbin-Watson:,1.995
Prob(Omnibus):,0.791,Jarque-Bera (JB):,0.432
Skew:,0.003,Prob(JB):,0.806
Kurtosis:,3.032,Cond. No.,1.0


But watch what happens when we condition on Z and D together, to estimate E[Y|D, Z]:

In [23]:
model = OLS(df['Y'], df[['Z', 'D']])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.671
Model:,OLS,Adj. R-squared:,0.671
Method:,Least Squares,F-statistic:,10210.0
Date:,"Thu, 20 Sep 2018",Prob (F-statistic):,0.0
Time:,14:12:11,Log-Likelihood:,-14083.0
No. Observations:,10000,AIC:,28170.0
Df Residuals:,9998,BIC:,28180.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Z,1.0017,0.010,99.956,0.000,0.982,1.021
D,0.0093,0.014,0.658,0.511,-0.019,0.037

0,1,2,3
Omnibus:,0.47,Durbin-Watson:,1.995
Prob(Omnibus):,0.79,Jarque-Bera (JB):,0.434
Skew:,0.003,Prob(JB):,0.805
Kurtosis:,3.032,Cond. No.,2.64


The coefficient on D went to zero! How could we know? Since D --> Z --> Y, Z d-separates D and Y. That implies that D and Y are conditionally independent given, Z. If they're conditionally independent, then E[Y|Z, D] = E[Y|Z], and so the coefficient on D should go to zero.

Conditional independence is different from independence. From the previous regressions, it's clear that the coefficient on D when we regress Y on D alone is non-zero, and they're statistically dependent.

Questions:

1. If we want to estimate the effect of D on Y, should we add Z to the regression or not?

2. What does this say about machine learning, where we might take the approach where we regress on everything that precedes Y in time?


### Now, let's look at the fork, where Z is in the middle.

We can generate some data where D <-- Z --> Y


In [24]:
z = np.random.normal(size=N)
d = z + np.random.normal(size=N)
y = z + np.random.normal(size=N)

df = pd.DataFrame({'D': d, 'Z': z, 'Y': y})

Notice they're all statistically dependent again.

In [25]:
df.corr()

Unnamed: 0,D,Z,Y
D,1.0,0.707687,0.499738
Z,0.707687,1.0,0.702741
Y,0.499738,0.702741,1.0


If we regress Y on D, we can see the coefficient is non-zero, so D and Y are statistically dependent.

In [26]:
model = OLS(df['Y'], df[['D']])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.25
Model:,OLS,Adj. R-squared:,0.25
Method:,Least Squares,F-statistic:,3328.0
Date:,"Thu, 20 Sep 2018",Prob (F-statistic):,0.0
Time:,14:23:27,Log-Likelihood:,-16166.0
No. Observations:,10000,AIC:,32330.0
Df Residuals:,9999,BIC:,32340.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D,0.4989,0.009,57.690,0.000,0.482,0.516

0,1,2,3
Omnibus:,0.311,Durbin-Watson:,2.009
Prob(Omnibus):,0.856,Jarque-Bera (JB):,0.295
Skew:,-0.012,Prob(JB):,0.863
Kurtosis:,3.009,Cond. No.,1.0


now, if we add Z to the regression, we get

In [31]:
model = OLS(df['Y'], df[['D', 'Z']])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.502
Model:,OLS,Adj. R-squared:,0.502
Method:,Least Squares,F-statistic:,5032.0
Date:,"Thu, 20 Sep 2018",Prob (F-statistic):,0.0
Time:,14:29:38,Log-Likelihood:,-10765.0
No. Observations:,10000,AIC:,21530.0
Df Residuals:,9998,BIC:,21550.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D,-0.5101,0.009,-59.625,0.000,-0.527,-0.493
Z,0.5022,0.005,100.295,0.000,0.492,0.512

0,1,2,3
Omnibus:,7.031,Durbin-Watson:,2.022
Prob(Omnibus):,0.03,Jarque-Bera (JB):,6.361
Skew:,0.008,Prob(JB):,0.0416
Kurtosis:,2.878,Cond. No.,2.39


The coefficient for D goes to zero again! In this case, D doesn't cause Y. Changing D by intervention and predicting Y here would give the correct answer for the expected value of Y, which is that there's no effect! In the previous example (the chain graph), it gives the wrong answer: we block the effect by conditioning on Z.

### Let's look at the collider now.

We'll generate data for D --> Z <-- Y

In [32]:
d = np.random.normal(size=N)
y = np.random.normal(size=N)
z = d + y + np.random.normal(size=N)

df = pd.DataFrame({'D': d, 'Z': z, 'Y': y})

Now, we'll see that D and Y are actually independent of each other:

In [33]:
df.corr()

Unnamed: 0,D,Z,Y
D,1.0,0.576053,-0.002938
Z,0.576053,1.0,0.57261
Y,-0.002938,0.57261,1.0


We can see that by doing regression, too:

In [35]:
model = OLS(df['Y'], df[['D']])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.0
Method:,Least Squares,F-statistic:,0.08246
Date:,"Thu, 20 Sep 2018",Prob (F-statistic):,0.774
Time:,14:29:58,Log-Likelihood:,-14156.0
No. Observations:,10000,AIC:,28310.0
Df Residuals:,9999,BIC:,28320.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D,-0.0029,0.010,-0.287,0.774,-0.022,0.017

0,1,2,3
Omnibus:,0.021,Durbin-Watson:,2.005
Prob(Omnibus):,0.99,Jarque-Bera (JB):,0.019
Skew:,-0.003,Prob(JB):,0.99
Kurtosis:,3.0,Cond. No.,1.0


now, if we add Z to the regression, we'll get a non-zero coefficient! 

In [36]:
model = OLS(df['Y'], df[['D', 'Z']])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.494
Model:,OLS,Adj. R-squared:,0.494
Method:,Least Squares,F-statistic:,4874.0
Date:,"Thu, 20 Sep 2018",Prob (F-statistic):,0.0
Time:,14:30:50,Log-Likelihood:,-10754.0
No. Observations:,10000,AIC:,21510.0
Df Residuals:,9998,BIC:,21530.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D,-0.4971,0.009,-57.206,0.000,-0.514,-0.480
Z,0.4977,0.005,98.732,0.000,0.488,0.508

0,1,2,3
Omnibus:,1.366,Durbin-Watson:,2.015
Prob(Omnibus):,0.505,Jarque-Bera (JB):,1.351
Skew:,0.004,Prob(JB):,0.509
Kurtosis:,3.056,Cond. No.,2.4


If we just put this data into a machine-learning model, we'd guess D has a negative effect on Y! The correct answer is that it has no effect.

## Summary

We can summarize these results.

#### Chain Graph, D --> Z --> Y.

Conditioning on Z blocks the effect of D on Y. Not conditioning on D gives a correct causal prediction for the effect of D on Y.

#### Fork, D <-- Z --> Y.

Conditioning on Z again blocks D and Y. Not conditioning leaves them with spurious statistical dependence. There's no directed path from D to Y, and we shouldn't find a causal effect of D on Y. Conditioning on Z removes the spurious dependence.

#### Collider, D --> Z <-- Y.

D and Y are statistically independent, and there is no directed path from D to Y. If we condition on Z, we find D and Y have a negative relationship, even though D has no causal effect on Y!

Question:

1. Can we combine these all into a criterion that tells us when to condition and when not to, to avoid spurious correlations and incorrect blocking?

Yes! The back-door criterion.

#### Definition: The Back-Door Criterion
    
A set of variables $Z$ satisfies the _back-door criterion_ relative to an ordered pair of variables $(X_i, X_j)$ in a DAG $G$ if:

1. no node in $Z$ is a descendant of $X_i$, and 
2. $Z$ blocks every path between $X_i$ and $X_j$ that contains an arrow into $X_i$.

Similarly, if $X$ and $Y$ are two disjoint subsets of nodes in $G$, then $Z$ is said to satisfy the back-door criterion relative to $(X, Y)$ if it satisfies the criterion relative to an pair $(X_i, X_j)$ such that $X_i \in X$, and $X_j \in Y$.

This criterion is the general answer to the question "What do I need to control for in order to estimate the effect of $X$ on $Y$". Usually, we'll take $X$ and $Y$ to only contain one variable each.

The question we haven't answered yet is "How do we use the set $Z$ to estimate $P(Y|do(X=x))?$"

#### Definition: Back-Door Adjustment

If a set of variables $Z$ satisfies the back-door criterion relative to $(X, Y)$, then the causal effect of $X$ on $Y$ is identifiable and given by the formula 

$$ P(Y|do(X=x)) = \sum_Z P(Y|X, Z)P(Z)$$

Question:

1. Up to this point, we've been looking at E[Y|X,Z]. How do we use regression with the BDC?

We can take expectations of both sides by multiplying by $Y$ and summing over $Y$ on each side. $P(Z)$ factors out of the sum on the RHS, and we get:  

$$ E[Y|do(X=x)]= \sum_Z E[Y|X, Z]P(Z)$$

So it looks like we're taking an average of regressions over Z.

There's a cool trick we can do, using the estimator $\frac{N[Z=z]}{N}$ for $P(Z)$.

$$ \hat{E}_N[Y|do(X=x)]= \sum_Z \hat{E}_N[Y|X, Z]\frac{N(Z=z)}{N}$$
$$ \hat{E}_N[Y|do(X=x)]= \sum_Z \sum_{i=1}^N \hat{E}_N[Y|X, Z]\frac{\delta(Z=z_i)}{N}$$
$$ \hat{E}_N[Y|do(X=x)]= \sum_{i=1}^N \hat{E}_N[Y|X, Z=z_i]\frac{1}{N}$$

This is just the regression evaluated at each data point, and then averaged!

## Estimating $E[Y|do(X=x)]$

Let's use the example with confounding as our example case.

#### Step 1: Establish the graph, and determine the back-door set

We know the graph, since we're simulating data. It's D <-- Z --> Y. A back-door set here is just the variable (suggestively named) Z.

#### Step 2: Collect Data (you have to measure $Z$., $D$, and $Y$!)

In [42]:
z = np.random.normal(size=N)
d = z + np.random.normal(size=N)
y = z + np.random.normal(size=N)

df = pd.DataFrame({'D': d, 'Z': z, 'Y': y})

#### Step 3: Estimate the regression conditional on the causal state and the back-door set

In [43]:
model = OLS(df['Y'], df[['D', 'Z']])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.496
Model:,OLS,Adj. R-squared:,0.496
Method:,Least Squares,F-statistic:,4917.0
Date:,"Thu, 20 Sep 2018",Prob (F-statistic):,0.0
Time:,15:00:41,Log-Likelihood:,-14233.0
No. Observations:,10000,AIC:,28470.0
Df Residuals:,9998,BIC:,28480.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D,-0.0053,0.010,-0.527,0.598,-0.025,0.014
Z,0.9981,0.014,70.161,0.000,0.970,1.026

0,1,2,3
Omnibus:,3.262,Durbin-Watson:,1.984
Prob(Omnibus):,0.196,Jarque-Bera (JB):,3.249
Skew:,-0.044,Prob(JB):,0.197
Kurtosis:,3.007,Cond. No.,2.63


#### Step 4: Fix $D$ in a copy of the data, and evaluate the predicted value of $Y$.

In [46]:
#Let's start with E[Y|do(D=0)]

df_intervention = df.copy()

df_intervention['D'] = 0
df_intervention['Y(Z, D=0)'] = result.predict(df_intervention[['D', 'Z']])

# and then do  E[Y|do(D=1)]
df_intervention['D'] = 1
df_intervention['Y(Z, D=1)'] = result.predict(df_intervention[['D', 'Z']])


#### Step 5: Do the average over the data points

In [45]:
df_intervention.mean()

D            1.000000
Z           -0.011293
Y            0.014920
Y(Z, D=0)   -0.011271
Y(Z, D=1)   -0.016605
dtype: float64

We've averaged Y(Z, D=d) over Z, so these means actually represent E[Y|do(D=d)]. We can see the lack of dependence between Y and D here, where the means aren't significantly different from each other.

#### E[Y|do(D=0)] = -0.011271

#### E[Y|do(D=1)] = -0.016605


You can do this same procedure with much more interesting models and data-generating processes!