# Understanding Conditional Independence and d-separation

If we have the right causal graph for a system, we should be able to read off the conditional independence relationships. How can we check them? Regression!

If we fit a regression for $Y$ using $X$ as a regressor by minimizing mean-squared error loss, and the regression model is the right model for the underlying conditional expectation, then the regression prediction is an estimator for E[Y|X]. Let's look at an example.

In [3]:
import numpy as np
import pandas as pd

N = 10000

d = np.random.binomial(1, p=0.5, size=N)
y = np.random.normal(d)

df = pd.DataFrame({'D': d, 'Y': y})

Questions:
    1. What is E[Y|D=1]? E[Y|D=0]?
    2. How do we get the same estimates from a regression?

In [4]:
# First question, we can groupby D, and average Y.

df.groupby('D').mean()

Unnamed: 0_level_0,Y
D,Unnamed: 1_level_1
0,-0.02493
1,1.00805


In [5]:
from statsmodels.api import OLS

model = OLS(df['Y'], df['D'])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.339
Model:,OLS,Adj. R-squared:,0.338
Method:,Least Squares,F-statistic:,5117.0
Date:,"Tue, 16 Oct 2018",Prob (F-statistic):,0.0
Time:,13:06:04,Log-Likelihood:,-14089.0
No. Observations:,10000,AIC:,28180.0
Df Residuals:,9999,BIC:,28190.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D,1.0080,0.014,71.537,0.000,0.980,1.036

0,1,2,3
Omnibus:,0.558,Durbin-Watson:,1.997
Prob(Omnibus):,0.757,Jarque-Bera (JB):,0.521
Skew:,0.003,Prob(JB):,0.771
Kurtosis:,3.035,Cond. No.,1.0


In [6]:
# This should be E[Y|D=1]

result.predict(1.)

array([1.00804969])

In [7]:
# This should be E[Y|D=0]

result.predict(0.)

array([0.])

## What happens when we add a variable Z, with different relationships to Y and D?

We could have a chain graph, collider, formed by D, Y, and Z. It'll be interesting to see when we get dependence or not. 

Question:

How can you tell if the regression's prediction depends on one of the regressors?

#### First, let's make a chain with $Z$ in the middle.

In [9]:
d = np.random.normal(size=N)
z = d + np.random.normal(size=N)
y = z + np.random.normal(size=N)


df = pd.DataFrame({'D': d, 'Z': z, 'Y': y})


Everything is statistically dependent here:

In [10]:
df.corr()

Unnamed: 0,D,Z,Y
D,1.0,0.706097,0.579601
Z,0.706097,1.0,0.817522
Y,0.579601,0.817522,1.0


You can see the dependence between D and Y from the regression coefficients, too (they're non-zero):

In [11]:
model = OLS(df['Y'], df['D'])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.336
Model:,OLS,Adj. R-squared:,0.336
Method:,Least Squares,F-statistic:,5058.0
Date:,"Tue, 16 Oct 2018",Prob (F-statistic):,0.0
Time:,13:10:33,Log-Likelihood:,-17637.0
No. Observations:,10000,AIC:,35280.0
Df Residuals:,9999,BIC:,35280.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D,1.0109,0.014,71.117,0.000,0.983,1.039

0,1,2,3
Omnibus:,3.993,Durbin-Watson:,1.991
Prob(Omnibus):,0.136,Jarque-Bera (JB):,4.018
Skew:,-0.044,Prob(JB):,0.134
Kurtosis:,2.955,Cond. No.,1.0


and likewise with Y and Z:

In [15]:
model = OLS(df['Y'], df['Z'])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.668
Model:,OLS,Adj. R-squared:,0.668
Method:,Least Squares,F-statistic:,20150.0
Date:,"Tue, 16 Oct 2018",Prob (F-statistic):,0.0
Time:,14:57:08,Log-Likelihood:,-14166.0
No. Observations:,10000,AIC:,28330.0
Df Residuals:,9999,BIC:,28340.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Z,1.0065,0.007,141.941,0.000,0.993,1.020

0,1,2,3
Omnibus:,0.475,Durbin-Watson:,2.016
Prob(Omnibus):,0.788,Jarque-Bera (JB):,0.475
Skew:,-0.017,Prob(JB):,0.789
Kurtosis:,2.999,Cond. No.,1.0


But watch what happens when we condition on Z and D together, to estimate E[Y|D, Z]:

In [16]:
model = OLS(df['Y'], df[['Z', 'D']])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.668
Model:,OLS,Adj. R-squared:,0.668
Method:,Least Squares,F-statistic:,10070.0
Date:,"Tue, 16 Oct 2018",Prob (F-statistic):,0.0
Time:,14:57:17,Log-Likelihood:,-14166.0
No. Observations:,10000,AIC:,28340.0
Df Residuals:,9998,BIC:,28350.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Z,1.0025,0.010,100.102,0.000,0.983,1.022
D,0.0081,0.014,0.568,0.570,-0.020,0.036

0,1,2,3
Omnibus:,0.483,Durbin-Watson:,2.016
Prob(Omnibus):,0.785,Jarque-Bera (JB):,0.482
Skew:,-0.017,Prob(JB):,0.786
Kurtosis:,2.999,Cond. No.,2.62


The coefficient on D went to zero! How could we know? Since D --> Z --> Y, Z d-separates D and Y. That implies that D and Y are conditionally independent given, Z. If they're conditionally independent, then E[Y|Z, D] = E[Y|Z], and so the coefficient on D should go to zero.

Conditional independence is different from independence. From the previous regressions, it's clear that the coefficient on D when we regress Y on D alone is non-zero, and they're statistically dependent.

Questions:

1. If we want to estimate the effect of D on Y, should we add Z to the regression or not?

2. What does this say about machine learning, where we might take the approach where we regress on everything that precedes Y in time?


### Now, let's look at the fork, where Z is in the middle.

We can generate some data where D <-- Z --> Y


In [17]:
z = np.random.normal(size=N)
d = z + np.random.normal(size=N)
y = z + np.random.normal(size=N)

df = pd.DataFrame({'D': d, 'Z': z, 'Y': y})

Notice they're all statistically dependent again.

In [18]:
df.corr()

Unnamed: 0,D,Z,Y
D,1.0,0.713714,0.506361
Z,0.713714,1.0,0.703382
Y,0.506361,0.703382,1.0


If we regress Y on D, we can see the coefficient is non-zero, so D and Y are statistically dependent.

In [19]:
model = OLS(df['Y'], df[['D']])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.256
Model:,OLS,Adj. R-squared:,0.256
Method:,Least Squares,F-statistic:,3447.0
Date:,"Tue, 16 Oct 2018",Prob (F-statistic):,0.0
Time:,15:12:16,Log-Likelihood:,-16147.0
No. Observations:,10000,AIC:,32300.0
Df Residuals:,9999,BIC:,32300.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D,0.4985,0.008,58.709,0.000,0.482,0.515

0,1,2,3
Omnibus:,2.932,Durbin-Watson:,1.985
Prob(Omnibus):,0.231,Jarque-Bera (JB):,3.002
Skew:,0.016,Prob(JB):,0.223
Kurtosis:,3.079,Cond. No.,1.0


now, if we add Z to the regression, we get

In [20]:
model = OLS(df['Y'], df[['D', 'Z']])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.495
Model:,OLS,Adj. R-squared:,0.495
Method:,Least Squares,F-statistic:,4895.0
Date:,"Tue, 16 Oct 2018",Prob (F-statistic):,0.0
Time:,15:12:31,Log-Likelihood:,-14215.0
No. Observations:,10000,AIC:,28430.0
Df Residuals:,9998,BIC:,28450.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D,0.0086,0.010,0.864,0.388,-0.011,0.028
Z,0.9738,0.014,68.689,0.000,0.946,1.002

0,1,2,3
Omnibus:,0.558,Durbin-Watson:,1.974
Prob(Omnibus):,0.756,Jarque-Bera (JB):,0.526
Skew:,0.013,Prob(JB):,0.769
Kurtosis:,3.024,Cond. No.,2.66


The coefficient for D goes to zero again! In this case, D doesn't cause Y. Changing D by intervention and predicting Y here would give the correct answer for the expected value of Y, which is that there's no effect! In the previous example (the chain graph), it gives the wrong answer: we block the effect by conditioning on Z.

### Let's look at the collider now.

We'll generate data for D --> Z <-- Y

In [21]:
d = np.random.normal(size=N)
y = np.random.normal(size=N)
z = d + y + np.random.normal(size=N)

df = pd.DataFrame({'D': d, 'Z': z, 'Y': y})

Now, we'll see that D and Y are actually independent of each other:

In [22]:
df.corr()

Unnamed: 0,D,Z,Y
D,1.0,0.569587,0.000883
Z,0.569587,1.0,0.585257
Y,0.000883,0.585257,1.0


We can see that by doing regression, too:

In [23]:
model = OLS(df['Y'], df[['D']])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.0
Model:,OLS,Adj. R-squared:,-0.0
Method:,Least Squares,F-statistic:,0.006645
Date:,"Tue, 16 Oct 2018",Prob (F-statistic):,0.935
Time:,15:13:23,Log-Likelihood:,-14173.0
No. Observations:,10000,AIC:,28350.0
Df Residuals:,9999,BIC:,28350.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D,0.0008,0.010,0.082,0.935,-0.019,0.021

0,1,2,3
Omnibus:,0.244,Durbin-Watson:,1.994
Prob(Omnibus):,0.885,Jarque-Bera (JB):,0.217
Skew:,-0.006,Prob(JB):,0.897
Kurtosis:,3.02,Cond. No.,1.0


now, if we add Z to the regression, we'll get a non-zero coefficient! 

In [24]:
model = OLS(df['Y'], df[['D', 'Z']])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.506
Model:,OLS,Adj. R-squared:,0.506
Method:,Least Squares,F-statistic:,5125.0
Date:,"Tue, 16 Oct 2018",Prob (F-statistic):,0.0
Time:,15:13:29,Log-Likelihood:,-10644.0
No. Observations:,10000,AIC:,21290.0
Df Residuals:,9998,BIC:,21310.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D,-0.4959,0.009,-57.566,0.000,-0.513,-0.479
Z,0.5044,0.005,101.245,0.000,0.495,0.514

0,1,2,3
Omnibus:,5.591,Durbin-Watson:,1.997
Prob(Omnibus):,0.061,Jarque-Bera (JB):,5.618
Skew:,-0.057,Prob(JB):,0.0603
Kurtosis:,2.979,Cond. No.,2.39


If we just put this data into a machine-learning model, we'd guess D has a negative effect on Y! The correct answer is that it has no effect.

## Summary

We can summarize these results.

#### Chain Graph, D --> Z --> Y.

Conditioning on Z blocks the effect of D on Y. Not conditioning on D gives a correct causal prediction for the effect of D on Y.

#### Fork, D <-- Z --> Y.

Conditioning on Z again blocks D and Y. Not conditioning leaves them with spurious statistical dependence. There's no directed path from D to Y, and we shouldn't find a causal effect of D on Y. Conditioning on Z removes the spurious dependence.

#### Collider, D --> Z <-- Y.

D and Y are statistically independent, and there is no directed path from D to Y. If we condition on Z, we find D and Y have a negative relationship, even though D has no causal effect on Y!

Question:

1. Can we combine these all into a criterion that tells us when to condition and when not to, to avoid spurious correlations and incorrect blocking?

Yes! The back-door criterion.

#### Definition: The Back-Door Criterion
    
A set of variables $Z$ satisfies the _back-door criterion_ relative to an ordered pair of variables $(X_i, X_j)$ in a DAG (Directed Acyclic Graph) $G$ if:

1. no node in $Z$ is a descendant of $X_i$, and 
2. $Z$ blocks every path between $X_i$ and $X_j$ that contains an arrow into $X_i$.

Similarly, if $X$ and $Y$ are two disjoint subsets of nodes in $G$, then $Z$ is said to satisfy the back-door criterion relative to $(X, Y)$ if it satisfies the criterion relative to an pair $(X_i, X_j)$ such that $X_i \in X$, and $X_j \in Y$.

This criterion is the general answer to the question "What do I need to control for in order to estimate the effect of $X$ on $Y$". Usually, we'll take $X$ and $Y$ to only contain one variable each.

The question we haven't answered yet is "How do we use the set $Z$ to estimate $P(Y|do(X=x))?$"

#### Definition: Back-Door Adjustment

If a set of variables $Z$ satisfies the back-door criterion relative to $(X, Y)$, then the causal effect of $X$ on $Y$ is identifiable and given by the formula 

$$ P(Y|do(X=x)) = \sum_Z P(Y|X, Z)P(Z)$$

Question:

1. Up to this point, we've been looking at E[Y|X,Z]. How do we use regression with the BDC?

We can take expectations of both sides by multiplying by $Y$ and summing over $Y$ on each side. $P(Z)$ factors out of the sum on the RHS, and we get:  

$$ E[Y|do(X=x)]= \sum_Z E[Y|X, Z]P(Z)$$

So it looks like we're taking an average of regressions over Z.

There's a cool trick we can do, using the estimator $\frac{N[Z=z]}{N}$ for $P(Z)$.

$$ \hat{E}_N[Y|do(X=x)]= \sum_Z \hat{E}_N[Y|X, Z]\frac{N(Z=z)}{N}$$
$$ \hat{E}_N[Y|do(X=x)]= \sum_Z \sum_{i=1}^N \hat{E}_N[Y|X, Z]\frac{\delta(Z=z_i)}{N}$$
$$ \hat{E}_N[Y|do(X=x)]= \sum_{i=1}^N \hat{E}_N[Y|X, Z=z_i]\frac{1}{N}$$

This is just the regression evaluated at each data point, and then averaged!

## Estimating $E[Y|do(X=x)]$

Let's use the example with confounding as our example case.

#### Step 1: Establish the graph, and determine the back-door set

We know the graph, since we're simulating data. It's D <-- Z --> Y. A back-door set here is just the variable (suggestively named) Z.

#### Step 2: Collect Data (you have to measure $Z$., $D$, and $Y$!)

In [27]:
z = np.random.normal(size=N)
d = z + np.random.normal(size=N)
y = z + np.random.normal(size=N)

df = pd.DataFrame({'D': d, 'Z': z, 'Y': y})

#### Step 3: Estimate the regression conditional on the causal state and the back-door set

In [30]:
model = OLS(df['Y'], df[['D', 'Z']])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.501
Model:,OLS,Adj. R-squared:,0.501
Method:,Least Squares,F-statistic:,5016.0
Date:,"Tue, 16 Oct 2018",Prob (F-statistic):,0.0
Time:,15:53:03,Log-Likelihood:,-14131.0
No. Observations:,10000,AIC:,28270.0
Df Residuals:,9998,BIC:,28280.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D,-0.0250,0.010,-2.505,0.012,-0.045,-0.005
Z,1.0277,0.014,72.773,0.000,1.000,1.055

0,1,2,3
Omnibus:,1.415,Durbin-Watson:,2.004
Prob(Omnibus):,0.493,Jarque-Bera (JB):,1.401
Skew:,0.006,Prob(JB):,0.496
Kurtosis:,3.057,Cond. No.,2.61


#### Step 4: Fix $D$ in a copy of the data, and evaluate the predicted value of $Y$.

In [31]:
#Let's start with E[Y|do(D=0)]

df_intervention = df.copy()

df_intervention['D'] = 0
df_intervention['Y(Z, D=0)'] = result.predict(df_intervention[['D', 'Z']])

# and then do  E[Y|do(D=1)]
df_intervention['D'] = 1
df_intervention['Y(Z, D=1)'] = result.predict(df_intervention[['D', 'Z']])


#### Step 5: Do the average over the data points

In [34]:
df_intervention.mean()

D            1.000000
Z            0.024257
Y            0.026620
Y(Z, D=0)    0.024928
Y(Z, D=1)   -0.000098
dtype: float64

We've averaged Y(Z, D=d) over Z, so these means actually represent E[Y|do(D=d)]. We can see the lack of dependence between Y and D here, where the means aren't significantly different from each other.

#### E[Y|do(D=0)] = -0.011271

#### E[Y|do(D=1)] = -0.016605


You can do this same procedure with much more interesting models and data-generating processes!