In [23]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

%matplotlib inline

Assume that we have a variable $Y$ such that $f_Y(D, e_Y) = \alpha + \delta D + e$.  
Now introduce a possible instrumental variable $Z$.

In [25]:
from causalgraphicalmodels import CausalGraphicalModel

iv = CausalGraphicalModel(
    nodes=["Z", "D", "Y", "e"],
    edges=[
        ("Z", "D"), 
        ("D", "Y"),
        ('e', 'Y')
    ],
    latent_edges=[
        ('e', 'D')
    ]
)

iv.draw()

ModuleNotFoundError: No module named 'causalgraphicalmodels'

In [26]:
from causalgraphicalmodels import CausalGraphicalModel

iv_p = CausalGraphicalModel(
    nodes=["Z", "D", "Y", "e"],
    edges=[
        ("Z", "D"), 
        ("D", "Y"),
        ('e', 'Y')
    ],
    latent_edges=[
        ('e', 'D'),
        ('e', 'Z')
    ]
)

iv_p.draw()

ModuleNotFoundError: No module named 'causalgraphicalmodels'

In graphs $iv$ and $iv_p$ we are prevented from taking a least squares regression of $Y$ on $D$ by the backdoor path $D \leftrightarrow e \rightarrow Y$. Doing so would yield an inconsistent or unbiased estimate of the effect of $D$ on $Y$.  
Since $e$ is an unobserved error term, there are no variables satisfying the Back Door Criterion.

Potential instrumental variable $Z$ has a relationship with $Y$, through the path $Z \rightarrow D \rightarrow Y$.  
$Z$ does not generate statistical dependence with $Y$ through the path $Z \rightarrow D \leftrightarrow e \rightarrow Y$, because of collider $D$.  
In $iv_p$, the path $Z \leftrightarrow e \rightarrow Y$ does create statistical dependence. So $Z$ cannot be used in the scenario represented by that graph.

With the assumption that the effect of $D$ on $Y$ is constant $\delta$ we can obtain a consistent estimator by isolating the covariation between $D$ and $Y$ that is causal. Then we could ignore the noncausal covariation, as the result of common causes of $D$ and $e$.  
In $iv$, $Z$ serves as an isolated source of variation for $D$, but not in $iv_p$.

The Wald estimator, $\delta_{WALD}$ \~ $\frac{\mathbb{E}[Y|Z = 1] - \mathbb{E}[Y|Z = 0]}{\mathbb{E}[D|Z = 1] - \mathbb{E}[D|Z = 0]}$, provides a consistent estimate of the causal effect of $D$ on $Y$ (if $D$ is binary).

In the case of $iv_p$, the nonzero association between $Z$ and $e$ we do not find a consistent $\delta$ from the WALD estimator. It converges to $\delta$ plus a bias term of the net association between $Z$ and $e$.

### IV Demonstration 1

School voucher program to be examined in a metropolitan area.  
Randomly select 10 000 ninth graders and give them a standardized test. Collect scores as $\{y_i, d_i\}^{10 000}_{i = 1}$, for scores $Y$ and $D = 1$ if a student attended a private high school and $0$ otherwise.

10% of students win a voucher redeemable at a private high school. $z_i = 1$ for winners and $0$ otherwise.

In [27]:
units = 10000

p_z = .1 #probability of winning a voucher
Z = np.random.binomial(1, p_z, size=units)

e = np.random.normal(0, 5, size=units)

#Probability of attending private school is greater for lottery winners
p_d = .1 + Z / 10. + e / (abs(e).max() * 10)
D = np.random.binomial(1, p_d)

#Scores Y are of the form y = \alpha + \beta * D + e
Y = 50 + 10 * D + e

df = pd.DataFrame({'Z': Z, 'e': e, 'D': D, 'Y': Y})

In [28]:
df.head()

Unnamed: 0,Z,e,D,Y
0,0,-0.780557,0,49.219443
1,0,5.049583,0,55.049583
2,0,4.272334,0,54.272334
3,1,-6.591397,1,53.408603
4,1,3.719931,0,53.719931


**Note** This is on the low end of interdependence between an instrumental variable and causal state of interest.

In [29]:
df[['Z', 'D']].corr()

Unnamed: 0,Z,D
Z,1.0,0.100637
D,0.100637,1.0


This data generating process corresponds to $iv$, since $D$ lies on the path between $Z$ and $e$ there is no statistical dependence between them.  
We can estimate the causal effect of $D$ on $Y$ with the WALD estimator.

In [30]:
(df[df['Z'] == 1].Y.mean() - df[df['Z'] == 0].Y.mean()) / (df[df['Z'] == 1].D.mean() - df[df['Z'] == 0].D.mean())

7.844617732516113

That's quite close to the true effect, which from our data generating process was $\beta = 10$.

Now let's try to fit an OLS to recover $f_Y$.

In [31]:
from statsmodels.api import OLS, Logit

In [32]:
df['intercept'] = 1
instrument_model = Logit(df['D'], df[['Z', 'intercept']])
instrument_result = instrument_model.fit()

df['D_expected'] = instrument_result.predict(df[['Z', 'intercept']])
causal_model = OLS(df['Y'], df[['D_expected', 'intercept']])
result = causal_model.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.335719
         Iterations 6


0,1,2,3
Dep. Variable:,Y,R-squared:,0.002
Model:,OLS,Adj. R-squared:,0.001
Method:,Least Squares,F-statistic:,15.85
Date:,"Sat, 15 Dec 2018",Prob (F-statistic):,6.9e-05
Time:,20:25:25,Log-Likelihood:,-32315.0
No. Observations:,10000,AIC:,64630.0
Df Residuals:,9998,BIC:,64650.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D_expected,7.8446,1.970,3.981,0.000,3.982,11.707
intercept,50.2425,0.219,229.046,0.000,49.812,50.672

0,1,2,3
Omnibus:,374.474,Durbin-Watson:,2.017
Prob(Omnibus):,0.0,Jarque-Bera (JB):,429.635
Skew:,0.46,Prob(JB):,5.08e-94
Kurtosis:,3.431,Cond. No.,32.5


We've just about got it. The coefficients for D_expected and intercept are close to the true values of 10 and 50, respectively.

Using statsmodels built-in 2SLS

In [33]:
from statsmodels.sandbox.regression.gmm import IV2SLS

In [34]:
model = IV2SLS(df['Y'], df[['D', 'intercept']], instrument=df[['Z', 'intercept']])
result = model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.292
Model:,IV2SLS,Adj. R-squared:,0.292
Method:,Two Stage,F-statistic:,22.34
,Least Squares,Prob (F-statistic):,2.31e-06
Date:,"Sat, 15 Dec 2018",,
Time:,20:25:27,,
No. Observations:,10000,,
Df Residuals:,9998,,
Df Model:,1,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D,7.8446,1.660,4.727,0.000,4.592,11.098
intercept,50.2425,0.185,271.932,0.000,49.880,50.605

0,1,2,3
Omnibus:,1.428,Durbin-Watson:,2.012
Prob(Omnibus):,0.49,Jarque-Bera (JB):,1.457
Skew:,0.026,Prob(JB):,0.483
Kurtosis:,2.972,Cond. No.,3.28


Now lets introduce statistical dependence between $Z$ and $e$ through a confounder $A$.

In [35]:
units = 10000

A = np.random.normal(size=units) #Confounder

p_z = .1 + A / (10 * abs(A).max()) #probability of winning a voucher
Z = np.random.binomial(1, p_z, size=units)

e = np.random.normal(5 * A, 5, size=units)

#Probability of attending private school is greater for lottery winners
p_d = .1 + Z / 10. + e / (abs(e).max() * 10)
D = np.random.binomial(1, p_d)

#Scores Y are of the form y = \alpha + \beta * D + e
Y = 50 + 10 * D + e

df = pd.DataFrame({'A': A, 'Z': Z, 'e': e, 'D': D, 'Y': Y})

In [36]:
df.corr()

Unnamed: 0,A,Z,e,D,Y
A,1.0,0.070782,0.709385,0.061012,0.653767
Z,0.070782,1.0,0.049154,0.10634,0.085545
e,0.709385,0.049154,1.0,0.080969,0.919611
D,0.061012,0.10634,0.080969,1.0,0.466
Y,0.653767,0.085545,0.919611,0.466,1.0


Now we have a correlation between $Z$ and $e$.

In [37]:
df['intercept'] = 1
instrument_model = Logit(df['D'], df[['Z', 'intercept']])
instrument_result = instrument_model.fit()

df['D_expected'] = instrument_result.predict(df[['Z', 'intercept']])
causal_model = OLS(df['Y'], df[['D_expected', 'intercept']])
result = causal_model.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.345306
         Iterations 6


0,1,2,3
Dep. Variable:,Y,R-squared:,0.007
Model:,OLS,Adj. R-squared:,0.007
Method:,Least Squares,F-statistic:,73.7
Date:,"Sat, 15 Dec 2018",Prob (F-statistic):,1.04e-17
Time:,20:25:30,Log-Likelihood:,-34937.0
No. Observations:,10000,AIC:,69880.0
Df Residuals:,9998,BIC:,69890.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D_expected,20.4110,2.377,8.585,0.000,15.751,25.071
intercept,48.8454,0.277,176.179,0.000,48.302,49.389

0,1,2,3
Omnibus:,100.517,Durbin-Watson:,2.003
Prob(Omnibus):,0.0,Jarque-Bera (JB):,105.372
Skew:,0.229,Prob(JB):,1.3099999999999999e-23
Kurtosis:,3.206,Cond. No.,30.2


The estimated coefficient of $D$ is very biased now, since we are picking some of the causal dependence through the backdoor path created by $A$.  
We can alleviate this by conditioning on $A$ in our OLS.

In [38]:
df['D_expected'] = instrument_result.predict(df[['Z', 'intercept']])
causal_model = OLS(df['Y'], df[['D_expected', 'A', 'intercept']]) #Include A in the regression
result = causal_model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.429
Model:,OLS,Adj. R-squared:,0.429
Method:,Least Squares,F-statistic:,3755.0
Date:,"Sat, 15 Dec 2018",Prob (F-statistic):,0.0
Time:,20:25:31,Log-Likelihood:,-32173.0
No. Observations:,10000,AIC:,64350.0
Df Residuals:,9997,BIC:,64370.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D_expected,9.4170,1.808,5.209,0.000,5.873,12.961
A,5.1615,0.060,85.916,0.000,5.044,5.279
intercept,50.1151,0.211,237.727,0.000,49.702,50.528

0,1,2,3
Omnibus:,331.644,Durbin-Watson:,1.995
Prob(Omnibus):,0.0,Jarque-Bera (JB):,379.104
Skew:,0.425,Prob(JB):,4.77e-83
Kurtosis:,3.434,Cond. No.,30.5


That's a lot better! Understanding the causal graph allowed us to recover the data generating process, even with the existence of a confounder.

In [39]:
iv_surrogate = CausalGraphicalModel(
    nodes=["Z", "V", "D", "Y", "e"],
    edges=[
        ("V", "Z"),
        ("V", "D"), 
        ("D", "Y"),
        ('e', 'Y')
    ],
    latent_edges=[
        ('e', 'D')
    ]
)

iv_surrogate.draw()

NameError: name 'CausalGraphicalModel' is not defined

In [40]:
iv_conditional = CausalGraphicalModel(
    nodes=["Z", "W", "D", "Y", "e"],
    edges=[
        ("Z", "D"), 
        ("D", "Y"),
        ('e', 'Y'),
        ("W", "Y")
    ],
    latent_edges=[
        ('e', 'D'),
        ('W', 'Z')
    ]
)

iv_conditional.draw()

NameError: name 'CausalGraphicalModel' is not defined

In both $iv_{surrogate}$ and $iv_{conditional}$, $Z$ is a valid instrumental variable. In the former, $V$ is unobserved, but $Z$ is a surrogate which has an association with $D$ even if it is not a direct cause of $D$. A possible complication in this case is a weak association between $Z$ and $D$. In the latter, $Z$ has an association with $Y$ through a path other than that through $D$, $Z \leftrightarrow W \rightarrow Y$. But this backdoor path can be blocked by conditioning on $W$ (see below) during estimation.

In [41]:
iv_c_do = iv_conditional.do('W')
iv_c_do.draw()

NameError: name 'iv_conditional' is not defined

### Examples

Causal effect of interest: years of schooling on subsequent earnings  
IVs: proximity to college, regional and temporal variation in school construction, tuition at local colleges, temporal variation in minimum school-leaving age, quarter of birth.  

Surrogate instrumental variable

In [56]:
units = 10000
alpha = 50
delta = 10

V = np.random.binomial(1, .5, size=units) #Parental graduation (not realistic because of causal effects on childrens' earnings) 

Z = np.random.exponential(1.5 - V, size=units) #Distance from some college
#College-educated parents are expected to live closer to colleges

p_d = .15 - V / 10  
D = np.random.binomial(1, p_d) #Years of schooling. greater than k for some int k
#Children of college-educated parents are expected to attend school for longer

e = np.random.normal(5, size=units) #Individual effects

Y = alpha + delta * D + e #Earnings

df = pd.DataFrame({'V': V, 'Z': Z, 'D': D, 'e': e, 'Y': Y})

In [59]:
df[['Z', 'D']].corr()

Unnamed: 0,Z,D
Z,1.0,0.056094
D,0.056094,1.0


Even though there is no causal relationship between $Z$ and $D$, since both are caused by $V$ we get a statistical association.

First we require that $Cov(e, Z) = 0$. If that holds then we should be able to recover $\delta$.

In [60]:
df[['e', 'Z']].corr()

Unnamed: 0,e,Z
e,1.0,-0.003027
Z,-0.003027,1.0


The OLS estimate works

In [62]:
df[['Y', 'D']].corr().Y.D / df['D'].var()

10.340875913596614

In [63]:
df[['Y', 'Z']].corr().Y.Z / df[['D', 'Z']].corr().D.Z

0.9372428174145447

**Note** Not sure what the missing piece is here?

Conditional instrumental variable

Draft lottery

In [72]:
units = 10000
alpha = 50
delta = 10

Z = np.random.randint(1, 12, size=units) #Month of birth

e = np.random.uniform(0, .25, size=units) #Individual effects

p_d = [.25 if i <= 6 else .5 for i in Z] #Chances of being drafted are greater for latter months
D = np.random.binomial(1, p_d + e) #1 for drafted, 0 otherwise

W = np.random.normal(6. - Z / 2.) #Success in school, assumed to be affected by date of birth

Y = alpha + delta * D + e #Earnings

df = pd.DataFrame({'Z': Z, 'e': e, 'D': D, 'W': W, 'Y': Y})

In [74]:
df[['Z', 'D']].corr()

Unnamed: 0,Z,D
Z,1.0,0.215237
D,0.215237,1.0


In [75]:
df[['e', 'Z']].corr()

Unnamed: 0,e,Z
e,1.0,0.001312
Z,0.001312,1.0


We have the necessary correlation between the IV $Z$ and the causal state of interest $D$.  
We have no correlation between $e$ and $Z$, as required.  
By conditioning on $W$ we should be able to block that backdoor path and recover the data generating process.

In [76]:
df['intercept'] = 1
instrument_model = Logit(df['D'], df[['Z', 'intercept']])
instrument_result = instrument_model.fit()

df['D_expected'] = instrument_result.predict(df[['Z', 'intercept']])
causal_model = OLS(df['Y'], df[['D_expected', 'W', 'intercept']])
result = causal_model.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.669296
         Iterations 4


0,1,2,3
Dep. Variable:,Y,R-squared:,0.047
Model:,OLS,Adj. R-squared:,0.046
Method:,Least Squares,F-statistic:,244.4
Date:,"Sat, 15 Dec 2018",Prob (F-statistic):,2.2899999999999997e-104
Time:,20:56:57,Log-Likelihood:,-30065.0
No. Observations:,10000,AIC:,60140.0
Df Residuals:,9997,BIC:,60160.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D_expected,9.8609,0.849,11.611,0.000,8.196,11.526
W,-0.0132,0.049,-0.269,0.788,-0.110,0.083
intercept,50.2331,0.547,91.806,0.000,49.161,51.306

0,1,2,3
Omnibus:,5.336,Durbin-Watson:,2.013
Prob(Omnibus):,0.069,Jarque-Bera (JB):,1358.844
Skew:,0.057,Prob(JB):,8.53e-296
Kurtosis:,1.198,Cond. No.,75.7


Just about!  
But not including $W$ works well, too?

In [79]:
df['D_expected'] = instrument_result.predict(df[['Z', 'intercept']])
causal_model = OLS(df['Y'], df[['D_expected', 'intercept']])
result = causal_model.fit()
result.summary()

0,1,2,3
Dep. Variable:,Y,R-squared:,0.047
Model:,OLS,Adj. R-squared:,0.047
Method:,Least Squares,F-statistic:,488.8
Date:,"Sat, 15 Dec 2018",Prob (F-statistic):,8.539999999999999e-106
Time:,20:59:49,Log-Likelihood:,-30065.0
No. Observations:,10000,AIC:,60130.0
Df Residuals:,9998,BIC:,60150.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
D_expected,10.0540,0.455,22.110,0.000,9.163,10.945
intercept,50.0991,0.227,221.073,0.000,49.655,50.543

0,1,2,3
Omnibus:,5.339,Durbin-Watson:,2.013
Prob(Omnibus):,0.069,Jarque-Bera (JB):,1358.894
Skew:,0.057,Prob(JB):,8.31e-296
Kurtosis:,1.198,Cond. No.,11.5


That might even be better!  
**Note** Try altering the data generating process to see is the correlation strengths are a factor.

Catholic school attendance  
IV: Catholic share of local population

In [82]:
units = 10000
delta = 10

Z = np.random.uniform(0, 100, size=units) #Catholic share of population
