# 11 Instrumental Variables

## 1 Why do we need instrumental variables

We've shown in Chapter 7 that when a "relevant" variable is omitted. The estimators we obtained will be biased (and also inconsistent).

$$E(\hat{\beta}) = \beta + \alpha\gamma$$

> Omitted variable bias is a special case of **endogeneity**. So sometimes people refer to it as the endogeneity problem, and refer to x as the endogenous variable.

where $\alpha$ is the contribution of the Omitted on y, and $\gamma$ is the correlation between x and the omitted.

To address this problem, we can add in the omitted variable. But this approach is not always available - consider the case when we regress wage on education, and the omitted variable is one's intelligence level.

>![gear](./images/gear.png)

The intuition behind the omitted variable bias is when you tune x, it will also change the omitted variable (call it o). And both o and x affect y. So with data on y and x but w/o o, we are not able to identify to sole effect of x on y.

## 2 Instrumental variable

To isolate the effect of x on y. We can picture that x can be break into 2 parts - one ($x_1$) correlates with the omitted variable o, the other part ($x_2$) does not. If we find a variable z that is only correlates with x but not o. Then we can identify the impact of x on y.

As an example, we use years of education of the individual's father as an instrument for educ. We can claim that the parent's education level does not necessarily correlate with the omitted intelligence, and it is correlated with the responder's education level.

Alternatively, we can use college proximity as an IV for education. This is a valid IV, because it is both **inclusive** - i.e. $corr(x,z)\neq 0$, and **exclusive** - i.e. $corr(o,z) = 0$.

> correlation is not transitive \
If x, y are demeaned, $corr(x,y) = \frac{x \cdot y}{||x||||y||} = cos\theta$

## 3 What to do after you find an IV? 

There are two simple estimators using IV.

The first one is named after IV, the **IV** estimator.

The second one is called a **two stage least squares (2SLS)** estimator.

## 4 IV estimator

$$\hat{\beta}^{IV} =\frac{ \hat{Cov}(z,y)}{\hat{Cov}(z,x)}$$

### Example

Use data from MROZ. We only analyze women with non-missing wage. (use the method dropna) to extract them. We want to estimate the return to education (**educ**) for these women. As an instrumental variable for education, we use the education of her father (**fatheduc**)

1. Import the dataset, print all column names using info().

In [1]:
import wooldridge as woo
df = woo.data("MROZ")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 753 entries, 0 to 752
Data columns (total 22 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   inlf      753 non-null    int64  
 1   hours     753 non-null    int64  
 2   kidslt6   753 non-null    int64  
 3   kidsge6   753 non-null    int64  
 4   age       753 non-null    int64  
 5   educ      753 non-null    int64  
 6   wage      428 non-null    float64
 7   repwage   753 non-null    float64
 8   hushrs    753 non-null    int64  
 9   husage    753 non-null    int64  
 10  huseduc   753 non-null    int64  
 11  huswage   753 non-null    float64
 12  faminc    753 non-null    float64
 13  mtr       753 non-null    float64
 14  motheduc  753 non-null    int64  
 15  fatheduc  753 non-null    int64  
 16  unem      753 non-null    float64
 17  city      753 non-null    int64  
 18  exper     753 non-null    int64  
 19  nwifeinc  753 non-null    float64
 20  lwage     428 non-null    float6

In [2]:
df = df.dropna(subset=["wage"])

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 428 entries, 0 to 427
Data columns (total 22 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   inlf      428 non-null    int64  
 1   hours     428 non-null    int64  
 2   kidslt6   428 non-null    int64  
 3   kidsge6   428 non-null    int64  
 4   age       428 non-null    int64  
 5   educ      428 non-null    int64  
 6   wage      428 non-null    float64
 7   repwage   428 non-null    float64
 8   hushrs    428 non-null    int64  
 9   husage    428 non-null    int64  
 10  huseduc   428 non-null    int64  
 11  huswage   428 non-null    float64
 12  faminc    428 non-null    float64
 13  mtr       428 non-null    float64
 14  motheduc  428 non-null    int64  
 15  fatheduc  428 non-null    int64  
 16  unem      428 non-null    float64
 17  city      428 non-null    int64  
 18  exper     428 non-null    int64  
 19  nwifeinc  428 non-null    float64
 20  lwage     428 non-null    float6

In [2]:
import wooldridge as woo
df = woo.data("mroz")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 753 entries, 0 to 752
Data columns (total 22 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   inlf      753 non-null    int64  
 1   hours     753 non-null    int64  
 2   kidslt6   753 non-null    int64  
 3   kidsge6   753 non-null    int64  
 4   age       753 non-null    int64  
 5   educ      753 non-null    int64  
 6   wage      428 non-null    float64
 7   repwage   753 non-null    float64
 8   hushrs    753 non-null    int64  
 9   husage    753 non-null    int64  
 10  huseduc   753 non-null    int64  
 11  huswage   753 non-null    float64
 12  faminc    753 non-null    float64
 13  mtr       753 non-null    float64
 14  motheduc  753 non-null    int64  
 15  fatheduc  753 non-null    int64  
 16  unem      753 non-null    float64
 17  city      753 non-null    int64  
 18  exper     753 non-null    int64  
 19  nwifeinc  753 non-null    float64
 20  lwage     428 non-null    float6

2.Drop observations with missing wage values.

In [3]:
df = df.dropna(subset=["wage"])

3.Compute covariances using numpy

In [4]:
import numpy as np
cov_yz = np.cov(df["wage"],df["fatheduc"])[0,1]
cov_xz = np.cov(df["educ"],df["fatheduc"])[0,1]
b_iv = cov_yz/cov_xz
b_iv

0.37566402287175477

In [5]:
import numpy as np
x = df["educ"]
y = df["lwage"]
z = df["fatheduc"]
num = np.cov(y,z)[0,1]
denom = np.cov(x,z)[0,1]

4.Compute b_iv and compare it with the ols estimator.

In [5]:
import statsmodels.formula.api as smf
res = smf.ols("wage ~ educ", data=df).fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                   wage   R-squared:                       0.117
Model:                            OLS   Adj. R-squared:                  0.115
Method:                 Least Squares   F-statistic:                     56.41
Date:                Tue, 12 Apr 2022   Prob (F-statistic):           3.49e-13
Time:                        12:56:27   Log-Likelihood:                -1092.5
No. Observations:                 428   AIC:                             2189.
Df Residuals:                     426   BIC:                             2197.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -2.0924      0.848     -2.467      0.0

In [7]:
b_iv = num/denom
b_iv

0.05917347999936593

In [8]:
import statsmodels.formula.api as smf
reg = smf.ols("lwage~educ", data=df)
res = reg.fit()
print(res.params["educ"])

0.10864865517467533


## 5 2SLS
Two stage least squares (2SLS) is a general approach for IV estimation when we have one or more endogenous regressors and at least as many additional instrumental variables. Consider the regression model.

$$y = \beta_0 + \beta_1x_1 + \beta_2 x_2 + \beta_3 x_3 +u $$

where both $x_1$ and $x_2$ are endogenous. Suppose we've found 3 instruments - $z_1$ for $x_1$, $z_2$ and $z_3$ for $x_2$. Then we can obtain unbiased $\hat{\beta_1}$ and $\hat{\beta_2}$ using the following steps.

1. Separately regress $x_1$ and $x_2$ on $z_1 through z_3$ and $x_3$. Obtain fitted values $\hat{x_1}$ and $\hat{x_2}$.
2. Regress $y_1$ on $\hat{x_1}$, $\hat{x_2}$, and $x_3$

In [10]:
df["x_h"] = smf.ols("educ~fatheduc+motheduc+exper",data=df).fit().fittedvalues

In [11]:
reg = smf.ols("wage~x_h+exper",data=df)
res = reg.fit()
res.summary()

0,1,2,3
Dep. Variable:,wage,R-squared:,0.014
Model:,OLS,Adj. R-squared:,0.009
Method:,Least Squares,F-statistic:,3.031
Date:,"Tue, 12 Apr 2022",Prob (F-statistic):,0.0493
Time:,13:09:53,Log-Likelihood:,-1116.1
No. Observations:,428,AIC:,2238.0
Df Residuals:,425,BIC:,2250.0
Df Model:,2,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-0.3399,1.960,-0.173,0.862,-4.192,3.512
x_h,0.3321,0.152,2.181,0.030,0.033,0.631
exper,0.0240,0.020,1.213,0.226,-0.015,0.063

0,1,2,3
Omnibus:,324.758,Durbin-Watson:,2.007
Prob(Omnibus):,0.0,Jarque-Bera (JB):,4693.932
Skew:,3.201,Prob(JB):,0.0
Kurtosis:,17.907,Cond. No.,237.0


### Example

Using the previous example, but this time let's consider the model `lwage~ educ + exper + expersq`. Assume educ is endogenous, and it can instrumented by **motheduc** and **fatheduc**.

1. First stage - regress educ on all instrumental and exogenous variables

In [9]:
res1 = smf.ols("educ ~ motheduc + fatheduc + exper + expersq", data=df).fit()

2.Add the fitted values to df, call it **educ_h**

In [10]:
df = df.reset_index(drop=True)
df["educ_h"] = res1.fittedvalues

3.Second stage - regress lwage on educ_h and other exogenous variables

In [11]:
res2 = smf.ols("lwage ~ educ_h + exper + expersq", data=df).fit()
print(res2.params)

Intercept    0.048100
educ_h       0.061397
exper        0.044170
expersq     -0.000899
dtype: float64


## 11.6 IV using linearmodels

To implement IV regression in Python, the module **linearmodels** offers the command **iv.IV2SLS** ([here](https://bashtage.github.io/linearmodels/iv/iv/linearmodels.iv.model.IV2SLS.html#linearmodels.iv.model.IV2SLS)) including the convenient formula syntax we know from **statsmodels**. When working with IV regression in **linearmodels**, our first line of code always is: 

### IV

In [13]:
import linearmodels.iv as iv

> Don't forget to `pip install linearmodels` first.

In [14]:
reg_iv = iv.IV2SLS.from_formula("wage~ 1+[educ~fatheduc]", data=df)
res_iv = reg_iv.fit()
res_iv.summary

0,1,2,3
Dep. Variable:,wage,R-squared:,0.1101
Estimator:,IV-2SLS,Adj. R-squared:,0.1080
No. Observations:,428,F-statistic:,4.5640
Date:,"Tue, Apr 12 2022",P-value (F-stat),0.0327
Time:,13:13:36,Distribution:,chi2(1)
Cov. Estimator:,robust,,
,,,

0,1,2,3,4,5,6
,Parameter,Std. Err.,T-stat,P-value,Lower CI,Upper CI
Intercept,-0.5778,2.2191,-0.2604,0.7946,-4.9272,3.7716
educ,0.3757,0.1758,2.1363,0.0327,0.0310,0.7203


> Remember that constants in **linearmodels** must be explicitly included by adding **1** to the formula.

In [14]:
print(res_iv.summary)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                  lwage   R-squared:                      0.0934
Estimator:                    IV-2SLS   Adj. R-squared:                 0.0913
No. Observations:                 428   F-statistic:                    2.5656
Date:                Mon, Apr 11 2022   P-value (F-stat)                0.1092
Time:                        11:30:57   Distribution:                  chi2(1)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Intercept      0.4411     0.4643     0.9501     0.3421     -0.4689      1.3511
educ           0.0592     0.0369     1.6017     0.10

> Note that **summary** in linearmodels is an attribute not method. 

### 2SLS

When there are more than one instruments, we use the same module but with modified syntax.

In [15]:
reg_2sls = iv.IV2SLS.from_formula("lwage~1+exper + expersq + [educ~motheduc+fatheduc]", data=df)
res_2sls = reg_2sls.fit()
print(res_2sls)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                  lwage   R-squared:                      0.1357
Estimator:                    IV-2SLS   Adj. R-squared:                 0.1296
No. Observations:                 428   F-statistic:                    18.611
Date:                Mon, Apr 11 2022   P-value (F-stat)                0.0003
Time:                        11:30:57   Distribution:                  chi2(3)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Intercept      0.0481     0.4278     0.1124     0.9105     -0.7903      0.8865
exper          0.0442     0.0155     2.8546     0.00

### Exercise
Use *CARD* to estimate the return to education (lwage on educ). Education is allowed to be endogenous and instrumented with the dummy variable **nearc4** which indicates whether the individual grew up close to a college. In addition, we control for experience, race, and regional information (smsa, south, and reg662-reg669). These variables are assumed to be exogenous and act as their own instruments.


In [16]:
df = woo.data("CARD")
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3010 entries, 0 to 3009
Data columns (total 34 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        3010 non-null   int64  
 1   nearc2    3010 non-null   int64  
 2   nearc4    3010 non-null   int64  
 3   educ      3010 non-null   int64  
 4   age       3010 non-null   int64  
 5   fatheduc  2320 non-null   float64
 6   motheduc  2657 non-null   float64
 7   weight    3010 non-null   float64
 8   momdad14  3010 non-null   int64  
 9   sinmom14  3010 non-null   int64  
 10  step14    3010 non-null   int64  
 11  reg661    3010 non-null   int64  
 12  reg662    3010 non-null   int64  
 13  reg663    3010 non-null   int64  
 14  reg664    3010 non-null   int64  
 15  reg665    3010 non-null   int64  
 16  reg666    3010 non-null   int64  
 17  reg667    3010 non-null   int64  
 18  reg668    3010 non-null   int64  
 19  reg669    3010 non-null   int64  
 20  south66   3010 non-null   int6

In [17]:
formula = "lwage ~ 1 + [educ~nearc4] + exper + black + smsa + south +"
region = "+".join([c for c in df.columns if "reg" in c])
formula += region
formula += "-reg661"
formula

'lwage ~ 1 + [educ~nearc4] + exper + black + smsa + south +reg661+reg662+reg663+reg664+reg665+reg666+reg667+reg668+reg669-reg661'

In [18]:
reg = iv.IV2SLS.from_formula(formula, data=df)
print(reg.fit().summary)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                  lwage   R-squared:                      0.1922
Estimator:                    IV-2SLS   Adj. R-squared:                 0.1887
No. Observations:                3010   F-statistic:                    735.97
Date:                Tue, Apr 12 2022   P-value (F-stat)                0.0000
Time:                        13:24:00   Distribution:                 chi2(13)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Intercept      3.6277     0.8755     4.1436     0.0000      1.9118      5.3437
exper          0.0677     0.0206     3.2843     0.00

In [19]:
import re
formula = "lwage~1 + [educ~nearc4]+ black + exper+ smsa+south+"
pattern = "^reg66[2-9]"
exo = "+".join([c for c in df.columns if re.search(pattern,c)])
formula += exo
formula

'lwage~1 + [educ~nearc4]+ black + exper+ smsa+south+reg662+reg663+reg664+reg665+reg666+reg667+reg668+reg669'

In [20]:
reg_iv = iv.IV2SLS.from_formula(formula, data=df)
res_iv = reg_iv.fit()
print(res_iv.summary)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                  lwage   R-squared:                      0.1922
Estimator:                    IV-2SLS   Adj. R-squared:                 0.1887
No. Observations:                3010   F-statistic:                    735.97
Date:                Mon, Apr 11 2022   P-value (F-stat)                0.0000
Time:                        11:30:57   Distribution:                 chi2(13)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Intercept      3.6277     0.8755     4.1436     0.0000      1.9118      5.3437
black         -0.1331     0.0513    -2.5954     0.00

### Exercise2

We now use WAGE2 to estimate the return to education for men.

Task1: Regress lwage on educ, interpret the coefficient

In [21]:
import wooldridge as woo
df = woo.data("wage2")
df = df.dropna()
df.head()

Unnamed: 0,wage,hours,IQ,KWW,educ,exper,tenure,age,married,black,south,urban,sibs,brthord,meduc,feduc,lwage
0,769,40,93,35,12,11,2,31,1,0,0,1,1,2.0,8.0,8.0,6.645091
2,825,40,108,46,14,11,9,33,1,0,0,1,1,2.0,14.0,14.0,6.715384
3,650,40,96,32,12,13,7,32,1,0,0,1,4,3.0,12.0,12.0,6.476973
4,562,40,74,27,11,14,5,34,1,0,0,1,10,6.0,6.0,11.0,6.331502
6,600,40,91,24,10,13,0,30,0,0,0,1,1,2.0,8.0,8.0,6.39693


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 663 entries, 0 to 931
Data columns (total 17 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   wage     663 non-null    int64  
 1   hours    663 non-null    int64  
 2   IQ       663 non-null    int64  
 3   KWW      663 non-null    int64  
 4   educ     663 non-null    int64  
 5   exper    663 non-null    int64  
 6   tenure   663 non-null    int64  
 7   age      663 non-null    int64  
 8   married  663 non-null    int64  
 9   black    663 non-null    int64  
 10  south    663 non-null    int64  
 11  urban    663 non-null    int64  
 12  sibs     663 non-null    int64  
 13  brthord  663 non-null    float64
 14  meduc    663 non-null    float64
 15  feduc    663 non-null    float64
 16  lwage    663 non-null    float64
dtypes: float64(4), int64(13)
memory usage: 93.2 KB


In [23]:
import statsmodels.formula.api as smf
reg = smf.ols("lwage ~ educ", data=df)
res = reg.fit()
print(res.summary())

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.104
Model:                            OLS   Adj. R-squared:                  0.103
Method:                 Least Squares   F-statistic:                     76.69
Date:                Mon, 11 Apr 2022   Prob (F-statistic):           1.67e-17
Time:                        11:30:57   Log-Likelihood:                -316.30
No. Observations:                 663   AIC:                             636.6
Df Residuals:                     661   BIC:                             645.6
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.9995      0.094     63.640      0.0

Task2: Explain why educ might be **endogenous** - i.e. there is some omitted variable o that is correlated with educ. \
Does $\hat{\beta}_{educ}$ over- or under- estimate the effect of education on wage.

Task3: What variable(s) can be used as instruments for educ?

wage: monthly earnings \
hours: average weekly hours \
IQ: IQ score \
KWW: knowledge of world work score \
educ: years of education \
exper: years of work experience \
tenure: years with current employer \
age: age in years \
married: =1 if married \
black: =1 if black \
south: =1 if live in south \
urban: =1 if live in SMSA \
sibs: number of siblings \
brthord: birth order \
meduc: mother's education \
feduc: father's education \
lwage: natural log of wage

Task4: Compute the IV estimator manually using sibs

In [24]:
import numpy as np
cov_yz = np.cov(df["lwage"],df["sibs"])[0,1]
cov_xz = np.cov(df["educ"], df["sibs"])[0,1]
b_iv = cov_yz/cov_xz
b_iv

0.12558991602151698

Task5: Compute the 2SLS estimator manually

In [25]:
res_s1 = smf.ols("educ ~ sibs",data=df).fit()
df["educ_h"] = res_s1.predict()
res_s2 = smf.ols("lwage ~ educ_h",data=df).fit()
print(res_s2.summary())

                            OLS Regression Results                            
Dep. Variable:                  lwage   R-squared:                       0.018
Model:                            OLS   Adj. R-squared:                  0.016
Method:                 Least Squares   F-statistic:                     12.09
Date:                Mon, 11 Apr 2022   Prob (F-statistic):           0.000539
Time:                        11:30:57   Log-Likelihood:                -346.68
No. Observations:                 663   AIC:                             697.4
Df Residuals:                     661   BIC:                             706.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      5.0962      0.494     10.309      0.0

Task6: Compute the IV2SLS estimator using linearmodels.iv

In [26]:
import linearmodels.iv as iv
res_iv2sls = iv.IV2SLS.from_formula("lwage~1+[educ~sibs]",data=df).fit()
print(res_iv2sls.summary)

                          IV-2SLS Estimation Summary                          
Dep. Variable:                  lwage   R-squared:                     -0.0238
Estimator:                    IV-2SLS   Adj. R-squared:                -0.0253
No. Observations:                 663   F-statistic:                    12.405
Date:                Mon, Apr 11 2022   P-value (F-stat)                0.0004
Time:                        11:30:57   Distribution:                  chi2(1)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
Intercept      5.0962     0.4878     10.448     0.0000      4.1402      6.0522
educ           0.1256     0.0357     3.5220     0.00