# 0. Abstract and Outline

The purpose of this document is to briefly introduce an important identification strategy for causal inference with observational data: instrumental variable (IV). In this document, we will first discuss a causal inference context where confounders make causation hard to establish, and introduce the idea of instrumental variable. We will then discuss an important case in causal inference where we seek to establish the causal effect of education on earning and operationalize the instrumental variable to identify the average treatment effect of school years on the log of annual salary. Finally, we introduce a few other examples of smart applications of instrumental variables and the general difficulty thereof to apply this strategy.

Below is the outline of this document:

- In [Section 1](#section_1), we discuss the basic idea of IV.

- In [Section 2](#section_2), we introduce a few examples of using IV to estimate the causal effect.

- In [Section 3](#section_3), we formalize the IV identification strategy under the potential outcome model framework, and implement this strategy to the case of education and earning. 

- In [Section 4](#section_4), we conclude with some discussions on the IV approach.

<a id='section_1'></a>
# 1. Identification with Instrumental Variables

We are interested in quantifying the causal effect of the independent variable $W\in\mathbb R$ on the outcome variable $Y$, where we do not assume $W$ is binary. We define $X\in\mathbb R^p$ as other observable features. We denote the data sample as:

<font color="red">$$\mathcal D=\{(Y_i(W_i,X_i),W_i,X_i):Y_i(W_i,X_i)\in\mathbb R,W_i\in\mathbb R,X_i\in\mathbb R^p,1\le i\le n\},$$</font>

where $Y_i(W_i,X_i)$ is the potential outcome of subject $i$ with $W_i$ and $X_i$. In the non-experiment setting where $W_i$ is not fully randomized, a fundamental challenge to identify the causal effect of $W$ on $Y$ is that there may exist some unobserved variables $X'\in\mathbb R^q$ which are correlated with both $W$ and $Y$ (in this case, $X'$ are called **<font color="red">[confounders/confounding factors](https://en.wikipedia.org/wiki/Confounding)</font>** and this phenomenon is called **confoundedness** or **[endogeneity](https://en.wikipedia.org/wiki/Endogeneity_(econometrics)))**. Under such circumstances, directly regressing $Y$ on $W$, regardless of controlling for the observable features $X$ or not, will result in biased estimation. Let us discuss an example of evaluating the causal effect of education on earning:


-----------------

<font color="red">

- $Y=$(log) of annual salary; $W=$years of education; $X=$gender, family size, etc.; $X'=$(unobservable) individual ability.
- $X'$ is positive correlated with $W$ (Smarter and more resilient people are more likely to earn a higher degree.); $X'$ is also positively correlated with $Y$ (Smarter and more resilient people are more likely to earn more as well). Therefore, individual ability is a confounding factor.
- Directly regressing $Y$ on $W$ will obtain an estimate of **sum** of both the effect of education and the effect of individual ability, thus **overestimating** the causal effect of education on earning.

</font>    
    
--------------

We now provide the high-level idea of using an **[instrumental variable](https://en.wikipedia.org/wiki/Instrumental_variables_estimation)**, denoted as $Z$, to identify the causal effect of $W$ on $Y$. Suppose $Z$ is **correlated** with the length of education $W$, but (conditionally) **uncorrelated** with the individual ability $X'$ (we will talk about what $Z$ exactly is below). Therefore, there is no confounder for identifying the causal effect of $Z$ on $Y$. Moreover, the correlation between $Z$ and $W$ helps translate the causal effect of $Z$ on $Y$ back to the causal effect of $W$ on $Y$. Formally, we need that:

- **<font color="red">Strong first stage:</font>** $Z$ and $W$ are correlated, i.e., $Cov(Z,W)\ne 0$. This condition can be verified with data via the [weak instrument test](https://en.wikipedia.org/wiki/Instrumental_variables_estimation#Weak_instruments_problem). We will discuss the details below.
- **<font color="red">Exclusion restriction:</font>** Conditioned on $W$, $Z$ is uncorrelated with $X'$ and $Y$, i.e., 
$$(Z\perp X',Y(\cdot,\cdot))|W$$
This condition is difficult, if not impossible, to verify. The ideal case is that $Z$ is completely induced by a randomized experiment. Then, $Z$ is independent of everything (especially $X'$) so the exclusion restriction holds naturally. As an alternative, $Z$ could be induced by some haphazard natural event, which is "almost" indpedent of $X'$.

<a id='section_2'></a>
# 2. Examples of Instrumental Variables

In this section, we introduce 3 typical examples of causal inference where instrumental variables enable us to successfully identify the causal effect:

-------

<font color="red">

- **Education and earning:** What is the effect of schooling years on earnings?
- **Price sensitivity of demand:** How demand will change if the price changes?
- **Measuring user satisfaction:** What is the user satisfaction for a new feature of the App?

</font>

------

The above three questions are fairly challenging due to various selection biases, but cleverly selected IVs could help us address them.

## 2.1. Education and Earning

As discussed above, because we cannot directly observe individual ability, which is positively correlated with both schooling and earning, identifying the causal effect of schooling on earning is challenging. The basic and most common confounding factor in this context is the individual ability. To find a valid instrument $Z$, we need to seek for a feature that is correlated with the schooling time of a student so that the **strong first stage** is satisfied. Furthermore, the IV $Z$ should be uncorrelated with individual ability, conditioned on the schooling years, obeying the **exclusion restriction**.

Before introducing a very smartly designed IV (first proposed in [this paper](https://www.jstor.org/stable/2937954?seq=1#metadata_info_tab_contents)), we first review an important policy for compulsory education in 1930s. Back then, children were required to attend school in September in the year they turned 6, because schools started in September. Furthermore, students had to stay in school until their 16th birthday. Because the majority of students would go to school exactly following the minimum time prescribed by the compulsoray policy, it creates a natural, and somewhat random, variation in the total time horizon length one received education. Therefore, the minimum schooling time of a student can be summarized in following table:

<table style="width:70%">
  <tr>
    <th>Birth Month </th>
    <th>January</th> 
    <th>June</th>
    <th>December</th>
  </tr>
  <tr>
    <td>Age to Start School</td>
    <td>6 Years and 8 Months</td>
    <td>6 Years and 3 Months</td>
    <td>5 Years and 9 Months</td>
  </tr>
   <tr>
    <td>Minimum Education Length</td>
    <td>9 Years and 4 Months</td>
    <td>9 Years and 9 Months</td>
    <td>10 Years and 3 Months</td>
  </tr>
</table>

As you can see, a child who was born in December could have been at school for almost one year longer than one who was born in January. This motivates us to use the **quarter of birthday** as the IV to identify the causal effect of education on earning. More specifically, the quarter of birthday should be independent of the individual ability of the subject, thus satisfying the **exclusion restriction**. In addition, because of the compulsory education policy, the quarter of birthday was correlated with the education length of an individual, thus satisfying the **strong first stage**. The following figure (extracted from the book *Mostly Harmless Econometrics*) illustrates that the **quarter of birthday**:

<img src="Length.png" width=750>

To make sense of our idea that the quarter of bith could really impact the income of an individual, we also plot the relationship between the log of weekly wages and the year and quarter of birth:

<img src="Earning.png" width=750>

It is evident from the two figures above that the **quarter of birth** is likely to be a valid IV.



## 2.2. Price Sensitivity of Demand

We are interested in measuring the price sensitivity of demand for a product. Specifically, we seek to estimate the impact of supply on price:
$$Demand\approx \hat f(Price)$$
A key challenge in this context is the **endogeneity of price**: Higher demand will encourage the seller to increase the price of the product (recall the case of Uber's surge pricing strategy). We consider the fish market and seek to estimate how fish demand will change in response to price. To estimate the price sensitivity of demand, we introduce **Weather** as the IV (see [this paper](https://academic.oup.com/restud/article/67/3/499/1547484) for a detailed reference):

- **Strong first stage:** Stormy weather makes it hard to fish, thus raising the prices.
- **Exclusion restriction:** Presumably, stormy weather should not directly affect the demand.

Therefore, the weather condition should be a valid instrument.

## 2.3. User Satisfaction

Suppose, an App launched a new feature and would like to measure the satisfaction of the users for this feature. In this context, directly using randomized experiment is difficult, because you are unable to directly require a customer to use that feature. Furthermore, the satisfaction of a user to its feature will also reversely impact the usage/adoption of the feature.

To address this challenge, we could devise an IV using a randomized encouragement for the product feature. Specifically, the App could randomly send private messages to half of its users to promote this new feature. 

- **Strong first stage:** Randomized encouragement will be correlated with the usage/adoption of the new feature.
- **Exclusion restriction:** Randomized encouragement should not directly affect the user satisfaction of the new feature. 

Therefore, randomized user encouragement serves as a valid instrument to identify the causal effect of a new product feature on user satisfaction.

<a id='section_3'></a>
# 3. Identification of Causal Effect with Instrumental Variables

Let us begin with an intuitive argument for identification with IV, where the treatment variable $W$ (in the context of IVs, this is also called the **endogenous variable**) and the instrumental variable $Z$ are one-dimensional. For the ease of our discussion, we consider the case of estimating the effect of schooling on earning, where $Y$ is the log-of-earning (outcome), $W$ is the length of education (treatment), $Z$ is the quarter of birthday (encoded as 0, 1, 2, 3; the IV), and $X$ are other features, such as family wealth, number of siblings, etc.

### A Single Instrumental Variable Case

The analysis begins with estimating that a unit change in $Z$ is associated with $\hat\delta$ units of change in $W$:


-------

<font color="red">


$$W\approx \hat\delta_0+\hat\delta Z +\hat\beta X^T$$

</font>

--------
Furthermore, we obtain that a unit change in $Z$ is associated with $\hat\gamma$ units of change in $Y$:

-------------

<font color="red">

$$Y\approx \hat \gamma_0 + \hat\gamma Z +\hat\beta' X^T$$

</font> 

------------

Since $Z$ is not directly correlated with $Y$, its effect on $Y$ results from a causal effect of $W$ on $Y$. In particular, the effect of $Z$ on $Y$ should be the effect of $Z$ on $W$ multiplied by the effect of $W$ on $Y$. Therefore, the **causal effect of $W$ on $Y$ should be estimated as <font color="red">$\hat\gamma/\hat\delta$</font>**.

### Two-Stage Least Squares (2SLS)

Next, we describe the general **2-stage least squares (2SLS)** framework to conduct the causal inference analysis through instrumental variables. Without loss of generality, we assume there are multiple treatment/endogenous variables $W\in\mathbb R^q$ and multiple IVs $Z\in\mathbb R^r$. For the IVs to be valid, we need the so-called rank condition: the number of IVs exceeds the number of endogenous variables, i.e., $r\ge q$. The 2SLS procedure is given as follows:


-----------------

<font color="red">

- **Step 1: First-stage least squares.** The first stage analysis is to regress each endogenous variable $W_j$ ($j=1,2,...,q$) on the instrumental variables $Z\in\mathbb R^r$ and the exogenous features $X\in\mathbb R^p$:
$$W_j\approx \hat\delta_{j0}+\sum_{k=1}^r\hat\delta_{jk}Z_{k}+\hat\beta_j X^T\mbox{ for all }j=1,2,..,q$$

- **Step 2: Compute the fitted endogenous variables.** Then, we should obtain the fitted endogenous variables:
$$\hat{W}_{ij}=\hat\delta_{j0}+\sum_{k=1}^r\hat\delta_{jk}Z_{ik}+\hat\beta_j X^T_i\mbox{ for all }j=1,2,..,q$$

- **Step 3: Second-stage least squares.** Finally, we should regress the outcome on the fitted endogenous variables $\hat W$ and the features $X$:
$$Y\approx\hat\gamma_{0}+\sum_{k=1}^q\hat\tau_{k}\hat W_{k}+\hat\beta X^T$$

</font>

-----------------

Then, $\hat\tau_k$ ($k=1,2,...,q$) is the **<font color="red">unbiased causal effect</font>** of endogenous variable $W_k$. Next, we use the case of estimating the effect of education on earning to implement the 2SLS framework IV analysis.

In [1]:
# Import necessary packages
import sys 
import numpy as np
import pandas as pd
import statsmodels.api as sm
import sklearn
import scipy as sp
%matplotlib inline 
import matplotlib.pyplot as plt

In [2]:
df_Education  = pd.read_csv("Education.csv")
df_Education.describe()

Unnamed: 0,LogEarning,Schooling,QoB
count,10000.0,10000.0,10000.0
mean,14.906823,14.83707,2.4868
std,3.188847,3.115285,1.117386
min,0.958698,1.5,1.0
25%,12.771827,13.2,1.0
50%,14.919952,13.7,2.0
75%,17.095972,16.5,3.0
max,27.932792,28.5,4.0


The data set has 3 variables:

- **LogEarning $\in\mathbb R$:** Log of income;
- **Schooling $\in\mathbb R^+$:** Length of schooling, in years;
- **QoB $\in\{1,2,3,4\}$:** Quarter of birthday.

We first directly regress **LogEarning** on **Schooling**:

In [3]:
# Direct OLS regression.

OLS = sm.OLS(endog = df_Education['LogEarning'], exog = sm.add_constant(df_Education['Schooling']))
result_OLS = OLS.fit()
print(result_OLS.summary())

                            OLS Regression Results                            
Dep. Variable:             LogEarning   R-squared:                       0.280
Model:                            OLS   Adj. R-squared:                  0.280
Method:                 Least Squares   F-statistic:                     3881.
Date:                Wed, 23 Feb 2022   Prob (F-statistic):               0.00
Time:                        15:05:00   Log-Likelihood:                -24145.
No. Observations:               10000   AIC:                         4.829e+04
Df Residuals:                    9998   BIC:                         4.831e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.8755      0.132     52.196      0.0

  x = pd.concat(x[::order], 1)


As discussed above, both **Schooling** and are **LogEarning** are positively correlated with individual ability, so the simple OLS estimate is an overestimate of the true causal effect of **Schooling** on **LogEarning**. To obtain an unbiased estimate, we adopt the 2SLS. We install the ``linearmodels`` package using ``pip install linearmodels``.

In [4]:
# 2SLS

# First, we change QoB into dummy variables.

df_edu = pd.get_dummies(df_Education,columns=['QoB'],drop_first=True)
df_edu['const'] = 1 

df_edu.head()

Unnamed: 0,LogEarning,Schooling,QoB_2,QoB_3,QoB_4,const
0,10.896972,7.2,1,0,0,1
1,9.826995,13.2,1,0,0,1
2,14.418743,16.0,0,0,0,1
3,13.25013,19.0,0,0,0,1
4,14.373162,13.2,1,0,0,1


In [5]:
# Fit the first-stage OLS and have the weak instrument test.

IVs = ['QoB_2','QoB_3','QoB_4']


First_Stage = sm.OLS(endog = df_edu['Schooling'], exog = sm.add_constant(df_edu[IVs]))
res_first_stage = First_Stage.fit()
print(res_first_stage.summary())


                            OLS Regression Results                            
Dep. Variable:              Schooling   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     27.93
Date:                Wed, 23 Feb 2022   Prob (F-statistic):           5.61e-18
Time:                        15:05:00   Log-Likelihood:                -25510.
No. Observations:               10000   AIC:                         5.103e+04
Df Residuals:                    9996   BIC:                         5.106e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         14.4964      0.062    234.909      0.0

  x = pd.concat(x[::order], 1)


### Testing for the Strong First-Stage

Based on the [weak instrument test](https://scholar.harvard.edu/files/stock/files/testing_for_weak_instruments_in_linear_iv_regression.pdf), the [F-statistic](https://en.wikipedia.org/wiki/F-test) of the first stage regression is 27.93>10, so the IVs satisfy the strong first-stage assumption. Next, we regress **LogEarning** on the fitted **Schooling** with the first-stage OLS. 

In [6]:
# Compute the predicted Schooling
df_edu['Schooling_p'] = res_first_stage.predict(sm.add_constant(df_edu[IVs]))
df_edu.head(10)

  x = pd.concat(x[::order], 1)


Unnamed: 0,LogEarning,Schooling,QoB_2,QoB_3,QoB_4,const,Schooling_p
0,10.896972,7.2,1,0,0,1,14.638017
1,9.826995,13.2,1,0,0,1,14.638017
2,14.418743,16.0,0,0,0,1,14.49644
3,13.25013,19.0,0,0,0,1,14.49644
4,14.373162,13.2,1,0,0,1,14.638017
5,14.512404,10.5,0,1,0,1,15.03163
6,20.825347,19.2,1,0,0,1,14.638017
7,21.117172,19.5,0,1,0,1,15.03163
8,15.785904,16.2,1,0,0,1,14.638017
9,12.554377,13.7,0,0,1,1,15.196957


In [7]:
# Fit the second-stage OLS

Second_Stage = sm.OLS(endog = df_edu['LogEarning'], exog = sm.add_constant(df_edu['Schooling_p']))
res_second_stage = Second_Stage.fit()
print(res_second_stage.summary())


                            OLS Regression Results                            
Dep. Variable:             LogEarning   R-squared:                       0.001
Model:                            OLS   Adj. R-squared:                  0.000
Method:                 Least Squares   F-statistic:                     5.884
Date:                Wed, 23 Feb 2022   Prob (F-statistic):             0.0153
Time:                        15:05:00   Log-Likelihood:                -25783.
No. Observations:               10000   AIC:                         5.157e+04
Df Residuals:                    9998   BIC:                         5.158e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const          10.8668      1.666      6.524      

  x = pd.concat(x[::order], 1)


The coefficient 0.2723 is an unbiased estimatation for the causal effect of **Schooling** on **LogEarning**. However, the standard error 0.112 from the second-stage regression is **NOT unbiased**. We need to directly estimate the standard error using the joint 2SLS. 

In [8]:
# Fit a 2sls model.

from linearmodels.iv import IV2SLS


IV_model = IV2SLS(dependent = df_edu['LogEarning'],endog = df_edu['Schooling'],\
                  exog = df_edu['const'],instruments=df_edu[IVs])
res_2sls = IV_model.fit()
print(res_2sls)

                          IV-2SLS Estimation Summary                          
Dep. Variable:             LogEarning   R-squared:                      0.2106
Estimator:                    IV-2SLS   Adj. R-squared:                 0.2105
No. Observations:               10000   F-statistic:                    7.4597
Date:                Wed, Feb 23 2022   P-value (F-stat)                0.0063
Time:                        15:05:00   Distribution:                  chi2(1)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          10.867     1.4796     7.3447     0.0000      7.9670      13.767
Schooling      0.2723     0.0997     2.7312     0.00

As we have seen from the 2SLS regression results, 1 year of schooling increases the earning of the individual by **27.23%**. 

<a id='section_4'></a>
# 4. Concluding Thoughts about Instrumental Variables

This section summarizes a few concluding thoughts about IVs. You may refer to [this paper](https://www.jstor.org/stable/2951620?seq=1#metadata_info_tab_contents) and [this paper](https://www.jstor.org/stable/2291629?seq=1#metadata_info_tab_contents) for further discussions.

- In general, IVs are typically found where there is **<font color="red">exogenous variation</font>** that leads to changes in the endogenous variable of interest, like those in our examples described above. 
- **Exclusion restriction** is generally **<font color="red">very hard to verify</font>**, especially when one only has observational data (i.e., not experiments). The key challenge is that the researcher is unable to tell if there are some **<font color="red">(unobserved) confounders</font>** that are correlated with the endogenous variable $W$ and outcome $Y$ simultaneously.
- **<font color="red">Variations induced by randomized experiments</font>** can often be used as an IV. Because experiments are relatively cheap for a large-scale online platform, such IV analysis is very common in the **<font color="red">high-tech industry</font>**.
  - Example: To measure the satisfaction from using a product, use the randomized encouragement of adopting this product as an IV.
- **<font color="red">Montonicity Assumption</font>**: $W_i(Z_i,X_i)$ is (weakly) monotonic in the same direction with respect to $Z_i$.
  - For example, in the case of education on future learning, for each individual $i$, born in the 4th quarter always implies a longer schooling time than born in the first quarter.
  - Sometimes, we call the estimation results using IV as **<font color="red">Local Average Treatment Effect (LATE)</font>**, because it only identifies the average effect of treatment to the **compliers**, i.e., those whose treatment status is affected by the instrument in the "right" direction. The effect of the treatment on other subjects, **Always-takers** (they always take the treatment action independent of instruments), **Never-takers** (they never take the treatment action independent of the instruments), and **Defiers** (their treatment status is affected by the t). Examples of these 4 types of subjects in the case of randomized encouragement case:
    - **<font color="red">Compliers:</font>** Those who would adopt the new feature if encouraged, but would not adopt the new feature otherwise.
    - **<font color="red">Always-takers:</font>** Those who would adopt the new feature regardless of whether being encouraged.
    - **<font color="red">Never-takers:</font>** Those who would adopt not the new feature regardless of whether being encouraged.
    - **<font color="red">Defiers:</font>** Those who would not adopt the new feature if encouraged, but would adopt the new feature otherwise.

The **monotonicity assumption** essentially means that there are **<font color="red">no defiers</font>** in the subjects.