In [1]:
# Import necessary packages
import sys 
import numpy as np
import pandas as pd
import statsmodels.api as sm
import sklearn
import scipy as sp
%matplotlib inline 
import matplotlib.pyplot as plt


Bad key "text.kerning_factor" on line 4 in
/Users/rz26/anaconda3/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


In [2]:
df_Education  = pd.read_csv("Education.csv")
df_Education.describe()

Unnamed: 0,LogEarning,Schooling,QoB
count,10000.0,10000.0,10000.0
mean,14.906823,14.83707,2.4868
std,3.188847,3.115285,1.117386
min,0.958698,1.5,1.0
25%,12.771827,13.2,1.0
50%,14.919952,13.7,2.0
75%,17.095972,16.5,3.0
max,27.932792,28.5,4.0


The data set has 3 variables:

- **LogEarning $\in\mathbb R$:** Log of income;
- **Schooling $\in\mathbb R^+$:** Length of schooling, in years;
- **QoB $\in\{1,2,3,4\}$:** Quarter of birthday.

We first directly regress **LogEarning** on **Schooling**:

In [3]:
# Direct OLS regression.

OLS = sm.OLS(endog = df_Education['LogEarning'], exog = sm.add_constant(df_Education['Schooling']))
result_OLS = OLS.fit()
print(result_OLS.summary())

                            OLS Regression Results                            
Dep. Variable:             LogEarning   R-squared:                       0.280
Model:                            OLS   Adj. R-squared:                  0.280
Method:                 Least Squares   F-statistic:                     3881.
Date:                Sat, 17 Apr 2021   Prob (F-statistic):               0.00
Time:                        14:51:23   Log-Likelihood:                -24145.
No. Observations:               10000   AIC:                         4.829e+04
Df Residuals:                    9998   BIC:                         4.831e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.8755      0.132     52.196      0.0

  return ptp(axis=axis, out=out, **kwargs)


As discussed above, both **Schooling** and are **LogEarning** are positively correlated with individual ability, so the simple OLS estimate is an overestimate of the true causal effect of **Schooling** on **LogEarning**. To obtain an unbiased estimate, we adopt the 2SLS. We install the ``linearmodels`` package using ``pip install linearmodels``.

In [7]:
# 2SLS

# First, we change QoB into dummy variables.

df_edu = pd.get_dummies(df_Education,columns=['QoB'],drop_first=True)
df_edu['const'] = 1 

df_edu.head()

Unnamed: 0,LogEarning,Schooling,QoB_2,QoB_3,QoB_4,const
0,10.896972,7.2,1,0,0,1
1,9.826995,13.2,1,0,0,1
2,14.418743,16.0,0,0,0,1
3,13.25013,19.0,0,0,0,1
4,14.373162,13.2,1,0,0,1


In [19]:
# Fit the first-stage OLS and have the weak instrument test.

IVs = ['QoB_2','QoB_3','QoB_4']


First_Stage = sm.OLS(endog = df_edu['Schooling'], exog = sm.add_constant(df_edu[IVs]))
res_first_stage = First_Stage.fit()
print(res_first_stage.summary())


                            OLS Regression Results                            
Dep. Variable:              Schooling   R-squared:                       0.008
Model:                            OLS   Adj. R-squared:                  0.008
Method:                 Least Squares   F-statistic:                     27.93
Date:                Sat, 17 Apr 2021   Prob (F-statistic):           5.61e-18
Time:                        15:52:41   Log-Likelihood:                -25510.
No. Observations:               10000   AIC:                         5.103e+04
Df Residuals:                    9996   BIC:                         5.106e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         14.4964      0.062    234.909      0.0

Based on the [weak instrument test](https://scholar.harvard.edu/files/stock/files/testing_for_weak_instruments_in_linear_iv_regression.pdf), the [F-statistic](https://en.wikipedia.org/wiki/F-test) of the first stage regression is 27.93>10, so the IVs satisfy the strong first-stage assumption. 

In [20]:
# Fit a 2sls model.

from linearmodels.iv import IV2SLS


IV_model = IV2SLS(dependent = df_edu['LogEarning'],endog = df_edu['Schooling'],\
                  exog = df_edu['const'],instruments=df_edu[IVs])
res_2sls = IV_model.fit()
print(res_2sls)

                          IV-2SLS Estimation Summary                          
Dep. Variable:             LogEarning   R-squared:                      0.2106
Estimator:                    IV-2SLS   Adj. R-squared:                 0.2105
No. Observations:               10000   F-statistic:                    7.4597
Date:                Sat, Apr 17 2021   P-value (F-stat)                0.0063
Time:                        15:56:55   Distribution:                  chi2(1)
Cov. Estimator:                robust                                         
                                                                              
                             Parameter Estimates                              
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
const          10.867     1.4796     7.3447     0.0000      7.9670      13.767
Schooling      0.2723     0.0997     2.7312     0.00

As we have seen from the 2SLS regression results, 1 year of schooling increases the earning of the individual by **27.23%**. 

<a id='section_4'></a>
# 4. Concluding Thoughts about Instrumental Variables

This section summarizes a few concluding thoughts about IVs.

- In general, IVs are typically found where there is exogenous variation that leads to changes in the endogenous variable of interest, like those in our examples described above. 
- Exclusion restriction is generally very hard to verify, especially when one only has observational data (i.e., not experiments). The key challenge is that the researcher is unable to tell if there are some confounders that are correlated with the endogenous variable $W$ and outcome $Y$ simultaneously.
- Variations induced by randomized experiments can often be used as an IV. Because experiments are relatively cheap for a large-scale online platform, such IV analysis is very common in the high-tech industry.
  - Example: To measure the satisfaction from using a product, use the randomized encouragement of adopting this product as an IV.
- **Montonicity Assumption**: $W_i(Z_i,X_i)$ is (weakly) monotonic in the same direction with respect to $Z_i$.
  - For example, in the case of education on future learning, for each individual $i$, born in the 4th quarter always implies a longer schooling time than born in the first quarter.
  - Sometimes, we call the estimation results using IV as **Local Average Treatment Effect (LATE)**, because it only identifies the average effect of treatment to the **compliers**, i.e., those whose treatment status is affected by the instrument in the "right" direction. The effect of the treatment on other subjects, **Always-takers** (they always take the treatment action independent of instruments), **Never-takers** (they never take the treatment action independent of the instruments), and **Defiers** (their treatment status is affected by the t). Examples of these 4 types of subjects in the case of randomized encouragement case:
    - **Compliers:** Those who would adopt the new feature if encouraged, but would not adopt the new feature otherwise.
    - **Always-takers:** Those who would adopt the new feature regardless of whether being encouraged.
    - **Never-takers:** Those who would adopt not the new feature regardless of whether being encouraged.
    - **Defiers:** Those who would not adopt the new feature if encouraged, but would adopt the new feature otherwise.