In [None]:
# === Environment Setup ===
import os, sys, math, time, random, json, textwrap, warnings
import numpy as np, pandas as pd, matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
from linearmodels.iv import IV2SLS
import ipywidgets as widgets
from IPython.display import display, Markdown

# --- Configuration ---
plt.style.use('seaborn-v0_8-whitegrid')
plt.rcParams.update({'font.size': 14, 'figure.figsize': (12, 8), 'figure.dpi': 150})
np.set_printoptions(suppress=True, linewidth=120, precision=4)

# --- Utility Functions ---
def note(msg, **kwargs):
    display(Markdown(f"<div class='alert alert-info'>📝 {textwrap.fill(msg, width=100)}</div>"))
def sec(title):
    print(f"\n{100*'='}\n| {title.upper()} |\n{100*'='}")

note("Environment initialized for Advanced Instrumental Variables.")

# Part 6: Econometrics
## Chapter 6.05: Instrumental Variables: Theory and Practice

### Introduction: Isolating Exogenous Variation

When a regressor $D$ is correlated with the error term $u$, we say it is **endogenous**. This is the central problem in empirical economics. The method of **Instrumental Variables (IV)** is a cornerstone of modern econometrics for estimating causal effects in the presence of endogeneity. The core idea is to find a third variable, the **instrument ($Z$)**, that isolates a source of exogenous variation in the endogenous variable. The instrument must satisfy two core conditions:

1.  **Instrument Relevance:** The instrument must be a cause of the treatment ($Z \rightarrow D$).
2.  **The Exclusion Restriction:** The instrument must *only* affect the outcome through its effect on the treatment. It cannot have a direct effect on $Y$ or be correlated with any unobserved confounders.

This chapter provides a PhD-level treatment of IV, covering its derivation, the LATE framework, and methods for dealing with common pitfalls like weak instruments.

### 1. The IV Estimator: Two-Stage Least Squares (2SLS)
The most common IV estimator is **Two-Stage Least Squares (2SLS)**. It can be understood intuitively as a two-step procedure:

1.  **First Stage:** We purge the endogenous variable $D$ of its correlation with the error term. We do this by regressing $D$ on the instrument $Z$ and any exogenous controls $X$. This gives us the predicted values, $\hat{D}$. These predicted values represent the part of the variation in $D$ that is explained *only* by the exogenous variables.
    $$ D = \pi_0 + \pi_1 Z + \pi_2 X + v $$
2.  **Second Stage:** We run the original regression, but replace the endogenous variable $D$ with its predicted value from the first stage, $\hat{D}$.
    $$ Y = \beta_0 + \beta_1 \hat{D} + \beta_2 X + u $$
Because $\hat{D}$ is, by construction, a linear combination of the exogenous variables, it is uncorrelated with the error term $u$, and this second-stage regression yields a consistent estimate of the causal effect $\beta_1$.

**Important Note:** While this two-stage procedure is intuitive, one should **never run it manually**. The standard errors from the second-stage OLS are incorrect because they fail to account for the uncertainty in estimating the first stage. Always use specialized software (like `linearmodels` or `Stata`) that computes the correct 2SLS variance-covariance matrix.

### 2. Heterogeneous Effects and the LATE Framework
A crucial insight from Imbens and Angrist (1994) is that when the treatment effect is heterogeneous, IV does not recover the Average Treatment Effect (ATE). Instead, it recovers the **Local Average Treatment Effect (LATE)**.

We can divide the population into four groups based on their potential response to a binary instrument $Z$:
1.  **Compliers:** People who take the treatment if encouraged ($D(1)=1$) but not if unencouraged ($D(0)=0$). These are the people whose behavior is changed by the instrument.
2.  **Always-Takers:** People who always take the treatment, regardless of the instrument.
3.  **Never-Takers:** People who never take the treatment, regardless of the instrument.
4.  **Defiers:** People who do the opposite of what the instrument encourages. A key assumption for the LATE interpretation is that there are no defiers (**monotonicity**).

The IV estimator identifies the average treatment effect *only for the group of compliers*:
$$ \beta_{IV} \xrightarrow{p} E[Y(1) - Y(0) | \text{i is a complier}] = \text{LATE} $$

In [None]:
sec("Case Study: Angrist and Krueger (1991) Returns to Education")
try:
    ak91_df = sm.datasets.get_rdataset("ak91", "ivmodel").data
    ak91_df['log_wage'] = np.log(ak91_df['wage'])
    ak91_df['qob_is_4'] = (ak91_df['qob'] == 4).astype(int)
    note("Loaded Angrist and Krueger (1991) dataset.")
except Exception as e:
    ak91_df = None; note(f"Could not load dataset. Skipping case study. Error: {e}")

if ak91_df is not None:
    ols_model = smf.ols('log_wage ~ school', data=ak91_df).fit()
    iv_model = IV2SLS.from_formula('log_wage ~ 1 + [school ~ qob_is_4]', data=ak91_df).fit()
    print("--- OLS Results ---"); print(ols_model.summary().tables[1])
    print("\n--- IV (2SLS) Results ---"); print(iv_model)
    note(f"The OLS estimate suggests a return of {ols_model.params['school']*100:.1f}%. The IV estimate is {iv_model.params['school']*100:.1f}%. The LATE interpretation suggests this is the return to schooling for the 'compliers' - those whose schooling was affected by their birth quarter.")

### 3. Weak Instruments
A critical problem in applied IV is the presence of **weak instruments**. If the instrument is only weakly correlated with the endogenous variable, the IV estimator has poor finite-sample properties:
1.  **Bias:** The 2SLS estimator is biased towards the OLS estimator.
2.  **Non-Normal Distribution:** The sampling distribution is not well-approximated by a normal distribution, making standard t-tests unreliable.

**Detection:** The standard diagnostic is the **first-stage F-statistic**. A common rule of thumb (Staiger & Stock, 1997) is that an F-statistic **below 10** signals a potential weak instrument problem.

In [None]:
sec("Interactive: The Weak Instrument Problem")
def run_weak_iv_sim(instrument_strength=0.1, n_sims=1000):
    true_beta = 0.8; ols_estimates, iv_estimates, f_stats = [], [], []
    for _ in range(n_sims):
        n = 200; ability = np.random.normal(0, 1, n); instrument = np.random.normal(0, 1, n)
        education = instrument_strength * instrument + 1.2 * ability + np.random.normal(0, 1, n)
        log_wage = true_beta * education + 1.0 * ability + np.random.normal(0, 1, n)
        df = pd.DataFrame({'log_wage':log_wage, 'educ':education, 'instr':instrument})
        ols = smf.ols('log_wage ~ educ', data=df).fit()
        iv = IV2SLS.from_formula('log_wage ~ 1 + [educ ~ instr]', df).fit()
        ols_estimates.append(ols.params['educ']); iv_estimates.append(iv.params['educ'])
        f_stats.append(iv.first_stage.f.stat)
    
    plt.figure(figsize=(12, 6))
    sns.kdeplot(ols_estimates, label=f'OLS Estimates (Mean={np.mean(ols_estimates):.2f})', fill=True)
    sns.kdeplot(iv_estimates, label=f'IV Estimates (Mean={np.mean(iv_estimates):.2f})', fill=True)
    plt.axvline(true_beta, color='k', ls='--', label=f'True Beta = {true_beta}')
    plt.title(f'Distribution of OLS vs. IV Estimates'); plt.legend()
    plt.show()
    note(f"With instrument strength = {instrument_strength}, the average First-Stage F-statistic is: {np.mean(f_stats):.2f}. When the instrument is weak, the IV estimator's distribution is wide and biased towards the OLS estimate.")

widgets.interact(run_weak_iv_sim, instrument_strength=widgets.FloatSlider(min=0.0, max=0.5, step=0.02, value=0.1));

### 4. The Control Function Approach
An alternative to 2SLS is the **control function** approach. Instead of purging the endogeneity from $D$, this method attempts to model the source of the endogeneity directly and include it in the regression as a control.

**Procedure:**
1.  Assume the endogeneity arises from $D = \Pi Z + v$, where $v$ is correlated with the structural error $u$. Assume $u = \rho v + \epsilon$, where $\epsilon$ is now well-behaved.
2.  **First Stage:** Run the regression of $D$ on $Z$ and obtain the residuals, $\hat{v}$.
3.  **Second Stage:** Run the original regression of $Y$ on $D$, but now include the first-stage residuals $\hat{v}$ as an additional regressor:
    $$ Y = \beta_0 + \beta_1 D + \delta \hat{v} + \epsilon $$ 

In this regression, the coefficient $\beta_1$ is a consistent estimate of the causal effect. The coefficient $\delta$ on the residual is an estimate of $\rho$, and a t-test on it is a **test for endogeneity**.

In [None]:
sec("Control Function Example and Endogeneity Test")
if ak91_df is not None:
    # 1. First Stage
    first_stage = smf.ols('school ~ qob_is_4', data=ak91_df).fit()
    ak91_df['resid'] = first_stage.resid
    
    # 2. Second Stage
    control_fn_model = smf.ols('log_wage ~ school + resid', data=ak91_df).fit()
    
    print(control_fn_model.summary().tables[1])
    note("The coefficient on 'school' is the control function estimate of the causal effect. The coefficient on 'resid' is statistically insignificant, suggesting we cannot reject the null hypothesis that schooling is exogenous in this specification.")
else:
    note("Dataset not available.")

### 5. Exercises

1.  **IV Assumptions:** For the Angrist and Krueger (1991) study, explain in detail what the relevance and exclusion restriction assumptions imply. Why might the exclusion restriction be violated?

2.  **LATE Interpretation:** In the Angrist and Krueger study, who are the 'compliers'? Who are the 'never-takers' and 'always-takers'? Why might the LATE be different from the ATE for the returns to schooling?

3.  **Testing for Weak Instruments:** Using the `linearmodels` library on the Angrist and Krueger data, access the first-stage regression results from the `iv_model` object. What is the F-statistic for the relevance of the quarter-of-birth instrument? Based on the rule of thumb, is this instrument considered weak?

4.  **Control Function for Endogeneity Testing:** Using the synthetic data from the weak instrument simulation, implement the control function approach. Run the first stage, get the residuals, and include them in the second stage. Perform a t-test on the coefficient of the residual. Does the test correctly detect endogeneity? How does the estimated treatment effect compare to the OLS and 2SLS estimates?

5.  **Invalid Instrument:** Suppose you use a bad instrument that violates the exclusion restriction. Specifically, assume the instrument $Z$ has a direct effect on the outcome $Y$. Modify the weak instrument simulation code to include such a direct effect. How does this affect the bias of the IV estimator?