# 03 Instrumental Variables (2SLS)

Endogeneity, instruments, and two-stage least squares (2SLS).


## Table of Contents
- [Simulate endogeneity](#simulate-endogeneity)
- [OLS vs 2SLS](#ols-vs-2sls)
- [First-stage + weak IV checks](#first-stage-weak-iv-checks)
- [Interpretation + limitations](#interpretation-limitations)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)


## Why This Notebook Matters
Causal notebooks focus on **identification**: what would have to be true for a coefficient to represent a causal effect.
You will practice:
- building a county-year panel,
- fixed effects (TWFE),
- clustered standard errors,
- DiD + event studies,
- IV/2SLS.


## Prerequisites (Quick Self-Check)
- Completed Part 02 (regression + robust SE).
- Basic familiarity with panels (same unit over time) and the idea of identification assumptions.

## What You Will Produce
- (no file output; learning/analysis notebook)

## Success Criteria
- You can explain what you built and why each step exists.
- You can run your work end-to-end without undefined variables.

## Common Pitfalls
- Running cells top-to-bottom without reading the instructions.
- Leaving `...` placeholders in code cells.
- Treating regression output as causal without stating identification assumptions.
- Using non-clustered SE when shocks are correlated within groups (e.g., states).

## Quick Fixes (When You Get Stuck)
- If you see `ModuleNotFoundError`, re-run the bootstrap cell and restart the kernel; make sure `PROJECT_ROOT` is the repo root.
- If a `data/processed/*` file is missing, either run the matching build script (see guide) or use the notebook’s `data/sample/*` fallback.
- If results look “too good,” suspect leakage; re-check shifts, rolling windows, and time splits.
- If a model errors, check dtypes (`astype(float)`) and missingness (`dropna()` on required columns).

## Matching Guide
- `docs/guides/06_causal/03_instrumental_variables_2sls.md`



## How To Use This Notebook
- Work section-by-section; don’t skip the markdown.
- Most code cells are incomplete on purpose: replace TODOs and `...`, then run.
- After each section, write 2–4 sentences answering the interpretation prompts (what changed, why it matters).
- Prefer `data/processed/*` if you have built the real datasets; otherwise use the bundled `data/sample/*` fallbacks.
- Use the **Checkpoint (Self-Check)** section to catch mistakes early.
- Use **Solutions (Reference)** only to unblock yourself; then re-implement without looking.
- Use the matching guide (`docs/guides/06_causal/03_instrumental_variables_2sls.md`) for the math, assumptions, and deeper context.



<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.



In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT



## Goal
Practice IV/2SLS by simulating a classic endogeneity problem.

We do this synthetically so you can see the bias and how IV can fix it under assumptions.



## Primer: Panel + IV regression with `linearmodels` (FE, clustered SE, 2SLS)

This repo uses:
- `statsmodels` for classic OLS inference patterns, and
- `linearmodels` for **panel fixed effects** and **instrumental variables** (IV/2SLS).

The goal of this primer is to make you productive quickly (with the *minimum* theory needed to use the tools correctly). Deep math lives in the guides.

### Why `linearmodels`?

`linearmodels` provides clean APIs for:
- `PanelOLS`: fixed effects / TWFE
- `IV2SLS`: two-stage least squares

and it handles some panel-specific details (like absorbing FE) more naturally than `statsmodels`.

### Panel data shape (the #1 requirement)

Most panel estimators expect a **MultiIndex**:
- level 0: entity (e.g., county `fips`)
- level 1: time (e.g., `year`)

```python
# df has columns: fips, year, y, x1, x2, state, ...
df = df.copy()
df["fips"] = df["fips"].astype(str)
df["year"] = df["year"].astype(int)
df = df.set_index(["fips", "year"]).sort_index()
```

**Expected output / sanity check**
- `df.index.nlevels == 2`
- `df.index.is_monotonic_increasing` is `True`
- no duplicate index pairs: `df.index.duplicated().any()` is `False`

### TWFE model (PanelOLS)

Econometric form:

$$
Y_{it} = X_{it}'\\beta + \\alpha_i + \\gamma_t + \\varepsilon_{it}
$$

In code:

```python
from linearmodels.panel import PanelOLS
import statsmodels.api as sm

y = df["y"].astype(float)
X = df[["x1", "x2"]].astype(float)
X = sm.add_constant(X, has_constant="add")

res = PanelOLS(y, X, entity_effects=True, time_effects=True).fit(cov_type="robust")
print(res.summary)
```

### Clustered SE (common in applied panel/DiD work)

If errors are correlated within clusters (e.g., state-level shocks), use clustered SE:

```python
clusters = df["state"]  # must align row-for-row with y/X index

res_cl = PanelOLS(y, X, entity_effects=True, time_effects=True).fit(
  cov_type="clustered",
  clusters=clusters,
)
```

**Expected output / sanity check**
- clustered SE are often larger than robust SE (not guaranteed, but common)
- always report the number of clusters: `clusters.nunique()`

### IV / 2SLS (IV2SLS)

Structural equation (endogeneity motivation):
$$
Y = \\beta X + W'\\delta + u, \\quad \\mathrm{Cov}(X,u)\\neq 0
$$

In code (one endogenous regressor):

```python
from linearmodels.iv import IV2SLS
import statsmodels.api as sm

y = df["y"].astype(float)
endog = df[["x_endog"]].astype(float)
exog = sm.add_constant(df[["x_exog1", "x_exog2"]].astype(float), has_constant="add")
instr = df[["z1", "z2"]].astype(float)

res_iv = IV2SLS(y, exog, endog, instr).fit(cov_type="robust")
print(res_iv.summary)
```

**Expected output / sanity check**
- `res_iv.params` contains coefficients for exog + endogenous variables
- `res_iv.first_stage` (if printed) shows instrument relevance diagnostics

### Common pitfalls (and quick fixes)

- **MultiIndex mismatch:** if `clusters` is not aligned to the same index as `y/X`, you’ll get errors or wrong results.
  - Fix: construct clusters from the same `df` after indexing/sorting.
- **Non-numeric dtypes:** strings in `X` silently break models.
  - Fix: `astype(float)` on model columns.
- **Missing data:** panels often have missing rows after merges/transforms.
  - Fix: build a modeling table with `.dropna()` for required columns.
- **Too few clusters:** cluster-robust inference is fragile with very small cluster counts.
  - Fix: treat p-values as fragile; report cluster count; consider alternative designs.


<a id="simulate-endogeneity"></a>
## Simulate endogeneity

### Background
Endogeneity means your regressor $x$ is correlated with the error term.
That breaks the core OLS condition $E[u\mid X]=0$ and typically biases OLS.

We simulate endogeneity by constructing a hidden confounder $u$ that affects both $x$ and $y$.
We then construct an instrument $z$ that shifts $x$ but (by design) does not directly shift $y$.

### What you should see
- `x` is correlated with the confounder-driven error component.
- `z` is correlated with `x` (relevance).
- `z` is not directly in the structural equation for `y` (exclusion in this synthetic setup).

### Interpretation prompts
- In one sentence, explain why OLS is biased here.
- Write the relevance and exclusion conditions in words for this simulation.

### Goal
Create data where:
- x is correlated with the error term (endogenous)
- z shifts x but not y directly (instrument)



### Your Turn: Simulate (y, x, z)


In [None]:
import numpy as np
import pandas as pd

rng = np.random.default_rng(0)
n = 2000

# Instrument
z = rng.normal(size=n)

# Hidden confounder
u = rng.normal(size=n)

# Endogenous regressor: depends on z and u
x = 0.8*z + 0.8*u + rng.normal(size=n)

# Error term correlated with u
eps = 0.8*u + rng.normal(size=n)

beta_true = 1.5
y = beta_true * x + eps

df = pd.DataFrame({'y': y, 'x': x, 'z': z})
df.head()



<a id="ols-vs-2sls"></a>
## OLS vs 2SLS

### Background
OLS uses all variation in $x$, including the endogenous part correlated with the error.
2SLS replaces $x$ with the part predicted by $z$ (instrumented variation).

### What you should see
- OLS estimate differs from `beta_true` (bias).
- IV/2SLS estimate is closer to `beta_true` (in this synthetic world).

### Interpretation prompts
- Which direction is the OLS bias and why (link it to how you constructed the confounder)?
- Why does IV move the estimate toward the truth in this setup?

### Goal
Compare naive OLS (biased) to IV/2SLS.



### Your Turn: Fit OLS and 2SLS


In [None]:
import statsmodels.api as sm
from src.causal import fit_iv_2sls

# OLS
ols = sm.OLS(df['y'], sm.add_constant(df[['x']], has_constant='add')).fit()
print('OLS beta:', float(ols.params['x']))

# 2SLS
iv = fit_iv_2sls(df, y_col='y', x_endog='x', x_exog=[], z_cols=['z'])
print('IV beta :', float(iv.params['x']))

iv.summary



<a id="first-stage-weak-iv-checks"></a>
## First-stage + weak IV checks

### Background
A valid instrument must be relevant.
If $z$ barely predicts $x$, 2SLS can be unstable and misleading (weak instruments).

### What you should see
- a first-stage relationship where `z` helps explain `x`.
- a discussion of instrument strength (even informally).

### Interpretation prompts
- What would happen to 2SLS if `z` were only weakly related to `x`?
- Which parts of IV validity are testable from the data, and which are not?

### Goal
Inspect the first stage and discuss instrument strength.



### Your Turn: Inspect first stage


In [None]:
# TODO: Explore first-stage outputs.
# Hint: `iv.first_stage` is usually informative.
iv.first_stage



<a id="interpretation-limitations"></a>
## Interpretation + limitations

Write 5-8 sentences on:
- relevance and exclusion in this synthetic setup
- what would break IV in real data
- why IV identifies a local effect when effects are heterogeneous (LATE intuition)



<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)
Run a few asserts and write 2-3 sentences summarizing what you verified.



In [None]:
import pandas as pd

# Expected output: (see notebook front matter)
# TODO: If you created a panel DataFrame, verify the indexing + core columns.
# Example (adjust variable names):
# assert isinstance(panel.index, pd.MultiIndex)
# assert panel.index.names[:2] == ['fips', 'year']
# assert panel['year'].astype(int).between(1900, 2100).all()
# assert panel['fips'].astype(str).str.len().eq(5).all()
#
# TODO: Write 2-3 sentences:
# - What is the identification assumption for your causal estimate?
# - What diagnostic/falsification did you run?
...



## Extensions (Optional)
- Try one additional variant beyond the main path (different features, different split, different model).
- Write down what improved, what got worse, and your hypothesis for why.



## Reflection
- What did you assume implicitly (about timing, availability, stationarity, or costs)?
- If you had to ship this model, what would you monitor?



<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: Simulate endogeneity</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 03_instrumental_variables_2sls — Simulate endogeneity
import numpy as np
import pandas as pd

rng = np.random.default_rng(0)
n = 2000
z = rng.normal(size=n)          # instrument
u = rng.normal(size=n)          # unobserved confounder

x = 0.8*z + 0.8*u + rng.normal(size=n)  # endogenous regressor
eps = 0.8*u + rng.normal(size=n)        # error correlated with x

beta_true = 1.5
y = beta_true * x + eps

df = pd.DataFrame({'y': y, 'x': x, 'z': z})
df.head()
```

</details>

<details><summary>Solution: OLS vs 2SLS</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 03_instrumental_variables_2sls — OLS vs 2SLS
import statsmodels.api as sm
from src.causal import fit_iv_2sls

ols = sm.OLS(df['y'], sm.add_constant(df[['x']], has_constant='add')).fit()
print('OLS beta:', float(ols.params['x']))

iv = fit_iv_2sls(df, y_col='y', x_endog='x', x_exog=[], z_cols=['z'])
print('IV beta :', float(iv.params['x']))
```

</details>

<details><summary>Solution: First-stage + weak IV checks</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 03_instrumental_variables_2sls — First-stage + weak IV checks
# Inspect first stage output (instrument strength):
# iv.first_stage
```

</details>

<details><summary>Solution: Interpretation + limitations</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 03_instrumental_variables_2sls — Interpretation + limitations
# Write 3-5 sentences on:
# - relevance + exclusion in your simulated setup
# - why IV can fix endogeneity here
```

</details>

