# 02a Endogeneity: Sources and Consequences

Why OLS can fail and how to recognize it.

## Table of Contents
- [What is endogeneity?](#what-is-endogeneity)
- [Source 1: Omitted variable bias (OVB)](#source-1-omitted-variable-bias-ovb)
- [Source 2: Measurement error](#source-2-measurement-error)
- [Source 3: Simultaneity / reverse causality](#source-3-simultaneity-reverse-causality)
- [What can we do about endogeneity?](#what-can-we-do-about-endogeneity)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)

## Why This Notebook Matters
Causal notebooks focus on **identification**: what would have to be true for a coefficient to represent a causal effect.
Endogeneity is the single most important threat to causal inference in observational data.
If you cannot recognize it, every regression you run risks producing misleading conclusions.

You will practice:
- defining endogeneity formally and connecting it to OLS bias,
- simulating each of the three classical sources (OVB, measurement error, simultaneity),
- observing *how* and *why* OLS fails in each case,
- summarizing the remedies that lead into the IV/2SLS notebook next.


## Prerequisites (Quick Self-Check)
- Completed notebooks `01_panel_fixed_effects_clustered_se` and `02_difference_in_differences_event_study`.
- Understanding of the OLS assumption $E[\varepsilon \mid X] = 0$ and why it matters.
- Basic familiarity with `numpy` random number generation and `statsmodels` OLS.

## What You Will Produce
- (no file output; learning/analysis notebook)

## Success Criteria
- You can explain what you built and why each step exists.
- You can run your work end-to-end without undefined variables.
- You can identify which source of endogeneity is present in a given scenario.
- You can predict the direction of bias from OVB and classical measurement error.

## Common Pitfalls
- Running cells top-to-bottom without reading the instructions.
- Leaving `...` placeholders in code cells.
- Treating regression output as causal without stating identification assumptions.
- Confusing "bias" (systematic deviation) with "noise" (random variation).
- Thinking that a large sample fixes endogeneity (it does not; endogeneity causes *inconsistency*).

## Quick Fixes (When You Get Stuck)
- If you see `ModuleNotFoundError`, re-run the bootstrap cell and restart the kernel; make sure `PROJECT_ROOT` is the repo root.
- If a `data/processed/*` file is missing, either run the matching build script (see guide) or use the notebook's `data/sample/*` fallback.
- If results look "too good," suspect leakage; re-check shifts, rolling windows, and time splits.
- If a model errors, check dtypes (`astype(float)`) and missingness (`dropna()` on required columns).

## Matching Guide
- `docs/guides/07_causal/02a_endogeneity_sources.md`

## How To Use This Notebook
- Work section-by-section; don't skip the markdown.
- Most code cells are incomplete on purpose: replace TODOs and `...`, then run.
- After each section, write 2--4 sentences answering the interpretation prompts (what changed, why it matters).
- This notebook uses only synthetic/simulated data -- no external datasets needed.
- Use the **Checkpoint (Self-Check)** section to catch mistakes early.
- Use **Solutions (Reference)** only to unblock yourself; then re-implement without looking.
- Use the matching guide (`docs/guides/07_causal/02a_endogeneity_sources.md`) for the math, assumptions, and deeper context.

<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.

In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT

## Goal
Understand the three classical sources of endogeneity through simulation.

By building the data-generating process (DGP) yourself, you control the true parameters.
This lets you compare OLS estimates to the truth and *see* the bias directly.

All data in this notebook is synthetic. No external datasets are required.

In [None]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt

<a id="what-is-endogeneity"></a>
## What is endogeneity?

### Background
The OLS estimator is the workhorse of empirical economics. It works beautifully -- **when its assumptions hold**.

The critical assumption for causal interpretation is **strict exogeneity**:

$$
E[\varepsilon \mid X] = 0
$$

This says the error term is mean-independent of the regressors. When this holds, OLS is:
- **Unbiased**: $E[\hat{\beta}] = \beta$
- **Consistent**: $\hat{\beta} \xrightarrow{p} \beta$ as $n \to \infty$

**Endogeneity** is what we call it when this assumption fails: $E[\varepsilon \mid X] \neq 0$.

When endogeneity is present:
- OLS is **biased** in finite samples.
- OLS is **inconsistent** -- more data does not fix the problem.
- The coefficient $\hat{\beta}$ **does not have a causal interpretation**.

### The three classical sources

| Source | Mechanism | Direction of bias |
|--------|-----------|-------------------|
| **Omitted variable bias** | A confounding variable affects both $X$ and $Y$ | Depends on signs of the omitted relationships |
| **Measurement error** | $X$ is measured with noise ($X = X^* + \nu$) | Attenuation: coefficient shrinks toward zero |
| **Simultaneity** | $Y$ affects $X$ and $X$ affects $Y$ | Depends on the simultaneous system |

### Interpretation prompts
- In your own words, why does more data not fix endogeneity?
- Can you think of a real-world example where $E[\varepsilon \mid X] \neq 0$?

### Warm-up: OLS works when exogeneity holds

Before breaking OLS, let's confirm it works when the assumption holds.
We simulate data where $X$ and $\varepsilon$ are truly independent.

In [None]:
rng = np.random.default_rng(42)
n = 5000

beta_true = 2.0
alpha_true = 1.0

# Exogenous X and independent error
X_clean = rng.normal(loc=5, scale=2, size=n)
eps_clean = rng.normal(loc=0, scale=1, size=n)
Y_clean = alpha_true + beta_true * X_clean + eps_clean

# TODO: Fit OLS and print the estimated coefficient on X.
# Hint: sm.OLS(Y_clean, sm.add_constant(X_clean)).fit()
res_clean = ...

print(f'True beta:      {beta_true}')
print(f'Estimated beta: {float(res_clean.params[1]):.4f}')
print(f'Bias:           {float(res_clean.params[1]) - beta_true:.4f}')

<a id="source-1-omitted-variable-bias-ovb"></a>
## Source 1: Omitted variable bias (OVB)

### Background
OVB is the most common source of endogeneity in applied work.

**Setup**: The true model is:
$$
Y_i = \beta_0 + \beta_1 X_i + \gamma W_i + \varepsilon_i
$$

where $W$ is an omitted variable (a confounder). If you run the **short regression** without $W$:
$$
Y_i = \tilde{\beta}_0 + \tilde{\beta}_1 X_i + \tilde{\varepsilon}_i
$$

then the OLS estimator $\tilde{\beta}_1$ converges to:
$$
\tilde{\beta}_1 \xrightarrow{p} \beta_1 + \gamma \cdot \delta
$$

where $\delta$ is the coefficient from regressing $W$ on $X$:
$$
W_i = \delta_0 + \delta X_i + \nu_i
$$

### The OVB formula

$$
\text{Bias} = \gamma \times \delta
$$

- $\gamma$: the effect of the omitted variable $W$ on $Y$ (holding $X$ constant).
- $\delta$: the relationship between $W$ and $X$.

**Signing the bias** (the most useful skill in applied econometrics):

| $\gamma$ (W on Y) | $\delta$ (W on X) | Bias direction |
|---|---|---|
| + | + | Positive (upward) |
| + | - | Negative (downward) |
| - | + | Negative (downward) |
| - | - | Positive (upward) |

### Classic example
**Returns to education**: $Y$ = wage, $X$ = years of education, $W$ = ability (unobserved).
- $\gamma > 0$: ability raises wages.
- $\delta > 0$: more able people get more education.
- Bias is **positive**: OLS overstates the return to education.

### What you should see
- The short regression (omitting $W$) gives a biased estimate of $\beta_1$.
- The long regression (including $W$) recovers the true $\beta_1$.
- The bias matches $\gamma \times \delta$.

### Interpretation prompts
- Can you sign the OVB direction before running the simulation?
- Why does adding more observations not fix this problem?

### Goal
Simulate OVB and verify the bias formula.

### Your Turn: Simulate OVB

In [None]:
rng = np.random.default_rng(42)
n = 5000

# --- True DGP ---
# Y = beta0 + beta1 * X + gamma * W + eps
# W is correlated with X (confounding)

beta0 = 1.0
beta1 = 2.0   # true causal effect of X on Y
gamma = 3.0   # effect of omitted variable W on Y

# Generate X and the confounder W with correlation
X = rng.normal(loc=5, scale=2, size=n)
W = 0.5 * X + rng.normal(loc=0, scale=1, size=n)  # W depends on X (delta ~ 0.5)
eps = rng.normal(loc=0, scale=1, size=n)

Y = beta0 + beta1 * X + gamma * W + eps

df_ovb = pd.DataFrame({'Y': Y, 'X': X, 'W': W})
df_ovb.head()

In [None]:
# --- Short regression (omitting W) ---
# TODO: Fit OLS of Y on X only (with a constant).
res_short = ...

# --- Long regression (including W) ---
# TODO: Fit OLS of Y on X and W (with a constant).
res_long = ...

print('=== Short regression (Y ~ X, omitting W) ===')
print(f'  beta1_hat (short): {float(res_short.params["X"]):.4f}')
print(f'  True beta1:        {beta1}')
print(f'  Bias:              {float(res_short.params["X"]) - beta1:.4f}')
print()
print('=== Long regression (Y ~ X + W) ===')
print(f'  beta1_hat (long):  {float(res_long.params["X"]):.4f}')
print(f'  True beta1:        {beta1}')
print(f'  Bias:              {float(res_long.params["X"]) - beta1:.4f}')

### Your Turn: Verify the OVB formula

In [None]:
# --- Step 1: Estimate delta by regressing W on X ---
# TODO: Fit OLS of W on X (with a constant) and extract delta.
res_aux = ...
delta_hat = ...  # Hint: float(res_aux.params['X'])

# --- Step 2: Compute predicted bias ---
# OVB formula: bias = gamma * delta
# TODO: Compute predicted bias and compare to actual bias.
predicted_bias = ...  # Hint: gamma * delta_hat
actual_bias = float(res_short.params['X']) - beta1

print(f'delta_hat (W ~ X):  {delta_hat:.4f}')
print(f'gamma (true):       {gamma}')
print(f'Predicted bias:     {predicted_bias:.4f}')
print(f'Actual bias:        {actual_bias:.4f}')
print(f'Match:              {abs(predicted_bias - actual_bias) < 0.01}')

### Your Turn: Visualize OVB across sample sizes

Show that the bias persists even as $n$ grows (inconsistency).

In [None]:
# TODO: Loop over different sample sizes, fit the short regression each time,
# and record the estimated beta1. Plot bias vs n.

sample_sizes = [100, 500, 1000, 5000, 10000, 50000]
biases = []

for n_i in sample_sizes:
    rng_i = np.random.default_rng(42)
    X_i = rng_i.normal(loc=5, scale=2, size=n_i)
    W_i = 0.5 * X_i + rng_i.normal(loc=0, scale=1, size=n_i)
    eps_i = rng_i.normal(loc=0, scale=1, size=n_i)
    Y_i = beta0 + beta1 * X_i + gamma * W_i + eps_i

    # TODO: Fit the short regression (Y ~ X) and record the bias.
    beta1_hat_i = ...  # Hint: fit OLS and extract the coefficient
    biases.append(beta1_hat_i - beta1)

fig, ax = plt.subplots(figsize=(8, 4))
ax.plot(sample_sizes, biases, 'o-', markersize=8)
ax.axhline(gamma * 0.5, color='red', linestyle='--', label=f'Predicted bias = {gamma * 0.5:.2f}')
ax.set_xlabel('Sample size (n)')
ax.set_ylabel('Bias (beta1_hat - beta1_true)')
ax.set_title('OVB does not vanish with more data (inconsistency)')
ax.set_xscale('log')
ax.legend()
plt.tight_layout()
plt.show()

### Interpretation prompt

Write 2--4 sentences:
- Does the bias shrink as the sample grows? Why or why not?
- In the education-wage example, what would $W$ be, and which direction would OVB push the estimated return to education?

<a id="source-2-measurement-error"></a>
## Source 2: Measurement error

### Background
In practice, we often cannot measure the true variable $X^*$ perfectly. Instead, we observe:
$$
X_i = X_i^* + \nu_i, \quad \nu_i \sim (0, \sigma_\nu^2), \quad \nu_i \perp X_i^*, \varepsilon_i
$$

This is called **classical errors-in-variables** (CEV).

The true model is:
$$
Y_i = \beta_0 + \beta_1 X_i^* + \varepsilon_i
$$

But we estimate:
$$
Y_i = \beta_0 + \beta_1 X_i + \tilde{\varepsilon}_i
$$

Since $X_i = X_i^* + \nu_i$, the new error $\tilde{\varepsilon}_i = \varepsilon_i - \beta_1 \nu_i$ is correlated with $X_i$ (through $\nu_i$). This is endogeneity.

### Attenuation bias

The OLS estimator converges to:
$$
\hat{\beta}_1 \xrightarrow{p} \beta_1 \cdot \frac{\sigma_{X^*}^2}{\sigma_{X^*}^2 + \sigma_\nu^2}
$$

The fraction $\frac{\sigma_{X^*}^2}{\sigma_{X^*}^2 + \sigma_\nu^2}$ is always between 0 and 1, so:
- The coefficient is **attenuated** (shrunk toward zero).
- More noise ($\sigma_\nu^2$ larger) means more attenuation.
- This is called **attenuation bias**.

### What you should see
- As measurement noise increases, the OLS coefficient on $X$ shrinks toward zero.
- The attenuation matches the formula above.

### Interpretation prompts
- Why does measurement error in $X$ bias the coefficient *toward zero* rather than in a random direction?
- If you have noisy data and find a "small" effect, should you conclude the true effect is small?

### Goal
Simulate classical measurement error and show attenuation bias.

### Your Turn: Simulate measurement error

In [None]:
rng = np.random.default_rng(42)
n = 5000

beta0_me = 1.0
beta1_me = 3.0  # true effect

# True X* (unobserved in practice)
X_star = rng.normal(loc=5, scale=2, size=n)
eps_me = rng.normal(loc=0, scale=1, size=n)

# True relationship
Y_me = beta0_me + beta1_me * X_star + eps_me

# Measurement noise of different magnitudes
noise_levels = [0.0, 0.5, 1.0, 2.0, 4.0]

print(f'{"Noise SD":>10s}  {"beta1_hat":>10s}  {"Attenuation":>12s}  {"Predicted":>10s}')
print('-' * 50)

for sigma_nu in noise_levels:
    # Measured X = X* + noise
    nu = rng.normal(loc=0, scale=sigma_nu, size=n) if sigma_nu > 0 else np.zeros(n)
    X_measured = X_star + nu

    # TODO: Fit OLS of Y_me on X_measured (with constant) and extract beta1_hat.
    beta1_hat = ...  # Hint: fit OLS and extract coefficient

    # Predicted attenuation factor: var(X*) / (var(X*) + sigma_nu^2)
    var_x_star = np.var(X_star)
    # TODO: Compute predicted coefficient.
    predicted = ...  # Hint: beta1_me * var_x_star / (var_x_star + sigma_nu**2)

    print(f'{sigma_nu:10.1f}  {beta1_hat:10.4f}  {beta1_hat / beta1_me:12.4f}  {predicted:10.4f}')

### Your Turn: Visualize attenuation bias

In [None]:
# TODO: Plot the estimated coefficient vs noise level.
# Include the theoretical attenuation curve for comparison.

sigma_grid = np.linspace(0, 5, 50)
var_x_star = np.var(X_star)

# Theoretical curve
theoretical = ...  # Hint: beta1_me * var_x_star / (var_x_star + sigma_grid**2)

# Simulated points
sim_betas = []
for sigma_nu in sigma_grid:
    nu_i = rng.normal(loc=0, scale=sigma_nu, size=n) if sigma_nu > 0 else np.zeros(n)
    X_m_i = X_star + nu_i
    res_i = sm.OLS(Y_me, sm.add_constant(X_m_i)).fit()
    sim_betas.append(float(res_i.params[1]))

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(sigma_grid, theoretical, 'r-', linewidth=2, label='Theoretical attenuation')
ax.plot(sigma_grid, sim_betas, 'b.', alpha=0.5, label='Simulated OLS estimates')
ax.axhline(beta1_me, color='green', linestyle='--', label=f'True beta = {beta1_me}')
ax.set_xlabel('Measurement noise SD (sigma_nu)')
ax.set_ylabel('Estimated beta1')
ax.set_title('Classical Measurement Error: Attenuation Bias')
ax.legend()
plt.tight_layout()
plt.show()

### Interpretation prompt

Write 2--4 sentences:
- Does the attenuation formula match the simulation?
- In what applied settings is measurement error likely to be a serious concern?
- Why is attenuation bias especially dangerous when the true effect is modest?

<a id="source-3-simultaneity-reverse-causality"></a>
## Source 3: Simultaneity / reverse causality

### Background
Simultaneity arises when $Y$ affects $X$ and $X$ affects $Y$ at the same time. This creates a system of simultaneous equations:

**Supply and demand example**:
- Demand: $Q = \alpha_d - \beta_d P + \varepsilon_d$  (higher price, less demand)
- Supply: $Q = \alpha_s + \beta_s P + \varepsilon_s$  (higher price, more supply)

We observe the equilibrium $(P^*, Q^*)$ where supply equals demand. Regressing $Q$ on $P$ does not recover either the demand or supply curve -- it gives a confounded mix of both.

**Other examples**:
- Crime and police: more crime leads to more police, but more police may deter crime.
- Exports and GDP: exports increase GDP, but higher GDP countries produce more to export.
- Advertising and sales: ads boost sales, but firms with high sales spend more on ads.

### Why OLS fails
In the supply/demand system, the equilibrium price $P^*$ depends on *both* $\varepsilon_d$ and $\varepsilon_s$.
So $P$ is correlated with the demand error and the supply error simultaneously.
OLS on either equation alone is inconsistent.

### What you should see
- A naive OLS regression of $Q$ on $P$ gives a coefficient that is neither the demand slope nor the supply slope.
- The estimate is a confounded mixture of both structural parameters.

### Interpretation prompts
- Why can't you just regress $Q$ on $P$ to get the demand curve?
- What would you need (conceptually) to identify the demand curve separately from the supply curve?

### Goal
Simulate a simple simultaneous system and show OLS gives wrong answers.

### Your Turn: Simulate a supply-demand system

In [None]:
rng = np.random.default_rng(42)
n = 5000

# --- Structural parameters ---
# Demand: Q = alpha_d - beta_d * P + eps_d
# Supply: Q = alpha_s + beta_s * P + eps_s
alpha_d = 10.0
beta_d = 2.0    # demand slope (positive; enters as -beta_d)
alpha_s = 2.0
beta_s = 1.0    # supply slope (positive)

eps_d = rng.normal(loc=0, scale=1, size=n)  # demand shocks
eps_s = rng.normal(loc=0, scale=1, size=n)  # supply shocks

# --- Solve for equilibrium ---
# At equilibrium: alpha_d - beta_d * P + eps_d = alpha_s + beta_s * P + eps_s
# => P* = (alpha_d - alpha_s + eps_d - eps_s) / (beta_d + beta_s)
# => Q* = substitute P* into either equation

# TODO: Solve for equilibrium price and quantity.
P_star = ...  # Hint: (alpha_d - alpha_s + eps_d - eps_s) / (beta_d + beta_s)
Q_star = ...  # Hint: substitute P_star into the demand equation

df_sim = pd.DataFrame({'P': P_star, 'Q': Q_star})
print(f'Mean equilibrium price:    {P_star.mean():.2f}')
print(f'Mean equilibrium quantity: {Q_star.mean():.2f}')
df_sim.head()

In [None]:
# --- Naive OLS: regress Q on P ---
# TODO: Fit OLS of Q on P (with constant) and extract the coefficient on P.
res_naive = ...

print('=== Naive OLS: Q ~ P ===')
print(f'  Estimated slope:      {float(res_naive.params["P"]):.4f}')
print(f'  True demand slope:   {-beta_d:.4f} (negative: higher P -> lower Q demanded)')
print(f'  True supply slope:   {+beta_s:.4f} (positive: higher P -> higher Q supplied)')
print()
print('The OLS estimate is neither the demand nor supply slope.')
print('It is a confounded mixture of both structural relationships.')

### Your Turn: Visualize the identification problem

In [None]:
# TODO: Create a scatter plot of Q vs P with the OLS line,
# the true demand curve (evaluated at mean shocks), and
# the true supply curve (evaluated at mean shocks).

fig, ax = plt.subplots(figsize=(8, 6))

# Scatter of equilibrium observations
ax.scatter(P_star, Q_star, alpha=0.1, s=5, label='Observed equilibria')

# OLS fitted line
p_grid = np.linspace(P_star.min(), P_star.max(), 100)
q_ols = ...  # TODO: Hint: res_naive.params['const'] + res_naive.params['P'] * p_grid

# True demand curve (at eps_d = 0)
q_demand = ...  # TODO: Hint: alpha_d - beta_d * p_grid

# True supply curve (at eps_s = 0)
q_supply = ...  # TODO: Hint: alpha_s + beta_s * p_grid

ax.plot(p_grid, q_ols, 'r-', linewidth=2, label='OLS fit (confounded)')
ax.plot(p_grid, q_demand, 'b--', linewidth=2, label=f'True demand (slope = {-beta_d})')
ax.plot(p_grid, q_supply, 'g--', linewidth=2, label=f'True supply (slope = +{beta_s})')

ax.set_xlabel('Price (P)')
ax.set_ylabel('Quantity (Q)')
ax.set_title('Simultaneity Bias: OLS Cannot Recover Structural Curves')
ax.legend()
plt.tight_layout()
plt.show()

### Your Turn: Show the OLS slope depends on variance of shocks

When demand shocks dominate ($\sigma_d \gg \sigma_s$), the observed variation traces out
the supply curve. When supply shocks dominate, it traces out the demand curve.

In [None]:
# TODO: Vary the ratio of demand-shock variance to supply-shock variance.
# Show how the OLS coefficient shifts between the supply and demand slopes.

ratios = [0.01, 0.1, 0.5, 1.0, 2.0, 10.0, 100.0]  # sigma_d / sigma_s
ols_slopes = []

for ratio in ratios:
    sigma_d = ratio
    sigma_s = 1.0
    ed = rng.normal(0, sigma_d, n)
    es = rng.normal(0, sigma_s, n)
    P_eq = (alpha_d - alpha_s + ed - es) / (beta_d + beta_s)
    Q_eq = alpha_d - beta_d * P_eq + ed

    # TODO: Fit OLS of Q_eq on P_eq and record the slope.
    slope_i = ...  # Hint: fit OLS and extract coefficient
    ols_slopes.append(slope_i)

fig, ax = plt.subplots(figsize=(8, 5))
ax.semilogx(ratios, ols_slopes, 'o-', markersize=8, label='OLS slope')
ax.axhline(-beta_d, color='blue', linestyle='--', label=f'True demand slope ({-beta_d})')
ax.axhline(beta_s, color='green', linestyle='--', label=f'True supply slope (+{beta_s})')
ax.set_xlabel('Ratio sigma_d / sigma_s')
ax.set_ylabel('OLS coefficient (Q ~ P)')
ax.set_title('OLS slope varies with relative shock variance')
ax.legend()
plt.tight_layout()
plt.show()

### Interpretation prompt

Write 2--4 sentences:
- When demand shocks are large, which structural curve does OLS approximate? Why?
- Why is IV/2SLS the standard solution for simultaneity? (Preview for the next notebook.)
- Name a real-world setting with simultaneity that you have encountered.

<a id="what-can-we-do-about-endogeneity"></a>
## What can we do about endogeneity?

### Summary

You have now seen three sources of endogeneity. Each has a different mechanism, but they all violate $E[\varepsilon \mid X] = 0$ and render OLS biased and inconsistent.

The good news: econometrics has developed targeted remedies for each source.

| Source | Mechanism | Remedies |
|--------|-----------|----------|
| **Omitted variable bias** | Confounding variable correlated with both $X$ and $Y$ | (1) Add controls if the omitted variable is observable; (2) Panel fixed effects (absorb time-invariant confounders); (3) Instrumental variables |
| **Measurement error** | Noisy measurement of $X$ attenuates the coefficient | (1) Instrumental variables (use an instrument correlated with $X^*$ but not the noise); (2) Better measurement (reduce $\sigma_\nu^2$) |
| **Simultaneity** | $Y$ affects $X$ and $X$ affects $Y$ | (1) IV/2SLS (find a variable that shifts one equation but not the other); (2) Natural experiments; (3) Structural models |

### The common thread: Instrumental Variables

Notice that IV appears as a remedy for **all three** sources. This is why IV/2SLS is the most important single tool in the causal inference toolkit.

An instrument $Z$ must satisfy:
1. **Relevance**: $\mathrm{Cov}(Z, X) \neq 0$ (the instrument affects the endogenous regressor).
2. **Exclusion**: $\mathrm{Cov}(Z, \varepsilon) = 0$ (the instrument affects $Y$ *only* through $X$).

**This leads directly to the next notebook**: `03_instrumental_variables_2sls`, where you will implement IV/2SLS, check instrument strength, and compare IV estimates to the biased OLS estimates you have seen here.

### Decision tree (practical)

```
Is your regressor X plausibly exogenous?
  |
  +-- YES: OLS is fine. Report robust SE. Done.
  |
  +-- NO: What is the source?
       |
       +-- OVB: Can you observe the omitted variable?
       |     +-- YES: Add it as a control.
       |     +-- NO: Use panel FE (if time-invariant) or IV.
       |
       +-- Measurement error: Can you get better data?
       |     +-- YES: Use it.
       |     +-- NO: Use IV.
       |
       +-- Simultaneity: Do you have a valid instrument?
             +-- YES: Use IV/2SLS.
             +-- NO: Consider natural experiments, RDD, DiD.
```

### Your Turn: Classify endogeneity scenarios

For each scenario below, identify the most likely source of endogeneity and the recommended remedy.

In [None]:
# TODO: For each scenario, fill in the source and remedy.
# Replace the '...' with your answers.

scenarios = pd.DataFrame({
    'scenario': [
        'Regressing wages on education (ability is unobserved)',
        'Regressing health outcomes on self-reported exercise (noisy measure)',
        'Regressing city crime rates on police spending',
        'Regressing firm profits on R&D spending (innovative firms do both)',
        'Regressing test scores on class size (using survey-reported class size)',
    ],
    'source': [
        ...,  # TODO: 'OVB', 'measurement error', or 'simultaneity'
        ...,
        ...,
        ...,
        ...,
    ],
    'remedy': [
        ...,  # TODO: brief description of the recommended approach
        ...,
        ...,
        ...,
        ...,
    ],
})

scenarios

<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)
Run a few asserts and write 2--3 sentences summarizing what you verified.

In [None]:
# --- Self-check asserts ---

# OVB: short regression should overestimate beta1
# (because gamma > 0 and delta > 0, so bias is positive)
assert float(res_short.params['X']) > beta1, \
    'Short regression should overestimate beta1 (positive OVB)'

# OVB: long regression should be close to true beta1
assert abs(float(res_long.params['X']) - beta1) < 0.2, \
    'Long regression should recover beta1 (close to 2.0)'

# OVB formula should match
assert abs(predicted_bias - actual_bias) < 0.05, \
    'OVB formula (gamma * delta) should match actual bias'

# Simultaneity: OLS slope should not match either structural slope
ols_sim_slope = float(res_naive.params['P'])
assert abs(ols_sim_slope - (-beta_d)) > 0.3, \
    'OLS should not recover the demand slope'
assert abs(ols_sim_slope - beta_s) > 0.3, \
    'OLS should not recover the supply slope'

print('All checkpoint asserts passed.')

# TODO: Write 2-3 sentences:
# - Which source of endogeneity produces the most predictable bias direction?
# - What is the common remedy across all three sources?
...

## Extensions (Optional)
- Try one additional variant beyond the main path (different DGP parameters, different noise structure).
- Write down what improved, what got worse, and your hypothesis for why.

Suggestions:
- **OVB with multiple omitted variables**: Add a second confounder $W_2$ that is negatively correlated with $X$. Does the total bias increase or decrease?
- **Non-classical measurement error**: Try multiplicative noise ($X = X^* \cdot \nu$ where $\nu > 0$). Does attenuation bias still hold?
- **Simultaneity with an instrument**: Add an instrument to the supply/demand system (e.g., a weather shock that shifts supply). Show that IV recovers the demand slope. (This is a preview of notebook 03.)

## Reflection
- What did you assume implicitly (about timing, availability, stationarity, or costs)?
- If you had to ship this model, what would you monitor?
- In your own applied work, which source of endogeneity do you think is most likely to arise?
- How does the OVB formula help you reason about the *direction* of bias even when you cannot observe the confounder?

<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: OLS when exogeneity holds</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02a -- OLS when exogeneity holds
res_clean = sm.OLS(Y_clean, sm.add_constant(X_clean)).fit()

print(f'True beta:      {beta_true}')
print(f'Estimated beta: {float(res_clean.params[1]):.4f}')
print(f'Bias:           {float(res_clean.params[1]) - beta_true:.4f}')
```

</details>

<details><summary>Solution: Simulate OVB (short and long regressions)</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02a -- OVB short and long regressions
import statsmodels.api as sm

# Short regression (omitting W)
res_short = sm.OLS(
    df_ovb['Y'],
    sm.add_constant(df_ovb[['X']], has_constant='add'),
).fit()

# Long regression (including W)
res_long = sm.OLS(
    df_ovb['Y'],
    sm.add_constant(df_ovb[['X', 'W']], has_constant='add'),
).fit()

print('Short regression beta1:', float(res_short.params['X']))
print('Long regression beta1: ', float(res_long.params['X']))
print('True beta1:            ', beta1)
```

</details>

<details><summary>Solution: Verify the OVB formula</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02a -- OVB formula verification
# Step 1: Estimate delta by regressing W on X
res_aux = sm.OLS(
    df_ovb['W'],
    sm.add_constant(df_ovb[['X']], has_constant='add'),
).fit()
delta_hat = float(res_aux.params['X'])

# Step 2: Compute predicted bias
predicted_bias = gamma * delta_hat
actual_bias = float(res_short.params['X']) - beta1

print(f'delta_hat:       {delta_hat:.4f}')
print(f'Predicted bias:  {predicted_bias:.4f}')
print(f'Actual bias:     {actual_bias:.4f}')
```

</details>

<details><summary>Solution: OVB across sample sizes</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02a -- OVB across sample sizes
sample_sizes = [100, 500, 1000, 5000, 10000, 50000]
biases = []

for n_i in sample_sizes:
    rng_i = np.random.default_rng(42)
    X_i = rng_i.normal(loc=5, scale=2, size=n_i)
    W_i = 0.5 * X_i + rng_i.normal(loc=0, scale=1, size=n_i)
    eps_i = rng_i.normal(loc=0, scale=1, size=n_i)
    Y_i = beta0 + beta1 * X_i + gamma * W_i + eps_i

    res_i = sm.OLS(Y_i, sm.add_constant(X_i)).fit()
    beta1_hat_i = float(res_i.params[1])
    biases.append(beta1_hat_i - beta1)
```

</details>

<details><summary>Solution: Simulate measurement error</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02a -- Measurement error
for sigma_nu in noise_levels:
    nu = rng.normal(loc=0, scale=sigma_nu, size=n) if sigma_nu > 0 else np.zeros(n)
    X_measured = X_star + nu

    res_me_i = sm.OLS(Y_me, sm.add_constant(X_measured)).fit()
    beta1_hat = float(res_me_i.params[1])

    var_x_star = np.var(X_star)
    predicted = beta1_me * var_x_star / (var_x_star + sigma_nu**2)
```

</details>

<details><summary>Solution: Visualize attenuation bias</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02a -- Attenuation bias visualization
sigma_grid = np.linspace(0, 5, 50)
var_x_star = np.var(X_star)

theoretical = beta1_me * var_x_star / (var_x_star + sigma_grid**2)

sim_betas = []
for sigma_nu in sigma_grid:
    nu_i = rng.normal(loc=0, scale=sigma_nu, size=n) if sigma_nu > 0 else np.zeros(n)
    X_m_i = X_star + nu_i
    res_i = sm.OLS(Y_me, sm.add_constant(X_m_i)).fit()
    sim_betas.append(float(res_i.params[1]))
```

</details>

<details><summary>Solution: Simulate supply-demand system</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02a -- Supply-demand simultaneity
P_star = (alpha_d - alpha_s + eps_d - eps_s) / (beta_d + beta_s)
Q_star = alpha_d - beta_d * P_star + eps_d

# Naive OLS
res_naive = sm.OLS(
    df_sim['Q'],
    sm.add_constant(df_sim[['P']], has_constant='add'),
).fit()
```

</details>

<details><summary>Solution: Visualize simultaneity</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02a -- Simultaneity visualization
p_grid = np.linspace(P_star.min(), P_star.max(), 100)

q_ols = float(res_naive.params['const']) + float(res_naive.params['P']) * p_grid
q_demand = alpha_d - beta_d * p_grid
q_supply = alpha_s + beta_s * p_grid

fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(P_star, Q_star, alpha=0.1, s=5, label='Observed equilibria')
ax.plot(p_grid, q_ols, 'r-', linewidth=2, label='OLS fit (confounded)')
ax.plot(p_grid, q_demand, 'b--', linewidth=2, label=f'True demand (slope = {-beta_d})')
ax.plot(p_grid, q_supply, 'g--', linewidth=2, label=f'True supply (slope = +{beta_s})')
ax.set_xlabel('Price (P)')
ax.set_ylabel('Quantity (Q)')
ax.set_title('Simultaneity Bias: OLS Cannot Recover Structural Curves')
ax.legend()
plt.tight_layout()
plt.show()
```

</details>

<details><summary>Solution: OLS slope depends on variance of shocks</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 02a -- Simultaneity shock variance
ratios = [0.01, 0.1, 0.5, 1.0, 2.0, 10.0, 100.0]
ols_slopes = []

for ratio in ratios:
    sigma_d = ratio
    sigma_s = 1.0
    ed = rng.normal(0, sigma_d, n)
    es = rng.normal(0, sigma_s, n)
    P_eq = (alpha_d - alpha_s + ed - es) / (beta_d + beta_s)
    Q_eq = alpha_d - beta_d * P_eq + ed

    res_i = sm.OLS(Q_eq, sm.add_constant(P_eq)).fit()
    slope_i = float(res_i.params[1])
    ols_slopes.append(slope_i)
```

</details>

<details><summary>Solution: Classify endogeneity scenarios</summary>

_One possible approach._

```python
# Reference solution for 02a -- Classify scenarios
scenarios = pd.DataFrame({
    'scenario': [
        'Regressing wages on education (ability is unobserved)',
        'Regressing health outcomes on self-reported exercise (noisy measure)',
        'Regressing city crime rates on police spending',
        'Regressing firm profits on R&D spending (innovative firms do both)',
        'Regressing test scores on class size (using survey-reported class size)',
    ],
    'source': [
        'OVB',
        'measurement error',
        'simultaneity',
        'OVB (or simultaneity)',
        'measurement error (+ possible OVB)',
    ],
    'remedy': [
        'IV (e.g., proximity to college, compulsory schooling laws)',
        'IV or use objective measurement (e.g., accelerometer data)',
        'IV (e.g., use a policy shock that changes police but not crime directly)',
        'Panel FE (control for firm-level unobservables) or IV',
        'Administrative records for class size; IV for remaining endogeneity',
    ],
})
scenarios
```

</details>