# Panel Instrumental Variables (IV)

**Level**: Advanced  
**Estimated Duration**: 75-90 minutes  
**Prerequisites**: Fixed Effects (Notebook 02), Basic IV/2SLS concepts

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Identify** sources of endogeneity in panel data (simultaneity, measurement error, omitted variables)
2. **Understand** instrumental variable (IV) requirements (relevance, exogeneity, exclusion)
3. **Estimate** 2SLS models for panels using PanelBox
4. **Combine** IV with panel transformations (IV-Pooled, IV-FE, IV-RE)
5. **Interpret** first-stage F-statistics and diagnose weak instruments
6. **Distinguish** between internal and external instruments
7. **Apply** IV to real-world endogeneity problems

---

## Setup

Import required packages and configure visualization settings.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import panelbox as pb
from panelbox import PanelIV
from panelbox.models.static import PooledOLS, FixedEffectsOLS

# Visualization settings
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 11

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

print("PanelBox version:", pb.__version__)
print("Setup complete!")

---

# Section 1: Endogeneity in Panel Data

## 1.1 Sources of Endogeneity

Endogeneity occurs when $\text{Cov}(X_{it}, \varepsilon_{it}) \neq 0$, violating the key OLS assumption. In panel data, endogeneity can arise from:

### 1. **Simultaneity** (Reverse Causality)
- $X_{it}$ and $y_{it}$ are jointly determined
- **Example**: Price and quantity in supply-demand systems
- **Example**: Firm investment and productivity

### 2. **Measurement Error**
- Observe $X^*_{it} = X_{it} + u_{it}$ instead of true $X_{it}$
- **Problem**: Classical measurement error causes attenuation bias
- **Worse in FE**: Demeaning amplifies error variance

### 3. **Time-Varying Omitted Variables**
- Unobserved $\omega_{it}$ correlated with $X_{it}$
- **FE only eliminates** time-invariant $\alpha_i$
- **Cannot eliminate** time-varying confounders

---

## 1.2 Why OLS/FE Fail

Let's simulate a panel with endogeneity and see how OLS produces biased estimates.

In [None]:
# Simulate endogenous panel data
np.random.seed(42)
N, T = 100, 5

# Unobserved time-varying shock (confounds relationship)
omega = np.random.normal(0, 1, (N, T))

data_endo = []
for i in range(N):
    for t in range(T):
        # Endogenous regressor (correlates with error through omega)
        x_endo = 10 + omega[i, t] + np.random.normal(0, 1)
        
        # True model: y = 2*x + error
        # But error includes omega (same omega affecting x!)
        y = 2 * x_endo + omega[i, t] + np.random.normal(0, 0.5)
        
        data_endo.append({
            'entity': i, 
            'time': t, 
            'y': y, 
            'x_endo': x_endo
        })

df_endo = pd.DataFrame(data_endo)

print("Simulated endogenous panel data:")
print(df_endo.head(10))
print(f"\nShape: {df_endo.shape}")

In [None]:
# Estimate with naive OLS (will be biased!)
ols_endo = PooledOLS(
    formula="y ~ x_endo",
    data=df_endo,
    entity_col='entity',
    time_col='time'
).fit(cov_type='clustered')

print("=" * 60)
print("NAIVE OLS ESTIMATION (BIASED)")
print("=" * 60)
print(f"True causal effect:    β = 2.000")
print(f"OLS estimate:          β̂ = {ols_endo.params['x_endo']:.3f}")
print(f"Bias:                  {ols_endo.params['x_endo'] - 2:.3f}")
print(f"Bias percentage:       {100 * (ols_endo.params['x_endo'] - 2) / 2:.1f}%")
print("\n⚠️  OLS severely overestimates the effect due to endogeneity!")

### Key Insight

When $X$ is endogenous:
- OLS attributes **both** the direct effect of $X$ **and** the confounding effect of $\omega$ to $X$
- This creates **positive bias** when $\text{Cov}(X, \omega) > 0$
- Even with thousands of observations, bias persists (consistency fails)

**Solution**: We need an **instrumental variable** to isolate the causal effect!

---

# Section 2: Instrumental Variables - Foundations

## 2.1 The Three IV Requirements

An instrumental variable $Z$ must satisfy:

### 1. **Relevance** (Testable)
$$\text{Cov}(Z_{it}, X_{\text{endo},it}) \neq 0$$
- Instrument must be correlated with endogenous variable
- **Test**: First-stage F-statistic > 10 (rule of thumb)
- **Violation**: Weak instruments → bias toward OLS

### 2. **Exogeneity** (Not directly testable)
$$\mathbb{E}[Z_{it} \cdot \varepsilon_{it}] = 0$$
- Instrument uncorrelated with structural error
- **Cannot test** in just-identified models
- **Requires** economic reasoning and institutional knowledge

### 3. **Exclusion Restriction** (Not directly testable)
- $Z$ affects $y$ **only through** $X_{\text{endo}}$
- No direct effect: $Z$ not in structural equation
- **Example**: Distance to college affects wages only through education

---

## 2.2 Internal vs External Instruments

### **Internal Instruments** (Lags)
- Use past values: $X_{i,t-1}, X_{i,t-2}, \ldots$
- **Valid if**: $\varepsilon_{it}$ not serially correlated
- **Common in**: Dynamic panels (Arellano-Bond GMM)
- **Advantage**: Always available
- **Disadvantage**: May be weak; serial correlation invalidates

### **External Instruments** (Policy, Geography)
- Variables from outside the model
- **Examples**: 
  - Policy changes (draft lottery for military service)
  - Geographic variation (distance to college)
  - Weather shocks (rainfall for agricultural productivity)
- **Advantage**: Often more credible
- **Disadvantage**: Harder to find; requires creativity

---

## 2.3 Identification Conditions

### Order Condition
$$\text{# Instruments} \geq \text{# Endogenous Variables}$$

- **Just-identified**: $\#Z = \#X_{\text{endo}}$ (exact identification)
- **Over-identified**: $\#Z > \#X_{\text{endo}}$ (can test overidentifying restrictions)

### Rank Condition
- Instruments must provide **independent** variation
- Tested via first-stage F-statistic

---

# Section 3: Two-Stage Least Squares (2SLS)

## 3.1 The 2SLS Procedure

Consider the model:
$$y_{it} = \beta_0 + \beta_1 X_{\text{endo},it} + \beta_2 X_{\text{exog},it} + \varepsilon_{it}$$

With instrument $Z_{it}$:

### **Stage 1**: Predict endogenous variable
$$X_{\text{endo},it} = \pi_0 + \pi_1 Z_{it} + \pi_2 X_{\text{exog},it} + v_{it}$$
- Obtain fitted values: $\hat{X}_{\text{endo},it}$
- Test instrument strength: F-test on $\pi_1$

### **Stage 2**: Use predicted values
$$y_{it} = \beta_0 + \beta_1 \hat{X}_{\text{endo},it} + \beta_2 X_{\text{exog},it} + u_{it}$$
- Obtain $\hat{\beta}_1^{IV}$ (consistent under IV assumptions)
- Standard errors corrected for two-stage procedure

---

## 3.2 PanelBox IV Syntax

The `PanelIV` class uses a special formula syntax:

```python
formula = "y ~ exog_vars + endog_vars | exog_vars + instruments"
```

- **Left of `|`**: Structural equation (including endogenous vars)
- **Right of `|`**: Instruments (plus exogenous vars)
- **Endogenous vars**: Appear left of `|` but NOT right
- **Instruments**: Appear right of `|` but NOT left

---

## 3.3 IV Estimation with PanelBox

In [None]:
# Create a valid instrument: lagged value of x_endo
# (Valid if error not serially correlated)
df_endo = df_endo.sort_values(['entity', 'time'])
df_endo['z_lag'] = df_endo.groupby('entity')['x_endo'].shift(1)

# Drop missing values from lag
df_iv = df_endo.dropna().copy()

print(f"Data with instrument: {df_iv.shape}")
print(df_iv.head(10))

In [None]:
# Estimate IV-Pooled (2SLS)
iv_pooled = PanelIV(
    formula="y ~ x_endo | z_lag",  # z_lag instruments for x_endo
    data=df_iv,
    entity_col='entity',
    time_col='time',
    model_type='pooled'
)

iv_results = iv_pooled.fit(cov_type='clustered')

print("=" * 70)
print("IV-POOLED ESTIMATION (2SLS)")
print("=" * 70)
print(iv_results.summary())

In [None]:
# Compare OLS vs IV
print("=" * 70)
print("COMPARISON: OLS vs IV")
print("=" * 70)
print(f"True causal effect:        β = 2.000")
print(f"OLS estimate (biased):     β̂ = {ols_endo.params['x_endo']:.3f}")
print(f"IV estimate (consistent):  β̂ = {iv_results.params['x_endo']:.3f}")
print(f"\nOLS bias:                  {ols_endo.params['x_endo'] - 2:.3f}")
print(f"IV bias:                   {iv_results.params['x_endo'] - 2:.3f}")
print("\n✓ IV corrects endogeneity bias!")

---

## 3.4 First-Stage Diagnostics

The **first-stage F-statistic** tests instrument relevance:

$$H_0: \pi_1 = 0 \quad \text{(instrument irrelevant)}$$

### Rules of Thumb
- **F > 10**: Strong instrument (Stock-Yogo threshold)
- **F < 10**: Weak instrument
  - Finite-sample bias toward OLS
  - Misleading inference (under-rejection of $H_0$)
  - Consider alternative instruments or LIML

### Critical Values (Stock-Yogo, 2005)
For 10% maximal IV size:
- 1 endogenous, 1 instrument: F > 16.38
- 1 endogenous, 2 instruments: F > 19.93

**F = 10** is a conservative rule for "not too weak."

In [None]:
# Extract first-stage statistics
print("=" * 70)
print("FIRST-STAGE DIAGNOSTICS")
print("=" * 70)

first_stage = iv_results.first_stage_results

for endog_var, stats in first_stage.items():
    print(f"\nEndogenous variable: '{endog_var}'")
    print(f"  Instrument(s):      z_lag")
    print(f"  F-statistic:        {stats['f_statistic']:.2f}")
    print(f"  P-value:            {stats['f_pvalue']:.4f}")
    print(f"  Partial R²:         {stats.get('partial_r2', 0):.4f}")
    
    # Assess strength
    if stats['f_statistic'] < 10:
        print("  ⚠️  WARNING: Weak instrument (F < 10)")
        print("      → Consider alternative instruments")
    else:
        print("  ✓ Strong instrument (F > 10)")
        print("      → IV estimates reliable")

### Visualize First-Stage Relationship

In [None]:
# Plot instrument vs endogenous variable
fig, ax = plt.subplots(figsize=(10, 6))

ax.scatter(df_iv['z_lag'], df_iv['x_endo'], alpha=0.3, s=20)
ax.set_xlabel('Instrument (z_lag)', fontsize=12)
ax.set_ylabel('Endogenous Variable (x_endo)', fontsize=12)
ax.set_title('First-Stage Relationship: Instrument Relevance', fontsize=14, fontweight='bold')

# Add regression line
z = df_iv['z_lag'].values
x = df_iv['x_endo'].values
coef = np.polyfit(z, x, 1)
poly1d = np.poly1d(coef)
ax.plot(z, poly1d(z), 'r-', linewidth=2, label=f'Slope = {coef[0]:.3f}')

# Add correlation
corr = np.corrcoef(z, x)[0, 1]
ax.text(0.05, 0.95, f'Correlation: {corr:.3f}\nF-stat: {stats["f_statistic"]:.2f}', 
        transform=ax.transAxes, fontsize=11, verticalalignment='top',
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("Strong positive correlation confirms instrument relevance!")

---

# Section 4: IV-FE - Combining IV with Fixed Effects

## 4.1 The IV-FE Model

When facing **both**:
1. Time-invariant unobserved heterogeneity ($\alpha_i$)
2. Endogenous regressors ($X_{it}$)

We need **IV-FE**:

### Procedure
1. **Demean** all variables (within transformation):
   $$\tilde{y}_{it} = y_{it} - \bar{y}_i$$
   $$\tilde{X}_{it} = X_{it} - \bar{X}_i$$
   $$\tilde{Z}_{it} = Z_{it} - \bar{Z}_i$$

2. **Apply 2SLS** to demeaned data

### Result
- Eliminates $\alpha_i$ (like standard FE)
- Instruments for endogenous $X_{it}$ (like IV)

---

## 4.2 Simulate IV-FE Data

In [None]:
# Simulate panel with fixed effects AND endogeneity
np.random.seed(123)
N, T = 100, 6

# Fixed effects (time-invariant heterogeneity)
alpha = np.random.normal(0, 2, N)

# Time-varying shock (creates endogeneity)
omega = np.random.normal(0, 1, (N, T))

data_ivfe = []
for i in range(N):
    for t in range(T):
        # Endogenous variable (affected by both alpha and omega)
        x_endo = 5 + 0.5 * alpha[i] + omega[i, t] + np.random.normal(0, 0.8)
        
        # Instrument: correlated with x but NOT with error
        z = 3 + 0.6 * x_endo + np.random.normal(0, 1)
        
        # Outcome (true effect: beta = 1.5)
        y = 10 + alpha[i] + 1.5 * x_endo + omega[i, t] + np.random.normal(0, 0.5)
        
        data_ivfe.append({
            'entity': i,
            'time': t,
            'y': y,
            'x_endo': x_endo,
            'z': z
        })

df_ivfe = pd.DataFrame(data_ivfe)

print(f"Simulated IV-FE data: {df_ivfe.shape}")
print(df_ivfe.head(10))

## 4.3 Compare Estimators

In [None]:
# 1. Naive OLS (biased: ignores alpha_i AND endogeneity)
ols_naive = PooledOLS(
    formula="y ~ x_endo",
    data=df_ivfe,
    entity_col='entity',
    time_col='time'
).fit(cov_type='clustered')

# 2. Fixed Effects (eliminates alpha_i, but x still endogenous)
fe_naive = FixedEffectsOLS(
    formula="y ~ x_endo",
    data=df_ivfe,
    entity_col='entity',
    time_col='time'
).fit(cov_type='clustered')

# 3. IV-Pooled (addresses endogeneity, ignores alpha_i)
iv_pooled_naive = PanelIV(
    formula="y ~ x_endo | z",
    data=df_ivfe,
    entity_col='entity',
    time_col='time',
    model_type='pooled'
).fit(cov_type='clustered')

# 4. IV-FE (addresses BOTH alpha_i AND endogeneity)
iv_fe = PanelIV(
    formula="y ~ x_endo | z",
    data=df_ivfe,
    entity_col='entity',
    time_col='time',
    model_type='fe'
).fit(cov_type='clustered')

print("=" * 70)
print("ESTIMATOR COMPARISON")
print("=" * 70)
print(f"True effect:           β = 1.500\n")
print(f"OLS (Pooled):          β̂ = {ols_naive.params['x_endo']:.3f}  [Bias: {ols_naive.params['x_endo'] - 1.5:.3f}]")
print(f"  → Biased: Ignores α_i AND endogeneity\n")

print(f"FE (no IV):            β̂ = {fe_naive.params['x_endo']:.3f}  [Bias: {fe_naive.params['x_endo'] - 1.5:.3f}]")
print(f"  → Eliminates α_i, but x_endo still endogenous\n")

print(f"IV-Pooled:             β̂ = {iv_pooled_naive.params['x_endo']:.3f}  [Bias: {iv_pooled_naive.params['x_endo'] - 1.5:.3f}]")
print(f"  → Addresses endogeneity, but ignores α_i\n")

print(f"IV-FE (CORRECT):       β̂ = {iv_fe.params['x_endo']:.3f}  [Bias: {iv_fe.params['x_endo'] - 1.5:.3f}]")
print(f"  → ✓ Addresses BOTH problems\n")

In [None]:
# Visualize comparison
estimates = [
    ('True', 1.500, 0),
    ('OLS', ols_naive.params['x_endo'], ols_naive.std_errors['x_endo']),
    ('FE', fe_naive.params['x_endo'], fe_naive.std_errors['x_endo']),
    ('IV-Pooled', iv_pooled_naive.params['x_endo'], iv_pooled_naive.std_errors['x_endo']),
    ('IV-FE', iv_fe.params['x_endo'], iv_fe.std_errors['x_endo'])
]

fig, ax = plt.subplots(figsize=(10, 6))

names = [e[0] for e in estimates]
coefs = [e[1] for e in estimates]
errors = [e[2] for e in estimates]

colors = ['green', 'red', 'orange', 'blue', 'darkgreen']
y_pos = np.arange(len(names))

ax.barh(y_pos, coefs, xerr=errors, color=colors, alpha=0.7, capsize=5)
ax.axvline(1.5, color='black', linestyle='--', linewidth=2, label='True Value')
ax.set_yticks(y_pos)
ax.set_yticklabels(names)
ax.set_xlabel('Coefficient Estimate', fontsize=12)
ax.set_title('Estimator Comparison: IV-FE vs Alternatives', fontsize=14, fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3, axis='x')
plt.tight_layout()
plt.show()

print("IV-FE is closest to the true value!")

## 4.4 IV-FE Full Results

In [None]:
# Display complete IV-FE results
print("=" * 70)
print("COMPLETE IV-FE RESULTS")
print("=" * 70)
print(iv_fe.summary())

# First-stage diagnostics
print("\n" + "=" * 70)
print("FIRST-STAGE DIAGNOSTICS (IV-FE)")
print("=" * 70)

for endog_var, stats in iv_fe.first_stage_results.items():
    print(f"\nEndogenous: '{endog_var}'")
    print(f"  F-statistic: {stats['f_statistic']:.2f}")
    print(f"  P-value:     {stats['f_pvalue']:.4f}")
    
    if stats['f_statistic'] > 10:
        print(f"  ✓ Strong instrument (F > 10)")
    else:
        print(f"  ⚠️  Weak instrument (F < 10)")

---

## 4.5 Real-World Example: Returns to Education

### The Problem
Estimating returns to schooling:
$$\log(\text{wage}_{it}) = \beta_0 + \beta_1 \text{education}_{it} + \beta_2 \text{experience}_{it} + \alpha_i + \varepsilon_{it}$$

### Endogeneity Issues
1. **Ability bias**: High-ability individuals get more education AND earn more
   - $\alpha_i$ captures time-invariant ability
   - **FE eliminates this**

2. **Time-varying shocks**: Health, family circumstances affect both education and wages
   - $\varepsilon_{it}$ still correlated with education
   - **Need IV in addition to FE**

### Instrument: College Openings
- New colleges in local area → easier access to higher education
- **Relevance**: Predicts education levels
- **Exogeneity**: College location determined by policy, not individual wages
- **Exclusion**: Affects wages only through education

### Approach
**IV-FE**:
- Fixed effects control for ability ($\alpha_i$)
- College openings instrument for education ($X_{it}$)

---

In [None]:
# Simulate wage panel (stylized example)
np.random.seed(999)
N, T = 200, 8

# Individual ability (time-invariant)
ability = np.random.normal(0, 1, N)

# College openings (external variation)
college_opening = np.random.binomial(1, 0.3, (N, T))

data_wage = []
for i in range(N):
    base_education = 12 + ability[i] + np.random.normal(0, 1)
    for t in range(T):
        # Experience
        experience = t + np.random.uniform(0, 2)
        
        # Education (affected by college opening and time-varying shocks)
        shock = np.random.normal(0, 0.5)
        education = base_education + 2 * college_opening[i, t] + shock
        
        # Wage (true return to education: 0.08)
        log_wage = 2.0 + 0.08 * education + 0.02 * experience + ability[i] + shock + np.random.normal(0, 0.1)
        
        data_wage.append({
            'person_id': i,
            'year': 2000 + t,
            'log_wage': log_wage,
            'education': education,
            'experience': experience,
            'college_opening': college_opening[i, t]
        })

df_wage = pd.DataFrame(data_wage)

print(f"Simulated wage panel: {df_wage.shape}")
print(df_wage.head(10))

In [None]:
# Compare estimators on wage data

# 1. OLS (biased by ability)
ols_wage = PooledOLS(
    formula="log_wage ~ education + experience",
    data=df_wage,
    entity_col='person_id',
    time_col='year'
).fit(cov_type='clustered')

# 2. FE (controls ability, but education still endogenous)
fe_wage = FixedEffectsOLS(
    formula="log_wage ~ education + experience",
    data=df_wage,
    entity_col='person_id',
    time_col='year'
).fit(cov_type='clustered')

# 3. IV-FE (controls ability AND instruments education)
iv_fe_wage = PanelIV(
    formula="log_wage ~ experience + education | experience + college_opening",
    data=df_wage,
    entity_col='person_id',
    time_col='year',
    model_type='fe'
).fit(cov_type='clustered')

print("=" * 70)
print("RETURNS TO EDUCATION: ESTIMATOR COMPARISON")
print("=" * 70)
print(f"\nTrue return to education: 8.0%\n")
print(f"OLS:      {100*ols_wage.params['education']:.2f}%  (biased by ability)")
print(f"FE:       {100*fe_wage.params['education']:.2f}%  (controls ability, but education endogenous)")
print(f"IV-FE:    {100*iv_fe_wage.params['education']:.2f}%  (✓ correct specification)")

print("\n" + "=" * 70)
print("FIRST-STAGE: COLLEGE OPENING → EDUCATION")
print("=" * 70)
for endog_var, stats in iv_fe_wage.first_stage_results.items():
    print(f"F-statistic: {stats['f_statistic']:.2f}")
    print(f"P-value:     {stats['f_pvalue']:.4f}")
    if stats['f_statistic'] > 10:
        print("✓ Strong instrument")
    else:
        print("⚠️  Weak instrument")

---

# Section 5: Practical Exercises

## Exercise 5.1: IV vs OLS Comparison

**Task**: Load endogenous data, estimate both OLS and IV, compare results.

**Steps**:
1. Use the `df_iv` data from Section 3
2. Estimate OLS (biased baseline)
3. Estimate IV-Pooled with `z_lag` as instrument
4. Compare coefficients
5. Check first-stage F-statistic
6. Interpret: Why does IV differ from OLS?

In [None]:
# EXERCISE 5.1: YOUR CODE HERE

# Step 1: Estimate OLS
# TODO: Use PooledOLS on df_iv

# Step 2: Estimate IV-Pooled
# TODO: Use PanelIV with z_lag as instrument

# Step 3: Compare coefficients
# TODO: Print both estimates side-by-side

# Step 4: Check first-stage F-statistic
# TODO: Extract and interpret F-statistic

pass  # Remove this line when you add your code

### Solution 5.1

In [None]:
# SOLUTION 5.1

# Step 1: OLS
ols_ex = PooledOLS(
    formula="y ~ x_endo",
    data=df_iv,
    entity_col='entity',
    time_col='time'
).fit(cov_type='clustered')

# Step 2: IV
iv_ex = PanelIV(
    formula="y ~ x_endo | z_lag",
    data=df_iv,
    entity_col='entity',
    time_col='time',
    model_type='pooled'
).fit(cov_type='clustered')

# Step 3: Compare
print("=" * 60)
print("EXERCISE 5.1 RESULTS")
print("=" * 60)
print(f"True effect:  2.000")
print(f"OLS estimate: {ols_ex.params['x_endo']:.3f}")
print(f"IV estimate:  {iv_ex.params['x_endo']:.3f}")
print(f"\nOLS-IV difference: {ols_ex.params['x_endo'] - iv_ex.params['x_endo']:.3f}")
print("→ OLS biased upward due to endogeneity")

# Step 4: First-stage
print("\n" + "=" * 60)
print("FIRST-STAGE DIAGNOSTICS")
print("=" * 60)
for var, stats in iv_ex.first_stage_results.items():
    print(f"F-statistic: {stats['f_statistic']:.2f}")
    print(f"Strength: {'Strong (F>10)' if stats['f_statistic'] > 10 else 'Weak (F<10)'}")

---

## Exercise 5.2: IV-FE Application

**Task**: Apply IV-FE to the wage data and interpret results.

**Steps**:
1. Use the `df_wage` data from Section 4.5
2. Estimate FE without IV
3. Estimate IV-FE with `college_opening` as instrument
4. Compare results: Does IV-FE correct additional bias?
5. Interpret the return to education
6. Check instrument strength

In [None]:
# EXERCISE 5.2: YOUR CODE HERE

# Step 1: FE without IV
# TODO: Estimate FixedEffectsOLS

# Step 2: IV-FE
# TODO: Estimate PanelIV with model_type='fe'

# Step 3: Compare
# TODO: Print both coefficients

# Step 4: Interpret
# TODO: What is the % return to one year of education?

# Step 5: Check instrument
# TODO: Extract first-stage F-statistic

pass  # Remove this line when you add your code

### Solution 5.2

In [None]:
# SOLUTION 5.2

# Step 1: FE
fe_ex = FixedEffectsOLS(
    formula="log_wage ~ education + experience",
    data=df_wage,
    entity_col='person_id',
    time_col='year'
).fit(cov_type='clustered')

# Step 2: IV-FE
iv_fe_ex = PanelIV(
    formula="log_wage ~ experience + education | experience + college_opening",
    data=df_wage,
    entity_col='person_id',
    time_col='year',
    model_type='fe'
).fit(cov_type='clustered')

# Step 3: Compare
print("=" * 60)
print("EXERCISE 5.2 RESULTS")
print("=" * 60)
print(f"True return to education:  8.00%\n")
print(f"FE estimate:    {100*fe_ex.params['education']:.2f}%")
print(f"IV-FE estimate: {100*iv_fe_ex.params['education']:.2f}%")
print(f"\nDifference: {100*(fe_ex.params['education'] - iv_fe_ex.params['education']):.2f} percentage points")
print("→ IV-FE corrects time-varying endogeneity bias")

# Step 4: Interpret
print("\nINTERPRETATION:")
print(f"One additional year of education increases wages by {100*iv_fe_ex.params['education']:.2f}%")

# Step 5: First-stage
print("\n" + "=" * 60)
print("INSTRUMENT STRENGTH")
print("=" * 60)
for var, stats in iv_fe_ex.first_stage_results.items():
    print(f"F-statistic: {stats['f_statistic']:.2f}")
    print(f"Assessment: {'✓ Strong' if stats['f_statistic'] > 10 else '⚠️  Weak'}")

---

# Section 6: Summary and Key Takeaways

## 6.1 Main Concepts

### Sources of Endogeneity
1. **Simultaneity**: $X_{it}$ and $y_{it}$ jointly determined
2. **Measurement error**: Noise in $X_{it}$ (worse in FE)
3. **Time-varying omitted variables**: $\omega_{it}$ correlated with $X_{it}$

### IV Requirements
1. **Relevance**: $\text{Cov}(Z, X) \neq 0$ (testable via F-stat)
2. **Exogeneity**: $\mathbb{E}[Z \cdot \varepsilon] = 0$ (not testable)
3. **Exclusion**: $Z$ affects $y$ only through $X$ (not testable)

### 2SLS Procedure
1. **First stage**: Regress $X_{\text{endo}}$ on $Z$ → obtain $\hat{X}$
2. **Second stage**: Regress $y$ on $\hat{X}$ → obtain $\hat{\beta}_{IV}$
3. **Diagnostics**: Check first-stage F > 10

### IV-FE
- **Combines** fixed effects (eliminates $\alpha_i$) with IV (addresses endogeneity)
- **Procedure**: Demean all variables, then apply 2SLS
- **Use when**: Both time-invariant heterogeneity AND endogeneity present

---

## 6.2 Practical Guidelines

### Instrument Selection
- **Look for**: Policy changes, geographic variation, randomized assignment
- **Avoid**: Variables that could directly affect outcome
- **Test**: Always report first-stage F-statistic

### Weak Instruments (F < 10)
- **Problem**: Finite-sample bias toward OLS
- **Solutions**: 
  - Find stronger instruments
  - Use LIML instead of 2SLS (more robust to weak instruments)
  - Report weak-instrument-robust confidence intervals

### Model Selection
| Estimator | Controls $\alpha_i$ | Addresses Endogeneity | Use When |
|-----------|---------------------|----------------------|----------|
| OLS       | No                  | No                   | Baseline |
| FE        | Yes                 | No                   | Only $\alpha_i$ problem |
| IV-Pooled | No                  | Yes                  | Only endogeneity |
| IV-FE     | Yes                 | Yes                  | Both problems |

---

## 6.3 Common Pitfalls

1. **Using weak instruments**: Always check F > 10
2. **Ignoring fixed effects**: When panel structure available, consider FE
3. **Overstating exogeneity**: Just because F > 10 doesn't mean exogeneity holds
4. **Measurement error in instruments**: Weakens first stage
5. **Serial correlation**: Invalidates lagged instruments

---

## 6.4 Next Steps

### Related Topics
- **Dynamic panels**: Arellano-Bond GMM (Notebook 08)
- **Heterogeneous effects**: Quantile IV
- **Multiple instruments**: Overidentification tests (J-test)
- **Weak instrument robust inference**: Anderson-Rubin test

### Further Reading
- Stock & Yogo (2005): "Testing for Weak Instruments in Linear IV Regression"
- Angrist & Pischke (2009): *Mostly Harmless Econometrics*, Chapter 4
- Wooldridge (2010): *Econometric Analysis of Cross Section and Panel Data*, Chapter 8

---

## Congratulations!

You have completed the Panel Instrumental Variables tutorial. You now understand:

- Sources of endogeneity in panel data
- How to identify and validate instruments
- 2SLS estimation mechanics
- Combining IV with fixed effects (IV-FE)
- Diagnosing weak instruments
- Applying IV to real-world problems

**Next**: Advanced topics in dynamic panels and GMM estimation.