# Advanced IV Diagnostics: Weak Instruments and Specification Testing

**Level**: Advanced-Expert  
**Estimated Duration**: 75-90 minutes  
**Date**: 2026-02-16

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Diagnose** weak instruments using first-stage F-statistics and Stock-Yogo critical values
2. **Understand** the consequences of weak instruments (bias, inconsistent inference)
3. **Conduct** overidentification tests (Sargan/Hansen J-test)
4. **Perform** endogeneity tests (Durbin-Wu-Hausman)
5. **Interpret** advanced IV diagnostics in panel data contexts
6. **Recognize** when IV estimation is unreliable
7. **Apply** weak-instrument-robust inference methods

---

## Prerequisites

**Conceptual**:
- Panel IV estimation (Notebook 05)
- Advanced IV theory
- Asymptotic theory basics

**Technical**:
- Hypothesis testing (Chi-squared, F-distribution)
- Matrix algebra
- Understanding of bias vs consistency

---

## Setup

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings('ignore')

# Import PanelBox
import sys
sys.path.insert(0, '/home/guhaase/projetos/panelbox')
import panelbox as pb

# Set random seed for reproducibility
np.random.seed(42)

# Configure visualization
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("PanelBox version:", pb.__version__)
print("Setup complete!")

---

## Section 1: The Problem of Weak Instruments

### 1.1 What Are Weak Instruments?

**Definition**: Instruments that have **low correlation** with the endogenous variable are called "weak instruments."

**Why does this matter?**

Valid IV estimation requires two conditions:
1. **Relevance**: Cov(Z, X) ‚â† 0 (instrument correlated with endogenous variable)
2. **Exogeneity**: Cov(Z, u) = 0 (instrument uncorrelated with error)

When instruments are **weak** (violate relevance strongly):

- **Finite-sample bias** toward OLS (even if Z is valid!)
- **Standard errors underestimated** ‚Üí tests over-reject
- **Confidence intervals too narrow** ‚Üí misleading inference
- **Asymptotic theory fails** in practice

**Rule of Thumb**: First-stage F-statistic < 10 indicates weak instruments.

---

### 1.2 Simulating Weak vs Strong Instruments

Let's simulate data to see the difference.

In [None]:
# Simulate strong vs weak instruments
np.random.seed(42)
N = 500

# Strong instrument (high correlation with X)
z_strong = np.random.normal(0, 1, N)
x_endo_strong = 0.7 * z_strong + np.random.normal(0, 1, N)  # Corr ‚âà 0.57

# Weak instrument (low correlation with X)
z_weak = np.random.normal(0, 1, N)
x_endo_weak = 0.1 * z_weak + np.random.normal(0, 1, N)  # Corr ‚âà 0.10

# True data-generating process
beta_true = 2.0
y_strong = beta_true * x_endo_strong + np.random.normal(0, 1, N)
y_weak = beta_true * x_endo_weak + np.random.normal(0, 1, N)

# Calculate correlations
corr_strong = np.corrcoef(z_strong, x_endo_strong)[0, 1]
corr_weak = np.corrcoef(z_weak, x_endo_weak)[0, 1]

print("="*60)
print("INSTRUMENT STRENGTH COMPARISON")
print("="*60)
print(f"\n{'Instrument Type':<20} {'Correlation(Z, X)':>20}")
print("-"*60)
print(f"{'Strong Instrument':<20} {corr_strong:>20.3f}")
print(f"{'Weak Instrument':<20} {corr_weak:>20.3f}")
print("="*60)

print(f"\nTrue Œ≤: {beta_true}")
print("\n‚ö† The weak instrument has correlation < 0.15 with X")
print("   This will likely result in:")
print("   - Biased estimates")
print("   - Unreliable standard errors")
print("   - Invalid hypothesis tests")

### 1.3 Visualization: Weak vs Strong Instruments

In [None]:
# Visualize instrument strength
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Strong instrument
axes[0].scatter(z_strong, x_endo_strong, alpha=0.5, s=20)
axes[0].set_xlabel('Instrument (Z)', fontsize=11)
axes[0].set_ylabel('Endogenous Variable (X)', fontsize=11)
axes[0].set_title(f'Strong Instrument\nCorr(Z, X) = {corr_strong:.3f}', 
                  fontsize=12, fontweight='bold')
axes[0].grid(alpha=0.3)

# Add regression line
z_sorted_strong = np.sort(z_strong)
fit_strong = np.polyfit(z_strong, x_endo_strong, 1)
axes[0].plot(z_sorted_strong, fit_strong[0] * z_sorted_strong + fit_strong[1], 
             'r-', linewidth=2, label='First-stage fit')
axes[0].legend()

# Weak instrument
axes[1].scatter(z_weak, x_endo_weak, alpha=0.5, s=20, color='coral')
axes[1].set_xlabel('Instrument (Z)', fontsize=11)
axes[1].set_ylabel('Endogenous Variable (X)', fontsize=11)
axes[1].set_title(f'Weak Instrument\nCorr(Z, X) = {corr_weak:.3f}', 
                  fontsize=12, fontweight='bold')
axes[1].grid(alpha=0.3)

# Add regression line
z_sorted_weak = np.sort(z_weak)
fit_weak = np.polyfit(z_weak, x_endo_weak, 1)
axes[1].plot(z_sorted_weak, fit_weak[0] * z_sorted_weak + fit_weak[1], 
             'r-', linewidth=2, label='First-stage fit')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\nüìä Notice:")
print("   - Strong instrument: Clear positive relationship")
print("   - Weak instrument: Nearly no visible relationship (almost flat line)")

---

## Section 2: First-Stage F-Statistic

### 2.1 Understanding the First-Stage F-Statistic

The **first-stage regression** is:

$$
X_{\text{endo}} = \pi_0 + \pi_1 Z + v
$$

The **F-statistic** tests:
- **H‚ÇÄ**: œÄ‚ÇÅ = 0 (instrument is irrelevant)
- **H‚ÇÅ**: œÄ‚ÇÅ ‚â† 0 (instrument is relevant)

**Interpretation**:
- **F > 10**: Acceptable (rule of thumb)
- **F > 16.38**: Strong (Stock-Yogo: < 10% relative bias)
- **F > 19.93**: Very strong (Stock-Yogo: < 5% relative bias)

**Stock-Yogo Critical Values** (for 1 instrument, 1 endogenous variable):

| Max Relative Bias | Critical F |
|-------------------|------------|
| 5%               | 19.93      |
| 10%              | 16.38      |
| 20%              | 8.96       |
| 30%              | 6.66       |

---

### 2.2 Panel IV with First-Stage Diagnostics

In [None]:
# Create panel data with strong instrument
np.random.seed(123)
data_panel = []

for i in range(100):  # 100 entities
    for t in range(5):  # 5 time periods
        z = np.random.normal(0, 1)
        x_endo = 0.6 * z + np.random.normal(0, 1)  # Strong instrument
        y = 2 * x_endo + np.random.normal(0, 0.5)
        data_panel.append({
            'entity': i, 
            'time': t, 
            'y': y, 
            'x_endo': x_endo, 
            'z': z
        })

df_panel = pd.DataFrame(data_panel)

print("Panel Data Shape:", df_panel.shape)
print("\nFirst 5 rows:")
print(df_panel.head())

In [None]:
# Helper function to compute first-stage F-statistic
def compute_first_stage_f(df, endog_col, instrument_col):
    """
    Compute first-stage F-statistic.
    
    Parameters
    ----------
    df : DataFrame
    endog_col : str
        Endogenous variable column
    instrument_col : str or list
        Instrument column(s)
    
    Returns
    -------
    dict : F-statistic and p-value
    """
    X_endo = df[[endog_col]].values
    
    if isinstance(instrument_col, str):
        Z = df[[instrument_col]].values
    else:
        Z = df[instrument_col].values
    
    # Add constant
    Z_const = np.column_stack([np.ones(len(Z)), Z])
    
    # OLS
    pi_hat = np.linalg.lstsq(Z_const, X_endo, rcond=None)[0]
    X_pred = Z_const @ pi_hat
    resid_fs = X_endo - X_pred
    
    # F-statistic
    SSR_r = np.sum((X_endo - X_endo.mean())**2)
    SSR_u = np.sum(resid_fs**2)
    
    num_instruments = Z.shape[1]
    N = len(X_endo)
    
    f_stat = ((SSR_r - SSR_u) / num_instruments) / (SSR_u / (N - num_instruments - 1))
    f_pval = 1 - stats.f.cdf(f_stat, num_instruments, N - num_instruments - 1)
    
    return {'f_statistic': f_stat, 'f_pvalue': f_pval}

# Compute first-stage F
fs_result = compute_first_stage_f(df_panel, 'x_endo', 'z')

print("="*60)
print("FIRST-STAGE DIAGNOSTICS")
print("="*60)
print(f"\nFirst-Stage F-statistic: {fs_result['f_statistic']:.2f}")
print(f"P-value:                 {fs_result['f_pvalue']:.6f}")

# Stock-Yogo evaluation
f_stat = fs_result['f_statistic']
print("\nStock-Yogo Assessment:")
if f_stat > 19.93:
    print("  ‚úì Very Strong Instrument (F > 19.93, < 5% bias)")
elif f_stat > 16.38:
    print("  ‚úì Strong Instrument (F > 16.38, < 10% bias)")
elif f_stat > 10:
    print("  ‚ö° Acceptable (F > 10, rule of thumb)")
else:
    print("  ‚ö† WARNING: Weak Instrument (F < 10)")
    print("     ‚Üí IV estimates may be severely biased")
    print("     ‚Üí Standard errors unreliable")

### 2.3 Comparing Strong vs Weak Instruments in Panel Context

In [None]:
# Simulate weak instrument panel data
np.random.seed(456)
data_panel_weak = []

for i in range(100):
    for t in range(5):
        z = np.random.normal(0, 1)
        x_endo = 0.08 * z + np.random.normal(0, 1)  # WEAK instrument
        y = 2 * x_endo + np.random.normal(0, 0.5)
        data_panel_weak.append({
            'entity': i,
            'time': t,
            'y': y,
            'x_endo': x_endo,
            'z': z
        })

df_panel_weak = pd.DataFrame(data_panel_weak)

# Compare strong vs weak
fs_strong = compute_first_stage_f(df_panel, 'x_endo', 'z')
fs_weak = compute_first_stage_f(df_panel_weak, 'x_endo', 'z')

print("="*60)
print("FIRST-STAGE F-STATISTIC COMPARISON")
print("="*60)
print(f"\n{'Instrument Type':<20} {'F-Statistic':>15} {'Assessment':>25}")
print("-"*60)

f_strong = fs_strong['f_statistic']
if f_strong > 19.93:
    assess_strong = "‚úì Very Strong"
elif f_strong > 16.38:
    assess_strong = "‚úì Strong"
elif f_strong > 10:
    assess_strong = "‚ö° Acceptable"
else:
    assess_strong = "‚ö† Weak"

print(f"{'Strong Instrument':<20} {f_strong:>15.2f} {assess_strong:>25}")

f_weak = fs_weak['f_statistic']
if f_weak > 19.93:
    assess_weak = "‚úì Very Strong"
elif f_weak > 16.38:
    assess_weak = "‚úì Strong"
elif f_weak > 10:
    assess_weak = "‚ö° Acceptable"
else:
    assess_weak = "‚ö† Weak"

print(f"{'Weak Instrument':<20} {f_weak:>15.2f} {assess_weak:>25}")
print("="*60)

print("\nüìå Key Insight:")
if f_weak < 10:
    print("   The weak instrument has F < 10, indicating serious problems!")
    print("   IV estimates will be biased and unreliable.")

---

## Section 3: Overidentification Test (J-Test)

### 3.1 The Sargan-Hansen J-Test

**When applicable**: Number of instruments > Number of endogenous variables ("overidentified")

**Purpose**: Test if **all** instruments are valid (uncorrelated with error term)

**Hypotheses**:
- **H‚ÇÄ**: All instruments are valid (E[Z_i ¬∑ u_i] = 0 for all instruments)
- **H‚ÇÅ**: At least one instrument is invalid

**Test Statistic**:
$$
J = N \times R^2_{\text{residuals on instruments}}
$$

**Distribution under H‚ÇÄ**:
$$
J \sim \chi^2(df = \text{# instruments} - \text{# endogenous})
$$

**Interpretation**:
- **Reject H‚ÇÄ** (p < 0.05): At least one instrument is invalid
- **Fail to reject**: Instruments appear valid (cannot reject orthogonality)

**Limitations**:
- Cannot detect if **all** instruments are invalid
- Low power in some cases
- Does NOT test relevance (only exogeneity)

---

### 3.2 Implementing the J-Test

In [None]:
# Simulate overidentified model: 2 instruments, 1 endogenous variable
np.random.seed(789)
data_overid = []

for i in range(100):
    for t in range(5):
        # Two valid instruments
        z1 = np.random.normal(0, 1)
        z2 = np.random.normal(0, 1)
        
        # Endogenous variable depends on both
        x_endo = 0.5 * z1 + 0.4 * z2 + np.random.normal(0, 1)
        
        # Outcome (error uncorrelated with instruments)
        u = np.random.normal(0, 0.5)
        y = 2 * x_endo + u
        
        data_overid.append({
            'entity': i,
            'time': t,
            'y': y,
            'x_endo': x_endo,
            'z1': z1,
            'z2': z2
        })

df_overid = pd.DataFrame(data_overid)

print("Overidentified Model Data:")
print(f"  Observations: {len(df_overid)}")
print(f"  # Instruments: 2")
print(f"  # Endogenous: 1")
print(f"  Degrees of overidentification: 2 - 1 = 1")
print("\nFirst 5 rows:")
print(df_overid.head())

In [None]:
# J-test computation function
def compute_j_test(y, X_endo, Z):
    """
    Compute Sargan-Hansen J-test for overidentification.
    
    Parameters
    ----------
    y : array
        Dependent variable
    X_endo : array
        Endogenous variables
    Z : array
        Instruments
    
    Returns
    -------
    dict : J-statistic, df, p-value
    """
    N = len(y)
    
    # Add constants
    X_endo_const = np.column_stack([np.ones(N), X_endo])
    Z_const = np.column_stack([np.ones(N), Z])
    
    # 2SLS estimation
    # First stage
    pi_hat = np.linalg.lstsq(Z_const, X_endo_const[:, 1:], rcond=None)[0]
    X_endo_pred = Z_const @ pi_hat
    X_endo_pred_const = np.column_stack([np.ones(N), X_endo_pred])
    
    # Second stage
    beta_2sls = np.linalg.lstsq(X_endo_pred_const, y, rcond=None)[0]
    resid_2sls = y - X_endo_pred_const @ beta_2sls
    
    # Regress residuals on instruments
    gamma_hat = np.linalg.lstsq(Z_const, resid_2sls, rcond=None)[0]
    resid_on_z = Z_const @ gamma_hat
    
    # R-squared from residuals on instruments
    ss_total = np.sum((resid_2sls - resid_2sls.mean())**2)
    ss_resid = np.sum((resid_2sls - resid_on_z)**2)
    r_squared = 1 - (ss_resid / ss_total)
    
    # J-statistic
    j_stat = N * r_squared
    
    # Degrees of freedom
    num_instruments = Z.shape[1]
    num_endog = X_endo.shape[1] if X_endo.ndim > 1 else 1
    df_j = num_instruments - num_endog
    
    # P-value
    p_value = 1 - stats.chi2.cdf(j_stat, df_j)
    
    return {
        'j_statistic': j_stat,
        'df': df_j,
        'p_value': p_value
    }

# Apply to data
y = df_overid['y'].values
X_endo = df_overid[['x_endo']].values
Z = df_overid[['z1', 'z2']].values

j_test_result = compute_j_test(y, X_endo, Z)

print("="*60)
print("OVERIDENTIFICATION TEST (J-TEST)")
print("="*60)
print(f"  J-statistic: {j_test_result['j_statistic']:.4f}")
print(f"  df:          {j_test_result['df']}")
print(f"  p-value:     {j_test_result['p_value']:.4f}")
print("-"*60)

if j_test_result['p_value'] > 0.05:
    print("  ‚úì Fail to reject H‚ÇÄ: Instruments appear valid")
    print("    (Cannot reject orthogonality conditions)")
else:
    print("  ‚úó Reject H‚ÇÄ: At least one instrument is invalid")
    print("    (Evidence against orthogonality conditions)")

print("="*60)

### 3.3 J-Test with Invalid Instrument

In [None]:
# Simulate with one INVALID instrument
np.random.seed(999)
data_invalid = []

for i in range(100):
    for t in range(5):
        u = np.random.normal(0, 0.5)  # Error term
        
        z1 = np.random.normal(0, 1)  # Valid instrument
        z2 = np.random.normal(0, 1) + 0.5 * u  # INVALID: correlated with error!
        
        x_endo = 0.5 * z1 + 0.4 * z2 + np.random.normal(0, 1)
        y = 2 * x_endo + u
        
        data_invalid.append({
            'entity': i,
            'time': t,
            'y': y,
            'x_endo': x_endo,
            'z1': z1,
            'z2': z2
        })

df_invalid = pd.DataFrame(data_invalid)

# Compute J-test
y_inv = df_invalid['y'].values
X_endo_inv = df_invalid[['x_endo']].values
Z_inv = df_invalid[['z1', 'z2']].values

j_test_invalid = compute_j_test(y_inv, X_endo_inv, Z_inv)

print("="*60)
print("J-TEST WITH INVALID INSTRUMENT")
print("="*60)
print("\nData Setup:")
print("  z1: Valid instrument (uncorrelated with error)")
print("  z2: INVALID instrument (correlated with error)")
print("\nJ-Test Results:")
print("-"*60)
print(f"  J-statistic: {j_test_invalid['j_statistic']:.4f}")
print(f"  df:          {j_test_invalid['df']}")
print(f"  p-value:     {j_test_invalid['p_value']:.4f}")
print("-"*60)

if j_test_invalid['p_value'] < 0.05:
    print("  ‚úì Test CORRECTLY detects invalid instrument!")
    print("    (Reject H‚ÇÄ: evidence of instrument invalidity)")
else:
    print("  ‚ö† Test failed to detect invalid instrument")
    print("    (Low power or insufficient violation)")

print("="*60)

print("\nüìå Key Lesson:")
print("   J-test can detect invalid instruments when they are")
print("   sufficiently correlated with the error term.")
print("   However, it has limitations (low power, cannot detect")
print("   if ALL instruments are invalid).")

---

## Section 4: Endogeneity Test (Durbin-Wu-Hausman)

### 4.1 Testing for Endogeneity

**Question**: Is IV estimation even necessary? Or is X actually exogenous?

**Durbin-Wu-Hausman (DWH) Test**:
- **H‚ÇÄ**: X is exogenous (OLS is consistent and efficient)
- **H‚ÇÅ**: X is endogenous (need IV)

**Procedure**:
1. Estimate **first stage**: X = Z œÄ + v, obtain residuals vÃÇ
2. Estimate **augmented regression**: y = X Œ≤ + vÃÇ Œ¥ + u
3. Test **H‚ÇÄ: Œ¥ = 0**
   - If reject: X is endogenous ‚Üí use IV
   - If fail to reject: X is exogenous ‚Üí use OLS

**Intuition**: If vÃÇ (first-stage residuals) significantly predicts y, then X is endogenous.

---

### 4.2 Manual Implementation

In [None]:
# Endogeneity test implementation
def durbin_wu_hausman_test(df, y_col, endog_col, instrument_cols):
    """
    Perform Durbin-Wu-Hausman test for endogeneity.
    
    Parameters
    ----------
    df : DataFrame
    y_col : str
        Dependent variable
    endog_col : str
        Suspected endogenous variable
    instrument_cols : list
        List of instrument column names
    
    Returns
    -------
    dict : Test results
    """
    y = df[y_col].values
    X_endo = df[[endog_col]].values
    Z = df[instrument_cols].values
    
    N = len(y)
    
    # Add constants
    Z_const = np.column_stack([np.ones(N), Z])
    
    # Step 1: First stage - regress X on Z
    pi_hat = np.linalg.lstsq(Z_const, X_endo, rcond=None)[0]
    X_pred = Z_const @ pi_hat
    v_hat = X_endo - X_pred  # First-stage residuals
    
    # Step 2: Augmented regression - y on X and vÃÇ
    X_augmented = np.column_stack([np.ones(N), X_endo, v_hat])
    
    beta_augmented = np.linalg.lstsq(X_augmented, y, rcond=None)[0]
    delta_hat = beta_augmented[2]  # Coefficient on vÃÇ
    
    # Compute standard error of delta_hat
    y_pred_aug = X_augmented @ beta_augmented
    resid_aug = y - y_pred_aug
    sigma2_aug = np.sum(resid_aug**2) / (N - X_augmented.shape[1])
    
    var_beta = sigma2_aug * np.linalg.inv(X_augmented.T @ X_augmented)
    se_delta = np.sqrt(var_beta[2, 2])
    
    # t-statistic
    t_stat = delta_hat / se_delta
    p_value = 2 * (1 - stats.t.cdf(np.abs(t_stat), N - X_augmented.shape[1]))
    
    return {
        'delta_hat': delta_hat,
        'se_delta': se_delta,
        't_statistic': t_stat,
        'p_value': p_value
    }

# Test on original panel data (should be endogenous)
dwh_result = durbin_wu_hausman_test(df_panel, 'y', 'x_endo', ['z'])

print("="*60)
print("DURBIN-WU-HAUSMAN ENDOGENEITY TEST")
print("="*60)
print("\nNull Hypothesis (H‚ÇÄ): X is exogenous")
print("Alternative (H‚ÇÅ):     X is endogenous")
print("-"*60)
print(f"  Œ¥ÃÇ (coefficient on vÃÇ):  {dwh_result['delta_hat']:.4f}")
print(f"  Standard error:        {dwh_result['se_delta']:.4f}")
print(f"  t-statistic:           {dwh_result['t_statistic']:.4f}")
print(f"  p-value:               {dwh_result['p_value']:.4f}")
print("-"*60)

if dwh_result['p_value'] < 0.05:
    print("  ‚úó Reject H‚ÇÄ: X is ENDOGENOUS")
    print("    ‚Üí Use IV/2SLS estimation")
    print("    ‚Üí OLS would be biased and inconsistent")
else:
    print("  ‚úì Fail to reject H‚ÇÄ: X is EXOGENOUS")
    print("    ‚Üí Use OLS (more efficient than IV)")
    print("    ‚Üí IV unnecessary")

print("="*60)

### 4.3 Testing with Truly Exogenous Data

In [None]:
# Simulate data where X is truly EXOGENOUS
np.random.seed(2024)
data_exog = []

for i in range(100):
    for t in range(5):
        x = np.random.normal(0, 1)  # Exogenous X
        z = np.random.normal(0, 1)  # Instrument (not needed)
        y = 2 * x + np.random.normal(0, 0.5)  # No endogeneity!
        
        data_exog.append({
            'entity': i,
            'time': t,
            'y': y,
            'x': x,
            'z': z
        })

df_exog = pd.DataFrame(data_exog)

# Run DWH test
dwh_exog = durbin_wu_hausman_test(df_exog, 'y', 'x', ['z'])

print("="*60)
print("DWH TEST WITH TRULY EXOGENOUS VARIABLE")
print("="*60)
print("\nData Setup: X is exogenous (no correlation with error)")
print("-"*60)
print(f"  Œ¥ÃÇ:          {dwh_exog['delta_hat']:.4f}")
print(f"  t-statistic: {dwh_exog['t_statistic']:.4f}")
print(f"  p-value:     {dwh_exog['p_value']:.4f}")
print("-"*60)

if dwh_exog['p_value'] >= 0.05:
    print("  ‚úì Test CORRECTLY fails to reject exogeneity")
    print("    ‚Üí OLS is appropriate")
else:
    print("  ‚ö† Type I error: falsely rejected exogeneity")

print("="*60)

print("\nüìå Interpretation:")
print("   When X is truly exogenous, DWH test should fail to reject H‚ÇÄ.")
print("   This tells us OLS is more efficient than IV.")

---

## Section 5: Weak-Instrument-Robust Inference

### 5.1 The Anderson-Rubin Test

**Problem**: When instruments are weak (F < 10), standard 2SLS inference is invalid.

**Solution**: **Anderson-Rubin (AR) Test** provides valid inference even with weak instruments.

**Key Features**:
- Tests hypotheses about Œ≤ (e.g., H‚ÇÄ: Œ≤ = Œ≤‚ÇÄ)
- **Robust to weak instruments** (does not rely on first-stage strength)
- Provides **confidence sets** instead of point estimates

**Limitation**: Cannot estimate Œ≤, only test specific values.

**Procedure** (simplified):
1. For a given Œ≤‚ÇÄ, compute: y - Œ≤‚ÇÄ ¬∑ X
2. Regress this on instruments Z
3. Test if coefficients on Z are jointly zero
4. If fail to reject ‚Üí Œ≤‚ÇÄ is in confidence set

---

### 5.2 When to Use Weak-Instrument-Robust Methods

**Use when**:
- First-stage F < 10 (weak instruments)
- You need to test specific hypotheses about Œ≤
- Standard 2SLS confidence intervals seem unreliable

**Available methods**:
- **Anderson-Rubin (AR) test**: Most common
- **Conditional Likelihood Ratio (CLR) test**: More powerful
- **Limited Information Maximum Likelihood (LIML)**: Alternative estimator

**References**:
- Stock & Yogo (2005): Testing for weak instruments
- Andrews, Moreira, & Stock (2006): Optimal weak-instrument-robust tests

---

### 5.3 Practical Advice

**If F < 10**:
1. **Do NOT trust** standard 2SLS inference
2. **Report** first-stage F prominently
3. **Consider**:
   - Finding stronger instruments
   - Using weak-instrument-robust methods (AR test)
   - LIML estimator (less biased than 2SLS with weak instruments)
4. **Be transparent** about instrument weakness in reporting

**Note**: Anderson-Rubin and related tests are not typically implemented in standard packages. Advanced users can implement manually or use specialized software (e.g., Stata's `weakiv` package, R's `ivmodel`).

In [None]:
# Conceptual demonstration: AR test logic (simplified)
def anderson_rubin_test_concept(y, X_endo, Z, beta_0):
    """
    Simplified Anderson-Rubin test for H‚ÇÄ: Œ≤ = Œ≤‚ÇÄ.
    
    This is a CONCEPTUAL demonstration. Full implementation
    requires additional corrections and is beyond scope.
    
    Parameters
    ----------
    y : array
    X_endo : array
    Z : array
    beta_0 : float
        Null hypothesis value
    
    Returns
    -------
    dict : Test statistic and interpretation
    """
    N = len(y)
    
    # Compute y - Œ≤‚ÇÄ ¬∑ X
    y_transformed = y - beta_0 * X_endo.flatten()
    
    # Add constant to Z
    Z_const = np.column_stack([np.ones(N), Z])
    
    # Regress transformed y on Z
    gamma_hat = np.linalg.lstsq(Z_const, y_transformed, rcond=None)[0]
    y_pred = Z_const @ gamma_hat
    resid = y_transformed - y_pred
    
    # F-statistic for H‚ÇÄ: Œ≥ = 0 (all coefficients on Z are zero)
    SSR_r = np.sum((y_transformed - y_transformed.mean())**2)
    SSR_u = np.sum(resid**2)
    
    num_instruments = Z.shape[1]
    
    f_stat = ((SSR_r - SSR_u) / num_instruments) / (SSR_u / (N - num_instruments - 1))
    p_value = 1 - stats.f.cdf(f_stat, num_instruments, N - num_instruments - 1)
    
    return {
        'f_statistic': f_stat,
        'p_value': p_value,
        'beta_0': beta_0
    }

print("="*60)
print("ANDERSON-RUBIN TEST (CONCEPTUAL)")
print("="*60)
print("\nThis demonstrates the AR test logic for weak instruments.")
print("True Œ≤ = 2.0 in our simulated data.")
print("-"*60)

# Test different values of Œ≤
beta_values = [1.5, 2.0, 2.5, 3.0]

y_test = df_panel['y'].values
X_test = df_panel[['x_endo']].values
Z_test = df_panel[['z']].values

print(f"\n{'Œ≤‚ÇÄ':<10} {'F-statistic':<15} {'p-value':<15} {'In 95% CS?':>15}")
print("-"*60)

for beta_0 in beta_values:
    ar_result = anderson_rubin_test_concept(y_test, X_test, Z_test, beta_0)
    in_cs = "Yes" if ar_result['p_value'] > 0.05 else "No"
    print(f"{beta_0:<10.1f} {ar_result['f_statistic']:<15.2f} {ar_result['p_value']:<15.4f} {in_cs:>15}")

print("="*60)
print("\nüìå Interpretation:")
print("   Values of Œ≤‚ÇÄ with p-value > 0.05 are in the 95% confidence set.")
print("   This method is valid even with weak instruments!")
print("\n‚ö† Note: This is a simplified demonstration. Production use")
print("   requires proper implementation with finite-sample corrections.")

---

## Section 6: Complete Diagnostic Workflow

### 6.1 Comprehensive IV Diagnostic Function

In [None]:
def comprehensive_iv_diagnostics(df, y_col, endog_col, instrument_cols, entity_col='entity', time_col='time'):
    """
    Run complete IV diagnostic workflow.
    
    Parameters
    ----------
    df : DataFrame
    y_col : str
    endog_col : str
    instrument_cols : list
    entity_col : str
    time_col : str
    
    Returns
    -------
    dict : All diagnostic results
    """
    diagnostics = {}
    
    print("="*70)
    print("COMPREHENSIVE IV DIAGNOSTICS WORKFLOW")
    print("="*70)
    
    # ===== STEP 1: First-Stage F-Statistic =====
    print("\n" + "="*70)
    print("STEP 1: FIRST-STAGE F-STATISTIC (Instrument Relevance)")
    print("="*70)
    
    fs_result = compute_first_stage_f(df, endog_col, instrument_cols)
    diagnostics['first_stage_f'] = fs_result
    
    f_stat = fs_result['f_statistic']
    print(f"\n  F-statistic: {f_stat:.2f}")
    print(f"  p-value:     {fs_result['f_pvalue']:.6f}")
    
    print("\n  Stock-Yogo Critical Values (1 instrument, 1 endogenous):")
    print("    F > 19.93: < 5% relative bias")
    print("    F > 16.38: < 10% relative bias")
    print("    F > 10:    Rule of thumb")
    
    if f_stat > 19.93:
        print("\n  ‚úì VERDICT: Very strong instrument (F > 19.93)")
        diagnostics['instrument_strength'] = 'very_strong'
    elif f_stat > 16.38:
        print("\n  ‚úì VERDICT: Strong instrument (F > 16.38)")
        diagnostics['instrument_strength'] = 'strong'
    elif f_stat > 10:
        print("\n  ‚ö° VERDICT: Acceptable instrument (10 < F < 16.38)")
        diagnostics['instrument_strength'] = 'acceptable'
    else:
        print("\n  ‚ö† WARNING: Weak instrument (F < 10)")
        print("     ‚Üí IV estimates will be biased")
        print("     ‚Üí Standard errors unreliable")
        print("     ‚Üí Consider weak-instrument-robust methods")
        diagnostics['instrument_strength'] = 'weak'
    
    # ===== STEP 2: Overidentification Test =====
    num_instruments = len(instrument_cols) if isinstance(instrument_cols, list) else 1
    num_endog = 1
    
    if num_instruments > num_endog:
        print("\n" + "="*70)
        print("STEP 2: OVERIDENTIFICATION TEST (J-Test)")
        print("="*70)
        
        y = df[y_col].values
        X_endo = df[[endog_col]].values
        Z = df[instrument_cols].values if isinstance(instrument_cols, list) else df[[instrument_cols]].values
        
        j_result = compute_j_test(y, X_endo, Z)
        diagnostics['j_test'] = j_result
        
        print(f"\n  J-statistic: {j_result['j_statistic']:.4f}")
        print(f"  df:          {j_result['df']}")
        print(f"  p-value:     {j_result['p_value']:.4f}")
        
        if j_result['p_value'] > 0.05:
            print("\n  ‚úì VERDICT: Cannot reject instrument validity")
            print("     (Orthogonality conditions appear satisfied)")
            diagnostics['instruments_valid'] = True
        else:
            print("\n  ‚úó VERDICT: Reject instrument validity")
            print("     (At least one instrument appears invalid)")
            diagnostics['instruments_valid'] = False
    else:
        print("\n" + "="*70)
        print("STEP 2: OVERIDENTIFICATION TEST (J-Test) - SKIPPED")
        print("="*70)
        print("\n  Model is exactly identified (# instruments = # endogenous)")
        print("  J-test requires overidentification.")
        diagnostics['j_test'] = None
    
    # ===== STEP 3: Endogeneity Test =====
    print("\n" + "="*70)
    print("STEP 3: ENDOGENEITY TEST (Durbin-Wu-Hausman)")
    print("="*70)
    
    dwh_result = durbin_wu_hausman_test(df, y_col, endog_col, instrument_cols if isinstance(instrument_cols, list) else [instrument_cols])
    diagnostics['dwh_test'] = dwh_result
    
    print(f"\n  Œ¥ÃÇ (coefficient on vÃÇ): {dwh_result['delta_hat']:.4f}")
    print(f"  Standard error:        {dwh_result['se_delta']:.4f}")
    print(f"  t-statistic:           {dwh_result['t_statistic']:.4f}")
    print(f"  p-value:               {dwh_result['p_value']:.4f}")
    
    if dwh_result['p_value'] < 0.05:
        print("\n  ‚úó VERDICT: Reject exogeneity (X is ENDOGENOUS)")
        print("     ‚Üí IV estimation is necessary")
        print("     ‚Üí OLS would be biased and inconsistent")
        diagnostics['is_endogenous'] = True
    else:
        print("\n  ‚úì VERDICT: Cannot reject exogeneity (X may be EXOGENOUS)")
        print("     ‚Üí OLS is more efficient than IV")
        print("     ‚Üí IV may not be necessary")
        diagnostics['is_endogenous'] = False
    
    # ===== FINAL SUMMARY =====
    print("\n" + "="*70)
    print("DIAGNOSTIC SUMMARY AND RECOMMENDATIONS")
    print("="*70)
    
    print("\nüìä Results:")
    print(f"   - Instrument Strength:  {diagnostics['instrument_strength']}")
    print(f"   - Endogeneity Detected: {diagnostics['is_endogenous']}")
    if diagnostics['j_test'] is not None:
        print(f"   - Instruments Valid:    {diagnostics['instruments_valid']}")
    
    print("\nüéØ Recommendations:")
    
    if diagnostics['instrument_strength'] == 'weak':
        print("   ‚ö† CRITICAL: Weak instruments detected!")
        print("      ‚Üí Do NOT trust standard 2SLS inference")
        print("      ‚Üí Consider: finding stronger instruments, LIML, or AR test")
    elif not diagnostics['is_endogenous']:
        print("   ‚úì No evidence of endogeneity - consider using OLS instead")
    elif diagnostics['j_test'] is not None and not diagnostics['instruments_valid']:
        print("   ‚ö† Invalid instruments detected!")
        print("      ‚Üí Re-examine instrument validity")
        print("      ‚Üí Results may be unreliable")
    else:
        print("   ‚úì IV estimation appears appropriate and reliable")
        print("      ‚Üí Instruments are strong and valid")
        print("      ‚Üí Endogeneity is present")
    
    print("="*70)
    
    return diagnostics

### 6.2 Running Complete Diagnostics

In [None]:
# Run comprehensive diagnostics on panel data
diagnostics_panel = comprehensive_iv_diagnostics(
    df=df_panel,
    y_col='y',
    endog_col='x_endo',
    instrument_cols='z',
    entity_col='entity',
    time_col='time'
)

In [None]:
# Run diagnostics on overidentified model
diagnostics_overid = comprehensive_iv_diagnostics(
    df=df_overid,
    y_col='y',
    endog_col='x_endo',
    instrument_cols=['z1', 'z2'],
    entity_col='entity',
    time_col='time'
)

---

## Section 7: Exercises

### Exercise 7.1: Weak Instrument Simulation

**Task**: Investigate the consequences of weak instruments.

**Instructions**:
1. Simulate panel data with a **weak instrument** (Corr(Z, X) ‚âà 0.05)
2. Estimate IV model
3. Compute first-stage F-statistic
4. Compare IV estimate to true Œ≤ - observe bias

In [None]:
# TODO: Exercise 7.1
# Hint: Modify the weak instrument simulation from Section 1
# True Œ≤ = 2.0, use correlation ‚âà 0.05

# Your code here:


### Exercise 7.2: Overidentification Test with Invalid Instrument

**Task**: Test if the J-test can detect invalid instruments.

**Instructions**:
1. Simulate with 3 instruments, 1 endogenous variable
2. Make **one instrument invalid** (correlated with error term)
3. Conduct J-test
4. Interpret: Does the test detect the invalid instrument?

In [None]:
# TODO: Exercise 7.2
# Hint: Create z1, z2 (valid), z3 (invalid: correlated with u)

# Your code here:


### Exercise 7.3: Complete Diagnostic Workflow

**Task**: Apply the comprehensive diagnostic workflow to new data.

**Instructions**:
1. Simulate panel data with your own specifications
2. Run `comprehensive_iv_diagnostics()`
3. Interpret all test results
4. Make a recommendation: Is IV appropriate?

In [None]:
# TODO: Exercise 7.3
# Design your own simulation and test it!

# Your code here:


---

## Solutions

In [None]:
# Solution 7.1: Weak Instrument Simulation
np.random.seed(1111)
data_weak_ex = []

for i in range(100):
    for t in range(5):
        z = np.random.normal(0, 1)
        x_endo = 0.05 * z + np.random.normal(0, 1)  # Very weak!
        y = 2 * x_endo + np.random.normal(0, 0.5)
        data_weak_ex.append({
            'entity': i, 'time': t, 'y': y, 'x_endo': x_endo, 'z': z
        })

df_weak_ex = pd.DataFrame(data_weak_ex)

# First-stage F
fs_weak_ex = compute_first_stage_f(df_weak_ex, 'x_endo', 'z')

print("Solution 7.1: Weak Instrument Consequences")
print("="*60)
print(f"First-stage F: {fs_weak_ex['f_statistic']:.2f}")
print(f"True Œ≤:        2.0")
print("\nExpected outcome: F << 10, severe bias in IV estimates")

In [None]:
# Solution 7.2: J-Test with Invalid Instrument
np.random.seed(2222)
data_3inst = []

for i in range(100):
    for t in range(5):
        u = np.random.normal(0, 0.5)
        
        z1 = np.random.normal(0, 1)  # Valid
        z2 = np.random.normal(0, 1)  # Valid
        z3 = np.random.normal(0, 1) + 0.6 * u  # INVALID
        
        x_endo = 0.4 * z1 + 0.3 * z2 + 0.3 * z3 + np.random.normal(0, 1)
        y = 2 * x_endo + u
        
        data_3inst.append({
            'entity': i, 'time': t, 'y': y, 'x_endo': x_endo,
            'z1': z1, 'z2': z2, 'z3': z3
        })

df_3inst = pd.DataFrame(data_3inst)

# J-test
y_3 = df_3inst['y'].values
X_3 = df_3inst[['x_endo']].values
Z_3 = df_3inst[['z1', 'z2', 'z3']].values

j_3inst = compute_j_test(y_3, X_3, Z_3)

print("Solution 7.2: J-Test with 3 Instruments (1 invalid)")
print("="*60)
print(f"J-statistic: {j_3inst['j_statistic']:.4f}")
print(f"df:          {j_3inst['df']}")
print(f"p-value:     {j_3inst['p_value']:.4f}")

if j_3inst['p_value'] < 0.05:
    print("\n‚úì J-test successfully detected invalid instrument!")
else:
    print("\n‚ö† J-test did not detect invalidity (low power)")

---

## Section 8: Summary

### Key Takeaways

1. **Weak Instruments (F < 10)**: Serious problem!
   - Bias toward OLS (even with valid instruments)
   - Unreliable standard errors
   - Invalid hypothesis tests
   - Solution: Find stronger instruments or use weak-instrument-robust methods

2. **First-Stage F-Statistic**: Primary diagnostic for instrument relevance
   - **F > 19.93**: Very strong (< 5% relative bias)
   - **F > 16.38**: Strong (< 10% relative bias)
   - **F > 10**: Acceptable (rule of thumb)
   - **F < 10**: Weak (unreliable IV)

3. **Overidentification Test (J-test)**: Tests instrument validity
   - Only for overidentified models (# instruments > # endogenous)
   - Tests if instruments are orthogonal to errors
   - Cannot detect if ALL instruments are invalid
   - Low power in some settings

4. **Endogeneity Test (DWH)**: Tests if IV is necessary
   - H‚ÇÄ: X is exogenous (use OLS)
   - H‚ÇÅ: X is endogenous (use IV)
   - Helps choose between OLS and IV

5. **Complete Diagnostics**: Always run ALL tests before trusting IV results
   - First-stage F (relevance)
   - J-test (validity, if overidentified)
   - DWH test (necessity)
   - Report diagnostics transparently

6. **Weak-Instrument-Robust Methods**: Advanced alternatives
   - Anderson-Rubin test
   - LIML estimator
   - Conditional likelihood ratio test

---

### Diagnostic Checklist

Before using IV estimation results:

- [ ] **First-stage F > 10** (preferably > 16.38)
- [ ] **J-test p-value > 0.05** (if overidentified)
- [ ] **DWH test rejects exogeneity** (p < 0.05)
- [ ] **Economic reasoning** supports instrument validity
- [ ] **Diagnostics reported** in results (transparency)

If ANY of these fail ‚Üí re-examine your instruments!

---

### What's Next?

**Congratulations!** You've completed the advanced IV diagnostics tutorial!

After completing all 7 notebooks in the static panel models series (01-07), you now have:

‚úÖ Comprehensive understanding of static panel estimators  
‚úÖ Ability to choose appropriate models based on data/context  
‚úÖ Skills to conduct rigorous specification tests  
‚úÖ Expertise in diagnosing and addressing IV problems  
‚úÖ Practical experience with real-world panel data analysis  

**Next Steps**:
- Explore **dynamic panel models** (GMM series)
- Study **advanced topics**: quantile regression, nonlinear models
- Apply to **research projects** with real data
- Read **econometric literature** on weak instruments (Stock & Yogo, Andrews et al.)

---

### Further Reading

**Essential Papers**:
1. Stock, J.H. & Yogo, M. (2005). "Testing for Weak Instruments in Linear IV Regression." In *Identification and Inference for Econometric Models*.
2. Andrews, D.W.K., Moreira, M.J., & Stock, J.H. (2006). "Optimal Two-Sided Invariant Similar Tests for Instrumental Variables Regression." *Econometrica*, 74(3).
3. Staiger, D. & Stock, J.H. (1997). "Instrumental Variables Regression with Weak Instruments." *Econometrica*, 65(3).
4. Hansen, L.P. (1982). "Large Sample Properties of Generalized Method of Moments Estimators." *Econometrica*, 50(4).

**Textbooks**:
- Wooldridge, J.M. (2010). *Econometric Analysis of Cross Section and Panel Data*, 2nd ed. MIT Press.
- Cameron, A.C. & Trivedi, P.K. (2005). *Microeconometrics: Methods and Applications*. Cambridge University Press.

---