# GMM Diagnostics Tutorial: Testing Model Specification and Instrument Validity

This tutorial provides a comprehensive guide to diagnostic testing for GMM estimation, including tests for overidentification, instrument validity, and weak instruments.

## Contents
1. Hansen J-Test for Overidentification
2. C-Statistic (Difference-in-Sargan)
3. Weak Instruments Diagnostics
4. Interpretation Guidelines
5. When to Worry About Specification Problems
6. Remedial Actions

## References
- Hansen, L.P. (1982). "Large Sample Properties of Generalized Method of Moments Estimators." *Econometrica*, 50(4), 1029-1054.
- Sargan, J.D. (1958). "The Estimation of Economic Relationships using Instrumental Variables." *Econometrica*, 26(3), 393-415.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

from panelbox.gmm import ContinuousUpdatedGMM
from panelbox.gmm.diagnostics import (
    hansen_j_test,
    c_statistic,
    weak_instruments_test
)

# Set random seed for reproducibility
np.random.seed(999)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 8)

## Setup: Generate Test Data

We'll generate three datasets to illustrate different diagnostic scenarios:
1. **Valid instruments** - All moment conditions satisfied
2. **Invalid instruments** - Some instruments correlated with errors
3. **Weak instruments** - Instruments weakly correlated with endogenous variables

In [None]:
def generate_gmm_data(n=500, scenario='valid', seed=None):
    """
    Generate data for GMM diagnostics.
    
    Parameters
    ----------
    n : int
        Sample size
    scenario : str
        'valid', 'invalid', or 'weak'
    seed : int, optional
        Random seed
    
    Returns
    -------
    dict
        Data dictionary with y, X, Z
    """
    if seed is not None:
        np.random.seed(seed)
    
    beta_true = np.array([1.0, 0.5])
    
    # Errors
    epsilon = np.random.randn(n)
    v = np.random.randn(n)
    
    if scenario == 'valid':
        # Valid instruments: uncorrelated with epsilon
        Z = np.random.randn(n, 4)
        Z[:, 0] = 1
        
        # Endogenous X (correlated with epsilon)
        X = np.zeros((n, 2))
        X[:, 0] = 1
        X[:, 1] = Z @ np.array([0.5, 0.3, 0.4, 0.2]) + 0.6 * epsilon + 0.8 * v
        
    elif scenario == 'invalid':
        # Invalid instruments: Z[2] and Z[3] correlated with epsilon
        Z = np.random.randn(n, 4)
        Z[:, 0] = 1
        Z[:, 2] = Z[:, 2] + 0.5 * epsilon  # Contaminated instrument!
        Z[:, 3] = Z[:, 3] + 0.3 * epsilon  # Contaminated instrument!
        
        X = np.zeros((n, 2))
        X[:, 0] = 1
        X[:, 1] = Z @ np.array([0.5, 0.3, 0.4, 0.2]) + 0.6 * epsilon + 0.8 * v
        
    elif scenario == 'weak':
        # Weak instruments: low correlation with X
        Z = np.random.randn(n, 4)
        Z[:, 0] = 1
        
        # Very weak first stage
        X = np.zeros((n, 2))
        X[:, 0] = 1
        X[:, 1] = Z @ np.array([0.5, 0.05, 0.03, 0.02]) + 0.6 * epsilon + 2.0 * v
    
    # Generate outcome
    y = X @ beta_true + epsilon
    
    return {
        'y': y,
        'X': X,
        'Z': Z,
        'beta_true': beta_true,
        'scenario': scenario
    }

# Generate all three scenarios
data_valid = generate_gmm_data(n=500, scenario='valid', seed=100)
data_invalid = generate_gmm_data(n=500, scenario='invalid', seed=101)
data_weak = generate_gmm_data(n=500, scenario='weak', seed=102)

print("Generated three datasets:")
print("  1. Valid instruments")
print("  2. Invalid instruments (Z[2], Z[3] contaminated)")
print("  3. Weak instruments")

## 1. Hansen J-Test for Overidentification

The Hansen J-test tests the null hypothesis that all moment conditions are valid:

$$H_0: E[Z_i' \varepsilon_i] = 0 \text{ (all moment conditions valid)}$$

**Test statistic:**
$$J = n \times Q(\hat{\beta}) \sim \chi^2(l - k)$$

where:
- $l$ = number of instruments
- $k$ = number of parameters
- $l - k$ = degree of overidentification

**Interpretation:**
- **Low p-value** (< 0.05): Reject H0 → overidentification restrictions violated
- **High p-value** (≥ 0.05): Do not reject H0 → instruments appear valid

In [None]:
def estimate_and_test(data, name):
    """
    Estimate GMM and run J-test.
    """
    print("\n" + "="*80)
    print(f"SCENARIO: {name}")
    print("="*80)
    
    # Estimate CUE-GMM
    gmm = ContinuousUpdatedGMM(
        endog=data['y'],
        exog=data['X'],
        instruments=data['Z']
    )
    
    result = gmm.fit()
    
    print(f"\nEstimated parameters:")
    print(f"  β̂ = {result.params}")
    print(f"  True β = {data['beta_true']}")
    print(f"  Bias = {result.params - data['beta_true']}")
    
    # J-test
    j_test = result.j_test()
    
    print(f"\nHansen J-Test:")
    print(f"  Statistic: {j_test['statistic']:.4f}")
    print(f"  P-value: {j_test['pvalue']:.4f}")
    print(f"  Degrees of freedom: {j_test['df']}")
    print(f"  Critical value (5%): {j_test['critical_value']:.4f}")
    
    if j_test['pvalue'] < 0.05:
        print(f"  ⚠️  REJECT H0: Overidentification restrictions violated (p={j_test['pvalue']:.4f})")
        print(f"      → Some instruments may be invalid or model misspecified")
    else:
        print(f"  ✓ Do not reject H0: Instruments appear valid (p={j_test['pvalue']:.4f})")
    
    return result

# Test all scenarios
result_valid = estimate_and_test(data_valid, "VALID INSTRUMENTS")
result_invalid = estimate_and_test(data_invalid, "INVALID INSTRUMENTS")
result_weak = estimate_and_test(data_weak, "WEAK INSTRUMENTS")

## 2. C-Statistic (Difference-in-Sargan)

The C-statistic tests the validity of a *subset* of instruments by comparing restricted and unrestricted models:

$$C = J_{restricted} - J_{unrestricted} \sim \chi^2(\#\text{restrictions})$$

**Use case:** Test if specific instruments are exogenous

**Example:** We suspect Z[2] and Z[3] are invalid in the "invalid" scenario.

In [None]:
# Test validity of Z[2] and Z[3] using C-statistic
print("\n" + "="*80)
print("C-STATISTIC: Testing Validity of Z[2] and Z[3]")
print("="*80)

# Unrestricted model (all instruments)
gmm_unrestricted = ContinuousUpdatedGMM(
    endog=data_invalid['y'],
    exog=data_invalid['X'],
    instruments=data_invalid['Z']
)
result_unrestricted = gmm_unrestricted.fit()
j_unrestricted = result_unrestricted.j_test()['statistic']

# Restricted model (only Z[0] and Z[1])
Z_restricted = data_invalid['Z'][:, [0, 1]]
gmm_restricted = ContinuousUpdatedGMM(
    endog=data_invalid['y'],
    exog=data_invalid['X'],
    instruments=Z_restricted
)
result_restricted = gmm_restricted.fit()
j_restricted = result_restricted.j_test()['statistic']

# C-statistic
c_stat = j_unrestricted - j_restricted
df_c = 2  # Number of excluded instruments
p_value_c = 1 - stats.chi2.cdf(c_stat, df_c)

print(f"\nRestricted model (Z[0], Z[1] only):")
print(f"  J-statistic: {j_restricted:.4f}")

print(f"\nUnrestricted model (all Z):")
print(f"  J-statistic: {j_unrestricted:.4f}")

print(f"\nC-Statistic Test:")
print(f"  C = J_unr - J_res = {c_stat:.4f}")
print(f"  Degrees of freedom: {df_c}")
print(f"  P-value: {p_value_c:.4f}")

if p_value_c < 0.05:
    print(f"  ⚠️  REJECT H0: Z[2] and Z[3] appear INVALID (p={p_value_c:.4f})")
    print(f"      → These instruments are likely correlated with errors")
else:
    print(f"  ✓ Do not reject H0: Z[2] and Z[3] appear valid (p={p_value_c:.4f})")

## 3. Weak Instruments Diagnostics

Weak instruments occur when instruments have low correlation with endogenous variables.

**Consequences:**
- Biased estimates (even asymptotically)
- Poor coverage of confidence intervals
- Unreliable inference

**Diagnostics:**
1. **First-stage F-statistic**: Should be > 10 (rule of thumb)
2. **Cragg-Donald statistic**: Tests joint significance
3. **Concentration parameter**: Measures instrument strength

In [None]:
def first_stage_diagnostics(data, name):
    """
    Compute first-stage F-statistic for weak instruments test.
    """
    print("\n" + "="*80)
    print(f"WEAK INSTRUMENTS TEST: {name}")
    print("="*80)
    
    # First stage: regress X on Z
    # X[:, 1] is endogenous regressor
    # Z are instruments
    
    from sklearn.linear_model import LinearRegression
    from scipy import stats as sp_stats
    
    X_endo = data['X'][:, 1].reshape(-1, 1)
    Z = data['Z']
    
    # First-stage regression
    model_fs = LinearRegression()
    model_fs.fit(Z, X_endo)
    
    # Predictions and residuals
    X_fitted = model_fs.predict(Z)
    residuals = X_endo.flatten() - X_fitted.flatten()
    
    # R-squared
    ss_tot = np.sum((X_endo.flatten() - np.mean(X_endo))**2)
    ss_res = np.sum(residuals**2)
    r2 = 1 - ss_res / ss_tot
    
    # F-statistic
    n = len(X_endo)
    k = Z.shape[1]
    f_stat = (r2 / (k - 1)) / ((1 - r2) / (n - k))
    p_value = 1 - sp_stats.f.cdf(f_stat, k - 1, n - k)
    
    print(f"\nFirst-Stage Regression (X ~ Z):")
    print(f"  R²: {r2:.4f}")
    print(f"  F-statistic: {f_stat:.2f}")
    print(f"  P-value: {p_value:.4e}")
    
    # Stock-Yogo critical values (rule of thumb)
    if f_stat > 10:
        print(f"  ✓ Instruments appear STRONG (F={f_stat:.2f} > 10)")
    elif f_stat > 5:
        print(f"  ⚠️  Instruments are MODERATE (5 < F={f_stat:.2f} < 10)")
    else:
        print(f"  ⚠️  Instruments are WEAK (F={f_stat:.2f} < 5)")
        print(f"      → Inference may be unreliable!")
    
    return {'r2': r2, 'f_stat': f_stat, 'p_value': p_value}

# Test all scenarios
fs_valid = first_stage_diagnostics(data_valid, "VALID INSTRUMENTS")
fs_invalid = first_stage_diagnostics(data_invalid, "INVALID INSTRUMENTS")
fs_weak = first_stage_diagnostics(data_weak, "WEAK INSTRUMENTS")

In [None]:
# Visualize first-stage strength
fig, ax = plt.subplots(figsize=(10, 6))

scenarios = ['Valid', 'Invalid', 'Weak']
f_stats = [fs_valid['f_stat'], fs_invalid['f_stat'], fs_weak['f_stat']]
colors = ['green', 'orange', 'red']

bars = ax.bar(scenarios, f_stats, color=colors, alpha=0.7, edgecolor='black', linewidth=2)

# Reference lines
ax.axhline(y=10, color='green', linestyle='--', linewidth=2, label='Strong threshold (F=10)')
ax.axhline(y=5, color='orange', linestyle='--', linewidth=2, label='Weak threshold (F=5)')

ax.set_ylabel('First-Stage F-Statistic', fontsize=12)
ax.set_title('Weak Instruments Diagnostic: First-Stage F-Statistics', fontsize=14, fontweight='bold')
ax.legend(fontsize=10)
ax.grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for i, (bar, f_stat) in enumerate(zip(bars, f_stats)):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 1,
            f'{f_stat:.1f}', ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

## 4. Interpretation Guidelines

### Hansen J-Test

| P-value | Interpretation | Action |
|---------|----------------|--------|
| p > 0.10 | Strong evidence instruments are valid | Proceed with confidence |
| 0.05 < p < 0.10 | Marginal evidence | Check robustness |
| 0.01 < p < 0.05 | Reject at 5% | Investigate instrument validity |
| p < 0.01 | Strong rejection | Serious specification problem |

**Important caveats:**
- J-test has power only if some (not all) instruments are valid
- Cannot detect if all instruments are equally invalid
- Large samples: J-test may reject for minor violations

### C-Statistic

- Tests validity of subset of instruments
- Useful for identifying which instruments are problematic
- Can test multiple subsets sequentially

### Weak Instruments (First-Stage F)

| F-statistic | Classification | Reliability |
|-------------|----------------|-------------|
| F > 10 | Strong | Reliable inference |
| 5 < F < 10 | Moderate | Caution advised |
| F < 5 | Weak | Unreliable inference |

**Stock-Yogo critical values:**
- More formal test available
- Depends on number of instruments and endogenous variables
- F > 10 is common rule of thumb

## 5. When to Worry About Specification Problems

### Red Flags:

1. **J-test strongly rejects** (p < 0.01)
   - Some instruments likely invalid
   - Model may be misspecified

2. **Weak first stage** (F < 10)
   - Instruments weakly correlated with endogenous variables
   - Estimates will be biased and inconsistent

3. **Implausible coefficient magnitudes**
   - Even if J-test passes, check economic plausibility

4. **Huge changes when adding/removing instruments**
   - Suggests weak identification or invalid instruments

### What to Check:

1. **Economic theory**: Are instruments truly exogenous?
2. **Institutional knowledge**: What is the identification strategy?
3. **First-stage strength**: Do instruments predict endogenous variables?
4. **Overidentification**: Are moment conditions consistent?
5. **Robustness**: Do results hold with different instrument sets?

## 6. Remedial Actions

### If J-test Rejects:

1. **Use C-statistic** to identify problematic instruments
2. **Drop suspect instruments** and re-estimate
3. **Reconsider model specification**
   - Are there omitted variables?
   - Is functional form correct?
4. **Try alternative estimation**
   - LIML (less sensitive to weak instruments)
   - Bias-corrected GMM

### If Weak Instruments:

1. **Find stronger instruments**
   - Look for better identification strategy
   - Use institutional knowledge
2. **Use weak-instrument robust inference**
   - Anderson-Rubin test
   - Conditional likelihood ratio test
3. **Consider alternative methods**
   - Reduced-form estimation
   - Sensitivity analysis
   - Partial identification

### Best Practices:

1. **Always report diagnostic tests**
2. **Test multiple instrument sets**
3. **Use economic theory to guide instrument choice**
4. **Be transparent about identification assumptions**
5. **Report first-stage results**
6. **Check robustness to instrument choice**

## Summary and Key Takeaways

### Diagnostic Testing Workflow:

1. **Estimate GMM model**
2. **Check first-stage strength** (F-statistic > 10?)
3. **Run Hansen J-test** (p-value > 0.05?)
4. **If J rejects, use C-statistic** to identify bad instruments
5. **Report all diagnostics** in results

### Decision Tree:

```
Is F-statistic > 10?
  NO → Weak instruments → Find stronger instruments or use robust methods
  YES → Continue
  
Does J-test reject (p < 0.05)?
  YES → Invalid instruments → Use C-statistic to identify, drop bad instruments
  NO → Instruments appear valid → Proceed with confidence
```

### Remember:

- **No test is perfect**: All rely on assumptions
- **Economic theory matters**: Statistical tests complement, don't replace, economic reasoning
- **Robustness is key**: Try multiple specifications
- **Transparency**: Report all diagnostics, even if unfavorable

### References:

1. Hansen, L.P. (1982). "Large Sample Properties of Generalized Method of Moments Estimators." *Econometrica*, 50(4), 1029-1054.

2. Stock, J.H., & Yogo, M. (2005). "Testing for Weak Instruments in Linear IV Regression." In *Identification and Inference for Econometric Models: Essays in Honor of Thomas Rothenberg*, Cambridge University Press.

3. Baum, C.F., Schaffer, M.E., & Stillman, S. (2003). "Instrumental Variables and GMM: Estimation and Testing." *Stata Journal*, 3(1), 1-31.

4. Hall, A.R. (2005). *Generalized Method of Moments*. Oxford University Press.

---

**PanelBox** - Advanced Panel Data Econometrics in Python  
https://github.com/bernardodionisi/panelbox