# 03 — Instrument Specification

**Duration:** ~110 minutes  
**Level:** Intermediate-Advanced  
**Prerequisites:** Notebooks 01-02

## Learning Objectives

1. Distinguish **GMM-style** vs **IV-style** instruments
2. Understand **instrument proliferation** and its consequences
3. Apply the **collapse** option to reduce instrument count
4. Choose appropriate **min/max lag** settings
5. Classify variables as exogenous, predetermined, or endogenous

## Outline

1. [GMM-style vs IV-style Instruments](#1-instrument-types)
2. [Instrument Proliferation](#2-proliferation)
3. [The Collapse Option](#3-collapse)
4. [Lag Selection Strategy](#4-lag-selection)
5. [Variable Classification](#5-classification)
6. [Applied Example](#6-applied)
7. [Exercises](#7-exercises)

In [None]:
# Setup
import sys
import warnings
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

project_root = Path("../../..").resolve()
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

from panelbox.gmm import DifferenceGMM, SystemGMM

sys.path.insert(0, str(Path("..").resolve()))
from utils.visualization import apply_tutorial_style, plot_instrument_count_effects

apply_tutorial_style()
warnings.filterwarnings('ignore', category=UserWarning)
print("Setup complete.")

## 1. GMM-style vs IV-style Instruments

There are two fundamental ways to use instruments in GMM:

### GMM-style ("Arellano-Bond" instruments)
- Creates a **separate column for each time period**
- The instrument set grows with T (one block per period)
- More moment conditions = potentially more efficient, but risk of proliferation
- Used for: lagged dependent variable, predetermined and endogenous variables

### IV-style ("Standard" instruments)
- Creates **one column per lag** (same across all time periods)
- Fixed number of instruments regardless of T
- Used for: strictly exogenous variables

### Instrument Count Comparison

For a variable with lags 2 to T-1 and T time periods:
- **GMM-style**: $\sum_{t=3}^{T}(t-2) = O(T^2)$ instruments
- **GMM-style collapsed**: $T-2 = O(T)$ instruments
- **IV-style**: Fixed number per lag

In [None]:
# Demonstrate instrument count growth
T_values = range(5, 25)
gmm_counts = []
gmm_collapsed_counts = []
iv_counts = []

for T in T_values:
    # GMM-style: sum from t=3 to T of (t-2) = T(T-1)/2 - 1
    gmm_count = sum(t - 2 for t in range(3, T + 1))
    gmm_counts.append(gmm_count)
    
    # GMM-style collapsed: T - 2
    gmm_collapsed_counts.append(T - 2)
    
    # IV-style: fixed (e.g., 3 lags)
    iv_counts.append(3)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(list(T_values), gmm_counts, 'o-', label='GMM-style (uncollapsed)', color='red')
ax.plot(list(T_values), gmm_collapsed_counts, 's-', label='GMM-style (collapsed)', color='blue')
ax.plot(list(T_values), iv_counts, '^-', label='IV-style (3 lags)', color='green')
ax.set_xlabel('T (time periods)')
ax.set_ylabel('Number of instruments')
ax.set_title('Instrument Count Growth by Type')
ax.legend()
fig.tight_layout()
fig.savefig('../outputs/figures/03_instrument_count_growth.png', dpi=150, bbox_inches='tight')
plt.show()

## 2. Instrument Proliferation

**Too many instruments** cause serious problems:

1. **Overfitting**: Hansen J-test loses power (p-value inflated toward 1.0)
2. **Finite-sample bias**: Estimates approach OLS/2SLS bias
3. **Numerical instability**: Weight matrix becomes near-singular

### Roodman's (2009) Warning

> "A large instrument collection overfits endogenous variables... can fail to expunge their endogenous components and bias coefficient estimates towards those from non-instrumenting estimators."

### Rule of Thumb
Number of instruments $\leq$ number of cross-sectional groups

In [None]:
# Demonstrate proliferation effects
abdata = pd.read_csv("../data/abdata.csv")

# Estimate with collapse=False (uncollapsed)
model_no_collapse = DifferenceGMM(
    data=abdata, dep_var='n', lags=1, id_var='firm', time_var='year',
    exog_vars=['w', 'k'], time_dummies=True,
    collapse=False,  # Not collapsed!
    two_step=True, robust=True
)
results_no_collapse = model_no_collapse.fit()

# Estimate with collapse=True
model_collapse = DifferenceGMM(
    data=abdata, dep_var='n', lags=1, id_var='firm', time_var='year',
    exog_vars=['w', 'k'], time_dummies=True,
    collapse=True,
    two_step=True, robust=True
)
results_collapse = model_collapse.fit()

print("Instrument Proliferation: Collapsed vs Uncollapsed")
print("=" * 60)
print(f"{'':25s} {'Uncollapsed':>15s} {'Collapsed':>15s}")
print("-" * 60)
print(f"{'Instruments':25s} {results_no_collapse.n_instruments:>15d} {results_collapse.n_instruments:>15d}")
print(f"{'Groups':25s} {results_no_collapse.n_groups:>15d} {results_collapse.n_groups:>15d}")
print(f"{'Inst/Group ratio':25s} {results_no_collapse.instrument_ratio:>15.3f} {results_collapse.instrument_ratio:>15.3f}")
print(f"{'Hansen J p-value':25s} {results_no_collapse.hansen_j.pvalue:>15.4f} {results_collapse.hansen_j.pvalue:>15.4f}")
print(f"{'rho':25s} {results_no_collapse.params.iloc[0]:>15.4f} {results_collapse.params.iloc[0]:>15.4f}")
print(f"{'SE(rho)':25s} {results_no_collapse.std_errors.iloc[0]:>15.4f} {results_collapse.std_errors.iloc[0]:>15.4f}")

## 3. The Collapse Option

The **collapse** option (Roodman 2009) reduces instrument count from $O(T^2)$ to $O(T)$ by:

Instead of creating separate moment conditions for each time period:
$$E[y_{i,t-s} \cdot \Delta \varepsilon_{it}] = 0 \quad \forall s \geq 2, \forall t$$

It combines across time periods:
$$E\left[\sum_t y_{i,t-s} \cdot \Delta \varepsilon_{it}\right] = 0 \quad \forall s \geq 2$$

### Benefits of Collapse
- Dramatically reduces instrument count
- More numerically stable
- Better finite-sample properties
- **Always recommended** (Roodman 2009)

## 4. Lag Selection Strategy

The choice of **minimum and maximum lags** for instruments is crucial:

### Minimum Lag
- **Lagged dependent variable**: min_lag = 2 (use $y_{t-2}$ and earlier)
- **Predetermined variables**: min_lag = 2
- **Endogenous variables**: min_lag = 3
- **Strictly exogenous**: min_lag = 0 (can use current period)

### Maximum Lag
- More lags = more instruments = potential proliferation
- With collapse: more lags are safe
- Without collapse: limit to 3-4 lags maximum

### Strategy
1. Start with conservative specification (few lags, collapse=True)
2. Check diagnostics (Hansen J, AR(2), instrument ratio)
3. Gradually add lags if more power is needed
4. If Hansen J p-value > 0.99, reduce lags

## 5. Variable Classification

How you classify variables determines the instrument strategy:

| Classification | Definition | Instruments (Diff eq.) | Example |
|---------------|------------|----------------------|----------|
| Strictly exogenous | $E[x_{it} \varepsilon_{is}] = 0 \ \forall s,t$ | IV-style, all lags | Policy dummies, demographics |
| Predetermined | $E[x_{it} \varepsilon_{is}] = 0$ for $s \geq t$ | GMM-style, lag 2+ | Capital stock |
| Endogenous | $E[x_{it} \varepsilon_{it}] \neq 0$ | GMM-style, lag 3+ | Labor input |
| Lagged dependent | Always endogenous | GMM-style, lag 2+ | $y_{t-1}$ |

## 6. Applied Example

Let's apply different instrument specifications to the firm investment data.

In [None]:
# Load firm investment data
firm_data = pd.read_csv("../data/firm_investment.csv")
print(f"Shape: {firm_data.shape}")
print(f"Firms: {firm_data['firm'].nunique()}, Years: {sorted(firm_data['year'].unique())}")
firm_data.describe().round(4)

In [None]:
# Specification 1: All variables treated as exogenous
model_exog = DifferenceGMM(
    data=firm_data, dep_var='ik', lags=1, id_var='firm', time_var='year',
    exog_vars=['q', 'cashflow', 'sales', 'debt'],
    time_dummies=False, collapse=True, two_step=True, robust=True
)
results_exog = model_exog.fit()

# Specification 2: q and cashflow as predetermined
model_pred = DifferenceGMM(
    data=firm_data, dep_var='ik', lags=1, id_var='firm', time_var='year',
    exog_vars=['sales', 'debt'],
    predetermined_vars=['q', 'cashflow'],
    time_dummies=False, collapse=True, two_step=True, robust=True
)
results_pred = model_pred.fit()

print("Impact of Variable Classification:")
print("=" * 65)
print(f"{'':25s} {'All Exogenous':>18s} {'q,cf Predetermined':>18s}")
print("-" * 65)
print(f"{'rho (L1.ik)':25s} {results_exog.params.iloc[0]:>18.4f} {results_pred.params.iloc[0]:>18.4f}")
print(f"{'Instruments':25s} {results_exog.n_instruments:>18d} {results_pred.n_instruments:>18d}")
print(f"{'Inst/Group':25s} {results_exog.instrument_ratio:>18.3f} {results_pred.instrument_ratio:>18.3f}")
print(f"{'Hansen J p':25s} {results_exog.hansen_j.pvalue:>18.4f} {results_pred.hansen_j.pvalue:>18.4f}")
print(f"{'AR(2) p':25s} {results_exog.ar2_test.pvalue:>18.4f} {results_pred.ar2_test.pvalue:>18.4f}")

## 7. Exercises

### Exercise 1: Instrument Count Experiment
Estimate the employment model with different max_lag settings (2, 4, 6, 99) and collapse=True. Plot how Hansen J p-value and coefficient estimates change.

### Exercise 2: Endogenous Variables
Re-estimate the firm investment model treating `q` as endogenous (not predetermined). How do the results differ? What does this imply about the timing assumption?

### Exercise 3: System GMM Instruments
Estimate a System GMM model with `level_instruments={'max_lags': k}` for k=1,2,3. How does the Difference-in-Hansen test behave?

In [None]:
# Space for exercises
# YOUR CODE HERE


## Summary

1. **GMM-style** instruments create time-specific moment conditions; **IV-style** create lag-specific columns
2. **Instrument proliferation** weakens diagnostics and biases estimates — always use **collapse=True**
3. **Variable classification** (exogenous/predetermined/endogenous) determines the instrument strategy
4. Start conservative and add instruments only if diagnostics support it
5. The instrument/group ratio should be $< 1.0$

### Next Notebook
In **Notebook 04**, we'll cover **GMM tests and diagnostics** in depth.

---
**References:**
- Roodman, D. (2009). How to do xtabond2. *Stata Journal*, 9(1), 86-136.
- Roodman, D. (2009). A note on the theme of too many instruments. *Oxford Bulletin of Economics and Statistics*, 71(1), 135-158.