# 01 — Difference GMM Fundamentals

**Duration:** ~80 minutes  
**Level:** Intermediate  
**Prerequisites:** Linear regression, panel data basics, instrumental variables

## Learning Objectives

By the end of this notebook, you will be able to:

1. Understand **why OLS and Fixed Effects fail** for dynamic panel models (Nickell bias)
2. Explain the **Arellano-Bond Difference GMM** estimation strategy
3. Estimate a Difference GMM model using PanelBox
4. Compare OLS, Fixed Effects, and GMM estimates
5. Interpret key diagnostic tests (Hansen J, AR(1), AR(2))

## Outline

1. [The Dynamic Panel Problem](#1-the-dynamic-panel-problem)
2. [Nickell Bias: Why Fixed Effects Fails](#2-nickell-bias)
3. [The Arellano-Bond Solution](#3-the-arellano-bond-solution)
4. [Estimation with PanelBox](#4-estimation-with-panelbox)
5. [OLS vs FE vs GMM Comparison](#5-ols-vs-fe-vs-gmm-comparison)
6. [Diagnostic Tests](#6-diagnostic-tests)
7. [Exercises](#7-exercises)

In [None]:
# Setup
import sys
import warnings
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

# Add project root to path
project_root = Path("../../..").resolve()
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# PanelBox imports
from panelbox.gmm import DifferenceGMM, SystemGMM

# Tutorial utilities
sys.path.insert(0, str(Path("..").resolve()))
from utils.visualization import apply_tutorial_style, plot_coefficient_comparison, plot_nickell_bias
from utils.data_generation import generate_nickell_bias_data

# Plot style
apply_tutorial_style()
warnings.filterwarnings('ignore', category=UserWarning)

print("Setup complete.")

## 1. The Dynamic Panel Problem

Many economic models involve **persistence** — the current value of a variable depends on its past values. For example:

- Employment adjusts slowly due to hiring/firing costs
- GDP growth is autocorrelated across countries
- Investment decisions depend on past capital levels

The standard **dynamic panel model** is:

$$y_{it} = \rho \, y_{i,t-1} + \mathbf{x}'_{it} \boldsymbol{\beta} + \mu_i + \varepsilon_{it}$$

where:
- $y_{it}$: dependent variable for entity $i$ at time $t$
- $y_{i,t-1}$: **lagged** dependent variable
- $\mathbf{x}_{it}$: vector of covariates
- $\mu_i$: entity-specific **fixed effect**
- $\varepsilon_{it}$: idiosyncratic error

### The Core Problem

The lagged dependent variable $y_{i,t-1}$ is **correlated with the fixed effect** $\mu_i$ by construction:

$$\text{Corr}(y_{i,t-1}, \mu_i) \neq 0$$

This means:
- **OLS** is biased upward (positive correlation between $y_{i,t-1}$ and $\mu_i + \varepsilon_{it}$)
- **Fixed Effects** (within estimator) is biased downward — this is **Nickell bias**

## 2. Nickell Bias: Why Fixed Effects Fails

Nickell (1981) showed that the FE estimator of $\rho$ has a **negative bias** of order $O(1/T)$:

$$\text{plim}_{N \to \infty} \hat{\rho}_{FE} = \rho - \frac{1 + \rho}{T - 1} + O\left(\frac{1}{T^2}\right)$$

This means:
- For $T = 5$ and $\rho = 0.5$: bias $\approx -0.375$ (75% underestimate!)
- For $T = 10$ and $\rho = 0.5$: bias $\approx -0.167$ (33% underestimate)
- The bias diminishes **only as T grows**, not N

Let's demonstrate this with simulated data.

In [None]:
# Load Nickell bias demonstration data
nickell_data = pd.read_csv("../data/dgp_nickell_bias.csv")
print(f"Dataset shape: {nickell_data.shape}")
print(f"Unique (rho, T) combinations: {nickell_data.groupby(['rho', 'T']).ngroups}")
nickell_data.head(10)

In [None]:
# Demonstrate Nickell bias: compare FE and OLS estimates across (rho, T)
from scipy import stats as sp_stats

bias_results = []

for (rho_true, T_true), group in nickell_data.groupby(['rho', 'T']):
    # Create lagged y within this subset
    df = group.sort_values(['entity', 'time']).copy()
    df['y_lag'] = df.groupby('entity')['y'].shift(1)
    df = df.dropna(subset=['y_lag'])
    
    # --- Pooled OLS ---
    y = df['y'].values
    X_ols = np.column_stack([np.ones(len(df)), df['y_lag'].values])
    beta_ols = np.linalg.lstsq(X_ols, y, rcond=None)[0]
    rho_ols = beta_ols[1]
    
    # --- Fixed Effects (within estimator) ---
    # Demean within entity
    entity_means_y = df.groupby('entity')['y'].transform('mean')
    entity_means_ylag = df.groupby('entity')['y_lag'].transform('mean')
    y_dm = (df['y'] - entity_means_y).values
    X_fe = (df['y_lag'] - entity_means_ylag).values.reshape(-1, 1)
    rho_fe = float(np.linalg.lstsq(X_fe, y_dm, rcond=None)[0][0])
    
    # --- Difference GMM ---
    try:
        model_gmm = DifferenceGMM(
            data=df[['entity', 'time', 'y']],
            dep_var='y',
            lags=1,
            id_var='entity',
            time_var='time',
            time_dummies=False,
            collapse=True,
            two_step=True,
            robust=True
        )
        result_gmm = model_gmm.fit()
        rho_gmm = float(result_gmm.params.iloc[0])
    except Exception:
        rho_gmm = np.nan
    
    bias_results.append({
        'rho': rho_true,
        'T': int(T_true),
        'ols_estimate': rho_ols,
        'fe_estimate': rho_fe,
        'gmm_estimate': rho_gmm,
        'ols_bias': rho_ols - rho_true,
        'fe_bias': rho_fe - rho_true,
        'gmm_bias': rho_gmm - rho_true if not np.isnan(rho_gmm) else np.nan,
    })

bias_df = pd.DataFrame(bias_results)
print("\nEstimation Results Across (rho, T):")
print("=" * 80)
print(bias_df.to_string(index=False, float_format='{:.4f}'.format))

In [None]:
# Visualize Nickell bias
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

estimators = [
    ('ols_estimate', 'Pooled OLS', 'tab:red'),
    ('fe_estimate', 'Fixed Effects', 'tab:orange'),
    ('gmm_estimate', 'Difference GMM', 'tab:blue'),
]

for ax, (col, name, color) in zip(axes, estimators):
    for rho in [0.3, 0.5, 0.8]:
        sub = bias_df[bias_df['rho'] == rho]
        ax.plot(sub['T'], sub[col], 'o-', label=f'rho = {rho}')
        ax.axhline(rho, color='gray', linestyle=':', alpha=0.5)
    ax.set_xlabel('T (panel length)')
    ax.set_ylabel('Estimate of rho')
    ax.set_title(name)
    ax.legend()

fig.suptitle('Nickell Bias Demonstration: OLS vs FE vs GMM', fontsize=14, y=1.02)
fig.tight_layout()
fig.savefig('../outputs/figures/01_nickell_bias_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

### Key Takeaways from the Simulation

1. **OLS** consistently **overestimates** $\rho$ (upward bias from fixed effects)
2. **Fixed Effects** consistently **underestimates** $\rho$ (Nickell bias)
3. **Difference GMM** provides approximately **unbiased** estimates
4. The FE bias is more severe for **small T** and **high $\rho$**

> **Rule of thumb**: In dynamic panels, the true $\rho$ should lie **between** the FE estimate (lower bound) and the OLS estimate (upper bound). If your GMM estimate falls outside this range, suspect misspecification.

## 3. The Arellano-Bond Solution

### Step 1: Eliminate Fixed Effects by First-Differencing

$$\Delta y_{it} = \rho \, \Delta y_{i,t-1} + \Delta \mathbf{x}'_{it} \boldsymbol{\beta} + \Delta \varepsilon_{it}$$

This removes $\mu_i$, but creates a new problem: $\Delta y_{i,t-1}$ is correlated with $\Delta \varepsilon_{it}$ because both contain $\varepsilon_{i,t-1}$.

### Step 2: Use Lagged Levels as Instruments

Under the assumptions:
- $E[\varepsilon_{it} \varepsilon_{is}] = 0$ for $t \neq s$ (no serial correlation)
- $E[y_{i,1} \varepsilon_{it}] = 0$ for $t \geq 2$ (predetermined initial conditions)

We can use $y_{i,t-2}, y_{i,t-3}, \ldots$ as **instruments** for $\Delta y_{i,t-1}$:

| Time | Endogenous | Valid Instruments |
|------|------------|-------------------|
| $t=3$ | $\Delta y_{i,2}$ | $y_{i,1}$ |
| $t=4$ | $\Delta y_{i,3}$ | $y_{i,1}, y_{i,2}$ |
| $t=5$ | $\Delta y_{i,4}$ | $y_{i,1}, y_{i,2}, y_{i,3}$ |

### Step 3: GMM Estimation

The moment conditions are:

$$E[y_{i,t-s} \cdot \Delta \varepsilon_{it}] = 0 \quad \text{for } s \geq 2$$

GMM finds the parameters that minimize the weighted distance between sample and population moments.

## 4. Estimation with PanelBox

Now let's apply Difference GMM to real-world-style data: the Arellano-Bond employment dataset.

In [None]:
# Load employment data
abdata = pd.read_csv("../data/abdata.csv")
print(f"Shape: {abdata.shape}")
print(f"Firms: {abdata['firm'].nunique()}")
print(f"Years: {sorted(abdata['year'].unique())}")
print(f"\nDescriptive Statistics:")
abdata.describe().round(3)

In [None]:
# Basic Difference GMM estimation
# Model: n_{it} = rho * n_{i,t-1} + beta1 * w_{it} + beta2 * k_{it} + mu_i + eps_{it}

model_diff = DifferenceGMM(
    data=abdata,
    dep_var='n',              # Log employment
    lags=1,                    # Include n_{t-1}
    id_var='firm',             # Firm identifier
    time_var='year',           # Time variable
    exog_vars=['w', 'k'],      # Strictly exogenous regressors
    time_dummies=True,         # Include year dummies
    collapse=True,             # Collapse instruments (recommended)
    two_step=True,             # Two-step estimation
    robust=True                # Windmeijer-corrected standard errors
)

results_diff = model_diff.fit()
print(results_diff.summary())

In [None]:
# Key results interpretation
print("Key Coefficient Estimates:")
print("=" * 50)
for var in ['L1.n', 'w', 'k']:
    if var in results_diff.params.index:
        coef = results_diff.params[var]
        se = results_diff.std_errors[var]
        p = results_diff.pvalues[var]
        sig = '***' if p < 0.001 else '**' if p < 0.01 else '*' if p < 0.05 else ''
        print(f"  {var:<10}: {coef:>8.4f}  (SE: {se:.4f})  p={p:.4f} {sig}")

print(f"\nDiagnostic Tests:")
print(f"  Hansen J:    p = {results_diff.hansen_j.pvalue:.4f}  [{results_diff.hansen_j.conclusion}]")
print(f"  AR(1):       p = {results_diff.ar1_test.pvalue:.4f}  [{results_diff.ar1_test.conclusion}]")
print(f"  AR(2):       p = {results_diff.ar2_test.pvalue:.4f}  [{results_diff.ar2_test.conclusion}]")
print(f"  Instruments: {results_diff.n_instruments}")
print(f"  Groups:      {results_diff.n_groups}")
print(f"  Inst. ratio: {results_diff.instrument_ratio:.3f}")

### Interpreting the Results

**Coefficients:**
- `L1.n` ($\hat{\rho}$): persistence of employment
- `w`: elasticity of employment with respect to wages
- `k`: elasticity of employment with respect to capital

**Diagnostic Tests:**
- **Hansen J-test**: If $p > 0.10$, instruments are valid. If $p > 0.99$, may indicate too many instruments.
- **AR(1)**: Expected to be significant (by construction in first-differences).
- **AR(2)**: Should NOT be significant ($p > 0.10$) — validates the moment conditions.
- **Instrument ratio**: Should be $< 1.0$ (fewer instruments than groups).

## 5. OLS vs FE vs GMM Comparison

Let's compare the three estimators on the employment data.

In [None]:
# Prepare data with lagged variable
df = abdata.sort_values(['firm', 'year']).copy()
df['n_lag'] = df.groupby('firm')['n'].shift(1)
df = df.dropna(subset=['n_lag'])

# --- Pooled OLS ---
y = df['n'].values
X_ols = np.column_stack([np.ones(len(df)), df['n_lag'].values, df['w'].values, df['k'].values])
beta_ols = np.linalg.lstsq(X_ols, y, rcond=None)[0]
resid_ols = y - X_ols @ beta_ols
se_ols = np.sqrt(np.diag(np.sum(resid_ols**2) / (len(y) - 4) * np.linalg.inv(X_ols.T @ X_ols)))

# --- Fixed Effects ---
for col in ['n', 'n_lag', 'w', 'k']:
    df[f'{col}_dm'] = df[col] - df.groupby('firm')[col].transform('mean')

y_dm = df['n_dm'].values
X_fe = np.column_stack([df['n_lag_dm'].values, df['w_dm'].values, df['k_dm'].values])
beta_fe = np.linalg.lstsq(X_fe, y_dm, rcond=None)[0]
resid_fe = y_dm - X_fe @ beta_fe
N_firms = df['firm'].nunique()
se_fe = np.sqrt(np.diag(np.sum(resid_fe**2) / (len(y_dm) - N_firms - 3) * np.linalg.inv(X_fe.T @ X_fe)))

# --- Collect GMM results ---
rho_gmm = results_diff.params.get('L1.n', np.nan)
se_gmm = results_diff.std_errors.get('L1.n', np.nan)

# Comparison table
comparison = pd.DataFrame({
    'OLS': {'rho': beta_ols[1], 'w': beta_ols[2], 'k': beta_ols[3], 'SE(rho)': se_ols[1]},
    'Fixed Effects': {'rho': beta_fe[0], 'w': beta_fe[1], 'k': beta_fe[2], 'SE(rho)': se_fe[0]},
    'Difference GMM': {
        'rho': rho_gmm,
        'w': results_diff.params.get('w', np.nan),
        'k': results_diff.params.get('k', np.nan),
        'SE(rho)': se_gmm,
    },
})

print("Estimator Comparison:")
print("=" * 60)
print(comparison.round(4).to_string())
print(f"\nExpected ordering: FE(rho) < GMM(rho) < OLS(rho)")
print(f"Actual: FE={beta_fe[0]:.4f}, GMM={rho_gmm:.4f}, OLS={beta_ols[1]:.4f}")

In [None]:
# Visual comparison
estimates = {
    'Pooled OLS': (beta_ols[1], se_ols[1]),
    'Fixed Effects': (beta_fe[0], se_fe[0]),
    'Difference GMM': (rho_gmm, se_gmm),
}

fig, ax = plt.subplots(figsize=(8, 4))
plot_coefficient_comparison(
    estimates,
    param_name='Persistence parameter (rho)',
    title='Employment Persistence: OLS vs FE vs Difference GMM',
    ax=ax
)
fig.tight_layout()
fig.savefig('../outputs/figures/01_ols_fe_gmm_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

## 6. Diagnostic Tests

GMM estimation is only as good as its specification. The key diagnostic tests are:

### Hansen J-Test (Overidentification)

- **H0**: All instruments are valid (moment conditions hold)
- **Reject if** $p < 0.10$: instruments may be invalid
- **Suspicious if** $p > 0.99$: too many instruments (overfitting)
- **Ideal range**: $0.10 < p < 0.25$

### AR(1) and AR(2) Tests

- **AR(1)**: Expected to be significant (by construction in first differences)
- **AR(2)**: Should NOT be significant ($p > 0.10$). Rejection implies serial correlation in levels, invalidating instruments.

### Instrument Count

Rule of thumb: number of instruments $\leq$ number of groups (cross-sectional units).

In [None]:
# Detailed diagnostic analysis
print("="*70)
print("DIAGNOSTIC REPORT: Difference GMM")
print("="*70)

# 1. Hansen J-test
hansen = results_diff.hansen_j
print(f"\n1. Hansen J-Test (Overidentification)")
print(f"   Statistic: {hansen.statistic:.4f}")
print(f"   P-value:   {hansen.pvalue:.4f}")
if hansen.df:
    print(f"   DF:        {hansen.df}")
if hansen.pvalue > 0.10:
    print(f"   Result:    PASS - Cannot reject instrument validity")
else:
    print(f"   Result:    FAIL - Instruments may be invalid")

# 2. AR tests
ar1 = results_diff.ar1_test
ar2 = results_diff.ar2_test
print(f"\n2. Arellano-Bond AR Tests")
print(f"   AR(1): z = {ar1.statistic:.4f}, p = {ar1.pvalue:.4f}")
print(f"          {'Expected: significant' if ar1.pvalue < 0.10 else 'Unusual: not significant'}")
print(f"   AR(2): z = {ar2.statistic:.4f}, p = {ar2.pvalue:.4f}")
print(f"          {'PASS: No second-order autocorrelation' if ar2.pvalue > 0.10 else 'FAIL: Serial correlation detected'}")

# 3. Instrument count
print(f"\n3. Instrument Count")
print(f"   Instruments: {results_diff.n_instruments}")
print(f"   Groups:      {results_diff.n_groups}")
ratio = results_diff.instrument_ratio
print(f"   Ratio:       {ratio:.3f} {'(OK)' if ratio <= 1.0 else '(TOO MANY)'}")

# Overall assessment
print(f"\n" + "="*70)
valid_hansen = hansen.pvalue > 0.10
valid_ar2 = ar2.pvalue > 0.10
valid_instruments = ratio <= 1.0

if valid_hansen and valid_ar2 and valid_instruments:
    print("OVERALL: Specification appears VALID")
else:
    issues = []
    if not valid_hansen: issues.append("Hansen J rejects")
    if not valid_ar2: issues.append("AR(2) significant")
    if not valid_instruments: issues.append("Too many instruments")
    print(f"OVERALL: Issues — {', '.join(issues)}")
print("="*70)

## 7. Exercises

### Exercise 1: Sensitivity to Panel Length

Using the Nickell bias data, estimate the FE and GMM coefficients for $\rho = 0.8$ and $T \in \{5, 10, 20\}$. How does the FE bias change with T? Does GMM remain consistent?

### Exercise 2: Adding More Covariates

Re-estimate the employment model adding `ys` (industry output) as an additional exogenous variable. How do the coefficient estimates change? Do the diagnostics still pass?

### Exercise 3: Collapse vs. Non-Collapse

Estimate the employment model with `collapse=False`. Compare the instrument count, Hansen J p-value, and coefficient estimates with the collapsed version. What happens?

In [None]:
# Space for Exercise 1
# YOUR CODE HERE


In [None]:
# Space for Exercise 2
# YOUR CODE HERE


In [None]:
# Space for Exercise 3
# YOUR CODE HERE


## Summary

In this notebook, we learned:

1. **Dynamic panel models** include lagged dependent variables, creating endogeneity
2. **OLS overestimates** and **FE underestimates** the persistence parameter (Nickell bias)
3. **Difference GMM** (Arellano-Bond) solves this by:
   - First-differencing to eliminate fixed effects
   - Using lagged levels as instruments
4. **Key diagnostics**: Hansen J-test, AR(2) test, instrument count ratio
5. **Best practices**: Always use `collapse=True`, check that instrument ratio < 1.0

### Next Notebook

In **Notebook 02**, we'll explore **System GMM (Blundell-Bond)**, which adds level equations for greater efficiency when series are persistent.

---

**References:**
- Arellano, M., & Bond, S. (1991). Some tests of specification for panel data. *Review of Economic Studies*, 58(2), 277-297.
- Nickell, S. (1981). Biases in dynamic models with fixed effects. *Econometrica*, 49(6), 1417-1426.
- Roodman, D. (2009). How to do xtabond2. *Stata Journal*, 9(1), 86-136.