# Tutorial 02: Clustered Standard Errors for Panel Data

**Author**: PanelBox Development Team  
**Date**: 2026-02-16  
**Estimated Duration**: 60-75 minutes  
**Prerequisites**: Tutorial 01 (Robust Fundamentals), Panel data concepts

---

## Learning Objectives

By the end of this tutorial, you will be able to:

1. **Understand** intra-cluster correlation and why it invalidates standard robust SEs
2. **Diagnose** within-cluster correlation using ACF plots and other diagnostics
3. **Implement** one-way clustering (by entity or time) using PanelBox
4. **Apply** two-way clustering when appropriate
5. **Interpret** cluster diagnostics (number of clusters, cluster sizes)
6. **Choose** the correct clustering dimension for different research contexts
7. **Avoid** common clustering pitfalls (too few clusters, wrong dimension)

---

## Table of Contents

1. [Setup and Data Loading](#setup)
2. [The Problem with Independence](#problem)
3. [Diagnosing Within-Cluster Correlation](#diagnosis)
4. [One-Way Clustering: By Entity](#entity)
5. [One-Way Clustering: By Time](#time)
6. [Two-Way Clustering](#twoway)
7. [Cluster Diagnostics](#diagnostics)
8. [Case Studies by Discipline](#cases)
9. [Common Pitfalls](#pitfalls)
10. [Exercises](#exercises)
11. [Summary and Key Takeaways](#summary)
12. [References](#references)

---

<a id='setup'></a>
## 1. Setup and Data Loading

We'll work with three panel datasets:
1. **Financial panel**: Stock returns (50 firms, 120 months)
2. **Policy reform**: Country-level outcomes (30 countries, 15 years)
3. **Wage panel**: Individual wages (2000 persons, 5 years)

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Statistical tools
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
from scipy import stats

# PanelBox imports
import panelbox as pb
from panelbox.models.static import PooledOLS, FixedEffects

# Configuration
np.random.seed(42)
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100
pd.set_option('display.precision', 4)

# Define paths
DATA_PATH = '../data/'
FIG_PATH = '../outputs/figures/02_clustering/'

# Create output directory if it doesn't exist
import os
os.makedirs(FIG_PATH, exist_ok=True)

print("‚úì Setup complete!")

### Load All Datasets

In [None]:
# Load financial panel data
financial = pd.read_csv(DATA_PATH + 'financial_panel.csv')
print("Financial Panel Data:")
print(f"  Shape: {financial.shape}")
print(f"  Firms: {financial['firm_id'].nunique()}")
print(f"  Months: {financial['month'].nunique()}")
print(f"  Columns: {list(financial.columns)}")
print()

# Load policy reform data
policy = pd.read_csv(DATA_PATH + 'policy_reform.csv')
print("Policy Reform Data:")
print(f"  Shape: {policy.shape}")
print(f"  Countries: {policy['country_id'].nunique()}")
print(f"  Years: {policy['year'].nunique()}")
print(f"  Columns: {list(policy.columns)}")
print()

# Load wage panel data
wage = pd.read_csv(DATA_PATH + 'wage_panel.csv')
print("Wage Panel Data:")
print(f"  Shape: {wage.shape}")
print(f"  Persons: {wage['person_id'].nunique()}")
print(f"  Years: {wage['year'].nunique()}")
print(f"  Columns: {list(wage.columns)}")
print()

# Display sample
print("Sample from Financial Data:")
financial.head()

---

<a id='problem'></a>
## 2. The Problem with Independence

### 2.1 Review: What Robust SEs Handle

**Quick Recap from Tutorial 01**:
- Robust SEs (HC0-HC3) handle **heteroskedasticity**
- Assumption maintained: **Independence** across observations
- Valid for: Cross-sectional data where observations are truly independent

**The New Problem**:

> In panel data, observations within entities (firms, individuals, countries) are almost never independent over time. Similarly, observations at the same time point may be correlated due to common shocks (market crashes, policy changes, global events).

### 2.2 Visual Demonstration: Autocorrelation in Panels

**Example: Stock Returns**

In [None]:
# Select one firm and plot returns over time
firm_1_data = financial[financial['firm_id'] == 1].sort_values('month')

plt.figure(figsize=(14, 5))
plt.plot(firm_1_data['month'], firm_1_data['returns'], marker='o', linewidth=1.5, markersize=4)
plt.axhline(0, color='red', linestyle='--', alpha=0.5, linewidth=2)
plt.xlabel('Month', fontsize=12, fontweight='bold')
plt.ylabel('Returns (%)', fontsize=12, fontweight='bold')
plt.title('Stock Returns Over Time (Firm 1) - Visual Evidence of Autocorrelation', 
          fontsize=13, fontweight='bold')
plt.grid(alpha=0.3)
plt.tight_layout()
plt.savefig(FIG_PATH + 'returns_timeseries_firm1.png', dpi=300, bbox_inches='tight')
plt.show()

print("Observation: Returns show persistence (clustering of positive/negative values)")
print("             This suggests autocorrelation within firm over time")

**Statistical Consequence**:
- Robust SEs assume Cov(Œµ·µ¢‚Çú, Œµ·µ¢‚Çõ) = 0 for t ‚â† s
- Reality: Cov(Œµ·µ¢‚Çú, Œµ·µ¢‚Çõ) ‚â† 0 (correlation within firm i)
- Result: Robust SEs are still **biased downward** (underestimate uncertainty)

### 2.3 Motivating Example: Significance Mirage

Let's demonstrate how ignoring clustering leads to false precision.

In [None]:
# Estimate model with robust SEs (WRONG for panels with temporal correlation)
fe = FixedEffects("returns ~ market_ret + size", financial, "firm_id", "month")

res_robust = fe.fit(cov_type='hc1')  # Ignores temporal correlation
res_cluster = fe.fit(cov_type='clustered', cluster_entity=True)  # Correct

# Compare
var = 'market_ret'
print("=" * 70)
print("COMPARISON: ROBUST vs CLUSTERED STANDARD ERRORS")
print("=" * 70)
print(f"\nCoefficient (Œ≤): {res_robust.params[var]:.4f}")
print()
print(f"Robust SE (HC1):     {res_robust.std_errors[var]:.4f}  (WRONG - ignores clustering)")
print(f"  t-statistic:       {res_robust.tvalues[var]:.4f}")
print(f"  p-value:           {res_robust.pvalues[var]:.4f}")
print()
print(f"Clustered SE (firm): {res_cluster.std_errors[var]:.4f}  (CORRECT)")
print(f"  t-statistic:       {res_cluster.tvalues[var]:.4f}")
print(f"  p-value:           {res_cluster.pvalues[var]:.4f}")
print()
ratio = res_cluster.std_errors[var] / res_robust.std_errors[var]
print(f"Ratio (Clustered/Robust): {ratio:.2f}x")
print("=" * 70)

print("\nüí° KEY INSIGHT:")
print("   Clustered SE is substantially larger ‚Üí more honest about uncertainty")
print("   Inference conclusions may change when using correct SEs!")

**Implication**: Robust SEs may show p < 0.001, while clustered SEs show p = 0.02. 
Your conclusion about significance can completely change!

---

<a id='diagnosis'></a>
## 3. Diagnosing Within-Cluster Correlation

Before applying clustered SEs, it's important to diagnose whether within-cluster correlation exists.

### 3.1 Autocorrelation Function (ACF) Plots

**Objective**: Visualize temporal correlation within entities

In [None]:
# Estimate model and extract residuals
fe = FixedEffects("returns ~ market_ret + size", financial, "firm_id", "month")
result = fe.fit(cov_type='hc1')

# Add residuals to data
financial_with_resid = financial.copy()
financial_with_resid['resid'] = result.resid

# Extract residuals for one firm
firm_1_resid = financial_with_resid[financial_with_resid['firm_id'] == 1].sort_values('month')['resid']

# Plot ACF and PACF
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

plot_acf(firm_1_resid, lags=20, ax=axes[0], title='ACF of Residuals (Firm 1)')
axes[0].set_xlabel('Lag', fontsize=11, fontweight='bold')
axes[0].set_ylabel('Autocorrelation', fontsize=11, fontweight='bold')

plot_pacf(firm_1_resid, lags=20, ax=axes[1], title='PACF of Residuals (Firm 1)')
axes[1].set_xlabel('Lag', fontsize=11, fontweight='bold')
axes[1].set_ylabel('Partial Autocorrelation', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.savefig(FIG_PATH + 'acf_pacf_firm1.png', dpi=300, bbox_inches='tight')
plt.show()

print("üìä INTERPRETATION:")
print("   - ACF bars outside confidence bands: Evidence of autocorrelation")
print("   - Lag 1 significant: First-order autocorrelation (common in financial data)")
print("   - Multiple lags significant: Persistent correlation")
print("\n   ‚Üí If significant autocorrelation detected, use clustered SEs!")

### 3.2 Average Within-Entity Correlation

Calculate the average correlation of residuals within each entity.

In [None]:
def calculate_within_entity_corr(data, residuals, entity_col):
    """
    Calculate average within-entity correlation of residuals.
    
    Parameters:
    -----------
    data : DataFrame
        Panel data
    residuals : array-like
        Residuals from regression
    entity_col : str
        Name of entity column
        
    Returns:
    --------
    dict with 'mean' and 'all_corrs'
    """
    data_with_resid = data.copy()
    data_with_resid['resid'] = residuals
    
    within_corrs = []
    for entity in data[entity_col].unique():
        entity_resid = data_with_resid[data_with_resid[entity_col] == entity]['resid']
        if len(entity_resid) > 1:
            # Correlation with lagged residual (lag 1)
            corr = entity_resid.autocorr(lag=1)
            if not np.isnan(corr):
                within_corrs.append(corr)
    
    return {'mean': np.mean(within_corrs), 'all_corrs': within_corrs}

# Calculate
corr_result = calculate_within_entity_corr(financial, result.resid, 'firm_id')
avg_corr = corr_result['mean']
all_corrs = corr_result['all_corrs']

print(f"Average within-firm correlation (lag 1): {avg_corr:.3f}")
print()

# Distribution of within-entity correlations
plt.figure(figsize=(10, 5))
plt.hist(all_corrs, bins=30, edgecolor='black', alpha=0.7, color='steelblue')
plt.axvline(0, color='red', linestyle='--', linewidth=2, label='Independence (œÅ=0)')
plt.axvline(avg_corr, color='green', linestyle='--', linewidth=2.5, 
            label=f'Mean: {avg_corr:.3f}')
plt.xlabel('Within-Entity Correlation (Lag 1)', fontsize=12, fontweight='bold')
plt.ylabel('Frequency', fontsize=12, fontweight='bold')
plt.title('Distribution of Within-Firm Correlations', fontsize=13, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.savefig(FIG_PATH + 'within_entity_correlation_dist.png', dpi=300, bbox_inches='tight')
plt.show()

print("üìä INTERPRETATION:")
if abs(avg_corr) < 0.1:
    print("   Mean ‚âà 0: Little temporal correlation (robust SEs may be acceptable)")
elif abs(avg_corr) < 0.3:
    print("   Mean > 0.1: Moderate correlation (clustered SEs recommended)")
else:
    print("   Mean > 0.3: Substantial correlation (clustered SEs NECESSARY)")
    if abs(avg_corr) > 0.5:
        print("   Mean > 0.5: Strong persistence (consider dynamic model)")

---

<a id='entity'></a>
## 4. One-Way Clustering: By Entity

### 4.1 When to Cluster by Entity

**Use Case**: Panel data with repeated observations within entities

**Examples**:
- Firms over time ‚Üí cluster by firm_id
- Individuals over years ‚Üí cluster by person_id
- Countries over decades ‚Üí cluster by country_id

**What it allows**: Arbitrary correlation within each entity over time

### 4.2 Implementation in PanelBox

In [None]:
# Financial panel: cluster by firm
fe = FixedEffects("returns ~ market_ret + size + book_to_market",
                   financial, "firm_id", "month")

# Cluster by firm (entity)
res_cluster_firm = fe.fit(cov_type='clustered', cluster_entity=True)

# Compare with robust (wrong)
res_robust_hc1 = fe.fit(cov_type='hc1')

print("=" * 80)
print("FIXED EFFECTS MODEL WITH ENTITY CLUSTERING")
print("=" * 80)
print(res_cluster_firm.summary())

In [None]:
# Create comparison table
comparison_data = []
for var in ['market_ret', 'size', 'book_to_market']:
    if var in res_robust_hc1.params.index:
        comparison_data.append({
            'Variable': var,
            'Coefficient': res_robust_hc1.params[var],
            'SE_Robust': res_robust_hc1.std_errors[var],
            'SE_Clustered': res_cluster_firm.std_errors[var],
            'Ratio': res_cluster_firm.std_errors[var] / res_robust_hc1.std_errors[var],
            't_Robust': res_robust_hc1.tvalues[var],
            't_Clustered': res_cluster_firm.tvalues[var]
        })

comp_df = pd.DataFrame(comparison_data)
print("\n" + "=" * 80)
print("COMPARISON: ROBUST (HC1) vs CLUSTERED (BY FIRM)")
print("=" * 80)
print(comp_df.to_string(index=False))
print()
print("Expected Pattern:")
print("  - Clustered SEs > Robust SEs (if temporal correlation present)")
print("  - Ratio (Clustered/Robust) typically 1.5 - 3.0")
print("  - Larger ratios indicate stronger intra-firm correlation")

### 4.3 Visualizing SE Comparison

In [None]:
# Plot SE comparison
fig, ax = plt.subplots(figsize=(10, 6))

x = np.arange(len(comp_df))
width = 0.35

bars1 = ax.bar(x - width/2, comp_df['SE_Robust'], width, label='Robust (HC1)', 
                alpha=0.8, color='steelblue', edgecolor='black')
bars2 = ax.bar(x + width/2, comp_df['SE_Clustered'], width, label='Clustered (Firm)', 
                alpha=0.8, color='darkorange', edgecolor='black')

ax.set_xlabel('Variable', fontsize=12, fontweight='bold')
ax.set_ylabel('Standard Error', fontsize=12, fontweight='bold')
ax.set_title('Standard Error Comparison: Robust vs Clustered (Entity)', 
             fontsize=13, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(comp_df['Variable'], rotation=45, ha='right')
ax.legend(fontsize=11)
ax.grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig(FIG_PATH + 'se_comparison_entity_clustering.png', dpi=300, bbox_inches='tight')
plt.show()

### 4.4 The Math Behind Clustering

**Intuition**: Instead of treating each observation independently, cluster-robust SEs aggregate residuals within clusters before computing variance.

**Cluster-Robust Variance Formula**:
$$
V_{\text{cluster}} = (X'X)^{-1} \left[ \sum_{g=1}^{G} u_g u_g' \right] (X'X)^{-1}
$$

Where:
- $G$ = number of clusters (e.g., firms)
- $u_g = \sum_{t \in g} x_{gt} \epsilon_{gt}$ = sum of scores within cluster $g$
- Allows **arbitrary correlation** within cluster

**Comparison**:
- **Robust (HC)**: $\sum_{i=1}^{N \times T} \epsilon_i^2 x_i x_i'$ (observation-level)
- **Clustered**: $\sum_{g=1}^{G} u_g u_g'$ (cluster-level aggregation)

**Key Insight**:

> Clustering reduces effective sample size from $N \times T$ observations to $G$ clusters. This is why clustered SEs are larger.

**Degrees of Freedom Correction**:
$$
\text{Adjustment} = \frac{G}{G-1} \times \frac{N-1}{N-K}
$$

**Purpose**: Improve small-sample performance (always used by default in PanelBox)

---

<a id='time'></a>
## 5. One-Way Clustering: By Time

### 5.1 When to Cluster by Time

**Use Case**: Common shocks affect all entities at the same time point

**Examples**:
- Policy changes (all countries affected in year $t$)
- Market crashes (all stocks affected in month $t$)
- Natural disasters (all regions affected simultaneously)
- Election years, regulatory changes

**Correlation Structure**:
- Between different firms at same time: Cov(Œµ·µ¢‚Çú, Œµ‚±º‚Çú) ‚â† 0 for i ‚â† j
- Within same firm over time: Cov(Œµ·µ¢‚Çú, Œµ·µ¢‚Çõ) = 0 (assumed)

### 5.2 Application: Policy Reform Example

In [None]:
# Visualize: Common time shocks
fig, ax = plt.subplots(figsize=(12, 6))

# Plot first 5 countries
for country in policy['country_id'].unique()[:5]:
    country_data = policy[policy['country_id'] == country].sort_values('year')
    ax.plot(country_data['year'], country_data['outcome'],
            marker='o', label=f'Country {country}', linewidth=2, markersize=5)

ax.set_xlabel('Year', fontsize=12, fontweight='bold')
ax.set_ylabel('Outcome', fontsize=12, fontweight='bold')
ax.set_title('Policy Outcomes by Country - Evidence of Common Time Shocks', 
             fontsize=13, fontweight='bold')
ax.legend()
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig(FIG_PATH + 'common_time_shocks.png', dpi=300, bbox_inches='tight')
plt.show()

print("Observation: All countries show similar patterns in certain years")
print("             This suggests cross-sectional correlation at specific time points")

### 5.3 Estimation with Time Clustering

In [None]:
# Cluster by time (year)
fe_policy = FixedEffects("outcome ~ treated + gdp_per_capita + democracy_index",
                          policy, "country_id", "year")

res_cluster_time = fe_policy.fit(cov_type='clustered', cluster_time=True)
res_cluster_entity = fe_policy.fit(cov_type='clustered', cluster_entity=True)
res_robust = fe_policy.fit(cov_type='hc1')

print("=" * 80)
print("TIME CLUSTERING RESULTS")
print("=" * 80)
print(res_cluster_time.summary())

In [None]:
# Comparison table: Entity vs Time clustering
comparison_time = []
for var in ['treated', 'gdp_per_capita', 'democracy_index']:
    if var in res_robust.params.index:
        comparison_time.append({
            'Variable': var,
            'Coef': res_robust.params[var],
            'SE_Robust': res_robust.std_errors[var],
            'SE_Entity': res_cluster_entity.std_errors[var],
            'SE_Time': res_cluster_time.std_errors[var]
        })

comp_time_df = pd.DataFrame(comparison_time)
print("\n" + "=" * 80)
print("COMPARISON: ENTITY vs TIME CLUSTERING")
print("=" * 80)
print(comp_time_df.to_string(index=False))
print()
print("Interpretation:")
print("  - If time clustering gives much larger SEs ‚Üí strong cross-sectional correlation")
print("  - Common in macro panels (countries share global shocks)")

### 5.4 Decision Rule: Entity vs Time Clustering

**Decision Tree**:

```
1. Is there temporal correlation within entities?
   YES ‚Üí Cluster by entity
   NO ‚Üí Go to 2

2. Is there cross-sectional correlation at each time point?
   YES ‚Üí Cluster by time
   NO ‚Üí Robust SEs sufficient

3. Both temporal AND cross-sectional correlation?
   ‚Üí Use TWO-WAY clustering (next section)
```

**Rule of Thumb**:
- **Micro panels** (many firms, few years): Cluster by entity
- **Macro panels** (countries, long time series): Often cluster by time OR two-way
- **Finance** (stocks): Often two-way clustering

---

<a id='twoway'></a>
## 6. Two-Way Clustering

### 6.1 When You Need Two-Way Clustering

**Problem**: Correlation in BOTH dimensions
- Within entity over time: Firm i's returns correlated across months
- Across entities at same time: All firms correlated in month t (market shocks)

**Real-World Examples**:
1. **Financial Markets**: Stocks (entity clustering) + Market shocks (time clustering)
2. **Labor Economics**: Workers (entity) + Year effects (time)
3. **Political Science**: Legislators (entity) + Session/year (time)

### 6.2 The Cameron-Gelbach-Miller (2011) Formula

**Two-Way Cluster Variance**:
$$
V_{\text{2way}} = V_{\text{entity}} + V_{\text{time}} - V_{\text{intersection}}
$$

Where:
- $V_{\text{entity}}$: One-way clustering by entity
- $V_{\text{time}}$: One-way clustering by time
- $V_{\text{intersection}}$: Clustering by entity-time pairs (for bias correction)

**Intuition**: Add correlations from both dimensions, subtract overlap

### 6.3 Implementation in PanelBox

In [None]:
# Two-way clustering: entity AND time
fe = FixedEffects("returns ~ market_ret + size + book_to_market",
                   financial, "firm_id", "month")

res_twoway = fe.fit(cov_type='clustered', 
                     cluster_entity=True, 
                     cluster_time=True)

print("=" * 80)
print("TWO-WAY CLUSTERING RESULTS (Entity + Time)")
print("=" * 80)
print(res_twoway.summary())

In [None]:
# Compare all methods
fe_fin = FixedEffects("returns ~ market_ret + size", financial, "firm_id", "month")

res_robust_fin = fe_fin.fit(cov_type='hc1')
res_entity_fin = fe_fin.fit(cov_type='clustered', cluster_entity=True)
res_time_fin = fe_fin.fit(cov_type='clustered', cluster_time=True)
res_twoway_fin = fe_fin.fit(cov_type='clustered', cluster_entity=True, cluster_time=True)

# Create comprehensive comparison
comparison_all = []
for var in ['market_ret', 'size']:
    if var in res_robust_fin.params.index:
        comparison_all.append({
            'Variable': var,
            'Robust': res_robust_fin.std_errors[var],
            'Cluster_Entity': res_entity_fin.std_errors[var],
            'Cluster_Time': res_time_fin.std_errors[var],
            'TwoWay': res_twoway_fin.std_errors[var]
        })

comp_all_df = pd.DataFrame(comparison_all)
print("\n" + "=" * 80)
print("COMPREHENSIVE COMPARISON: ALL CLUSTERING METHODS")
print("=" * 80)
print(comp_all_df.to_string(index=False))
print()
print("Observation: Two-way SEs are typically largest (most conservative)")

### 6.4 Visualizing Two-Way Clustering Components

In [None]:
# Plot comparison for market_ret variable
var = 'market_ret'
methods = ['Robust\n(HC1)', 'Entity\nCluster', 'Time\nCluster', 'Two-Way\nCluster']
values = [
    res_robust_fin.std_errors[var],
    res_entity_fin.std_errors[var],
    res_time_fin.std_errors[var],
    res_twoway_fin.std_errors[var]
]
colors = ['steelblue', 'orange', 'green', 'red']

fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(methods, values, color=colors, alpha=0.7, edgecolor='black', linewidth=2)

ax.set_ylabel('Standard Error', fontsize=12, fontweight='bold')
ax.set_title(f'Comparison of Clustering Methods: {var}', fontsize=13, fontweight='bold')
ax.grid(alpha=0.3, axis='y')

# Add value labels
for bar, val in zip(bars, values):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{val:.4f}',
            ha='center', va='bottom', fontweight='bold', fontsize=11)

plt.tight_layout()
plt.savefig(FIG_PATH + 'twoway_clustering_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

### 6.5 When Two-Way is NOT Necessary

**Scenario 1**: Time fixed effects absorb all time-specific shocks
- Time FE control for common shocks
- Entity clustering may suffice

**Scenario 2**: Short panels (T < 5)
- Limited temporal correlation
- Entity clustering often sufficient

**Scenario 3**: Balanced design experiments
- Randomization breaks correlations
- Standard robust SEs acceptable

**Rule**: When in doubt, use two-way clustering (conservative but safe)

---

<a id='diagnostics'></a>
## 7. Cluster Diagnostics

### 7.1 Critical Issue: Too Few Clusters

**Problem**: With G < 20 clusters, cluster-robust SEs are unreliable  
**Reason**: Asymptotic theory requires G ‚Üí ‚àû

**Severity**:
- G ‚â• 50: Generally safe
- 20 ‚â§ G < 50: Acceptable, some caution
- 10 ‚â§ G < 20: **Warning zone** - SEs may be underestimated
- G < 10: **Critical** - Do not trust cluster-robust SEs

### 7.2 Cluster Diagnostics Function

In [None]:
def cluster_diagnostics(data, entity_col=None, time_col=None):
    """
    Diagnose cluster structure and report warnings.
    
    Parameters:
    -----------
    data : DataFrame
        Panel data
    entity_col : str, optional
        Entity/cluster column
    time_col : str, optional
        Time column
        
    Returns:
    --------
    dict with diagnostic information
    """
    results = {}
    
    if entity_col:
        entity_sizes = data.groupby(entity_col).size()
        results['entity'] = {
            'n_clusters': len(entity_sizes),
            'size_min': entity_sizes.min(),
            'size_mean': entity_sizes.mean(),
            'size_max': entity_sizes.max(),
            'balanced': (entity_sizes.nunique() == 1),
            'sizes': entity_sizes
        }
        
        # Generate warning
        G = results['entity']['n_clusters']
        if G < 10:
            results['entity']['warning'] = "CRITICAL: Too few clusters. Cluster-robust SEs unreliable. Consider bootstrap."
        elif G < 20:
            results['entity']['warning'] = "WARNING: Few clusters. SEs may be biased. Interpret with caution."
        else:
            results['entity']['warning'] = None
    
    if time_col:
        time_sizes = data.groupby(time_col).size()
        results['time'] = {
            'n_clusters': len(time_sizes),
            'size_min': time_sizes.min(),
            'size_mean': time_sizes.mean(),
            'size_max': time_sizes.max(),
            'balanced': (time_sizes.nunique() == 1),
            'sizes': time_sizes
        }
        
        # Generate warning
        G = results['time']['n_clusters']
        if G < 10:
            results['time']['warning'] = "CRITICAL: Too few time clusters. Consider HAC instead."
        elif G < 20:
            results['time']['warning'] = "WARNING: Few time clusters. Consider HAC methods."
        else:
            results['time']['warning'] = None
    
    return results

# Run diagnostics on financial data
diag_fin = cluster_diagnostics(financial, entity_col='firm_id', time_col='month')

print("=" * 70)
print("CLUSTER DIAGNOSTICS: FINANCIAL PANEL")
print("=" * 70)
print("\nEntity (Firm) Clustering:")
print(f"  Number of clusters: {diag_fin['entity']['n_clusters']}")
print(f"  Cluster size (min, mean, max): {diag_fin['entity']['size_min']}, "
      f"{diag_fin['entity']['size_mean']:.1f}, {diag_fin['entity']['size_max']}")
print(f"  Balanced: {diag_fin['entity']['balanced']}")
if diag_fin['entity']['warning']:
    print(f"  ‚ö†Ô∏è  {diag_fin['entity']['warning']}")
else:
    print("  ‚úì Sufficient clusters for reliable inference")

print("\nTime (Month) Clustering:")
print(f"  Number of clusters: {diag_fin['time']['n_clusters']}")
print(f"  Cluster size (min, mean, max): {diag_fin['time']['size_min']}, "
      f"{diag_fin['time']['size_mean']:.1f}, {diag_fin['time']['size_max']}")
print(f"  Balanced: {diag_fin['time']['balanced']}")
if diag_fin['time']['warning']:
    print(f"  ‚ö†Ô∏è  {diag_fin['time']['warning']}")
else:
    print("  ‚úì Sufficient time clusters")

### 7.3 Checking Cluster Balance

In [None]:
# Check balance: distribution of cluster sizes
sizes = diag_fin['entity']['sizes']

plt.figure(figsize=(10, 5))
plt.hist(sizes, bins=30, edgecolor='black', alpha=0.7, color='steelblue')
plt.axvline(sizes.mean(), color='red', linestyle='--', linewidth=2.5, 
            label=f'Mean: {sizes.mean():.1f}')
plt.xlabel('Cluster Size (Observations per Firm)', fontsize=12, fontweight='bold')
plt.ylabel('Frequency', fontsize=12, fontweight='bold')
plt.title(f'Distribution of Cluster Sizes (N={len(sizes)} firms)', 
          fontsize=13, fontweight='bold')
plt.legend(fontsize=11)
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.savefig(FIG_PATH + 'cluster_size_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print("Interpretation:")
print("  - Balanced clusters: All firms have same number of observations")
print("  - Unbalanced: Some firms have more observations than others")
print("  - Impact: Clustering still valid, but extreme imbalance reduces efficiency")

### 7.4 Solutions for Few Clusters

**If G < 10**:

1. **Wild Cluster Bootstrap** (Cameron-Miller 2015)
   - Resamples entire clusters
   - More reliable with few clusters

2. **Aggregate to higher level**
   - Instead of clustering by firm, cluster by industry
   - Trade-off: Less precise clustering, but more clusters

3. **Aggregate over time**
   - Collapse panel to cross-section (average over time)
   - Use robust SEs on averaged data

4. **Fixed effects absorb clustering**
   - Entity FE + robust SEs (if correlation purely mechanical)

---

<a id='cases'></a>
## 8. Case Studies by Discipline

### 8.1 Finance: Asset Returns

**Context**: Testing market risk factors

In [None]:
# Finance example: CAPM-style regression
fe_finance = FixedEffects("returns ~ market_ret + size + book_to_market",
                           financial, "firm_id", "month")

# Two-way clustering (firm + time)
res_finance = fe_finance.fit(cov_type='clustered', 
                              cluster_entity=True, 
                              cluster_time=True)

print("=" * 70)
print("CASE STUDY 1: FINANCE - ASSET PRICING")
print("=" * 70)
print(res_finance.summary())
print()
print("Why Two-Way Clustering?")
print("  - Firm-level: Momentum, firm-specific news persist over time")
print("  - Time-level: Market crashes, interest rate changes affect all stocks")
print()
print("Citation Pattern:")
print('  "Standard errors are clustered by firm and month (two-way)."')

### 8.2 Labor Economics: Wage Regressions

**Context**: Returns to education

In [None]:
# Labor economics: wage equation
fe_wage = FixedEffects("wage ~ education + experience",
                        wage, "person_id", "year")

# Cluster by person
res_wage = fe_wage.fit(cov_type='clustered', cluster_entity=True)

print("=" * 70)
print("CASE STUDY 2: LABOR ECONOMICS - WAGE DETERMINATION")
print("=" * 70)
print(res_wage.summary())
print()
print("Why Cluster by Person?")
print("  - Individual-specific shocks (health, motivation) persist over time")
print("  - Wages are autocorrelated within person")
print()
print("Not Clustered by Time (usually):")
print("  - Year fixed effects control for macro shocks")
print("  - Individuals don't share strong year-specific shocks beyond FE")

### 8.3 Political Economy: Policy Impact

In [None]:
# Political economy: policy effects
fe_policy_econ = FixedEffects("outcome ~ treated + gdp_per_capita",
                               policy, "country_id", "year")

# Cluster by country (typically)
res_policy_entity = fe_policy_econ.fit(cov_type='clustered', cluster_entity=True)
res_policy_time = fe_policy_econ.fit(cov_type='clustered', cluster_time=True)

print("=" * 70)
print("CASE STUDY 3: POLITICAL ECONOMY - POLICY EFFECTS")
print("=" * 70)
print("\nEntity Clustering (by country):")
print(res_policy_entity.summary())
print("\nTime Clustering (by year):")
print(res_policy_time.summary())
print()
print("Decision Depends on Main Source of Correlation:")
print("  - Country-specific persistence ‚Üí Cluster by country")
print("  - Global shocks (recessions, oil crises) ‚Üí Cluster by year")
print("  - Both ‚Üí Two-way clustering")

### 8.4 Summary of Discipline-Specific Practices

| Field | Typical Data | Common Clustering | Rationale |
|-------|--------------|-------------------|----------|
| **Finance** | Stocks √ó Time | Two-way (firm + time) | Firm persistence + market shocks |
| **Labor** | Individuals √ó Years | Entity (person) | Within-person correlation dominates |
| **Macro** | Countries √ó Years | Time or Two-way | Global shocks affect all countries |
| **Health** | Patients √ó Time | Entity (hospital/clinic) | Shared facilities, staff, protocols |
| **Education** | Students √ó Time | Two-way (school + cohort) | School effects + cohort effects |

---

<a id='pitfalls'></a>
## 9. Common Pitfalls and How to Avoid Them

### Pitfall 1: Clustering in Wrong Dimension

**Error**: Cluster by time when correlation is primarily within-entity  
**Example**: Wage panel clustered by year instead of person  
**Consequence**: SEs still underestimated  
**Solution**: Think carefully about correlation structure. ACF plots help!

### Pitfall 2: Using Clustered SEs with Too Few Clusters

**Error**: G = 8 states, cluster by state  
**Consequence**: Clustered SEs unreliable (may still be biased)  
**Solution**:
- Use wild cluster bootstrap
- Aggregate to higher level (e.g., regions)

### Pitfall 3: Not Reporting Number of Clusters

**Error**: Table says "clustered SEs" but doesn't report G  
**Problem**: Readers can't assess reliability  
**Solution**: Always report: "Standard errors clustered by firm (50 clusters)"

### Pitfall 4: Mechanical Correlation from Fixed Effects

**Example**: Firm fixed effects + cluster by firm  
**Issue**: FE already account for within-firm correlation (partially)  
**Recommendation**: Still use clustered SEs (conservative), but FE reduce need

### Pitfall 5: Treating Clustering as a "Fix" for Bad Models

**Error**: Model is misspecified, use clustered SEs to "fix" it  
**Reality**: Clustering corrects inference, not bad modeling  
**Solution**: Fix the model first (add controls, check functional form), then apply appropriate SEs

---

<a id='exercises'></a>
## 10. Exercises

### Exercise 1: Diagnose and Choose Clustering (Easy)

**Task**: Determine appropriate clustering for wage data

**Steps**:
1. Estimate: `wage ~ education + experience`
2. Plot ACF of residuals for 3 random individuals
3. Calculate average within-person correlation
4. Estimate with: (a) robust, (b) cluster by person, (c) cluster by year
5. Compare SEs and decide which is appropriate

**Expected Finding**: Strong within-person correlation ‚Üí cluster by person

In [None]:
# Exercise 1: Your code here

# Step 1: Estimate model
# YOUR CODE

# Step 2: Plot ACF for 3 random persons
# YOUR CODE

# Step 3: Calculate average within-person correlation
# YOUR CODE

# Step 4-5: Compare clustering methods
# YOUR CODE

### Exercise 2: Two-Way Clustering Necessity (Moderate)

**Task**: Determine if two-way clustering changes conclusions

**Dataset**: `financial_panel.csv`

**Steps**:
1. Estimate: `returns ~ market_ret + size + book_to_market`
2. Compare: one-way entity, one-way time, two-way
3. Identify coefficient where significance changes
4. Write interpretation: "Does two-way clustering matter for our conclusions?"

**Bonus**: Test with time fixed effects. Does time FE reduce need for time clustering?

In [None]:
# Exercise 2: Your code here

# Step 1: Estimate model
# YOUR CODE

# Step 2: Compare clustering methods
# YOUR CODE

# Step 3: Identify significance changes
# YOUR CODE

# Step 4: Write interpretation
print("""
INTERPRETATION:
[Your analysis here]
""")

### Exercise 3: Few Clusters Problem (Challenging)

**Task**: Simulate and demonstrate failure of clustering with G=5

**Requirements**:
1. Simulate panel with N=5 entities, T=20 time periods
2. Generate within-entity correlation (œÅ=0.5)
3. Estimate with clustered SEs
4. Run Monte Carlo (500 replications) to check coverage of 95% CI
5. Compare with theoretical 95% coverage

**Deliverable**: Short write-up explaining why clustered SEs fail with few clusters

In [None]:
# Exercise 3: Your code here

# Step 1-2: Simulate data with few clusters
# YOUR CODE

# Step 3-5: Monte Carlo simulation
# YOUR CODE

# Write-up
print("""
WRITE-UP: Why Clustered SEs Fail with Few Clusters

[Your explanation here - discuss:
 - Asymptotic vs finite-sample performance
 - Coverage rates observed
 - Recommendations for practice]
""")

---

<a id='summary'></a>
## 11. Summary and Key Takeaways

### What We Learned

1. **Robust SEs are not enough** for panel data due to within-cluster correlation
2. **Cluster by entity** is typical for panels (firms, individuals over time)
3. **Cluster by time** when common shocks dominate
4. **Two-way clustering** handles correlation in both dimensions
5. **G ‚â• 20** is minimum for reliable cluster-robust SEs
6. **ACF plots** and diagnostics help choose clustering dimension

### Key Formula

**Cluster-Robust Variance**:
$$
V_{\text{cluster}} = (X'X)^{-1} \left[\sum_{g=1}^{G} u_g u_g'\right] (X'X)^{-1}
$$

Where $u_g = \sum_{t \in g} x_{gt} \epsilon_{gt}$

**Two-Way Clustering** (Cameron-Gelbach-Miller 2011):
$$
V_{\text{2way}} = V_{\text{entity}} + V_{\text{time}} - V_{\text{intersection}}
$$

### Decision Flowchart

```
Panel Data
    ‚îÇ
    ‚îú‚îÄ‚Üí Temporal correlation within entities? ‚Üí YES ‚Üí Cluster by entity
    ‚îÇ                                           NO ‚Üì
    ‚îÇ
    ‚îú‚îÄ‚Üí Cross-sectional correlation at same time? ‚Üí YES ‚Üí Cluster by time
    ‚îÇ                                                NO ‚Üì
    ‚îÇ
    ‚îî‚îÄ‚Üí Both? ‚Üí YES ‚Üí Two-way clustering
              NO ‚Üí Robust SEs sufficient (rare in panels)
```

### PanelBox Implementation

```python
# Entity clustering (most common)
result = model.fit(cov_type='clustered', cluster_entity=True)

# Time clustering
result = model.fit(cov_type='clustered', cluster_time=True)

# Two-way clustering
result = model.fit(cov_type='clustered', 
                   cluster_entity=True, 
                   cluster_time=True)
```

### Connection to Next Tutorials

‚û°Ô∏è **Tutorial 03: HAC (Newey-West, Driscoll-Kraay)**

**Why?** Clustering allows arbitrary correlation, but requires G ‚Üí ‚àû

**Alternative**: HAC methods model correlation structure (require T ‚Üí ‚àû)

**Difference**:
- **Clustering**: For micro panels (large N, small T)
- **HAC**: For time series and macro panels (small N, large T)

---

---

<a id='references'></a>
## 12. References

### Foundational Papers

1. **Cameron, A. C., Gelbach, J. B., & Miller, D. L. (2011)**. "Robust inference with multiway clustering." *Journal of Business & Economic Statistics*, 29(2), 238-249.

2. **Petersen, M. A. (2009)**. "Estimating standard errors in finance panel data sets: Comparing approaches." *Review of Financial Studies*, 22(1), 435-480.

3. **Cameron, A. C., & Miller, D. L. (2015)**. "A practitioner's guide to cluster-robust inference." *Journal of Human Resources*, 50(2), 317-372.

4. **Bertrand, M., Duflo, E., & Mullainathan, S. (2004)**. "How much should we trust differences-in-differences estimates?" *Quarterly Journal of Economics*, 119(1), 249-275.

### Textbooks

1. **Wooldridge, J. M. (2010)**. *Econometric Analysis of Cross Section and Panel Data* (2nd ed.). MIT Press. [Chapter 10]

2. **Cameron, A. C., & Trivedi, P. K. (2005)**. *Microeconometrics: Methods and Applications*. Cambridge University Press. [Chapter 21]

### Online Resources

- [PanelBox Documentation](https://panelbox.readthedocs.io/)
- [Clustered Standard Errors Guide](https://panelbox.readthedocs.io/robust-inference/clustering.html)
- Petersen (2009) replication code and data

### Next Tutorial

‚û°Ô∏è **Tutorial 03**: HAC Standard Errors (Newey-West, Driscoll-Kraay)

---

**End of Tutorial 02**