# Tutorial 01: Robust Standard Errors - Fundamentals

**Author**: PanelBox Development Team
**Date**: 2026-02-16
**Estimated Duration**: 45-60 minutes
**Prerequisites**: Basic econometrics, Python, pandas

---

## Learning Objectives

By the end of this tutorial, you will be able to:

1. Diagnose heteroskedasticity in panel data using residual plots and formal tests
2. Understand the difference between HC0, HC1, HC2, and HC3 robust standard errors
3. Apply robust standard errors to linear panel models (Pooled OLS and Fixed Effects)
4. Interpret the impact of heteroskedasticity on statistical inference
5. Choose appropriate robust standard error corrections for different data structures

---

## Table of Contents

1. [Setup and Data Loading](#setup)
2. [The Heteroskedasticity Problem](#problem)
3. [Robust Standard Error Variants (HC0-HC3)](#variants)
4. [Application to Panel Data](#application)
5. [Comparison and Interpretation](#comparison)
6. [Exercises](#exercises)
7. [References](#references)

---

<a id='setup'></a>
## 1. Setup and Data Loading

We'll start by importing the necessary libraries and loading the Grunfeld dataset.

In [None]:
# Standard imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# PanelBox imports
import panelbox as pb
from panelbox.models.static import PooledOLS, FixedEffects

# Local utilities
import sys
sys.path.append('../utils')
from plotting import plot_residuals, plot_se_comparison, plot_heteroskedasticity_test
from diagnostics import test_heteroskedasticity

# Configuration
np.random.seed(42)
sns.set_style("whitegrid")
plt.rcParams['figure.dpi'] = 100

# Define paths
DATA_PATH = '../data/'
FIG_PATH = '../outputs/figures/01_robust/'

print("Setup complete!")

### Load Grunfeld Dataset

The Grunfeld dataset contains investment data for 10 firms over 20 years (1935-1954).

In [None]:
# Load data
data = pd.read_csv(DATA_PATH + 'grunfeld.csv')

# Display basic info
print(f"Shape: {data.shape}")
print(f"\nColumns: {list(data.columns)}")
print(f"\nEntities (firms): {data['firm_id'].nunique()}")
print(f"Time periods (years): {data['year'].nunique()}")
print(f"\nFirst few rows:")
data.head()

---

<a id='problem'></a>
## 2. The Heteroskedasticity Problem

### What is Heteroskedasticity?

**Heteroskedasticity** occurs when the variance of the error term is not constant across observations:

$$\text{Var}(u_i | X_i) = \sigma_i^2 \neq \sigma^2$$

In the presence of heteroskedasticity:
- OLS coefficients remain **unbiased** and **consistent**
- Standard errors are **biased** ‚Üí invalid t-tests and confidence intervals
- Efficiency is lost (OLS is no longer BLUE)

### Diagnosing Heteroskedasticity

Let's estimate a simple investment model and check for heteroskedasticity.

In [None]:
# Estimate pooled OLS model
model_pooled = PooledOLS(
    data=data,
    dependent='invest',
    regressors=['value', 'capital'],
    entity_var='firm_id',
    time_var='year'
)

result_pooled = model_pooled.fit()
print(result_pooled.summary())

# Extract fitted values and residuals
fitted = result_pooled.fitted_values
residuals = result_pooled.residuals

In [None]:
# Plot residuals vs fitted values
fig = plot_residuals(fitted, residuals,
                     title="Residuals vs Fitted Values - Grunfeld Data")
plt.savefig(FIG_PATH + 'residuals_vs_fitted.png', dpi=300, bbox_inches='tight')
plt.show()

# Interpretation
print("\nüìä Visual Inspection:")
print("Look for:")
print("  - Fan-shaped pattern ‚Üí increasing variance")
print("  - Funnel pattern ‚Üí decreasing variance")
print("  - Horizontal band ‚Üí homoskedasticity (ideal)")

### Formal Tests for Heteroskedasticity

We'll use two standard tests:

1. **White Test**: General test with no specific functional form assumption
2. **Breusch-Pagan Test**: Assumes variance is linear function of regressors

In [None]:
# Prepare regressor matrix (without constant)
X = data[['value', 'capital']].values

# White test
white_result = test_heteroskedasticity(residuals, X, test_type='white')
print("=" * 60)
print("WHITE TEST FOR HETEROSKEDASTICITY")
print("=" * 60)
print(white_result)
print("\n")

# Breusch-Pagan test
bp_result = test_heteroskedasticity(residuals, X, test_type='breusch_pagan')
print("=" * 60)
print("BREUSCH-PAGAN TEST FOR HETEROSKEDASTICITY")
print("=" * 60)
print(bp_result)

---

<a id='variants'></a>
## 3. Robust Standard Error Variants (HC0-HC3)

When heteroskedasticity is detected, we need to correct the standard errors. The **robust covariance matrix** (also called sandwich estimator or Huber-White estimator) is:

$$\hat{V}_{\text{robust}} = (X'X)^{-1} \left(\sum_{i=1}^n \hat{u}_i^2 x_i x_i'\right) (X'X)^{-1}$$

Several variants exist that differ in how they weight the residuals:

### HC0 (Original White)
$$\hat{V}_{\text{HC0}} = (X'X)^{-1} \left(\sum_{i=1}^n \hat{u}_i^2 x_i x_i'\right) (X'X)^{-1}$$

### HC1 (Degrees of Freedom Correction)
$$\hat{V}_{\text{HC1}} = \frac{n}{n-k} \hat{V}_{\text{HC0}}$$

### HC2 (Leverage Correction)
$$\hat{u}_i^{(2)} = \frac{\hat{u}_i}{\sqrt{1 - h_i}}$$

where $h_i$ is the leverage of observation $i$.

### HC3 (Davidson-MacKinnon)
$$\hat{u}_i^{(3)} = \frac{\hat{u}_i}{1 - h_i}$$

**Recommendation**: HC3 is generally preferred in small samples as it provides better finite-sample properties.

---

<a id='application'></a>
## 4. Application to Panel Data

Let's estimate the model with different robust SE variants and compare results.

In [None]:
# Re-estimate with different robust SE methods
se_types = ['classical', 'HC0', 'HC1', 'HC2', 'HC3']
results_dict = {}

for se_type in se_types:
    if se_type == 'classical':
        cov_type = 'nonrobust'
    else:
        cov_type = se_type.lower()

    result = model_pooled.fit(cov_type=cov_type)
    results_dict[se_type] = result

print("‚úì Estimated models with all SE variants")

In [None]:
# Create comparison table
comparison_data = []

for var in ['value', 'capital']:
    for se_type in se_types:
        res = results_dict[se_type]
        comparison_data.append({
            'Variable': var,
            'SE Type': se_type,
            'Coefficient': res.params[var],
            'Std Error': res.std_errors[var],
            't-statistic': res.t_stats[var],
            'p-value': res.p_values[var]
        })

comparison_df = pd.DataFrame(comparison_data)
print("\n" + "=" * 80)
print("COMPARISON OF STANDARD ERROR METHODS")
print("=" * 80)
print(comparison_df.to_string(index=False))

In [None]:
# Plot SE comparison for 'value' variable
estimates = {se: results_dict[se].params['value'] for se in se_types}
std_errors = {se: results_dict[se].std_errors['value'] for se in se_types}

fig = plot_se_comparison(
    coef_name='value',
    estimates=estimates,
    std_errors=std_errors,
    methods=se_types,
    title='Comparison of Standard Error Methods: Value Coefficient'
)
plt.savefig(FIG_PATH + 'se_comparison_value.png', dpi=300, bbox_inches='tight')
plt.show()

### Application to Fixed Effects Model

Now let's see how robust SEs work with entity fixed effects.

In [None]:
# Estimate fixed effects model
model_fe = FixedEffects(
    data=data,
    dependent='invest',
    regressors=['value', 'capital'],
    entity_var='firm_id',
    time_var='year'
)

# Compare classical vs robust SEs
result_fe_classical = model_fe.fit(cov_type='nonrobust')
result_fe_robust = model_fe.fit(cov_type='hc3')

print("=" * 80)
print("FIXED EFFECTS: CLASSICAL vs ROBUST SE")
print("=" * 80)
print("\nClassical SE:")
print(result_fe_classical.summary())
print("\nRobust SE (HC3):")
print(result_fe_robust.summary())

---

<a id='comparison'></a>
## 5. Comparison and Interpretation

### Key Insights

1. **Magnitude of Correction**: How much do robust SEs differ from classical SEs?
2. **Inference Impact**: Do conclusions change when using robust SEs?
3. **Choice of Variant**: How much do HC0-HC3 differ in practice?

In [None]:
# Calculate SE ratios (robust/classical)
print("=" * 60)
print("STANDARD ERROR RATIOS (Robust / Classical)")
print("=" * 60)

for var in ['value', 'capital']:
    classical_se = results_dict['classical'].std_errors[var]
    print(f"\nVariable: {var}")
    print(f"  Classical SE: {classical_se:.6f}")

    for se_type in ['HC0', 'HC1', 'HC2', 'HC3']:
        robust_se = results_dict[se_type].std_errors[var]
        ratio = robust_se / classical_se
        print(f"  {se_type} SE: {robust_se:.6f} (ratio: {ratio:.3f})")

### Summary of Findings

**When to Use Robust Standard Errors:**

‚úÖ **Always use** when:
- Heteroskedasticity is suspected or detected
- Sample size is large (n > 50)
- You want inference robust to misspecification

‚ö†Ô∏è **Be cautious** when:
- Sample size is very small (n < 30)
- Model is severely misspecified
- High leverage points are present

**Which Variant?**

- **HC0**: Original White, can underestimate SEs in small samples
- **HC1**: Simple df correction, better in small samples
- **HC2**: Leverage adjustment, good theoretical properties
- **HC3**: **Recommended default** - best small-sample performance

**PanelBox Default**: HC3 (when `cov_type='robust'`)

---

<a id='exercises'></a>
## 6. Exercises

### Exercise 1: Different Dataset

Load the `wage_panel.csv` dataset and:
1. Estimate a wage equation: `wage ~ education + experience`
2. Test for heteroskedasticity
3. Compare classical vs robust (HC3) standard errors
4. Interpret the results

### Exercise 2: Simulation Study

Generate synthetic data with known heteroskedasticity:
1. Use `generate_heteroskedastic_data()` from utils
2. Estimate model with classical SEs
3. Estimate model with robust SEs
4. Verify that robust SEs provide correct inference

### Exercise 3: Model Comparison

Using the Grunfeld data:
1. Estimate Random Effects model
2. Compare robust SEs across Pooled OLS, FE, and RE
3. Discuss which model is most affected by heteroskedasticity

---

<a id='references'></a>
## 7. References

### Key Papers

1. **White, H. (1980)**. "A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity". *Econometrica*, 48(4), 817-838.

2. **MacKinnon, J. G., & White, H. (1985)**. "Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties". *Journal of Econometrics*, 29(3), 305-325.

3. **Long, J. S., & Ervin, L. H. (2000)**. "Using Heteroscedasticity Consistent Standard Errors in the Linear Regression Model". *The American Statistician*, 54(3), 217-224.

### Software Documentation

- [PanelBox Documentation](https://panelbox.readthedocs.io/)
- [Robust Covariance Guide](https://panelbox.readthedocs.io/robust-inference.html)

### Next Tutorial

‚û°Ô∏è **Tutorial 02**: Clustered Standard Errors for Panel Data

---

**End of Tutorial 01**