# Fixed Effects: Controlling for Unobserved Heterogeneity

**Tutorial 02 - Static Panel Models Fundamentals**

**Level**: Intermediate  
**Duration**: 60-75 minutes  
**Date**: 2026-02-16

---

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Understand** the problem of omitted variable bias from unobserved heterogeneity (α_i)
2. **Explain** the within transformation (demeaning) and how it eliminates α_i
3. **Estimate** Fixed Effects models using PanelBox
4. **Distinguish** between LSDV and demeaning approaches
5. **Implement** two-way fixed effects (entity + time)
6. **Interpret** within vs between variation and coefficients
7. **Conduct** F-tests for FE vs Pooled OLS
8. **Access** and interpret estimated fixed effects (α̂_i)
9. **Visualize** within vs between relationships

---

## Prerequisites

### Conceptual
- Completed Notebook 01 (Pooled OLS Introduction)
- Understanding of omitted variable bias
- Familiarity with variance decomposition

### Technical
- Comfortable with pandas groupby operations
- Understanding of OLS mechanics
- Basic matrix algebra

---

## Setup

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import panelbox as pb
from scipy import stats
from sklearn.linear_model import LinearRegression
import warnings
warnings.filterwarnings('ignore')

# Configure visualization
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)
pd.set_option('display.precision', 4)
pd.set_option('display.max_columns', None)

print(f"PanelBox version: {pb.__version__}")
print("Setup complete!")

---

# Section 1: The Problem of Omitted Variable Bias

## 1.1 Motivating Example: Returns to Education

Consider the following research question:

> **What is the causal effect of education on wages?**

A naive model might be:

$$\text{wage}_{it} = \beta_1 \cdot \text{education}_{it} + \varepsilon_{it}$$

### The Problem

**Innate ability** (α_i) is unobserved but correlates with education:
- Smarter people tend to get more education
- Smarter people also earn higher wages (even with same education)
- β̂₁ confounds **true education effect** + **ability bias**

Let's demonstrate this with a simulation:

In [None]:
# Simulate biased Pooled OLS
np.random.seed(42)
N, T = 100, 5

# Generate data with unobserved ability
ability = np.random.normal(0, 2, N)  # α_i: unobserved ability
education = 12 + 0.5 * ability + np.random.normal(0, 1, N)  # Education correlates with ability

# Create panel
data_sim = []
for i in range(N):
    for t in range(T):
        # True model: wage = 2*education + ability + noise
        wage = 2 * education[i] + ability[i] + np.random.normal(0, 0.5)
        data_sim.append({
            'person': i, 
            'year': t, 
            'wage': wage, 
            'education': education[i]
        })

df_sim = pd.DataFrame(data_sim)

# Pooled OLS (biased)
pooled = pb.PooledOLS("wage ~ education", df_sim, 'person', 'year')
res_pooled = pooled.fit(cov_type='clustered')

print("="*70)
print("DEMONSTRATION: Omitted Variable Bias")
print("="*70)
print(f"True education effect:     2.000")
print(f"Pooled OLS estimate:       {res_pooled.params['education']:.3f}")
print(f"Bias:                      {res_pooled.params['education'] - 2:.3f}")
print(f"Relative bias:             {(res_pooled.params['education'] - 2)/2 * 100:.1f}%")
print("="*70)
print("\nPooled OLS overestimates the education effect by ~24%")
print("Reason: Ability (α_i) is omitted and correlates with education")

### Key Insight

The **bias formula** is:

$$\text{Bias} = \frac{\text{Cov}(\text{education}, \text{ability})}{\text{Var}(\text{education})}$$

In our simulation: Bias ≈ 0.5 (because we set education = 12 + 0.5 × ability + noise)

**This bias persists even as sample size → ∞!**

## 1.2 General Framework: Unobserved Heterogeneity

The **true model** for many panel applications is:

$$y_{it} = X_{it}\beta + \alpha_i + \varepsilon_{it}$$

Where:
- **α_i**: Entity-specific effect (fixed over time)
  - Examples: Managerial quality, innate ability, geography, brand value
- **ε_it**: Idiosyncratic error (varies over time and entity)

### The Problem with Pooled OLS

Pooled OLS treats the model as:

$$y_{it} = X_{it}\beta + (\alpha_i + \varepsilon_{it})$$

It **lumps α_i into the composite error**. If α_i is correlated with X_it:

$$\mathbb{E}[X_{it} \cdot (\alpha_i + \varepsilon_{it})] \neq 0$$

This violates the **exogeneity assumption** → β̂^OLS is **biased and inconsistent**!

### Solution

We need to **eliminate or control for α_i**. This is what **Fixed Effects** does.

## 1.3 Graphical Intuition: Between vs Within Variation

Let's visualize the difference between:
- **Between-firm** variation (confounded by α_i)
- **Within-firm** variation (α_i eliminated)

In [None]:
# Load Grunfeld data
data = pb.load_grunfeld()

print("Grunfeld Investment Dataset")
print(f"Firms: {data['firm'].nunique()}, Years: {data['year'].nunique()}")
print(f"Total observations: {len(data)}")
print("\nFirst few rows:")
print(data.head())

In [None]:
# Compute firm means (proxies for α_i)
firm_means = data.groupby('firm')[['invest', 'value']].mean()
firm_means['firm_id'] = firm_means.index

# Compute demeaned data
data_demeaned = data.copy()
for col in ['invest', 'value']:
    data_demeaned[col + '_dm'] = data.groupby('firm')[col].transform(lambda x: x - x.mean())

# Create visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Left: Between-firm scatter (includes α_i)
axes[0].scatter(firm_means['value'], firm_means['invest'], s=120, alpha=0.7, color='steelblue', edgecolor='black')
for _, row in firm_means.iterrows():
    axes[0].annotate(f"F{row['firm_id']}", (row['value'], row['invest']), 
                     fontsize=9, ha='center', va='bottom', fontweight='bold')
axes[0].set_xlabel('Average Value (x̄_i)', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Average Investment (ȳ_i)', fontsize=12, fontweight='bold')
axes[0].set_title('Between-Firm Variation\n(Confounded by α_i)', fontsize=13, fontweight='bold')
axes[0].grid(alpha=0.3)

# Right: Within-firm scatter (α_i eliminated)
colors = plt.cm.tab10(range(data['firm'].nunique()))
for i, firm in enumerate(data['firm'].unique()):
    firm_data = data_demeaned[data_demeaned['firm'] == firm]
    axes[1].scatter(firm_data['value_dm'], firm_data['invest_dm'], 
                   alpha=0.6, s=50, color=colors[i], label=f'Firm {firm}')

axes[1].axhline(0, color='red', linestyle='--', linewidth=1.5, alpha=0.7)
axes[1].axvline(0, color='red', linestyle='--', linewidth=1.5, alpha=0.7)
axes[1].set_xlabel('Value - Firm Mean (x_it - x̄_i)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Investment - Firm Mean (y_it - ȳ_i)', fontsize=12, fontweight='bold')
axes[1].set_title('Within-Firm Variation\n(α_i Removed)', fontsize=13, fontweight='bold')
axes[1].grid(alpha=0.3)
axes[1].legend(bbox_to_anchor=(1.05, 1), loc='upper left', fontsize=8)

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("• LEFT: Between-firm relationship may be confounded by firm characteristics (α_i)")
print("• RIGHT: Within-firm relationship isolates effect of value changes holding α_i constant")
print("• Fixed Effects uses ONLY the within variation (right panel)")

---

# Section 2: The Within Transformation (Demeaning)

## 2.1 Derivation of the Within Estimator

Here's how Fixed Effects eliminates α_i:

### Step 1: Start with the true model
$$y_{it} = X_{it}\beta + \alpha_i + \varepsilon_{it}$$

### Step 2: Take entity means (average over t)
$$\bar{y}_i = \bar{X}_i\beta + \alpha_i + \bar{\varepsilon}_i$$

Where: $\bar{y}_i = \frac{1}{T}\sum_{t=1}^T y_{it}$

### Step 3: Subtract entity mean from each observation
$$(y_{it} - \bar{y}_i) = (X_{it} - \bar{X}_i)\beta + (\varepsilon_{it} - \bar{\varepsilon}_i)$$

**α_i cancels!** Because: $\alpha_i - \alpha_i = 0$

### Step 4: Estimate β via OLS on demeaned data
$$\hat{\beta}^{FE} = \left[(X-\bar{X})'(X-\bar{X})\right]^{-1} (X-\bar{X})'(y-\bar{y})$$

### Key Result
**Fixed Effects eliminates α_i without ever estimating it!**

## 2.2 Manual Implementation of Demeaning

In [None]:
# Manual within transformation
def demean_manually(data, entity_col, vars_to_demean):
    """
    Manually demean variables by entity.
    
    Parameters
    ----------
    data : pd.DataFrame
        Panel data
    entity_col : str
        Entity identifier column
    vars_to_demean : list
        Variables to demean
        
    Returns
    -------
    pd.DataFrame
        Data with demeaned variables (suffix '_dm')
    """
    data_dm = data.copy()
    for var in vars_to_demean:
        # Compute entity means
        entity_means = data.groupby(entity_col)[var].transform('mean')
        # Subtract from original
        data_dm[var + '_dm'] = data[var] - entity_means
    return data_dm

# Apply demeaning
data_manual = demean_manually(data, 'firm', ['invest', 'value', 'capital'])

# Verify demeaning worked (firm means should be ~0)
print("Verification: Firm means of demeaned variables (should be ≈ 0)")
print(data_manual.groupby('firm')[['invest_dm', 'value_dm', 'capital_dm']].mean().round(10))

In [None]:
# Estimate FE manually using sklearn on demeaned data
X_dm = data_manual[['value_dm', 'capital_dm']].values
y_dm = data_manual['invest_dm'].values

# No intercept (already demeaned)
lr = LinearRegression(fit_intercept=False)
lr.fit(X_dm, y_dm)

print("="*70)
print("Manual Fixed Effects (via OLS on demeaned data)")
print("="*70)
print(f"value coefficient:   {lr.coef_[0]:.6f}")
print(f"capital coefficient: {lr.coef_[1]:.6f}")
print("="*70)

## 2.3 PanelBox Implementation

Now let's estimate the same model using PanelBox's `FixedEffects` class:

In [None]:
# Estimate Fixed Effects using PanelBox
fe_model = pb.FixedEffects(
    formula="invest ~ value + capital",
    data=data,
    entity_col='firm',
    time_col='year'
)

fe_results = fe_model.fit(cov_type='clustered')
print(fe_results.summary())

In [None]:
# Compare manual vs PanelBox
print("="*70)
print("COMPARISON: Manual Demeaning vs PanelBox")
print("="*70)
comparison_df = pd.DataFrame({
    'Manual (sklearn)': [lr.coef_[0], lr.coef_[1]],
    'PanelBox': [fe_results.params['value'], fe_results.params['capital']],
    'Difference': [lr.coef_[0] - fe_results.params['value'], 
                   lr.coef_[1] - fe_results.params['capital']]
}, index=['value', 'capital'])

print(comparison_df)
print("="*70)
print("✓ Coefficients match exactly!")
print("\nBoth methods apply the within transformation and estimate β on demeaned data.")

## 2.4 What Happens to the Intercept?

**Key Point**: Demeaning removes the common intercept!

Instead, Fixed Effects estimates **N separate intercepts** (one per entity):
- α̂₁, α̂₂, ..., α̂ₙ

These are the **entity fixed effects** and can be recovered after estimation.

### Degrees of Freedom

| Model | df_resid |
|-------|----------|
| **Pooled OLS** | NT - k - 1 |
| **Fixed Effects** | NT - N - k |

FE loses **N-1 additional degrees of freedom** (relative to common intercept).

In [None]:
# Compare degrees of freedom
pooled_model = pb.PooledOLS("invest ~ value + capital", data, 'firm', 'year')
pooled_results = pooled_model.fit(cov_type='nonrobust')

N = data['firm'].nunique()
T = data['year'].nunique()
k = 2  # Number of X variables (excluding intercept)

print("Degrees of Freedom Comparison:")
print(f"N (firms) = {N}, T (years) = {T}, k (covariates) = {k}")
print(f"Total observations (NT) = {N*T}")
print()
print(f"Pooled OLS:     df_resid = NT - k - 1 = {N*T} - {k} - 1 = {pooled_results.df_resid}")
print(f"Fixed Effects:  df_resid = NT - N - k = {N*T} - {N} - {k} = {fe_results.df_resid}")
print()
print(f"FE loses {N-1} degrees of freedom relative to Pooled OLS")

---

# Section 3: LSDV vs Demeaning

## 3.1 Least Squares Dummy Variables (LSDV)

An **alternative approach** to Fixed Effects is to include N-1 entity dummies:

$$y_{it} = \sum_{j=2}^N \alpha_j D_{ij} + X_{it}\beta + \varepsilon_{it}$$

Where:
- $D_{ij} = 1$ if i=j, 0 otherwise
- Omit first dummy to avoid perfect multicollinearity

### LSDV vs Demeaning

| Aspect | LSDV | Demeaning |
|--------|------|------------|
| **Computational cost** | High (N-1 extra parameters) | Low |
| **β̂ estimates** | Identical | Identical |
| **Access to α̂_i** | Direct (dummy coefficients) | Post-estimation |
| **Memory usage** | High | Low |
| **Numerical stability** | Worse | Better |

**Recommendation**: Use demeaning (PanelBox default) for estimation efficiency.

In [None]:
# LSDV: Add firm dummies
import statsmodels.formula.api as smf

data_lsdv = data.copy()
firm_dummies = pd.get_dummies(data_lsdv['firm'], prefix='firm', drop_first=True)
data_lsdv = pd.concat([data_lsdv, firm_dummies], axis=1)

# Construct formula with dummies
dummy_vars = ' + '.join([col for col in data_lsdv.columns if col.startswith('firm_')])
formula_lsdv = f"invest ~ value + capital + {dummy_vars}"

# Estimate LSDV
lsdv_results = smf.ols(formula_lsdv, data=data_lsdv).fit(
    cov_type='cluster', 
    cov_kwds={'groups': data_lsdv['firm']}
)

print("="*70)
print("LSDV Estimation Results")
print("="*70)
print("\nCoefficients on X variables:")
print(lsdv_results.params[['value', 'capital']])
print("\nFirm dummy coefficients (α̂_i):")
dummy_cols = [col for col in lsdv_results.params.index if col.startswith('firm_')]
print(lsdv_results.params[dummy_cols])

In [None]:
# Compare LSDV vs FE (demeaning)
print("="*70)
print("COMPARISON: LSDV vs Fixed Effects (Demeaning)")
print("="*70)

comparison_lsdv_fe = pd.DataFrame({
    'LSDV': [lsdv_results.params['value'], lsdv_results.params['capital']],
    'FE (Demeaning)': [fe_results.params['value'], fe_results.params['capital']],
    'Difference': [
        lsdv_results.params['value'] - fe_results.params['value'],
        lsdv_results.params['capital'] - fe_results.params['capital']
    ]
}, index=['value', 'capital'])

print(comparison_lsdv_fe)
print("="*70)
print("✓ β̂^LSDV = β̂^FE exactly!")
print("\nBoth methods yield identical slope coefficients.")
print("Difference: LSDV directly estimates dummy coefficients, FE recovers them post-estimation.")

---

# Section 4: Two-Way Fixed Effects

## 4.1 Adding Time Fixed Effects

**Motivation**: Control for common time shocks affecting all entities

Examples:
- Macroeconomic trends (GDP growth, interest rates)
- Policy changes (tax reform, regulation)
- Seasonal effects
- Global shocks (pandemic, financial crisis)

### Two-Way FE Model

$$y_{it} = \alpha_i + \gamma_t + X_{it}\beta + \varepsilon_{it}$$

Where:
- **α_i**: Entity fixed effects (as before)
- **γ_t**: Time fixed effects (new!)

### Transformation

Demean by entity, then demean by time:

$$(y_{it} - \bar{y}_i - \bar{y}_t + \bar{y}_{..})$$

Where:
- $\bar{y}_t$ = mean across entities in period t
- $\bar{y}_{..}$ = grand mean

## 4.2 Estimating Two-Way Fixed Effects

In [None]:
# Two-way Fixed Effects
fe_twoway = pb.FixedEffects(
    formula="invest ~ value + capital",
    data=data,
    entity_col='firm',
    time_col='year',
    entity_effects=True,   # Default
    time_effects=True      # Add time FE
)

fe_twoway_results = fe_twoway.fit(cov_type='clustered')
print(fe_twoway_results.summary())

In [None]:
# Compare one-way vs two-way FE
print("="*70)
print("COMPARISON: One-Way FE vs Two-Way FE")
print("="*70)

comparison_fe = pd.DataFrame({
    'One-Way FE': fe_results.params,
    'Two-Way FE': fe_twoway_results.params,
    'Difference': fe_results.params - fe_twoway_results.params,
    'Pct Change': (fe_twoway_results.params / fe_results.params - 1) * 100
})

print(comparison_fe)
print("="*70)
print("\nInterpretation:")
print("• Coefficients differ when time trends affect both X and y")
print("• Two-way FE is more conservative (controls for more confounders)")
print("• Use two-way FE when common time shocks are present")

## 4.3 Visualizing Entity and Time Fixed Effects

In [None]:
# Extract fixed effects
time_fe = fe_twoway.time_fe      # pd.Series indexed by year
entity_fe = fe_twoway.entity_fe  # pd.Series indexed by firm

print("Time Fixed Effects (γ̂_t):")
print(time_fe)
print("\nEntity Fixed Effects (α̂_i):")
print(entity_fe)

In [None]:
# Visualization
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Time FE (γ̂_t)
axes[0].plot(time_fe.index, time_fe.values, marker='o', linewidth=2.5, 
             markersize=8, color='steelblue', label='Time FE')
axes[0].axhline(0, color='red', linestyle='--', linewidth=1.5, alpha=0.7, label='Zero line')
axes[0].fill_between(time_fe.index, time_fe.values, 0, alpha=0.3, color='steelblue')
axes[0].set_xlabel('Year', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Time Fixed Effect (γ̂_t)', fontsize=12, fontweight='bold')
axes[0].set_title('Estimated Time Fixed Effects\n(Common Time Shocks)', fontsize=13, fontweight='bold')
axes[0].grid(alpha=0.3)
axes[0].legend()

# Entity FE (α̂_i)
bars = axes[1].bar(entity_fe.index, entity_fe.values, alpha=0.7, 
                   color=['steelblue' if x > 0 else 'coral' for x in entity_fe.values],
                   edgecolor='black', linewidth=1.2)
axes[1].axhline(0, color='red', linestyle='--', linewidth=1.5, alpha=0.7)
axes[1].set_xlabel('Firm', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Entity Fixed Effect (α̂_i)', fontsize=12, fontweight='bold')
axes[1].set_title('Estimated Entity Fixed Effects\n(Firm-Specific Intercepts)', fontsize=13, fontweight='bold')
axes[1].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("• LEFT (γ̂_t): Common shocks in each year (e.g., post-war investment boom in 1940s)")
print("• RIGHT (α̂_i): Firm-specific intercepts (e.g., some firms consistently invest more)")

---

# Section 5: Interpretation - Within vs Between

## 5.1 What Does β̂^FE Mean?

**Critical distinction**:

### Within Interpretation (what FE estimates)
> "When firm i increases X by 1 unit, Y changes by β̂ units **holding α_i constant**"

### Between Interpretation (what FE does NOT estimate)
> "Firms with 1 unit higher X have β̂ higher Y on average"

### Example: Education and Wages

If β̂_education^FE = 0.08 in a wage regression:

✅ **Correct**: "When an individual gains 1 more year of education, their wage increases by 8% (controlling for innate ability)"

❌ **Incorrect**: "People with 1 more year of education earn 8% more on average"

## 5.2 Decomposition of R-squared

PanelBox reports three R² measures:

1. **R²_within**: Fit of demeaned model (what FE maximizes)
   - *How well does the model explain within-entity variation?*

2. **R²_between**: Fit of entity means ($\bar{y}_i$ on $\bar{X}_i$)
   - *How well does the model explain cross-sectional differences?*

3. **R²_overall**: Fit of original data (not demeaned)
   - *Overall goodness of fit*

In [None]:
print("="*70)
print("R-squared Decomposition")
print("="*70)
print(f"R² Within:   {fe_results.rsquared_within:.4f}  (FE maximizes this)")
print(f"R² Between:  {fe_results.rsquared_between:.4f}  (Cross-sectional fit)")
print(f"R² Overall:  {fe_results.rsquared_overall:.4f}  (Overall fit)")
print("="*70)

# Visualization
r2_values = [
    fe_results.rsquared_within, 
    fe_results.rsquared_between, 
    fe_results.rsquared_overall
]
r2_labels = ['Within', 'Between', 'Overall']

plt.figure(figsize=(10, 6))
bars = plt.bar(r2_labels, r2_values, alpha=0.8, 
               color=['steelblue', 'coral', 'lightgreen'],
               edgecolor='black', linewidth=1.5)

# Add value labels
for bar, val in zip(bars, r2_values):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height,
             f'{val:.4f}', ha='center', va='bottom', fontsize=12, fontweight='bold')

plt.ylabel('R-squared', fontsize=12, fontweight='bold')
plt.title('R-squared Decomposition: Within vs Between vs Overall', fontsize=13, fontweight='bold')
plt.ylim(0, 1)
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nInterpretation:")
if fe_results.rsquared_within > fe_results.rsquared_between:
    print("• Model explains time variation BETTER than cross-sectional differences")
else:
    print("• Model explains cross-sectional differences BETTER than time variation")

## 5.3 Variance Decomposition Exercise

Let's decompose the total variance into within and between components:

In [None]:
# Decompose variance of 'value'
var_total = data['value'].var()
var_within = data.groupby('firm')['value'].transform(lambda x: x - x.mean()).var()
var_between = data.groupby('firm')['value'].mean().var() * data.groupby('firm').size().mean()

print("="*70)
print("Variance Decomposition of 'value'")
print("="*70)
print(f"Total Variance:   {var_total:.2f}")
print(f"Within Variance:  {var_within:.2f}  ({var_within/var_total*100:.1f}%)")
print(f"Between Variance: {var_between:.2f}  ({var_between/var_total*100:.1f}%)")
print("="*70)

# Visualization
plt.figure(figsize=(10, 6))
plt.bar(['Within', 'Between'], [var_within, var_between], 
        alpha=0.8, color=['steelblue', 'coral'], 
        edgecolor='black', linewidth=1.5)

# Add value labels
plt.text(0, var_within, f'{var_within:.2f}\n({var_within/var_total*100:.1f}%)', 
         ha='center', va='bottom', fontsize=11, fontweight='bold')
plt.text(1, var_between, f'{var_between:.2f}\n({var_between/var_total*100:.1f}%)', 
         ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.ylabel('Variance', fontsize=12, fontweight='bold')
plt.title('Variance Decomposition: Within vs Between', fontsize=13, fontweight='bold')
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nDiscussion:")
if var_within > var_between:
    print("• Strong time variation → FE has plenty of identifying variation")
else:
    print("• Strong cross-sectional variation → FE discards informative between variation")

---

# Section 6: Testing Fixed Effects vs Pooled OLS

## 6.1 F-Test for Joint Significance of Fixed Effects

**Question**: Are entity fixed effects necessary, or is Pooled OLS sufficient?

### Hypotheses

- **H₀**: α₁ = α₂ = ... = αₙ (all entity effects equal → use Pooled OLS)
- **H₁**: At least one α_i differs (use Fixed Effects)

### Test Statistic

$$F = \frac{(\text{SSR}_{\text{pooled}} - \text{SSR}_{\text{FE}}) / (N-1)}{\text{SSR}_{\text{FE}} / (NT-N-k)} \sim F(N-1, NT-N-k)$$

### Decision Rule

- **p < 0.05**: Reject H₀ → Use Fixed Effects
- **p ≥ 0.05**: Fail to reject H₀ → Pooled OLS sufficient

## 6.2 Conducting the F-Test

In [None]:
# PanelBox computes F-test automatically
print("="*70)
print("F-Test: Fixed Effects vs Pooled OLS")
print("="*70)
print(f"F-statistic: {fe_results.f_statistic:.2f}")
print(f"P-value:     {fe_results.f_pvalue:.6f}")
print(f"df:          ({fe_results.f_df_num}, {fe_results.f_df_denom})")
print("="*70)

if fe_results.f_pvalue < 0.05:
    print("\n✓ Reject H₀ at 5% level → Use Fixed Effects")
    print("  Conclusion: Entity fixed effects are jointly significant.")
else:
    print("\n✗ Fail to reject H₀ → Pooled OLS sufficient")
    print("  Conclusion: No evidence of entity-specific effects.")

## 6.3 Manual F-Test Calculation

Let's verify PanelBox's F-test by computing it manually:

In [None]:
# Get sum of squared residuals
SSR_pooled = np.sum(pooled_results.resid ** 2)
SSR_fe = np.sum(fe_results.resid ** 2)

# Degrees of freedom
N = data['firm'].nunique()
T_mean = data.groupby('firm').size().mean()  # Average T per entity
k = 2  # Number of X variables (excluding intercept/FE)

df_num = N - 1
df_denom = len(data) - N - k

# Compute F-statistic
F_manual = ((SSR_pooled - SSR_fe) / df_num) / (SSR_fe / df_denom)
p_value_manual = 1 - stats.f.cdf(F_manual, df_num, df_denom)

print("="*70)
print("Manual F-Test Calculation")
print("="*70)
print(f"SSR_pooled = {SSR_pooled:.2f}")
print(f"SSR_FE     = {SSR_fe:.2f}")
print(f"df_num     = {df_num}")
print(f"df_denom   = {df_denom}")
print()
print(f"F-statistic: {F_manual:.2f}")
print(f"P-value:     {p_value_manual:.6f}")
print("="*70)
print(f"\n✓ Matches PanelBox: {np.isclose(F_manual, fe_results.f_statistic)}")

---

# Section 7: Accessing and Interpreting Fixed Effects

## 7.1 Extracting Entity Fixed Effects (α̂_i)

In [None]:
# Access entity fixed effects
entity_fe = fe_model.entity_fe

print("="*70)
print("Estimated Entity Fixed Effects (α̂_i)")
print("="*70)
print(entity_fe)
print("="*70)

# Summary statistics
print(f"\nMean:   {entity_fe.mean():.4f}  (normalized to ~0)")
print(f"Std:    {entity_fe.std():.4f}")
print(f"Min:    {entity_fe.min():.4f}  (Firm {entity_fe.idxmin()})")
print(f"Max:    {entity_fe.max():.4f}  (Firm {entity_fe.idxmax()})")
print(f"Range:  {entity_fe.max() - entity_fe.min():.4f}")

## 7.2 Interpreting Fixed Effects

### What does α̂_i represent?

- **α̂_i > 0**: Firm i invests **more** than average (controlling for value, capital)
  - Possible reasons: Better management, growth opportunities, access to credit

- **α̂_i < 0**: Firm i invests **less** than average
  - Possible reasons: Mature firm, risk aversion, capital constraints

### Important Caveat

**α̂_i is relative to an omitted baseline** (usually first entity or mean)

Only **differences** matter:
- α̂₅ - α̂₃ = 20 means Firm 5 invests 20 units more than Firm 3 (controlling for X)

In [None]:
# Visualization: Distribution of entity fixed effects
plt.figure(figsize=(12, 6))
bars = plt.bar(entity_fe.index, entity_fe.values, alpha=0.8, 
               color=['steelblue' if x > 0 else 'coral' for x in entity_fe.values],
               edgecolor='black', linewidth=1.2)

plt.axhline(0, color='red', linestyle='--', linewidth=2, label='Zero line', alpha=0.7)
plt.xlabel('Firm', fontsize=12, fontweight='bold')
plt.ylabel('Fixed Effect (α̂_i)', fontsize=12, fontweight='bold')
plt.title('Estimated Entity Fixed Effects\n(Firm-Specific Investment Propensity)', 
          fontsize=13, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

print("\nInterpretation:")
print(f"• Firm {entity_fe.idxmax()} has highest investment propensity (α̂ = {entity_fe.max():.2f})")
print(f"• Firm {entity_fe.idxmin()} has lowest investment propensity (α̂ = {entity_fe.min():.2f})")
print(f"• Difference: Firm {entity_fe.idxmax()} invests {entity_fe.max() - entity_fe.min():.2f} more than Firm {entity_fe.idxmin()}")

---

# Section 8: Practical Exercises

## Exercise 8.1: Fixed Effects on Wage Data

**Research Question**: What is the effect of experience and tenure on log wages, controlling for individual heterogeneity?

### Tasks

1. Load wage panel data
2. Estimate **Pooled OLS**: `log_wage ~ experience + tenure`
3. Estimate **Fixed Effects** with same formula
4. Compare coefficients: Does FE differ from Pooled?
5. Conduct **F-test**: Are fixed effects necessary?
6. Plot distribution of individual fixed effects (α̂_i)
7. **Interpret**: What does α̂_i represent in this context?

In [None]:
# TODO: Complete this exercise

# 1. Load wage panel data
wage_data = pb.load_wage_panel()

print("Wage Panel Dataset")
print(wage_data.head())
print(f"\nIndividuals: {wage_data['person_id'].nunique()}")
print(f"Years: {wage_data['year'].nunique()}")
print(f"Total observations: {len(wage_data)}")

# 2. Pooled OLS
# YOUR CODE HERE

# 3. Fixed Effects
# YOUR CODE HERE

# 4. Compare coefficients
# YOUR CODE HERE

# 5. F-test
# YOUR CODE HERE

# 6. Plot FE distribution
# YOUR CODE HERE

# 7. Interpretation
# YOUR CODE HERE

### Solution (Run after attempting)

In [None]:
# SOLUTION: Exercise 8.1

# 1. Data already loaded above

# 2. Pooled OLS
pooled_wage = pb.PooledOLS("log_wage ~ experience + tenure", wage_data, 'person_id', 'year')
res_pooled_wage = pooled_wage.fit(cov_type='clustered')

print("="*70)
print("Pooled OLS Results")
print("="*70)
print(res_pooled_wage.summary())

# 3. Fixed Effects
fe_wage = pb.FixedEffects("log_wage ~ experience + tenure", wage_data, 'person_id', 'year')
res_fe_wage = fe_wage.fit(cov_type='clustered')

print("\n" + "="*70)
print("Fixed Effects Results")
print("="*70)
print(res_fe_wage.summary())

# 4. Compare
print("\n" + "="*70)
print("Comparison: Pooled OLS vs Fixed Effects")
print("="*70)
comparison_wage = pd.DataFrame({
    'Pooled OLS': res_pooled_wage.params,
    'Fixed Effects': res_fe_wage.params,
    'Difference': res_pooled_wage.params - res_fe_wage.params
})
print(comparison_wage)

# 5. F-test
print("\n" + "="*70)
print("F-Test for Fixed Effects")
print("="*70)
print(f"F-statistic: {res_fe_wage.f_statistic:.2f}")
print(f"P-value:     {res_fe_wage.f_pvalue:.6f}")
if res_fe_wage.f_pvalue < 0.05:
    print("\n✓ Use Fixed Effects (entity effects are significant)")

# 6. Plot FE distribution
individual_fe = fe_wage.entity_fe

plt.figure(figsize=(12, 6))
plt.hist(individual_fe, bins=30, alpha=0.7, color='steelblue', edgecolor='black', linewidth=1.2)
plt.axvline(individual_fe.mean(), color='red', linestyle='--', linewidth=2, label=f'Mean = {individual_fe.mean():.4f}')
plt.xlabel('Individual Fixed Effect (α̂_i)', fontsize=12, fontweight='bold')
plt.ylabel('Frequency', fontsize=12, fontweight='bold')
plt.title('Distribution of Individual Wage Fixed Effects', fontsize=13, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3, axis='y')
plt.tight_layout()
plt.show()

# 7. Interpretation
print("\n" + "="*70)
print("Interpretation")
print("="*70)
print("α̂_i represents individual-specific wage component:")
print("  • Innate ability")
print("  • Motivation / work ethic")
print("  • Unobserved skills")
print("  • Family background")
print("\nFixed Effects controls for these time-invariant factors when estimating")
print("the effect of experience and tenure on wages.")

## Exercise 8.2: Two-Way Fixed Effects with Time Trends

**Research Question**: How do time shocks affect investment in Grunfeld data?

### Tasks

1. Estimate **two-way FE** on Grunfeld data
2. Compare **one-way vs two-way** coefficients
3. Plot **time fixed effects** (γ̂_t)
4. **Interpret**: What do time FE capture?

In [None]:
# TODO: Complete this exercise

# 1. Two-way FE (already done above, but re-estimate for clarity)
# YOUR CODE HERE

# 2. Compare one-way vs two-way
# YOUR CODE HERE

# 3. Plot time FE
# YOUR CODE HERE

# 4. Interpretation
# YOUR CODE HERE

### Solution (Run after attempting)

In [None]:
# SOLUTION: Exercise 8.2

# Already estimated above, but let's extract results

# 2. Comparison
print("="*70)
print("Comparison: One-Way vs Two-Way Fixed Effects")
print("="*70)
comparison_twoway = pd.DataFrame({
    'One-Way FE': fe_results.params,
    'Two-Way FE': fe_twoway_results.params,
    'Difference': fe_results.params - fe_twoway_results.params,
    '% Change': ((fe_twoway_results.params / fe_results.params - 1) * 100).round(2)
})
print(comparison_twoway)

# 3. Plot time FE
time_fe_plot = fe_twoway.time_fe

plt.figure(figsize=(12, 6))
plt.plot(time_fe_plot.index, time_fe_plot.values, marker='o', linewidth=2.5, 
         markersize=10, color='steelblue', label='Time FE (γ̂_t)')
plt.axhline(0, color='red', linestyle='--', linewidth=2, alpha=0.7, label='Zero line')
plt.fill_between(time_fe_plot.index, time_fe_plot.values, 0, alpha=0.3, color='steelblue')

# Annotate key periods
plt.axvspan(1941, 1945, alpha=0.2, color='orange', label='WWII (1941-1945)')

plt.xlabel('Year', fontsize=12, fontweight='bold')
plt.ylabel('Time Fixed Effect (γ̂_t)', fontsize=12, fontweight='bold')
plt.title('Time Fixed Effects: Capturing Common Macro Shocks', fontsize=13, fontweight='bold')
plt.legend()
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

# 4. Interpretation
print("\n" + "="*70)
print("Interpretation of Time Fixed Effects")
print("="*70)
print("Time fixed effects (γ̂_t) capture common shocks affecting all firms:")
print("\n• 1930s: Great Depression (negative time effects)")
print("• 1940s: WWII and post-war boom (positive time effects)")
print("• General trend: Secular changes in investment climate")
print("\nTwo-way FE removes these macro trends to isolate firm-specific responses.")

---

# Section 9: Summary and Key Takeaways

## What We Learned

### 1. The Problem
- **Unobserved heterogeneity** (α_i) causes omitted variable bias in Pooled OLS
- If α_i correlates with X_it, β̂^OLS is biased and inconsistent

### 2. The Solution: Fixed Effects
- **Within transformation** (demeaning) eliminates α_i
- Transformation: $(y_{it} - \bar{y}_i) = (X_{it} - \bar{X}_i)\beta + (\varepsilon_{it} - \bar{\varepsilon}_i)$

### 3. Implementation Methods
- **Demeaning**: Efficient, numerically stable (PanelBox default)
- **LSDV**: Include N-1 entity dummies (equivalent but costly)

### 4. Two-Way Fixed Effects
- Add **time fixed effects** (γ_t) to control for common time shocks
- Use when macro trends affect all entities

### 5. Interpretation
- β̂^FE is a **within effect** (NOT between)
- "When entity i changes X by 1 unit, Y changes by β̂"

### 6. Testing
- **F-test** decides FE vs Pooled OLS
- Usually reject H₀ → use Fixed Effects in practice

### 7. R-squared Decomposition
- **R²_within**: Fit of demeaned model (FE maximizes this)
- **R²_between**: Cross-sectional fit
- **R²_overall**: Overall fit

### 8. Limitations (Preview for Next Notebook)
- ❌ Cannot estimate **time-invariant variables** (race, gender, geography)
- ❌ Inefficient if α_i ⊥ X_it (**Random Effects** better)
- ❌ Loses **N degrees of freedom**

In [None]:
print("="*70)
print("KEY TAKEAWAYS - FIXED EFFECTS")
print("="*70)
print("1. FE eliminates α_i via demeaning (within transformation)")
print("2. Use FE when α_i ~ X_it (unobserved heterogeneity)")
print("3. β̂^FE = within effect (not between)")
print("4. F-test: FE vs Pooled (usually reject → use FE)")
print("5. Two-way FE: controls entity + time effects")
print("6. Limitation: Cannot estimate time-invariant variables")
print("\n⏭ Next Notebook: Random Effects")
print("  • When is RE more efficient than FE?")
print("  • Hausman test for model selection")
print("="*70)

---

## Troubleshooting Tips

### Common Issues and Solutions

#### 1. Singular Matrix Error

**Error**: `LinAlgError: Singular matrix`

**Causes**:
- Perfect multicollinearity (e.g., including time-invariant variables with entity FE)
- Entity with only 1 observation (no within variation)
- Constant variables within all entities

**Solutions**:
```python
# Check for time-invariant variables
for col in data.select_dtypes(include=[np.number]).columns:
    within_var = data.groupby('firm')[col].transform(lambda x: x.var()).sum()
    if within_var == 0:
        print(f"Warning: {col} has no within variation")

# Check entity observation counts
obs_counts = data.groupby('firm').size()
print(f"Entities with T=1: {(obs_counts == 1).sum()}")

# Remove entities with T=1
data_filtered = data[data.groupby('firm')['firm'].transform('size') > 1]
```

---

#### 2. Low R²_within but High R²_between

**Symptom**: R²_within ≈ 0.05, R²_between ≈ 0.85

**Interpretation**:
- Model explains cross-sectional differences well
- But weak at explaining within-entity changes over time
- Fixed Effects may not be appropriate (little within variation)

**Consider**:
- Check if X variables vary over time
- Inspect variance decomposition (see Section 5.3)
- If most variation is between entities, Random Effects or Pooled OLS might be better

---

#### 3. Standard Errors Seem Too Small

**Issue**: Unrealistically low standard errors, high t-statistics

**Likely Cause**: Not using clustered standard errors

**Solution**:
```python
# Always use clustered SE for panel data
fe_results = fe_model.fit(cov_type='clustered')  # ✓ Correct

# Not:
fe_results = fe_model.fit(cov_type='nonrobust')  # ✗ Wrong (underestimates SE)
```

**Why**: Residuals are serially correlated within entities → need to cluster by entity

---

#### 4. F-Test Shows p > 0.05 (Fail to Reject)

**Result**: F-test p-value = 0.23 → Pooled OLS sufficient

**Interpretation**: No evidence of entity-specific effects

**Action**:
- Use Pooled OLS instead of FE (more efficient)
- FE still consistent, but loses N degrees of freedom for no gain

```python
# Use Pooled OLS
pooled_model = pb.PooledOLS(formula, data, entity_col, time_col)
pooled_results = pooled_model.fit(cov_type='clustered')
```

---

#### 5. "Cannot Estimate Time-Invariant Variables"

**Example**: Gender, race, geography (don't change over time)

**Why**: Within transformation eliminates all time-invariant variables

$$x_{it} - \bar{x}_i = \text{constant} - \text{constant} = 0$$

**Solutions**:
1. **If interested in time-invariant effects**: Use Random Effects or Pooled OLS
2. **If only interested in time-varying effects**: Drop time-invariant variables
3. **Hybrid approach**: Estimate FE, then regress estimated α̂_i on time-invariant variables (Hausman-Taylor)

---

#### 6. Negative R²_overall or R²_between

**This is normal!** 

**Why**: 
- FE can have negative R²_between or R²_overall
- FE maximizes R²_within, not R²_overall
- Demeaning can create worse fit for original (non-demeaned) data

**Interpretation**: 
- Only R²_within matters for FE
- Negative R²_overall doesn't mean "bad model"

---

#### 7. Coefficients Change Dramatically with Two-Way FE

**Observation**: One-way FE β̂ = 0.5, Two-way FE β̂ = 0.1

**Interpretation**:
- Time trends are confounding the relationship
- Two-way FE removes time effects → more conservative estimate

**Recommendation**: 
- Use two-way FE when time shocks likely present
- Test for time FE significance (joint F-test on γ_t)

---

#### 8. Memory Error with LSDV

**Error**: `MemoryError` when using LSDV with many entities

**Cause**: LSDV creates N-1 dummy variables → huge design matrix

**Solution**: Use demeaning (PanelBox default) instead of LSDV

```python
# Efficient (demeaning)
fe_model = pb.FixedEffects(formula, data, entity_col, time_col)

# Memory-intensive (LSDV) - avoid for large N
# data_with_dummies = pd.get_dummies(data, columns=['entity'])
```

---

### Data Requirements Checklist

Before running Fixed Effects, verify:

✅ **Panel structure**: Each entity observed in multiple time periods (T ≥ 2)

✅ **Within variation**: Explanatory variables vary over time within entities

✅ **No perfect multicollinearity**: No time-invariant X variables (unless using RE)

✅ **Balanced panel** (preferred): Same T for all entities (not required, but helpful)

✅ **Entity and time identifiers**: Clear entity_col and time_col

```python
# Check panel structure
print(f"Entities: {data[entity_col].nunique()}")
print(f"Time periods: {data[time_col].nunique()}")
print(f"Balanced: {data.groupby(entity_col).size().nunique() == 1}")
print(f"\nObservations per entity:\n{data.groupby(entity_col).size().describe()}")
```

---

### Saving Plots

To save plots for your reports or presentations:

In [None]:
# Optional: Save all plots to outputs/plots/
import os
from pathlib import Path

# Create output directory
output_dir = Path("../../outputs/plots/02_fixed_effects")
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Plots will be saved to: {output_dir.absolute()}")
print("\nTo save plots, add this line after each plt.show():")
print("plt.savefig(output_dir / 'plot_name.png', dpi=300, bbox_inches='tight')")
print("\nExample:")
print("```python")
print("plt.figure(figsize=(10, 6))")
print("# ... your plot code ...")
print("plt.tight_layout()")
print("plt.savefig(output_dir / 'between_vs_within.png', dpi=300, bbox_inches='tight')")
print("plt.show()")
print("```")

---

## References

1. **Wooldridge, J. M. (2010)**. *Econometric Analysis of Cross Section and Panel Data* (2nd ed.). MIT Press. Chapter 10.

2. **Baltagi, B. H. (2021)**. *Econometric Analysis of Panel Data* (6th ed.). Springer. Chapter 2.

3. **Cameron, A. C., & Trivedi, P. K. (2005)**. *Microeconometrics: Methods and Applications*. Cambridge University Press. Chapter 21.

4. **Angrist, J. D., & Pischke, J. S. (2009)**. *Mostly Harmless Econometrics*. Princeton University Press. Chapter 5.

---

## Next Steps

Continue to:
- **Notebook 03**: Random Effects and Hausman Test
- **Notebook 04**: First Differences and Instrumental Variables

---

**End of Notebook 02**