# 01a Random Effects and the Hausman Test

When to use FE vs RE, and how to decide.

## Table of Contents
- [Review: Fixed Effects recap](#review-fixed-effects-recap)
- [Random Effects model](#random-effects-model)
- [Hausman test](#hausman-test)
- [Practical comparison](#practical-comparison)
- [When to use which](#when-to-use-which)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)

## Why This Notebook Matters
Causal notebooks focus on **identification**: what would have to be true for a coefficient to represent a causal effect.
In the previous notebook you estimated fixed effects models that sweep out all time-invariant unobserved heterogeneity.
But FE comes at a cost: it throws away all between-entity variation and cannot estimate coefficients on time-invariant regressors.
Random Effects (RE) keeps that variation and is more efficient -- **if** its key assumption holds.

You will practice:
- fitting a Random Effects model with `linearmodels`,
- implementing the Hausman test from scratch,
- comparing FE and RE coefficient estimates side-by-side,
- building a decision framework for when to use which estimator.


## Prerequisites (Quick Self-Check)
- Completed notebook `01_panel_fixed_effects_clustered_se`.
- Understanding of entity fixed effects and the within estimator.
- Basic familiarity with panels (same unit over time) and the idea of identification assumptions.

## What You Will Produce
- (no file output; learning/analysis notebook)

## Success Criteria
- You can explain what you built and why each step exists.
- You can run your work end-to-end without undefined variables.
- You can articulate the core assumption difference between FE and RE.
- You can implement and interpret a Hausman test.

## Common Pitfalls
- Running cells top-to-bottom without reading the instructions.
- Leaving `...` placeholders in code cells.
- Treating regression output as causal without stating identification assumptions.
- Confusing "more efficient" with "more correct" -- RE efficiency only matters if the assumption holds.
- Using non-clustered SE when shocks are correlated within groups (e.g., states).

## Quick Fixes (When You Get Stuck)
- If you see `ModuleNotFoundError`, re-run the bootstrap cell and restart the kernel; make sure `PROJECT_ROOT` is the repo root.
- If a `data/processed/*` file is missing, either run the matching build script (see guide) or use the notebook's `data/sample/*` fallback.
- If results look "too good," suspect leakage; re-check shifts, rolling windows, and time splits.
- If a model errors, check dtypes (`astype(float)`) and missingness (`dropna()` on required columns).

## Matching Guide
- `docs/guides/07_causal/01_panel_fixed_effects_clustered_se.md`


## How To Use This Notebook
- Work section-by-section; don't skip the markdown.
- Most code cells are incomplete on purpose: replace TODOs and `...`, then run.
- After each section, write 2--4 sentences answering the interpretation prompts (what changed, why it matters).
- Prefer `data/processed/*` if you have built the real datasets; otherwise use the bundled `data/sample/*` fallbacks.
- Use the **Checkpoint (Self-Check)** section to catch mistakes early.
- Use **Solutions (Reference)** only to unblock yourself; then re-implement without looking.
- Use the matching guide (`docs/guides/07_causal/01_panel_fixed_effects_clustered_se.md`) for the math, assumptions, and deeper context.


<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.


In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT

## Goal
Compare Fixed Effects and Random Effects estimators on the same county-year panel.
Then use the Hausman test to decide which is appropriate.

Key question: **is the unobserved county-level heterogeneity correlated with the regressors?**
- If yes: FE is consistent, RE is not.
- If no: both are consistent, but RE is more efficient.

<a id="review-fixed-effects-recap"></a>
## Review: Fixed Effects recap

### Background
In the previous notebook (`01_panel_fixed_effects_clustered_se`), you estimated models of the form:

$$
Y_{it} = X_{it}'\beta + \alpha_i + \gamma_t + \varepsilon_{it}
$$

where $\alpha_i$ are entity (county) fixed effects and $\gamma_t$ are time (year) fixed effects.

The **key assumption** behind FE: unobserved heterogeneity $\alpha_i$ may be **correlated** with the regressors $X_{it}$.
FE handles this by demeaning within each entity, which eliminates $\alpha_i$ entirely.

**Cost of FE**:
- Cannot estimate coefficients on time-invariant regressors (they get absorbed).
- Uses only within-entity variation, discarding between-entity variation.
- Less efficient than RE when the RE assumption actually holds.

### What you should see
- A loaded panel with MultiIndex `(fips, year)`.
- An entity FE regression result to use as a baseline for comparison.

### Interpretation prompts
- What does entity demeaning do to a time-invariant variable?
- Why is FE considered the "safe default" in applied work?

### Goal
Load the panel data and fit an entity FE model as a baseline.

### Your Turn: Load panel and fit entity FE baseline

In [None]:
import numpy as np
import pandas as pd

path = PROCESSED_DIR / 'census_county_panel.csv'
if path.exists():
    df = pd.read_csv(path)
else:
    df = pd.read_csv(SAMPLE_DIR / 'census_county_panel_sample.csv')

# TODO: Ensure fips/year exist and build a MultiIndex
df['fips'] = df['fips'].astype(str)
df['year'] = df['year'].astype(int)
df = df.set_index(['fips', 'year'], drop=False).sort_index()

# Starter transforms
df['log_income'] = np.log(df['B19013_001E'].astype(float))
df['log_rent'] = np.log(df['B25064_001E'].astype(float))

df[['poverty_rate', 'unemployment_rate', 'log_income', 'log_rent']].describe()

In [None]:
from src.causal import fit_twfe_panel_ols

y_col = 'poverty_rate'
x_cols = ['log_income', 'unemployment_rate']

# TODO: Fit entity FE model (entity_effects=True, time_effects=False for pure entity FE)
res_fe = fit_twfe_panel_ols(
    df,
    y_col=y_col,
    x_cols=x_cols,
    entity_effects=True,
    time_effects=False,
)
print(res_fe.summary)

<a id="random-effects-model"></a>
## Random Effects model

### Background
The Random Effects (RE) model assumes:

$$
Y_{it} = X_{it}'\beta + \alpha_i + \varepsilon_{it}, \quad \mathrm{Cov}(\alpha_i, X_{it}) = 0
$$

The critical difference from FE: RE assumes the unobserved entity effect $\alpha_i$ is **uncorrelated** with the regressors.

Under this assumption, RE is a weighted average of the between and within estimators, and is **more efficient** than FE (smaller standard errors) because it uses both within-entity and between-entity variation.

RE uses a GLS-type transformation: it partially demeans the data (by a fraction $\theta$, estimated from the variance components) rather than fully demeaning like FE.

Use `linearmodels.panel.RandomEffects`:
```python
from linearmodels.panel import RandomEffects
res_re = RandomEffects(y, X).fit()
```

### What you should see
- A `RandomEffects` summary with coefficient estimates.
- Standard errors that are typically **smaller** than the FE standard errors (because RE is more efficient under its assumption).

### Interpretation prompts
- In words, what does $\mathrm{Cov}(\alpha_i, X_{it}) = 0$ mean for this county panel?
- Why would RE standard errors be smaller than FE standard errors if the assumption holds?

### Goal
Fit a Random Effects model on the same outcome and regressors.

### Your Turn: Fit Random Effects model

In [None]:
from linearmodels.panel import RandomEffects
import statsmodels.api as sm

# Build modeling table (drop missing, ensure float types)
tmp = df[[y_col] + x_cols].dropna().copy()
y = tmp[y_col].astype(float)
X = tmp[x_cols].astype(float)

# TODO: Fit the RandomEffects model and print the summary
# Hint: RandomEffects(y, X).fit()
res_re = ...
print(res_re.summary)

<a id="hausman-test"></a>
## Hausman test

### Background
The Hausman test checks whether the RE assumption ($\mathrm{Cov}(\alpha_i, X_{it}) = 0$) is plausible.

**Intuition**: Under $H_0$ (RE is consistent), both FE and RE are consistent, but RE is more efficient.
Under the alternative, FE is consistent but RE is not. So the coefficients should differ systematically.

**Test statistic**:
$$
H = (\hat{\beta}_{FE} - \hat{\beta}_{RE})' \left[\widehat{\mathrm{Var}}(\hat{\beta}_{FE}) - \widehat{\mathrm{Var}}(\hat{\beta}_{RE})\right]^{-1} (\hat{\beta}_{FE} - \hat{\beta}_{RE})
$$

Under $H_0$, $H \sim \chi^2_k$ where $k$ is the number of regressors.

**Decision rule**:
- If $p < 0.05$: reject $H_0$ -- use FE (RE assumption likely violated).
- If $p \geq 0.05$: fail to reject $H_0$ -- RE may be appropriate.

**Important caveat**: failing to reject does not prove the RE assumption is true. It could just be low power.

### What you should see
- A manually computed Hausman test statistic.
- A p-value from the chi-squared distribution.
- A clear conclusion: FE or RE.

### Interpretation prompts
- What does it mean, economically, if the test rejects?
- Why might you still prefer FE even if the test fails to reject?

### Goal
Implement the Hausman test manually and interpret the result.

### Your Turn: Manual Hausman test

In [None]:
from linearmodels.panel import PanelOLS, RandomEffects
from scipy import stats

# --- Step 1: Fit FE and RE on the same modeling table ---
# We need both models estimated on identical observations.
tmp = df[[y_col] + x_cols].dropna().copy()
y = tmp[y_col].astype(float)
X = tmp[x_cols].astype(float)

# FE (entity effects, no constant -- FE absorbs it)
res_fe_h = PanelOLS(y, X, entity_effects=True).fit()

# RE
res_re_h = RandomEffects(y, X).fit()

# --- Step 2: Extract coefficients and covariance matrices ---
# TODO: Get the coefficient vectors (as numpy arrays)
b_fe = ...  # Hint: res_fe_h.params.values
b_re = ...  # Hint: res_re_h.params.values

# TODO: Get the covariance matrices (as numpy arrays)
V_fe = ...  # Hint: res_fe_h.cov.values
V_re = ...  # Hint: res_re_h.cov.values

# --- Step 3: Align coefficients ---
# RE may include a constant that FE does not. We compare only the
# coefficients that appear in BOTH models (the x_cols regressors).
common = [c for c in res_fe_h.params.index if c in res_re_h.params.index]
b_fe = res_fe_h.params[common].values
b_re = res_re_h.params[common].values
V_fe = res_fe_h.cov.loc[common, common].values
V_re = res_re_h.cov.loc[common, common].values

In [None]:
# --- Step 4: Compute the Hausman test statistic ---
# H = (b_fe - b_re)' [V_fe - V_re]^{-1} (b_fe - b_re)

# TODO: Compute the difference in coefficients
b_diff = ...  # Hint: b_fe - b_re

# TODO: Compute the difference in variance matrices
V_diff = ...  # Hint: V_fe - V_re

# TODO: Compute the test statistic
# Hint: Use np.linalg.inv() for the matrix inverse
# H = b_diff @ np.linalg.inv(V_diff) @ b_diff
H = ...

# --- Step 5: Compute p-value ---
# TODO: degrees of freedom = number of common coefficients
k = ...  # Hint: len(common)
p_value = ...  # Hint: 1 - stats.chi2.cdf(H, df=k)

print(f'Hausman test statistic: {H:.4f}')
print(f'Degrees of freedom:     {k}')
print(f'p-value:                {p_value:.6f}')
print()
if p_value < 0.05:
    print('Reject H0: RE assumption likely violated. Use FE.')
else:
    print('Fail to reject H0: RE may be appropriate (but FE is still safe).')

<a id="practical-comparison"></a>
## Practical comparison

### Background
Before relying on the Hausman test alone, it is useful to look at the FE and RE estimates side-by-side.
If the coefficient estimates are very close, the choice between FE and RE may not matter much in practice.
If they diverge substantially, that is itself a signal that unobserved heterogeneity may be correlated with the regressors.

### What you should see
- A table comparing FE vs RE: coefficients, standard errors, and the difference.
- A visual sense of how much the estimates diverge.

### Interpretation prompts
- For which regressor do FE and RE disagree the most? What story could explain that?
- If RE standard errors are smaller, does that automatically make RE better? Why or why not?

### Goal
Build a side-by-side comparison table of FE vs RE estimates.

### Your Turn: Compare FE and RE side-by-side

In [None]:
import pandas as pd

# TODO: Build a comparison DataFrame with columns:
#   'FE_coef', 'RE_coef', 'FE_se', 'RE_se', 'coef_diff'
# Use only the common coefficients (the regressors shared by both models).

comparison = pd.DataFrame({
    'FE_coef': ...,   # Hint: res_fe_h.params[common]
    'RE_coef': ...,   # Hint: res_re_h.params[common]
    'FE_se':   ...,   # Hint: res_fe_h.std_errors[common]
    'RE_se':   ...,   # Hint: res_re_h.std_errors[common]
})
comparison['coef_diff'] = comparison['FE_coef'] - comparison['RE_coef']

comparison

In [None]:
import matplotlib.pyplot as plt

# TODO: Create a visual comparison (bar chart or coefficient plot)
# Hint: Plot FE and RE coefficients side by side with error bars

fig, ax = plt.subplots(figsize=(8, 4))
x_pos = np.arange(len(common))
width = 0.35

ax.bar(x_pos - width/2, ..., width, yerr=..., label='FE', alpha=0.8, capsize=4)
ax.bar(x_pos + width/2, ..., width, yerr=..., label='RE', alpha=0.8, capsize=4)

ax.set_xticks(x_pos)
ax.set_xticklabels(common, rotation=15)
ax.set_ylabel('Coefficient estimate')
ax.set_title('FE vs RE coefficient estimates (with SE error bars)')
ax.legend()
ax.axhline(0, color='gray', linestyle='--', linewidth=0.8)
plt.tight_layout()
plt.show()

<a id="when-to-use-which"></a>
## When to use which

### Decision framework

| Criterion | Fixed Effects (FE) | Random Effects (RE) |
|---|---|---|
| **Core assumption** | $\alpha_i$ may be correlated with $X_{it}$ | $\mathrm{Cov}(\alpha_i, X_{it}) = 0$ |
| **Consistency** | Always consistent (under strict exogeneity of $\varepsilon$) | Only consistent if the uncorrelation assumption holds |
| **Efficiency** | Less efficient (uses only within variation) | More efficient (uses within + between variation) |
| **Time-invariant regressors** | Cannot estimate (absorbed by FE) | Can estimate |
| **Hausman test rejects** | Use FE | Do not use RE |
| **Hausman test fails to reject** | Still safe to use FE | RE is a valid (and more efficient) choice |

### Practical guidance

1. **FE is the safe default.** In most applied economics, researchers worry about endogeneity (omitted variables correlated with regressors). FE is robust to this.

2. **RE is more efficient** if you genuinely believe that unobserved county-level factors (culture, geography, institutions) are uncorrelated with your regressors. This is a strong assumption.

3. **In practice**: most applied work defaults to FE when there is any worry about endogeneity. RE is more common in fields where the uncorrelation assumption is more defensible (e.g., randomized experiments with clustering, some clinical trial designs).

4. **The Hausman test is a guide, not a guarantee.** Failure to reject could reflect low power rather than a true lack of correlation. When in doubt, report both and discuss.

5. **Mundlak (1978) compromise**: add group means of time-varying regressors to the RE model. This nests FE within RE and can be tested directly. This is an extension for further study.

### Your Turn: Summarize your decision

In [None]:
# TODO: Write 3-5 sentences summarizing your results.
# Address:
# 1. What did the Hausman test say?
# 2. How different were the FE and RE coefficients?
# 3. Which estimator would you recommend for this panel and why?
...

<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)
Run a few asserts and write 2-3 sentences summarizing what you verified.


In [None]:
import pandas as pd

# Expected output: (see notebook front matter)
# TODO: Verify your panel indexing and model results.
# Example (adjust variable names):
# assert isinstance(df.index, pd.MultiIndex)
# assert df.index.names[:2] == ['fips', 'year']
# assert res_fe_h is not None, 'FE model not fitted'
# assert res_re_h is not None, 'RE model not fitted'
# assert H > 0, 'Hausman statistic should be positive'
# assert 0 <= p_value <= 1, 'p-value out of range'
# assert len(common) == len(x_cols), 'Common coefficients should match x_cols'
#
# TODO: Write 2-3 sentences:
# - What is the key assumption you are testing with the Hausman test?
# - What did you conclude?
...

## Extensions (Optional)
- Try one additional variant beyond the main path (different features, different split, different model).
- Write down what improved, what got worse, and your hypothesis for why.

Suggestions:
- **Mundlak approach**: Add within-entity means of time-varying regressors to the RE model. Does the Hausman test result change? (This is the Mundlak (1978) device that nests FE within RE.)
- **Different regressors**: Swap in `log_rent` or add additional controls. How sensitive is the Hausman result?
- **Two-way FE vs entity-only FE**: Compare the Hausman test using entity-only FE vs TWFE (with time effects). Which comparison is more appropriate?

## Reflection
- What did you assume implicitly (about timing, availability, stationarity, or costs)?
- If you had to ship this model, what would you monitor?
- Under what real-world conditions might you prefer RE over FE despite the Hausman test?
- How does the FE vs RE decision relate to the broader theme of the bias-variance tradeoff?

<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: Load panel and fit entity FE baseline</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 01a_random_effects_hausman -- Load panel and FE baseline
import numpy as np
import pandas as pd
from src.causal import fit_twfe_panel_ols

path = PROCESSED_DIR / 'census_county_panel.csv'
if path.exists():
    df = pd.read_csv(path)
else:
    df = pd.read_csv(SAMPLE_DIR / 'census_county_panel_sample.csv')

df['fips'] = df['fips'].astype(str)
df['year'] = df['year'].astype(int)
df = df.set_index(['fips', 'year'], drop=False).sort_index()

df['log_income'] = np.log(df['B19013_001E'].astype(float))
df['log_rent'] = np.log(df['B25064_001E'].astype(float))

y_col = 'poverty_rate'
x_cols = ['log_income', 'unemployment_rate']

res_fe = fit_twfe_panel_ols(
    df,
    y_col=y_col,
    x_cols=x_cols,
    entity_effects=True,
    time_effects=False,
)
print(res_fe.summary)
```

</details>

<details><summary>Solution: Fit Random Effects model</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 01a_random_effects_hausman -- Random Effects model
from linearmodels.panel import RandomEffects

tmp = df[[y_col] + x_cols].dropna().copy()
y = tmp[y_col].astype(float)
X = tmp[x_cols].astype(float)

res_re = RandomEffects(y, X).fit()
print(res_re.summary)
```

</details>

<details><summary>Solution: Manual Hausman test</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 01a_random_effects_hausman -- Hausman test
from linearmodels.panel import PanelOLS, RandomEffects
from scipy import stats
import numpy as np

tmp = df[[y_col] + x_cols].dropna().copy()
y = tmp[y_col].astype(float)
X = tmp[x_cols].astype(float)

res_fe_h = PanelOLS(y, X, entity_effects=True).fit()
res_re_h = RandomEffects(y, X).fit()

# Align on common coefficients
common = [c for c in res_fe_h.params.index if c in res_re_h.params.index]
b_fe = res_fe_h.params[common].values
b_re = res_re_h.params[common].values
V_fe = res_fe_h.cov.loc[common, common].values
V_re = res_re_h.cov.loc[common, common].values

b_diff = b_fe - b_re
V_diff = V_fe - V_re
H = b_diff @ np.linalg.inv(V_diff) @ b_diff

k = len(common)
p_value = 1 - stats.chi2.cdf(H, df=k)

print(f'Hausman test statistic: {H:.4f}')
print(f'Degrees of freedom:     {k}')
print(f'p-value:                {p_value:.6f}')
print()
if p_value < 0.05:
    print('Reject H0: RE assumption likely violated. Use FE.')
else:
    print('Fail to reject H0: RE may be appropriate (but FE is still safe).')
```

</details>

<details><summary>Solution: Compare FE and RE side-by-side</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 01a_random_effects_hausman -- Practical comparison
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

comparison = pd.DataFrame({
    'FE_coef': res_fe_h.params[common],
    'RE_coef': res_re_h.params[common],
    'FE_se':   res_fe_h.std_errors[common],
    'RE_se':   res_re_h.std_errors[common],
})
comparison['coef_diff'] = comparison['FE_coef'] - comparison['RE_coef']
print(comparison)

fig, ax = plt.subplots(figsize=(8, 4))
x_pos = np.arange(len(common))
width = 0.35

ax.bar(x_pos - width/2, comparison['FE_coef'], width,
       yerr=comparison['FE_se'], label='FE', alpha=0.8, capsize=4)
ax.bar(x_pos + width/2, comparison['RE_coef'], width,
       yerr=comparison['RE_se'], label='RE', alpha=0.8, capsize=4)

ax.set_xticks(x_pos)
ax.set_xticklabels(common, rotation=15)
ax.set_ylabel('Coefficient estimate')
ax.set_title('FE vs RE coefficient estimates (with SE error bars)')
ax.legend()
ax.axhline(0, color='gray', linestyle='--', linewidth=0.8)
plt.tight_layout()
plt.show()
```

</details>

<details><summary>Solution: Summarize your decision</summary>

_One possible approach._

```python
# Example summary (replace with your own words):
#
# The Hausman test statistic was [value] with p-value [value].
# Since we [reject / fail to reject] H0, this suggests that the
# unobserved county effects [are / may not be] correlated with
# the regressors (log_income, unemployment_rate).
#
# The FE and RE coefficients [were close / diverged], particularly
# for [regressor]. This is consistent with [the Hausman result].
#
# For this panel, I would recommend [FE / RE] because [reasoning].
# In general, FE is the safer default for observational county data
# where unobserved county characteristics (geography, institutions)
# are plausibly correlated with income and employment.
```

</details>