# 01 Panel Fixed Effects + Clustered SE

Pooled vs two-way fixed effects and clustered standard errors.


## Table of Contents
- [Load panel and define variables](#load-panel-and-define-variables)
- [Pooled OLS baseline](#pooled-ols-baseline)
- [Two-way fixed effects](#two-way-fixed-effects)
- [Clustered standard errors](#clustered-standard-errors)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)


## Why This Notebook Matters
Causal notebooks focus on **identification**: what would have to be true for a coefficient to represent a causal effect.
You will practice:
- building a county-year panel,
- fixed effects (TWFE),
- clustered standard errors,
- DiD + event studies,
- IV/2SLS.


## What You Will Produce
- (no file output; learning/analysis notebook)

## Success Criteria
- You can explain what you built and why each step exists.
- You can run your work end-to-end without undefined variables.

## Common Pitfalls
- Running cells top-to-bottom without reading the instructions.
- Leaving `...` placeholders in code cells.
- Treating regression output as causal without stating identification assumptions.
- Using non-clustered SE when shocks are correlated within groups (e.g., states).

## Matching Guide
- `docs/guides/07_causal/01_panel_fixed_effects_clustered_se.md`



## How To Use This Notebook
- This notebook is hands-on. Most code cells are incomplete on purpose.
- Complete each TODO, then run the cell.
- Use the matching guide (`docs/guides/07_causal/01_panel_fixed_effects_clustered_se.md`) for deep explanations and alternative examples.
- Write short interpretation notes as you go (what changed, why it matters).



<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.



In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT



## Goal
Compare:
- pooled OLS (ignores panel structure)
- two-way fixed effects (county FE + year FE)
- robust vs clustered standard errors

This is still not causal by default. FE helps control time-invariant confounding, not everything.



## Primer: Panel and IV regression with `linearmodels`

`statsmodels` is great for OLS inference, but panel and IV workflows are often cleaner with `linearmodels`.

This project uses `linearmodels` for:
- **PanelOLS** (fixed effects / TWFE)
- **IV2SLS** (instrumental variables)

### Panel data shape
Most panel estimators expect a **MultiIndex**:
- level 0: entity (e.g., county `fips`)
- level 1: time (e.g., `year`)

In pandas:
```python
# df is a DataFrame with columns fips, year, y, x1, x2
# df = df.set_index(['fips', 'year']).sort_index()
```

### Minimal PanelOLS (two-way fixed effects)
```python
from linearmodels.panel import PanelOLS

# y: Series with MultiIndex
# X: DataFrame with MultiIndex

# model: y_it = beta'X_it + alpha_i + gamma_t + eps_it
# res = PanelOLS(y, X, entity_effects=True, time_effects=True).fit(cov_type='robust')
# print(res.summary)
```

### Clustered standard errors (common in applied work)
If errors are correlated within clusters (e.g., state), use clustered SE:
```python
# clusters must align with y/X index (same rows)
# res = PanelOLS(y, X, entity_effects=True, time_effects=True).fit(
#     cov_type='clustered',
#     clusters=df['state'],  # e.g., state-level clustering
# )
```

### Minimal IV2SLS (one endogenous regressor)
```python
from linearmodels.iv import IV2SLS
import statsmodels.api as sm

# y: Series
# endog: DataFrame with endogenous regressor(s)
# exog: DataFrame with controls (include a constant)
# instr: DataFrame with instruments

# exog = sm.add_constant(exog, has_constant='add')
# res = IV2SLS(y, exog, endog, instr).fit(cov_type='robust')
# print(res.summary)
```

### Practical rule
- If the goal is **causal identification**, always write down the assumptions first (parallel trends, exclusion restriction, etc.).
- Then treat the model output as conditional on those assumptions, not as “truth”.


<a id="load-panel-and-define-variables"></a>
## Load panel and define variables

### Goal
Load the county-year panel and build a small modeling table.



### Your Turn: Load panel (processed or sample)


In [None]:
import numpy as np
import pandas as pd

path = PROCESSED_DIR / 'census_county_panel.csv'
if path.exists():
    df = pd.read_csv(path)
else:
    df = pd.read_csv(SAMPLE_DIR / 'census_county_panel_sample.csv')

# TODO: Ensure fips/year exist and build a MultiIndex
df['fips'] = df['fips'].astype(str)
df['year'] = df['year'].astype(int)
df = df.set_index(['fips', 'year'], drop=False).sort_index()

# Starter transforms
df['log_income'] = np.log(df['B19013_001E'].astype(float))
df['log_rent'] = np.log(df['B25064_001E'].astype(float))

df[['poverty_rate', 'unemployment_rate', 'log_income', 'log_rent']].describe()



<a id="pooled-ols-baseline"></a>
## Pooled OLS baseline

### Goal
Fit a pooled model that ignores FE.



### Your Turn: Fit pooled OLS


In [None]:
import statsmodels.api as sm

y_col = 'poverty_rate'
x_cols = ['log_income', 'unemployment_rate']

tmp = df[[y_col] + x_cols].dropna().copy()
y = tmp[y_col].astype(float)
X = sm.add_constant(tmp[x_cols].astype(float), has_constant='add')

# TODO: Fit and print a summary (HC3 as a baseline)
res_pool = sm.OLS(y, X).fit(cov_type='HC3')
print(res_pool.summary())



<a id="two-way-fixed-effects"></a>
## Two-way fixed effects

### Goal
Estimate a TWFE model:
- county FE (entity)
- year FE (time)



### Your Turn: Fit TWFE with PanelOLS


In [None]:
from src.causal import fit_twfe_panel_ols

# TODO: Fit TWFE (robust SE)
res_twfe = fit_twfe_panel_ols(
    df,
    y_col=y_col,
    x_cols=x_cols,
    entity_effects=True,
    time_effects=True,
)
print(res_twfe.summary)



<a id="clustered-standard-errors"></a>
## Clustered standard errors

### Goal
Re-fit TWFE with clustered SE.

Typical clustering choice here:
- by state (shared shocks/policies)



### Your Turn: Cluster by state and compare SE


In [None]:
import pandas as pd
from src.causal import fit_twfe_panel_ols

# TODO: Compare robust vs clustered SE
res_cluster = fit_twfe_panel_ols(
    df,
    y_col=y_col,
    x_cols=x_cols,
    entity_effects=True,
    time_effects=True,
    cluster_col='state',
)

pd.DataFrame({'robust_se': res_twfe.std_errors, 'cluster_se': res_cluster.std_errors})



<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)
Run a few asserts and write 2-3 sentences summarizing what you verified.



In [None]:
import pandas as pd

# Expected output: (see notebook front matter)
# TODO: If you created a panel DataFrame, verify the indexing + core columns.
# Example (adjust variable names):
# assert isinstance(panel.index, pd.MultiIndex)
# assert panel.index.names[:2] == ['fips', 'year']
# assert panel['year'].astype(int).between(1900, 2100).all()
# assert panel['fips'].astype(str).str.len().eq(5).all()
#
# TODO: Write 2-3 sentences:
# - What is the identification assumption for your causal estimate?
# - What diagnostic/falsification did you run?
...



## Extensions (Optional)
- Try one additional variant beyond the main path (different features, different split, different model).
- Write down what improved, what got worse, and your hypothesis for why.



## Reflection
- What did you assume implicitly (about timing, availability, stationarity, or costs)?
- If you had to ship this model, what would you monitor?



<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: Load panel and define variables</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 01_panel_fixed_effects_clustered_se — Load panel and define variables
import numpy as np
import pandas as pd

path = PROCESSED_DIR / 'census_county_panel.csv'
if path.exists():
    df = pd.read_csv(path)
else:
    df = pd.read_csv(SAMPLE_DIR / 'census_county_panel_sample.csv')

df['fips'] = df['fips'].astype(str)
df['year'] = df['year'].astype(int)
df = df.set_index(['fips', 'year'], drop=False).sort_index()

df['log_income'] = np.log(df['B19013_001E'].astype(float))
df['log_rent'] = np.log(df['B25064_001E'].astype(float))
df[['poverty_rate', 'log_income', 'unemployment_rate']].describe()
```

</details>

<details><summary>Solution: Pooled OLS baseline</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 01_panel_fixed_effects_clustered_se — Pooled OLS baseline
import statsmodels.api as sm

tmp = df[['poverty_rate', 'log_income', 'unemployment_rate']].dropna().copy()
y = tmp['poverty_rate'].astype(float)
X = sm.add_constant(tmp[['log_income', 'unemployment_rate']], has_constant='add')
res = sm.OLS(y, X).fit(cov_type='HC3')
print(res.summary())
```

</details>

<details><summary>Solution: Two-way fixed effects</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 01_panel_fixed_effects_clustered_se — Two-way fixed effects
from src.causal import fit_twfe_panel_ols

res_twfe = fit_twfe_panel_ols(
    df,
    y_col='poverty_rate',
    x_cols=['log_income', 'unemployment_rate'],
    entity_effects=True,
    time_effects=True,
)
print(res_twfe.summary)
```

</details>

<details><summary>Solution: Clustered standard errors</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 01_panel_fixed_effects_clustered_se — Clustered standard errors
from src.causal import fit_twfe_panel_ols

res_cluster = fit_twfe_panel_ols(
    df,
    y_col='poverty_rate',
    x_cols=['log_income', 'unemployment_rate'],
    entity_effects=True,
    time_effects=True,
    cluster_col='state',
)

pd.DataFrame({'robust_se': res_twfe.std_errors, 'cluster_se': res_cluster.std_errors})
```

</details>

