# 00 Build Census County Panel

Build a county-year ACS panel for panel/DiD methods.


## Table of Contents
- [Choose years + variables](#choose-years-variables)
- [Fetch/cache ACS tables](#fetch-cache-acs-tables)
- [Build panel + FIPS](#build-panel-fips)
- [Save processed panel](#save-processed-panel)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)


## Why This Notebook Matters
Causal notebooks focus on **identification**: what would have to be true for a coefficient to represent a causal effect.
You will practice:
- building a county-year panel,
- fixed effects (TWFE),
- clustered standard errors,
- DiD + event studies,
- IV/2SLS.


## Prerequisites (Quick Self-Check)
- Completed Part 02 (regression + robust SE).
- Basic familiarity with panels (same unit over time) and the idea of identification assumptions.

## What You Will Produce
- data/processed/census_county_panel.csv

## Success Criteria
- You can explain what you built and why each step exists.
- You can run your work end-to-end without undefined variables.
- You can point to the concrete deliverable(s) listed below and explain how they were produced.

## Common Pitfalls
- Running cells top-to-bottom without reading the instructions.
- Leaving `...` placeholders in code cells.
- Treating regression output as causal without stating identification assumptions.
- Using non-clustered SE when shocks are correlated within groups (e.g., states).

## Quick Fixes (When You Get Stuck)
- If you see `ModuleNotFoundError`, re-run the bootstrap cell and restart the kernel; make sure `PROJECT_ROOT` is the repo root.
- If a `data/processed/*` file is missing, either run the matching build script (see guide) or use the notebook’s `data/sample/*` fallback.
- If results look “too good,” suspect leakage; re-check shifts, rolling windows, and time splits.
- If a model errors, check dtypes (`astype(float)`) and missingness (`dropna()` on required columns).

## Matching Guide
- `docs/guides/07_causal/00_build_census_county_panel.md`



## How To Use This Notebook
- Work section-by-section; don’t skip the markdown.
- Most code cells are incomplete on purpose: replace TODOs and `...`, then run.
- After each section, write 2–4 sentences answering the interpretation prompts (what changed, why it matters).
- Prefer `data/processed/*` if you have built the real datasets; otherwise use the bundled `data/sample/*` fallbacks.
- Use the **Checkpoint (Self-Check)** section to catch mistakes early.
- Use **Solutions (Reference)** only to unblock yourself; then re-implement without looking.
- Use the matching guide (`docs/guides/07_causal/00_build_census_county_panel.md`) for the math, assumptions, and deeper context.



<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.



In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT



## Goal
Build a multi-year county dataset suitable for panel methods (FE/DiD).

Important framing:
- This is **not** a panel of the same individuals.
- It is repeated cross-sections summarized at the county level.
- Panel methods can still be useful, but interpretation must be careful.



## Primer: Paths, files, and environment variables (how this repo stays reproducible)

You will see a few patterns repeatedly in notebooks and scripts.

### Environment variables (API keys)

Environment variables are key/value settings provided by your shell to Python.
This repo uses them for API keys:
- `FRED_API_KEY`
- `CENSUS_API_KEY` (optional)

```python
import os

fred_key = os.getenv("FRED_API_KEY")
print("FRED key set?", fred_key is not None)
```

If you set a key in a terminal, restart the Jupyter kernel so Python sees it.

### Paths (why `pathlib.Path` is the default)

Use `Path` to build OS-safe file paths:

```python
from pathlib import Path

p = Path("data") / "sample" / "macro_quarterly_sample.csv"
print(p, "exists?", p.exists())
```

### Repo bootstrap variables (defined in every notebook)

The notebook bootstrap cell defines:
- `PROJECT_ROOT` (repo root)
- `DATA_DIR`, `RAW_DIR`, `PROCESSED_DIR`, `SAMPLE_DIR`

Prefer these over hard-coded relative paths.

### Sample vs processed data (offline-first)

Most notebooks follow this pattern:
1) try `data/processed/*` (real pipeline output)
2) fall back to `data/sample/*` (small offline dataset)

This keeps notebooks runnable without network access.

### Common “file not found” fixes

- Print the path and check `.exists()`
- Print current working directory:
  - `import os; print(os.getcwd())`
- Start Jupyter from the repo root (so bootstrap can find `src/` and `docs/`)


<a id="choose-years-variables"></a>
## Choose years + variables

### Background
This project treats a dataset config as a **contract**:
- which years are included,
- which variables are fetched,
- and which geography level the rows represent.

ACS variable names look like codes (e.g., `B19013_001E`). That is normal.
Your job is to keep a small “data dictionary” as you go: what each code measures and what the units are.

### What you should see
- `years` is a list of years (default: 2014–2022).
- `acs_vars` is a list of ACS variable codes.
- `geo_for`/`geo_in` describe a county-within-state query.

### Interpretation prompts
- Pick 2 ACS variables and write (in words) what they measure.
- Which variables will be numerators vs denominators for rates?

### Goal
Load a default panel config (`configs/census_panel.yaml`) and inspect:
- years
- ACS variables
- geography



### Your Turn: Load the panel config


In [None]:
import yaml

cfg_path = PROJECT_ROOT / 'configs' / 'census_panel.yaml'
cfg = yaml.safe_load(cfg_path.read_text())

acs = cfg['acs_panel']
years = list(acs['years'])
dataset = acs.get('dataset', 'acs/acs5')
acs_vars = list(acs['get'])
geo_for = acs['geography']['for']
geo_in = acs['geography'].get('in')

years[:5], acs_vars



<a id="fetch-cache-acs-tables"></a>
## Fetch/cache ACS tables

### Background
In applied work, you almost never want to hit an API repeatedly during experiments.
So we cache raw pulls under `data/raw/` and build a clean panel under `data/processed/`.

This notebook is offline-first:
- if cached raw CSVs exist, we load them,
- otherwise we fall back to the bundled sample panel.

### What you should see
- Either `frames` is non-empty (cached raw CSVs found), or you see a message that the sample panel is used.
- `panel_raw` contains county rows with columns like `state`, `county`, and ACS variables.

### Interpretation prompts
- Where on disk is the cache for a given year stored?
- What would you change in the config to add/remove variables?

### Goal
For each year, load a cached raw CSV if available; otherwise fetch from the Census API.

Offline default:
- If nothing is cached, use `data/sample/census_county_panel_sample.csv`.



### Your Turn: Load cached tables or fall back to sample


In [None]:
import pandas as pd
from src import census_api

raw_dir = RAW_DIR / 'census'
raw_dir.mkdir(parents=True, exist_ok=True)

frames = []
for year in years:
    p = raw_dir / f'acs_county_{int(year)}.csv'
    if p.exists():
        df_y = pd.read_csv(p)
        frames.append((int(year), df_y))
    else:
        # TODO (optional): fetch and cache.
        # df_y = census_api.fetch_acs(year=int(year), dataset=dataset, get=acs_vars, for_geo=geo_for, in_geo=geo_in)
        # df_y.to_csv(p, index=False)
        # frames.append((int(year), df_y))
        pass

if not frames:
    print('No cached raw CSVs found. Using bundled sample panel.')
    panel_raw = pd.read_csv(SAMPLE_DIR / 'census_county_panel_sample.csv')
else:
    # Attach year and concatenate
    tmp = []
    for year, df_y in frames:
        df_y = df_y.copy()
        df_y['year'] = year
        tmp.append(df_y)
    panel_raw = pd.concat(tmp, ignore_index=True)

panel_raw.head()



<a id="build-panel-fips"></a>
## Build panel + FIPS

### Background
Panel methods require stable unit identifiers.
For U.S. counties, a standard identifier is **FIPS**:
- 2-digit state code + 3-digit county code.

We also build key derived outcomes as rates so later regressions have consistent units.

### What you should see
- `fips` is a 5-character string.
- `year` is an integer.
- `poverty_rate` and `unemployment_rate` are usually between 0 and 1.
- the DataFrame has a MultiIndex `('fips','year')` and is sorted.

### Interpretation prompts
- Why do we `zfill` the state/county codes?
- If a rate is outside [0, 1], what data issues could cause it?

### Goal
Create stable identifiers and derived rates:
- `fips` = state (2-digit) + county (3-digit)
- `unemployment_rate`, `poverty_rate`



### Your Turn: Clean geo ids, build fips, derived rates


In [None]:
import pandas as pd

df = panel_raw.copy()

# Geo ids
df['state'] = df['state'].astype(str).str.zfill(2)
df['county'] = df['county'].astype(str).str.zfill(3)
df['fips'] = df['state'] + df['county']
df['year'] = df['year'].astype(int)

# Derived rates (safe guards)
df['unemployment_rate'] = (
    df['B23025_005E'].astype(float) / df['B23025_002E'].replace({0: pd.NA}).astype(float)
).astype(float)
df['poverty_rate'] = (
    df['B17001_002E'].astype(float) / df['B01003_001E'].replace({0: pd.NA}).astype(float)
).astype(float)

# Panel index (PanelOLS-ready)
panel = df.set_index(['fips', 'year'], drop=False).sort_index()

panel[['state', 'county', 'fips', 'year', 'unemployment_rate', 'poverty_rate']].head()



<a id="save-processed-panel"></a>
## Save processed panel

### Background
This file is the handoff between the data pipeline and the causal notebooks.
Once you write `data/processed/census_county_panel.csv`, later notebooks can run without rebuilding the panel.

### What you should see
- a new file at `data/processed/census_county_panel.csv`.
- reloading the file produces a non-empty DataFrame.

### Interpretation prompts
- What columns are essential for later FE/DiD notebooks?
- What would you add to the panel if you wanted a richer causal story?

### Goal
Write a panel dataset to `data/processed/census_county_panel.csv`.



### Your Turn: Save + reload


In [None]:
out_path = PROCESSED_DIR / 'census_county_panel.csv'
out_path.parent.mkdir(parents=True, exist_ok=True)
panel.to_csv(out_path, index=True)

print('wrote', out_path)

# Quick reload
check = pd.read_csv(out_path)
check.head()



<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)
Run a few asserts and write 2-3 sentences summarizing what you verified.



In [None]:
import pandas as pd

# Expected file: data/processed/census_county_panel.csv
# TODO: If you created a panel DataFrame, verify the indexing + core columns.
# Example (adjust variable names):
# assert isinstance(panel.index, pd.MultiIndex)
# assert panel.index.names[:2] == ['fips', 'year']
# assert panel['year'].astype(int).between(1900, 2100).all()
# assert panel['fips'].astype(str).str.len().eq(5).all()
#
# TODO: Write 2-3 sentences:
# - What is the identification assumption for your causal estimate?
# - What diagnostic/falsification did you run?
...



## Extensions (Optional)
- Try one additional variant beyond the main path (different features, different split, different model).
- Write down what improved, what got worse, and your hypothesis for why.



## Reflection
- What did you assume implicitly (about timing, availability, stationarity, or costs)?
- If you had to ship this model, what would you monitor?



<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: Choose years + variables</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 00_build_census_county_panel — Choose years + variables
import yaml

cfg = yaml.safe_load((PROJECT_ROOT / 'configs' / 'census_panel.yaml').read_text())
acs = cfg['acs_panel']
years = list(acs['years'])
acs_vars = list(acs['get'])
dataset = acs.get('dataset', 'acs/acs5')
geo_for = acs['geography']['for']
geo_in = acs['geography'].get('in')

years[:3], acs_vars[:5]
```

</details>

<details><summary>Solution: Fetch/cache ACS tables</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 00_build_census_county_panel — Fetch/cache ACS tables
import pandas as pd

# Offline default: load the bundled sample panel.
panel_raw = pd.read_csv(SAMPLE_DIR / 'census_county_panel_sample.csv')
panel_raw.head()
```

</details>

<details><summary>Solution: Build panel + FIPS</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 00_build_census_county_panel — Build panel + FIPS
import pandas as pd

df = panel_raw.copy()
df['state'] = df['state'].astype(str).str.zfill(2)
df['county'] = df['county'].astype(str).str.zfill(3)
df['fips'] = df['state'] + df['county']
df['year'] = df['year'].astype(int)

# Recompute derived rates (safe guards included)
df['unemployment_rate'] = (
    df['B23025_005E'].astype(float) / df['B23025_002E'].replace({0: pd.NA}).astype(float)
).astype(float)
df['poverty_rate'] = (
    df['B17001_002E'].astype(float) / df['B01003_001E'].replace({0: pd.NA}).astype(float)
).astype(float)

panel = df.set_index(['fips', 'year'], drop=False).sort_index()
panel.head()
```

</details>

<details><summary>Solution: Save processed panel</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 00_build_census_county_panel — Save processed panel
out_path = PROCESSED_DIR / 'census_county_panel.csv'
out_path.parent.mkdir(parents=True, exist_ok=True)
panel.to_csv(out_path, index=True)

print('wrote', out_path)
```

</details>

