# 04 Census API Microdata Fetch

Fetch county-level ACS data and build a micro dataset.


## Table of Contents
- [Browse variables](#browse-variables)
- [Fetch county data](#fetch-county-data)
- [Derived rates](#derived-rates)
- [Save processed data](#save-processed-data)
- [Checkpoint (Self-Check)](#checkpoint-self-check)
- [Solutions (Reference)](#solutions-reference)


## Why This Notebook Matters
Data notebooks build the datasets used everywhere else. If these steps are wrong, every model result is suspect.
You will practice:
- API ingestion and caching,
- frequency alignment,
- label construction.


## Prerequisites (Quick Self-Check)
- Completed Part 00 (foundations) or equivalent time-series basics.
- FRED API key set (`FRED_API_KEY`) for real data (sample data works offline).

## What You Will Produce
- data/processed/census_county_<year>.csv

## Success Criteria
- You can explain what you built and why each step exists.
- You can run your work end-to-end without undefined variables.
- You can point to the concrete deliverable(s) listed below and explain how they were produced.

## Common Pitfalls
- Running cells top-to-bottom without reading the instructions.
- Leaving `...` placeholders in code cells.
- Merging mixed-frequency series without explicit resampling/aggregation.
- Forgetting to shift targets for forecasting tasks.

## Quick Fixes (When You Get Stuck)
- If you see `ModuleNotFoundError`, re-run the bootstrap cell and restart the kernel; make sure `PROJECT_ROOT` is the repo root.
- If a `data/processed/*` file is missing, either run the matching build script (see guide) or use the notebook’s `data/sample/*` fallback.
- If results look “too good,” suspect leakage; re-check shifts, rolling windows, and time splits.
- If a model errors, check dtypes (`astype(float)`) and missingness (`dropna()` on required columns).

## Matching Guide
- `docs/guides/01_data/04_census_api_microdata_fetch.md`



## How To Use This Notebook
- Work section-by-section; don’t skip the markdown.
- Most code cells are incomplete on purpose: replace TODOs and `...`, then run.
- After each section, write 2–4 sentences answering the interpretation prompts (what changed, why it matters).
- Prefer `data/processed/*` if you have built the real datasets; otherwise use the bundled `data/sample/*` fallbacks.
- Use the **Checkpoint (Self-Check)** section to catch mistakes early.
- Use **Solutions (Reference)** only to unblock yourself; then re-implement without looking.
- Use the matching guide (`docs/guides/01_data/04_census_api_microdata_fetch.md`) for the math, assumptions, and deeper context.



<a id="environment-bootstrap"></a>
## Environment Bootstrap
Run this cell first. It makes the repo importable and defines common directories.



In [None]:
from __future__ import annotations

from pathlib import Path
import sys


def find_repo_root(start: Path) -> Path:
    p = start
    for _ in range(8):
        if (p / 'src').exists() and (p / 'docs').exists():
            return p
        p = p.parent
    raise RuntimeError('Could not find repo root. Start Jupyter from the repo root.')


PROJECT_ROOT = find_repo_root(Path.cwd())
if str(PROJECT_ROOT) not in sys.path:
    sys.path.append(str(PROJECT_ROOT))

DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
SAMPLE_DIR = DATA_DIR / 'sample'

PROJECT_ROOT



## Goal
Build a county-level micro dataset from the US Census ACS API.

### Why this matters
This micro track is deliberately different from macro time series:
- observations are counties (not time)
- regression interpretation focuses on cross-sectional relationships
- robust SE (HC3) is usually more relevant than time-series HAC



## Primer: Paths, files, and environment variables (how this repo stays reproducible)

You will see a few patterns repeatedly in notebooks and scripts.

### Environment variables (API keys)

Environment variables are key/value settings provided by your shell to Python.
This repo uses them for API keys:
- `FRED_API_KEY`
- `CENSUS_API_KEY` (optional)

```python
import os

fred_key = os.getenv("FRED_API_KEY")
print("FRED key set?", fred_key is not None)
```

If you set a key in a terminal, restart the Jupyter kernel so Python sees it.

### Paths (why `pathlib.Path` is the default)

Use `Path` to build OS-safe file paths:

```python
from pathlib import Path

p = Path("data") / "sample" / "macro_quarterly_sample.csv"
print(p, "exists?", p.exists())
```

### Repo bootstrap variables (defined in every notebook)

The notebook bootstrap cell defines:
- `PROJECT_ROOT` (repo root)
- `DATA_DIR`, `RAW_DIR`, `PROCESSED_DIR`, `SAMPLE_DIR`

Prefer these over hard-coded relative paths.

### Sample vs processed data (offline-first)

Most notebooks follow this pattern:
1) try `data/processed/*` (real pipeline output)
2) fall back to `data/sample/*` (small offline dataset)

This keeps notebooks runnable without network access.

### Common “file not found” fixes

- Print the path and check `.exists()`
- Print current working directory:
  - `import os; print(os.getcwd())`
- Start Jupyter from the repo root (so bootstrap can find `src/` and `docs/`)


<a id="browse-variables"></a>
## Browse variables

### Goal
Learn how ACS variable codes work and choose a starter set.

We'll focus on a practical starter set:
- population
- median household income
- median gross rent
- median home value
- poverty count (to build a poverty rate)
- labor force / unemployment (to build an unemployment rate)



### Your Turn (1): Fetch or load variables.json


In [None]:
import json
from src import census_api

year = 2022  # TODO: change if you want a different year
raw_dir = RAW_DIR / 'census'
raw_dir.mkdir(parents=True, exist_ok=True)
vars_path = raw_dir / f'variables_{year}.json'

# TODO: Load variables metadata.
# - If vars_path exists, load it from disk.
# - Otherwise, fetch from the API and save it to vars_path.
...



### Your Turn (2): Search for relevant variables


In [None]:
# The variables metadata is a nested JSON structure.
# TODO: Explore it and search for keywords like:
# - 'Median household income'
# - 'Median gross rent'
# - 'Poverty'
# - 'Labor force'

# Hint: variables are typically under payload['variables'].
...



<a id="fetch-county-data"></a>
## Fetch county data

### Goal
Fetch a county-level table for your chosen variables.

Default geography:
- all counties: `for=county:*`
- within all states: `in=state:*`



### Your Turn (1): Choose a starter variable set


In [None]:
# TODO: Use a starter set.
# These are commonly-used ACS 5-year estimate codes:
acs_vars = [
    'NAME',
    'B01003_001E',  # total population
    'B19013_001E',  # median household income
    'B25064_001E',  # median gross rent
    'B25077_001E',  # median home value
    'B17001_002E',  # count below poverty level
    'B23025_002E',  # in labor force
    'B23025_005E',  # unemployed
]

acs_vars



### Your Turn (2): Fetch the ACS table


In [None]:
import pandas as pd
from src import census_api

# TODO: Fetch the data from the API.
# Hint: census_api.fetch_acs(year=..., get=..., for_geo='county:*', in_geo='state:*')
try:
    df_raw = census_api.fetch_acs(year=year, get=acs_vars, for_geo='county:*', in_geo='state:*')
except Exception as exc:
    df_raw = None
    print('Fetch failed, will use sample. Error:', exc)

df_raw.head() if df_raw is not None else None



### Your Turn (3): Fallback to sample


In [None]:
import pandas as pd

# TODO: If df_raw is None, load the sample dataset.
if df_raw is None:
    df_raw = pd.read_csv(SAMPLE_DIR / 'census_county_sample.csv')

df_raw.head()



<a id="derived-rates"></a>
## Derived rates

### Goal
Turn raw counts into rates (more comparable across counties).

You will build:
- unemployment_rate = unemployed / labor_force
- poverty_rate = below_poverty / population



### Your Turn (1): Cast numeric columns


In [None]:
# TODO: Ensure numeric columns are numeric (some API returns strings).
# Hint: pd.to_numeric(..., errors='coerce')
...



### Your Turn (2): Build derived rates safely


In [None]:
import numpy as np

# TODO: Compute rates with safe division.
# Replace division-by-zero with NaN.

pop = df_raw['B01003_001E'].astype(float)
labor_force = df_raw['B23025_002E'].astype(float)
unemployed = df_raw['B23025_005E'].astype(float)
below_pov = df_raw['B17001_002E'].astype(float)

df_raw['unemployment_rate'] = unemployed / labor_force.replace({0: np.nan})
df_raw['poverty_rate'] = below_pov / pop.replace({0: np.nan})

df_raw[['unemployment_rate', 'poverty_rate']].describe()



<a id="save-processed-data"></a>
## Save processed data

### Goal
Save a cleaned dataset to `data/processed/census_county_<year>.csv`.



### Your Turn (1): Save + reload


In [None]:
out_path = PROCESSED_DIR / f'census_county_{year}.csv'
out_path.parent.mkdir(parents=True, exist_ok=True)

# TODO: Select a useful subset of columns and save.
# Suggested: NAME, state, county, raw vars, unemployment_rate, poverty_rate
cols = ['NAME', 'state', 'county'] + [c for c in acs_vars if c not in {'NAME'}] + ['unemployment_rate', 'poverty_rate']
df_out = df_raw[cols].copy()
df_out.to_csv(out_path, index=False)

df_check = pd.read_csv(out_path)
df_check.head()



### Checkpoint


In [None]:
# TODO: Validate rates are in [0, 1] for most rows.
assert (df_out['unemployment_rate'].dropna().between(0, 1).mean() > 0.95)
assert (df_out['poverty_rate'].dropna().between(0, 1).mean() > 0.95)
...



<a id="checkpoint-self-check"></a>
## Checkpoint (Self-Check)
Run a few asserts and write 2-3 sentences summarizing what you verified.



In [None]:
import pandas as pd

# Expected file: data/processed/census_county_<year>.csv
# TODO: After saving your processed dataset, load it and run checks.
# df = pd.read_csv(PROCESSED_DIR / 'your_file.csv', index_col=0, parse_dates=True)
# assert df.index.is_monotonic_increasing
# assert df.shape[0] > 20
# print(df.dtypes)
...



## Extensions (Optional)
- Try one additional variant beyond the main path (different features, different split, different model).
- Write down what improved, what got worse, and your hypothesis for why.



## Reflection
- What did you assume implicitly (about timing, availability, stationarity, or costs)?
- If you had to ship this model, what would you monitor?



<a id="solutions-reference"></a>
## Solutions (Reference)

Try the TODOs first. Use these only to unblock yourself or to compare approaches.

<details><summary>Solution: Browse variables</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 04_census_api_microdata_fetch — Browse variables
import json

# Offline default
print('Open the Census variables metadata in data/raw/census/variables_<year>.json if available.')
```

</details>

<details><summary>Solution: Fetch county data</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 04_census_api_microdata_fetch — Fetch county data
import pandas as pd

# Offline default sample
df = pd.read_csv(SAMPLE_DIR / 'census_county_sample.csv')
df.head()
```

</details>

<details><summary>Solution: Derived rates</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 04_census_api_microdata_fetch — Derived rates
df['unemployment_rate'] = df['B23025_005E'] / df['B23025_002E']
df['poverty_rate'] = df['B17001_002E'] / df['B01003_001E']
df[['unemployment_rate', 'poverty_rate']].describe()
```

</details>

<details><summary>Solution: Save processed data</summary>

_One possible approach. Your variable names may differ; align them with the notebook._

```python
# Reference solution for 04_census_api_microdata_fetch — Save processed data
from src import data as data_utils
year = 2022
data_utils.save_csv(df.set_index(['state','county'], drop=False), PROCESSED_DIR / f'census_county_{year}.csv')
print('saved')
```

</details>

