Skip to content

Carry the block-anchored US geography ladder as spine columns#277

Merged
MaxGhenis merged 2 commits into
mainfrom
geo-ladder-275
Jul 2, 2026
Merged

Carry the block-anchored US geography ladder as spine columns#277
MaxGhenis merged 2 commits into
mainfrom
geo-ladder-275

Conversation

@MaxGhenis

Copy link
Copy Markdown
Contributor

Fixes #275.

What

Anchors every household's geography at a 2020 census tabulation block and carries the full ladder as household spine columns on the national artifact: block_geoid, tract_geoid, county_fips, place_fips, sldu, sldl, cbsa_code, alongside the existing state_fips and congressional_district_geoid. One dataset, filter by geography, at any grain — no per-area files.

Blocks are sampled within each household's already-assigned congressional district proportional to 2020 block population, so the calibrated CD and state distributions are preserved exactly; every other rung is a deterministic function of the block (tract/county by prefix, place/SLD by BAF crosswalk, CBSA by county → OMB delineation). Adding a future layer (school districts, PUMAs, ZCTAs) is a crosswalk-only change to the ladder artifact — no rebuild, no re-assignment.

Pieces

  • populace.build.us_runtime.geography_ladder — artifact loader (refuses an artifact missing any layer vintage: vintage_policy: error, the Translate old-vintage CD targets to current district geography #205 lesson), CD-conditioned block assignment with loud vintage-mismatch and state-prefix checks, Frame wrapper, provenance summary, and the us_geography_ladder gate whose NYC checks are the permanent form of the Fix in_nyc before enforcing no-formula exports #34 regression (NYC weighted share within bounds nationally and within New York State; place/CBSA/SLD coverage anchors; block/tract/county prefix consistency).
  • populace.build.us_runtime.block_ladder_sources — pure, unit-tested parsers for the primary sources: the 119th Congressional District BEF (NationalCD119.txt, GEOID,CDFP; 98 delegate → at-large 00, ZZ dropped), 2020 BAF SLDU/SLDL/INCPLACE_CDP layers (ZZZ → unassigned), P.L. 94-171 geographic headers (block POP100, validated per state against the summary-level-040 row), and the OMB 2023 CBSA delineation workbook.
  • tools/build_us_block_ladder_artifact.py — builds the single national NPZ from those sources with cached downloads, per-layer {vintage, source, url} metadata, per-file sha256 provenance, and a load-back self-check. --states supports smoke subsets.
  • tools/build_us_puf_support_base.py--block-ladder-artifact assigns the ladder immediately after CD assignment (requires the CD vintage crosswalk so household districts are current-vintage; a ladder/district vintage mismatch is an error, never a partial join), runs the gate release-blocking (--allow-geography-ladder-gate-failures is the diagnostic escape hatch), records the assignment summary + weighted coverage shares, and stamps populace_geography_ladder_artifact_sha256 / populace_geography_ladder_vintages H5 root attrs next to the existing CD-vintage attrs. The fiscal-refresh/L0 pipeline propagates household columns and populace_* attrs unchanged, so the ladder flows to the release artifact with no further wiring.

Column names follow the policyengine-us household input surface, so the artifact carries engine inputs, never formula outputs. The issue sketched sldu_geoid/sldl_geoid; the artifact stores the policyengine-us inputs sldu/sldl (3-character within-state BAF codes) — state_fips + sldu is the full SLD geoid.

Vintages (recorded per layer in artifact metadata, build summary, and H5 attrs)

layer vintage source
block anchor 2020_tabulation_blocks P.L. 94-171 geographic headers (POP100 per block)
congressional district 119th_congress Census 119th CD BEF (cd119.zip) — not the 116th-vintage baf2020 CD layer
sldu / sldl 2020_baf 2020 BAF (legislative plans as of the 2020 P.L. 94-171 release)
place 2020_census 2020 BAF INCPLACE_CDP
cbsa omb_2023_delineations OMB Bulletin 23-01 (list1_2023.xlsx)

A newer SLD plan (post-2020 redistricting BEF) is a data swap in the artifact builder, not a code change.

The #34 regression (in_nyc / nyc_income_tax)

The county rung restores computable NYC taxes, paired with PolicyEngine/policyengine-us#8843 (over a dataset, county now maps stored county_fips instead of collapsing to first_county_in_state). Verified end to end with that fix installed: a DE+NY+VT ladder build → H5 export → Microsimulation recomputes in_nyc exactly equal to the ladder's NYC-county assignment (all five boroughs present), nyc_income_tax > 0 for every NYC household and 0 elsewhere, and the gate passes with NYC at 43.5% of New York State weight (actual ≈ 42%). Until #8843 ships, the artifact still carries the county rung; the gate checks it structurally without needing the engine.

Verification

  • 45 new tests (test_us_geography_ladder.py, test_us_block_ladder_sources.py + loader round-trip): loader refusals (missing vintage, ZZZ markers, duplicate/mismatched blocks, nonpositive population), deterministic population-weighted assignment, vintage-mismatch refusal, frame mass/strata preservation, gate pass/fail behavior including the NYC-collapse case. Full populace-build suite green; ruff check clean.
  • Real smoke build (DE+VT): 33,293 populated blocks; population exactly matches census (1,633,025); assigned county shares match 2020 census shares to three decimals (New Castle .576/.577, Sussex .240/.239, Kent .184/.184); SLDU/SLDL fully assigned; place/CBSA population shares plausible (43%/92%).
  • Real DE+NY+VT build: 263,632 blocks, population exact (21,834,274), 28 CDs (1+26+1).
  • Adapter write_dataset round-trip: all nine spine columns survive USSingleYearDataset save/reload with dtypes intact.

🤖 Generated with Claude Code

MaxGhenis and others added 2 commits July 2, 2026 11:20
Anchor every household at a 2020 census tabulation block (sampled within
its assigned congressional district proportional to block population) and
derive the full geography ladder as household spine columns on the
national artifact: block_geoid, tract_geoid, county_fips, place_fips,
sldu, sldl, and cbsa_code alongside the existing state_fips and
congressional_district_geoid. One dataset, filter by geography, at any
grain — no per-area files.

- populace.build.us_runtime.geography_ladder: ladder artifact loader
  (refuses artifacts missing any layer vintage — vintage_policy: error),
  block assignment, Frame wrapper, provenance summary, and the
  us_geography_ladder gate whose NYC checks are the permanent form of the
  #34 in_nyc-collapse regression.
- populace.build.us_runtime.block_ladder_sources: pure parsers for the
  primary sources — 119th CD BEF (NationalCD119.txt), 2020 BAF
  SLDU/SLDL/INCPLACE_CDP layers, P.L. 94-171 geographic headers (block
  POP100, validated against each state row), and OMB 2023 CBSA
  delineations.
- tools/build_us_block_ladder_artifact.py: builds the national NPZ
  artifact from those sources with cached downloads, per-layer vintage
  metadata, sha256 provenance, and a load-back self-check.
- tools/build_us_puf_support_base.py: --block-ladder-artifact assigns the
  ladder after CD assignment (requires the CD vintage crosswalk so
  households carry current-vintage districts), runs the gate, records the
  assignment summary, and stamps populace_geography_ladder_* H5 attrs.

Column names follow the policyengine-us household input surface, so the
county rung recomputes in_nyc/nyc_income_tax from inputs rather than
persisting formula outputs.

Fixes #275

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Atomic download cache writes (write-then-rename) so a crash mid-write
  cannot leave a truncated file a later run treats as a cache hit.
- CBSA delineation parser accepts numeric-typed spreadsheet cells for all
  three code columns and raises a clear error when a data row carries
  malformed FIPS cells instead of an opaque int() failure.
- Correct the P.L. 94-171 geoheader field count (97, not 93) in the
  parser docstring and test fixture.

Review validated the parsers against the real published files (8.17M-row
NationalCD119.txt, DE/AK/NE/DC BAF zips, list1_2023.xlsx, degeo2020.pl)
with no confirmed bugs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit 10881bf into main Jul 2, 2026
4 checks passed
@MaxGhenis MaxGhenis deleted the geo-ladder-275 branch July 2, 2026 13:16
MaxGhenis added a commit that referenced this pull request Jul 2, 2026
The ladder shipped opt-in (#277): nothing stopped a base or release from
being built without the geography spine, which is the silent-degradation
family (#225 everyone-is-a-citizen, #34 NYC-zero) this repo legislates
against.

- tools/build_us_puf_support_base.py: omitting --block-ladder-artifact is
  now an error; diagnostic builds opt out explicitly with
  --without-block-ladder (recorded in the summary).
- L0/refit release export: the geography spine (state_fips,
  congressional_district_geoid, and the seven ladder columns) joins the
  required release source columns (presence; value quality is the gate's
  job), and the us_geography_ladder gate runs on the selected support
  with its calibrated weights — a release whose spine is inconsistent or
  whose NYC mass collapsed fails by default.
  --allow-geography-ladder-gate-failures is the diagnostic escape hatch;
  the gate result is recorded in the export summary either way.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Carry the full geography ladder from census block: county, place, state leg districts, metro/CBSA as spine columns

1 participant