Skip to content

Build the 117th->119th CD vintage crosswalk from Census sources#288

Draft
MaxGhenis wants to merge 3 commits into
mainfrom
cd-vintage-crosswalk-205
Draft

Build the 117th->119th CD vintage crosswalk from Census sources#288
MaxGhenis wants to merge 3 commits into
mainfrom
cd-vintage-crosswalk-205

Conversation

@MaxGhenis

Copy link
Copy Markdown
Contributor

What

The congressional-district (CD) geography-vintage translation merged in #207 /
#208 / #209 consumed a crosswalk passed as an external CLI path — there was
no versioned, reproducible, or Census-cited crosswalk artifact in the repo, and
the only file that existed (an ad-hoc validation CSV) had demonstrably wrong
weights
(Montana split 39/61 rather than the equal-population ~50/50, and DC
keyed as …1198 rather than the repo-wide at-large/delegate …1100).

This PR ships the missing piece: a versioned, reproducible, population-weighted
117th → 119th CD crosswalk built from primary Census sources
, its generator, and
its provenance, and makes it the packaged default. It closes #205.

The mechanics (translate_congressional_district_facts_to_current_vintage, the
state-total proxy for at-large states, the support-provenance guard) already
landed in #207#209 and are unchanged here; the hole was the data artifact and
its lineage
.

Method

A single-vintage block overlay, so no 2010↔2020 block relationship file is
needed
:

Role Census source Format
Old (117th) district of each 2020 block 2020 Block Assignment Files, CD layer (BlockAssign_ST{fips}_{usps}_CD.txt) — the 116th-Congress plan (identical district geography to the 117th) on 2020 blocks BLOCKID|DISTRICT
Current (119th) district of each 2020 block 119th Congressional District BEF (NationalCD119.txt) — the same source the block ladder (#277) already uses GEOID,CDFP
Weight 2020 P.L. 94-171 POP100 per block (parse_pl_geo_blocks convention) pipe-delimited

Both district assignments are read on the same 2020 blocks and weighted by the
same 2020 block populations, so each old district's population is redistributed
across the current districts it overlaps and never invented. Population is the
correct default basis (apportionment and equal-population redistricting are
population operations); ACS income/tax proxy weights for fiscal targets are the
documented next refinement noted in #205.

The committed national crosswalk

  • 1,444 rows; 436 source districts → 436 current 119th-Congress districts
    (including one-district states and DC).
  • Exact per-state population conservation across all 50 states + DC:
    331,449,281 people, zero unmatched or cross-state population.
  • Apportionment-shrunk districts (CA-53, IL-18, MI-14, NY-27, OH-16, PA-18,
    WV-03) appear only as sources; new districts (CO-08, FL-28, MT-02, NC-14,
    OR-06, TX-37, TX-38) appear as populated targets.
  • Montana's at-large district splits ~50/50 into MT-01/MT-02.

Every source file's URL + SHA-256, the crosswalk SHA-256, the source/target
vintages, and the full per-state conservation table are recorded in
congressional_district_vintage_crosswalk.csv.provenance.json; the human-readable
recipe and citations are in CONGRESSIONAL_DISTRICT_VINTAGE_CROSSWALK.md.

Fit with the Ledger schema direction

The translated CD targets are derived build artifacts, never Ledger facts
the fact-vs-computed boundary of PolicyEngine/ledger#71. The crosswalk is a
declared consumer-side transform over facts with its own lineage, exactly the
pattern #71 prescribes and that #280 mirrors for period aging. The same mechanism
is called out for the Belgian NIS-code vintage crosswalk in PolicyEngine/ledger#69.

Files

  • packages/populace-build/src/populace/build/us_runtime/congressional_district_vintage_crosswalk.py — pure, tested parsers (BAF CD layer, CD BEF) and the population-weighted join with conservation diagnostics.
  • tools/build_us_congressional_district_vintage_crosswalk.py — download + cache + provenance orchestration, mirroring build_us_block_ladder_artifact.py.
  • packages/populace-build/src/populace/build/us/congressional_district_vintage_crosswalk.csv (+ .provenance.json, + .md) — the committed artifact, per-source SHA-256s, and the data-source doc.
  • packages/populace-build/src/populace/build/us_runtime/congressional_district_vintage.py — packaged-default loader helpers (load_default_…, default_…_path).
  • tools/build_us_fiscal_refresh_release.py — default to the packaged crosswalk when CD targets are requested and no path is passed (explicit paths still override).

Verification

  • pytest test_us_congressional_district_vintage_crosswalk.py → 14 passed (parsers, at-large/delegate normalization, conservation math, uncovered-population reporting, and integration tests that load the real packaged crosswalk: 436/436, MT ~50/50, extra-only-as-source / new-only-as-target, exact fact-value conservation).
  • pytest test_us_congressional_district_vintage.py test_us_congressional_district_geography.py test_us_block_ladder_sources.py test_us_fiscal_targets.py → all green (no regressions).
  • pytest --extra us test_us_fiscal_refresh_builder.py → all green (provenance-guard path).
  • The committed CSV and provenance JSON are byte-reproducible from the generator over the cached Census sources; ruff check, ruff format --check, and git diff --check all clean.

Closes #205. Refs PolicyEngine/ledger#69, PolicyEngine/ledger#71.

🤖 Generated with Claude Code

MaxGhenis and others added 3 commits July 2, 2026 17:13
The CD geography-vintage translation (#207/#208/#209) consumed a crosswalk
passed as an external CLI path with no versioned, reproducible, or cited
artifact in the repo. This adds that artifact and its generator, and makes it
the packaged default.

The crosswalk is built by a single-vintage block overlay, so no 2010<->2020
block bridge is needed:

- old (117th) district of each 2020 block: the 2020 Block Assignment File CD
  layer (BlockAssign_ST{fips}_{usps}_CD.txt), which carries the 116th-Congress
  plan (identical district geography to the 117th) on 2020 tabulation blocks;
- current (119th) district of each 2020 block: the 119th BEF (NationalCD119.txt),
  the same source the block ladder already uses;
- weight: 2020 P.L. 94-171 POP100 per block (the block ladder's
  parse_pl_geo_blocks convention).

Both district assignments are read on the same 2020 blocks weighted by the same
2020 block populations, so each old district's population is redistributed
across the current districts it overlaps and never invented. Population is the
correct default basis (apportionment and equal-population redistricting are
population operations); ACS income/tax proxy weights for fiscal targets are a
documented future refinement.

The committed national crosswalk covers all 436 current 119th-Congress districts
from 436 source districts, with exact per-state population conservation over all
331,449,281 people in the 50 states + DC (zero unmatched or cross-state). The
apportionment-shrunk districts (CA-53, IL-18, MI-14, NY-27, OH-16, PA-18, WV-03)
appear only as sources; the new districts (CO-08, FL-28, MT-02, NC-14, OR-06,
TX-37, TX-38) appear as populated targets. Montana's at-large district splits
~50/50 into MT-01/MT-02, as equal-population districts require.

The derived crosswalk is a regenerable build artifact, not a Ledger fact -- the
fact-vs-computed boundary of PolicyEngine/ledger#71; the same declared-consumer-
side-transform pattern applies to the Belgian NIS-code vintage work in
PolicyEngine/ledger#69.

Files:
- congressional_district_vintage_crosswalk.py: pure parsers (BAF CD layer, CD
  BEF) and the population-weighted join with conservation diagnostics.
- tools/build_us_congressional_district_vintage_crosswalk.py: download +
  cache + provenance orchestration, mirroring build_us_block_ladder_artifact.
- us/congressional_district_vintage_crosswalk.csv (+ .provenance.json + .md):
  the committed artifact, per-source SHA-256s, and the data-source doc.
- congressional_district_vintage.py: packaged-default loader helpers.
- build_us_fiscal_refresh_release.py: default to the packaged crosswalk when CD
  targets are requested and no path is passed.

Closes #205.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
CI (wheels job) surfaced two issues:

1. The `us/` country package is a spec-only directory: the governance tests
   (test_spec_only_country_packages) and country_spec.py require it to contain
   only .json resources, all declared in country_package.json. Move the
   crosswalk CSV, its provenance JSON, and the data-source doc to a new
   us_runtime/data/ package (us_runtime is exempt from the spec-only rule), and
   re-anchor the packaged-default loader at populace.build.us_runtime.data.

2. test_cd_targets_require_vintage_crosswalk asserted the release builder errors
   when CD targets are requested without a crosswalk; the builder now defaults
   to the packaged crosswalk, so rename/rewrite the test to assert the default
   is applied and an explicit path still overrides it.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Translate old-vintage CD targets to current district geography

1 participant