Skip to content

Impute SSN card type and immigration status from CPS ASEC citizenship#266

Merged
MaxGhenis merged 1 commit into
mainfrom
fix-225-immigration-ssn
Jul 2, 2026
Merged

Impute SSN card type and immigration status from CPS ASEC citizenship#266
MaxGhenis merged 1 commit into
mainfrom
fix-225-immigration-ssn

Conversation

@MaxGhenis

Copy link
Copy Markdown
Contributor

Fixes #225.

Problem

The published US dataset stored zero SSN/immigration/ITIN columns, so policyengine-us defaulted every person to a citizen with a valid SSN. Every SSN- or citizenship-conditioned policy was a no-op: OBBBA's CTC SSN tightening scored ≈ $0 (vs +$3.3B in PolicyEngine's official OBBBA model), EITC SSN requirements excluded nobody, and ACA/Medicaid/SNAP/SSI immigration axes treated everyone as eligible.

Approach

A new immigration_status source stage (manifest entry in source_stages.json + a derive_immigration_status US runtime handler), designed fresh rather than porting the incumbent wholesale:

  1. Citizenship is measured, not imputedPRCITSHP 1–4 → CITIZEN, 5 → non-citizen.
  2. ASEC-UA residual method (Van Hook et al., SSRN 4662801): non-citizens with any legal-status indicator (pre-1982 IRCA arrival, naturalization eligibility, Medicare/Medicaid/SSI/Social Security, federal pension, IHS/CHAMPVA/military coverage, government employment, subsidized housing, veteran status) → OTHER_NON_CITIZEN.
  3. Work/study authorization split — the CPS has no work-authorization variable, so residual non-citizen workers/students spill to NON_CITIZEN_VALID_EAD in deterministic seeded order until the remaining undocumented worker/student counts match published controls (Pew 8.3M workers; Presidents' Alliance ~408k students). These are the only forced margins, and each control carries a citation in the manifest.
  4. The total undocumented population is emergent, not forced — 13.3M on the full 2024 ASEC, inside the range of published 2023–24 estimates. A release gate checks it against Pew's cited 11.0M (2022) anchor with a coarse [0.5, 1.6] band. Reconciling the level is the calibration lane's job (follow-up: a Ledger fact with SE).
  5. Status tags carry only statutory tests the data supportsDACA (arrival-cohort test among EAD holders), CUBAN_HAITIAN_ENTRANT (nativity + post-1980 arrival); other documented non-citizens stay LEGAL_PERMANENT_RESIDENT.

Deliberate divergences from the incumbent (policyengine-us-data)

  • No family-correlation step. The incumbent flipped OTHER_NON_CITIZEN members of mixed households back to undocumented to hit a 13M total — reclassifying people with affirmative legal-status indicators (Medicaid, Social Security) and corrupting exactly the program-participation↔status correlation that benefit analysis needs. It was also a no-op on current data (the emergent total already exceeded the target). Representation belongs to calibration per the charter.
  • No blanket REFUGEE/TPS labels. The incumbent tagged every recently-arrived documented non-citizen REFUGEE and every leftover EAD holder TPS — mislabeling millions (true stocks are under 1M each) and over-granting refugee-class benefit exemptions. LPR treatment gives near-identical means-tested eligibility for those populations and is honest about modal composition.
  • No np.random state. Selection draws are blake2b hashes keyed by source person identity (source_year/source_person_id), so support-channel clones of one source person always land on the same side of a selection threshold and reruns are bit-reproducible.
  • Below-control counts spill nothing (the incumbent moved a share to EAD even when already below target).

Wiring

  • build_us_puf_support_base.py runs the stage after derive_cps_carried, before channel cloning; the summary records the composition.
  • build_us_fiscal_refresh_release.py runs it idempotently after the census mass repair (so absolute controls bind at full population scale) and blocks the release on a new immigration_composition gate: columns present, non-constant, enum-domain-valid, cross-column-consistent (CITIZENCITIZEN, NONEUNDOCUMENTED), non-citizen share in [3%, 12%], undocumented total within the anchor band. The gate lands in calibration_diagnostics.json, build_manifest.json, and release_manifest.json alongside the existing gates.
  • The L0/refit exporter now requires both person columns to carry signal, so a sparse release can't ship the Dataset imputes 100% of the population as citizens with valid SSNs (breaks SSN/citizenship-conditioned policies, e.g. OBBBA CTC) #225 failure mode.
  • New stage name in US_STAGE_NAMES + US_DONORS citation entry.

Validation

  • 51 new behavioral tests (test_us_immigration.py): residual conditions, control binding, weight-awareness, determinism/seed-sensitivity, clone consistency, statutory tags, idempotence, loud failures, manifest citation discipline, gate bands.
  • Full suite: 1,126 passed, 10 skipped; ruff check clean.
  • Full real-data run on the 2024 ASEC (326.0M persons):
ssn_card_type this PR incumbent usdata
CITIZEN 299.37M 299.37M
NONE (undocumented) 13.34M 13.24M
NON_CITIZEN_VALID_EAD 4.67M 4.76M
OTHER_NON_CITIZEN 8.59M 8.59M

Status tags: 360k DACA, 660k Cuban/Haitian entrants, 12.2M LPR. Composition gate passes.

Notes

  • The published artifact needs a rebuild to pick this up; the next fiscal-refresh release derives the columns idempotently from any base H5 that still carries the raw ASEC columns (all current ones do).
  • Follow-ups filed after merge: an undocumented-population calibration target in the Ledger facts lane, and a conditional model for the EAD split if a status-observed donor ever materializes.

🤖 Generated with Claude Code

Fixes #225: the published US dataset stored no SSN/immigration columns,
so the engine defaulted every person to a citizen with a valid SSN and
every SSN/citizenship-conditioned policy (OBBBA CTC SSN tightening, EITC
SSN requirements, ACA/Medicaid/SNAP/SSI immigration axes, ITIN filers)
was a no-op.

New immigration_status source stage (manifest entry + US runtime
handler): citizenship is measured from PRCITSHP; non-citizens with any
ASEC-UA legal-status indicator (Van Hook et al., SSRN 4662801) become
OTHER_NON_CITIZEN; residual non-citizen workers/students spill to
NON_CITIZEN_VALID_EAD in deterministic seeded order until the remaining
undocumented worker/student counts match published controls (Pew 8.3M
workers; Presidents' Alliance ~408k students) — the only forced margins.
The total undocumented population is emergent (13.3M on the full 2024
ASEC) and release-gated against Pew's cited 11.0M anchor with a coarse
band; reconciling the level belongs to the calibration lane. Status
tags carry only statutory tests the data supports (DACA cohort,
Cuban/Haitian entrant class); documented non-citizens default to LPR
rather than fabricated blanket REFUGEE/TPS labels.

Selection draws are blake2b hashes keyed by source person identity, so
support-channel clones always share a status and reruns are
bit-reproducible. The stage runs in the support-base builder before
channel cloning and idempotently in the fiscal-refresh driver (after
the census mass repair, so controls bind at full scale), where a new
release-blocking immigration_composition gate also enforces presence,
enum domains, cross-column consistency, a plausible non-citizen share,
and the anchor band; the L0/refit exporter now requires the two person
columns to carry signal.

Full 2024 ASEC composition: 299.4M CITIZEN / 13.3M NONE / 4.7M EAD /
8.6M OTHER_NON_CITIZEN; 360k DACA, 660k Cuban/Haitian entrants.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit 480cc7c into main Jul 2, 2026
4 checks passed
@MaxGhenis MaxGhenis deleted the fix-225-immigration-ssn branch July 2, 2026 08:02
daphnehanse11 added a commit that referenced this pull request Jul 2, 2026
The pre-#266 main-line release ships persisted PolicyEngine input
columns that are constant at the engine default for every record —
weekly_hours_worked_before_lsr at the 40-hour default, SNAP/TANF/SSI
take-up flags at True, spm_unit_tenure_type at RENTER, s_corp_income
at zero. Such columns carry zero information while looking populated,
which is how a failed or missing imputation ships silently. The
existing nonconstant_columns_gate only checks a hand-picked ACA
allowlist, and input_mass_parity_gate passes when the parent artifact
is equally degenerate — the constant-40 hours column had full mass in
every ancestor.

Add default_valued_columns_gate: a sweep over every persisted input
column that fails when all observed values equal the engine default,
with no reference artifact needed. Constant-but-not-default columns
pass and are reported (an intentional broadcast is a modeling choice).
Reviewed exclusions accept known degenerate columns with a recorded
reason; a stale exclusion (column now carries signal) fails so the
list cannot rot, while a dormant one (column absent from this release
line's surface) is only reported.

Expose engine defaults through PolicyEngineUSEngine.default_values —
input variables only, enum defaults normalized to their stored member
name — and wire the gate into the US fiscal refresh builder across all
entity tables, threading the result into the release gate failures,
calibration diagnostics, and build manifest alongside the health input
and input-mass gates. The builder seeds reviewed exclusions for the 20
known offenders, each naming its tracking issue; ssn_card_type and
immigration_status_str are deliberately not excluded — #266 imputes
them now, so a base where they are still constant skipped that stage
and should fail.

Closes #257

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dataset imputes 100% of the population as citizens with valid SSNs (breaks SSN/citizenship-conditioned policies, e.g. OBBBA CTC)

1 participant