Impute SSN card type and immigration status from CPS ASEC citizenship#266
Merged
Conversation
Fixes #225: the published US dataset stored no SSN/immigration columns, so the engine defaulted every person to a citizen with a valid SSN and every SSN/citizenship-conditioned policy (OBBBA CTC SSN tightening, EITC SSN requirements, ACA/Medicaid/SNAP/SSI immigration axes, ITIN filers) was a no-op. New immigration_status source stage (manifest entry + US runtime handler): citizenship is measured from PRCITSHP; non-citizens with any ASEC-UA legal-status indicator (Van Hook et al., SSRN 4662801) become OTHER_NON_CITIZEN; residual non-citizen workers/students spill to NON_CITIZEN_VALID_EAD in deterministic seeded order until the remaining undocumented worker/student counts match published controls (Pew 8.3M workers; Presidents' Alliance ~408k students) — the only forced margins. The total undocumented population is emergent (13.3M on the full 2024 ASEC) and release-gated against Pew's cited 11.0M anchor with a coarse band; reconciling the level belongs to the calibration lane. Status tags carry only statutory tests the data supports (DACA cohort, Cuban/Haitian entrant class); documented non-citizens default to LPR rather than fabricated blanket REFUGEE/TPS labels. Selection draws are blake2b hashes keyed by source person identity, so support-channel clones always share a status and reruns are bit-reproducible. The stage runs in the support-base builder before channel cloning and idempotently in the fiscal-refresh driver (after the census mass repair, so controls bind at full scale), where a new release-blocking immigration_composition gate also enforces presence, enum domains, cross-column consistency, a plausible non-citizen share, and the anchor band; the L0/refit exporter now requires the two person columns to carry signal. Full 2024 ASEC composition: 299.4M CITIZEN / 13.3M NONE / 4.7M EAD / 8.6M OTHER_NON_CITIZEN; 360k DACA, 660k Cuban/Haitian entrants. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jul 1, 2026
Open
daphnehanse11
added a commit
that referenced
this pull request
Jul 2, 2026
The pre-#266 main-line release ships persisted PolicyEngine input columns that are constant at the engine default for every record — weekly_hours_worked_before_lsr at the 40-hour default, SNAP/TANF/SSI take-up flags at True, spm_unit_tenure_type at RENTER, s_corp_income at zero. Such columns carry zero information while looking populated, which is how a failed or missing imputation ships silently. The existing nonconstant_columns_gate only checks a hand-picked ACA allowlist, and input_mass_parity_gate passes when the parent artifact is equally degenerate — the constant-40 hours column had full mass in every ancestor. Add default_valued_columns_gate: a sweep over every persisted input column that fails when all observed values equal the engine default, with no reference artifact needed. Constant-but-not-default columns pass and are reported (an intentional broadcast is a modeling choice). Reviewed exclusions accept known degenerate columns with a recorded reason; a stale exclusion (column now carries signal) fails so the list cannot rot, while a dormant one (column absent from this release line's surface) is only reported. Expose engine defaults through PolicyEngineUSEngine.default_values — input variables only, enum defaults normalized to their stored member name — and wire the gate into the US fiscal refresh builder across all entity tables, threading the result into the release gate failures, calibration diagnostics, and build manifest alongside the health input and input-mass gates. The builder seeds reviewed exclusions for the 20 known offenders, each naming its tracking issue; ssn_card_type and immigration_status_str are deliberately not excluded — #266 imputes them now, so a base where they are still constant skipped that stage and should fail. Closes #257 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jul 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #225.
Problem
The published US dataset stored zero SSN/immigration/ITIN columns, so policyengine-us defaulted every person to a citizen with a valid SSN. Every SSN- or citizenship-conditioned policy was a no-op: OBBBA's CTC SSN tightening scored ≈ $0 (vs +$3.3B in PolicyEngine's official OBBBA model), EITC SSN requirements excluded nobody, and ACA/Medicaid/SNAP/SSI immigration axes treated everyone as eligible.
Approach
A new
immigration_statussource stage (manifest entry insource_stages.json+ aderive_immigration_statusUS runtime handler), designed fresh rather than porting the incumbent wholesale:PRCITSHP1–4 →CITIZEN, 5 → non-citizen.OTHER_NON_CITIZEN.NON_CITIZEN_VALID_EADin deterministic seeded order until the remaining undocumented worker/student counts match published controls (Pew 8.3M workers; Presidents' Alliance ~408k students). These are the only forced margins, and each control carries a citation in the manifest.DACA(arrival-cohort test among EAD holders),CUBAN_HAITIAN_ENTRANT(nativity + post-1980 arrival); other documented non-citizens stayLEGAL_PERMANENT_RESIDENT.Deliberate divergences from the incumbent (policyengine-us-data)
OTHER_NON_CITIZENmembers of mixed households back to undocumented to hit a 13M total — reclassifying people with affirmative legal-status indicators (Medicaid, Social Security) and corrupting exactly the program-participation↔status correlation that benefit analysis needs. It was also a no-op on current data (the emergent total already exceeded the target). Representation belongs to calibration per the charter.REFUGEE/TPSlabels. The incumbent tagged every recently-arrived documented non-citizenREFUGEEand every leftover EAD holderTPS— mislabeling millions (true stocks are under 1M each) and over-granting refugee-class benefit exemptions. LPR treatment gives near-identical means-tested eligibility for those populations and is honest about modal composition.np.randomstate. Selection draws are blake2b hashes keyed by source person identity (source_year/source_person_id), so support-channel clones of one source person always land on the same side of a selection threshold and reruns are bit-reproducible.Wiring
build_us_puf_support_base.pyruns the stage afterderive_cps_carried, before channel cloning; the summary records the composition.build_us_fiscal_refresh_release.pyruns it idempotently after the census mass repair (so absolute controls bind at full population scale) and blocks the release on a newimmigration_compositiongate: columns present, non-constant, enum-domain-valid, cross-column-consistent (CITIZEN⟺CITIZEN,NONE⟺UNDOCUMENTED), non-citizen share in [3%, 12%], undocumented total within the anchor band. The gate lands incalibration_diagnostics.json,build_manifest.json, andrelease_manifest.jsonalongside the existing gates.US_STAGE_NAMES+US_DONORScitation entry.Validation
test_us_immigration.py): residual conditions, control binding, weight-awareness, determinism/seed-sensitivity, clone consistency, statutory tags, idempotence, loud failures, manifest citation discipline, gate bands.ruff checkclean.Status tags: 360k DACA, 660k Cuban/Haitian entrants, 12.2M LPR. Composition gate passes.
Notes
🤖 Generated with Claude Code