QRF-impute CPS-only variables for PUF clone half#589
QRF-impute CPS-only variables for PUF clone half#589
Conversation
48530a9 to
b890507
Compare
Instead of naively duplicating CPS donor values for ~60 income- correlated variables (retirement distributions, transfers, hours, medical expenses, SS sub-components, SPM variables), train a second- stage QRF on CPS data using demographics + PUF-imputed income as predictors and predict plausible values for PUF clones. Fixes #561 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Verify no variable appears in multiple imputation lists - Verify stage-2 income predictors come from stage-1 imputation - Verify sequential QRF preserves y1-y2 correlation (synthetic data) - Verify single sequential call beats separate calls (the old batched approach) on inter-variable correlation All tests use synthetic data — no real PUF/CPS needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Extract _fit_and_predict_qrf() to eliminate duplication between impute_income_variables and impute_cps_only_variables - Extract _to_entity() to deduplicate entity mapping in concat loop - Replace CPS_STAGE2_DEMOGRAPHIC_PREDICTORS with shared DEMOGRAPHIC_PREDICTORS + STAGE1_EXTRA_PREDICTORS - Convert variable lists to sets for O(1) lookup in concat loop - Extract _QRF_SAMPLE_SIZE and _QRF_RANDOM_STATE constants - Pre-compute training/test DataFrames in generate() to avoid redundant calculate_dataframe() calls - Remove unused MagicMock import from tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
b890507 to
71e9aec
Compare
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Some variables in CPS_ONLY_IMPUTED_VARIABLES may not exist in the current policyengine-us package. Filter them out before calling calculate_dataframe to avoid ValueError. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use CPS Microsimulation to calculate demographic predictors (is_male, tax_unit_is_joint, etc.) since they aren't stored in the h5 file - Fix entity half-length calculation in splice function: spm_unit variables need spm_unit_id count, not person_id count Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
baogorek
left a comment
There was a problem hiding this comment.
@MaxGhenis , The more I review this, the more I realize that imputation is just as complex as calibration (if not more) and probably should have its own dashboards, etc. But for now, my review is in two inline comments plus other comments from Claude below.
Other CC comments:
SS sub-components (line 39): Same overwrite issue as the retirement contributions. reconcile_ss_subcomponents() in
puf_clone_dataset() ensures these 4 shares sum to 1 and match the imputed social_security total. Stage 2's
unconstrained QRF replaces them with values that have no summing guarantee.
Pension leaf variables (line 31): taxable_private_pension_income and tax_exempt_private_pension_income get silently
discarded. _rename_imputed_to_inputs() runs after Stage 2 and overwrites them with the PUF-imputed values. Wasted
QRF effort.
Formula variables (line 75): weekly_hours_worked and employment_income_last_year have formulas/adds in
policyengine-us, so _drop_formula_variables() removes them after Stage 2. Predictions are wasted unless they're
added to _KEEP_FORMULA_VARS.
child_support_expense (line 73): Listed under # Medical expenses but it's not a medical expense.
Test gap (test file line 33): Should also check overlap with CPS_RETIREMENT_VARIABLES and SS_SUBCOMPONENTS from
puf_impute.py — those have specialized imputation that Stage 2 would overwrite.
| "taxable_private_pension_income", | ||
| "tax_exempt_private_pension_income", | ||
| # Retirement contributions | ||
| "traditional_401k_contributions", |
There was a problem hiding this comment.
At first I was thinking that "cloning" was referring to the stage 2 stuff, but there are "cloning" operations here as well. CC is saying that:
- traditional_401k_contributions
- roth_401k_contributions
- roth_ira_contributions
- self_employed_pension_contributions
"are already imputed by _impute_retirement_contributions() in puf_clone_dataset() with domain-specific constraints — IRS contribution limits by year, age-based catch-up provisions, and income requirements (e.g., 401k zeroed when employment_income = 0).Stage 2's unconstrained QRF overwrites these with values that violate legal limits. "
● 1. puf_clone_dataset() imputes retirement contributions for the PUF clone half with explicit IRS constraints (e.g.,
401k capped at $23,000 + catch-up, zeroed when no employment income)
2. _impute_cps_only_variables() trains a generic QRF that also predicts those same 4 retirement variables — with no
constraints
3. _splice_cps_only_predictions() overwrites the constrained values from step 1 with the unconstrained values from
step 2
Okay, and later it says's that these variables are not even in policyengine-us!
- 3 variables don't exist in policyengine-us
roth_ira_distributions, regular_ira_distributions, other_type_retirement_account_distributions are not in the TBS.
They'll always be zero-filled. The code handles this gracefully via the valid_outputs filter, but consider removing
them or adding a comment.
I know they've been "impossible targets" in calibration.
| "social_security_retirement", | ||
| "social_security_disability", | ||
| "social_security_dependents", | ||
| "social_security_survivors", |
There was a problem hiding this comment.
CC:
All 4 SS sub-components are already handled by reconcile_ss_subcomponents() (puf_impute.py:376), which predicts
shares via QRF, normalizes them to sum to 1, and scales to match the PUF-imputed total. Stage 2's generic QRF would
overwrite with unconstrained values that likely don't sum to social_security.
Overlapping variables:
- social_security_retirement
- social_security_disability
- social_security_dependents
- social_security_survivors
Move all CPS-only variable imputation (retirement contributions, SS sub-components) exclusively to Stage 2 QRF in extended_cps.py, with post-processing constraints: - Clean CPS_ONLY_IMPUTED_VARIABLES: remove 3 non-existent vars, 2 vars handled by Stage 1 rename, add traditional_ira_contributions - Add apply_retirement_constraints() with explicit variable-to-limit mapping (IRS 401k/IRA caps, SE pension rate cap, income-based zeroing) - Add reconcile_ss_subcomponents() to normalize sub-component shares to match imputed social_security total - Extract get_se_pension_limits() into retirement_limits.py (cached) - Add SE pension params to imputation_parameters.yaml - Add 12 new tests covering variable lists, retirement caps, SS normalization Addresses Ben Ogorek's review on PR PolicyEngine#589. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Clean CPS_ONLY_IMPUTED_VARIABLES: remove 3 non-existent vars, 2 vars handled by Stage 1 rename, add traditional_ira_contributions - Add apply_retirement_constraints() with explicit variable-to-limit mapping (IRS 401k/IRA caps, SE pension rate cap, income-based zeroing) - Add reconcile_ss_subcomponents() to normalize sub-component shares to match imputed social_security total - Extract get_se_pension_limits() into retirement_limits.py (cached) - Add 12 new tests covering variable lists, retirement caps, SS normalization Addresses Ben Ogorek's review on PR #589. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
baogorek
left a comment
There was a problem hiding this comment.
@MaxGhenis I 'm going to give you a preemptive check mark since I know you're working through the feedback and I've got to leave the house for a few hours.
Also, I'm eager to see the full publication process happen again.
…ine/policyengine-us-data into impute-cps-only-vars-puf-clone
|
Merged main into this branch to catch up after PR #538 (calibration pipeline improvements). No conflicts — merged cleanly. |
Summary
Fixes #561
The extended CPS doubles the dataset — CPS records on the left, PUF clones on the right. Income variables are already QRF-imputed from PUF data, but ~60 other CPS-only variables (retirement distributions, transfers, hours, medical expenses, Social Security sub-components, SPM variables) were naively duplicated from the CPS donor regardless of the PUF clone's imputed income.
This PR adds a second-stage QRF imputation so the PUF clone half gets plausible values for these CPS-only variables that are consistent with its imputed income profile:
CPS_ONLY_IMPUTED_VARIABLESlist (~60 variables): retirement distributions/contributions, Social Security sub-components, transfer income, SPM variables, medical expenses, hours/employment, and previous-year incomeimpute_cps_only_variables()function: trains a QRF on CPS person-level data where predictors = demographics (age, sex, filing status, dependents) + key income variables (employment, self-employment, Social Security), and outputs = the CPS-only variables. For PUF clone prediction, the income predictor columns are swapped with PUF-imputed values from the first stage.IMPUTED_VARIABLES)Uses the same batching, memory management, and entity-mapping patterns as the existing QRF code.
Test plan
make data)