Exclude SIPP allocation flags from tip income sum (closes #524)#773
Merged
MaxGhenis merged 1 commit intoPolicyEngine:mainfrom Apr 17, 2026
Merged
Conversation
`sipp.py` computed tip income as the sum of every column containing "TXAMT", which matches both the TJB<n>_TXAMT dollar-amount columns and the AJB<n>_TXAMT allocation flags (small ints 0/1/2 indicating Census imputation status). Allocation-flag values were being added to dollar amounts. Change the filter to `str.match(r"TJB\d_TXAMT")`. Impact is small because the flags are 0-2, but it is incorrect. See issue PolicyEngine#524. Add a unit test covering the regex (TJB* matched, AJB* excluded) and a row-sum test showing the new filter yields the correct totals while the old `contains("TXAMT")` would have inflated them by the allocation-flag integers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MaxGhenis
added a commit
that referenced
this pull request
Apr 17, 2026
SIPP panels have one row per person per month. train_tip_model annualized tip_income and employment_income as ``monthly * 12`` on *every* row and then sampled 10,000 training rows. Without a MONTHCODE filter the training frame contained 12 rows per person — each annualized from a different month — so the QRF: - inflated effective sample size (12x duplicates per person), and - mixed seasonal tip amounts (restaurant, holiday) with annual figures because each Jan-annualized row disagreed with its Dec-annualized sibling. Filter to MONTHCODE == 12 before the resample so every training row represents one person-year. Orthogonal to #524 / #773 (which is about column selection); this is about row selection. tests/unit/datasets/test_sipp_monthcode_filter.py asserts: - train_tip_model body now references ``MONTHCODE == 12``. - The filter collapses a 12-rows-per-person toy frame to one row. - Without the filter, a single person has 12 duplicate rows (pins the bug premise).
2 tasks
MaxGhenis
added a commit
that referenced
this pull request
Apr 17, 2026
SIPP panels have one row per person per month. train_tip_model annualized tip_income and employment_income as ``monthly * 12`` on *every* row and then sampled 10,000 training rows. Without a MONTHCODE filter the training frame contained 12 rows per person — each annualized from a different month — so the QRF: - inflated effective sample size (12x duplicates per person), and - mixed seasonal tip amounts (restaurant, holiday) with annual figures because each Jan-annualized row disagreed with its Dec-annualized sibling. Filter to MONTHCODE == 12 before the resample so every training row represents one person-year. Orthogonal to #524 / #773 (which is about column selection); this is about row selection. tests/unit/datasets/test_sipp_monthcode_filter.py asserts: - train_tip_model body now references ``MONTHCODE == 12``. - The filter collapses a 12-rows-per-person toy frame to one row. - Without the filter, a single person has 12 duplicate rows (pins the bug premise).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #524.
sipp.pycomputed tip income by summing every column containing"TXAMT", which matched both:TJB<n>_TXAMT— the per-job tip dollar amounts (the intended source)AJB<n>_TXAMT— Census allocation flags (small integers 0/1/2 indicating whether the value was imputed)Allocation-flag integers were added to dollar totals. The effect is small (the flags range 0–2, so typical inflation per job is <$3/month × 12 = <$36/year) but incorrect.
Fix
Implements the fix proposed in #524 verbatim.
Regression tests
New
tests/unit/datasets/test_sipp_tip_columns.py:test_tip_regex_matches_dollar_amounts_only— the new regex matchesTJB1_TXAMT/TJB2_TXAMTbut excludesAJB1_TXAMT,AJB2_TXAMT,SOME_TXAMT_OTHER, andTPTOTINC.test_tip_sum_excludes_allocation_flags— row-sum check: with dollar amounts [100, 50]/[200, 75] and allocation flags [1, 0]/[2, 1], the new filter yields 150/275 while the oldcontains("TXAMT")would have yielded 151/278.Test plan