Exclude SIPP allocation flags from tip income sum (closes #524) by MaxGhenis · Pull Request #773 · PolicyEngine/policyengine-us-data

MaxGhenis · 2026-04-17T11:30:20Z

Summary

Fixes #524.

sipp.py computed tip income by summing every column containing "TXAMT", which matched both:

TJB<n>_TXAMT — the per-job tip dollar amounts (the intended source)
AJB<n>_TXAMT — Census allocation flags (small integers 0/1/2 indicating whether the value was imputed)

Allocation-flag integers were added to dollar totals. The effect is small (the flags range 0–2, so typical inflation per job is <$3/month × 12 = <$36/year) but incorrect.

Fix

# before
df["tip_income"] = (
    df[df.columns[df.columns.str.contains("TXAMT")]].fillna(0).sum(axis=1) * 12
)

# after — match dollar-amount columns only
df["tip_income"] = (
    df[df.columns[df.columns.str.match(r"TJB\d_TXAMT")]].fillna(0).sum(axis=1)
    * 12
)

Implements the fix proposed in #524 verbatim.

Regression tests

New tests/unit/datasets/test_sipp_tip_columns.py:

test_tip_regex_matches_dollar_amounts_only — the new regex matches TJB1_TXAMT/TJB2_TXAMT but excludes AJB1_TXAMT, AJB2_TXAMT, SOME_TXAMT_OTHER, and TPTOTINC.
test_tip_sum_excludes_allocation_flags — row-sum check: with dollar amounts [100, 50]/[200, 75] and allocation flags [1, 0]/[2, 1], the new filter yields 150/275 while the old contains("TXAMT") would have yielded 151/278.

Test plan

2 new unit tests pass.
CI passes.

`sipp.py` computed tip income as the sum of every column containing "TXAMT", which matches both the TJB<n>_TXAMT dollar-amount columns and the AJB<n>_TXAMT allocation flags (small ints 0/1/2 indicating Census imputation status). Allocation-flag values were being added to dollar amounts. Change the filter to `str.match(r"TJB\d_TXAMT")`. Impact is small because the flags are 0-2, but it is incorrect. See issue PolicyEngine#524. Add a unit test covering the regex (TJB* matched, AJB* excluded) and a row-sum test showing the new filter yields the correct totals while the old `contains("TXAMT")` would have inflated them by the allocation-flag integers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SIPP panels have one row per person per month. train_tip_model annualized tip_income and employment_income as ``monthly * 12`` on *every* row and then sampled 10,000 training rows. Without a MONTHCODE filter the training frame contained 12 rows per person — each annualized from a different month — so the QRF: - inflated effective sample size (12x duplicates per person), and - mixed seasonal tip amounts (restaurant, holiday) with annual figures because each Jan-annualized row disagreed with its Dec-annualized sibling. Filter to MONTHCODE == 12 before the resample so every training row represents one person-year. Orthogonal to #524 / #773 (which is about column selection); this is about row selection. tests/unit/datasets/test_sipp_monthcode_filter.py asserts: - train_tip_model body now references ``MONTHCODE == 12``. - The filter collapses a 12-rows-per-person toy frame to one row. - Without the filter, a single person has 12 duplicate rows (pins the bug premise).

MaxGhenis mentioned this pull request Apr 17, 2026

Filter SIPP tip-model training frame to MONTHCODE == 12 #785

Merged

2 tasks

MaxGhenis merged commit fe8e239 into PolicyEngine:main Apr 17, 2026
8 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exclude SIPP allocation flags from tip income sum (closes #524)#773

Exclude SIPP allocation flags from tip income sum (closes #524)#773
MaxGhenis merged 1 commit intoPolicyEngine:mainfrom
MaxGhenis:fix-sipp-tip-txamt-regex

MaxGhenis commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Apr 17, 2026

Summary

Fix

Regression tests

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant