Skip to content

Exclude SIPP allocation flags from tip income sum (closes #524)#773

Merged
MaxGhenis merged 1 commit intoPolicyEngine:mainfrom
MaxGhenis:fix-sipp-tip-txamt-regex
Apr 17, 2026
Merged

Exclude SIPP allocation flags from tip income sum (closes #524)#773
MaxGhenis merged 1 commit intoPolicyEngine:mainfrom
MaxGhenis:fix-sipp-tip-txamt-regex

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Fixes #524.

sipp.py computed tip income by summing every column containing "TXAMT", which matched both:

  • TJB<n>_TXAMT — the per-job tip dollar amounts (the intended source)
  • AJB<n>_TXAMT — Census allocation flags (small integers 0/1/2 indicating whether the value was imputed)

Allocation-flag integers were added to dollar totals. The effect is small (the flags range 0–2, so typical inflation per job is <$3/month × 12 = <$36/year) but incorrect.

Fix

# before
df["tip_income"] = (
    df[df.columns[df.columns.str.contains("TXAMT")]].fillna(0).sum(axis=1) * 12
)

# after — match dollar-amount columns only
df["tip_income"] = (
    df[df.columns[df.columns.str.match(r"TJB\d_TXAMT")]].fillna(0).sum(axis=1)
    * 12
)

Implements the fix proposed in #524 verbatim.

Regression tests

New tests/unit/datasets/test_sipp_tip_columns.py:

  • test_tip_regex_matches_dollar_amounts_only — the new regex matches TJB1_TXAMT/TJB2_TXAMT but excludes AJB1_TXAMT, AJB2_TXAMT, SOME_TXAMT_OTHER, and TPTOTINC.
  • test_tip_sum_excludes_allocation_flags — row-sum check: with dollar amounts [100, 50]/[200, 75] and allocation flags [1, 0]/[2, 1], the new filter yields 150/275 while the old contains("TXAMT") would have yielded 151/278.

Test plan

  • 2 new unit tests pass.
  • CI passes.

`sipp.py` computed tip income as the sum of every column containing
"TXAMT", which matches both the TJB<n>_TXAMT dollar-amount columns
and the AJB<n>_TXAMT allocation flags (small ints 0/1/2 indicating
Census imputation status). Allocation-flag values were being added
to dollar amounts.

Change the filter to `str.match(r"TJB\d_TXAMT")`. Impact is small
because the flags are 0-2, but it is incorrect. See issue PolicyEngine#524.

Add a unit test covering the regex (TJB* matched, AJB* excluded)
and a row-sum test showing the new filter yields the correct
totals while the old `contains("TXAMT")` would have inflated them
by the allocation-flag integers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MaxGhenis added a commit that referenced this pull request Apr 17, 2026
SIPP panels have one row per person per month. train_tip_model
annualized tip_income and employment_income as ``monthly * 12`` on
*every* row and then sampled 10,000 training rows. Without a
MONTHCODE filter the training frame contained 12 rows per person —
each annualized from a different month — so the QRF:

- inflated effective sample size (12x duplicates per person), and
- mixed seasonal tip amounts (restaurant, holiday) with annual
  figures because each Jan-annualized row disagreed with its
  Dec-annualized sibling.

Filter to MONTHCODE == 12 before the resample so every training row
represents one person-year. Orthogonal to #524 / #773 (which is about
column selection); this is about row selection.

tests/unit/datasets/test_sipp_monthcode_filter.py asserts:
- train_tip_model body now references ``MONTHCODE == 12``.
- The filter collapses a 12-rows-per-person toy frame to one row.
- Without the filter, a single person has 12 duplicate rows
  (pins the bug premise).
@MaxGhenis MaxGhenis merged commit fe8e239 into PolicyEngine:main Apr 17, 2026
8 of 9 checks passed
MaxGhenis added a commit that referenced this pull request Apr 17, 2026
SIPP panels have one row per person per month. train_tip_model
annualized tip_income and employment_income as ``monthly * 12`` on
*every* row and then sampled 10,000 training rows. Without a
MONTHCODE filter the training frame contained 12 rows per person —
each annualized from a different month — so the QRF:

- inflated effective sample size (12x duplicates per person), and
- mixed seasonal tip amounts (restaurant, holiday) with annual
  figures because each Jan-annualized row disagreed with its
  Dec-annualized sibling.

Filter to MONTHCODE == 12 before the resample so every training row
represents one person-year. Orthogonal to #524 / #773 (which is about
column selection); this is about row selection.

tests/unit/datasets/test_sipp_monthcode_filter.py asserts:
- train_tip_model body now references ``MONTHCODE == 12``.
- The filter collapses a 12-rows-per-person toy frame to one row.
- Without the filter, a single person has 12 duplicate rows
  (pins the bug premise).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: SIPP tip income imputation includes allocation flag columns

1 participant