Skip to content

Add salary sacrifice imputation to dataset pipeline#220

Merged
nikhilwoodruff merged 6 commits intomainfrom
impute-salary-sacrifice
Nov 26, 2025
Merged

Add salary sacrifice imputation to dataset pipeline#220
nikhilwoodruff merged 6 commits intomainfrom
impute-salary-sacrifice

Conversation

@MaxGhenis
Copy link
Contributor

Summary

  • Adds salary sacrifice participation imputation using ML model trained on FRS SALSAC routing question
  • Extracts SALSAC field before numeric conversion to distinguish Yes/No/Not-asked responses
  • Imputes ~30% participation rate per HMRC survey data, avoiding heavy calibration weight adjustments

Problem

The FRS under-reports salary sacrifice usage (only ~1% of jobs have non-zero SPNAMT), while HMRC data indicates ~30% of private sector employees use it. Attempting to calibrate to HMRC targets without imputation inflates population from 68M to 74M (6% over target vs 2% tolerance).

Solution

Use FRS SALSAC routing question to train ML model predicting salary sacrifice participation:

  • Training data: SALSAC='1' (Yes: 224 jobs) + SALSAC='2' (No: 3,803 jobs)
  • Imputation candidates: SALSAC=' ' (skip/not asked: 13,265 jobs)
  • External validation: ~30% participation rate from HMRC surveys

Note on Approach

This differs from other imputations (wealth, consumption, income) which train on external microdata (WAS, LCFS, SPI) and impute to all FRS records. This imputation trains on within-FRS respondents who answered SALSAC and imputes only to non-responders. The closest analogy is the Council Tax imputation in frs.py.

Changes

  • frs.py: Extract SALSAC before numeric conversion, add salary_sacrifice_reported and salary_sacrifice_asked indicators
  • salary_sacrifice.py: New imputation module using QRF model
  • create_datasets.py: Add salary sacrifice imputation step (after income, before uprating)
  • __init__.py: Export impute_salary_sacrifice function

Test plan

  • Verify FRS processing extracts SALSAC correctly
  • Confirm imputation targets ~30% participation rate
  • Check population test passes with imputation (should be within 2% tolerance)
  • Run full dataset creation pipeline

Closes #219

🤖 Generated with Claude Code

MaxGhenis and others added 6 commits November 26, 2025 16:05
Uses SALSAC routing question from FRS to train ML model predicting
salary sacrifice participation, then imputes to ~30% of employees
per HMRC survey data. This avoids the need for heavy calibration
weight adjustments that inflate population estimates.

Changes:
- frs.py: Extract SALSAC field before numeric conversion, add
  salary_sacrifice_reported and salary_sacrifice_asked indicators
- salary_sacrifice.py: New imputation module using QRF model
- create_datasets.py: Add salary sacrifice imputation step
- __init__.py: Export impute_salary_sacrifice function

Closes #219

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The .clip(upper=1) method doesn't work on numpy arrays in numpy 2.x.
Changed to np.clip(arr, 0, 1) for compatibility.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Use QRF model from microimpute to predict boolean participation
- Remove targeting logic - that happens via weight calibration
- Train on FRS respondents who answered SALSAC
- Predict for non-respondents based on age and employment income
- Assign SS contributions = employee pension for imputed participants

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Instead of predicting boolean participation then setting amounts,
directly predict pension_contributions_via_salary_sacrifice as a
continuous variable. The model learns from those who were asked:
- Non-participants have 0 (SALSAC='2')
- Participants have their reported SPNAMT amount (SALSAC='1')

This is simpler and preserves the distribution of amounts from
the training data rather than assuming SS = employee_pension.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@nikhilwoodruff nikhilwoodruff merged commit 3245377 into main Nov 26, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Impute salary sacrifice participation to full population

2 participants