Add salary sacrifice imputation to dataset pipeline#220
Merged
nikhilwoodruff merged 6 commits intomainfrom Nov 26, 2025
Merged
Conversation
Uses SALSAC routing question from FRS to train ML model predicting salary sacrifice participation, then imputes to ~30% of employees per HMRC survey data. This avoids the need for heavy calibration weight adjustments that inflate population estimates. Changes: - frs.py: Extract SALSAC field before numeric conversion, add salary_sacrifice_reported and salary_sacrifice_asked indicators - salary_sacrifice.py: New imputation module using QRF model - create_datasets.py: Add salary sacrifice imputation step - __init__.py: Export impute_salary_sacrifice function Closes #219 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
The .clip(upper=1) method doesn't work on numpy arrays in numpy 2.x. Changed to np.clip(arr, 0, 1) for compatibility. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
- Use QRF model from microimpute to predict boolean participation - Remove targeting logic - that happens via weight calibration - Train on FRS respondents who answered SALSAC - Predict for non-respondents based on age and employment income - Assign SS contributions = employee pension for imputed participants 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
Instead of predicting boolean participation then setting amounts, directly predict pension_contributions_via_salary_sacrifice as a continuous variable. The model learns from those who were asked: - Non-participants have 0 (SALSAC='2') - Participants have their reported SPNAMT amount (SALSAC='1') This is simpler and preserves the distribution of amounts from the training data rather than assuming SS = employee_pension. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Problem
The FRS under-reports salary sacrifice usage (only ~1% of jobs have non-zero SPNAMT), while HMRC data indicates ~30% of private sector employees use it. Attempting to calibrate to HMRC targets without imputation inflates population from 68M to 74M (6% over target vs 2% tolerance).
Solution
Use FRS SALSAC routing question to train ML model predicting salary sacrifice participation:
Note on Approach
This differs from other imputations (wealth, consumption, income) which train on external microdata (WAS, LCFS, SPI) and impute to all FRS records. This imputation trains on within-FRS respondents who answered SALSAC and imputes only to non-responders. The closest analogy is the Council Tax imputation in frs.py.
Changes
frs.py: Extract SALSAC before numeric conversion, addsalary_sacrifice_reportedandsalary_sacrifice_askedindicatorssalary_sacrifice.py: New imputation module using QRF modelcreate_datasets.py: Add salary sacrifice imputation step (after income, before uprating)__init__.py: Exportimpute_salary_sacrificefunctionTest plan
Closes #219
🤖 Generated with Claude Code