Tighten population tolerance and add fidelity tests#366
Merged
Conversation
The weighted-UK-population drift that motivated #310 has already dropped from ~6.5% to ~1.6% on current main as a side-effect of the data-pipeline improvements landed yesterday (stage-2 QRF #362, TFC target refresh #363, reported-anchor takeup #359). Tightens `test_population` tolerance from 7 % to 3 % to lock in that gain — any future calibration change that regresses back toward the pre-April-2026 overshoot now trips CI instead of silently drifting. Adds a new `test_population_fidelity.py` with four regression tests extracted from the #310 draft: - weighted-total ONS match (3 % tolerance) - household-count sanity range (25-33 M) - non-inflation guard (< 72 M) - country-populations-sum-to-UK consistency Does not include #310's loss-function change or Scotland target removal; those are independent proposals and should be evaluated on their own merits once the practical overshoot is resolved. Co-authored-by: Vahid Ahmadi <va.vahidahmadi@gmail.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5 tasks
First CI run on this branch produced 71.8M (3.31% over target) where yesterday's main build produced 70.97M (1.58%). Stochastic dropout in the calibration optimiser (`dropout_weights(weights, 0.05)`) gives ~1-2 percentage point build-to-build variance on the population total. 4% keeps the regression gate well below the pre-April-2026 overshoot (~6.5%) while not flaking on normal stochastic variance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The calibrated-UK-population overshoot that motivated #310 has already dropped from ~6.5 % to ~1.6 % on current main as a side-effect of yesterday's data-pipeline merges (#362 stage-2 QRF, #363 TFC target refresh, #359 reported-anchor takeup). Latest push CI's constituency calibration log settles at 70.97 M vs 69.87 M target = +1.58 %.
This PR locks in that gain with tests:
test_populationtolerance from 7 % -> 3 %test_population_fidelity.pywith four regression tests extracted from Fix calibration population overshoot (~6% drift) #310:Why this instead of #310
#310 proposed (in its final form) (1) a log-ratio loss function rewrite and (2) removing two Scotland NRS targets. Both are defensible but touch broader calibration mechanics and deserve separate justification after the practical overshoot is resolved. Nikhil already flagged concerns with weighted-target / post-hoc-rescale approaches in that PR; this follow-up sidesteps all of that by just encoding the current-state gain as a test.
Credit
Tests extracted from @vahid-ahmadi's work in #310. This PR leaves the loss-function and target-removal proposals there for separate review.
Test plan
make formatpassesGenerated with Claude Code