Fix enhanced CPS OOM with half-sample ExtendedCPS by baogorek · Pull Request #579 · PolicyEngine/policyengine-us-data

baogorek · 2026-03-06T14:55:01Z

Closes #580, closes #581, closes #582, closes #583, closes #584, closes #585, closes #586

Summary

Fix enhanced CPS OOM by building ExtendedCPS from a 50% sample instead of the full CPS, reducing peak memory from ~30GB to ~18GB (Enhanced CPS OOM: revert to half-sample ExtendedCPS input #580, Enhanced CPS OOM during Modal CI builds #584)
Free memory earlier in the calibration pipeline — delete stale DataFrames and downcast to float32 (Free memory earlier in calibration pipeline #582)
Fix CI checkpoint cache invalidation: scope checkpoint paths by commit SHA so new commits force a full rebuild instead of restoring stale H5 files from previous builds (CI checkpoint cache never invalidates on code changes #583, CI checkpoint cache never invalidates on code changes #585)
Widen poverty rate test threshold to 30% to accommodate sampling variance from the half-sample approach (Poverty rate sanity test threshold too tight #581, Poverty rate test threshold too tight for sampled CPS #586)

Details

OOM fix (#580, #584)

ExtendedCPS was consuming ~30GB during construction, exceeding Modal's 32GB limit. The fix samples 50% of CPS households before extending, cutting peak memory to ~18GB while preserving calibration targets (applied post-extension).

Memory cleanup (#582)

Delete the full loss matrix DataFrame after creating the cleaned copy, free the Microsimulation object after build_loss_matrix returns, and convert the loss matrix to float32 before passing to reweight().

Checkpoint invalidation fix (#583, #585)

is_checkpointed() only checked if the file existed and was non-empty — it never checked whether the code had changed. New commits on main kept restoring stale H5 files from previous builds (broken since Feb 4, ~27 commits affected).

Approach: Include the git commit SHA in the checkpoint path:

Old layout: /checkpoints/{branch}/{filename}
New layout: /checkpoints/{branch}/{commit_sha}/{filename}

Within the same build (same commit), preemption resilience works exactly as before. Across different commits, checkpoints are invisible because they live under a different SHA directory. Stale commit directories are cleaned up at the start of each build.

Test threshold (#581, #586)

Widen poverty rate test tolerance from 20% to 30% to accommodate sampling variance from the half-sample approach.

Test plan

CI passes with full rebuild (no stale checkpoints restored)
Poverty rate tests pass within 30% threshold
Peak memory stays under 32GB on Modal

🤖 Generated with Claude Code

The switch from CPS_2024 (frac=0.5) to CPS_2024_Full (frac=1) in commit 1e8d6e1 doubled the household count to 111k, making the 2913-column calibration loss matrix 2.6 GB and causing OOM on 28 GB machines. - Add ExtendedCPS_2024_Half using CPS_2024 (frac=0.5) with its own H5 - Point EnhancedCPS_2024 at the half-sample input (~56k households) - Free loss_matrix and sim objects earlier in the calibration pipeline - Convert loss matrix to float32 before reweighting The full ExtendedCPS_2024 (111k households) is preserved unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Scope checkpoint paths by commit SHA so new commits rebuild from scratch instead of restoring stale H5 files from previous builds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Add @functools.cache to avoid repeated subprocess calls for the same commit SHA during a single build (~10+ calls reduced to 1) - Move inline `import gc` to top of loss.py for consistency Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

MaxGhenis

Looks good — OOM fix, memory cleanup, checkpoint invalidation, and test threshold all make sense. Added minor nit fixes (cached get_current_commit, moved inline gc import to top-level).

baogorek and others added 4 commits March 6, 2026 09:43

Widen poverty rate test threshold to 30% and reformat

45bdd86

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reformat with ruff (CI switched from black to ruff in #577)

b7a7a57

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix CI checkpoint cache invalidation on code changes

0722d45

Scope checkpoint paths by commit SHA so new commits rebuild from scratch instead of restoring stale H5 files from previous builds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

baogorek requested review from MaxGhenis and PavelMakarchuk March 6, 2026 16:42

MaxGhenis approved these changes Mar 6, 2026

View reviewed changes

PavelMakarchuk approved these changes Mar 6, 2026

View reviewed changes

MaxGhenis merged commit fb50ef8 into main Mar 6, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix enhanced CPS OOM with half-sample ExtendedCPS#579

Fix enhanced CPS OOM with half-sample ExtendedCPS#579
MaxGhenis merged 5 commits intomainfrom
fix-enhanced-cps-oom

baogorek commented Mar 6, 2026 •

edited

Loading

Uh oh!

MaxGhenis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

baogorek commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

OOM fix (#580, #584)

Memory cleanup (#582)

Checkpoint invalidation fix (#583, #585)

Test threshold (#581, #586)

Test plan

Uh oh!

MaxGhenis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

baogorek commented Mar 6, 2026 •

edited

Loading