Skip to content

Fix enhanced CPS OOM with half-sample ExtendedCPS#579

Merged
MaxGhenis merged 5 commits intomainfrom
fix-enhanced-cps-oom
Mar 6, 2026
Merged

Fix enhanced CPS OOM with half-sample ExtendedCPS#579
MaxGhenis merged 5 commits intomainfrom
fix-enhanced-cps-oom

Conversation

@baogorek
Copy link
Collaborator

@baogorek baogorek commented Mar 6, 2026

Closes #580, closes #581, closes #582, closes #583, closes #584, closes #585, closes #586

Summary

Details

OOM fix (#580, #584)

ExtendedCPS was consuming ~30GB during construction, exceeding Modal's 32GB limit. The fix samples 50% of CPS households before extending, cutting peak memory to ~18GB while preserving calibration targets (applied post-extension).

Memory cleanup (#582)

Delete the full loss matrix DataFrame after creating the cleaned copy, free the Microsimulation object after build_loss_matrix returns, and convert the loss matrix to float32 before passing to reweight().

Checkpoint invalidation fix (#583, #585)

is_checkpointed() only checked if the file existed and was non-empty — it never checked whether the code had changed. New commits on main kept restoring stale H5 files from previous builds (broken since Feb 4, ~27 commits affected).

Approach: Include the git commit SHA in the checkpoint path:

  • Old layout: /checkpoints/{branch}/{filename}
  • New layout: /checkpoints/{branch}/{commit_sha}/{filename}

Within the same build (same commit), preemption resilience works exactly as before. Across different commits, checkpoints are invisible because they live under a different SHA directory. Stale commit directories are cleaned up at the start of each build.

Test threshold (#581, #586)

Widen poverty rate test tolerance from 20% to 30% to accommodate sampling variance from the half-sample approach.

Test plan

  • CI passes with full rebuild (no stale checkpoints restored)
  • Poverty rate tests pass within 30% threshold
  • Peak memory stays under 32GB on Modal

🤖 Generated with Claude Code

baogorek and others added 4 commits March 6, 2026 09:43
The switch from CPS_2024 (frac=0.5) to CPS_2024_Full (frac=1) in commit
1e8d6e1 doubled the household count to 111k, making the 2913-column
calibration loss matrix 2.6 GB and causing OOM on 28 GB machines.

- Add ExtendedCPS_2024_Half using CPS_2024 (frac=0.5) with its own H5
- Point EnhancedCPS_2024 at the half-sample input (~56k households)
- Free loss_matrix and sim objects earlier in the calibration pipeline
- Convert loss matrix to float32 before reweighting

The full ExtendedCPS_2024 (111k households) is preserved unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scope checkpoint paths by commit SHA so new commits rebuild from scratch
instead of restoring stale H5 files from previous builds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add @functools.cache to avoid repeated subprocess calls for the same
  commit SHA during a single build (~10+ calls reduced to 1)
- Move inline `import gc` to top of loss.py for consistency

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@MaxGhenis MaxGhenis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good — OOM fix, memory cleanup, checkpoint invalidation, and test threshold all make sense. Added minor nit fixes (cached get_current_commit, moved inline gc import to top-level).

@MaxGhenis MaxGhenis merged commit fb50ef8 into main Mar 6, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants