perf: selective activation checkpointing feature support #2280
perf: selective activation checkpointing feature support #2280
Conversation
Plumbs Megatron-Core recompute_granularity and recompute_modules through policy config and megatron setup so training can selectively recompute activations (core_attn, moe, moe_act, etc.) instead of full checkpointing. Documents the new fields in the grpo_math_1B base config. Signed-off-by: sna <sna@nvidia.com>
terrykong
left a comment
There was a problem hiding this comment.
Review Summary
Thanks for adding selective activation checkpointing support — this is a valuable feature that exposes a real MCore knob MoE users need. The implementation is well-scoped and the branching logic is clean.
Suggestions
Config-conventions guideline (inline): The .get("recompute_granularity", "full") pattern uses a hidden non-None default, which the repo guidelines forbid. The key should be accessed directly.
Incomplete module list (inline): The recompute_modules comments list 3 valid options but MCore allows 7.
Validation gap (inline): Invalid recompute_modules values (typos) are silently accepted because MCore's __post_init__ validation doesn’t re-run after attribute assignment. Filed #2291 to track a broader refactor.
Performance Evidence
Since this is a perf:-labeled PR and the TypedDict comment claims "∼10–18GB savings for MoE models," could you share a before/after comparison? For example, peak GPU memory and tokens/sec with selective vs full recompute on a representative MoE model would help future users understand the expected benefit and source the claim. Filling in the PR template sections would also be helpful.
Test Coverage (nit)
The new selective branch and ValueError path have no unit test coverage. Consider adding tests for: (a) selective with custom modules, (b) selective with None modules (MCore default), (c) invalid granularity raises ValueError. Happy to share three pre-verified tests that match the existing MagicMock-based style.
Linter: PASS
All hooks passed (ruff, ruff-format, taplo, pyrefly, end-of-files, trailing-whitespace, minimize-check).
Generated by Claude Code
Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>
Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>
Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>
Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>
|
/ok to 150ab01 |
Supports Megatron-Core recompute_granularity and recompute_modules through policy config and megatron setup so training can selectively recompute activations (core_attn, moe, moe_act, etc.) instead of full checkpointing.
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information