Skip to content

perf: selective activation checkpointing feature support #2280

Open
seonjinn wants to merge 5 commits intomainfrom
sj/selective-recompute
Open

perf: selective activation checkpointing feature support #2280
seonjinn wants to merge 5 commits intomainfrom
sj/selective-recompute

Conversation

@seonjinn
Copy link
Copy Markdown
Contributor

Supports Megatron-Core recompute_granularity and recompute_modules through policy config and megatron setup so training can selectively recompute activations (core_attn, moe, moe_act, etc.) instead of full checkpointing.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

Plumbs Megatron-Core recompute_granularity and recompute_modules through
policy config and megatron setup so training can selectively recompute
activations (core_attn, moe, moe_act, etc.) instead of full checkpointing.
Documents the new fields in the grpo_math_1B base config.

Signed-off-by: sna <sna@nvidia.com>
@seonjinn seonjinn requested review from a team as code owners April 17, 2026 06:57
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@seonjinn seonjinn changed the title Add selective activation recompute support and perf_test configs perf: selective activation checkpointing feature support Apr 17, 2026
@seonjinn seonjinn self-assigned this Apr 17, 2026
@seonjinn seonjinn requested a review from terrykong April 17, 2026 21:11
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Summary

Thanks for adding selective activation checkpointing support — this is a valuable feature that exposes a real MCore knob MoE users need. The implementation is well-scoped and the branching logic is clean.

Suggestions

Config-conventions guideline (inline): The .get("recompute_granularity", "full") pattern uses a hidden non-None default, which the repo guidelines forbid. The key should be accessed directly.

Incomplete module list (inline): The recompute_modules comments list 3 valid options but MCore allows 7.

Validation gap (inline): Invalid recompute_modules values (typos) are silently accepted because MCore's __post_init__ validation doesn’t re-run after attribute assignment. Filed #2291 to track a broader refactor.

Performance Evidence

Since this is a perf:-labeled PR and the TypedDict comment claims "∼10–18GB savings for MoE models," could you share a before/after comparison? For example, peak GPU memory and tokens/sec with selective vs full recompute on a representative MoE model would help future users understand the expected benefit and source the claim. Filling in the PR template sections would also be helpful.

Test Coverage (nit)

The new selective branch and ValueError path have no unit test coverage. Consider adding tests for: (a) selective with custom modules, (b) selective with None modules (MCore default), (c) invalid granularity raises ValueError. Happy to share three pre-verified tests that match the existing MagicMock-based style.

Linter: PASS

All hooks passed (ruff, ruff-format, taplo, pyrefly, end-of-files, trailing-whitespace, minimize-check).

Generated by Claude Code

Comment thread nemo_rl/models/megatron/setup.py Outdated
Comment thread nemo_rl/models/megatron/setup.py
Comment thread nemo_rl/models/policy/__init__.py Outdated
Comment thread examples/configs/grpo_math_1B.yaml Outdated
seonjinn and others added 4 commits April 19, 2026 18:13
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Seonjin  <sna@nvidia.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Seonjin  <sna@nvidia.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Seonjin  <sna@nvidia.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Seonjin  <sna@nvidia.com>
@seonjinn
Copy link
Copy Markdown
Contributor Author

/ok to 150ab01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants