perf: selective activation checkpointing feature support by seonjinn · Pull Request #2280 · NVIDIA-NeMo/RL

seonjinn · 2026-04-17T06:57:33Z

Supports Megatron-Core recompute_granularity and recompute_modules through policy config and megatron setup so training can selectively recompute activations (core_attn, moe, moe_act, etc.) instead of full checkpointing.

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Plumbs Megatron-Core recompute_granularity and recompute_modules through policy config and megatron setup so training can selectively recompute activations (core_attn, moe, moe_act, etc.) instead of full checkpointing. Documents the new fields in the grpo_math_1B base config. Signed-off-by: sna <sna@nvidia.com>

copy-pr-bot · 2026-04-17T06:57:36Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

terrykong

Review Summary

Thanks for adding selective activation checkpointing support — this is a valuable feature that exposes a real MCore knob MoE users need. The implementation is well-scoped and the branching logic is clean.

Suggestions

Config-conventions guideline (inline): The .get("recompute_granularity", "full") pattern uses a hidden non-None default, which the repo guidelines forbid. The key should be accessed directly.

Incomplete module list (inline): The recompute_modules comments list 3 valid options but MCore allows 7.

Validation gap (inline): Invalid recompute_modules values (typos) are silently accepted because MCore's __post_init__ validation doesn’t re-run after attribute assignment. Filed #2291 to track a broader refactor.

Performance Evidence

Since this is a perf:-labeled PR and the TypedDict comment claims "∼10–18GB savings for MoE models," could you share a before/after comparison? For example, peak GPU memory and tokens/sec with selective vs full recompute on a representative MoE model would help future users understand the expected benefit and source the claim. Filling in the PR template sections would also be helpful.

Test Coverage (nit)

The new selective branch and ValueError path have no unit test coverage. Consider adding tests for: (a) selective with custom modules, (b) selective with None modules (MCore default), (c) invalid granularity raises ValueError. Happy to share three pre-verified tests that match the existing MagicMock-based style.

Linter: PASS

All hooks passed (ruff, ruff-format, taplo, pyrefly, end-of-files, trailing-whitespace, minimize-check).

Generated by Claude Code

Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>

seonjinn · 2026-04-20T01:15:55Z

/ok to 150ab01

seonjinn requested review from a team as code owners April 17, 2026 06:57

seonjinn changed the title ~~Add selective activation recompute support and perf_test configs~~ perf: selective activation checkpointing feature support Apr 17, 2026

seonjinn self-assigned this Apr 17, 2026

seonjinn requested a review from terrykong April 17, 2026 21:11

terrykong mentioned this pull request Apr 19, 2026

refactor: leverage MCore TransformerConfig.__post_init__ validation for config fields set in _apply_performance_config #2291

Open

terrykong reviewed Apr 19, 2026

View reviewed changes

Comment thread nemo_rl/models/megatron/setup.py Outdated

Comment thread nemo_rl/models/megatron/setup.py

Comment thread nemo_rl/models/policy/__init__.py Outdated

Comment thread examples/configs/grpo_math_1B.yaml Outdated

seonjinn and others added 4 commits April 19, 2026 18:13

Update nemo_rl/models/policy/__init__.py

f2d0ae4

Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>

Update nemo_rl/models/megatron/setup.py

09e1ad0

Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>

Update examples/configs/grpo_math_1B.yaml

b6816f7

Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>

Update nemo_rl/models/megatron/setup.py

150ab01

Co-authored-by: Terry Kong <terryk@nvidia.com> Signed-off-by: Seonjin <sna@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: selective activation checkpointing feature support #2280

perf: selective activation checkpointing feature support #2280
seonjinn wants to merge 5 commits intomainfrom
sj/selective-recompute

seonjinn commented Apr 17, 2026

Uh oh!

copy-pr-bot bot commented Apr 17, 2026

Uh oh!

terrykong left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seonjinn commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

seonjinn commented Apr 17, 2026

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Apr 17, 2026

Uh oh!

terrykong left a comment

Choose a reason for hiding this comment

Review Summary

Suggestions

Performance Evidence

Test Coverage (nit)

Linter: PASS

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

seonjinn commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants