Skip to content

fix: set moe_permute_fusion default to true for deterministic MoE forward#2258

Merged
terrykong merged 3 commits into
NVIDIA-NeMo:mainfrom
zpqiu:alexq/fix-moe-permute-fusion-default
Apr 14, 2026
Merged

fix: set moe_permute_fusion default to true for deterministic MoE forward#2258
terrykong merged 3 commits into
NVIDIA-NeMo:mainfrom
zpqiu:alexq/fix-moe-permute-fusion-default

Conversation

@zpqiu
Copy link
Copy Markdown
Contributor

@zpqiu zpqiu commented Apr 13, 2026

What does this PR do ?

With moe_permute_fusion=false, MoE models produce non-deterministic forward pass results due to scatter_add_ in the unpermute operation, causing train/probs_ratio to deviate from 1.0 in on-policy GRPO.

Issues

List issues that this PR closes (syntax):
Fixes #2255

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

…ward

With moe_permute_fusion=false, MoE models produce non-deterministic
forward pass results due to scatter_add_ in the unpermute operation,
causing train/probs_ratio to deviate from 1.0 in on-policy GRPO.

Fixes NVIDIA-NeMo#2255

Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
@zpqiu zpqiu requested review from a team as code owners April 13, 2026 09:30
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates example training configs to default moe_permute_fusion: true to avoid non-deterministic MoE forward passes (and resulting train/probs_ratio drift) caused by the unfused unpermute path.

Changes:

  • Flip policy.megatron_cfg.moe_permute_fusion from false to true in several example configs.
  • Apply the same default in both standard and Megatron variants of GRPO/distillation configs.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
examples/configs/sft.yaml Sets megatron_cfg.moe_permute_fusion: true in the SFT example config.
examples/configs/grpo_math_1B.yaml Sets megatron_cfg.moe_permute_fusion: true in the GRPO math 1B baseline config.
examples/configs/grpo_math_1B_megatron.yaml Sets megatron_cfg.moe_permute_fusion: true in the Megatron GRPO math 1B config.
examples/configs/dpo.yaml Sets megatron_cfg.moe_permute_fusion: true in the DPO example config.
examples/configs/distillation_math.yaml Sets megatron_cfg.moe_permute_fusion: true in the distillation math config.
examples/configs/distillation_math_megatron.yaml Sets megatron_cfg.moe_permute_fusion: true in the Megatron distillation math config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/configs/grpo_math_1B.yaml
Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
@zpqiu
Copy link
Copy Markdown
Contributor Author

zpqiu commented Apr 13, 2026

/ok to test 48e920e

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 13, 2026

/ok to test 48e920e

@zpqiu, there was an error processing your request: E2

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/

@zpqiu zpqiu added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Apr 13, 2026
@zpqiu
Copy link
Copy Markdown
Contributor Author

zpqiu commented Apr 13, 2026

/ok to test 018ca26

These recipe configs previously overrode the base default (false) with
true. Now that the base default is true, these overrides are redundant
and fail the minimize-check lint.

Signed-off-by: Zhaopeng Qiu <alexq@nvidia.com>
@terrykong terrykong enabled auto-merge (squash) April 14, 2026 05:48
@terrykong
Copy link
Copy Markdown
Collaborator

/ok to test 4346565

@terrykong terrykong merged commit dd3e8b7 into NVIDIA-NeMo:main Apr 14, 2026
27 checks passed
snivertynv pushed a commit to snivertynv/RL that referenced this pull request May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Should moe_permute_fusion default to true for MoE models?

4 participants