fsdp: avoid double sharding of MoE experts when EP is enabled #2833

CodersAcademy006 · 2026-01-06T16:56:35Z

When Expert Parallelism (EP) is enabled, MoE expert parameters are already
partitioned across ranks. However, current FSDP auto-wrapping logic still
wraps these expert modules, causing parameters to be effectively tracked
and sharded twice.

This results in:

Increased peak GPU memory usage
Duplicated optimizer state
Additional communication overhead
Reduced throughput compared to EP-only runs

What this PR does

Skips FSDP auto-wrapping for MoE expert modules that are already managed
by Expert Parallelism
Adds an explicit ownership signal (expert_parallel_enabled) to MoE layers
to avoid heuristic or name-based checks
Preserves existing FSDP behavior for all non-expert parameters

Impact

Restores expected memory behavior (EP + FSDP ≤ EP-only)
Reduces redundant parameter metadata and optimizer state
Improves training throughput
No behavior change for non-MoE or non-EP models

Reproduction

This issue is reproducible on large MoE models and is reported in #2772.
The change removes the redundant sharding path responsible for the memory
regression.

Fixes #2772

copy-pr-bot · 2026-01-06T16:56:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

fsdp: avoid double sharding of MoE experts when EP is enabled

2583898

CodersAcademy006 requested review from a team as code owners January 6, 2026 16:56

Merge branch 'main' into fix/fsdp-ep-double-sharding

8fc0e45

github-actions bot requested a review from Phlip79 January 6, 2026 16:56

github-actions bot added the community-request label Jan 6, 2026

Skylion007 approved these changes Jan 6, 2026

View reviewed changes

Merge branch 'main' into fix/fsdp-ep-double-sharding

df99d31

yaox12 added Expert Review Apply this label to indicate that your PR is ready for expert review. module: megatron-fsdp labels Jan 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fsdp: avoid double sharding of MoE experts when EP is enabled #2833

fsdp: avoid double sharding of MoE experts when EP is enabled #2833

Uh oh!

CodersAcademy006 commented Jan 6, 2026

Uh oh!

copy-pr-bot bot commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fsdp: avoid double sharding of MoE experts when EP is enabled #2833

Are you sure you want to change the base?

fsdp: avoid double sharding of MoE experts when EP is enabled #2833

Uh oh!

Conversation

CodersAcademy006 commented Jan 6, 2026

What this PR does

Impact

Reproduction

Uh oh!

copy-pr-bot bot commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants