Skip to content

Conversation

@CodersAcademy006
Copy link
Contributor

When Expert Parallelism (EP) is enabled, MoE expert parameters are already
partitioned across ranks. However, current FSDP auto-wrapping logic still
wraps these expert modules, causing parameters to be effectively tracked
and sharded twice.

This results in:

  • Increased peak GPU memory usage
  • Duplicated optimizer state
  • Additional communication overhead
  • Reduced throughput compared to EP-only runs

What this PR does

  • Skips FSDP auto-wrapping for MoE expert modules that are already managed
    by Expert Parallelism
  • Adds an explicit ownership signal (expert_parallel_enabled) to MoE layers
    to avoid heuristic or name-based checks
  • Preserves existing FSDP behavior for all non-expert parameters

Impact

  • Restores expected memory behavior (EP + FSDP ≤ EP-only)
  • Reduces redundant parameter metadata and optimizer state
  • Improves training throughput
  • No behavior change for non-MoE or non-EP models

Reproduction

This issue is reproducible on large MoE models and is reported in #2772.
The change removes the redundant sharding path responsible for the memory
regression.

Fixes #2772

@CodersAcademy006 CodersAcademy006 requested review from a team as code owners January 6, 2026 16:56
@copy-pr-bot
Copy link

copy-pr-bot bot commented Jan 6, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaox12 yaox12 added Expert Review Apply this label to indicate that your PR is ready for expert review. module: megatron-fsdp labels Jan 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-request Expert Review Apply this label to indicate that your PR is ready for expert review. module: megatron-fsdp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Using Megatron-FSDP + EP consumes more GPU memory than using EP alone

3 participants