Add NCCL symmetric-memory staging to experimental FSDP by wujingyue · Pull Request #5440 · NVIDIA/Megatron-LM

wujingyue · 2026-06-23T03:39:00Z

Summary

This adds an opt-in use_symm_mem=True path for experimental Megatron-FSDP staging buffers. The full-parameter all-gather buffer and full-gradient reduce-scatter buffer are allocated from PyTorch's NCCL symmetric-memory pool, rendezvoused before collectives, and verified through profiler activity.

Note: the communication-time benchmark (and its pytest-benchmark/uv.lock changes) was split out to #5596 to keep this PR focused on the core staging feature.

Details

Threads use_symm_mem through fully_shard, FsdpModule, and FsdpParameterGroup.
Adds DBuffer rendezvous support and a shared placement-axis helper used by redistribution and symmetric-memory call sites.
Allocates unsharded parameter and partial-gradient staging buffers under torch.cuda.use_mem_pool while leaving the DBuffer allocation API unchanged.
Uses SUM reduce-scatter for the symmetric-memory path and scales locally to preserve AVG gradient semantics.
unshard_parameters raises on a None gather axis and rendezvous unconditionally under symmetric memory, mirroring reduce_gradients.
Adds distributed CUDA/NCCL coverage in test_symmetric_memory.py that checks loss parity and requires observed ncclSymk all-gather and reduce-scatter kernels.

Validation

python -m isort megatron/core/distributed/fsdp/src/megatron_fsdp/experimental/dbuffer.py megatron/core/distributed/fsdp/src/megatron_fsdp/experimental/fully_shard.py megatron/core/distributed/fsdp/src/megatron_fsdp/experimental/module.py megatron/core/distributed/fsdp/src/megatron_fsdp/experimental/parameter_group.py megatron/core/distributed/fsdp/src/megatron_fsdp/experimental/placement.py tests/unit_tests/distributed/megatron_fsdp/test_symmetric_memory.py
python -m torch.distributed.run --nproc-per-node 2 --standalone -m pytest -q tests/unit_tests/distributed/megatron_fsdp/test_symmetric_memory.py

copy-pr-bot · 2026-06-23T03:39:04Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

wujingyue · 2026-06-23T06:35:18Z

/ok to test

copy-pr-bot · 2026-06-25T04:47:08Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Allocate experimental FSDP all-gather and reduce-scatter staging buffers from PyTorch's NCCL symmetric-memory pool when use_symm_mem=True. Add explicit rendezvous before the symmetric-memory collectives and cover the path with a CUDA/NCCL profiler test that checks the symmetric kernel counts. Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>

Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>

- unshard_parameters now raises on a None gather axis and rendezvous unconditionally under symmetric memory, mirroring reduce_gradients. - Inline the single-use num_sharded_modules constant in the parity test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>

The parity test used a tiny model (Linear(8,16)+Linear(16,4)); its sub-KB collectives make NCCL fall back to ring on runners with NCCL_NVLS_ENABLE=0 (e.g. CI), so the ncclSymk* kernel-count assertions failed there even though the runner supports symmetric memory. Widen the two sharded Linears to 1024 (a few-MiB bf16 weight), which reliably engages the symmetric kernels while preserving the loss-parity check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>

wujingyue · 2026-07-02T03:30:01Z

/ok to test

svcnvidia-nemo-ci · 2026-07-02T21:32:55Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28622613847

wujingyue force-pushed the symm branch 2 times, most recently from 3d9c0e4 to a499fa1 Compare June 23, 2026 06:34

wujingyue marked this pull request as ready for review June 23, 2026 06:34

wujingyue requested review from a team as code owners June 23, 2026 06:34

wujingyue added Run tests MFSDPv2 labels Jun 23, 2026

copy-pr-bot Bot temporarily deployed to public June 23, 2026 06:35 Inactive

svcnvidia-nemo-ci added the complexity: medium label Jun 23, 2026

copy-pr-bot Bot temporarily deployed to public June 23, 2026 06:38 Inactive

copy-pr-bot Bot temporarily deployed to public June 23, 2026 06:39 Inactive

copy-pr-bot Bot temporarily deployed to public June 23, 2026 06:47 Inactive

wujingyue requested a review from ahmadki June 23, 2026 16:12

wujingyue changed the base branch from pull-request/5387 to main June 23, 2026 23:17

wujingyue changed the title ~~Add NCCL symmetric-memory staging to experimental FSDP~~ Add NCCL symmetric-memory staging Jun 25, 2026

wujingyue force-pushed the symm branch from a499fa1 to 44d60be Compare June 25, 2026 04:47

wujingyue removed the MFSDPv2 label Jun 29, 2026

wujingyue self-assigned this Jun 29, 2026

Autumn1998 approved these changes Jun 29, 2026

View reviewed changes

svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Jun 29, 2026

wujingyue changed the title ~~Add NCCL symmetric-memory staging~~ Add NCCL symmetric-memory staging to experimental FSDP Jun 29, 2026

wujingyue added 2 commits June 30, 2026 04:22

Test symmetric memory with bf16 parameters

190cfc1

Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>

copy-pr-bot Bot temporarily deployed to public July 1, 2026 07:05 Inactive

wujingyue force-pushed the symm branch from f30b4fa to b10ac1e Compare July 1, 2026 07:07

copy-pr-bot Bot temporarily deployed to public July 1, 2026 07:08 Inactive

copy-pr-bot Bot had a problem deploying to test July 1, 2026 07:08 Error

wujingyue force-pushed the symm branch 2 times, most recently from b10ac1e to aad98de Compare July 1, 2026 07:11

copy-pr-bot Bot temporarily deployed to test July 1, 2026 07:12 Inactive

copy-pr-bot Bot temporarily deployed to public July 1, 2026 07:17 Inactive

copy-pr-bot Bot temporarily deployed to public July 1, 2026 07:20 Inactive

copy-pr-bot Bot temporarily deployed to public July 1, 2026 07:21 Inactive

copy-pr-bot Bot temporarily deployed to public July 1, 2026 07:30 Inactive

deepakn94 approved these changes Jul 1, 2026

View reviewed changes

wujingyue force-pushed the symm branch from aad98de to 1ab3bf7 Compare July 1, 2026 14:59

copy-pr-bot Bot temporarily deployed to public July 1, 2026 14:59 Inactive

copy-pr-bot Bot temporarily deployed to test July 1, 2026 15:00 Inactive

copy-pr-bot Bot temporarily deployed to public July 1, 2026 15:03 Inactive

wujingyue mentioned this pull request Jul 1, 2026

DO NOT MERGE: symmetric-memory multicast CI diagnostic #5603

Closed

wujingyue mentioned this pull request Jul 2, 2026

DO NOT MERGE: NVLS=1 symmetric-memory CI diagnostic #5624

Open

Phlip79 approved these changes Jul 2, 2026

View reviewed changes

wujingyue mentioned this pull request Jul 2, 2026

CUDA autograd post-accumulate hooks are missed by coverage.py #5633

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add NCCL symmetric-memory staging to experimental FSDP#5440

Add NCCL symmetric-memory staging to experimental FSDP#5440
wujingyue merged 4 commits into
NVIDIA:mainfrom
wujingyue:symm

wujingyue commented Jun 23, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 23, 2026

Uh oh!

wujingyue commented Jun 23, 2026

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

wujingyue commented Jul 2, 2026

Uh oh!

svcnvidia-nemo-ci commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

wujingyue commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Validation

Uh oh!

copy-pr-bot Bot commented Jun 23, 2026

Uh oh!

wujingyue commented Jun 23, 2026

Uh oh!

copy-pr-bot Bot commented Jun 25, 2026

Uh oh!

wujingyue commented Jul 2, 2026

Uh oh!

svcnvidia-nemo-ci commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wujingyue commented Jun 23, 2026 •

edited

Loading