Skip to content

Add NCCL symmetric-memory staging to experimental FSDP#5440

Merged
wujingyue merged 4 commits into
NVIDIA:mainfrom
wujingyue:symm
Jul 2, 2026
Merged

Add NCCL symmetric-memory staging to experimental FSDP#5440
wujingyue merged 4 commits into
NVIDIA:mainfrom
wujingyue:symm

Conversation

@wujingyue

@wujingyue wujingyue commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Summary

This adds an opt-in use_symm_mem=True path for experimental Megatron-FSDP staging buffers. The full-parameter all-gather buffer and full-gradient reduce-scatter buffer are allocated from PyTorch's NCCL symmetric-memory pool, rendezvoused before collectives, and verified through profiler activity.

Note: the communication-time benchmark (and its pytest-benchmark/uv.lock changes) was split out to #5596 to keep this PR focused on the core staging feature.

Details

  • Threads use_symm_mem through fully_shard, FsdpModule, and FsdpParameterGroup.
  • Adds DBuffer rendezvous support and a shared placement-axis helper used by redistribution and symmetric-memory call sites.
  • Allocates unsharded parameter and partial-gradient staging buffers under torch.cuda.use_mem_pool while leaving the DBuffer allocation API unchanged.
  • Uses SUM reduce-scatter for the symmetric-memory path and scales locally to preserve AVG gradient semantics.
  • unshard_parameters raises on a None gather axis and rendezvous unconditionally under symmetric memory, mirroring reduce_gradients.
  • Adds distributed CUDA/NCCL coverage in test_symmetric_memory.py that checks loss parity and requires observed ncclSymk all-gather and reduce-scatter kernels.

Validation

  • python -m isort megatron/core/distributed/fsdp/src/megatron_fsdp/experimental/dbuffer.py megatron/core/distributed/fsdp/src/megatron_fsdp/experimental/fully_shard.py megatron/core/distributed/fsdp/src/megatron_fsdp/experimental/module.py megatron/core/distributed/fsdp/src/megatron_fsdp/experimental/parameter_group.py megatron/core/distributed/fsdp/src/megatron_fsdp/experimental/placement.py tests/unit_tests/distributed/megatron_fsdp/test_symmetric_memory.py
  • python -m torch.distributed.run --nproc-per-node 2 --standalone -m pytest -q tests/unit_tests/distributed/megatron_fsdp/test_symmetric_memory.py

@copy-pr-bot

copy-pr-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@wujingyue wujingyue force-pushed the symm branch 2 times, most recently from 3d9c0e4 to a499fa1 Compare June 23, 2026 06:34
@wujingyue wujingyue marked this pull request as ready for review June 23, 2026 06:34
@wujingyue wujingyue requested review from a team as code owners June 23, 2026 06:34
@wujingyue

Copy link
Copy Markdown
Contributor Author

/ok to test

@wujingyue wujingyue requested a review from ahmadki June 23, 2026 16:12
@wujingyue wujingyue changed the base branch from pull-request/5387 to main June 23, 2026 23:17
@wujingyue wujingyue changed the title Add NCCL symmetric-memory staging to experimental FSDP Add NCCL symmetric-memory staging Jun 25, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 25, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wujingyue wujingyue removed the MFSDPv2 label Jun 29, 2026
@wujingyue wujingyue self-assigned this Jun 29, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the Final Review PR is in the "final review" stage label Jun 29, 2026
@wujingyue wujingyue changed the title Add NCCL symmetric-memory staging Add NCCL symmetric-memory staging to experimental FSDP Jun 29, 2026
Allocate experimental FSDP all-gather and reduce-scatter staging buffers from PyTorch's NCCL symmetric-memory pool when use_symm_mem=True. Add explicit rendezvous before the symmetric-memory collectives and cover the path with a CUDA/NCCL profiler test that checks the symmetric kernel counts.

Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>
Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>
- unshard_parameters now raises on a None gather axis and rendezvous
  unconditionally under symmetric memory, mirroring reduce_gradients.
- Inline the single-use num_sharded_modules constant in the parity test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>
The parity test used a tiny model (Linear(8,16)+Linear(16,4)); its sub-KB
collectives make NCCL fall back to ring on runners with NCCL_NVLS_ENABLE=0
(e.g. CI), so the ncclSymk* kernel-count assertions failed there even though
the runner supports symmetric memory. Widen the two sharded Linears to 1024
(a few-MiB bf16 weight), which reliably engages the symmetric kernels while
preserving the loss-parity check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jingyue Wu <jingyuew@nvidia.com>
@wujingyue

Copy link
Copy Markdown
Contributor Author

/ok to test

@svcnvidia-nemo-ci

Copy link
Copy Markdown

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28622613847

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Approved All necessary approvals have been made complexity: medium Run tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants