[Megatron-FSDP] MaxPoolAllocator for double-buffering hybrid architectures. by cspades · Pull Request #5462 · NVIDIA/Megatron-LM

cspades · 2026-06-23T22:20:15Z

I, the PR author, have personally reviewed every line of this PR.

What does this PR do ?

Iterating through all FSDP units, data buckets are categorized by data-type, sorted from small to large, and compared to the current MaxPool. If there are not enough buckets in the pool to support the unit, buckets are added to the pool (with size 0). If the largest buckets of the pool are not large enough to support the buckets in the unit (assigned to the pool from smallest to largest), the buckets in the pool are enlarged. After this process, we arrive at a minimal set of buckets that can symmetrically double-buffer every FSDP unit in the model.

Adds hybrid architecture double buffering via FSDP unit max-pooling for Megatron-FSDP. (V1)
- Opens up CG or NCCL UBR support for hybrid architectures, which will help support users for a while.
Adds the strict_assignment state to attempt to assign the same bucket previously assigned to an FSDP unit before warning the user and assigning a different bucket to the unit.
- If this warning appears during warmup or CUDA graph capture, likely some memory is being orphaned and you will hit numerical errors.
Fixes an issue where parameters / buckets that are not members of an FSDP unit will pre-fetch subsequent buckets that aren't subsequently used, exhausting buffers in the double buffer allocator and causing an allocation error.
- Only necessary for double buffer allocators, which require careful management of the 2 buffers in the pool.
Deprecates --grad-reduce-in-bf16 / reduce_grad_in_fp32 for Megatron-FSDP, which has been incredibly confusing to use. Default arguments (auto) assume BF16 for both, so will not OOM any existing user's configs.
Deprecate fsdp_unit_id == -1. It is never set to -1.
Adds a call to torch.autograd.graph.set_override_stale_capture_stream(True) (only supported on new PyTorch versions since Detect and fix stale stream references in autograd during CUDA graph capture pytorch/pytorch#180090) to prevent full-iteration CG errors like this:

[rank0]: RuntimeError: During CUDA graph capture, autograd node 'torch::autograd::AccumulateGrad' has a stale reference to the default stream (stream 0) from warmup. This will invalidate the capture because cudaStreamWaitEvent on the default stream pulls a non-capturing stream into the graph.

[rank0]: To fix, either:
[rank0]:   (a) Run warmup on the same stream that capture will use, or
[rank0]:   (b) Delete references to the loss / autograd graph (e.g. `del loss`) before capture, or
[rank0]:   (c) Call torch.autograd.graph.set_override_stale_capture_stream(True) to automatically redirect stale nodes to the capturing stream.

^ (a) is annoying to implement, (b) is dirty, and (c) is EZ-PZ and recommended.

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share a design doc with the team. If you're unsure what's the best way to do so, contact @NVIDIA/mcore-oncall.

Issue tracking

For PRs from open-source community contributors:

New features: a linked issue is required. Please open a feature request and reference it here before submitting the PR.
Small updates (bug fixes, minor improvements): a linked issue is recommended and will accelerate the PR review process.

Linked issue:

Contribution process

Pre-checks

I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!

All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.

Step 1: Mark PR as "Ready for Review"

When your PR is ready, click Ready for Review.
An oncall reviewer is auto-assigned and expert reviewers are notified based on your changes.
- Some PRs may jump straight to step 2. This is determined by .github/CODEOWNERS.

⚠️ Only mark as ready once merge-conflicts are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

Step 2: Final Review

For PRs that change megatron/core, once all expert reviewers have approved, the Final Review label is applied automatically and final reviewers are assigned.

For PRs outside megatron/core, this step is skipped.

Step 3: Approved

Once all required reviewers have approved, the Approved label is applied automatically.

Merge

Any member of mcore-engineers will be able to merge your PR.

copy-pr-bot · 2026-06-23T22:20:19Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cspades · 2026-06-25T02:55:09Z

+    def _build_fixed_max_pool(self):
+        """
+        Compute the maximum double-buffer pool required to support all FSDP units.
+        """


The max pooling algorithm is here. The rest of the code is similar to FixedPoolAllocator.

Do max pooling decisions depend on prefetching/overlapping? Conceptually, more aggressive prefetching needs more memory and therefore affects the max pooling algorithm?

Max-pooling is a hard-requirement on memory efficiency unless we have a more sophisticated FSDP scheduling algorithm to pre-compute all possible paths for this set of asymmetric model layers.

Once you consider the max pool the representative size of the unit in this model, pre-fetch sizing can be controlled using suggested_communication_unit_size which controls how many max pool buckets we pre-fetch. It defaults to 500M or 1B numel.

Finally, if you have performance requirements where the size of a max pool bucket lands you in an awkward position with the required communication size, and you don't want to increase the communication size, then the last thing you can do is to use fine-grained AG or fine-grained RS to allow for more checkpoints where we permit an AG or RS to be launched. This will allow multiple episodes of AG or RS to be called within a single FSDP unit.

cspades · 2026-06-25T03:28:28Z

+            if hasattr(torch.autograd.graph, 'set_override_stale_capture_stream'):
+                torch.autograd.graph.set_override_stale_capture_stream(True)
+            else:
+                logger.warning(
+                    'torch.autograd.graph.set_override_stale_capture_stream is not '
+                    'available in this PyTorch version; CUDA graph capture may fail '
+                    'if autograd nodes hold stale references to non-capturing streams. '
+                    'Upgrade to a PyTorch build that includes pytorch/pytorch#180090.'
+                )


This should just be something that we should call if we have a new enough PyTorch version: pytorch/pytorch#180090 (The PyTorch version has not been published yet.)

It harmlessly makes things a lot easier w.r.t. stragglers on the Autograd / accumulate stream. cc @nanz-nv

wujingyue

Deprecates --grad-reduce-in-bf16 / reduce_grad_in_fp32 for Megatron-FSDP, which has been incredibly confusing to use. Default arguments (auto) assume BF16 for both, so will not OOM any existing user's configs.
Adds a call to torch.autograd.graph.set_override_stale_capture_stream(True) (only supported on new PyTorch versions since pytorch/pytorch#180090) to prevent full-iteration CG errors like this:

Thanks for the PR and the figures!

While I'm still reviewing the rest, can these two changes go to a separate PR(s)? https://google.github.io/eng-practices/review/developer/small-cls.html

cspades · 2026-06-26T23:02:54Z

While I'm still reviewing the rest, can these two changes go to a separate PR(s)? https://google.github.io/eng-practices/review/developer/small-cls.html

@wujingyue Considering this exact commit needs to be merged for the NeMo release code freeze in a few days, could we make an exception in this case? These three features are all needed for Nemotron benchmarks. I'm concerned that waiting on 3 PR's to be merged in a few work days is not feasible.

wujingyue · 2026-06-28T01:36:33Z

I'm concerned that waiting on 3 PR's to be merged in a few work days is not feasible.

In my experience, reviewing three stacked PRs is usually faster than reviewing a single large PR. Stacked PRs can also be reviewed in parallel, though I may be missing something about how the review process works in Megatron-LM.

As a less ideal alternative, you could keep everything in a single PR but split it into three well-structured commits. GitHub's UI supports reviewing commits individually, which provides a similar incremental review experience.

wujingyue · 2026-06-28T18:53:07Z

+    def _build_fixed_max_pool(self):
+        """
+        Compute the maximum double-buffer pool required to support all FSDP units.
+        """


Do max pooling decisions depend on prefetching/overlapping? Conceptually, more aggressive prefetching needs more memory and therefore affects the max pooling algorithm?

wujingyue

LGTM otherwise

wujingyue · 2026-06-28T22:45:40Z

+                # If more buckets are needed for this unit, extend the pool with 0's.
+                if len(bucket_sizes) > len(max_bucket_sizes):
+                    extend_len = len(bucket_sizes) - len(max_bucket_sizes)
+                    max_bucket_sizes.extend([0] * extend_len)


Isn't max_bucket_sizes already sorted so we can prepend 0s without having to sort max_bucket_sizes again?

Well, we have already assigned the previous bucket ID's and I'm using the enumerated index of this list as a bucket offset. If I prepend, it will shift all the buckets to the right by one relative to their bucket offset, and break this algorithm.

sorted(enumerate(max_bucket_sizes), key=lambda x: x[1])

We can avoid this by reversing the zip, adding the new buckets to the end of the pool but getting the largest N buckets from the top of the pool and assigning them to the largest N buckets of the unit (so also bucket_sizes.sort() -> bucket_sizes.sort(reverse=True). I think that should preserve a reversed sorting order.

ericharper · 2026-06-30T23:55:23Z

-                if ddp_config.grad_reduce_in_fp32
-                else ddp_config.megatron_fsdp_grad_comm_dtype
-            ),
+            main_grads_dtype=ddp_config.megatron_fsdp_main_grads_dtype,


does this break bridge? https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/d12548a3cf7a72e0b2f38cd67da5598624abb3fd/src/megatron/bridge/recipes/qwen_vl/qwen35_vl.py#L131-L141

@gautham-kollu , can you update bridge?

FYI @yaoyu-33 and @cuichenx

The migration action items are:

--megatron-fsdp-main-grads-dtype fp32 / ddp.megatron_fsdp_main_grads_dtype=torch.float32 --megatron-fsdp-grad-comm-dtype fp32 / ddp.megatron_fsdp_grad_comm_dtype=torch.float32

for any recipe that uses grad_reduce_in_fp32=True (i.e. does not use --grad-reduce-in-bf16).

For completeness, if --grad-reduce-in-bf16 / grad_reduce_in_fp32=False, then the default megatron_fsdp_grad_comm_dtype and megatron_fsdp_main_grads_dtype are both BF16 so that's also aligned with turning that argument on and does not need any changes. (This is the logical spaghetti I was talking about, two levels of arguments.)

cc @gautham-kollu if you can hit this in your next benchmark update. 🙏🏻 IMO low-ish priority because this will not OOM anyone's script.

ericharper

Approve but bridge needs to be updated.

ericharper · 2026-07-01T00:02:24Z

-                if ddp_config.grad_reduce_in_fp32
-                else ddp_config.megatron_fsdp_grad_comm_dtype
-            ),
+            main_grads_dtype=ddp_config.megatron_fsdp_main_grads_dtype,


@gautham-kollu , can you update bridge?

FYI @yaoyu-33 and @cuichenx

svcnvidia-nemo-ci · 2026-07-01T00:19:49Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28484555475

svcnvidia-nemo-ci · 2026-07-01T00:36:29Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28485237545

svcnvidia-nemo-ci · 2026-07-01T16:27:36Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28532337643

Signed-off-by: Cory Ye <cye@nvidia.com>

… later, and grad_comm_dtype not respected during FixedPool/MaxPool bucket planning. Signed-off-by: Cory Ye <cye@nvidia.com>

Signed-off-by: Cory Ye <cye@nvidia.com>

…ction. Signed-off-by: Cory Ye <cye@nvidia.com>

Signed-off-by: Cory Ye <cye@nvidia.com>

…nits. Signed-off-by: Cory Ye <cye@nvidia.com>

Signed-off-by: Cory Ye <cye@nvidia.com>

svcnvidia-nemo-ci · 2026-07-02T00:11:49Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28556300115

svcnvidia-nemo-ci · 2026-07-02T00:35:46Z

🔄 Merge queue validation started!

You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28557199610

cspades self-assigned this Jun 23, 2026

cspades force-pushed the cye/maxpool-dbuf branch from af4ad72 to b81af2c Compare June 23, 2026 22:21

cspades added the module: megatron-fsdp label Jun 23, 2026

dingqingy-nv added nemotron 26.06.01 labels Jun 24, 2026

cspades marked this pull request as ready for review June 25, 2026 01:24

cspades requested review from a team as code owners June 25, 2026 01:24

copy-pr-bot Bot temporarily deployed to public June 25, 2026 01:25 Inactive

svcnvidia-nemo-ci added the complexity: medium label Jun 25, 2026

copy-pr-bot Bot temporarily deployed to public June 25, 2026 01:28 Inactive

copy-pr-bot Bot temporarily deployed to public June 25, 2026 01:37 Inactive

copy-pr-bot Bot temporarily deployed to public June 25, 2026 01:41 Inactive

copy-pr-bot Bot temporarily deployed to test June 25, 2026 01:41 Inactive

copy-pr-bot Bot temporarily deployed to public June 25, 2026 01:44 Inactive

copy-pr-bot Bot temporarily deployed to public June 25, 2026 01:45 Inactive

copy-pr-bot Bot temporarily deployed to public June 25, 2026 01:54 Inactive

copy-pr-bot Bot temporarily deployed to public June 25, 2026 03:26 Inactive

copy-pr-bot Bot had a problem deploying to test June 25, 2026 03:26 Error

cspades commented Jun 25, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public June 26, 2026 17:14 Inactive

copy-pr-bot Bot temporarily deployed to test June 26, 2026 17:15 Inactive

wujingyue reviewed Jun 26, 2026

View reviewed changes

Comment thread megatron/core/distributed/fsdp/src/megatron_fsdp/param_and_grad_buffer.py

wujingyue reviewed Jun 28, 2026

View reviewed changes

shjwudp approved these changes Jun 29, 2026

View reviewed changes

yashaswikarnati approved these changes Jun 29, 2026

View reviewed changes

deepakn94 approved these changes Jun 29, 2026

View reviewed changes

ericharper reviewed Jun 30, 2026

View reviewed changes

ericharper approved these changes Jul 1, 2026

View reviewed changes

cspades added 11 commits July 1, 2026 16:14

Add the MaxPoolAllocator for double-buffering hybrid architectures.

d9ffb3c

Signed-off-by: Cory Ye <cye@nvidia.com>

Add documentation for MaxPoolAllocator.

78ee5cc

Signed-off-by: Cory Ye <cye@nvidia.com>

Fix DDPConfig typo.

3e084c7

Signed-off-by: Cory Ye <cye@nvidia.com>

Fix non-FSDP unit buckets prefetching buckets that are not used until…

b6ef81d

… later, and grad_comm_dtype not respected during FixedPool/MaxPool bucket planning. Signed-off-by: Cory Ye <cye@nvidia.com>

Lint.

6f50b24

Signed-off-by: Cory Ye <cye@nvidia.com>

Add full CG accumulate stream override and DRY the FP8 parameter dete…

d7ca759

…ction. Signed-off-by: Cory Ye <cye@nvidia.com>

Add tests.

2200de0

Signed-off-by: Cory Ye <cye@nvidia.com>

Fully deprecate --grad-reduce-in-bf16 for Megatron-FSDP.

24b485c

Signed-off-by: Cory Ye <cye@nvidia.com>

Only skip pre-fetch for non-unit modules when using double buffering.

76cabed

Signed-off-by: Cory Ye <cye@nvidia.com>

Remove excess MaxPool sorting, deprecate fsdp_unit_id==-1, and other …

5da6cda

…nits. Signed-off-by: Cory Ye <cye@nvidia.com>

Update functional tests.

96d05ab

Signed-off-by: Cory Ye <cye@nvidia.com>

Uh oh!

Conversation

cspades commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issue tracking

Contribution process

Pre-checks

Code review

Step 1: Mark PR as "Ready for Review"

Step 2: Final Review

Step 3: Approved

Merge

Uh oh!

copy-pr-bot Bot commented Jun 23, 2026

Uh oh!

Uh oh!

Uh oh!

cspades Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

wujingyue Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

cspades Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cspades Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cspades commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wujingyue commented Jun 28, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wujingyue Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

wujingyue left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

wujingyue Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cspades Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericharper Jun 30, 2026

Choose a reason for hiding this comment

Uh oh!

ericharper Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

cspades Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ericharper left a comment

Choose a reason for hiding this comment

Uh oh!

ericharper Jul 1, 2026

Choose a reason for hiding this comment

cspades commented Jun 23, 2026 •

edited

Loading

cspades Jun 29, 2026 •

edited

Loading

cspades Jun 25, 2026 •

edited

Loading

cspades commented Jun 26, 2026 •

edited

Loading

wujingyue Jun 28, 2026 •

edited

Loading

cspades Jun 29, 2026 •

edited

Loading

cspades Jul 1, 2026 •

edited

Loading