[Megatron-FSDP] MaxPoolAllocator for double-buffering hybrid architectures.#5462
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
af4ad72 to
b81af2c
Compare
| def _build_fixed_max_pool(self): | ||
| """ | ||
| Compute the maximum double-buffer pool required to support all FSDP units. | ||
| """ |
There was a problem hiding this comment.
The max pooling algorithm is here. The rest of the code is similar to FixedPoolAllocator.
There was a problem hiding this comment.
Do max pooling decisions depend on prefetching/overlapping? Conceptually, more aggressive prefetching needs more memory and therefore affects the max pooling algorithm?
There was a problem hiding this comment.
Max-pooling is a hard-requirement on memory efficiency unless we have a more sophisticated FSDP scheduling algorithm to pre-compute all possible paths for this set of asymmetric model layers.
Once you consider the max pool the representative size of the unit in this model, pre-fetch sizing can be controlled using suggested_communication_unit_size which controls how many max pool buckets we pre-fetch. It defaults to 500M or 1B numel.
Finally, if you have performance requirements where the size of a max pool bucket lands you in an awkward position with the required communication size, and you don't want to increase the communication size, then the last thing you can do is to use fine-grained AG or fine-grained RS to allow for more checkpoints where we permit an AG or RS to be launched. This will allow multiple episodes of AG or RS to be called within a single FSDP unit.
| if hasattr(torch.autograd.graph, 'set_override_stale_capture_stream'): | ||
| torch.autograd.graph.set_override_stale_capture_stream(True) | ||
| else: | ||
| logger.warning( | ||
| 'torch.autograd.graph.set_override_stale_capture_stream is not ' | ||
| 'available in this PyTorch version; CUDA graph capture may fail ' | ||
| 'if autograd nodes hold stale references to non-capturing streams. ' | ||
| 'Upgrade to a PyTorch build that includes pytorch/pytorch#180090.' | ||
| ) |
There was a problem hiding this comment.
This should just be something that we should call if we have a new enough PyTorch version: pytorch/pytorch#180090 (The PyTorch version has not been published yet.)
It harmlessly makes things a lot easier w.r.t. stragglers on the Autograd / accumulate stream. cc @nanz-nv
wujingyue
left a comment
There was a problem hiding this comment.
Deprecates --grad-reduce-in-bf16 / reduce_grad_in_fp32 for Megatron-FSDP, which has been incredibly confusing to use. Default arguments (auto) assume BF16 for both, so will not OOM any existing user's configs.
Adds a call to torch.autograd.graph.set_override_stale_capture_stream(True) (only supported on new PyTorch versions since pytorch/pytorch#180090) to prevent full-iteration CG errors like this:
Thanks for the PR and the figures!
While I'm still reviewing the rest, can these two changes go to a separate PR(s)? https://google.github.io/eng-practices/review/developer/small-cls.html
@wujingyue Considering this exact commit needs to be merged for the NeMo release code freeze in a few days, could we make an exception in this case? These three features are all needed for Nemotron benchmarks. I'm concerned that waiting on 3 PR's to be merged in a few work days is not feasible. |
In my experience, reviewing three stacked PRs is usually faster than reviewing a single large PR. Stacked PRs can also be reviewed in parallel, though I may be missing something about how the review process works in Megatron-LM. As a less ideal alternative, you could keep everything in a single PR but split it into three well-structured commits. GitHub's UI supports reviewing commits individually, which provides a similar incremental review experience. |
| def _build_fixed_max_pool(self): | ||
| """ | ||
| Compute the maximum double-buffer pool required to support all FSDP units. | ||
| """ |
There was a problem hiding this comment.
Do max pooling decisions depend on prefetching/overlapping? Conceptually, more aggressive prefetching needs more memory and therefore affects the max pooling algorithm?
| # If more buckets are needed for this unit, extend the pool with 0's. | ||
| if len(bucket_sizes) > len(max_bucket_sizes): | ||
| extend_len = len(bucket_sizes) - len(max_bucket_sizes) | ||
| max_bucket_sizes.extend([0] * extend_len) |
There was a problem hiding this comment.
Isn't max_bucket_sizes already sorted so we can prepend 0s without having to sort max_bucket_sizes again?
There was a problem hiding this comment.
Well, we have already assigned the previous bucket ID's and I'm using the enumerated index of this list as a bucket offset. If I prepend, it will shift all the buckets to the right by one relative to their bucket offset, and break this algorithm.
sorted(enumerate(max_bucket_sizes), key=lambda x: x[1])
We can avoid this by reversing the zip, adding the new buckets to the end of the pool but getting the largest N buckets from the top of the pool and assigning them to the largest N buckets of the unit (so also bucket_sizes.sort() -> bucket_sizes.sort(reverse=True). I think that should preserve a reversed sorting order.
| if ddp_config.grad_reduce_in_fp32 | ||
| else ddp_config.megatron_fsdp_grad_comm_dtype | ||
| ), | ||
| main_grads_dtype=ddp_config.megatron_fsdp_main_grads_dtype, |
There was a problem hiding this comment.
There was a problem hiding this comment.
The migration action items are:
--megatron-fsdp-main-grads-dtype fp32 / ddp.megatron_fsdp_main_grads_dtype=torch.float32
--megatron-fsdp-grad-comm-dtype fp32 / ddp.megatron_fsdp_grad_comm_dtype=torch.float32
for any recipe that uses grad_reduce_in_fp32=True (i.e. does not use --grad-reduce-in-bf16).
For completeness, if --grad-reduce-in-bf16 / grad_reduce_in_fp32=False, then the default megatron_fsdp_grad_comm_dtype and megatron_fsdp_main_grads_dtype are both BF16 so that's also aligned with turning that argument on and does not need any changes. (This is the logical spaghetti I was talking about, two levels of arguments.)
cc @gautham-kollu if you can hit this in your next benchmark update. 🙏🏻 IMO low-ish priority because this will not OOM anyone's script.
ericharper
left a comment
There was a problem hiding this comment.
Approve but bridge needs to be updated.
| if ddp_config.grad_reduce_in_fp32 | ||
| else ddp_config.megatron_fsdp_grad_comm_dtype | ||
| ), | ||
| main_grads_dtype=ddp_config.megatron_fsdp_main_grads_dtype, |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28484555475 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28485237545 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28532337643 |
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
… later, and grad_comm_dtype not respected during FixedPool/MaxPool bucket planning. Signed-off-by: Cory Ye <cye@nvidia.com>
…ction. Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
…nits. Signed-off-by: Cory Ye <cye@nvidia.com>
Signed-off-by: Cory Ye <cye@nvidia.com>
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28556300115 |
|
🔄 Merge queue validation started! You can track the progress here: https://github.com/NVIDIA/Megatron-LM/actions/runs/28557199610 |
What does this PR do ?
strict_assignmentstate to attempt to assign the same bucket previously assigned to an FSDP unit before warning the user and assigning a different bucket to the unit.--grad-reduce-in-bf16/reduce_grad_in_fp32for Megatron-FSDP, which has been incredibly confusing to use. Default arguments (auto) assume BF16 for both, so will not OOM any existing user's configs.fsdp_unit_id == -1. It is never set to -1.torch.autograd.graph.set_override_stale_capture_stream(True)(only supported on new PyTorch versions since Detect and fix stale stream references in autograd during CUDA graph capture pytorch/pytorch#180090) to prevent full-iteration CG errors like this:^ (a) is annoying to implement, (b) is dirty, and (c) is EZ-PZ and recommended.
Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment @NVIDIA/mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.