Fix incorrectly set decoupled_grad in training.py for MFSDP.#4133
Fix incorrectly set decoupled_grad in training.py for MFSDP.#4133cspades wants to merge 2 commits intoNVIDIA:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
3860daf to
a230105
Compare
a230105 to
349b8ff
Compare
349b8ff to
d562ada
Compare
shjwudp
left a comment
There was a problem hiding this comment.
Overall LGTM, just have a little concern about megatron_fsdp_use_decoupled_grad.
This PR avoids FusedAdam from redundantly creating master weights when M-FSDP already maintains FP32 main weights, which is the correct way to use FusedAdam under M-FSDP (in terms of memory usage).
| # the same conditions that the distributed optimizer uses decoupled gradients. | ||
| args.main_params_dtype != torch.float32 | ||
| or (args.fp8_recipe is None or args.fp8_recipe == "delayed") | ||
| or args.optimizer_cpu_offload |
There was a problem hiding this comment.
Do we really need to follow this combination of conditions? 🤔
Unless there are some specific constraints, I think we’d better avoid using such a complicated setup — it makes maintenance harder.
Theoretically, M-FSDP shouldn’t be restricted by these conditions and can freely use decoupled_grad.
There was a problem hiding this comment.
Not sure if I can change the current logic for non-MFSDP, so if there are no constraints on the Megatron-FSDP side, I think matching the behavior should be inconsequential, right?
Otherwise, I will likely need to do something like this:
- If Megatron-FSDP, use a separate
decoupled_gradargument to determine if it is used.if use_megatron_fsdp and use_precision_aware_optimizer: ...
- If not Megatron-FSDP, follow the MLM logic where only BF16 and FP8 DS use it.
if use_precision_aware_optimizer_no_fp8_or_ds_fp8: ...
Another thing I can do is to move the OptimizerConfig.__post_init__ logic outside so we can use the same variable for the DP wrapper and the optimizer, but it won't make it significantly easier to maintain.
There was a problem hiding this comment.
Maybe this is the cleanest solution?
# Optimizer Functions
if use_precision_aware_optimizer_no_fp8_or_ds_fp8 or (
use_megatron_fsdp and use_precision_aware_optimizer
):
# Make sure no use of FusedAdam master weights when using Megatron-FSDP.
And Megatron-FSDP can just always use decoupled_grad when using FusedAdam, or it can be controlled using a new argument but in my opinion users do not need to care about this.
There was a problem hiding this comment.
I think this approach is better — it avoids complex condition checks later that might make us wonder why M-FSDP doesn’t support using decoupled_grad in certain cases.
Signed-off-by: Cory Ye <cye@nvidia.com>
d562ada to
f30d840
Compare
What does this PR do ?
training.pyDDPConfig argument assignment formegatron_fsdp_use_decoupled_gradis incorrect (introduced in m-fsdp: wire use_precision_aware_optimizer from ddp_config to ParamAn… #4024), should follow the logic ofOptimizerConfig.use_precision_aware_optimizer_no_fp8_or_ds_fp8.Details
master_weights=TrueifOptimizerConfig.use_precision_aware_optimizer_no_fp8_or_ds_fp8/use_decoupled_gradare True. This PR turns off FusedAdam master weights when using Megatron-FSDP, as FusedAdam should only provide anoptimizer.step()to Megatron-FSDP'sDTensor(FP32/BF16)main weights.Testing
--use-precision-aware-optimizerunit test to temporarily guarantee functionality.master_weights:TODO
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.