-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Distributed Fused Adam Issues #8880
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good, although we are still hashing out the design in NVIDIA/apex#1794. As discussed in NVIDIA/apex#1794 (comment), I think we should set MegatronDistributedFusedAdam._step_support_amp_scaling=False
to signal that the NeMo grad scaler can accommodate the distributed optimizer (unlike the plain PyTorch grad scaler). As a bonus, this approach fixes the grad scaling issue even without needing to update Apex.
You are probably right, to make sure it won't hurt perf I need to confirm with our usecases. Anyway, I think that relates to the changes in APEX, the changes here have no relation to gradient clipping. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
_step_support_amp_scaling=False
will be considered in a future PR once perf is verified.
Agreed offline it could be dismissed.
jenkins |
jenkins |
* Fix distributed fused adam issue with NHWC layout. * Fix the CUDA graph issue if there's kernel in zero_grad. * Add option to distribute adam states within node. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>
* Fix distributed fused adam issue with NHWC layout. * Fix the CUDA graph issue if there's kernel in zero_grad. * Add option to distribute adam states within node. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Ao Tang <aot@nvidia.com>
What does this PR do ?
This PR fixes distributed optimizer issues:
zero_grad()
being not captured by CUDA graph;Collection: [Note which collection this PR will affect]
Changelog
Usage
# Add a code snippet demonstrating how to use this
Jenkins CI
To run Jenkins, a NeMo User with write access must comment
jenkins
on the PR.Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?
Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.
Additional Information