Fix Distributed Fused Adam Issues #8880

alpha0422 · 2024-04-11T01:09:41Z

What does this PR do ?

This PR fixes distributed optimizer issues:

Support NHWC layout, as required by Diffusion models;
Fix complex zero_grad() being not captured by CUDA graph;
Add option to distribute states within node;

Collection: [Note which collection this PR will affect]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

for more information, see https://pre-commit.ci

timmoon10

Overall looks good, although we are still hashing out the design in NVIDIA/apex#1794. As discussed in NVIDIA/apex#1794 (comment), I think we should set MegatronDistributedFusedAdam._step_support_amp_scaling=False to signal that the NeMo grad scaler can accommodate the distributed optimizer (unlike the plain PyTorch grad scaler). As a bonus, this approach fixes the grad scaling issue even without needing to update Apex.

alpha0422 · 2024-04-12T11:13:00Z

Overall looks good, although we are still hashing out the design in NVIDIA/apex#1794. As discussed in NVIDIA/apex#1794 (comment), I think we should set MegatronDistributedFusedAdam._step_support_amp_scaling=False to signal that the NeMo grad scaler can accommodate the distributed optimizer (unlike the plain PyTorch grad scaler). As a bonus, this approach fixes the grad scaling issue even without needing to update Apex.

You are probably right, to make sure it won't hurt perf I need to confirm with our usecases. Anyway, I think that relates to the changes in APEX, the changes here have no relation to gradient clipping.

ericharper

LGTM. Thanks!

_step_support_amp_scaling=False will be considered in a future PR once perf is verified.

Agreed offline it could be dismissed.

ericharper · 2024-04-12T15:13:32Z

jenkins

ericharper · 2024-04-12T18:30:26Z

jenkins

* Fix distributed fused adam issue with NHWC layout. * Fix the CUDA graph issue if there's kernel in zero_grad. * Add option to distribute adam states within node. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>

* Fix distributed fused adam issue with NHWC layout. * Fix the CUDA graph issue if there's kernel in zero_grad. * Add option to distribute adam states within node. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Ao Tang <aot@nvidia.com>

Fix distributed fused adam issue with NHWC layout.

52001a7

github-actions bot added the core Changes to NeMo Core label Apr 11, 2024

Fix the CUDA graph issue if there's kernel in zero_grad.

b797941

alpha0422 changed the title ~~Fix Distributed Fused Adam Issue with NHWC Layout~~ Fix Distributed Fused Adam Issues Apr 11, 2024

alpha0422 marked this pull request as draft April 11, 2024 14:42

alpha0422 and others added 2 commits April 11, 2024 23:56

Add option to distribute adam states within node.

a5279c8

[pre-commit.ci] auto fixes from pre-commit.com hooks

6ce5a35

for more information, see https://pre-commit.ci

alpha0422 marked this pull request as ready for review April 11, 2024 16:01

ericharper requested a review from timmoon10 April 11, 2024 20:21

timmoon10 previously requested changes Apr 12, 2024

View reviewed changes

ericharper requested a review from timmoon10 April 12, 2024 15:11

ericharper approved these changes Apr 12, 2024

View reviewed changes

timmoon10 approved these changes Apr 12, 2024

View reviewed changes

Merge branch 'main' into wkong/dist-opt-nhwc

b970551

ericharper merged commit 08ea4cb into NVIDIA:main Apr 12, 2024
12 of 124 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Distributed Fused Adam Issues #8880

Fix Distributed Fused Adam Issues #8880

alpha0422 commented Apr 11, 2024 •

edited

timmoon10 left a comment •

edited

alpha0422 commented Apr 12, 2024

ericharper left a comment

ericharper commented Apr 12, 2024

ericharper commented Apr 12, 2024

Fix Distributed Fused Adam Issues #8880

Fix Distributed Fused Adam Issues #8880

Conversation

alpha0422 commented Apr 11, 2024 • edited

What does this PR do ?

Changelog

Usage

Jenkins CI

Before your PR is "Ready for review"

Who can review?

Additional Information

timmoon10 left a comment • edited

Choose a reason for hiding this comment

alpha0422 commented Apr 12, 2024

ericharper left a comment

Choose a reason for hiding this comment

ericharper commented Apr 12, 2024

ericharper commented Apr 12, 2024

alpha0422 commented Apr 11, 2024 •

edited

timmoon10 left a comment •

edited