Skip to content

Conversation

@alpha0422
Copy link
Contributor

Continue of #1794.

This PR enhances distributed fused adam by:

  • Support NHWC layout (required by some Conv related models, e.g. Diffusion models);
  • Fix the gradient clipping bug;
  • Support CUDA graph;

@timmoon10 @crcrpar Please help review, thanks.

alpha0422 and others added 12 commits August 22, 2024 15:55
Signed-off-by: Wil Kong <alpha0422@gmail.com>
Signed-off-by: Wil Kong <alpha0422@gmail.com>
Signed-off-by: Wil Kong <alpha0422@gmail.com>
Signed-off-by: Wil Kong <alpha0422@gmail.com>
Signed-off-by: Wil Kong <alpha0422@gmail.com>
Signed-off-by: Wil Kong <alpha0422@gmail.com>
Signed-off-by: Wil Kong <alpha0422@gmail.com>
Signed-off-by: Wil Kong <alpha0422@gmail.com>
… copy after all-gather.

Signed-off-by: Wil Kong <alpha0422@gmail.com>
Signed-off-by: Wil Kong <alpha0422@gmail.com>
Call unscale_grads within step if grad scaler is provided. Revert grad clipping logic.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Copy link
Contributor

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good to me aside from some stylistic suggestions.

alpha0422 and others added 4 commits August 28, 2024 22:39
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
Copy link
Collaborator

@crcrpar crcrpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a test case for capturable?

Copy link
Collaborator

@crcrpar crcrpar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed on @alpha0422 will add a test in a follow-up.

@crcrpar crcrpar merged commit 7d5ecf1 into NVIDIA:master Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants