Move logic for distopt FP32 grads to models #8867

timmoon10 · 2024-04-09T23:14:16Z

What does this PR do ?

#8792 introduced some runtime failures in T5 since it adds GPT-specific logic to MegatronBaseModel. This PR moves the distopt FP32 grad logic to the specific models.

Collection: NLP

Changelog

Move logic for distopt FP32 grads to models

Usage

Run T5, e.g. with the config at https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/conf/megatron_t5_config.yaml.

Enable the distributed optimizer with model.optim.name=distributed_fused_adam.

Jenkins CI

To run Jenkins, a NeMo User with write access must comment jenkins on the PR.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Fixes bug from Distributed optimizer reduces GPT embedding grads in FP32 #8792
Closes Check if model has position embed before accessing param #8857
Closes Fix import of get_gpt_layer_ammo_spec #8810

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 · 2024-04-09T23:14:36Z

jenkins

for more information, see https://pre-commit.ci

timmoon10 · 2024-04-09T23:17:43Z

jenkins

ericharper

LGTM. Thanks!

ericharper · 2024-04-09T23:22:14Z

jenkins

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

timmoon10 · 2024-04-10T01:45:12Z

jenkins

ericharper · 2024-04-10T14:46:22Z

jenkins

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

timmoon10 · 2024-04-10T16:09:20Z

jenkins

ericharper

LGTM. Thanks!

* Move logic for FP32 embedding grads to models Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com>

* Move logic for FP32 embedding grads to models Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: jxin <jxin@nvidia.com>

* Move logic for FP32 embedding grads to models Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Ao Tang <aot@nvidia.com>

Move logic for FP32 embedding grads to models

2557118

Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added the bug Something isn't working label Apr 9, 2024

Merge branch 'main' into debug-fp32-embedding-grads

fad089b

github-actions bot added the NLP label Apr 9, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

a78050a

for more information, see https://pre-commit.ci

ericharper self-requested a review April 9, 2024 23:17

ericharper previously approved these changes Apr 9, 2024

View reviewed changes

Merge branch 'main' into debug-fp32-embedding-grads

3ba95f9

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

timmoon10 dismissed ericharper’s stale review via 3ba95f9 April 10, 2024 01:44

ericharper and others added 2 commits April 10, 2024 08:51

Merge branch 'main' into debug-fp32-embedding-grads

1c85fd3

Merge branch 'main' into debug-fp32-embedding-grads

1faa3a4

Signed-off-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>

ericharper approved these changes Apr 10, 2024

View reviewed changes

ericharper merged commit f7941cb into NVIDIA:main Apr 10, 2024
10 checks passed

timmoon10 mentioned this pull request Apr 16, 2024

Add config option for FP32 embedding grads #8946

Merged

8 tasks

github-actions bot mentioned this pull request Apr 17, 2024

Add config option for FP32 embedding grads #8953

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move logic for distopt FP32 grads to models #8867

Move logic for distopt FP32 grads to models #8867

timmoon10 commented Apr 9, 2024

timmoon10 commented Apr 9, 2024

timmoon10 commented Apr 9, 2024

ericharper left a comment

ericharper commented Apr 9, 2024

timmoon10 commented Apr 10, 2024

ericharper commented Apr 10, 2024

timmoon10 commented Apr 10, 2024

ericharper left a comment

Move logic for distopt FP32 grads to models #8867

Move logic for distopt FP32 grads to models #8867

Conversation

timmoon10 commented Apr 9, 2024

What does this PR do ?

Changelog

Usage

Jenkins CI

Before your PR is "Ready for review"

Who can review?

Additional Information

timmoon10 commented Apr 9, 2024

timmoon10 commented Apr 9, 2024

ericharper left a comment

Choose a reason for hiding this comment

ericharper commented Apr 9, 2024

timmoon10 commented Apr 10, 2024

ericharper commented Apr 10, 2024

timmoon10 commented Apr 10, 2024

ericharper left a comment

Choose a reason for hiding this comment