Support distributed Adam with T5 and support overlapped grad reductions with pipeline parallelism #4900

timmoon10 · 2022-09-07T23:21:59Z

What does this PR do ?

Generalize distributed Adam support for GPT-3 to T5 and other Megatron-LM models. It also implements several performance optimizations.

Collection: NLP

Changelog

When params are BF16, distributed Adam will store 16-bit param remainders instead of FP32 main params
Decouples distributed Adam support from Megatron O2-level optizations
Add support for Apex distributed Adam optimizer with other Megatron-LM models, namely T5
Add support for overlapped grad reductions with pipeline or sequence parallelism

Usage

Set optimizer to distributed_fused_adam in config file:

NeMo/examples/nlp/language_modeling/conf/megatron_t5_config.yaml

Line 137 in 265f7b1

name: fused_adam

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

If params are bf16, dist Adam will only store 16-bit remainder needed to reconstruct fp32 params. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Requires dist Adam optimizer. Signed-off-by: Tim Moon <tmoon@nvidia.com>

lgtm-com · 2022-09-07T23:33:32Z

This pull request introduces 1 alert and fixes 1 when merging 065a89b into abbe643 - view on LGTM.com

new alerts:

1 for Unused import

fixed alerts:

1 for Unused import

Signed-off-by: Tim Moon <tmoon@nvidia.com>

…async-grad-reduction

lgtm-com · 2022-09-08T23:27:55Z

This pull request introduces 1 alert and fixes 1 when merging 7943ebc into b9cf05c - view on LGTM.com

new alerts:

1 for Unused import

fixed alerts:

1 for Unused import

Requires dist Adam optimizer Signed-off-by: Tim Moon <tmoon@nvidia.com>

lgtm-com · 2022-09-09T00:51:09Z

This pull request fixes 1 alert when merging e06d34a into b9cf05c - view on LGTM.com

fixed alerts:

1 for Unused import

timmoon10 · 2022-09-10T00:21:09Z

Running T5 41B on 32 Selene nodes, I see a 1.2x speedup over the pure data-parallel impl, 66% of expected memory savings, and nearly identical loss values after 20 steps.

Full results with T5 41B and GPT-3 175B. The run configurations are detailed inside. Note that I ran with a relatively small global batch size, which makes communication a more significant portion of runtime.

…tion Signed-off-by: Tim Moon <tmoon@nvidia.com>

lgtm-com · 2022-09-15T20:53:23Z

This pull request fixes 1 alert when merging d528a89 into f1825bc - view on LGTM.com

fixed alerts:

1 for Unused import

Signed-off-by: Tim Moon <tmoon@nvidia.com>

lgtm-com · 2022-09-20T17:43:22Z

This pull request fixes 1 alert when merging 811b59c into f1825bc - view on LGTM.com

fixed alerts:

1 for Unused import

…m-pipeline-parallel-async-grad-reduction

lgtm-com · 2022-09-27T00:48:40Z

This pull request fixes 1 alert when merging ebd98c4 into 971485c - view on LGTM.com

fixed alerts:

1 for Unused import

…tion

lgtm-com · 2022-09-27T20:12:44Z

This pull request fixes 1 alert when merging b2a61ad into 73fcfd7 - view on LGTM.com

fixed alerts:

1 for Unused import

…tion

Changes were made to support pipeline parallelism with interleaved pipeline parallelism. Distributed Adam does not support this currently. Signed-off-by: Tim Moon <tmoon@nvidia.com>

…tion

for more information, see https://pre-commit.ci

lgtm-com · 2022-10-19T23:28:48Z

This pull request fixes 1 alert when merging a088304 into 8656574 - view on LGTM.com

fixed alerts:

1 for Unused import

nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

nemo/core/optim/distributed_adam.py

@ericharper

Review suggestions from @ericharper and @crcrpar. Signed-off-by: Tim Moon <tmoon@nvidia.com>

for more information, see https://pre-commit.ci

…tion

timmoon10 · 2022-10-20T04:55:09Z

Running on a DGX A100 node for 50 steps with 2-way data, tensor, and pipeline parallelism, I see nearly identical learning behavior with and without the distributed optimizer:

Model	ZeRO	O2	Data type	Throughput	Train loss	Val loss
GPT-2 124M	Yes	No	FP16	2.69it/s	8.4	7.870
GPT-2 124M	No	No	FP16	3.31it/s	8.4	7.870
GPT-2 124M	Yes	No	BF16	3.19it/s	8.28	7.790
GPT-2 124M	No	No	BF16	3.39it/s	8.28	7.790
GPT-2 124M	Yes	Yes	BF16	3.44it/s	8.3	7.800
GPT-2 124M	No	Yes	BF16	3.60it/s	8.3	7.800
T5 220M	Yes	No	FP32	1.46it/s	7.64	7.530
T5 220M	No	No	FP32	1.69it/s	7.64	7.530
T5 220M	Yes	No	FP16	1.43it/s	8.45	8.290
T5 220M	No	No	FP16	1.43it/s	8.45	8.290
T5 220M	Yes	No	BF16	1.50it/s	7.66	7.560
T5 220M	No	No	BF16	1.45it/s	7.65	7.550
T5 220M	Yes	Yes	BF16	1.58it/s	7.65	7.540
T5 220M	No	Yes	BF16	1.61it/s	7.65	7.540

I get runtime failures when I run GPT-2 with FP32 and with pipeline parallelism enabled. This error shows up in the main branch as well.

lgtm-com · 2022-10-20T04:58:32Z

This pull request fixes 1 alert when merging aed0e00 into 85fc659 - view on LGTM.com

fixed alerts:

1 for Unused import

ericharper

LGTM. Thanks!

…enabled Signed-off-by: Tim Moon <tmoon@nvidia.com>

ericharper

Re-approving

…tion

lgtm-com · 2022-10-20T23:43:18Z

This pull request fixes 1 alert when merging 190f992 into 0336000 - view on LGTM.com

fixed alerts:

1 for Unused import

timmoon10 · 2022-10-21T01:12:05Z

With NVIDIA/apex#1514 the distributed optimizer supports interleaved pipeline parallelism. Running GPT-2 124M for 20 steps, I get the same loss values with and without the distributed optimizer.

@ericharper

…ns with pipeline parallelism (NVIDIA#4900) * Avoid storing extra copy of params in dist Adam optimizer If params are bf16, dist Adam will only store 16-bit remainder needed to reconstruct fp32 params. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add support for dist Adam in GPT-3 without O2-level AMP Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add support for dist Adam in Megatron-LM models Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug dist Adam support without Megatron AMP O2 Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add support for overlapped grad sync with pipeline parallelism in GPT-3 Requires dist Adam optimizer. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug dist Adam support for T5 Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add support for overlapped grad sync with pipeline parallelism in T5 Requires dist Adam optimizer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update Apex commits in Dockerfile and Jenkinsfile Signed-off-by: Tim Moon <tmoon@nvidia.com> * Support distributed Adam in Megatron grad scaler class. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update dist Adam to accommodate changes in GPT model Changes were made to support pipeline parallelism with interleaved pipeline parallelism. Distributed Adam does not support this currently. Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor tweaks to dist Adam integration Review suggestions from @ericharper and @crcrpar. Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove error when dist Adam and interleaved pipeline parallelism are enabled Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: 1-800-bad-code <shane.carroll@utsa.edu>

@ericharper

…ns with pipeline parallelism (NVIDIA#4900) * Avoid storing extra copy of params in dist Adam optimizer If params are bf16, dist Adam will only store 16-bit remainder needed to reconstruct fp32 params. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add support for dist Adam in GPT-3 without O2-level AMP Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add support for dist Adam in Megatron-LM models Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug dist Adam support without Megatron AMP O2 Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add support for overlapped grad sync with pipeline parallelism in GPT-3 Requires dist Adam optimizer. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug dist Adam support for T5 Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add support for overlapped grad sync with pipeline parallelism in T5 Requires dist Adam optimizer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update Apex commits in Dockerfile and Jenkinsfile Signed-off-by: Tim Moon <tmoon@nvidia.com> * Support distributed Adam in Megatron grad scaler class. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update dist Adam to accommodate changes in GPT model Changes were made to support pipeline parallelism with interleaved pipeline parallelism. Distributed Adam does not support this currently. Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor tweaks to dist Adam integration Review suggestions from @ericharper and @crcrpar. Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove error when dist Adam and interleaved pipeline parallelism are enabled Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Hainan Xu <hainanx@nvidia.com>

@ericharper

…ns with pipeline parallelism (NVIDIA#4900) * Avoid storing extra copy of params in dist Adam optimizer If params are bf16, dist Adam will only store 16-bit remainder needed to reconstruct fp32 params. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add support for dist Adam in GPT-3 without O2-level AMP Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add support for dist Adam in Megatron-LM models Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug dist Adam support without Megatron AMP O2 Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add support for overlapped grad sync with pipeline parallelism in GPT-3 Requires dist Adam optimizer. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Debug dist Adam support for T5 Signed-off-by: Tim Moon <tmoon@nvidia.com> * Add support for overlapped grad sync with pipeline parallelism in T5 Requires dist Adam optimizer Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update Apex commits in Dockerfile and Jenkinsfile Signed-off-by: Tim Moon <tmoon@nvidia.com> * Support distributed Adam in Megatron grad scaler class. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update dist Adam to accommodate changes in GPT model Changes were made to support pipeline parallelism with interleaved pipeline parallelism. Distributed Adam does not support this currently. Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor tweaks to dist Adam integration Review suggestions from @ericharper and @crcrpar. Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove error when dist Adam and interleaved pipeline parallelism are enabled Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Eric Harper <complex451@gmail.com> Signed-off-by: Hainan Xu <hainanx@nvidia.com>

timmoon10 and others added 9 commits August 23, 2022 11:53

Avoid storing extra copy of params in dist Adam optimizer

62d059e

If params are bf16, dist Adam will only store 16-bit remainder needed to reconstruct fp32 params. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add support for dist Adam in GPT-3 without O2-level AMP

a62a368

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Add support for dist Adam in Megatron-LM models

59a7859

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into dist-adam-nlp-models

fa3049a

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Debug dist Adam support without Megatron AMP O2

36109c1

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into dist-adam-nlp-models

23bce24

Merge branch 'main' into dist-adam-nlp-models

b6b509e

Merge branch 'main' into dist-adam-nlp-models

1bede48

Add support for overlapped grad sync with pipeline parallelism in GPT-3

065a89b

Requires dist Adam optimizer. Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 added 3 commits September 8, 2022 16:12

Debug dist Adam support for T5

1ec40b1

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into dist-adam-nlp-models

8b46a9b

Merge branch 'dist-adam-nlp-models' into dist-adam-pipeline-parallel-…

7943ebc

…async-grad-reduction

Add support for overlapped grad sync with pipeline parallelism in T5

e06d34a

Requires dist Adam optimizer Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 changed the title ~~Enable overlapped grad reductions with pipeline or sequence parallelism~~ Support distributed Adam with T5 and support overlapped grad reductions with pipeline parallelism Sep 15, 2022

timmoon10 mentioned this pull request Sep 15, 2022

Add support for Apex distributed Adam optimizer with Megatron-LM models #4799

Closed

8 tasks

Merge branch 'main' into dist-adam-pipeline-parallel-async-grad-reduc…

d528a89

…tion Signed-off-by: Tim Moon <tmoon@nvidia.com>

timmoon10 marked this pull request as ready for review September 15, 2022 20:52

timmoon10 mentioned this pull request Sep 19, 2022

Avoid storing extra copy of params in distributed Adam optimizer #4796

Closed

8 tasks

Update Apex commits in Dockerfile and Jenkinsfile

811b59c

Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge commit 'e3ac280a861fdda5889f5ded88508ceb259f2278' into dist-ada…

ebd98c4

…m-pipeline-parallel-async-grad-reduction

Merge branch 'main' into dist-adam-pipeline-parallel-async-grad-reduc…

b2a61ad

…tion

Merge branch 'main' into dist-adam-pipeline-parallel-async-grad-reduc…

7c3551e

…tion

timmoon10 added 2 commits October 19, 2022 16:08

Update dist Adam to accommodate changes in GPT model

5da9e42

Changes were made to support pipeline parallelism with interleaved pipeline parallelism. Distributed Adam does not support this currently. Signed-off-by: Tim Moon <tmoon@nvidia.com>

Merge branch 'main' into dist-adam-pipeline-parallel-async-grad-reduc…

4ef0255

…tion

timmoon10 force-pushed the dist-adam-pipeline-parallel-async-grad-reduction branch from 2638974 to 4ef0255 Compare October 19, 2022 23:09

[pre-commit.ci] auto fixes from pre-commit.com hooks

a088304

for more information, see https://pre-commit.ci