Support overlapped grad sync with Megatron pipeline parallelism #1475

timmoon10 · 2022-09-07T23:11:33Z

This PR adds functionality so that the distributed Adam optimizer can overlap grad reduce-scatters with backward compute in the first pipeline stage. Async grad reductions are disabled in the other pipeline stages since it slows down the backward pass, so the reduction should be done externally while waiting for the first stage to finish. Note that this is not compatible with DistributedDataParallel since I am not aware of an option to manually trigger a reduction after using no_sync.

I've also done some refactoring of distributed Adam to support NeMo-Megatron integration, mostly so that I can selectively disable async grad reductions for model-parallel operations:

Each bucket independently keeps track of which grads it has received
Exposed function that creates callback functions for launching async grad reductions, which I override in NeMo
Perform collective communication for checkpointing in main stream in order to reduce memory pool overheads
Change default value for grad norm function from [] to None. This allows us to pass in empty iterators

Each grad bucket independently keeps track of grads that have been generated. Add helper function to create callback functions. Change default param arg in grad norm functions to None. Perform communication for checkpointing in main stream to avoid memory pool overheads.

Enables async grad reduction in first pipeline stage during last backward pass, and disables async grad reduction in all other pipeline stages.

crcrpar

knowing this would be too early...

apex/transformer/pipeline_parallel/schedules/common.py

apex/transformer/pipeline_parallel/schedules/fwd_bwd_pipelining_without_interleaving.py

apex/contrib/optimizers/distributed_fused_adam.py

Add unit test for pipeline parallelism with custom sync context. Style tweaks.

crcrpar

SGTM

tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py

@crcrpar

Review suggestion from @crcrpar

crcrpar

thank you :)

…atron pipeline parallelism (NVIDIA#1475) * Refactor how dist Adam handles overlapped grad sync Each grad bucket independently keeps track of grads that have been generated. Add helper function to create callback functions. Change default param arg in grad norm functions to None. Perform communication for checkpointing in main stream to avoid memory pool overheads. * Support Megatron pipeline parallelism with async grad reduction Enables async grad reduction in first pipeline stage during last backward pass, and disables async grad reduction in all other pipeline stages. * Review suggestions from crcrpar Add unit test for pipeline parallelism with custom sync context. Style tweaks. * Use unittest assert functions in pipeline parallelism test Review suggestion from crcrpar

timmoon10 added 2 commits September 7, 2022 15:08

Support Megatron pipeline parallelism with async grad reduction

94a3fa7

Enables async grad reduction in first pipeline stage during last backward pass, and disables async grad reduction in all other pipeline stages.

timmoon10 mentioned this pull request Sep 7, 2022

Support distributed Adam with T5 and support overlapped grad reductions with pipeline parallelism NVIDIA/NeMo#4900

Merged

8 tasks

crcrpar added the contrib label Sep 8, 2022

crcrpar reviewed Sep 8, 2022

View reviewed changes

Review suggestions from @crcrpar

7b1db86

Add unit test for pipeline parallelism with custom sync context. Style tweaks.

timmoon10 marked this pull request as ready for review September 15, 2022 20:25

crcrpar reviewed Sep 19, 2022

View reviewed changes

tests/L0/run_transformer/test_pipeline_parallel_fwd_bwd.py Outdated Show resolved Hide resolved

Use unittest assert functions in pipeline parallelism test

b5ab746

Review suggestion from @crcrpar

crcrpar self-requested a review September 20, 2022 04:28

crcrpar approved these changes Sep 20, 2022

View reviewed changes

crcrpar merged commit 2b0e837 into NVIDIA:master Sep 20, 2022

timmoon10 mentioned this pull request Oct 20, 2022

Support overlapped grad sync with Megatron interleaved pipeline parallelism #1514

Merged

crcrpar added this to the 22.11 milestone Oct 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support overlapped grad sync with Megatron pipeline parallelism #1475

Support overlapped grad sync with Megatron pipeline parallelism #1475

timmoon10 commented Sep 7, 2022

crcrpar left a comment •

edited

Loading

crcrpar left a comment

crcrpar left a comment

Support overlapped grad sync with Megatron pipeline parallelism #1475

Support overlapped grad sync with Megatron pipeline parallelism #1475

Conversation

timmoon10 commented Sep 7, 2022

crcrpar left a comment • edited Loading

Choose a reason for hiding this comment

crcrpar left a comment

Choose a reason for hiding this comment

crcrpar left a comment

Choose a reason for hiding this comment

crcrpar left a comment •

edited

Loading