add async grad allreduce and chunk optimization #4084

xrennvidia · 2022-04-28T23:49:14Z

What does this PR do ?

Add async grad allreduce, work for T5 and GPT-3.
Increase allreduce granularity by grouping params into multiple chunks. Every time a chunk is finished, we start an allreduce for the chunk (sending large chunk of data can better utilize bandwidth).
Current implementation works for BF16 O2 and PP=1.

Collection: [Note which collection this PR will affect]

Changelog

Update train_step of GPT-3 and T5 for async grad allreduce
Add chunk optimization in MainParamsOptimizerWrapper

Usage

add a knob "grad_allreduce_chunk_size_mb", default config is 125MB. You can change it for different models and/or model sizes.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: ericharper <complex451@gmail.com>

…nc_handler

Signed-off-by: ericharper <complex451@gmail.com>

…ost cases

… works are done

…chroonization

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py

nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py

ericharper

LGTM. Thanks!

* O2 runs but O1 does not Signed-off-by: ericharper <complex451@gmail.com> * disable async for O1 Signed-off-by: ericharper <complex451@gmail.com> * typo Signed-off-by: ericharper <complex451@gmail.com> * update async flag in configure_optimizers Signed-off-by: ericharper <complex451@gmail.com> * typo Signed-off-by: ericharper <complex451@gmail.com> * revert Signed-off-by: ericharper <complex451@gmail.com> * update _require if using async Signed-off-by: ericharper <complex451@gmail.com> * clean comments Signed-off-by: ericharper <complex451@gmail.com> * always all_reduce Signed-off-by: ericharper <complex451@gmail.com> * add async grad allreduce and chunk optimization to T5 * push reformatted files after style check * set chunk size as 0 while async grad allreduce is off * more experiments show that 125MB is a better default chunk size for most cases * add grad_allreduce_chunk_size_mb for GPT-3 * at the end of each training step, wait until all async grad allreduce works are done * replace individual allreduce work.wait() with a single dGPU evice synchroonization * record the status of each allreduce work seems too much for perf * add more comments * push a reformatted file Co-authored-by: ericharper <complex451@gmail.com>

ericharper and others added 16 commits March 18, 2022 16:20

O2 runs but O1 does not

b3f7a84

Signed-off-by: ericharper <complex451@gmail.com>

disable async for O1

b91f0e3

Signed-off-by: ericharper <complex451@gmail.com>

Merge branch 'main' into gpt_sync_handler

32120c9

typo

f49ec58

Signed-off-by: ericharper <complex451@gmail.com>

update async flag in configure_optimizers

73eb1a5

Signed-off-by: ericharper <complex451@gmail.com>

typo

d79cee2

Signed-off-by: ericharper <complex451@gmail.com>

Merge branch 'main' into gpt_sync_handler

f9355ee

revert

8fef3a9

Signed-off-by: ericharper <complex451@gmail.com>

Merge branch 'gpt_sync_handler' of github.com:NVIDIA/NeMo into gpt_sy…

2e4da67

…nc_handler

update _require if using async

5f7a45b

Signed-off-by: ericharper <complex451@gmail.com>

Merge branch 'main' into gpt_sync_handler

21e8e34

Merge branch 'main' into gpt_sync_handler

eabaf62

Merge branch 'main' into gpt_sync_handler

e0271a8

clean comments

fcb58de

Signed-off-by: ericharper <complex451@gmail.com>

always all_reduce

4761b24

Signed-off-by: ericharper <complex451@gmail.com>

add async grad allreduce and chunk optimization to T5

b0ba0a9

ericharper changed the base branch from gpt_sync_handler to main April 29, 2022 16:48

Merge branch 'main' into xren/t5_sync_handler

08b2a70

ericharper marked this pull request as draft April 29, 2022 17:09

xrennvidia and others added 5 commits April 29, 2022 11:21

push reformatted files after style check

04eac5e

Merge branch 'main' into xren/t5_sync_handler

db9cbb9

set chunk size as 0 while async grad allreduce is off

28b795c

more experiments show that 125MB is a better default chunk size for m…

f6b2f3b

…ost cases

add grad_allreduce_chunk_size_mb for GPT-3

382c533

MaximumEntropy requested review from MaximumEntropy and ericharper May 6, 2022 20:01

xrennvidia added 2 commits May 10, 2022 21:20

at the end of each training step, wait until all async grad allreduce…

36dcfd2

… works are done

replace individual allreduce work.wait() with a single dGPU evice syn…

e7e8124

…chroonization

ericharper reviewed May 12, 2022

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_gpt_model.py Show resolved Hide resolved

ericharper reviewed May 12, 2022

View reviewed changes

nemo/collections/nlp/models/language_modeling/megatron_lm_encoder_decoder_model.py Show resolved Hide resolved

xrennvidia and others added 4 commits May 12, 2022 10:03

record the status of each allreduce work seems too much for perf

6b8b59e

add more comments

2280634

push a reformatted file

0469315

Merge branch 'main' into xren/t5_sync_handler

2ba62cf

ericharper mentioned this pull request May 17, 2022

Add async grad allreduce to GPT training with O2 mixed precision #3873

Closed

8 tasks

xrennvidia changed the title ~~add async grad allreduce and chunk optimization to T5~~ add async grad allreduce and chunk optimization May 17, 2022

xrennvidia marked this pull request as ready for review May 17, 2022 21:05

ericharper approved these changes May 17, 2022

View reviewed changes

ericharper merged commit de0b445 into NVIDIA:main May 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add async grad allreduce and chunk optimization #4084

add async grad allreduce and chunk optimization #4084

xrennvidia commented Apr 28, 2022 •

edited

ericharper left a comment

add async grad allreduce and chunk optimization #4084

add async grad allreduce and chunk optimization #4084

Conversation

xrennvidia commented Apr 28, 2022 • edited

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

ericharper left a comment

Choose a reason for hiding this comment

xrennvidia commented Apr 28, 2022 •

edited