Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add async grad allreduce and chunk optimization #4084

Merged
merged 28 commits into from
May 17, 2022

Conversation

xrennvidia
Copy link
Collaborator

@xrennvidia xrennvidia commented Apr 28, 2022

What does this PR do ?

Add async grad allreduce, work for T5 and GPT-3.
Increase allreduce granularity by grouping params into multiple chunks. Every time a chunk is finished, we start an allreduce for the chunk (sending large chunk of data can better utilize bandwidth).
Current implementation works for BF16 O2 and PP=1.

Collection: [Note which collection this PR will affect]

Changelog

  • Update train_step of GPT-3 and T5 for async grad allreduce
  • Add chunk optimization in MainParamsOptimizerWrapper

Usage

add a knob "grad_allreduce_chunk_size_mb", default config is 125MB. You can change it for different models and/or model sizes.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

ericharper and others added 16 commits March 18, 2022 16:20
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
Signed-off-by: ericharper <complex451@gmail.com>
@ericharper ericharper changed the base branch from gpt_sync_handler to main April 29, 2022 16:48
@ericharper ericharper marked this pull request as draft April 29, 2022 17:09
@xrennvidia xrennvidia changed the title add async grad allreduce and chunk optimization to T5 add async grad allreduce and chunk optimization May 17, 2022
@xrennvidia xrennvidia marked this pull request as ready for review May 17, 2022 21:05
Copy link
Collaborator

@ericharper ericharper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@ericharper ericharper merged commit de0b445 into NVIDIA:main May 17, 2022
yaoyu-33 pushed a commit that referenced this pull request May 17, 2022
* O2 runs but O1 does not

Signed-off-by: ericharper <complex451@gmail.com>

* disable async for O1

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

* update async flag in configure_optimizers

Signed-off-by: ericharper <complex451@gmail.com>

* typo

Signed-off-by: ericharper <complex451@gmail.com>

* revert

Signed-off-by: ericharper <complex451@gmail.com>

* update _require if using async

Signed-off-by: ericharper <complex451@gmail.com>

* clean comments

Signed-off-by: ericharper <complex451@gmail.com>

* always all_reduce

Signed-off-by: ericharper <complex451@gmail.com>

* add async grad allreduce and chunk optimization to T5

* push reformatted files after style check

* set chunk size as 0 while async grad allreduce is off

* more experiments show that 125MB is a better default chunk size for most cases

* add grad_allreduce_chunk_size_mb for GPT-3

* at the end of each training step, wait until all async grad allreduce works are done

* replace individual allreduce work.wait() with a single dGPU evice synchroonization

* record the status of each allreduce work seems too much for perf

* add more comments

* push a reformatted file

Co-authored-by: ericharper <complex451@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants