Skip to content

Commit

Permalink
Release fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
shoeybi authored and deepakn94 committed Apr 8, 2021
1 parent 23632ee commit 50a4b5f
Show file tree
Hide file tree
Showing 9 changed files with 12 additions and 58 deletions.
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,12 @@ Below are some of the projects where we have directly used Megatron:
* [RACE Reading Comprehension Dataset Leaderboard](http://www.qizhexie.com/data/RACE_leaderboard.html)
* [Training Question Answering Models From Synthetic Data](https://www.aclweb.org/anthology/2020.emnlp-main.468.pdf)

Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage [NVIDIA's Selene supercomputer](https://www.top500.org/system/179842/) to perform scaling studies and use up to 3072 [A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for the largest model. The table below shows the model configurations along with the achieved FLOPs per second (both per GPU and aggregate over all GPUs). Note that the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.
Our codebase is capable of efficiently training very large (hundreds of billions of parameters) language models with both model and data parallelism. To demonstrate how the code scales with multiple GPUs and model sizes, we consider GPT models from 1 billion all the way to 1 trillion parameters. All models use a vocabulary size of 51,200 and a sequence length of 2048. We vary hidden size, number of attention heads, and number of layers to arrive at a specifc model size. As the model size increases, we also modestly increase the batch size. We leverage [NVIDIA's Selene supercomputer](https://www.top500.org/system/179842/) to perform scaling studies and use up to 3072 [A100](https://www.nvidia.com/en-us/data-center/a100/) GPUs for the largest model. The table below shows the model configurations along with the achieved FLOPs (both per GPU and aggregate over all GPUs). Note that the FLOPs are measured for end-to-end training, i.e., includes all operations including data loading, optimization, and even logging.

![Cases](images/cases_jan2021.png)
![Cases](images/cases_april2021.png)

The following figures show achieved percentage of theoretical peak FLOPs and achieved aggregate petaFLOPs per second as a function of number of GPUs. All the cases from 1 billion to 1 trillion achieve more than 41% half precision utilization, which is high for an end-to-end application. We observe that initially as the model parallel size increases, utilization slightly decreases; as hidden size increases for larger models, utilization starts increasing and reaches 49% for the largest model. We also note that achieved aggregate petaFLOPs per second across all GPUs increases almost linearly with number of GPUs, demonstrating good weak scaling.
All the cases from 1 billion to 1 trillion parameters achieve more than 43% half precision utilization, which is high for an end-to-end application. We observe that initially the utilization remains constant but as hidden size increases for larger models, utilization starts increasing and reaches 52% for the largest model. We also note that achieved aggregate petaFLOPs across all GPUs increases almost linearly with number of GPUs, demonstrating good weak scaling.

![Model Parallel Scaling](images/scaling.png)

# Contents
* [Contents](#contents)
Expand Down
12 changes: 0 additions & 12 deletions images/Makefile

This file was deleted.

Binary file added images/cases_april2021.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed images/cases_jan2021.png
Binary file not shown.
Binary file removed images/scaling.png
Binary file not shown.
40 changes: 0 additions & 40 deletions images/tables.tex

This file was deleted.

7 changes: 7 additions & 0 deletions megatron/arguments.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,13 @@ def parse_args(extra_args_provider=None, defaults={},
if args.bf16:
assert not args.fp16
args.params_dtype = torch.bfloat16
# bfloat16 requires gradient accumulation and all-reduce to
# be done in fp32.
if not args.accumulate_allreduce_grads_in_fp32:
args.accumulate_allreduce_grads_in_fp32 = True
if args.rank == 0:
print('accumulate and all-reduce gradients in fp32 for '
'bfloat16 data type.', flush=True)

if args.rank == 0:
print('using {} for parameters ...'.format(args.params_dtype),
Expand Down
2 changes: 1 addition & 1 deletion tasks/finetune_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,7 +170,7 @@ def _train(model, optimizer, lr_scheduler, forward_step,
report_memory_flag = True

# For each remaining epoch
timers('interval time').start()
timers('interval-time').start()
for epoch in range(start_epoch, args.epochs):
print_rank_0('working on epoch {} ...'.format(epoch + 1))

Expand Down
2 changes: 1 addition & 1 deletion tasks/vision/finetune_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,7 @@ def _train(
report_memory_flag = True

# For each remaining epoch
timers("interval time").start()
timers("interval-time").start()
for epoch in range(start_epoch, args.epochs):
print_rank_0("working on epoch {} ...".format(epoch + 1))

Expand Down

0 comments on commit 50a4b5f

Please sign in to comment.