ASC22-Yuan

Introduction

Traning large-scale language model like Yuan is difficult because it requires not only massive computing resources, but also complex training methods to efficiently process a large number of parameters. We complete Yuan Large Language Model Chanllenge in 67.75h using 4 32GB Tesla V100 DGXS, ZeRO parallel strategy and various acceleration training methods.We use parallel strategy and various acceleration training methods to complete Yuan Challenge, the current largest singleton language model(2022.2) with 246B parameters, which achieved excellent performance on thousands GPUs, and state-of-the-art results on different natural language processing tasks.

Environment

Hardware Environment

Software Environment

The Yuan Large Language Model Challenge must be finished with Pytorch.The software mainly used in this challenge is listed as below:

Methodology(Main)

Based on our environment and investigation, how we narrow down the choice of parallel strategies is shown in Figure . The yellow line is our chosen path. Our candidates are MP, TP and 2D parallelism.

Result

We build GPT2 on Megatron-LM as the baseline, use DeepSpeed engine to add ZeRO optimized memory management, and use General, GPU specific, and ZeRO optimizations. The training time required by different methods is shown in the Figure. Our final traning time is 66.75h. The reason why we did not reach 33.87h is explained below.

DefaultCPUAllocator can’t allocate memory occurs during training of 1B tokens by using the same parameters. To analyze the cause of this problem, we use htop [18] to observe the CPU usage during training. It was found that during the training process, the CPU memory usage showed an increasing trend until it crashed.

Conclusion

Our main contribution is to build GPT2 on Megatron-LM as the baseline, use DeepSpeed engine to add ZeRO optimized memory management, and use General, GPU specific, and ZeRO optimizations, the final training of 1B tokens takes 67.75h. If we have extra CPU memory, we can reduce this time to 33.87h.

Acknowledgement

If you have any problems, please contact congruiyin@email.ncu.edu.cn.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Log		Log
model		model
tensorboard		tensorboard
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Log

Log

model

model

tensorboard

tensorboard

README.md

README.md

Repository files navigation

ASC22-Yuan

Introduction

Environment

Hardware Environment

Software Environment

Methodology(Main)

Result

Conclusion

Acknowledgement

About

Releases

Packages

Languages

NCUSCC/ASC22-Yuan

Folders and files

Latest commit

History

Repository files navigation

ASC22-Yuan

Introduction

Environment

Hardware Environment

Software Environment

Methodology(Main)

Result

Conclusion

Acknowledgement

About

Resources

Stars

Watchers

Forks

Languages