Skip to content
/ TDS Public

A plug-in of Microsoft DeepSpeed to fix the bug of DeepSpeed pipeline

License

Notifications You must be signed in to change notification settings

TsinghuaAI/TDS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 

Repository files navigation

TDS

Release note|中文文档

Introduction

  • Tsinghua/Temporary DeepSpeed (TDS) is a plug-in of Microsoft DeepSpeed to fix the bug of the DeepSpeed PipelineEngine.

  • Although DeepSpeed provides interfaces to support pipeline-parallel training. There are still some bugs and hack implementation in its code, especially the code to send tensors between different stages. We thus reimplement the PipelineEngine of DeepSpeed in TDS.

How to use TDS

  1. The first step is to install DeepSpeed. How to install DeepSpeed can refer to DeepSpeed Installation.

  2. Copy the folder "tds" into your project, and use "import tds as deepspeed" instead of "import deepspeed" in your code.

  3. If you want to use pipeline-parallel training, you must add the code to let your model know some essential settings for its forward and backward operations. These settings consist of tensor (including both input data and hidden states) types, whether these tensors need to save gradients, and whether these tensors need to be partitioned across GPUs to save memory. We take training GPT-2 as an example, the detailed code can be found from GPT-2.

    • The code of using DeepSpeed
    def model_provider():
        """Build the model for GPT-2."""
        args = get_args()
        print_rank_0('building GPT2 model ...')
        if args.pipe_parallel_size == 0:
            model = GPT2Model(num_tokentypes=0, parallel_output=True)
        else:
            model = GPT2ModelPipe(num_tokentypes=0, parallel_output=True, topology=mpu.get_topology())
            model._megatron_batch_fn = get_batch_pipe
        return model
    • The code of using TDS
    def model_provider():
    """Build the model for GPT-2."""
    args = get_args()
    print_rank_0('building GPT2 model ...')
    if args.pipe_parallel_size == 0:
        model = GPT2Model(num_tokentypes=0, parallel_output=True)
    else:
        model = GPT2ModelPipe(num_tokentypes=0, parallel_output=True, topology=mpu.get_topology())
        model._megatron_batch_fn = get_batch_pipe
        # The first input tensor is input embeddings and hidden states, it requires to save its gradients. The second input tensor is attention mask. 
        model._input_grad = [True, False]
        # The first input tensor is input embeddings and hidden states, its type is float. The second input tensor is attention mask, its type is boolean.
        model._input_type = ['float', 'bool']
        # Input embeddings and hidden states can be partitioned across GPUs to save memory.
        model._input_pipe_partitioned = [True, False]
    return model
  4. All other operations can directly follow DeepSpeed and DeepSpeedExamples.

Examples

More examples like using TDS for GPT-2 and T5 can refer to CPM-Pretrain.

Citation

If you use the code, please cite the following paper:

@article{cpm-v1,
  title={CPM: A Large-scale Generative Chinese Pre-trained Language Model},
  author={Zhang, Zhengyan and Han, Xu, and Zhou, Hao, and Ke, Pei, and Gu, Yuxian and Ye, Deming and Qin, Yujia and Su, Yusheng and Ji, Haozhe and Guan, Jian and Qi, Fanchao and Wang, Xiaozhi and Zheng, Yanan and Zeng, Guoyang and Cao, Huanqi and Chen, Shengqi and Li, Daixuan and Sun, Zhenbo and Liu, Zhiyuan and Huang, Minlie and Han, Wentao and Tang, Jie and Li, Juanzi and Sun, Maosong},
  year={2020}
}

About

A plug-in of Microsoft DeepSpeed to fix the bug of DeepSpeed pipeline

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages