Skip to content

Official implementation of "A Multi-level Framework for Accelerating Training Transformer Models""

Notifications You must be signed in to change notification settings

Photooon/Multi-Level-Training-Framework

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

A Multi-Level Framework for Accelerating Training Transformer Models

Brief Introduction

This is the repository for multi-level training framework. We provide the Coalescing, De-coalescing and Interpolation operators described in the paper and an example of accelerating the pre-training of GPT-2 on Wiki-En. The framework is built based on the transformers.

Install Requirements

Step 1

To use the map_tools for the model mapping, please install following packages.

pip install torch==2.0.1+cu118 transformers==4.31.0

Step 2

If you hope to run the pre-training acceleration example, please install packages as follows.

cd example
pip install -r requirements.txt

Map Tools Usage

We implemented the three operators to ochestrate the multi-level training framework in map_tools. With map tools, it's convenient to resize and merge transformers. The usage of map tools could be found in map_tools document.

Accelerate Pre-training of GPT-2 on Wikipedia-En

To better illustrate the usage of map tools and demonstrate the effectiveness of the multi-level training framework, we provide a example of accelerating the GPT-2 pre-training on Wikipedia-En.

If you hope to run the example, 150GB space of disk is required to preprocess the wikipedia dataset.

Adaptive Interpolation Ratio ($\alpha$)

In the initial paper, the interpolation ratio is set based on priliminary experimental results. After acceptance, we find that it's possible to adaptively refine this process further.

We have implemented an adaptive mechanism to determine the interpolation ratio dynamically. First, we normalize the parameters of the de-coalesced model to align with those of the original model prior to coalescing. The normalization process balances the parameters of these two model. Then we simply use the ratio of 0.5 to merge all parameters. We termed the process as adaptive interpolation, since it dynamically merge parameters with different ratios. Our experiments show that adaptive interpolation could further save around 10% FLOPs.

If you hope to boost the multigrid training with adaptive interpolation, just utilize the --norm-src flag during the model interpolation phase with the map tools.

Supported Model

  • BERT
  • GPT2
  • LLaMA
  • DeiT

Citation

Please cite our paper if you find the repo helpful for you:

@inproceedings{
    zou2024a,
    title={A Multi-Level Framework for Accelerating Training Transformer Models},
    author={Longwei Zou and Han Zhang and Yangdong Deng},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=BI1N3lTWtn}
}

About

Official implementation of "A Multi-level Framework for Accelerating Training Transformer Models""

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published