A Multi-Level Framework for Accelerating Training Transformer Models

Brief Introduction

This is the repository for multi-level training framework. We provide the Coalescing, De-coalescing and Interpolation operators described in the paper and an example of accelerating the pre-training of GPT-2 on Wiki-En. The framework is built based on the transformers.

Install Requirements

Step 1

To use the map_tools for the model mapping, please install following packages.

pip install torch==2.0.1+cu118 transformers==4.31.0

Step 2

If you hope to run the pre-training acceleration example, please install packages as follows.

cd example
pip install -r requirements.txt

Map Tools Usage

We implemented the three operators to ochestrate the multi-level training framework in map_tools. With map tools, it's convenient to resize and merge transformers. The usage of map tools could be found in map_tools document.

Accelerate Pre-training of GPT-2 on Wikipedia-En

To better illustrate the usage of map tools and demonstrate the effectiveness of the multi-level training framework, we provide a example of accelerating the GPT-2 pre-training on Wikipedia-En.

If you hope to run the example, 150GB space of disk is required to preprocess the wikipedia dataset.

Adaptive Interpolation Ratio ($\alpha$)

In the initial paper, the interpolation ratio is set based on priliminary experimental results. After acceptance, we find that it's possible to adaptively refine this process further.

We have implemented an adaptive mechanism to determine the interpolation ratio dynamically. First, we normalize the parameters of the de-coalesced model to align with those of the original model prior to coalescing. The normalization process balances the parameters of these two model. Then we simply use the ratio of 0.5 to merge all parameters. We termed the process as adaptive interpolation, since it dynamically merge parameters with different ratios. Our experiments show that adaptive interpolation could further save around 10% FLOPs.

If you hope to boost the multigrid training with adaptive interpolation, just utilize the --norm-src flag during the model interpolation phase with the map tools.

Supported Model

BERT
GPT2
LLaMA
DeiT

Citation

Please cite our paper if you find the repo helpful for you:

@inproceedings{
    zou2024a,
    title={A Multi-Level Framework for Accelerating Training Transformer Models},
    author={Longwei Zou and Han Zhang and Yangdong Deng},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=BI1N3lTWtn}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
example		example
map_tools		map_tools
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Multi-Level Framework for Accelerating Training Transformer Models

Brief Introduction

Install Requirements

Map Tools Usage

Accelerate Pre-training of GPT-2 on Wikipedia-En

Adaptive Interpolation Ratio ($\alpha$)

Supported Model

Citation

About

Releases

Packages

Languages

Photooon/Multi-Level-Training-Framework

Folders and files

Latest commit

History

Repository files navigation

A Multi-Level Framework for Accelerating Training Transformer Models

Brief Introduction

Install Requirements

Map Tools Usage

Accelerate Pre-training of GPT-2 on Wikipedia-En

Adaptive Interpolation Ratio ($\alpha$)

Supported Model

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages