ScaleMoE

The framework for the paper "ScaleMoE: A Fast and Scalable Distributed Training Framework for Large-Scale Mixture-of-Experts Models" in PACT 2025.

We will continue updating the framework codes and annotations in the future.

This repository is based on the DeepSpeed training framework and the Tutel MoE.
Please refer to the official Microsoft DeepSpeed and Tutel libraries to understand the fundamentals of DeepSpeed and Tutel.

Overview

• All-to-all communication optimization. We propose adaptive all-to-all communication to minimize communication volume by removing unnecessary zero padding.

• Balanced expert selection. We propose dynamic expert clustering, facilitating more balanced expert selection.

• Heterogeneous network-aware data placement. We propose topology-aware expert remapping to fully leverage any type of network configuration.

How to use

Environment setting

To use ScaleMoE, users need to install DeepSpeed and Tutel first.

For a quick and consistent setup, we provide a Dockerfile that automatically builds a container including all required dependencies (e.g., CUDA, PyTorch, DeepSpeed, and Tutel). Users who prefer an easy setup can simply build the Docker image and start a container to run ScaleMoE without manually configuring the environment.

Please refer to the docker/ directory for detailed instructions and preconfigured files.

# Build the docker image
docker build -t scalemoe:latest -f docker/Dockerfile_cuda12.2 .

# Run the container
docker run --gpus all -it --rm scalemoe:latest

If you want to use your own environment, please install DeepSpeed and Tutel.

#Install deepspeed
pip install deepspeed

#Install tutel
git clone https://github.com/microsoft/tutel --branch main
python3 -m pip uninstall tutel -y
python3 ./tutel/setup.py install --user

MoE block Usage

See folder scalemoe for details

baseline
Runs the baseline MoE implementation.
Use this script if you want to reproduce the standard Tutel as baseline behavior with minimal modifications. Use different scripts depending on the CUDA version selected when creating Docker.
run_adaptive.sh
Runs the Adaptive MoE implementation.
This script introduces
run_scalable.sh
Runs the K-means and GA MoE implementation.
This script introduces

Usage

Run the desired script with your configuration. For example:

bash scalemoe/scripts/baseline/run_cuda11.sh
bash scalemoe/scripts/run_adaptive.sh
bash scalemoe/scripts/run_scalable.sh

Examples

ScaleMoE provides out-of-the-box support for running large-scale Mixture-of-Experts (MoE) models such as BERT-MoE and GPT-MoE. You can find example training scripts and configuration files under the models/ directory. We provide a MoE-enhanced BERT implementation based on DeepSpeed and Tutel. This example demonstrates how ScaleMoE optimizations—such as adaptive all-to-all communication, dynamic expert clustering, and topology-aware expert remapping—can significantly accelerate BERT training on large-scale distributed environments.

BERT: Bidirectional Encoder Representations from Transformers

To run the BERT-MoE example:

Prepare data
```
bash models/BERT/data/getdata.sh
```
Prepare checkpoints
```
bash models/BERT/prepare_ck.sh
```
Run baseline
```
bash models/BERT/tutel_run.sh
```
Run adaptive all-to-all communication
```
bash models/BERT/adaptive_run.sh
```
Run dynamic expert clustering
```
bash models/BERT/kmean_run.sh
```
Run topology-aware expert remapping
```
bash models/BERT/ga_run.sh
```

Wait for the execution to complete, and you will obtain the logs and the exploration results in the logs folder like models/BERT/logs/.

Publication

This work has been published in PACT'25 ScaleMoE.

Citations

@inproceedings{choi2025scalemoe,
  title={ScaleMoE: A Fast and Scalable Distributed Training Framework for Large-Scale Mixture-of-Experts Models},
  author={Choi, Seohong and Hong, Huize and Han, Tae Hee and Kim, Joonsung},
  booktitle={2025 34th International Conference on Parallel Architectures and Compilation Techniques (PACT)},
  pages={30--42},
  year={2025},
  organization={IEEE}
}

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
docker		docker
models		models
scalemoe		scalemoe
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScaleMoE

Overview

How to use

Environment setting

MoE block Usage

Usage

Examples

BERT: Bidirectional Encoder Representations from Transformers

Publication

Citations

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

SKKU-IDEAL/ScaleMoE

Folders and files

Latest commit

History

Repository files navigation

ScaleMoE

Overview

How to use

Environment setting

MoE block Usage

Usage

Examples

BERT: Bidirectional Encoder Representations from Transformers

Publication

Citations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages