Demystifying the Compression of Mixture-of-Experts Through a Unified Framework

Shwai He*, Daize Dong*, Liang Ding, Ang Li

This is the official implementation of the paper Demystifying the Compression of Mixture-of-Experts Through a Unified Framework. We provide a comprehensive framework for compressing Mixture-of-Experts models.

Introduction

The Mixture of Experts (MoE) approach dynamically selects and activates only a subset of experts, significantly reducing computational costs while maintaining high performance. However, MoE introduces potential redundancy (e.g., parameters) and extra costs (e.g., communication overhead). Since the compression of MoE remains under-explored, we address this gap with a cutting-edge unified framework that seamlessly integrates mainstream compression methods and helps systematically understand MoE compression. This framework approaches compression from two perspectives: Expert Slimming, which compresses individual experts, and Expert Trimming, which removes structured modules. Within this framework, we explore the optimization space unexplored by existing methods and introduce aggressive Expert Trimming techniques, such as Layer Drop and Block Drop, to eliminate redundancy on a larger scale. Based on these insights, we present a comprehensive recipe to guide practitioners in effectively compressing MoE.

Installation

Environment

Create conda environment and install the pipeline for pruning and Expert Trimming (based on the LLaMA-Factory).

conda create -n moe-compression python=3.10
conda activate moe-compression

git clone git@github.com:DaizeDong/Unified-MoE-Compression.git
cd ./Unified-MoE-Compression
pip install -e .
pip install flash-attn --no-build-isolation

Install the pipeline for quantization (based on the AutoAWQ and AutoGPTQ). Ensure you carefully install the packages that correspond to your CUDA version. For more details you can refer to the README files in corresponding folders.

cd ./AutoAWQ
pip install -e .

cd ./AutoAWQ/AutoAWQ_kernels
pip install -e .

cd ./AutoGPTQ
pip install -vvv --no-build-isolation -e .

Prepare Models

Download the Mixtral-8x7B and DeepSeek-MoE-16B model from HuggingFace, and delete the following lines in the config.json of DeepSeek-MoE-16B.

"auto_map": {
  "AutoConfig": "configuration_deepseek.DeepseekConfig",
  "AutoModel": "modeling_deepseek.DeepseekModel",
  "AutoModelForCausalLM": "modeling_deepseek.DeepseekForCausalLM"
},

Running Compression

Expert Slimming

Pruning

bash scripts/compression/pruning/mixtral_prune.sh
bash scripts/compression/pruning/deepseek_prune.sh
bash scripts/compression/pruning/deepseek_prune_noshared.sh

Quantization

bash scripts/compression/quantization/awq.sh
bash scripts/compression/quantization/gptq.sh

Expert Trimming

Expert Drop

bash scripts/compression/expert_drop/mixtral_expert_drop.sh
bash scripts/compression/expert_drop/deepseek_expert_drop.sh

Layer Drop

bash scripts/compression/layer_drop/mixtral_layer_drop.sh
bash scripts/compression/layer_drop/deepseek_layer_drop.sh

Block Drop

bash scripts/compression/block_drop/mixtral_block_drop.sh
bash scripts/compression/block_drop/deepseek_block_drop.sh

Running Evaluation

FLOPs & Speed

bash scripts/evaluation/speedup/measure_flops.sh
bash scripts/evaluation/speedup/measure_speed.sh

Loss & PPL

bash scripts/evaluation/loss/mixtral_evaluate.sh
bash scripts/evaluation/loss/deepseek_evaluate.sh

Benchmarks

Coming soon. We are still cleaning the code...

Fow now please refer to EleutherAI/lm-evaluation-harness.

Remember to use the modeling files in src/llmtuner/model to load the Mixtral-8x7B and DeepSeek-MoE-16B models.

Citation

@article{he2024demystifying,
  title={Demystifying the Compression of Mixture-of-Experts Through a Unified Framework},
  author={He, Shwai and Dong, Daize and Ding, Liang and Li, Ang},
  journal={arXiv preprint arXiv:2406.02500},
  year={2024}
}

Contact Us

If you have any questions, please contact:

Shwai He: shwaihe@umd.edu
Daize Dong: dzdong2019@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.idea		.idea
AutoAWQ		AutoAWQ
AutoGPTQ		AutoGPTQ
config/accelerate		config/accelerate
data		data
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
measure_flops.py		measure_flops.py
measure_speed.py		measure_speed.py
requirements.txt		requirements.txt
setup.py		setup.py
unified-view-table.svg		unified-view-table.svg
unified-view.svg		unified-view.svg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Demystifying the Compression of Mixture-of-Experts Through a Unified Framework

Introduction

Installation

Environment

Prepare Models

Running Compression

Expert Slimming

Pruning

Quantization

Expert Trimming

Expert Drop

Layer Drop

Block Drop

Running Evaluation

FLOPs & Speed

Loss & PPL

Benchmarks

Citation

Contact Us

About

Packages

Contributors 2

Languages

License

DaizeDong/Unified-MoE-Compression

Folders and files

Latest commit

History

Repository files navigation

Demystifying the Compression of Mixture-of-Experts Through a Unified Framework

Introduction

Installation

Environment

Prepare Models

Running Compression

Expert Slimming

Pruning

Quantization

Expert Trimming

Expert Drop

Layer Drop

Block Drop

Running Evaluation

FLOPs & Speed

Loss & PPL

Benchmarks

Citation

Contact Us

About

Topics

Resources

License

Stars

Watchers

Forks

Packages 0

Contributors 2

Languages

Packages