CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts

International Conference on Learning Representations (ICLR), 2026

This repository provides the official implementation of CodeQuant, a unified clustering and quantization framework for Mixture-of-Experts (MoE) Large Language Models (LLMs), addressing activation outliers with fine-tuned rotation and robust clustering method, enabling efficient low-precision deployment.

⭐️Highlights

Unified Rotation and Clustering framework for MoE LLMs low-precision deployment with carefully designed MoE-specific fine-tuning objectives.
Fully offline quantization with no on-the-fly computation overhead. Achieving strong performance on language modeling, zero-shot QA tasks, and few-show mathematical reasoning.
Lookup-table (LUT) based system for efficient deployment and inference, achieving 4.15x speedup on CPU, and average 2.63x speedup on A100 GPU (simulator).

🔧Requirements

Our implementation requires different transformers versions for different models. The DeepSeek model we used is DeepSeek-V2-Lite which requires lower version transformers. We used transformers==4.45.0. The best practice is to install the required packages separately. For DeepSeek-V2-Lite model, use the requirements-deepseek.txt. For other models (e.g. Qwen3-30B-A3B, Mixtral 8x7B), use the requirements.txt.

DeepSeek-V2-Lite Model:

pip install -r requirements-deepseek.txt

Other Models:

pip install -r requirements.txt

👨‍💻Pipeline

In our framework, we decouple the configuration and the pipeline. We have provided a set of examples for different models under configs/ directory. You can use our examples first to have a quick start of the pipeline. Then you can modify the configuration according to your needs.

🎯Run:

Our scrips are stored under script/ directory. You can follow the following steps to reproduce our results.

Step 1: run AOS to fine-tune the rotation matrix:

cd script/
python rotation_fine_tune_script.py --config model_name.yaml

Step 2: run ACCF (set permutation=True and weight_group_size to some number to enable POG):

# cd script/
python cluster_fine_tune_script.py --config model_name.yaml

Step 3: evaluate, we will use fake quantization for evaluation (for downstream tasks and math reasoning, we use the third-party evaluation tool lm-eval):

# cd script/
python evaluation_script.py --config model_name.yaml

🔍Config:

You can use our examples or create your own configurations. To create your own configuration, you can follow the example's structure and modify the following configuration parameters:

accelerator:
- device: The accelerator to use. If you use GPU, set it to cuda.
path:
- rotation_data_path: The path to save fine-tuned rotation matrix. We prefer the absolute path.
- cluster_data_path: The path to save clustering results. We prefer the absolute path.
model:
- model_name: The huggingface model path. E.g. Qwen/Qwen3-30B-A3B.
calibration:
- dataset_name: The calibration dataset name.
common_setting:
- weight_group_size: The group size for weight clustering. Set it to -1 for embedding-wise setup.
- input_group_size: The group size for activation quantization. Set it to -1 for embedding-wise setup.
cluster:
- permutation: The switch for POG. Set it to True for POG.
- max_sample: The number of calibration samples to use for clustering fine-tune (ACCF).
- batch_size: The batch size for clustering fine-tune (ACCF).
- max_length: The maximum length of input tokens. Set it to a smaller value to save memory.
- epochs: The number of epochs for clustering fine-tune (ACCF).
- fine_tune_lr: The learning rate for clustering fine-tune (ACCF). Don't use scientific notation here (e.g. 1e-3). Use decimal notation instead (e.g. 0.001).
rotation:
- max_sample: The number of calibration samples to use for rotation fine-tune (AOS).
- batch_size: The batch size for rotation fine-tune (AOS).
- max_length: The maximum length of input tokens. Set it to a smaller value to save memory.
- epochs: The number of epochs for rotation fine-tune (AOS).
- fine_tune_lr: The learning rate for rotation fine-tune (AOS). Don't use scientific notation here (e.g. 1e-3). Use decimal notation instead (e.g. 0.001).
eval:
- activation_quantization_bit: The bitwidth for activation quantization.
- weight_quantization_bit: The bitwidth for weight quantization. This is only used for benchmark evaluation. If you evaluate a clustered model, this parameter will not be used.
- tasks: The evaluation tasks. Use the format task1,task2,...,taskN where each task following naming convention of lm-eval.
- ppls: The perplexity tasks. Use the format ppl1,ppl2,...,pplN where each task is a huggingface dataset path.

📚Citation

If you find our work useful for your research, please consider citing our paper:

@inproceedings{
    yin2026codequant,
    title={CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts},
    author={Xiangyang Yin and Xingyu Liu and Tianhua Xia and BO BAO and Vithursan Thangarasa and Valavan Manohararajah and Eric Sather and Sai Qian Zhang},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=ATpchFiBQi}
}

🤝Contributing

We thank the community for sharing their projects. Our project builds on top of the existing works:

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
asset		asset
cluster		cluster
configs		configs
model		model
script		script
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements-deepseek.txt		requirements-deepseek.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts

⭐️Highlights

🔧Requirements

👨‍💻Pipeline

🎯Run:

🔍Config:

📚Citation

🤝Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts

⭐️Highlights

🔧Requirements

👨‍💻Pipeline

🎯Run:

🔍Config:

📚Citation

🤝Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages