Skip to content

SAI-Lab-NYU/CodeQuant

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts

International Conference on Learning Representations (ICLR), 2026

📄OpenReview

This repository provides the official implementation of CodeQuant, a unified clustering and quantization framework for Mixture-of-Experts (MoE) Large Language Models (LLMs), addressing activation outliers with fine-tuned rotation and robust clustering method, enabling efficient low-precision deployment. CodeQuant Overview

⭐️Highlights

  • Unified Rotation and Clustering framework for MoE LLMs low-precision deployment with carefully designed MoE-specific fine-tuning objectives.
  • Fully offline quantization with no on-the-fly computation overhead. Achieving strong performance on language modeling, zero-shot QA tasks, and few-show mathematical reasoning.
  • Lookup-table (LUT) based system for efficient deployment and inference, achieving 4.15x speedup on CPU, and average 2.63x speedup on A100 GPU (simulator).

🔧Requirements

Our implementation requires different transformers versions for different models. The DeepSeek model we used is DeepSeek-V2-Lite which requires lower version transformers. We used transformers==4.45.0. The best practice is to install the required packages separately. For DeepSeek-V2-Lite model, use the requirements-deepseek.txt. For other models (e.g. Qwen3-30B-A3B, Mixtral 8x7B), use the requirements.txt.

  • DeepSeek-V2-Lite Model:
pip install -r requirements-deepseek.txt
  • Other Models:
pip install -r requirements.txt

👨‍💻Pipeline

In our framework, we decouple the configuration and the pipeline. We have provided a set of examples for different models under configs/ directory. You can use our examples first to have a quick start of the pipeline. Then you can modify the configuration according to your needs.

🎯Run:

Our scrips are stored under script/ directory. You can follow the following steps to reproduce our results.

  • Step 1: run AOS to fine-tune the rotation matrix:
cd script/
python rotation_fine_tune_script.py --config model_name.yaml
  • Step 2: run ACCF (set permutation=True and weight_group_size to some number to enable POG):
# cd script/
python cluster_fine_tune_script.py --config model_name.yaml
  • Step 3: evaluate, we will use fake quantization for evaluation (for downstream tasks and math reasoning, we use the third-party evaluation tool lm-eval):
# cd script/
python evaluation_script.py --config model_name.yaml

🔍Config:

You can use our examples or create your own configurations. To create your own configuration, you can follow the example's structure and modify the following configuration parameters:

  • accelerator:
    • device: The accelerator to use. If you use GPU, set it to cuda.
  • path:
    • rotation_data_path: The path to save fine-tuned rotation matrix. We prefer the absolute path.
    • cluster_data_path: The path to save clustering results. We prefer the absolute path.
  • model:
    • model_name: The huggingface model path. E.g. Qwen/Qwen3-30B-A3B.
  • calibration:
    • dataset_name: The calibration dataset name.
  • common_setting:
    • weight_group_size: The group size for weight clustering. Set it to -1 for embedding-wise setup.
    • input_group_size: The group size for activation quantization. Set it to -1 for embedding-wise setup.
  • cluster:
    • permutation: The switch for POG. Set it to True for POG.
    • max_sample: The number of calibration samples to use for clustering fine-tune (ACCF).
    • batch_size: The batch size for clustering fine-tune (ACCF).
    • max_length: The maximum length of input tokens. Set it to a smaller value to save memory.
    • epochs: The number of epochs for clustering fine-tune (ACCF).
    • fine_tune_lr: The learning rate for clustering fine-tune (ACCF). Don't use scientific notation here (e.g. 1e-3). Use decimal notation instead (e.g. 0.001).
  • rotation:
    • max_sample: The number of calibration samples to use for rotation fine-tune (AOS).
    • batch_size: The batch size for rotation fine-tune (AOS).
    • max_length: The maximum length of input tokens. Set it to a smaller value to save memory.
    • epochs: The number of epochs for rotation fine-tune (AOS).
    • fine_tune_lr: The learning rate for rotation fine-tune (AOS). Don't use scientific notation here (e.g. 1e-3). Use decimal notation instead (e.g. 0.001).
  • eval:
    • activation_quantization_bit: The bitwidth for activation quantization.
    • weight_quantization_bit: The bitwidth for weight quantization. This is only used for benchmark evaluation. If you evaluate a clustered model, this parameter will not be used.
    • tasks: The evaluation tasks. Use the format task1,task2,...,taskN where each task following naming convention of lm-eval.
    • ppls: The perplexity tasks. Use the format ppl1,ppl2,...,pplN where each task is a huggingface dataset path.

📚Citation

If you find our work useful for your research, please consider citing our paper:

@inproceedings{
    yin2026codequant,
    title={CodeQuant: Unified Clustering and Quantization for Enhanced Outlier Smoothing in Low-Precision Mixture-of-Experts},
    author={Xiangyang Yin and Xingyu Liu and Tianhua Xia and BO BAO and Vithursan Thangarasa and Valavan Manohararajah and Eric Sather and Sai Qian Zhang},
    booktitle={The Fourteenth International Conference on Learning Representations},
    year={2026},
    url={https://openreview.net/forum?id=ATpchFiBQi}
}

🤝Contributing

We thank the community for sharing their projects. Our project builds on top of the existing works:

About

This repository provides the official implementation of CodeQuant (ICLR, 2026), a unified clustering and quantization framework for Mixture-of-Experts (MoE) Large Language Models (LLMs), addressing activation outliers with fine-tuned rotation and robust clustering method, enabling efficient low-precision deployment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages