Skip to content

Official implementation of "LittleBit: Ultra Low-Bit Quantization via Latent Factorization" (NeurIPS 2025)

License

Notifications You must be signed in to change notification settings

SamsungLabs/LittleBit

Repository files navigation

LittleBit: Ultra Low-Bit Quantization
via Latent Factorization

Banseok Lee*, Dongkyu Kim*, Youngcheon You, Youngmin Kim

*Equal Contribution, Corresponding Author

arXiv NeurIPS


📢 Abstract

LittleBit is a novel method for extreme LLM compression, targeting levels like 0.1 bits per weight (BPW). By representing weights in a low-rank form using latent matrix factorization and subsequently binarizing these factors, it achieves nearly 31× memory reduction (e.g., Llama2-13B to under 0.9 GB). To counteract information loss, it integrates a multi-scale compensation mechanism including row, column, and an additional latent dimension learning per-rank importance.


✨ Key Features

🧠 Model Architecture & Support

  • Extreme Compression: Targets 0.1 BPW regime.
  • High Efficiency: 31× memory reduction compared to FP16.
  • Novel Method: Latent Matrix Factorization with Binarization & Multi-scale Compensation.

🏗️ Supported Models

The codebase currently supports the following architectures:

  • OPT
  • Llama (Llama-2, Llama-3)
  • Phi-4
  • Qwen2.5 (QwQ)
  • Gemma 2 & Gemma 3

💿 Installation

Set up the environment using Conda and Pip. We recommend using Python 3.12.

conda create -n littlebit python=3.12
conda activate littlebit

# Install CUDA toolkit (adjust version as necessary)
conda install nvidia/label/cuda-12.4.1::cuda-toolkit -c nvidia/label/cuda-12.4.1

# Install PyTorch
pip install torch==2.8.0+cu124 torchvision==0.23.0+cu124 torchaudio==2.8.0+cu124 --index-url https://download.pytorch.org/whl/cu124

# Install dependencies
pip install -r requirements.txt

🚀 Usage

1. Training (QAT)

Train the model using Quantization-Aware Training (QAT) with the LittleBit approach.

Single GPU Example:

CUDA_VISIBLE_DEVICES=0 python -m main \
    --model_id meta-llama/Llama-2-7b-hf \
    --dataset c4_wiki \
    --save_dir ./outputs/Llama-2-7b-LittleBit \
    --num_train_epochs 5.0 \
    --per_device_train_batch_size 4 \
    --lr 4e-05 \
    --warmup_ratio 0.02 \
    --report wandb \
    --quant_func SmoothSign \
    --quant_mod LittleBitLinear \
    --residual True \
    --eff_bit 1.0 \
    --kv_factor 1.0 \
    --l2l_loss_scale 10.0

Multi-GPU (DeepSpeed) Example:

deepspeed --num_gpus=4 main.py \
    --model_id meta-llama/Llama-2-7b-hf \
    --dataset c4_wiki \
    --save_dir ./outputs/Llama-2-7b-LittleBit \
    --ds_config_path configs/ds_config.json \
    --num_train_epochs 5.0 \
    --per_device_train_batch_size 4 \
    --lr 4e-05 \
    --report wandb \
    --quant_func SmoothSign \
    --quant_mod LittleBitLinear \
    --residual True \
    --eff_bit 1.0

2. Evaluation

Evaluate the trained LittleBit model on Perplexity (PPL) tasks and Zero-shot benchmarks.

CUDA_VISIBLE_DEVICES=0 python -m eval \
    --model_type llama \
    --model_id ./outputs/Llama-2-7b-LittleBit \
    --quant_func SmoothSign \
    --quant_mod LittleBitLinear \
    --residual True \
    --eff_bit 1.0 \
    --kv_factor 1.0 \
    --ppl_task wikitext2,c4 \
    --zeroshot_task boolq,piqa,hellaswag,winogrande,arc_easy,arc_challenge,openbookqa

📝 Citation

If you find this work useful, please cite our paper:

@inproceedings{littlebit,
  title={LittleBit: Ultra Low-Bit Quantization via Latent Factorization},
  author={Lee, Banseok and Kim, Dongkyu and You, Youngcheon and Kim, Youngmin},
  booktitle={Advances in Neural Information Processing Systems},
  year={2025},
}

⚖️ License

This project is licensed under the CC BY-NC 4.0 license.

About

Official implementation of "LittleBit: Ultra Low-Bit Quantization via Latent Factorization" (NeurIPS 2025)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages