v1.0.0
turboquant-vllm 1.0.0 — First stable release
First open-source TurboQuant implementation — paper to working vLLM plugin in 72 hours.
Google published TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (ICLR 2026) on March 24. By March 27, turboquant-vllm was serving compressed video inference on a stock vLLM container.
Install
pip install turboquant-vllm[vllm]Use with vLLM (zero code changes)
vllm serve allenai/Molmo2-8B --attention-backend CUSTOMBenchmark Results
Molmo2-4B (bfloat16, 36 layers) on RTX 4090 — 11K visual tokens from 2fps video + 256 generation tokens:
| Mode | KV Cache | Compression | Output Quality | Overhead |
|---|---|---|---|---|
| FP16 baseline | 1,639 MiB | 1.0x | -- | -- |
| TQ3 (3-bit) | 845 MiB | 1.94x | ~95% cosine similarity | 2.35x |
| TQ4 (incremental) | 435 MiB | 3.76x | ~97% cosine, 100+ matching tokens | 1.78x |
What shipped
- Core TurboQuant algorithm — Lloyd-Max codebook solver, MSE quantizer, nibble-packed compressors
- CompressedDynamicCache — Drop-in HuggingFace DynamicCache wrapper with incremental dequantization
- vLLM TQ4 attention backend — Auto-registers via
vllm.general_pluginsentry point, serves through the OpenAI-compatible API - Fused Triton kernels — 17.8x Q@K^T speedup, Flash Attention fusion with K+V decompression
- 180+ tests across 9 test files, 95% coverage threshold
- 10 GPU experiments validating compression, quality, and performance end-to-end
Validated end-to-end
Installed from PyPI into a stock vllm/vllm-openai:latest container, served Molmo2-8B video inference with --attention-backend CUSTOM. Zero code changes, zero errors.