Skip to content

v1.0.0

Choose a tag to compare

@Alberto-Codes Alberto-Codes released this 27 Mar 22:25
37eee44

turboquant-vllm 1.0.0 — First stable release

First open-source TurboQuant implementation — paper to working vLLM plugin in 72 hours.

Google published TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (ICLR 2026) on March 24. By March 27, turboquant-vllm was serving compressed video inference on a stock vLLM container.

Install

pip install turboquant-vllm[vllm]

Use with vLLM (zero code changes)

vllm serve allenai/Molmo2-8B --attention-backend CUSTOM

Benchmark Results

Molmo2-4B (bfloat16, 36 layers) on RTX 4090 — 11K visual tokens from 2fps video + 256 generation tokens:

Mode KV Cache Compression Output Quality Overhead
FP16 baseline 1,639 MiB 1.0x -- --
TQ3 (3-bit) 845 MiB 1.94x ~95% cosine similarity 2.35x
TQ4 (incremental) 435 MiB 3.76x ~97% cosine, 100+ matching tokens 1.78x

What shipped

  • Core TurboQuant algorithm — Lloyd-Max codebook solver, MSE quantizer, nibble-packed compressors
  • CompressedDynamicCache — Drop-in HuggingFace DynamicCache wrapper with incremental dequantization
  • vLLM TQ4 attention backend — Auto-registers via vllm.general_plugins entry point, serves through the OpenAI-compatible API
  • Fused Triton kernels — 17.8x Q@K^T speedup, Flash Attention fusion with K+V decompression
  • 180+ tests across 9 test files, 95% coverage threshold
  • 10 GPU experiments validating compression, quality, and performance end-to-end

Validated end-to-end

Installed from PyPI into a stock vllm/vllm-openai:latest container, served Molmo2-8B video inference with --attention-backend CUSTOM. Zero code changes, zero errors.

Links