turboquant-vllm 1.0.0 — First stable release

First open-source TurboQuant implementation — paper to working vLLM plugin in 72 hours.

Google published TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate (ICLR 2026) on March 24. By March 27, turboquant-vllm was serving compressed video inference on a stock vLLM container.

Install

pip install turboquant-vllm[vllm]

Use with vLLM (zero code changes)

vllm serve allenai/Molmo2-8B --attention-backend CUSTOM

Benchmark Results

Molmo2-4B (bfloat16, 36 layers) on RTX 4090 — 11K visual tokens from 2fps video + 256 generation tokens:

Mode	KV Cache	Compression	Output Quality	Overhead
FP16 baseline	1,639 MiB	1.0x	--	--
TQ3 (3-bit)	845 MiB	1.94x	~95% cosine similarity	2.35x
TQ4 (incremental)	435 MiB	3.76x	~97% cosine, 100+ matching tokens	1.78x

What shipped

Core TurboQuant algorithm — Lloyd-Max codebook solver, MSE quantizer, nibble-packed compressors
CompressedDynamicCache — Drop-in HuggingFace DynamicCache wrapper with incremental dequantization
vLLM TQ4 attention backend — Auto-registers via vllm.general_plugins entry point, serves through the OpenAI-compatible API
Fused Triton kernels — 17.8x Q@K^T speedup, Flash Attention fusion with K+V decompression
180+ tests across 9 test files, 95% coverage threshold
10 GPU experiments validating compression, quality, and performance end-to-end

Validated end-to-end

Installed from PyPI into a stock vllm/vllm-openai:latest container, served Molmo2-8B video inference with --attention-backend CUSTOM. Zero code changes, zero errors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.0.0

Choose a tag to compare

Sorry, something went wrong.