turboquant-vllm 1.0.0 — First stable release #4

Alberto-Codes · 2026-03-27T22:59:15Z

Alberto-Codes
Mar 27, 2026
Maintainer

First open-source TurboQuant implementation as a vLLM plugin. Paper to PyPI in 72 hours.

Google published TurboQuant at ICLR 2026 on March 24. By March 27, turboquant-vllm was serving compressed video inference from a stock vLLM container.

What shipped

Core TurboQuant algorithm — Lloyd-Max codebook solver, MSE quantizer, nibble-packed compressors
CompressedDynamicCache — Drop-in HuggingFace DynamicCache wrapper with incremental dequantization
vLLM TQ4 attention backend — Auto-registers via vllm.general_plugins, serves through the OpenAI-compatible API
Fused Triton kernels — Compress, decompress, Q@K^T, Flash Attention + TQ4 K/V decompression
180+ tests, 95%+ coverage, validated on NVIDIA RTX 4090 + AMD Radeon 890M (ROCm)
16 GPU experiments documenting the full research-to-production journey

Benchmarks

Molmo2-4B on RTX 4090 — 11K visual tokens + 256 generation tokens:

KV cache: 1,639 MiB → 435 MiB (3.76x compression)
Quality: ~97% cosine similarity, 100+ tokens match word-for-word
Overhead: 1.78x (incremental dequantization)

This is the first TurboQuant implementation validated on vision-language models with video input.

Install

pip install turboquant-vllm[vllm]
vllm serve allenai/Molmo2-8B --attention-backend CUSTOM

What's next

Upstream vLLM contribution (vllm#38171)
Full Flash Attention-style kernel fusion for multi-layer correctness
Stacking with token pruning methods (VL-Cache) for multiplicative VLM savings

If you're working on KV cache compression or VLM inference optimization, I'd love to hear from you — especially if you've hit precision issues at long sequences.

Full changelog: v1.0.0
Deep-dive blog post: Paper to PyPI in 72 hours

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

turboquant-vllm 1.0.0 — First stable release #4

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

turboquant-vllm 1.0.0 — First stable release #4

Uh oh!

Alberto-Codes Mar 27, 2026 Maintainer

What shipped

Benchmarks

Install

What's next

Replies: 0 comments

Alberto-Codes
Mar 27, 2026
Maintainer