turboquant-vllm v1.3.0 — Seven validated model families #58

Alberto-Codes · 2026-04-01T00:52:59Z

Alberto-Codes
Apr 1, 2026
Maintainer

v1.3.0 expands model support from Molmo2-only to seven validated families. Fused paged kernels (v1.2.0) provide the performance foundation; v1.3.0 adds the kernel-level changes for architectural diversity.

What's new since v1.1.0

This release covers v1.2.0 through v1.3.0 — fused kernels, model expansion, and two production hotfixes.

Fused paged TQ4 kernels (v1.2.0) — decompress + attend in a single SRAM pass. HBM traffic: 1,160 → 136 bytes/token (8.5x reduction). Includes both decode and INT8 prefill paths.
Non-pow2 head_dim support (v1.3.0) — pad-to-pow2 + boundary masking across all 5 Triton kernels. Enables head_dim 96 (Phi-3-mini) and 256 (Gemma-2/3). ~5–15% throughput penalty for non-pow2 dims.
Sliding window attention bypass (v1.3.0) — Gemma-2/3 SWA layers bypass compression automatically. Global layers compress normally.
Verify CLI (v1.2.0) — python -m turboquant_vllm.verify --model <name> --bits 4 checks any model in ~30 seconds.
7 validated model families — Molmo2, Llama 3.1 8B, Mistral 7B, Qwen2.5-3B, Phi-3-mini, Phi-4, Gemma-2-2b, Gemma-3-4B-it. All pass at cosine ≥0.99.
Production hotfixes (v1.2.1, v1.2.2) — container benchmarks found OOM bugs in fused kernel scratch buffers. Both patched within 24 hours.

Benchmarks

VLM (Molmo2-4B, FP16 baseline): 3.76x KV compression, ~97% cosine similarity
Text-only (Llama 3.1 / Mistral, FP8 baseline): 1.88x KV capacity, lossless at temperature=0
16K context: TQ4 serves 6x concurrent requests vs baseline 3x

Install / Upgrade

pip install turboquant-vllm[vllm]>=1.3.0

What's Next

Upstream vLLM contribution (vllm#38171 — 49 upvotes)
Flash Attention kernel fusion for multi-layer decode
VL-Cache stacking for multiplicative VLM savings

Full changelog: v1.2.0 | v1.2.1 | v1.2.2 | v1.3.0
Blog post: From one model to seven — making TurboQuant model-portable
Docs: alberto-codes.github.io/turboquant-vllm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

turboquant-vllm v1.3.0 — Seven validated model families #58

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

turboquant-vllm v1.3.0 — Seven validated model families #58

Uh oh!

Alberto-Codes Apr 1, 2026 Maintainer

What's new since v1.1.0

Benchmarks

Install / Upgrade

What's Next

Replies: 0 comments

Alberto-Codes
Apr 1, 2026
Maintainer