This library is now part of DeepNetz — a complete LLM inference framework with 6 backends, Web UI, and KV cache optimization. Install via
pip install deepnetz.
First real implementation of Google's TurboQuant (ICLR 2026) for llama.cpp and Ollama.
Compress your LLM's KV cache by 4.6x at 3-bit with near-zero quality loss. Run bigger models, longer contexts, less VRAM.
TQ4: 3.6x compression — MSE 0.003 — cosine sim > 0.99
TQ3: 4.6x compression — MSE 0.011 — cosine sim > 0.95
TQ2: 6.4x compression — aggressive, best for values
TurboQuant (arXiv:2504.19874) is Google Research's algorithm for extreme KV cache compression. Instead of naive min/max quantization, it:
- Rotates vectors with Walsh-Hadamard Transform (spreads information evenly)
- Quantizes each coordinate with mathematically optimal Lloyd-Max codebooks
- Packs indices into 2/3/4 bits per element
The result: near-optimal distortion at extreme compression ratios. Your 8B model's 32K context KV cache drops from 2GB to 430MB.
Current (q8_0) |
With TurboQuant (tq3) |
Savings |
|---|---|---|
| 8 bits/element | 3.5 bits/element | 4.6x less VRAM |
| 2.0 GB KV cache (8B, 32K) | 430 MB | 1.6 GB freed |
| 4.0 GB KV cache (8B, 64K) | 870 MB | 3.1 GB freed |
This means: run longer contexts on the same GPU, or run bigger models that didn't fit before.
git clone https://github.com/Keyvanhardani/turboquant-ggml.git
cd turboquant-ggml
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
./build/test_turboquant # 19/19 tests passinginclude/turboquant.h # GGML-compatible API + block structs
src/turboquant.c # Core: WHT, quantize, dequantize, bit-packing
src/turboquant-tables.c # Precomputed Lloyd-Max codebooks
tests/test_turboquant.c # Test suite (19/19 passing)
tests/bench_turboquant.c # Performance benchmark
ggml-integration/
ggml-turboquant.h # Drop-in header for llama.cpp
ggml-turboquant.c # Drop-in implementation for llama.cpp
INTEGRATION.md # Step-by-step llama.cpp integration guide
// TQ3_0: 14 bytes per 32 elements (3.5 bits/elem)
typedef struct {
uint16_t d; // fp16 L2 norm
uint8_t qs[12]; // 32 x 3-bit packed
} block_tq3_0;
// TQ4_0: 18 bytes per 32 elements (4.5 bits/elem)
typedef struct {
uint16_t d; // fp16 L2 norm
uint8_t qs[16]; // 32 x 4-bit packed
} block_tq4_0;#include "turboquant.h"
float kv_data[128]; // one head's KV vector
// Quantize
block_tq3_0 compressed[4]; // 128/32 = 4 blocks
quantize_row_tq3_0(kv_data, compressed, 128, 128);
// Dequantize
float reconstructed[128];
dequantize_row_tq3_0(compressed, reconstructed, 128, 128);
// cosine_similarity(kv_data, reconstructed) > 0.95| Paper says | We do | Why |
|---|---|---|
| Random orthogonal rotation | Walsh-Hadamard Transform | 59x better quality (community finding) |
| QJL residual correction | MSE-only | QJL degrades quality from 80.4% to 69.6% |
| Block size 128 | Block size 32 | Better flash attention parallelism |
| Symmetric K/V bits | Asymmetric support | Keys need more precision than values |
Sources: llama.cpp Discussion #20969, ik_llama.cpp #1509
Ready-to-use GGML integration files in ggml-integration/. See INTEGRATION.md for the step-by-step guide.
5 files to modify, 2 files to add — that's it.
# Usage after integration:
./llama-server -m model.gguf --cache-type-k turbo3_0 --cache-type-v turbo3_0 --flash-attn on
# Ollama (after rebuild):
OLLAMA_KV_CACHE_TYPE=turbo3_0 ollama run llama3- Phase 1: Block structs + type registration interfaces
- Phase 2: CPU quantize/dequantize with norm correction (pure C)
- Phase 3: Test suite (19/19 passing) + benchmark
- Phase 4: GGML drop-in integration files
- Phase 5: Integration guide for llama.cpp
- Phase 6: llama.cpp PR submission
- Phase 7: CUDA kernels (fused FA for peak performance)
- Phase 8: Ollama native support
Input vector (FP16)
|
v
[Store L2 norm] ─── norm stored as FP16 (2 bytes)
|
v
[Normalize to unit sphere]
|
v
[Walsh-Hadamard Transform] ─── O(n log n), self-inverse
| spreads info evenly across coords
v
[Lloyd-Max quantization] ─── optimal codebook for Beta distribution
| precomputed, zero per-block overhead
v
[Bit-pack to 2/3/4 bits] ─── compact storage
|
v
Compressed block (14 bytes for 32 FP16 elements)
Tested on wikitext-2 (ctx=512, 5 chunks) with our llama.cpp integration:
| Model | f16 PPL | turbo4_0 PPL | Delta | KV Memory |
|---|---|---|---|---|
| Llama-3.2-3B Q4_K_M | 9.77 | 9.82 | +0.4% | 224 -> 63 MiB |
| Qwen2.5-3B Q4_K_M | 9.14 | 9.84 | +7.7% | 72 -> 20 MiB |
| Qwen3-4B Q4_K_M | 17.78 | 16.61 | -6.6% | 288 -> 81 MiB |
| Gemma-3-27B Q2_K | 8.53 | 8.70 | +2.0% | — |
| Qwen3VL-8B Q4_K_M | 8.15 | 8.57 | +5.2% | 288 -> 81 MiB |
| Qwen3VL-30B-A3B Q4_K_M | 6.24 | 6.63 | +6.3% | 192 -> 54 MiB |
| Qwen3.5-35B-A3B Q4_K_XL | 5.91 | 6.07 | +2.7% | 40 -> 11 MiB |
| Llama-3.3-70B IQ2_M | 4.91 | — | — | — |
| Qwen3.5-122B-A10B IQ2_XXS | — | — | — | — |
Tested from 3B to 122B — all models produce correct output with turbo4_0 KV cache.
| Model | f16 | turbo4_0 | Overhead |
|---|---|---|---|
| Qwen3.5-35B-A3B | 7.6 tok/s | 7.4 tok/s | -2.6% |
| Gemma-3-27B | 2.4 tok/s | 2.3 tok/s | -4.2% |
| Qwen3.5-122B-A10B | 1.3 tok/s | 1.3 tok/s | ~0% |
| Llama-3.3-70B | 0.7 tok/s | 0.4 tok/s | — |
Near-zero speed overhead. Cross-platform verified (WSL2 + Windows).
| Type | Bits/elem | Compression | MSE | Cosine Sim |
|---|---|---|---|---|
| TQ4_0 | 4.5 | 3.6x | 0.008 | 0.996 |
| TQ3_0 | 3.5 | 4.6x | 0.031 | 0.985 |
| TQ2_0 | 2.5 | 6.4x | 0.112 | 0.944 |
Community validation (from llama.cpp #20969)
Real-world results from independent implementations:
| Source | Hardware | Finding |
|---|---|---|
| Madreag | RTX 5090 | turbo2 beats q8_0 by 5.4% at 32K context |
| TheTom | M5 Max | turbo3 at 98.7-99.5% of q8_0 speed, PPL +1.1% |
| Aaryan-Kapoor | CPU | Zero speed penalty on prompt processing (20.1 vs 19.3 t/s) |
| scos-lab | 8 models | K/V norm ratio predicts optimal bit allocation |
| sjoerdmaessen | 2x L40S | 2x128K dual-slot from 1x82K — zero decode penalty |
| AmesianX | DGX Spark | tbqp3/tbq3 at 5.2x compression, +1.1% PPL |
| spiritbuun | RTX 3090 | 98.8% of q8_0 prefill speed |
Stores original_norm / ||reconstruction|| instead of raw norm. Zero decode cost, measurable quality improvement (-0.36% PPL). Discovered independently by TheTom (turbo3) and spiritbuun (turbo4).
Keys need more precision than values. Qwen models show 100-180x K/V norm ratio. Recommended: 4-bit K, 2-3 bit V.
Keyvan Hardani — GitHub | LinkedIn
MIT
If you use this implementation, please cite both the original paper and this work:
@software{hardani2026turboquant_ggml,
title={turboquant-ggml: TurboQuant KV Cache Compression for llama.cpp and Ollama},
author={Hardani, Keyvan},
url={https://github.com/Keyvanhardani/turboquant-ggml},
year={2026},
note={GGML-compatible implementation with norm correction, WHT rotation,
and Lloyd-Max optimal codebooks. 42 tests, drop-in integration.}
}
@inproceedings{zandieh2026turboquant,
title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
author={Zandieh, Amir and Daliri, Majid and Hadian, Ali and Mirrokni, Vahab},
booktitle={ICLR},
year={2026}
}