turboquant-ggml

This library is now part of DeepNetz — a complete LLM inference framework with 6 backends, Web UI, and KV cache optimization. Install via pip install deepnetz.

First real implementation of Google's TurboQuant (ICLR 2026) for llama.cpp and Ollama.

Compress your LLM's KV cache by 4.6x at 3-bit with near-zero quality loss. Run bigger models, longer contexts, less VRAM.

TQ4: 3.6x compression — MSE 0.003 — cosine sim > 0.99
TQ3: 4.6x compression — MSE 0.011 — cosine sim > 0.95
TQ2: 6.4x compression — aggressive, best for values

What is this?

TurboQuant (arXiv:2504.19874) is Google Research's algorithm for extreme KV cache compression. Instead of naive min/max quantization, it:

Rotates vectors with Walsh-Hadamard Transform (spreads information evenly)
Quantizes each coordinate with mathematically optimal Lloyd-Max codebooks
Packs indices into 2/3/4 bits per element

The result: near-optimal distortion at extreme compression ratios. Your 8B model's 32K context KV cache drops from 2GB to 430MB.

Why this matters for Ollama / llama.cpp

Current (`q8_0`)	With TurboQuant (`tq3`)	Savings
8 bits/element	3.5 bits/element	4.6x less VRAM
2.0 GB KV cache (8B, 32K)	430 MB	1.6 GB freed
4.0 GB KV cache (8B, 64K)	870 MB	3.1 GB freed

This means: run longer contexts on the same GPU, or run bigger models that didn't fit before.

Build

git clone https://github.com/Keyvanhardani/turboquant-ggml.git
cd turboquant-ggml
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
./build/test_turboquant   # 19/19 tests passing

Architecture

include/turboquant.h              # GGML-compatible API + block structs
src/turboquant.c                  # Core: WHT, quantize, dequantize, bit-packing
src/turboquant-tables.c           # Precomputed Lloyd-Max codebooks
tests/test_turboquant.c           # Test suite (19/19 passing)
tests/bench_turboquant.c          # Performance benchmark
ggml-integration/
  ggml-turboquant.h               # Drop-in header for llama.cpp
  ggml-turboquant.c               # Drop-in implementation for llama.cpp
  INTEGRATION.md                  # Step-by-step llama.cpp integration guide

Block formats (GGML-compatible)

// TQ3_0: 14 bytes per 32 elements (3.5 bits/elem)
typedef struct {
    uint16_t d;        // fp16 L2 norm
    uint8_t  qs[12];   // 32 x 3-bit packed
} block_tq3_0;

// TQ4_0: 18 bytes per 32 elements (4.5 bits/elem)
typedef struct {
    uint16_t d;        // fp16 L2 norm
    uint8_t  qs[16];   // 32 x 4-bit packed
} block_tq4_0;

API

#include "turboquant.h"

float kv_data[128];  // one head's KV vector

// Quantize
block_tq3_0 compressed[4];  // 128/32 = 4 blocks
quantize_row_tq3_0(kv_data, compressed, 128, 128);

// Dequantize
float reconstructed[128];
dequantize_row_tq3_0(compressed, reconstructed, 128, 128);
// cosine_similarity(kv_data, reconstructed) > 0.95

Design decisions (backed by community research)

Paper says	We do	Why
Random orthogonal rotation	Walsh-Hadamard Transform	59x better quality (community finding)
QJL residual correction	MSE-only	QJL degrades quality from 80.4% to 69.6%
Block size 128	Block size 32	Better flash attention parallelism
Symmetric K/V bits	Asymmetric support	Keys need more precision than values

Sources: llama.cpp Discussion #20969, ik_llama.cpp #1509

llama.cpp integration

Ready-to-use GGML integration files in ggml-integration/. See INTEGRATION.md for the step-by-step guide.

5 files to modify, 2 files to add — that's it.

# Usage after integration:
./llama-server -m model.gguf --cache-type-k turbo3_0 --cache-type-v turbo3_0 --flash-attn on

# Ollama (after rebuild):
OLLAMA_KV_CACHE_TYPE=turbo3_0 ollama run llama3

Roadmap

Phase 1: Block structs + type registration interfaces
Phase 2: CPU quantize/dequantize with norm correction (pure C)
Phase 3: Test suite (19/19 passing) + benchmark
Phase 4: GGML drop-in integration files
Phase 5: Integration guide for llama.cpp
Phase 6: llama.cpp PR submission
Phase 7: CUDA kernels (fused FA for peak performance)
Phase 8: Ollama native support

How it works

Input vector (FP16)
    |
    v
[Store L2 norm] ─── norm stored as FP16 (2 bytes)
    |
    v
[Normalize to unit sphere]
    |
    v
[Walsh-Hadamard Transform] ─── O(n log n), self-inverse
    |                           spreads info evenly across coords
    v
[Lloyd-Max quantization] ─── optimal codebook for Beta distribution
    |                         precomputed, zero per-block overhead
    v
[Bit-pack to 2/3/4 bits] ─── compact storage
    |
    v
Compressed block (14 bytes for 32 FP16 elements)

Benchmarks — Real Model Perplexity

Tested on wikitext-2 (ctx=512, 5 chunks) with our llama.cpp integration:

Model	f16 PPL	turbo4_0 PPL	Delta	KV Memory
Llama-3.2-3B Q4_K_M	9.77	9.82	+0.4%	224 -> 63 MiB
Qwen2.5-3B Q4_K_M	9.14	9.84	+7.7%	72 -> 20 MiB
Qwen3-4B Q4_K_M	17.78	16.61	-6.6%	288 -> 81 MiB
Gemma-3-27B Q2_K	8.53	8.70	+2.0%	—
Qwen3VL-8B Q4_K_M	8.15	8.57	+5.2%	288 -> 81 MiB
Qwen3VL-30B-A3B Q4_K_M	6.24	6.63	+6.3%	192 -> 54 MiB
Qwen3.5-35B-A3B Q4_K_XL	5.91	6.07	+2.7%	40 -> 11 MiB
Llama-3.3-70B IQ2_M	4.91	—	—	—
Qwen3.5-122B-A10B IQ2_XXS	—	—	—	—

Tested from 3B to 122B — all models produce correct output with turbo4_0 KV cache.

Generation speed (RTX 4060, 8GB VRAM)

Model	f16	turbo4_0	Overhead
Qwen3.5-35B-A3B	7.6 tok/s	7.4 tok/s	-2.6%
Gemma-3-27B	2.4 tok/s	2.3 tok/s	-4.2%
Qwen3.5-122B-A10B	1.3 tok/s	1.3 tok/s	~0%
Llama-3.3-70B	0.7 tok/s	0.4 tok/s	—

Near-zero speed overhead. Cross-platform verified (WSL2 + Windows).

Synthetic benchmarks (10k vectors, dim=128)

Type	Bits/elem	Compression	MSE	Cosine Sim
TQ4_0	4.5	3.6x	0.008	0.996
TQ3_0	3.5	4.6x	0.031	0.985
TQ2_0	2.5	6.4x	0.112	0.944

Community validation (from llama.cpp #20969)

Real-world results from independent implementations:

Source	Hardware	Finding
Madreag	RTX 5090	turbo2 beats q8_0 by 5.4% at 32K context
TheTom	M5 Max	turbo3 at 98.7-99.5% of q8_0 speed, PPL +1.1%
Aaryan-Kapoor	CPU	Zero speed penalty on prompt processing (20.1 vs 19.3 t/s)
scos-lab	8 models	K/V norm ratio predicts optimal bit allocation
sjoerdmaessen	2x L40S	2x128K dual-slot from 1x82K — zero decode penalty
AmesianX	DGX Spark	tbqp3/tbq3 at 5.2x compression, +1.1% PPL
spiritbuun	RTX 3090	98.8% of q8_0 prefill speed

Key optimization: Norm correction

Stores original_norm / ||reconstruction|| instead of raw norm. Zero decode cost, measurable quality improvement (-0.36% PPL). Discovered independently by TheTom (turbo3) and spiritbuun (turbo4).

K/V asymmetry (scos-lab finding)

Keys need more precision than values. Qwen models show 100-180x K/V norm ratio. Recommended: 4-bit K, 2-3 bit V.

Author

Keyvan Hardani — GitHub | LinkedIn

License

MIT

Citation

If you use this implementation, please cite both the original paper and this work:

@software{hardani2026turboquant_ggml,
  title={turboquant-ggml: TurboQuant KV Cache Compression for llama.cpp and Ollama},
  author={Hardani, Keyvan},
  url={https://github.com/Keyvanhardani/turboquant-ggml},
  year={2026},
  note={GGML-compatible implementation with norm correction, WHT rotation,
        and Lloyd-Max optimal codebooks. 42 tests, drop-in integration.}
}

@inproceedings{zandieh2026turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Ali and Mirrokni, Vahab},
  booktitle={ICLR},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
ggml-integration		ggml-integration
include		include
src		src
tests		tests
tools		tools
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
windows-test-instructions.md		windows-test-instructions.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

turboquant-ggml

What is this?

Why this matters for Ollama / llama.cpp

Build

Architecture

Block formats (GGML-compatible)

API

Design decisions (backed by community research)

llama.cpp integration

Roadmap

How it works

Benchmarks — Real Model Perplexity

Generation speed (RTX 4060, 8GB VRAM)

Synthetic benchmarks (10k vectors, dim=128)

Community validation (from llama.cpp #20969)

Key optimization: Norm correction

K/V asymmetry (scos-lab finding)

Author

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

turboquant-ggml

What is this?

Why this matters for Ollama / llama.cpp

Build

Architecture

Block formats (GGML-compatible)

API

Design decisions (backed by community research)

llama.cpp integration

Roadmap

How it works

Benchmarks — Real Model Perplexity

Generation speed (RTX 4060, 8GB VRAM)

Synthetic benchmarks (10k vectors, dim=128)

Community validation (from llama.cpp #20969)

Key optimization: Norm correction

K/V asymmetry (scos-lab finding)

Author

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages