Skip to content

Keyvanhardani/turboquant-ggml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

turboquant-ggml

This library is now part of DeepNetz — a complete LLM inference framework with 6 backends, Web UI, and KV cache optimization. Install via pip install deepnetz.

First real implementation of Google's TurboQuant (ICLR 2026) for llama.cpp and Ollama.

Compress your LLM's KV cache by 4.6x at 3-bit with near-zero quality loss. Run bigger models, longer contexts, less VRAM.

TQ4: 3.6x compression — MSE 0.003 — cosine sim > 0.99
TQ3: 4.6x compression — MSE 0.011 — cosine sim > 0.95
TQ2: 6.4x compression — aggressive, best for values

What is this?

TurboQuant (arXiv:2504.19874) is Google Research's algorithm for extreme KV cache compression. Instead of naive min/max quantization, it:

  1. Rotates vectors with Walsh-Hadamard Transform (spreads information evenly)
  2. Quantizes each coordinate with mathematically optimal Lloyd-Max codebooks
  3. Packs indices into 2/3/4 bits per element

The result: near-optimal distortion at extreme compression ratios. Your 8B model's 32K context KV cache drops from 2GB to 430MB.

Why this matters for Ollama / llama.cpp

Current (q8_0) With TurboQuant (tq3) Savings
8 bits/element 3.5 bits/element 4.6x less VRAM
2.0 GB KV cache (8B, 32K) 430 MB 1.6 GB freed
4.0 GB KV cache (8B, 64K) 870 MB 3.1 GB freed

This means: run longer contexts on the same GPU, or run bigger models that didn't fit before.

Build

git clone https://github.com/Keyvanhardani/turboquant-ggml.git
cd turboquant-ggml
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
./build/test_turboquant   # 19/19 tests passing

Architecture

include/turboquant.h              # GGML-compatible API + block structs
src/turboquant.c                  # Core: WHT, quantize, dequantize, bit-packing
src/turboquant-tables.c           # Precomputed Lloyd-Max codebooks
tests/test_turboquant.c           # Test suite (19/19 passing)
tests/bench_turboquant.c          # Performance benchmark
ggml-integration/
  ggml-turboquant.h               # Drop-in header for llama.cpp
  ggml-turboquant.c               # Drop-in implementation for llama.cpp
  INTEGRATION.md                  # Step-by-step llama.cpp integration guide

Block formats (GGML-compatible)

// TQ3_0: 14 bytes per 32 elements (3.5 bits/elem)
typedef struct {
    uint16_t d;        // fp16 L2 norm
    uint8_t  qs[12];   // 32 x 3-bit packed
} block_tq3_0;

// TQ4_0: 18 bytes per 32 elements (4.5 bits/elem)
typedef struct {
    uint16_t d;        // fp16 L2 norm
    uint8_t  qs[16];   // 32 x 4-bit packed
} block_tq4_0;

API

#include "turboquant.h"

float kv_data[128];  // one head's KV vector

// Quantize
block_tq3_0 compressed[4];  // 128/32 = 4 blocks
quantize_row_tq3_0(kv_data, compressed, 128, 128);

// Dequantize
float reconstructed[128];
dequantize_row_tq3_0(compressed, reconstructed, 128, 128);
// cosine_similarity(kv_data, reconstructed) > 0.95

Design decisions (backed by community research)

Paper says We do Why
Random orthogonal rotation Walsh-Hadamard Transform 59x better quality (community finding)
QJL residual correction MSE-only QJL degrades quality from 80.4% to 69.6%
Block size 128 Block size 32 Better flash attention parallelism
Symmetric K/V bits Asymmetric support Keys need more precision than values

Sources: llama.cpp Discussion #20969, ik_llama.cpp #1509

llama.cpp integration

Ready-to-use GGML integration files in ggml-integration/. See INTEGRATION.md for the step-by-step guide.

5 files to modify, 2 files to add — that's it.

# Usage after integration:
./llama-server -m model.gguf --cache-type-k turbo3_0 --cache-type-v turbo3_0 --flash-attn on

# Ollama (after rebuild):
OLLAMA_KV_CACHE_TYPE=turbo3_0 ollama run llama3

Roadmap

  • Phase 1: Block structs + type registration interfaces
  • Phase 2: CPU quantize/dequantize with norm correction (pure C)
  • Phase 3: Test suite (19/19 passing) + benchmark
  • Phase 4: GGML drop-in integration files
  • Phase 5: Integration guide for llama.cpp
  • Phase 6: llama.cpp PR submission
  • Phase 7: CUDA kernels (fused FA for peak performance)
  • Phase 8: Ollama native support

How it works

Input vector (FP16)
    |
    v
[Store L2 norm] ─── norm stored as FP16 (2 bytes)
    |
    v
[Normalize to unit sphere]
    |
    v
[Walsh-Hadamard Transform] ─── O(n log n), self-inverse
    |                           spreads info evenly across coords
    v
[Lloyd-Max quantization] ─── optimal codebook for Beta distribution
    |                         precomputed, zero per-block overhead
    v
[Bit-pack to 2/3/4 bits] ─── compact storage
    |
    v
Compressed block (14 bytes for 32 FP16 elements)

Benchmarks — Real Model Perplexity

Tested on wikitext-2 (ctx=512, 5 chunks) with our llama.cpp integration:

Model f16 PPL turbo4_0 PPL Delta KV Memory
Llama-3.2-3B Q4_K_M 9.77 9.82 +0.4% 224 -> 63 MiB
Qwen2.5-3B Q4_K_M 9.14 9.84 +7.7% 72 -> 20 MiB
Qwen3-4B Q4_K_M 17.78 16.61 -6.6% 288 -> 81 MiB
Gemma-3-27B Q2_K 8.53 8.70 +2.0%
Qwen3VL-8B Q4_K_M 8.15 8.57 +5.2% 288 -> 81 MiB
Qwen3VL-30B-A3B Q4_K_M 6.24 6.63 +6.3% 192 -> 54 MiB
Qwen3.5-35B-A3B Q4_K_XL 5.91 6.07 +2.7% 40 -> 11 MiB
Llama-3.3-70B IQ2_M 4.91
Qwen3.5-122B-A10B IQ2_XXS

Tested from 3B to 122B — all models produce correct output with turbo4_0 KV cache.

Generation speed (RTX 4060, 8GB VRAM)

Model f16 turbo4_0 Overhead
Qwen3.5-35B-A3B 7.6 tok/s 7.4 tok/s -2.6%
Gemma-3-27B 2.4 tok/s 2.3 tok/s -4.2%
Qwen3.5-122B-A10B 1.3 tok/s 1.3 tok/s ~0%
Llama-3.3-70B 0.7 tok/s 0.4 tok/s

Near-zero speed overhead. Cross-platform verified (WSL2 + Windows).

Synthetic benchmarks (10k vectors, dim=128)

Type Bits/elem Compression MSE Cosine Sim
TQ4_0 4.5 3.6x 0.008 0.996
TQ3_0 3.5 4.6x 0.031 0.985
TQ2_0 2.5 6.4x 0.112 0.944

Community validation (from llama.cpp #20969)

Real-world results from independent implementations:

Source Hardware Finding
Madreag RTX 5090 turbo2 beats q8_0 by 5.4% at 32K context
TheTom M5 Max turbo3 at 98.7-99.5% of q8_0 speed, PPL +1.1%
Aaryan-Kapoor CPU Zero speed penalty on prompt processing (20.1 vs 19.3 t/s)
scos-lab 8 models K/V norm ratio predicts optimal bit allocation
sjoerdmaessen 2x L40S 2x128K dual-slot from 1x82K — zero decode penalty
AmesianX DGX Spark tbqp3/tbq3 at 5.2x compression, +1.1% PPL
spiritbuun RTX 3090 98.8% of q8_0 prefill speed

Key optimization: Norm correction

Stores original_norm / ||reconstruction|| instead of raw norm. Zero decode cost, measurable quality improvement (-0.36% PPL). Discovered independently by TheTom (turbo3) and spiritbuun (turbo4).

K/V asymmetry (scos-lab finding)

Keys need more precision than values. Qwen models show 100-180x K/V norm ratio. Recommended: 4-bit K, 2-3 bit V.

Author

Keyvan HardaniGitHub | LinkedIn

License

MIT

Citation

If you use this implementation, please cite both the original paper and this work:

@software{hardani2026turboquant_ggml,
  title={turboquant-ggml: TurboQuant KV Cache Compression for llama.cpp and Ollama},
  author={Hardani, Keyvan},
  url={https://github.com/Keyvanhardani/turboquant-ggml},
  year={2026},
  note={GGML-compatible implementation with norm correction, WHT rotation,
        and Lloyd-Max optimal codebooks. 42 tests, drop-in integration.}
}

@inproceedings{zandieh2026turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Ali and Mirrokni, Vahab},
  booktitle={ICLR},
  year={2026}
}

About

TurboQuant for llama.cpp — First real integration of Google's extreme KV cache compression (ICLR 2026). 3-bit keys, near-zero quality loss, 6x memory savings.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors