Compress your LLM's KV cache by 8–32× with near-zero accuracy loss. PCA decorrelation + adaptive quantization + entropy coding.
First open-source implementation of NVIDIA's KVTC (ICLR 2026). No model changes required — compress, store, decompress, resume.
KV caches grow linearly with context length and can consume multiple gigabytes for long conversations. KVTC compresses them for compact storage using the same ideas behind JPEG: decorrelate, quantize, entropy-code.
Key insight: KV cache vectors have strong low-rank structure. PCA exposes this structure, then a DP algorithm allocates bits only where they matter. The rest gets pruned to zero — free dimensionality reduction.
| Method | Compression | Accuracy Retention | Approach |
|---|---|---|---|
| KVTC | 20× | < 1% loss | PCA + DP quantization + DEFLATE |
| KIVI | 2.6× | Moderate | 2-bit asymmetric quantization |
| GEAR | 4× | Good | Low-rank + quantization |
| H2O | 4–8× | Task-dependent | Token eviction |
| xKV | 8–16× | Strong | SVD-based compression |
| Mode | Key Cosine | Value Cosine | Compression | Use Case |
|---|---|---|---|---|
| High Quality | 0.9996 | 0.9982 | 7.9× | Production serving |
| Balanced | 0.9950 | 0.9818 | 16.2× | Memory-constrained |
| Aggressive | 0.9895 | 0.9654 | 31.7× | Maximum compression |
Tested on 32 layers × 8 KV heads × head_dim=128. Attention sinks (first 4 tokens) and sliding window (last 128 tokens) preserved exactly — only middle tokens are compressed.
Input: KV Cache [layers, tokens, heads, dim]
│
├── Sinks (first 4 tokens) ──────────────── stored exactly
├── Window (last 128 tokens) ────────────── stored exactly
│
└── Middle tokens
│
▼
┌─────────────────┐
│ Undo RoPE │ (keys only — exposes low-rank structure)
│ on keys │
└────────┬────────┘
▼
┌─────────────────┐
│ PCA Transform │ Calibrated offline, one-time per model
│ decorrelate │ V^T · (x - μ) → principal components
└────────┬────────┘
▼
┌─────────────────┐
│ DP Quantize │ Optimal bit allocation per component
│ 0–16 bits/comp │ 0 bits = component pruned entirely
└────────┬────────┘
▼
┌─────────────────┐
│ Bit Pack + │ Variable-width packing → zlib DEFLATE
│ DEFLATE │ Lossless entropy coding
└────────┬────────┘
▼
Output: CompressedKVCache (bytes + metadata)
Decompression reverses the pipeline: DEFLATE → unpack → dequantize → PCA inverse → reapply RoPE → concatenate with sinks and window.
git clone https://github.com/OnlyTerp/kvtc.git
cd kvtc
pip install -e ".[dev]"
pytest src/test_kvtc.py # 38 testsimport torch
from src.pca import PCACalibrator
from src.pipeline import KVTCCompressor
# Simulate KV cache: [layers, tokens, heads, dim]
kv_cache = {
"keys": torch.randn(2, 256, 4, 64),
"values": torch.randn(2, 256, 4, 64),
}
positions = torch.arange(256)
# Step 1: Calibrate PCA (one-time, per model)
calibrator = PCACalibrator(head_group_size=1)
for layer_idx in range(2):
calibrator.collect(layer_idx, "keys", kv_cache["keys"][layer_idx], positions)
calibrator.collect(layer_idx, "values", kv_cache["values"][layer_idx])
calibration = calibrator.compute(bit_budget_ratio=0.12) # 12% = ~16x compression
# Step 2: Compress
compressor = KVTCCompressor(calibration)
compressed = compressor.compress(kv_cache, positions)
print(f"Compression ratio: {compressed.metadata.compression_ratio:.1f}x")
# Step 3: Decompress (lossless for sinks/window, lossy for middle)
restored = compressor.decompress(compressed)bit_budget_ratio |
Compression | Quality | When to use |
|---|---|---|---|
0.25 |
~8× | Excellent (0.999+ cosine) | Production, quality-critical |
0.12 |
~16× | Very good (0.995 cosine) | Balanced memory/quality |
0.06 |
~32× | Good (0.989 cosine) | Maximum savings, long context |
- Reference implementation — Pure PyTorch on CPU. Not optimized for production throughput.
- Entropy coding is CPU-only — Uses
zlibDEFLATE. The paper uses NVIDIA'snvCOMPfor GPU-accelerated DEFLATE. - DP quantization is O(d × B × 16) — Fast enough for reference use, but production would need optimized kernels.
- Compression is slow (~30-60s per prompt on CPU) — The DP + PCA transform dominates. GPU acceleration would fix this.
- Tested on 2 models — TinyLlama-1.1B and Mistral-7B. More model validation welcome.
- Not affiliated with NVIDIA — Independent implementation from the public paper.
- Collect KV cache samples from a calibration dataset (10 texts, ~2 seconds)
- Undo RoPE on keys before computing PCA — RoPE rotation hides low-rank structure
- Compute SVD per
(layer, head, key/value)to get eigenvectors and eigenvalues - At compression time: project vectors into PCA space via matrix multiply
- Eigenvalues sorted descending — first components capture most variance
- DP algorithm over eigenvalues and bit budget minimizes total reconstruction error
- Error model:
λᵢ / 4^bᵢ— each bit halves quantization step, reducing MSE by 4× - Components assigned 0 bits are pruned entirely (dimensionality reduction for free)
- Uniform affine quantization within each bit width:
scale = (max - min) / (2^b - 1)
- Pack variable-width quantized indices into compact byte stream
- Apply zlib DEFLATE for lossless compression of statistical redundancy
- Typically adds 1.2–1.5× additional compression beyond quantization alone
- Attention sinks (first 4 tokens): Never compressed. These receive disproportionate attention weight regardless of content.
- Sliding window (last 128 tokens): Never compressed. Most relevant context for next-token generation.
- Ablation studies in the paper show compressing these tokens collapses accuracy at high compression ratios.
kvtc/
├── src/
│ ├── __init__.py # Package exports
│ ├── pca.py # PCA calibration, RoPE undo/reapply
│ ├── quantize.py # DP bit allocation, uniform quantization
│ ├── entropy.py # Bit packing, zlib DEFLATE
│ ├── pipeline.py # Full KVTCCompressor (compress/decompress)
│ ├── cache.py # HuggingFace DynamicCache wrapper
│ ├── calibrate.py # Model calibration utilities
│ ├── common.py # Shared dataclasses
│ ├── test_kvtc.py # 38 unit tests
│ └── test_real_model.py # Optional TinyLlama integration test
├── notebooks/
│ └── demo.ipynb # Colab notebook
├── deploy/
│ ├── Dockerfile
│ └── run.sh
├── .github/workflows/test.yml
├── IMPLEMENTATION_NOTES.md # Detailed algorithm documentation
├── CONTRIBUTING.md # How to contribute
├── BENCHMARKS.md # Full benchmark results
├── LICENSE # MIT
├── README.md
└── setup.py
@inproceedings{staniszewski2026kvtc,
title={KV Cache Transform Coding for Compact Storage in LLM Inference},
author={Staniszewski, Konrad and {\L}a{\'n}cucki, Adrian},
booktitle={International Conference on Learning Representations (ICLR)},
year={2026}
}This is an independent open-source implementation of the KVTC algorithm. All credit for the algorithm design and research belongs to the paper authors at NVIDIA.
- Paper: KV Cache Transform Coding for Compact Storage in LLM Inference — Accepted at ICLR 2026
- Authors: Konrad Staniszewski, Adrian Łańcucki (NVIDIA)
- Implementation: Terp AI Labs
Not affiliated with or endorsed by NVIDIA. Built from the public paper to make KVTC accessible to the open-source community.
MIT — see LICENSE for details.