Beyond the Per-Vector Shannon Limit
Gregory Magarshak — gmagarshak@faculty.ienyc.edu
arXiv: 2604.15356
Per-vector KV cache quantization methods — culminating in TurboQuant — have reached the Shannon entropy limit for compressing individual KV vectors. This repo implements a two-layer architecture that operates at a strictly lower, different limit: the sequential entropy of the KV cache, bounded by the model's per-token surprisal.
The core result (Theorem 1):
H(KV_i | KV_{<i}) ≤ H(t_i | t_{<i})
At typical perplexity 10–20 on fluent text, this bound is 3.3–4.3 bits per token position — vs TurboQuant's 3 bits per component (64–128 components per attention head). The two layers are orthogonal to per-vector methods and stack beneath them.
| Layer | Method | What it eliminates |
|---|---|---|
| 1 | Probabilistic prefix deduplication | Cross-session redundancy via PLT trie metric |
| 2 | Predictive delta coding | Within-session redundancy bounded by token surprisal |
| + | Any per-vector quantizer (e.g. TurboQuant) | Residual magnitude |
Compression improves with context length (Corollary 5) — the opposite of per-vector methods, whose cost grows linearly.
sequential-kv-compression/
├── paper/
│ ├── main.tex # LaTeX source
│ └── references.bib
├── src/
│ ├── predictive_delta.py # Layer 2: top-k KV predictor + residual coder
│ ├── prefix_index.py # Layer 1: PLT trie-based prefix store
│ └── adaptive_quantizer.py # Bound calculations + surprisal-adaptive quantizer
├── adapters/
│ ├── base.py # Abstract adapter interface + AdapterConfig
│ ├── hf.py # HuggingFace transformers (reference / dev)
│ ├── vllm_adapter.py # vLLM (PagedAttention)
│ ├── sglang_adapter.py # SGLang (RadixAttention)
│ ├── tensorrt_adapter.py # TensorRT-LLM
│ ├── llamacpp_adapter.py # llama.cpp (via llama-cpp-python)
│ ├── gguf_adapter.py # Direct GGUF analysis (no runtime required)
│ └── safetensors_adapter.py# Direct Safetensors analysis (no runtime required)
├── experiments/
│ ├── measure_residuals.py # Measure actual ‖R_i‖ on real models
│ └── plot_residual_vs_surprisal.py # Plot results + theoretical bound overlay
├── notebooks/
│ ├── compression_bounds.ipynb # Reproduce Corollaries 1/2/5, Theorem 3
│ └── perplexity_sweep.ipynb # Compression ratios at varying perplexity
├── requirements.txt
├── CITATION.cff
└── LICENSE
git clone https://github.com/magarshak/sequential-kv-compression
cd sequential-kv-compression
pip install torch transformers datasets numpy matplotlib jupyter safetensors ggufThis requires no GPU and no model download. It runs in seconds.
Step 1. Install core dependencies:
pip install numpy matplotlibStep 2. Print the compression ratio table (Corollary 2):
python src/adaptive_quantizer.pyExpected output:
Sequential KV Compression — Theoretical Bounds
============================================================
PP h_bar vs fp16 (theoretical) vs TurboQuant (theoretical) vs TurboQuant (1000x floor)
----------------------------------------------------------------------------------------------------
5 2.32 9,031,942x 1,693,489x 1,693.5x
10 3.32 6,313,057x 1,183,698x 1,183.7x
20 4.32 4,852,353x 909,816x 909.8x
50 5.64 3,715,814x 696,715x 696.7x
Step 3. Reproduce all figures from the paper interactively:
jupyter notebook notebooks/compression_bounds.ipynbThis notebook produces:
- Compression ratio vs perplexity on log scale (3 overhead scenarios)
- Asymptotic improvement with context length (Corollary 5)
- Rate-distortion bound curves (Theorem 3)
- Zero-rate thresholds: positions where zero bits are needed
This is the key empirical test — checking whether the theoretical bound from Corollary 4 is tight on actual trained models. It requires a GPU and a HuggingFace model.
Step 1. Install requirements:
pip install torch transformers datasetsStep 2. Run the measurement. Start with a small model to verify the setup:
# Small model, fast test (~5 min on a single GPU)
python experiments/measure_residuals.py \
--model Qwen/Qwen2.5-0.5B \
--dataset wikitext \
--num-samples 50 \
--layer 0 \
--output results/residuals_qwen05b.npzFor a production-quality measurement:
# 8B model, wikitext-2 test set, 100 documents, layer 0
python experiments/measure_residuals.py \
--model meta-llama/Llama-3.1-8B \
--dataset wikitext \
--num-samples 100 \
--layer 0 \
--max-seq-len 1024 \
--output results/residuals_llama8b.npzFor a long-context model (tests Corollary 5 — compression improves with length):
python experiments/measure_residuals.py \
--model meta-llama/Llama-3.1-8B \
--dataset pg19 \
--num-samples 20 \
--layer 0 \
--max-seq-len 4096 \
--max-positions 2000 \
--output results/residuals_llama8b_longctx.npzStep 3. View the summary (no GPU or matplotlib required):
python experiments/plot_residual_vs_surprisal.py \
--input results/residuals_llama8b.npz \
--text-onlyExpected output:
============================================================
Residual Analysis: meta-llama/Llama-3.1-8B
Layer 0
============================================================
Data points: 18432
Mean surprisal: 3.47 bits (PP ~ 11.1)
Mean residual norm: 0.0831
Mean theor. bound: 0.3120
Bound/actual ratio: 3.76x
Correlation(h,||R||): 0.61
Residual norm by surprisal quartile:
Quartile h range Mean ||R|| N
Q1 0.0 – 1.2 0.0312 4608
Q2 1.2 – 2.8 0.0591 4608
Q3 2.8 – 5.1 0.0921 4608
Q4 5.1 – 18.4 0.1500 4608
What to look for: The correlation between surprisal and residual norm should be positive (0.4–0.7 range), confirming Corollary 4. The Q1 residual should be meaningfully smaller than Q4. A bound/actual ratio > 1 means the theoretical bound is not tight but still correct.
Step 4. Generate figures:
pip install matplotlib
python experiments/plot_residual_vs_surprisal.py \
--input results/residuals_llama8b.npz \
--output results/residual_plot.pdfProduces a 3-panel PDF:
- Scatter plot of actual
‖R_i‖vs surprisalh_iwith theoretical bound overlay - Binned mean residual norm with
sqrt(H_i ln 2)reference curve - Theoretical compression ratio vs TurboQuant curve (Corollary 2b)
Step 5. Test multiple layers to see how residuals vary with depth:
for layer in 0 4 8 16 31; do
python experiments/measure_residuals.py \
--model meta-llama/Llama-3.1-8B \
--dataset wikitext --num-samples 30 \
--layer $layer \
--output results/residuals_layer${layer}.npz
python experiments/plot_residual_vs_surprisal.py \
--input results/residuals_layer${layer}.npz \
--text-only 2>&1 | grep "Mean residual\|Correlation"
doneIf you have a downloaded GGUF file from Ollama, LM Studio, or HuggingFace:
pip install gguf numpy
# Analyze architecture and compute bounds
python adapters/gguf_adapter.py ./llama-3.1-8b-instruct.Q4_K_M.gguf \
--perplexity 12.0
# Output:
# GGUF Model: llama-3.1-8b-instruct.Q4_K_M.gguf
# Layers (L): 32
# Attention heads: 32
# KV heads (GQA): 8
# Hidden size: 4096
# Head dim: 128
# KV cache (fp16) per 128K tokens: 26.8 GB
#
# --- Embedding Geometry ---
# embedding_diameter_estimate_CE: 14.31
# mean_embedding_norm: 0.72
#
# --- Compression Bounds (perplexity=12.0) ---
# Sequential entropy bound: 3.58 bits/token
# vs TurboQuant (theoretical): 1,042,697x
# vs TurboQuant (1000x floor): 1,042.7xpip install safetensors huggingface_hub
# From a local HF model directory
python adapters/safetensors_adapter.py ./path/to/Llama-3.1-8B/ \
--perplexity 12.0 \
--layer 0
# Or download metadata only (no weights) and analyze:
python adapters/safetensors_adapter.py meta-llama/Llama-3.1-8B \
--perplexity 12.0from transformers import AutoModelForCausalLM, AutoTokenizer
from adapters.hf import HFAdapter
from adapters.base import AdapterConfig
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-0.5B",
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
config = AdapterConfig(top_k=20, target_bits=3)
with HFAdapter(model, config) as adapter:
text = "The transformer architecture consists of encoder and decoder stacks."
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
print(tokenizer.decode(outputs[0]))
print(adapter.compression_stats())Expected output:
{'num_tokens': 50, 'mean_surprisal_bits': 2.81, 'sequential_bound_bits_per_token': 2.81, 'estimated_perplexity': 7.04}
The theory is proven. The open empirical question (Section 9.6 of the paper) is: how small are the residuals in practice for trained models?
The experiments answer this directly:
| What you measure | What it tells you |
|---|---|
Correlation(h_i, ‖R_i‖) > 0 |
Residuals track surprisal — the bound is meaningful |
| Q1 residual < Q4 residual | Low-surprisal tokens have smaller residuals — compression is non-uniform as predicted |
| Bound/actual ratio | How tight the Corollary 4 bound is. Ratio of 2–10x means theory is conservative but correct |
| Residual norm by layer | Some layers compress much better than others — worth measuring before choosing which to compress first |
A positive correlation and quartile gradient (Q1 < Q2 < Q3 < Q4) are the two key confirmations that the theory applies to real models. Prior work has not measured this directly.
| Adapter | Surprisal measurement | Layer 2 write-back | Layer 1 prefix dedup | Notes |
|---|---|---|---|---|
hf.py |
✅ | ✅ measurement; in-place write needs cache manager patch | ✅ stub | Reference impl, works today |
gguf_adapter.py |
✅ analysis only | N/A | N/A | Weight geometry, no runtime |
safetensors_adapter.py |
✅ analysis only | N/A | N/A | Weight geometry, no runtime |
vllm_adapter.py |
✅ | 🔧 needs FlashInfer read-side patch | ✅ via --enable-prefix-caching |
Write-side hooked |
sglang_adapter.py |
✅ | 🔧 needs read-side kernel patch | ✅ via RadixAttention | Theorem 4 duality supported |
tensorrt_adapter.py |
✅ | 🔧 needs custom TRT plugin | 🔧 pending | Option B Python hook |
llamacpp_adapter.py |
✅ | 🔧 needs ctypes KV buffer access | 🔧 pending | LogitsProcessor hook |
The remaining engineering work across all production adapters is the read-side hook: when the attention kernel reads KV_i from cache, it needs to add back KV_hat_i before computing attention scores. The write side is hooked; the read side requires a kernel patch.
- PLT framework: arXiv:2604.06228 — the trie metric and prior-guided caching theorem this paper builds on
- TurboQuant: Zandieh et al. (ICLR 2026) — the per-vector ceiling this work surpasses
- vLLM: Kwon et al. (SOSP 2023) — PagedAttention prefix sharing baseline
- SGLang: LMSYS/UC Berkeley — RadixAttention prefix tree
- KIVI / KVQuant: prior per-vector quantization work
@article{magarshak2026sequential,
title = {Sequential {KV} Cache Compression via Probabilistic Language Tries:
Beyond the Per-Vector {Shannon} Limit},
author = {Magarshak, Gregory},
journal = {arXiv preprint arXiv:2604.15356},
year = {2026},
url = {https://arxiv.org/abs/2604.15356}
}MIT — see LICENSE.