Sequential KV Cache Compression via Probabilistic Language Tries

Beyond the Per-Vector Shannon Limit

Gregory Magarshak — gmagarshak@faculty.ienyc.edu
arXiv: 2604.15356

What this is

Per-vector KV cache quantization methods — culminating in TurboQuant — have reached the Shannon entropy limit for compressing individual KV vectors. This repo implements a two-layer architecture that operates at a strictly lower, different limit: the sequential entropy of the KV cache, bounded by the model's per-token surprisal.

The core result (Theorem 1):

H(KV_i | KV_{<i}) ≤ H(t_i | t_{<i})

At typical perplexity 10–20 on fluent text, this bound is 3.3–4.3 bits per token position — vs TurboQuant's 3 bits per component (64–128 components per attention head). The two layers are orthogonal to per-vector methods and stack beneath them.

Layer	Method	What it eliminates
1	Probabilistic prefix deduplication	Cross-session redundancy via PLT trie metric
2	Predictive delta coding	Within-session redundancy bounded by token surprisal
+	Any per-vector quantizer (e.g. TurboQuant)	Residual magnitude

Compression improves with context length (Corollary 5) — the opposite of per-vector methods, whose cost grows linearly.

Repository structure

sequential-kv-compression/
├── paper/
│   ├── main.tex              # LaTeX source
│   └── references.bib
├── src/
│   ├── predictive_delta.py   # Layer 2: top-k KV predictor + residual coder
│   ├── prefix_index.py       # Layer 1: PLT trie-based prefix store
│   └── adaptive_quantizer.py # Bound calculations + surprisal-adaptive quantizer
├── adapters/
│   ├── base.py               # Abstract adapter interface + AdapterConfig
│   ├── hf.py                 # HuggingFace transformers (reference / dev)
│   ├── vllm_adapter.py       # vLLM (PagedAttention)
│   ├── sglang_adapter.py     # SGLang (RadixAttention)
│   ├── tensorrt_adapter.py   # TensorRT-LLM
│   ├── llamacpp_adapter.py   # llama.cpp (via llama-cpp-python)
│   ├── gguf_adapter.py       # Direct GGUF analysis (no runtime required)
│   └── safetensors_adapter.py# Direct Safetensors analysis (no runtime required)
├── experiments/
│   ├── measure_residuals.py           # Measure actual ‖R_i‖ on real models
│   └── plot_residual_vs_surprisal.py  # Plot results + theoretical bound overlay
├── notebooks/
│   ├── compression_bounds.ipynb  # Reproduce Corollaries 1/2/5, Theorem 3
│   └── perplexity_sweep.ipynb    # Compression ratios at varying perplexity
├── requirements.txt
├── CITATION.cff
└── LICENSE

Quick setup

git clone https://github.com/magarshak/sequential-kv-compression
cd sequential-kv-compression
pip install torch transformers datasets numpy matplotlib jupyter safetensors gguf

Step-by-step: reproduce the theoretical bounds

This requires no GPU and no model download. It runs in seconds.

Step 1. Install core dependencies:

pip install numpy matplotlib

Step 2. Print the compression ratio table (Corollary 2):

python src/adaptive_quantizer.py

Expected output:

Sequential KV Compression — Theoretical Bounds
============================================================
    PP     h_bar   vs fp16 (theoretical)   vs TurboQuant (theoretical)   vs TurboQuant (1000x floor)
----------------------------------------------------------------------------------------------------
     5      2.32               9,031,942x                     1,693,489x                       1,693.5x
    10      3.32               6,313,057x                     1,183,698x                       1,183.7x
    20      4.32               4,852,353x                       909,816x                         909.8x
    50      5.64               3,715,814x                       696,715x                         696.7x

Step 3. Reproduce all figures from the paper interactively:

jupyter notebook notebooks/compression_bounds.ipynb

This notebook produces:

Compression ratio vs perplexity on log scale (3 overhead scenarios)
Asymptotic improvement with context length (Corollary 5)
Rate-distortion bound curves (Theorem 3)
Zero-rate thresholds: positions where zero bits are needed

Step-by-step: measure actual residuals on a real model

This is the key empirical test — checking whether the theoretical bound from Corollary 4 is tight on actual trained models. It requires a GPU and a HuggingFace model.

Step 1. Install requirements:

pip install torch transformers datasets

Step 2. Run the measurement. Start with a small model to verify the setup:

# Small model, fast test (~5 min on a single GPU)
python experiments/measure_residuals.py \
    --model Qwen/Qwen2.5-0.5B \
    --dataset wikitext \
    --num-samples 50 \
    --layer 0 \
    --output results/residuals_qwen05b.npz

For a production-quality measurement:

# 8B model, wikitext-2 test set, 100 documents, layer 0
python experiments/measure_residuals.py \
    --model meta-llama/Llama-3.1-8B \
    --dataset wikitext \
    --num-samples 100 \
    --layer 0 \
    --max-seq-len 1024 \
    --output results/residuals_llama8b.npz

For a long-context model (tests Corollary 5 — compression improves with length):

python experiments/measure_residuals.py \
    --model meta-llama/Llama-3.1-8B \
    --dataset pg19 \
    --num-samples 20 \
    --layer 0 \
    --max-seq-len 4096 \
    --max-positions 2000 \
    --output results/residuals_llama8b_longctx.npz

Step 3. View the summary (no GPU or matplotlib required):

python experiments/plot_residual_vs_surprisal.py \
    --input results/residuals_llama8b.npz \
    --text-only

Expected output:

============================================================
Residual Analysis: meta-llama/Llama-3.1-8B
Layer 0
============================================================
  Data points:         18432
  Mean surprisal:      3.47 bits  (PP ~ 11.1)
  Mean residual norm:  0.0831
  Mean theor. bound:   0.3120
  Bound/actual ratio:  3.76x
  Correlation(h,||R||): 0.61

  Residual norm by surprisal quartile:
    Quartile      h range     Mean ||R||      N
          Q1   0.0 –  1.2       0.0312   4608
          Q2   1.2 –  2.8       0.0591   4608
          Q3   2.8 –  5.1       0.0921   4608
          Q4   5.1 – 18.4       0.1500   4608

What to look for: The correlation between surprisal and residual norm should be positive (0.4–0.7 range), confirming Corollary 4. The Q1 residual should be meaningfully smaller than Q4. A bound/actual ratio > 1 means the theoretical bound is not tight but still correct.

Step 4. Generate figures:

pip install matplotlib
python experiments/plot_residual_vs_surprisal.py \
    --input results/residuals_llama8b.npz \
    --output results/residual_plot.pdf

Produces a 3-panel PDF:

Scatter plot of actual ‖R_i‖ vs surprisal h_i with theoretical bound overlay
Binned mean residual norm with sqrt(H_i ln 2) reference curve
Theoretical compression ratio vs TurboQuant curve (Corollary 2b)

Step 5. Test multiple layers to see how residuals vary with depth:

for layer in 0 4 8 16 31; do
    python experiments/measure_residuals.py \
        --model meta-llama/Llama-3.1-8B \
        --dataset wikitext --num-samples 30 \
        --layer $layer \
        --output results/residuals_layer${layer}.npz
    python experiments/plot_residual_vs_surprisal.py \
        --input results/residuals_layer${layer}.npz \
        --text-only 2>&1 | grep "Mean residual\|Correlation"
done

Step-by-step: analyze a GGUF model (no GPU needed)

If you have a downloaded GGUF file from Ollama, LM Studio, or HuggingFace:

pip install gguf numpy

# Analyze architecture and compute bounds
python adapters/gguf_adapter.py ./llama-3.1-8b-instruct.Q4_K_M.gguf \
    --perplexity 12.0

# Output:
# GGUF Model: llama-3.1-8b-instruct.Q4_K_M.gguf
#   Layers (L):          32
#   Attention heads:     32
#   KV heads (GQA):      8
#   Hidden size:         4096
#   Head dim:            128
#   KV cache (fp16) per 128K tokens: 26.8 GB
#
# --- Embedding Geometry ---
#   embedding_diameter_estimate_CE: 14.31
#   mean_embedding_norm: 0.72
#
# --- Compression Bounds (perplexity=12.0) ---
#   Sequential entropy bound:    3.58 bits/token
#   vs TurboQuant (theoretical): 1,042,697x
#   vs TurboQuant (1000x floor): 1,042.7x

Step-by-step: analyze a Safetensors model

pip install safetensors huggingface_hub

# From a local HF model directory
python adapters/safetensors_adapter.py ./path/to/Llama-3.1-8B/ \
    --perplexity 12.0 \
    --layer 0

# Or download metadata only (no weights) and analyze:
python adapters/safetensors_adapter.py meta-llama/Llama-3.1-8B \
    --perplexity 12.0

Step-by-step: use the HF adapter for measurement on a live model

from transformers import AutoModelForCausalLM, AutoTokenizer
from adapters.hf import HFAdapter
from adapters.base import AdapterConfig

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")

config = AdapterConfig(top_k=20, target_bits=3)

with HFAdapter(model, config) as adapter:
    text = "The transformer architecture consists of encoder and decoder stacks."
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
    print(tokenizer.decode(outputs[0]))
    print(adapter.compression_stats())

Expected output:

{'num_tokens': 50, 'mean_surprisal_bits': 2.81, 'sequential_bound_bits_per_token': 2.81, 'estimated_perplexity': 7.04}

Is the compression real? What the experiments tell you

The theory is proven. The open empirical question (Section 9.6 of the paper) is: how small are the residuals in practice for trained models?

The experiments answer this directly:

What you measure	What it tells you
`Correlation(h_i, ‖R_i‖)` > 0	Residuals track surprisal — the bound is meaningful
Q1 residual < Q4 residual	Low-surprisal tokens have smaller residuals — compression is non-uniform as predicted
Bound/actual ratio	How tight the Corollary 4 bound is. Ratio of 2–10x means theory is conservative but correct
Residual norm by layer	Some layers compress much better than others — worth measuring before choosing which to compress first

A positive correlation and quartile gradient (Q1 < Q2 < Q3 < Q4) are the two key confirmations that the theory applies to real models. Prior work has not measured this directly.

Adapter status

Adapter	Surprisal measurement	Layer 2 write-back	Layer 1 prefix dedup	Notes
`hf.py`	✅	✅ measurement; in-place write needs cache manager patch	✅ stub	Reference impl, works today
`gguf_adapter.py`	✅ analysis only	N/A	N/A	Weight geometry, no runtime
`safetensors_adapter.py`	✅ analysis only	N/A	N/A	Weight geometry, no runtime
`vllm_adapter.py`	✅	🔧 needs FlashInfer read-side patch	✅ via `--enable-prefix-caching`	Write-side hooked
`sglang_adapter.py`	✅	🔧 needs read-side kernel patch	✅ via RadixAttention	Theorem 4 duality supported
`tensorrt_adapter.py`	✅	🔧 needs custom TRT plugin	🔧 pending	Option B Python hook
`llamacpp_adapter.py`	✅	🔧 needs ctypes KV buffer access	🔧 pending	LogitsProcessor hook

The remaining engineering work across all production adapters is the read-side hook: when the attention kernel reads KV_i from cache, it needs to add back KV_hat_i before computing attention scores. The write side is hooked; the read side requires a kernel patch.

Related work

PLT framework: arXiv:2604.06228 — the trie metric and prior-guided caching theorem this paper builds on
TurboQuant: Zandieh et al. (ICLR 2026) — the per-vector ceiling this work surpasses
vLLM: Kwon et al. (SOSP 2023) — PagedAttention prefix sharing baseline
SGLang: LMSYS/UC Berkeley — RadixAttention prefix tree
KIVI / KVQuant: prior per-vector quantization work

Citation

@article{magarshak2026sequential,
  title   = {Sequential {KV} Cache Compression via Probabilistic Language Tries:
             Beyond the Per-Vector {Shannon} Limit},
  author  = {Magarshak, Gregory},
  journal = {arXiv preprint arXiv:2604.15356},
  year    = {2026},
  url     = {https://arxiv.org/abs/2604.15356}
}

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sequential KV Cache Compression via Probabilistic Language Tries

What this is

Repository structure

Quick setup

Step-by-step: reproduce the theoretical bounds

Step-by-step: measure actual residuals on a real model

Step-by-step: analyze a GGUF model (no GPU needed)

Step-by-step: analyze a Safetensors model

Step-by-step: use the HF adapter for measurement on a live model

Is the compression real? What the experiments tell you

Adapter status

Related work

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
adapters		adapters
experiments		experiments
notebooks		notebooks
paper		paper
src		src
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Sequential KV Cache Compression via Probabilistic Language Tries

What this is

Repository structure

Quick setup

Step-by-step: reproduce the theoretical bounds

Step-by-step: measure actual residuals on a real model

Step-by-step: analyze a GGUF model (no GPU needed)

Step-by-step: analyze a Safetensors model

Step-by-step: use the HF adapter for measurement on a live model

Is the compression real? What the experiments tell you

Adapter status

Related work

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages