Skip to content

SafeBots/KV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sequential KV Cache Compression via Probabilistic Language Tries

Beyond the Per-Vector Shannon Limit

Gregory Magarshak — gmagarshak@faculty.ienyc.edu
arXiv: 2604.15356


What this is

Per-vector KV cache quantization methods — culminating in TurboQuant — have reached the Shannon entropy limit for compressing individual KV vectors. This repo implements a two-layer architecture that operates at a strictly lower, different limit: the sequential entropy of the KV cache, bounded by the model's per-token surprisal.

The core result (Theorem 1):

H(KV_i | KV_{<i}) ≤ H(t_i | t_{<i})

At typical perplexity 10–20 on fluent text, this bound is 3.3–4.3 bits per token position — vs TurboQuant's 3 bits per component (64–128 components per attention head). The two layers are orthogonal to per-vector methods and stack beneath them.

Layer Method What it eliminates
1 Probabilistic prefix deduplication Cross-session redundancy via PLT trie metric
2 Predictive delta coding Within-session redundancy bounded by token surprisal
+ Any per-vector quantizer (e.g. TurboQuant) Residual magnitude

Compression improves with context length (Corollary 5) — the opposite of per-vector methods, whose cost grows linearly.


Repository structure

sequential-kv-compression/
├── paper/
│   ├── main.tex              # LaTeX source
│   └── references.bib
├── src/
│   ├── predictive_delta.py   # Layer 2: top-k KV predictor + residual coder
│   ├── prefix_index.py       # Layer 1: PLT trie-based prefix store
│   └── adaptive_quantizer.py # Bound calculations + surprisal-adaptive quantizer
├── adapters/
│   ├── base.py               # Abstract adapter interface + AdapterConfig
│   ├── hf.py                 # HuggingFace transformers (reference / dev)
│   ├── vllm_adapter.py       # vLLM (PagedAttention)
│   ├── sglang_adapter.py     # SGLang (RadixAttention)
│   ├── tensorrt_adapter.py   # TensorRT-LLM
│   ├── llamacpp_adapter.py   # llama.cpp (via llama-cpp-python)
│   ├── gguf_adapter.py       # Direct GGUF analysis (no runtime required)
│   └── safetensors_adapter.py# Direct Safetensors analysis (no runtime required)
├── experiments/
│   ├── measure_residuals.py           # Measure actual ‖R_i‖ on real models
│   └── plot_residual_vs_surprisal.py  # Plot results + theoretical bound overlay
├── notebooks/
│   ├── compression_bounds.ipynb  # Reproduce Corollaries 1/2/5, Theorem 3
│   └── perplexity_sweep.ipynb    # Compression ratios at varying perplexity
├── requirements.txt
├── CITATION.cff
└── LICENSE

Quick setup

git clone https://github.com/magarshak/sequential-kv-compression
cd sequential-kv-compression
pip install torch transformers datasets numpy matplotlib jupyter safetensors gguf

Step-by-step: reproduce the theoretical bounds

This requires no GPU and no model download. It runs in seconds.

Step 1. Install core dependencies:

pip install numpy matplotlib

Step 2. Print the compression ratio table (Corollary 2):

python src/adaptive_quantizer.py

Expected output:

Sequential KV Compression — Theoretical Bounds
============================================================
    PP     h_bar   vs fp16 (theoretical)   vs TurboQuant (theoretical)   vs TurboQuant (1000x floor)
----------------------------------------------------------------------------------------------------
     5      2.32               9,031,942x                     1,693,489x                       1,693.5x
    10      3.32               6,313,057x                     1,183,698x                       1,183.7x
    20      4.32               4,852,353x                       909,816x                         909.8x
    50      5.64               3,715,814x                       696,715x                         696.7x

Step 3. Reproduce all figures from the paper interactively:

jupyter notebook notebooks/compression_bounds.ipynb

This notebook produces:

  • Compression ratio vs perplexity on log scale (3 overhead scenarios)
  • Asymptotic improvement with context length (Corollary 5)
  • Rate-distortion bound curves (Theorem 3)
  • Zero-rate thresholds: positions where zero bits are needed

Step-by-step: measure actual residuals on a real model

This is the key empirical test — checking whether the theoretical bound from Corollary 4 is tight on actual trained models. It requires a GPU and a HuggingFace model.

Step 1. Install requirements:

pip install torch transformers datasets

Step 2. Run the measurement. Start with a small model to verify the setup:

# Small model, fast test (~5 min on a single GPU)
python experiments/measure_residuals.py \
    --model Qwen/Qwen2.5-0.5B \
    --dataset wikitext \
    --num-samples 50 \
    --layer 0 \
    --output results/residuals_qwen05b.npz

For a production-quality measurement:

# 8B model, wikitext-2 test set, 100 documents, layer 0
python experiments/measure_residuals.py \
    --model meta-llama/Llama-3.1-8B \
    --dataset wikitext \
    --num-samples 100 \
    --layer 0 \
    --max-seq-len 1024 \
    --output results/residuals_llama8b.npz

For a long-context model (tests Corollary 5 — compression improves with length):

python experiments/measure_residuals.py \
    --model meta-llama/Llama-3.1-8B \
    --dataset pg19 \
    --num-samples 20 \
    --layer 0 \
    --max-seq-len 4096 \
    --max-positions 2000 \
    --output results/residuals_llama8b_longctx.npz

Step 3. View the summary (no GPU or matplotlib required):

python experiments/plot_residual_vs_surprisal.py \
    --input results/residuals_llama8b.npz \
    --text-only

Expected output:

============================================================
Residual Analysis: meta-llama/Llama-3.1-8B
Layer 0
============================================================
  Data points:         18432
  Mean surprisal:      3.47 bits  (PP ~ 11.1)
  Mean residual norm:  0.0831
  Mean theor. bound:   0.3120
  Bound/actual ratio:  3.76x
  Correlation(h,||R||): 0.61

  Residual norm by surprisal quartile:
    Quartile      h range     Mean ||R||      N
          Q1   0.0 –  1.2       0.0312   4608
          Q2   1.2 –  2.8       0.0591   4608
          Q3   2.8 –  5.1       0.0921   4608
          Q4   5.1 – 18.4       0.1500   4608

What to look for: The correlation between surprisal and residual norm should be positive (0.4–0.7 range), confirming Corollary 4. The Q1 residual should be meaningfully smaller than Q4. A bound/actual ratio > 1 means the theoretical bound is not tight but still correct.

Step 4. Generate figures:

pip install matplotlib
python experiments/plot_residual_vs_surprisal.py \
    --input results/residuals_llama8b.npz \
    --output results/residual_plot.pdf

Produces a 3-panel PDF:

  1. Scatter plot of actual ‖R_i‖ vs surprisal h_i with theoretical bound overlay
  2. Binned mean residual norm with sqrt(H_i ln 2) reference curve
  3. Theoretical compression ratio vs TurboQuant curve (Corollary 2b)

Step 5. Test multiple layers to see how residuals vary with depth:

for layer in 0 4 8 16 31; do
    python experiments/measure_residuals.py \
        --model meta-llama/Llama-3.1-8B \
        --dataset wikitext --num-samples 30 \
        --layer $layer \
        --output results/residuals_layer${layer}.npz
    python experiments/plot_residual_vs_surprisal.py \
        --input results/residuals_layer${layer}.npz \
        --text-only 2>&1 | grep "Mean residual\|Correlation"
done

Step-by-step: analyze a GGUF model (no GPU needed)

If you have a downloaded GGUF file from Ollama, LM Studio, or HuggingFace:

pip install gguf numpy

# Analyze architecture and compute bounds
python adapters/gguf_adapter.py ./llama-3.1-8b-instruct.Q4_K_M.gguf \
    --perplexity 12.0

# Output:
# GGUF Model: llama-3.1-8b-instruct.Q4_K_M.gguf
#   Layers (L):          32
#   Attention heads:     32
#   KV heads (GQA):      8
#   Hidden size:         4096
#   Head dim:            128
#   KV cache (fp16) per 128K tokens: 26.8 GB
#
# --- Embedding Geometry ---
#   embedding_diameter_estimate_CE: 14.31
#   mean_embedding_norm: 0.72
#
# --- Compression Bounds (perplexity=12.0) ---
#   Sequential entropy bound:    3.58 bits/token
#   vs TurboQuant (theoretical): 1,042,697x
#   vs TurboQuant (1000x floor): 1,042.7x

Step-by-step: analyze a Safetensors model

pip install safetensors huggingface_hub

# From a local HF model directory
python adapters/safetensors_adapter.py ./path/to/Llama-3.1-8B/ \
    --perplexity 12.0 \
    --layer 0

# Or download metadata only (no weights) and analyze:
python adapters/safetensors_adapter.py meta-llama/Llama-3.1-8B \
    --perplexity 12.0

Step-by-step: use the HF adapter for measurement on a live model

from transformers import AutoModelForCausalLM, AutoTokenizer
from adapters.hf import HFAdapter
from adapters.base import AdapterConfig

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-0.5B",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")

config = AdapterConfig(top_k=20, target_bits=3)

with HFAdapter(model, config) as adapter:
    text = "The transformer architecture consists of encoder and decoder stacks."
    inputs = tokenizer(text, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=50, do_sample=False)
    print(tokenizer.decode(outputs[0]))
    print(adapter.compression_stats())

Expected output:

{'num_tokens': 50, 'mean_surprisal_bits': 2.81, 'sequential_bound_bits_per_token': 2.81, 'estimated_perplexity': 7.04}

Is the compression real? What the experiments tell you

The theory is proven. The open empirical question (Section 9.6 of the paper) is: how small are the residuals in practice for trained models?

The experiments answer this directly:

What you measure What it tells you
Correlation(h_i, ‖R_i‖) > 0 Residuals track surprisal — the bound is meaningful
Q1 residual < Q4 residual Low-surprisal tokens have smaller residuals — compression is non-uniform as predicted
Bound/actual ratio How tight the Corollary 4 bound is. Ratio of 2–10x means theory is conservative but correct
Residual norm by layer Some layers compress much better than others — worth measuring before choosing which to compress first

A positive correlation and quartile gradient (Q1 < Q2 < Q3 < Q4) are the two key confirmations that the theory applies to real models. Prior work has not measured this directly.


Adapter status

Adapter Surprisal measurement Layer 2 write-back Layer 1 prefix dedup Notes
hf.py ✅ measurement; in-place write needs cache manager patch ✅ stub Reference impl, works today
gguf_adapter.py ✅ analysis only N/A N/A Weight geometry, no runtime
safetensors_adapter.py ✅ analysis only N/A N/A Weight geometry, no runtime
vllm_adapter.py 🔧 needs FlashInfer read-side patch ✅ via --enable-prefix-caching Write-side hooked
sglang_adapter.py 🔧 needs read-side kernel patch ✅ via RadixAttention Theorem 4 duality supported
tensorrt_adapter.py 🔧 needs custom TRT plugin 🔧 pending Option B Python hook
llamacpp_adapter.py 🔧 needs ctypes KV buffer access 🔧 pending LogitsProcessor hook

The remaining engineering work across all production adapters is the read-side hook: when the attention kernel reads KV_i from cache, it needs to add back KV_hat_i before computing attention scores. The write side is hooked; the read side requires a kernel patch.


Related work

  • PLT framework: arXiv:2604.06228 — the trie metric and prior-guided caching theorem this paper builds on
  • TurboQuant: Zandieh et al. (ICLR 2026) — the per-vector ceiling this work surpasses
  • vLLM: Kwon et al. (SOSP 2023) — PagedAttention prefix sharing baseline
  • SGLang: LMSYS/UC Berkeley — RadixAttention prefix tree
  • KIVI / KVQuant: prior per-vector quantization work

Citation

@article{magarshak2026sequential,
  title   = {Sequential {KV} Cache Compression via Probabilistic Language Tries:
             Beyond the Per-Vector {Shannon} Limit},
  author  = {Magarshak, Gregory},
  journal = {arXiv preprint arXiv:2604.15356},
  year    = {2026},
  url     = {https://arxiv.org/abs/2604.15356}
}

License

MIT — see LICENSE.

About

Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limi

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors