Delta-K replaces per-channel Key quantization with closed-loop differential encoding, exploiting the DC-dominated spectrum of Key activations to enable per-token 2-bit quantization without grouped channel statistics.
Delta-K's PPL is invariant to group size, while KIVI degrades sharply. Crossover at G=64.
Table 2: Group size sweep (Qwen2.5-7B, WikiText-2, 2-bit K+V, online eval, steady-state PPL)
| G | Delta-K PPL | Delta-K ΔPPL | K bpv | KIVI PPL | KIVI ΔPPL | K bpv | Winner |
|---|---|---|---|---|---|---|---|
| 32 | 6.730 | +0.677 | 2.56 | 6.548 | +0.494 | 3.00 | KIVI |
| 64 | 6.727 | +0.674 | 2.28 | 6.728 | +0.675 | 2.50 | Tie |
| 128 | 6.700 | +0.647 | 2.14 | 6.870 | +0.817 | 2.25 | Delta-K |
| 256 | 6.598 | +0.545 | 2.07 | 7.068 | +1.015 | 2.13 | Delta-K |
Literature: KIVI 2-bit ΔPPL = +0.89 on Qwen2.5-7B (AQUA-KV, ICML 2025, Table 1)
Table 3: Ablation study (Qwen2.5-14B, K-only, G=32, attention output CosSim)
| Step | Configuration | CosSim | Δ |
|---|---|---|---|
| A | Open-loop, 3-level symmetric | 0.419 | baseline |
| B | + Closed-loop (DPCM) | 0.884 | +0.464 (82%) |
| C | + 4-level symmetric | 0.976 | +0.092 |
| D | + Non-uniform codebook | 0.985 | +0.013 |
| — | KIVI per-channel 2-bit | 0.764 | — |
Table 4: LongBench (Qwen2.5-7B-Instruct, G=128, 2-bit K+V, TREC excluded)
| Method | TriviaQA (F1) | SAMSum (ROUGE-L) | Avg. Δ |
|---|---|---|---|
| FP16 | 79.87 | 34.68 | — |
| Delta-K G=128 | 80.30 (+0.43) | 34.74 (+0.06) | +0.25 |
| KIVI g=128 | 75.84 (-4.03) | 33.10 (-1.58) | -2.81 |
Table 5: GSM8K — Limitation (Qwen2.5-7B, 2-bit K+V, 300 samples)
| Method | Accuracy | ΔAcc | KV bpv |
|---|---|---|---|
| FP16 | 82.0% | — | 16.0 |
| KIVI g=32 | 77.0% | -5.0% | 2.63 |
| KIVI g=128 | 74.0% | -8.0% | 2.25 |
| Delta-K G=32 | 55.7% | -26.3% | 2.40 |
| Delta-K G=128 | 57.3% | -24.7% | 2.40 |
| Delta-K G=256 | 58.3% | -23.7% | 2.40 |
DPCM's temporally correlated errors harm chain-of-thought reasoning. Delta-K is best suited for long-context understanding rather than precise multi-step tasks.
For each group of G tokens in the K cache:
- Anchor: First token stored at FP16 (lossless)
- DPCM residuals:
r[t] = K[t] - K_hat[t-1](closed-loop) - Quantization: Residuals quantized with learned 4-level non-uniform codebook
- Reconstruction:
K_hat[t] = K_hat[t-1] + Q(r[t])
Reconstruction error at each step equals single-step quantization noise — it does not accumulate.
- KIVI (per-channel grouped): Larger groups → more outliers per group → scale forced larger → precision loss
- Delta-K DPCM: Group size only affects anchor overhead. DPCM residuals are independent of G.
bpv_DK = 2 + 14/G + 16(G-1)/(G·d)
At G=128, d=128: bpv = 2.14 (vs KIVI's 2.25 with asymmetric metadata).
├── README.md
├── scripts/ # All experiment code
│ ├── analyze.py # KV cache frequency analysis
│ ├── kv_cache_freq.py # KV cache spectral analysis
│ ├── kv_sensitivity.py # K vs V sensitivity comparison
│ ├── neuron_pruning_exp.py # FFN neuron pruning (negative result)
│ ├── delta_k_validation.py # Delta-K v1 open-loop (1.5B)
│ ├── delta_k_v2_revised.py # Delta-K v2 DPCM smoke test (14B)
│ ├── delta_k_v3_ppl.py # PPL evaluation K-only (14B)
│ ├── delta_k_v4_7b.py # K+V prefix PPL (7B)
│ ├── delta_k_v4_2_online.py # Online full-sequence PPL (7B)
│ ├── delta_k_v4_2_g128.py # KIVI g128/g256 supplement
│ ├── delta_k_v4_2_gsweep.py # Group size sweep
│ ├── Delta_k_longbench.py # LongBench eval (includes Triton kernel)
│ ├── delta_k_longbench_final.py # LongBench final version
│ ├── delta_k_gsm8k.py # GSM8K evaluation
│ ├── delta_k_gsm8k_300.py # GSM8K 300-sample version
│ └── benchmark_latency.py # Quantization latency benchmark
├── experiments/ # Results by experiment
│ ├── 01_depth_axis_freq/
│ ├── 02_neuron_pruning/
│ ├── 03_kv_cache_freq/
│ ├── 04_kv_sensitivity/
│ ├── 05_delta_k_v1_1.5B/
│ ├── 06_delta_k_v2_14B/
│ ├── 07_delta_k_v3_ppl_14B/
│ ├── 08_delta_k_v4_kv_7B/
│ ├── 09_delta_k_v4_2_online/
│ ├── 10_gsweep/
│ └── 11_gsm8k/
└── plots/ # Figures and tables
├── fig1_group_size_sweep.png
├── fig2_ppl_vs_bpv.png
├── fig3_error_accumulation.png
├── fig4_experiment_journey.png
└── tables/
├── table1_ppl_main.png
├── table2_longbench.png
├── table3_gsm8k.png
└── table4_ablation.png
All experiments on NEU Explorer HPC cluster:
- Qwen2.5-1.5B: V100 PCIe 32GB
- Qwen2.5-7B/14B: A100 80GB
- Framework: PyTorch 2.9.1, Transformers 4.57.6
@article{pan2025deltak,
title={Delta-K: Group-Size-Robust KV Cache Quantization via Closed-Loop Differential Encoding},
author={Pan, Zhiyuan},
year={2025}
}


