Skip to content

CyperPan/delta-k-quantization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Delta-K: Group-Size-Robust KV Cache Quantization via Closed-Loop Differential Encoding

Delta-K replaces per-channel Key quantization with closed-loop differential encoding, exploiting the DC-dominated spectrum of Key activations to enable per-token 2-bit quantization without grouped channel statistics.

Key Results

Core Finding: Group Size Robustness

Delta-K's PPL is invariant to group size, while KIVI degrades sharply. Crossover at G=64.

Table 2: Group size sweep (Qwen2.5-7B, WikiText-2, 2-bit K+V, online eval, steady-state PPL)

G Delta-K PPL Delta-K ΔPPL K bpv KIVI PPL KIVI ΔPPL K bpv Winner
32 6.730 +0.677 2.56 6.548 +0.494 3.00 KIVI
64 6.727 +0.674 2.28 6.728 +0.675 2.50 Tie
128 6.700 +0.647 2.14 6.870 +0.817 2.25 Delta-K
256 6.598 +0.545 2.07 7.068 +1.015 2.13 Delta-K

Literature: KIVI 2-bit ΔPPL = +0.89 on Qwen2.5-7B (AQUA-KV, ICML 2025, Table 1)

DPCM Eliminates Error Accumulation

Table 3: Ablation study (Qwen2.5-14B, K-only, G=32, attention output CosSim)

Step Configuration CosSim Δ
A Open-loop, 3-level symmetric 0.419 baseline
B + Closed-loop (DPCM) 0.884 +0.464 (82%)
C + 4-level symmetric 0.976 +0.092
D + Non-uniform codebook 0.985 +0.013
KIVI per-channel 2-bit 0.764

PPL vs Compression Rate (G=32)

Downstream Tasks

Table 4: LongBench (Qwen2.5-7B-Instruct, G=128, 2-bit K+V, TREC excluded)

Method TriviaQA (F1) SAMSum (ROUGE-L) Avg. Δ
FP16 79.87 34.68
Delta-K G=128 80.30 (+0.43) 34.74 (+0.06) +0.25
KIVI g=128 75.84 (-4.03) 33.10 (-1.58) -2.81

Table 5: GSM8K — Limitation (Qwen2.5-7B, 2-bit K+V, 300 samples)

Method Accuracy ΔAcc KV bpv
FP16 82.0% 16.0
KIVI g=32 77.0% -5.0% 2.63
KIVI g=128 74.0% -8.0% 2.25
Delta-K G=32 55.7% -26.3% 2.40
Delta-K G=128 57.3% -24.7% 2.40
Delta-K G=256 58.3% -23.7% 2.40

DPCM's temporally correlated errors harm chain-of-thought reasoning. Delta-K is best suited for long-context understanding rather than precise multi-step tasks.

Experiment Journey

Method

Delta-K DPCM Encoding

For each group of G tokens in the K cache:

  1. Anchor: First token stored at FP16 (lossless)
  2. DPCM residuals: r[t] = K[t] - K_hat[t-1] (closed-loop)
  3. Quantization: Residuals quantized with learned 4-level non-uniform codebook
  4. Reconstruction: K_hat[t] = K_hat[t-1] + Q(r[t])

Reconstruction error at each step equals single-step quantization noise — it does not accumulate.

Why Delta-K is Robust to Group Size

  • KIVI (per-channel grouped): Larger groups → more outliers per group → scale forced larger → precision loss
  • Delta-K DPCM: Group size only affects anchor overhead. DPCM residuals are independent of G.

Storage Efficiency

bpv_DK = 2 + 14/G + 16(G-1)/(G·d)

At G=128, d=128: bpv = 2.14 (vs KIVI's 2.25 with asymmetric metadata).

Repository Structure

├── README.md
├── scripts/                         # All experiment code
│   ├── analyze.py                   # KV cache frequency analysis
│   ├── kv_cache_freq.py             # KV cache spectral analysis
│   ├── kv_sensitivity.py            # K vs V sensitivity comparison
│   ├── neuron_pruning_exp.py        # FFN neuron pruning (negative result)
│   ├── delta_k_validation.py        # Delta-K v1 open-loop (1.5B)
│   ├── delta_k_v2_revised.py        # Delta-K v2 DPCM smoke test (14B)
│   ├── delta_k_v3_ppl.py            # PPL evaluation K-only (14B)
│   ├── delta_k_v4_7b.py             # K+V prefix PPL (7B)
│   ├── delta_k_v4_2_online.py       # Online full-sequence PPL (7B)
│   ├── delta_k_v4_2_g128.py         # KIVI g128/g256 supplement
│   ├── delta_k_v4_2_gsweep.py       # Group size sweep
│   ├── Delta_k_longbench.py         # LongBench eval (includes Triton kernel)
│   ├── delta_k_longbench_final.py   # LongBench final version
│   ├── delta_k_gsm8k.py             # GSM8K evaluation
│   ├── delta_k_gsm8k_300.py         # GSM8K 300-sample version
│   └── benchmark_latency.py         # Quantization latency benchmark
├── experiments/                     # Results by experiment
│   ├── 01_depth_axis_freq/
│   ├── 02_neuron_pruning/
│   ├── 03_kv_cache_freq/
│   ├── 04_kv_sensitivity/
│   ├── 05_delta_k_v1_1.5B/
│   ├── 06_delta_k_v2_14B/
│   ├── 07_delta_k_v3_ppl_14B/
│   ├── 08_delta_k_v4_kv_7B/
│   ├── 09_delta_k_v4_2_online/
│   ├── 10_gsweep/
│   └── 11_gsm8k/
└── plots/                           # Figures and tables
    ├── fig1_group_size_sweep.png
    ├── fig2_ppl_vs_bpv.png
    ├── fig3_error_accumulation.png
    ├── fig4_experiment_journey.png
    └── tables/
        ├── table1_ppl_main.png
        ├── table2_longbench.png
        ├── table3_gsm8k.png
        └── table4_ablation.png

Hardware

All experiments on NEU Explorer HPC cluster:

  • Qwen2.5-1.5B: V100 PCIe 32GB
  • Qwen2.5-7B/14B: A100 80GB
  • Framework: PyTorch 2.9.1, Transformers 4.57.6

Citation

@article{pan2025deltak,
  title={Delta-K: Group-Size-Robust KV Cache Quantization via Closed-Loop Differential Encoding},
  author={Pan, Zhiyuan},
  year={2025}
}

About

Delta-K: DPCM-Based KV Cache Quantization for LLMs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages