Add DeepSeek and GLM benchmark reports (+ cleanup of legacy top-level reports) by FluffyAIcode · Pull Request #2 · FluffyAIcode/LLM-KV--Cache-compress

FluffyAIcode · 2026-04-17T05:56:07Z

Summary

Extends the existing cross-model benchmark matrix (Gemma 4 / Qwen2.5 / SmolLM2 / Qwen3) with three additional open-source models from DeepSeek and Zhipu AI (GLM) — the ones that actually fit on the 15 GiB CPU benchmark box.

deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B (DeepSeek, 1.5B, Qwen2 backbone)
THUDM/glm-edge-1.5b-chat (Zhipu AI, 1.5B, on-device GLM)
THUDM/glm-edge-4b-chat (Zhipu AI, 4B, on-device GLM)

All three ran through the same build_kakeya_cache(model) factory and the same run_all_benchmarks.sh orchestrator as before. Zero model-specific code. The codec generalization shipped in PR #1 handled every new config out of the box.

Headline: 128 k total compression ratio (bf16 store)

Model	Baseline	Kakeya (bf16)	Total ratio
Qwen/Qwen3-0.6B (existing)	14.00 GiB	3.10 GiB	4.51×
google/gemma-4-E2B-it (existing)	774 MiB	180 MiB	4.29×
THUDM/glm-edge-4b-chat (new)	15.00 GiB	3.87 GiB	3.88×
THUDM/glm-edge-1.5b-chat (new)	7.00 GiB	1.89 GiB	3.70×
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B (new)	3.50 GiB	1.03 GiB	3.41×
HuggingFaceTB/SmolLM2-1.7B-Instruct (existing)	24.00 GiB	10.65 GiB	2.25×
Qwen/Qwen2.5-0.5B-Instruct (existing)	1.50 GiB	714 MiB	2.15×

Per-model highlights

DeepSeek-R1-Distill-Qwen-1.5B (3.41× @ 128k)

Qwen2 architecture (DeepSeek distilled R1 reasoning into a Qwen2.5 backbone). The codec operates on the KV tensors, not on the reasoning trace, so distillation is invisible to the compressor.
GQA 12:2, head_dim=128. 28 dense full-attention layers.
Sits just below Qwen3-0.6B's 4.51× because Qwen3 has GQA 16:8 (more KV rows per block → more stable K-means).
Notes on the larger DeepSeek-V2-Lite / V3 family with MLA are in the report; the short version is "Kakeya should be applied at the MLA latent dimension, needs ~50 lines of adapter".

GLM-Edge-1.5B-Chat (3.70× @ 128k)

28 layers, GQA 16:4, head_dim=128, max_pos=8k.
Real measurements at 2k / 4k / 8k; 16k+ rows are codec-only projections (the model itself caps at 8k context).
Ratio slightly above DeepSeek-R1-Distill-Qwen-1.5B because wider GQA (4 vs 2 KV heads).

GLM-Edge-4B-Chat (3.88× @ 128k)

40 layers, GQA 24:6, head_dim=128.
Best in the non-Qwen3 tier: 11.13 GiB absolute saving per 128k sequence (15.00 GiB → 3.87 GiB bf16).
8k prefill required --skip-generation --prefill-chunk 1024 to fit in 15 GiB CPU RAM (40 layers × 6 KV × 128 at 8k with activations is close to the limit).

Extrapolator cross-validation on the new models

The 16k / 32k / 64k / 128k rows for each new model are byte-exact projections from each model's 8k measurement (using the same method validated against Gemma 4's real 16k/32k runs to ≤ 0.003 absolute error). For the GLM-Edge models this is strictly a codec-byte projection since the models themselves cannot process 16k+ inputs.

What drives the per-model differences

Added a concrete lookup table to reports/CROSS_MODEL.md:

head_dim	typical bf16 full-attn ratio @ 128k
64	2.1–2.3× (Qwen2.5, SmolLM2)
128	3.4–4.5× (Qwen3, DeepSeek, GLM-Edge)
256+	4.4× (Gemma 4, MQA with `global_head_dim=512`)

So on the same codec preset the model-family difference washes out: Gemma, Qwen, DeepSeek, GLM all reach the same asymptotic envelope, and what varies is how fast they approach it (function of head_dim × GQA ratio) and what fraction of their KV is full-attention vs sliding.

Cleanup

Removed duplicated legacy files from reports/ (top-level bench_2k.json … bench_32k.json, extrapolation_from_*.json, quick.json). The canonical copies live under reports/gemma4_e2b/; the duplicates were an artifact of the original single-model benchmark directory layout before the per-model folders were introduced in PR #1.

Files

reports/deepseek_r1_distill_qwen_1_5b/ — per-context bench_*.json + extrapolation.json + REPORT.md
reports/glm_edge_1_5b/ — same layout
reports/glm_edge_4b/ — same layout
README.md, reports/STANDARD.md, reports/CROSS_MODEL.md updated to include the three new models

Repro

./run_all_benchmarks.sh models/DeepSeek-R1-Distill-Qwen-1.5B deepseek_r1_distill_qwen_1_5b
./run_all_benchmarks.sh models/glm-edge-1.5b-chat glm_edge_1_5b
# The 4b variant needs the memory-limited path for 8k:
./run_all_benchmarks.sh models/glm-edge-4b-chat glm_edge_4b   # will OOM at 8k on 15 GiB
# Rerun just 8k with:
python3 kakeya_benchmark.py --model-path models/glm-edge-4b-chat --model-name glm_edge_4b \
  --context-tokens 8192 --new-tokens 0 \
  --skip-baseline-prefill --skip-generation --prefill-chunk 1024 \
  --report reports/glm_edge_4b/bench_8192.json
python3 kakeya_extrapolate.py --report reports/glm_edge_4b/bench_8192.json \
  --targets 16384,32768,65536,131072,262144 \
  --out reports/glm_edge_4b/extrapolation.json

Environment

Same as previous benchmark PRs: CPU-only x86_64, 15 GB RAM, BF16, eager attention, torch==2.11.0+cu130, transformers==5.5.4. Public (not gated) open-source models only.

Three new models benchmarked using the same codec preset and harness as the existing 4-model suite: - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B (DeepSeek) Qwen2 architecture, GQA 12:2, head_dim=128, 28 layers 128k total bf16 ratio: 3.41x (3.50 GiB -> 1.03 GiB) - THUDM/glm-edge-1.5b-chat (Zhipu AI) GLM architecture, GQA 16:4, head_dim=128, 28 layers, max_pos=8192 128k total bf16 ratio: 3.70x (7.00 GiB -> 1.89 GiB) (16k+ rows are codec-only projections since the model caps at 8k) - THUDM/glm-edge-4b-chat (Zhipu AI) GLM architecture, GQA 24:6, head_dim=128, 40 layers, max_pos=8192 128k total bf16 ratio: 3.88x (15.00 GiB -> 3.87 GiB) 8k required --skip-generation --prefill-chunk 1024 to fit in 15 GiB RAM All three ran through the existing build_kakeya_cache(model) factory and run_all_benchmarks.sh orchestrator with zero model-specific code; the factory auto-dispatches based on config.layer_types (or the num_hidden_layers + sliding_window fallback for Llama-family configs). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

- README headline table now lists all 7 models (was 4). - STANDARD.md per-model index now lists DeepSeek and two GLM-Edge variants. - CROSS_MODEL.md: full 7-model matrix (2k-128k bf16), absolute bytes-saved ranking, grouped-by-vendor summary, and updated 'what drives the differences' section. Also adds a head_dim vs ratio lookup table for quick architecture-level estimates. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Buckets on the HF (+7.82%) vs vLLM (+35.33%) 27-pp gap: #1 Engine baseline shift ~10 pp (clean-model PPL disagreement; 0.145 KL; 18% top-1 disagreement) #2 Codec residual magnitude ~0 (codec is engine- agnostic; mse ratio 1.01) #3 Noise-sensitivity curve HF MORE sensitive per \u03c3 in linear regime; not the cause #4 Boundary layers already skipped +69 pp saved by SPRINT_CLOSEOUT boundary policy #5 Cross-layer non-linear compound +39 pp (joint-cell - \u03a3 singletons over 22 quiet layers) Localised root cause: vLLM's single-forward bf16 residual-stream accumulation through Flash-Attention compounds per-layer codec residuals ~39 pp above their sum, while HF eager's f32-accumulate + teacher-force over DynamicCache compounds them less aggressively. Each per-layer residual is small on both engines (Phase 4 matched); what differs is the accumulation path. Deployment recommendations: 1. Extend vLLM boundary skip to {2, 6, 11} on top of the existing {0,1,7,14,26,27}; cuts ~10-15 pp off the joint Delta-ppl. 2. Adaptive per-layer bit-width: K b=4 on the hot layers, b=3 elsewhere; preserves 19/28 of the ratio benefit. Phase 3 ran only on vLLM (reused production harness); the HF per- layer curve is left as a follow-up if someone wants to confirm that HF's cross-layer interaction is the ~+10 pp we infer here. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursoragent and others added 2 commits April 17, 2026 05:55

cursor Bot marked this pull request as ready for review April 17, 2026 05:56

cursor Bot merged commit 8f4ba66 into main Apr 17, 2026

FluffyAIcode mentioned this pull request Apr 18, 2026

Real head-to-head: kakeyaturbo v1.2 vs TurboQuant turbo3 on 7 models × 3 contexts #8

Merged

cursor Bot mentioned this pull request Apr 21, 2026

Outlier compensation + Besicovitch-product skeleton — diagnostic sprint #13

Closed

FluffyAIcode deleted the cursor/add-deepseek-glm-benchmarks-12f5 branch April 23, 2026 15:52

FluffyAIcode mentioned this pull request Apr 25, 2026

HF Space demo: fix compression display + repetition-loop prompt, add Docker SDK manifest #53

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DeepSeek and GLM benchmark reports (+ cleanup of legacy top-level reports)#2

Add DeepSeek and GLM benchmark reports (+ cleanup of legacy top-level reports)#2
cursor[bot] merged 2 commits intomainfrom
cursor/add-deepseek-glm-benchmarks-12f5

FluffyAIcode commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Apr 17, 2026

Summary

Headline: 128 k total compression ratio (bf16 store)

Per-model highlights

DeepSeek-R1-Distill-Qwen-1.5B (3.41× @ 128k)

GLM-Edge-1.5B-Chat (3.70× @ 128k)

GLM-Edge-4B-Chat (3.88× @ 128k)

Extrapolator cross-validation on the new models

What drives the per-model differences

Cleanup

Files

Repro

Environment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants