Skip to content

Add DeepSeek and GLM benchmark reports (+ cleanup of legacy top-level reports)#2

Merged
cursor[bot] merged 2 commits intomainfrom
cursor/add-deepseek-glm-benchmarks-12f5
Apr 17, 2026
Merged

Add DeepSeek and GLM benchmark reports (+ cleanup of legacy top-level reports)#2
cursor[bot] merged 2 commits intomainfrom
cursor/add-deepseek-glm-benchmarks-12f5

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

Summary

Extends the existing cross-model benchmark matrix (Gemma 4 / Qwen2.5 / SmolLM2 / Qwen3) with three additional open-source models from DeepSeek and Zhipu AI (GLM) — the ones that actually fit on the 15 GiB CPU benchmark box.

  • deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B (DeepSeek, 1.5B, Qwen2 backbone)
  • THUDM/glm-edge-1.5b-chat (Zhipu AI, 1.5B, on-device GLM)
  • THUDM/glm-edge-4b-chat (Zhipu AI, 4B, on-device GLM)

All three ran through the same build_kakeya_cache(model) factory and the same run_all_benchmarks.sh orchestrator as before. Zero model-specific code. The codec generalization shipped in PR #1 handled every new config out of the box.

Headline: 128 k total compression ratio (bf16 store)

Model Baseline Kakeya (bf16) Total ratio
Qwen/Qwen3-0.6B (existing) 14.00 GiB 3.10 GiB 4.51×
google/gemma-4-E2B-it (existing) 774 MiB 180 MiB 4.29×
THUDM/glm-edge-4b-chat (new) 15.00 GiB 3.87 GiB 3.88×
THUDM/glm-edge-1.5b-chat (new) 7.00 GiB 1.89 GiB 3.70×
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B (new) 3.50 GiB 1.03 GiB 3.41×
HuggingFaceTB/SmolLM2-1.7B-Instruct (existing) 24.00 GiB 10.65 GiB 2.25×
Qwen/Qwen2.5-0.5B-Instruct (existing) 1.50 GiB 714 MiB 2.15×

Per-model highlights

DeepSeek-R1-Distill-Qwen-1.5B (3.41× @ 128k)

  • Qwen2 architecture (DeepSeek distilled R1 reasoning into a Qwen2.5 backbone). The codec operates on the KV tensors, not on the reasoning trace, so distillation is invisible to the compressor.
  • GQA 12:2, head_dim=128. 28 dense full-attention layers.
  • Sits just below Qwen3-0.6B's 4.51× because Qwen3 has GQA 16:8 (more KV rows per block → more stable K-means).
  • Notes on the larger DeepSeek-V2-Lite / V3 family with MLA are in the report; the short version is "Kakeya should be applied at the MLA latent dimension, needs ~50 lines of adapter".

GLM-Edge-1.5B-Chat (3.70× @ 128k)

  • 28 layers, GQA 16:4, head_dim=128, max_pos=8k.
  • Real measurements at 2k / 4k / 8k; 16k+ rows are codec-only projections (the model itself caps at 8k context).
  • Ratio slightly above DeepSeek-R1-Distill-Qwen-1.5B because wider GQA (4 vs 2 KV heads).

GLM-Edge-4B-Chat (3.88× @ 128k)

  • 40 layers, GQA 24:6, head_dim=128.
  • Best in the non-Qwen3 tier: 11.13 GiB absolute saving per 128k sequence (15.00 GiB → 3.87 GiB bf16).
  • 8k prefill required --skip-generation --prefill-chunk 1024 to fit in 15 GiB CPU RAM (40 layers × 6 KV × 128 at 8k with activations is close to the limit).

Extrapolator cross-validation on the new models

The 16k / 32k / 64k / 128k rows for each new model are byte-exact projections from each model's 8k measurement (using the same method validated against Gemma 4's real 16k/32k runs to ≤ 0.003 absolute error). For the GLM-Edge models this is strictly a codec-byte projection since the models themselves cannot process 16k+ inputs.

What drives the per-model differences

Added a concrete lookup table to reports/CROSS_MODEL.md:

head_dim typical bf16 full-attn ratio @ 128k
64 2.1–2.3× (Qwen2.5, SmolLM2)
128 3.4–4.5× (Qwen3, DeepSeek, GLM-Edge)
256+ 4.4× (Gemma 4, MQA with global_head_dim=512)

So on the same codec preset the model-family difference washes out: Gemma, Qwen, DeepSeek, GLM all reach the same asymptotic envelope, and what varies is how fast they approach it (function of head_dim × GQA ratio) and what fraction of their KV is full-attention vs sliding.

Cleanup

Removed duplicated legacy files from reports/ (top-level bench_2k.jsonbench_32k.json, extrapolation_from_*.json, quick.json). The canonical copies live under reports/gemma4_e2b/; the duplicates were an artifact of the original single-model benchmark directory layout before the per-model folders were introduced in PR #1.

Files

  • reports/deepseek_r1_distill_qwen_1_5b/ — per-context bench_*.json + extrapolation.json + REPORT.md
  • reports/glm_edge_1_5b/ — same layout
  • reports/glm_edge_4b/ — same layout
  • README.md, reports/STANDARD.md, reports/CROSS_MODEL.md updated to include the three new models

Repro

./run_all_benchmarks.sh models/DeepSeek-R1-Distill-Qwen-1.5B deepseek_r1_distill_qwen_1_5b
./run_all_benchmarks.sh models/glm-edge-1.5b-chat glm_edge_1_5b
# The 4b variant needs the memory-limited path for 8k:
./run_all_benchmarks.sh models/glm-edge-4b-chat glm_edge_4b   # will OOM at 8k on 15 GiB
# Rerun just 8k with:
python3 kakeya_benchmark.py --model-path models/glm-edge-4b-chat --model-name glm_edge_4b \
  --context-tokens 8192 --new-tokens 0 \
  --skip-baseline-prefill --skip-generation --prefill-chunk 1024 \
  --report reports/glm_edge_4b/bench_8192.json
python3 kakeya_extrapolate.py --report reports/glm_edge_4b/bench_8192.json \
  --targets 16384,32768,65536,131072,262144 \
  --out reports/glm_edge_4b/extrapolation.json

Environment

Same as previous benchmark PRs: CPU-only x86_64, 15 GB RAM, BF16, eager attention, torch==2.11.0+cu130, transformers==5.5.4. Public (not gated) open-source models only.

Open in Web Open in Cursor 

cursoragent and others added 2 commits April 17, 2026 05:55
Three new models benchmarked using the same codec preset and harness
as the existing 4-model suite:

  - deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B (DeepSeek)
    Qwen2 architecture, GQA 12:2, head_dim=128, 28 layers
    128k total bf16 ratio: 3.41x (3.50 GiB -> 1.03 GiB)

  - THUDM/glm-edge-1.5b-chat (Zhipu AI)
    GLM architecture, GQA 16:4, head_dim=128, 28 layers, max_pos=8192
    128k total bf16 ratio: 3.70x (7.00 GiB -> 1.89 GiB)
    (16k+ rows are codec-only projections since the model caps at 8k)

  - THUDM/glm-edge-4b-chat (Zhipu AI)
    GLM architecture, GQA 24:6, head_dim=128, 40 layers, max_pos=8192
    128k total bf16 ratio: 3.88x (15.00 GiB -> 3.87 GiB)
    8k required --skip-generation --prefill-chunk 1024 to fit in 15 GiB RAM

All three ran through the existing build_kakeya_cache(model) factory
and run_all_benchmarks.sh orchestrator with zero model-specific code;
the factory auto-dispatches based on config.layer_types (or the
num_hidden_layers + sliding_window fallback for Llama-family configs).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
  - README headline table now lists all 7 models (was 4).
  - STANDARD.md per-model index now lists DeepSeek and two GLM-Edge
    variants.
  - CROSS_MODEL.md: full 7-model matrix (2k-128k bf16), absolute
    bytes-saved ranking, grouped-by-vendor summary, and updated
    'what drives the differences' section. Also adds a head_dim vs
    ratio lookup table for quick architecture-level estimates.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@cursor cursor Bot marked this pull request as ready for review April 17, 2026 05:56
@cursor cursor Bot merged commit 8f4ba66 into main Apr 17, 2026
cursor Bot pushed a commit that referenced this pull request Apr 22, 2026
Buckets on the HF (+7.82%) vs vLLM (+35.33%) 27-pp gap:

  #1  Engine baseline shift            ~10 pp (clean-model PPL
                                        disagreement; 0.145 KL;
                                        18% top-1 disagreement)
  #2  Codec residual magnitude         ~0    (codec is engine-
                                        agnostic; mse ratio 1.01)
  #3  Noise-sensitivity curve          HF MORE sensitive per \u03c3 in
                                        linear regime; not the cause
  #4  Boundary layers already skipped  +69 pp saved by SPRINT_CLOSEOUT
                                        boundary policy
  #5  Cross-layer non-linear compound  +39 pp (joint-cell - \u03a3
                                        singletons over 22 quiet
                                        layers)

Localised root cause: vLLM's single-forward bf16 residual-stream
accumulation through Flash-Attention compounds per-layer codec
residuals ~39 pp above their sum, while HF eager's f32-accumulate
+ teacher-force over DynamicCache compounds them less aggressively.
Each per-layer residual is small on both engines (Phase 4 matched);
what differs is the accumulation path.

Deployment recommendations:
  1. Extend vLLM boundary skip to {2, 6, 11} on top of the existing
     {0,1,7,14,26,27}; cuts ~10-15 pp off the joint Delta-ppl.
  2. Adaptive per-layer bit-width: K b=4 on the hot layers, b=3
     elsewhere; preserves 19/28 of the ratio benefit.

Phase 3 ran only on vLLM (reused production harness); the HF per-
layer curve is left as a follow-up if someone wants to confirm
that HF's cross-layer interaction is the ~+10 pp we infer here.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@FluffyAIcode FluffyAIcode deleted the cursor/add-deepseek-glm-benchmarks-12f5 branch April 23, 2026 15:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants