Add DeepSeek and GLM benchmark reports (+ cleanup of legacy top-level reports)#2
Merged
cursor[bot] merged 2 commits intomainfrom Apr 17, 2026
Merged
Conversation
Three new models benchmarked using the same codec preset and harness
as the existing 4-model suite:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B (DeepSeek)
Qwen2 architecture, GQA 12:2, head_dim=128, 28 layers
128k total bf16 ratio: 3.41x (3.50 GiB -> 1.03 GiB)
- THUDM/glm-edge-1.5b-chat (Zhipu AI)
GLM architecture, GQA 16:4, head_dim=128, 28 layers, max_pos=8192
128k total bf16 ratio: 3.70x (7.00 GiB -> 1.89 GiB)
(16k+ rows are codec-only projections since the model caps at 8k)
- THUDM/glm-edge-4b-chat (Zhipu AI)
GLM architecture, GQA 24:6, head_dim=128, 40 layers, max_pos=8192
128k total bf16 ratio: 3.88x (15.00 GiB -> 3.87 GiB)
8k required --skip-generation --prefill-chunk 1024 to fit in 15 GiB RAM
All three ran through the existing build_kakeya_cache(model) factory
and run_all_benchmarks.sh orchestrator with zero model-specific code;
the factory auto-dispatches based on config.layer_types (or the
num_hidden_layers + sliding_window fallback for Llama-family configs).
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
- README headline table now lists all 7 models (was 4).
- STANDARD.md per-model index now lists DeepSeek and two GLM-Edge
variants.
- CROSS_MODEL.md: full 7-model matrix (2k-128k bf16), absolute
bytes-saved ranking, grouped-by-vendor summary, and updated
'what drives the differences' section. Also adds a head_dim vs
ratio lookup table for quick architecture-level estimates.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
cursor Bot
pushed a commit
that referenced
this pull request
Apr 22, 2026
Buckets on the HF (+7.82%) vs vLLM (+35.33%) 27-pp gap: #1 Engine baseline shift ~10 pp (clean-model PPL disagreement; 0.145 KL; 18% top-1 disagreement) #2 Codec residual magnitude ~0 (codec is engine- agnostic; mse ratio 1.01) #3 Noise-sensitivity curve HF MORE sensitive per \u03c3 in linear regime; not the cause #4 Boundary layers already skipped +69 pp saved by SPRINT_CLOSEOUT boundary policy #5 Cross-layer non-linear compound +39 pp (joint-cell - \u03a3 singletons over 22 quiet layers) Localised root cause: vLLM's single-forward bf16 residual-stream accumulation through Flash-Attention compounds per-layer codec residuals ~39 pp above their sum, while HF eager's f32-accumulate + teacher-force over DynamicCache compounds them less aggressively. Each per-layer residual is small on both engines (Phase 4 matched); what differs is the accumulation path. Deployment recommendations: 1. Extend vLLM boundary skip to {2, 6, 11} on top of the existing {0,1,7,14,26,27}; cuts ~10-15 pp off the joint Delta-ppl. 2. Adaptive per-layer bit-width: K b=4 on the hot layers, b=3 elsewhere; preserves 19/28 of the ratio benefit. Phase 3 ran only on vLLM (reused production harness); the HF per- layer curve is left as a follow-up if someone wants to confirm that HF's cross-layer interaction is the ~+10 pp we infer here. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Extends the existing cross-model benchmark matrix (Gemma 4 / Qwen2.5 / SmolLM2 / Qwen3) with three additional open-source models from DeepSeek and Zhipu AI (GLM) — the ones that actually fit on the 15 GiB CPU benchmark box.
deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B(DeepSeek, 1.5B, Qwen2 backbone)THUDM/glm-edge-1.5b-chat(Zhipu AI, 1.5B, on-device GLM)THUDM/glm-edge-4b-chat(Zhipu AI, 4B, on-device GLM)All three ran through the same
build_kakeya_cache(model)factory and the samerun_all_benchmarks.shorchestrator as before. Zero model-specific code. The codec generalization shipped in PR #1 handled every new config out of the box.Headline: 128 k total compression ratio (bf16 store)
Per-model highlights
DeepSeek-R1-Distill-Qwen-1.5B (3.41× @ 128k)
DeepSeek-V2-Lite/V3family with MLA are in the report; the short version is "Kakeya should be applied at the MLA latent dimension, needs ~50 lines of adapter".GLM-Edge-1.5B-Chat (3.70× @ 128k)
GLM-Edge-4B-Chat (3.88× @ 128k)
--skip-generation --prefill-chunk 1024to fit in 15 GiB CPU RAM (40 layers × 6 KV × 128 at 8k with activations is close to the limit).Extrapolator cross-validation on the new models
The 16k / 32k / 64k / 128k rows for each new model are byte-exact projections from each model's 8k measurement (using the same method validated against Gemma 4's real 16k/32k runs to ≤ 0.003 absolute error). For the GLM-Edge models this is strictly a codec-byte projection since the models themselves cannot process 16k+ inputs.
What drives the per-model differences
Added a concrete lookup table to
reports/CROSS_MODEL.md:global_head_dim=512)So on the same codec preset the model-family difference washes out: Gemma, Qwen, DeepSeek, GLM all reach the same asymptotic envelope, and what varies is how fast they approach it (function of head_dim × GQA ratio) and what fraction of their KV is full-attention vs sliding.
Cleanup
Removed duplicated legacy files from
reports/(top-levelbench_2k.json…bench_32k.json,extrapolation_from_*.json,quick.json). The canonical copies live underreports/gemma4_e2b/; the duplicates were an artifact of the original single-model benchmark directory layout before the per-model folders were introduced in PR #1.Files
reports/deepseek_r1_distill_qwen_1_5b/— per-contextbench_*.json+extrapolation.json+REPORT.mdreports/glm_edge_1_5b/— same layoutreports/glm_edge_4b/— same layoutREADME.md,reports/STANDARD.md,reports/CROSS_MODEL.mdupdated to include the three new modelsRepro
Environment
Same as previous benchmark PRs: CPU-only x86_64, 15 GB RAM, BF16, eager attention,
torch==2.11.0+cu130,transformers==5.5.4. Public (not gated) open-source models only.