[Klaud Cold] Add glm5-fp8-h200-sglang-mtp recipe#1480
Conversation
Adds the MTP/EAGLE speculative-decoding variant of glm5-fp8-h200-sglang. TP=8, conc 4..64, ISL/OSL 1k1k + 8k1k — same search-space shape as the existing non-MTP H200 recipe. Launch script mirrors benchmarks/single_node/glm5_fp8_h200.sh and adds --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 (matching the b200/b300 MTP siblings) plus --use-chat-template on the bench client (required for EAGLE per AGENTS.md). Doesn't pull in the NSA / trtllm-mha args from glm5_fp8_b300_mtp.sh — those backends are Blackwell-specific. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26016384420 |
| --tp-size "$TP" \ | ||
| --tool-call-parser glm47 \ | ||
| --reasoning-parser glm45 \ | ||
| --mem-fraction-static 0.85 \ | ||
| --served-model-name glm-5-fp8 \ |
There was a problem hiding this comment.
🔴 The new MTP launch script is missing SGLANG_ENABLE_SPEC_V2=1. Every other SGLang MTP recipe in this repo sets it — including the closest sibling qwen3.5_fp8_h200_mtp.sh (same H200/SGLang/EAGLE) and all glm5 MTP siblings (b200/b300/fp4/mi355x). Without it, the --speculative-* flags likely fall back to the legacy spec-decoding path, undermining the purpose of the recipe. Fix: add export SGLANG_ENABLE_SPEC_V2=1 near the other env setup (or inline it before the python3 -m sglang.launch_server invocation, matching qwen3.5_fp8_h200_mtp.sh:38).
Extended reasoning...
What is missing
benchmarks/single_node/glm5_fp8_h200_mtp.sh adds the four EAGLE speculative-decoding flags (--speculative-algorithm EAGLE, --speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4) but never enables SGLang's spec-v2 scheduler via the SGLANG_ENABLE_SPEC_V2=1 environment variable. The PR description notes the script mirrors glm5_fp8_h200.sh (the non-MTP recipe, which correctly has no spec env var) and then bolts on the EAGLE flags — but the env var that gates SGLang's optimized spec-decoding path was not bolted on alongside them.
Why this matters
Every other SGLang MTP recipe in the repo sets SGLANG_ENABLE_SPEC_V2=1 — either exported (glm5_fp8_b200_mtp.sh:25, glm5_fp8_b300_mtp.sh:29, glm5_fp4_b200_mtp.sh:25, glm5_fp4_b300_mtp.sh:29, glm5_fp8_mi355x_mtp.sh:25) or as a command prefix (qwen3.5_fp8_h200_mtp.sh:38, qwen3.5_fp4_b200_mtp.sh:36, qwen3.5_fp8_b200_mtp.sh:36, qwen3.5_fp8_b300_mtp.sh:34, dsr1_fp8_b200_mtp.sh:57, dsr1_fp8_b300_mtp.sh:61). The new glm5_fp8_h200_mtp.sh is the lone outlier.
The closest direct sibling is qwen3.5_fp8_h200_mtp.sh — same hardware (H200), same framework (SGLang), same EAGLE flag set — and it launches the server with SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server …. The new recipe omits this and uses bare python3 -m sglang.launch_server.
perf-changelog.yaml history reinforces that this is a deliberate, required toggle for SGLang spec-decoding recipes. PR #1017 was titled "Enable SGLANG_ENABLE_SPEC_V2=1 for Qwen3.5 FP8 H200 SGLang MTP" (line 1371). The five existing GLM5 MTP recipes are each documented as adding EAGLE "behind SGLANG_ENABLE_SPEC_V2=1" (lines 1623, 1633, 1643, 1653, 1663). Line 2185 documents aligning B200 with B300 by setting SGLANG_ENABLE_SPEC_V2=1, and line 2219 describes adding MTP flags together with SGLANG_ENABLE_SPEC_V2=1 as a unit.
Impact
Without SGLANG_ENABLE_SPEC_V2=1, the EAGLE config will either run through SGLang's legacy speculative-decoding scheduler (slower) or initialize sub-optimally — silently defeating the performance purpose of the MTP recipe. The sweep would still execute and post numbers, but they would not reflect what an H200 GLM-5 MTP recipe is supposed to measure.
How to fix
Add the env var alongside the other setup. Either:
export SGLANG_ENABLE_SPEC_V2=1near the top of the script (matching the glm5 b200/b300/mi355x style), or inline it before the launch command (matching qwen3.5_fp8_h200_mtp.sh:38):
SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server …Step-by-step proof
- The recipe is invoked by the harness; lines 1–43 of
glm5_fp8_h200_mtp.shset up env-var checks, monitor, andEVAL_CONTEXT_ARGS. No environment variable namedSGLANG_ENABLE_SPEC_V2is exported anywhere in the file (the diff shows the full file; grep confirms 0 hits). - Line 44 begins
python3 -m sglang.launch_server— notSGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_serveras inqwen3.5_fp8_h200_mtp.sh:38. - SGLang reads
SGLANG_ENABLE_SPEC_V2from the process environment at server startup; with the variable unset, the speculative-decoding stack falls back to its v1/legacy path. - The
--speculative-algorithm EAGLE …flags are still parsed and applied, but they run on the legacy scheduler — which is precisely what every other MTP recipe in the repo, and the perf-changelog history, deliberately avoids. - Result: the recipe ships claiming to benchmark GLM-5 FP8 H200 with MTP, but is actually measuring GLM-5 FP8 H200 with EAGLE on the slower legacy spec path. The numbers published from this sweep will not match the MTP recipe's intent.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26016387220 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26017405120 |
Summary
Adds the MTP/EAGLE speculative-decoding sibling of
glm5-fp8-h200-sglang. TP=8, conc 4..64, ISL/OSL 1k1k + 8k1k — same search-space shape as the existing non-MTP H200 recipe.Changes
nvidia-master.yaml: newglm5-fp8-h200-sglang-mtpentry (imagelmsysorg/sglang:v0.5.12-cu130, modelzai-org/GLM-5-FP8).benchmarks/single_node/glm5_fp8_h200_mtp.sh: new launch script — mirrorsglm5_fp8_h200.shand adds--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4, plus--use-chat-templateon the bench client per AGENTS.md.perf-changelog.yaml: trigger entry.Why not copy the B300 MTP launch script verbatim?
glm5_fp8_b300_mtp.shuses NSA + trtllm-mha attention/MoE backends that are Blackwell-specific. On Hopper (H200) we stick with the same args the existing non-MTP H200 recipe uses and just bolt on the EAGLE flags.Test plan
bash -nsyntax-checks the launch script.full-sweep-enabledsweep finishes green on H200 across tp=8 / conc 4..64 / 1k1k + 8k1k.🤖 Generated with Claude Code