Skip to content

[Klaud Cold] Add glm5-fp8-h200-sglang-mtp recipe#1480

Merged
functionstackx merged 3 commits into
mainfrom
add-glm5-fp8-h200-sglang-mtp
May 18, 2026
Merged

[Klaud Cold] Add glm5-fp8-h200-sglang-mtp recipe#1480
functionstackx merged 3 commits into
mainfrom
add-glm5-fp8-h200-sglang-mtp

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

Summary

Adds the MTP/EAGLE speculative-decoding sibling of glm5-fp8-h200-sglang. TP=8, conc 4..64, ISL/OSL 1k1k + 8k1k — same search-space shape as the existing non-MTP H200 recipe.

Changes

  • nvidia-master.yaml: new glm5-fp8-h200-sglang-mtp entry (image lmsysorg/sglang:v0.5.12-cu130, model zai-org/GLM-5-FP8).
  • benchmarks/single_node/glm5_fp8_h200_mtp.sh: new launch script — mirrors glm5_fp8_h200.sh and adds --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4, plus --use-chat-template on the bench client per AGENTS.md.
  • perf-changelog.yaml: trigger entry.

Why not copy the B300 MTP launch script verbatim?

glm5_fp8_b300_mtp.sh uses NSA + trtllm-mha attention/MoE backends that are Blackwell-specific. On Hopper (H200) we stick with the same args the existing non-MTP H200 recipe uses and just bolt on the EAGLE flags.

Test plan

  • bash -n syntax-checks the launch script.
  • YAML loads cleanly + new recipe entry shape matches existing MTP siblings.
  • full-sweep-enabled sweep finishes green on H200 across tp=8 / conc 4..64 / 1k1k + 8k1k.

🤖 Generated with Claude Code

Adds the MTP/EAGLE speculative-decoding variant of glm5-fp8-h200-sglang.
TP=8, conc 4..64, ISL/OSL 1k1k + 8k1k — same search-space shape as the
existing non-MTP H200 recipe.

Launch script mirrors benchmarks/single_node/glm5_fp8_h200.sh and adds
--speculative-algorithm EAGLE --speculative-num-steps 3
--speculative-eagle-topk 1 --speculative-num-draft-tokens 4 (matching
the b200/b300 MTP siblings) plus --use-chat-template on the bench
client (required for EAGLE per AGENTS.md). Doesn't pull in the NSA /
trtllm-mha args from glm5_fp8_b300_mtp.sh — those backends are
Blackwell-specific.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Comment on lines +44 to +48
--tp-size "$TP" \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--mem-fraction-static 0.85 \
--served-model-name glm-5-fp8 \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new MTP launch script is missing SGLANG_ENABLE_SPEC_V2=1. Every other SGLang MTP recipe in this repo sets it — including the closest sibling qwen3.5_fp8_h200_mtp.sh (same H200/SGLang/EAGLE) and all glm5 MTP siblings (b200/b300/fp4/mi355x). Without it, the --speculative-* flags likely fall back to the legacy spec-decoding path, undermining the purpose of the recipe. Fix: add export SGLANG_ENABLE_SPEC_V2=1 near the other env setup (or inline it before the python3 -m sglang.launch_server invocation, matching qwen3.5_fp8_h200_mtp.sh:38).

Extended reasoning...

What is missing

benchmarks/single_node/glm5_fp8_h200_mtp.sh adds the four EAGLE speculative-decoding flags (--speculative-algorithm EAGLE, --speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4) but never enables SGLang's spec-v2 scheduler via the SGLANG_ENABLE_SPEC_V2=1 environment variable. The PR description notes the script mirrors glm5_fp8_h200.sh (the non-MTP recipe, which correctly has no spec env var) and then bolts on the EAGLE flags — but the env var that gates SGLang's optimized spec-decoding path was not bolted on alongside them.

Why this matters

Every other SGLang MTP recipe in the repo sets SGLANG_ENABLE_SPEC_V2=1 — either exported (glm5_fp8_b200_mtp.sh:25, glm5_fp8_b300_mtp.sh:29, glm5_fp4_b200_mtp.sh:25, glm5_fp4_b300_mtp.sh:29, glm5_fp8_mi355x_mtp.sh:25) or as a command prefix (qwen3.5_fp8_h200_mtp.sh:38, qwen3.5_fp4_b200_mtp.sh:36, qwen3.5_fp8_b200_mtp.sh:36, qwen3.5_fp8_b300_mtp.sh:34, dsr1_fp8_b200_mtp.sh:57, dsr1_fp8_b300_mtp.sh:61). The new glm5_fp8_h200_mtp.sh is the lone outlier.

The closest direct sibling is qwen3.5_fp8_h200_mtp.sh — same hardware (H200), same framework (SGLang), same EAGLE flag set — and it launches the server with SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server …. The new recipe omits this and uses bare python3 -m sglang.launch_server.

perf-changelog.yaml history reinforces that this is a deliberate, required toggle for SGLang spec-decoding recipes. PR #1017 was titled "Enable SGLANG_ENABLE_SPEC_V2=1 for Qwen3.5 FP8 H200 SGLang MTP" (line 1371). The five existing GLM5 MTP recipes are each documented as adding EAGLE "behind SGLANG_ENABLE_SPEC_V2=1" (lines 1623, 1633, 1643, 1653, 1663). Line 2185 documents aligning B200 with B300 by setting SGLANG_ENABLE_SPEC_V2=1, and line 2219 describes adding MTP flags together with SGLANG_ENABLE_SPEC_V2=1 as a unit.

Impact

Without SGLANG_ENABLE_SPEC_V2=1, the EAGLE config will either run through SGLang's legacy speculative-decoding scheduler (slower) or initialize sub-optimally — silently defeating the performance purpose of the MTP recipe. The sweep would still execute and post numbers, but they would not reflect what an H200 GLM-5 MTP recipe is supposed to measure.

How to fix

Add the env var alongside the other setup. Either:

export SGLANG_ENABLE_SPEC_V2=1

near the top of the script (matching the glm5 b200/b300/mi355x style), or inline it before the launch command (matching qwen3.5_fp8_h200_mtp.sh:38):

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server …

Step-by-step proof

  1. The recipe is invoked by the harness; lines 1–43 of glm5_fp8_h200_mtp.sh set up env-var checks, monitor, and EVAL_CONTEXT_ARGS. No environment variable named SGLANG_ENABLE_SPEC_V2 is exported anywhere in the file (the diff shows the full file; grep confirms 0 hits).
  2. Line 44 begins python3 -m sglang.launch_server — not SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server as in qwen3.5_fp8_h200_mtp.sh:38.
  3. SGLang reads SGLANG_ENABLE_SPEC_V2 from the process environment at server startup; with the variable unset, the speculative-decoding stack falls back to its v1/legacy path.
  4. The --speculative-algorithm EAGLE … flags are still parsed and applied, but they run on the legacy scheduler — which is precisely what every other MTP recipe in the repo, and the perf-changelog history, deliberately avoids.
  5. Result: the recipe ships claiming to benchmark GLM-5 FP8 H200 with MTP, but is actually measuring GLM-5 FP8 H200 with EAGLE on the slower legacy spec path. The numbers published from this sweep will not match the MTP recipe's intent.

@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx
Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@functionstackx functionstackx merged commit ec15908 into main May 18, 2026
3 of 5 checks passed
@functionstackx functionstackx deleted the add-glm5-fp8-h200-sglang-mtp branch May 18, 2026 06:31
@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant