[Klaud Cold] Add glm5-fp8-h200-sglang-mtp recipe by functionstackx · Pull Request #1480 · SemiAnalysisAI/InferenceX

functionstackx · 2026-05-18T06:01:33Z

Summary

Adds the MTP/EAGLE speculative-decoding sibling of glm5-fp8-h200-sglang. TP=8, conc 4..64, ISL/OSL 1k1k + 8k1k — same search-space shape as the existing non-MTP H200 recipe.

Changes

nvidia-master.yaml: new glm5-fp8-h200-sglang-mtp entry (image lmsysorg/sglang:v0.5.12-cu130, model zai-org/GLM-5-FP8).
benchmarks/single_node/glm5_fp8_h200_mtp.sh: new launch script — mirrors glm5_fp8_h200.sh and adds --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4, plus --use-chat-template on the bench client per AGENTS.md.
perf-changelog.yaml: trigger entry.

Why not copy the B300 MTP launch script verbatim?

glm5_fp8_b300_mtp.sh uses NSA + trtllm-mha attention/MoE backends that are Blackwell-specific. On Hopper (H200) we stick with the same args the existing non-MTP H200 recipe uses and just bolt on the EAGLE flags.

Test plan

bash -n syntax-checks the launch script.
YAML loads cleanly + new recipe entry shape matches existing MTP siblings.
full-sweep-enabled sweep finishes green on H200 across tp=8 / conc 4..64 / 1k1k + 8k1k.

🤖 Generated with Claude Code

Adds the MTP/EAGLE speculative-decoding variant of glm5-fp8-h200-sglang. TP=8, conc 4..64, ISL/OSL 1k1k + 8k1k — same search-space shape as the existing non-MTP H200 recipe. Launch script mirrors benchmarks/single_node/glm5_fp8_h200.sh and adds --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 (matching the b200/b300 MTP siblings) plus --use-chat-template on the bench client (required for EAGLE per AGENTS.md). Doesn't pull in the NSA / trtllm-mha args from glm5_fp8_b300_mtp.sh — those backends are Blackwell-specific. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-18T06:01:42Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-18T06:02:10Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26016384420
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26016384420

claude · 2026-05-18T06:12:06Z

+  --tp-size "$TP" \
+  --tool-call-parser glm47 \
+  --reasoning-parser glm45 \
+  --mem-fraction-static 0.85 \
+  --served-model-name glm-5-fp8 \


🔴 The new MTP launch script is missing SGLANG_ENABLE_SPEC_V2=1. Every other SGLang MTP recipe in this repo sets it — including the closest sibling qwen3.5_fp8_h200_mtp.sh (same H200/SGLang/EAGLE) and all glm5 MTP siblings (b200/b300/fp4/mi355x). Without it, the --speculative-* flags likely fall back to the legacy spec-decoding path, undermining the purpose of the recipe. Fix: add export SGLANG_ENABLE_SPEC_V2=1 near the other env setup (or inline it before the python3 -m sglang.launch_server invocation, matching qwen3.5_fp8_h200_mtp.sh:38).

Extended reasoning...

What is missing

benchmarks/single_node/glm5_fp8_h200_mtp.sh adds the four EAGLE speculative-decoding flags (--speculative-algorithm EAGLE, --speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4) but never enables SGLang's spec-v2 scheduler via the SGLANG_ENABLE_SPEC_V2=1 environment variable. The PR description notes the script mirrors glm5_fp8_h200.sh (the non-MTP recipe, which correctly has no spec env var) and then bolts on the EAGLE flags — but the env var that gates SGLang's optimized spec-decoding path was not bolted on alongside them.

Why this matters

Every other SGLang MTP recipe in the repo sets SGLANG_ENABLE_SPEC_V2=1 — either exported (glm5_fp8_b200_mtp.sh:25, glm5_fp8_b300_mtp.sh:29, glm5_fp4_b200_mtp.sh:25, glm5_fp4_b300_mtp.sh:29, glm5_fp8_mi355x_mtp.sh:25) or as a command prefix (qwen3.5_fp8_h200_mtp.sh:38, qwen3.5_fp4_b200_mtp.sh:36, qwen3.5_fp8_b200_mtp.sh:36, qwen3.5_fp8_b300_mtp.sh:34, dsr1_fp8_b200_mtp.sh:57, dsr1_fp8_b300_mtp.sh:61). The new glm5_fp8_h200_mtp.sh is the lone outlier.

The closest direct sibling is qwen3.5_fp8_h200_mtp.sh — same hardware (H200), same framework (SGLang), same EAGLE flag set — and it launches the server with SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server …. The new recipe omits this and uses bare python3 -m sglang.launch_server.

perf-changelog.yaml history reinforces that this is a deliberate, required toggle for SGLang spec-decoding recipes. PR #1017 was titled "Enable SGLANG_ENABLE_SPEC_V2=1 for Qwen3.5 FP8 H200 SGLang MTP" (line 1371). The five existing GLM5 MTP recipes are each documented as adding EAGLE "behind SGLANG_ENABLE_SPEC_V2=1" (lines 1623, 1633, 1643, 1653, 1663). Line 2185 documents aligning B200 with B300 by setting SGLANG_ENABLE_SPEC_V2=1, and line 2219 describes adding MTP flags together with SGLANG_ENABLE_SPEC_V2=1 as a unit.

Impact

Without SGLANG_ENABLE_SPEC_V2=1, the EAGLE config will either run through SGLang's legacy speculative-decoding scheduler (slower) or initialize sub-optimally — silently defeating the performance purpose of the MTP recipe. The sweep would still execute and post numbers, but they would not reflect what an H200 GLM-5 MTP recipe is supposed to measure.

How to fix

Add the env var alongside the other setup. Either:

export SGLANG_ENABLE_SPEC_V2=1

near the top of the script (matching the glm5 b200/b300/mi355x style), or inline it before the launch command (matching qwen3.5_fp8_h200_mtp.sh:38):

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server …

Step-by-step proof

The recipe is invoked by the harness; lines 1–43 of glm5_fp8_h200_mtp.sh set up env-var checks, monitor, and EVAL_CONTEXT_ARGS. No environment variable named SGLANG_ENABLE_SPEC_V2 is exported anywhere in the file (the diff shows the full file; grep confirms 0 hits).

Line 44 begins python3 -m sglang.launch_server — not SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server as in qwen3.5_fp8_h200_mtp.sh:38.

SGLang reads SGLANG_ENABLE_SPEC_V2 from the process environment at server startup; with the variable unset, the speculative-decoding stack falls back to its v1/legacy path.

The --speculative-algorithm EAGLE … flags are still parsed and applied, but they run on the legacy scheduler — which is precisely what every other MTP recipe in the repo, and the perf-changelog history, deliberately avoids.

Result: the recipe ships claiming to benchmark GLM-5 FP8 H200 with MTP, but is actually measuring GLM-5 FP8 H200 with EAGLE on the slower legacy spec path. The numbers published from this sweep will not match the MTP recipe's intent.

github-actions · 2026-05-18T06:27:56Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26016387220
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26016387220

functionstackx · 2026-05-18T06:30:57Z

/reuse-sweep-run

github-actions · 2026-05-18T06:31:40Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26017405120
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26017405120

functionstackx requested a review from a team May 18, 2026 06:01

functionstackx added the full-sweep-enabled label May 18, 2026

functionstackx requested review from jgangani and kedarpotdar-nv as code owners May 18, 2026 06:01

github-project-automation Bot added this to InferenceMAX Board May 18, 2026

chore: fill pr-link for #1480

04e3d38

claude Bot reviewed May 18, 2026

View reviewed changes

Merge branch 'main' into add-glm5-fp8-h200-sglang-mtp

d74a0b0

functionstackx merged commit ec15908 into main May 18, 2026
3 of 5 checks passed

functionstackx deleted the add-glm5-fp8-h200-sglang-mtp branch May 18, 2026 06:31

github-project-automation Bot moved this to Done in InferenceMAX Board May 18, 2026

functionstackx mentioned this pull request May 26, 2026

feat(blog): B200 NVFP4 vs H200 FP8 on GLM-5 — up to 3.65x better perf/$ SemiAnalysisAI/InferenceX-app#386

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Klaud Cold] Add glm5-fp8-h200-sglang-mtp recipe#1480

[Klaud Cold] Add glm5-fp8-h200-sglang-mtp recipe#1480
functionstackx merged 3 commits into
mainfrom
add-glm5-fp8-h200-sglang-mtp

functionstackx commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

claude Bot May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

functionstackx commented May 18, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented May 18, 2026

Summary

Changes

Why not copy the B300 MTP launch script verbatim?

Test plan

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

claude Bot May 18, 2026

Choose a reason for hiding this comment

What is missing

Why this matters

Impact

How to fix

Step-by-step proof

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

functionstackx commented May 18, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant