[Klaud Cold] Add glm5-fp8-mi325x-sglang (off + mtp) recipes#1485
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
2 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26016489321 |
| #!/usr/bin/env bash | ||
|
|
||
| # GLM-5 FP8 on MI325X with EAGLE / MTP speculative decoding. | ||
| # Mirrors glm5_fp8_mi325x.sh and adds the speculative-* flags. | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| EP_SIZE | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
| CONTEXT_LENGTH=$((ISL + OSL + 20)) | ||
| MAX_PREFILL_TOKENS=32768 | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then |
There was a problem hiding this comment.
🔴 The new glm5_fp8_mi325x_mtp.sh script enables EAGLE speculative decoding but does NOT export SGLANG_ENABLE_SPEC_V2=1, which every other GLM-5 MTP recipe in the codebase explicitly sets (b200, b300, fp4-b200, fp4-b300, and the closest ROCm sibling glm5_fp8_mi355x_mtp.sh:25). Without it, sglang on v0.5.12 falls back to the legacy spec-decoding implementation, which would give degraded/incorrect MTP numbers relative to the rest of the GLM-5 fleet. Fix: add export SGLANG_ENABLE_SPEC_V2=1 alongside the other top-of-script setup, mirroring glm5_fp8_mi355x_mtp.sh:25.
Extended reasoning...
What the bug is
benchmarks/single_node/glm5_fp8_mi325x_mtp.sh (lines 1–30 of the new file) launches sglang with full EAGLE / MTP speculative-decoding flags:
--speculative-algorithm EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4
…but the script never exports SGLANG_ENABLE_SPEC_V2=1. This env var is the documented opt-in for sglang's V2 spec-decoding path on v0.5.x — the path the rest of the GLM-5 fleet uses.
Why this matters / the GLM-5 family pattern
A grep across benchmarks/single_node/ shows that every GLM-5 MTP recipe sets this var, on both NVIDIA and AMD:
glm5_fp8_b200_mtp.sh:25glm5_fp8_b300_mtp.sh:29glm5_fp4_b200_mtp.sh:25glm5_fp4_b300_mtp.sh:29glm5_fp8_mi355x_mtp.sh:25← closest sibling (same model, sglang framework, MTP/EAGLE, ROCm)
The new glm5_fp8_mi325x_mtp.sh is the only GLM-5 MTP recipe missing it. The script's own header says it "Mirrors glm5_fp8_mi325x.sh and adds the speculative-* flags" — the mirroring is from the non-MTP mi325x and qwen3.5-mi325x lineage rather than from the GLM-5 MTP family, which is the likely source of the oversight. (The two qwen3.5 mi355x mtp scripts also lack the var, but they run sglang v0.5.10rc0 on a different model family; this is a GLM-5-specific knob tied to GLM-5's draft-model layout on v0.5.12.)
Step-by-step proof
- The new recipe uses image
lmsysorg/sglang:v0.5.12-rocm720-mi30x. This is the same v0.5.12 era as the b200/b300 GLM-5 MTP recipes that gate spec-V2 behind this env var. glm5_fp8_mi355x_mtp.sh(same model, same framework, same MTP feature, ROCm) doesexport SGLANG_ENABLE_SPEC_V2=1near the top of the script (line 25), right alongside other ROCm sglang env exports.glm5_fp8_mi325x_mtp.shdefines noSGLANG_*exports at all; the only top-of-script setup ischeck_env_varsand the HF download branch.- With the V2 path disabled, sglang routes EAGLE/MTP through the legacy spec-decoding implementation that the GLM-5 family explicitly opts out of in every other recipe.
- Net effect: when this script does run, MTP results will silently land on a different code path than the rest of the GLM-5 fleet — degraded or non-comparable numbers without any obvious failure signal.
Why existing code doesn't prevent it
There is no central place that injects SGLANG_ENABLE_SPEC_V2=1 — each recipe sets it directly in its launch script. The env var is unset by default in v0.5.12. benchmark_lib.sh does not export it. So a script that omits the export simply runs the legacy path.
Coupling with bug_001 (dispatch)
This bug is latent until the dispatcher actually invokes glm5_fp8_mi325x_mtp.sh (the related dispatch issue). However:
- The two should be fixed together — fixing dispatch alone would still produce wrong-path MTP numbers.
- It's a trivial one-line addition; deferring it just means the next sweep after the dispatch fix produces invalid data.
Fix
Add the export near the top of benchmarks/single_node/glm5_fp8_mi325x_mtp.sh, mirroring glm5_fp8_mi355x_mtp.sh:25:
export SGLANG_ENABLE_SPEC_V2=1Placement: after the hf download line, before SERVER_LOG=... (i.e. immediately before line 22 of the new file).
| #!/usr/bin/env bash | ||
|
|
||
| # GLM-5 FP8 on MI325X with EAGLE / MTP speculative decoding. | ||
| # Mirrors glm5_fp8_mi325x.sh and adds the speculative-* flags. | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| EP_SIZE | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
| CONTEXT_LENGTH=$((ISL + OSL + 20)) | ||
| MAX_PREFILL_TOKENS=32768 | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then |
There was a problem hiding this comment.
🔴 The new glm5_fp8_mi325x_mtp.sh will never execute. runners/launch_mi325x-amds.sh:42 dispatches to benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_mi325x.sh without appending any FRAMEWORK_SUFFIX or SPEC_SUFFIX, so both glm5-fp8-mi325x-sglang and glm5-fp8-mi325x-sglang-mtp resolve to the same path (glm5_fp8_mi325x.sh) — the MTP sweep silently runs without EAGLE flags or --use-chat-template and produces numbers indistinguishable from the off sweep. Fix by extending the mi325x launcher to mirror launch_mi355x-amds.sh:221-228 (build ${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh with a fallback), or inline the EAGLE flags into glm5_fp8_mi325x.sh behind a SPEC_DECODING check and delete this new file.
Extended reasoning...
The bug. The MTP launch script added in this PR (benchmarks/single_node/glm5_fp8_mi325x_mtp.sh) is dead code — it will never be invoked by the mi325x runner, and the glm5-fp8-mi325x-sglang-mtp recipe in amd-master.yaml will silently use the non-MTP script instead.
Why. runners/launch_mi325x-amds.sh:42 dispatches via:
bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi325x.shThere is no FRAMEWORK_SUFFIX or SPEC_SUFFIX appended. Contrast with runners/launch_mi355x-amds.sh:182-228, which sets SPEC_SUFFIX=_mtp when SPEC_DECODING=mtp and builds ${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh with a fallback. The mi325x launcher has no such logic — this is the first mi325x MTP recipe in the file (grep mi325x.*mtp .github/configs/amd-master.yaml returns only the newly-added entry), so the dispatch path has never had to exist.
Step-by-step proof. EXP_NAME is built in utils/matrix_logic/generate_sweep_configs.py:290,362 as f"{model_code}_{seq_len_str}". Both new recipes share model-prefix: glm5 in the yaml, so for the 1k1k scenario both produce EXP_NAME='glm5_1k1k', giving ${EXP_NAME%%_*}='glm5'. With PRECISION='fp8' and runner mi325x, both recipes resolve to exactly the same path: benchmarks/single_node/glm5_fp8_mi325x.sh. The newly-added glm5_fp8_mi325x_mtp.sh is never selected.
Impact on the MTP sweep. Because glm5_fp8_mi325x.sh (the non-MTP script) is what actually runs for the MTP recipe:
- The server starts without
--speculative-algorithm EAGLE,--speculative-num-steps,--speculative-eagle-topk, or--speculative-num-draft-tokens, so the "MTP" numbers are actually non-MTP numbers. EP_SIZEis set by the runner for the mtp recipe (ep: 1in the yaml) but the non-MTP script ignores it (it hardcodes--data-parallel-size 1instead).- The bench client is invoked without
--use-chat-template. - Net effect: the
mtpsweep results will be statistically indistinguishable from theoffsweep, polluting perf-changelog with bogus MTP-labeled data.
How to fix. Two options, either is fine:
- (a) Extend
launch_mi325x-amds.shto mirrorlaunch_mi355x-amds.sh:221-228— computeSPEC_SUFFIXfromSPEC_DECODING, construct${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.shwith a fallback to the bare${SCRIPT_BASE}.sh. This is the cleaner long-term fix because future mi325x recipes (e.g. atom on mi325x, or other-framework MTP variants) will need the same dispatch. - (b) Inline the EAGLE flags and
--use-chat-templateintoglm5_fp8_mi325x.shbehind aSPEC_DECODINGcheck, and deleteglm5_fp8_mi325x_mtp.sh. Lower-blast-radius but doesn'''t generalize.
|
Filed upstream issue for the aiter MHA Triton kernel crash: sgl-project/sglang#25672 Root cause: GLM-5's Local workaround: switch |
Filed upstream: sgl-project/sglang#25672The repeated FAILUREs on this PR are not infra — they're a deterministic crash in sglang's GLM-5's Local workaround: swap Upstream fix: track sgl-project/sglang#25672 — suggests padding |
|
Handing off to @Oseltamivir — tracked alongside 7 other stuck Klaud-Cold PRs in #1511. /loop will stop auto-retrying this one. AI-generated via Claude Code /loop. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26016491988 |
|
Trying an alternate launch-arg recipe suggested upstream in sgl-project/sglang#25672 (comment) — dropping the aiter backend in favor of tilelang NSA backends + fp8_e4m3 KV cache + multithread model loader: Applied to both the off and MTP variants in |
New family on MI325X using lmsysorg/sglang:v0.5.12-rocm720-mi30x. TP=8, conc 4..64, 1k1k + 8k1k. Launch scripts follow the qwen3.5-fp8-mi325x SGLang recipe (aiter attention backend + AMD allreduce fusion), adding GLM-5-specific --tool-call-parser glm47 and --reasoning-parser glm45. MTP variant adds --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 and the required --use-chat-template on the bench client. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops --attention-backend aiter / --enable-aiter-allreduce-fusion in
favor of the recipe suggested in sglang issue #25672 comment 4485916205:
--mem-fraction-static 0.80
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
--nsa-prefill-backend tilelang --nsa-decode-backend tilelang
--kv-cache-dtype fp8_e4m3
Applied to both the off and MTP variants.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dcd8bce to
8b81e6a
Compare
Drops --attention-backend aiter / --enable-aiter-allreduce-fusion in
favor of the recipe suggested in sglang issue #25672 comment 4485916205:
--mem-fraction-static 0.80
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
--nsa-prefill-backend tilelang --nsa-decode-backend tilelang
--kv-cache-dtype fp8_e4m3
Same fix applied to glm5-fp8-mi325x in #1485; both recipes share the
aiter Triton MHA tl.arange power-of-2 crash on GLM-5. Applied to both
the off and MTP variants.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26130045148 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26143688774 |
Drops --attention-backend aiter / --enable-aiter-allreduce-fusion in
favor of the recipe suggested in sglang issue #25672 comment 4485916205:
--mem-fraction-static 0.80
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
--nsa-prefill-backend tilelang --nsa-decode-backend tilelang
--kv-cache-dtype fp8_e4m3
Same fix applied to glm5-fp8-mi325x in #1485; both recipes share the
aiter Triton MHA tl.arange power-of-2 crash on GLM-5. Applied to both
the off and MTP variants.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds a new GLM-5 FP8 SGLang ROCm recipe family for MI325X, both the off and MTP/EAGLE variants in one PR (grouped per the project convention of pairing MTP with its non-MTP sibling).
Recipes
glm5-fp8-mi325x-sglangglm5-fp8-mi325x-sglang-mtpImage
lmsysorg/sglang:v0.5.12-rocm720-mi30x(same tag as the existingqwen3.5-*-mi325x-sglangrecipes; mi30x suffix is shared between mi300x and mi325x).Launch scripts
Launch args follow sglang issue #25672 comment 4485916205: tilelang NSA backends, fp8_e4m3 KV cache, multithread model loader, and a bumped
--mem-fraction-static 0.80:GLM-5 parsers:
--tool-call-parser glm47,--reasoning-parser glm45. The MTP variant adds--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4and--use-chat-templateon the bench client.Test plan
bash -nsyntax passes on both launch scripts.🤖 Generated with Claude Code