[Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes#1486
[Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes#1486functionstackx wants to merge 3 commits into
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26016532295 |
| #!/usr/bin/env bash | ||
|
|
||
| # GLM-5 FP8 on MI300X with EAGLE / MTP speculative decoding. | ||
| # Mirrors glm5_fp8_mi300x.sh and adds the speculative-* flags. | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| EP_SIZE | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
| CONTEXT_LENGTH=$((ISL + OSL + 20)) | ||
| MAX_PREFILL_TOKENS=32768 | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then |
There was a problem hiding this comment.
🔴 The new glm5_fp8_mi300x_mtp.sh passes the --speculative-* CLI flags but does not export SGLANG_ENABLE_SPEC_V2=1, which every other GLM-5 MTP sglang recipe in this repo sets (glm5_fp8_mi355x_mtp.sh:25, glm5_fp8_b200_mtp.sh:25, glm5_fp8_b300_mtp.sh:29, glm5_fp4_b200_mtp.sh:25, glm5_fp4_b300_mtp.sh:29). perf-changelog.yaml repeatedly describes the GLM-5 EAGLE codepath as gated "behind SGLANG_ENABLE_SPEC_V2=1" — without that env var the MTP datapoints for glm5-fp8-mi300x-sglang-mtp will run a different codepath than every other GLM-5 MTP datapoint in perf history, breaking cross-runner comparability. Fix: add export SGLANG_ENABLE_SPEC_V2=1 near the top of the script, mirroring glm5_fp8_mi355x_mtp.sh:25.
Extended reasoning...
What goes wrong
The new `\benchmarks/single_node/glm5_fp8_mi300x_mtp.sh' launches sglang with the standard EAGLE knobs:
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
…but it never sets \SGLANG_ENABLE_SPEC_V2=1'. Every other GLM-5 MTP launch script in this repo does set it, right next to the same --speculative-*' flags:
| Script | Line |
|---|---|
| `\benchmarks/single_node/glm5_fp8_mi355x_mtp.sh' | 25 |
| `\benchmarks/single_node/glm5_fp8_b200_mtp.sh' | 25 |
| `\benchmarks/single_node/glm5_fp8_b300_mtp.sh' | 29 |
| `\benchmarks/single_node/glm5_fp4_b200_mtp.sh' | 25 |
| `\benchmarks/single_node/glm5_fp4_b300_mtp.sh' | 29 |
Why this matters
`\perf-changelog.yaml' documents the GLM-5 EAGLE path as gated on this env var on the sglang versions used here (v0.5.10–v0.5.12):
- line 1623 / 1633 / 1643 — "Mirrors the glm5-fp8-XXX-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) behind SGLANG_ENABLE_SPEC_V2=1"
- line 2219 — "Add MTP flags: SGLANG_ENABLE_SPEC_V2=1, EAGLE speculative decoding (steps=3, topk=1, draft=4)"
The wording "behind" indicates the new spec-v2 codepath is selected only when the env var is set; without it, sglang falls back to the v1 spec path (or in some builds, ignores the flags entirely). Either way the codepath is different from the one every other GLM-5 MTP datapoint already in perf history was collected on.
Step-by-step proof
- PR adds `\glm5_fp8_mi300x_mtp.sh' modeled after the mi300x non-MTP script + EAGLE knobs.
- Compare line-by-line against
\glm5_fp8_mi355x_mtp.sh' (the canonical GLM-5 MTP sibling): mi355x sets\export SGLANG_ENABLE_SPEC_V2=1' at line 25, mi300x has no such export anywhere in the file. \perf-changelog.yaml' lines 1623/1633/1643/1653/1663/2219 all describe GLM-5 EAGLE as living **behind**\SGLANG_ENABLE_SPEC_V2=1'.- Therefore on
\lmsysorg/sglang:v0.5.12-rocm720-mi30x' (the image this recipe uses), starting sglang without\SGLANG_ENABLE_SPEC_V2=1' will route the `--speculative-*' flags through the spec-v1 path (or no-op), not the spec-v2 EAGLE path that mi355x/b200/b300 MTP numbers were collected on. - Result: the `\glm5-fp8-mi300x-sglang-mtp' datapoints recorded by the full-sweep CI will measure a different (and per the changelog, less performant) codepath than the rest of the GLM-5 MTP perf history — cross-runner comparisons are invalid.
Fix
Add the env-var export near the top of \glm5_fp8_mi300x_mtp.sh', mirroring \glm5_fp8_mi355x_mtp.sh:25':
export SGLANG_ENABLE_SPEC_V2=1A natural place is right after the `\source ../benchmark_lib.sh' line, alongside the other env-driven knobs.
|
|
||
| glm5-fp8-mi300x-sglang-mtp: | ||
| image: lmsysorg/sglang:v0.5.12-rocm720-mi30x | ||
| model: zai-org/GLM-5-FP8 | ||
| model-prefix: glm5 | ||
| runner: mi300x | ||
| precision: fp8 | ||
| framework: sglang | ||
| multinode: false | ||
| scenarios: | ||
| fixed-seq-len: | ||
| - isl: 1024 | ||
| osl: 1024 | ||
| search-space: | ||
| - { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp } | ||
| - isl: 8192 | ||
| osl: 1024 | ||
| search-space: | ||
| - { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp } |
There was a problem hiding this comment.
🔴 The new glm5-fp8-mi300x-sglang-mtp recipe (with spec-decoding: mtp) will never invoke the new benchmarks/single_node/glm5_fp8_mi300x_mtp.sh script: runners/launch_mi300x-amds.sh:41 hardcodes bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi300x.sh and never appends a SPEC_SUFFIX, so both glm5-fp8-mi300x-sglang and glm5-fp8-mi300x-sglang-mtp resolve to glm5_fp8_mi300x.sh (the vanilla decode path). Because benchmark-tmpl.yml:180 bakes spec-${SPEC_DECODING} into RESULT_FILENAME, the run is recorded as MTP data while actually executing without speculative decoding — silently misattributed perf numbers. Fix by adding SPEC_SUFFIX dispatch to launch_mi300x-amds.sh mirroring launch_mi355x-amds.sh:182-228, or drop the -mtp recipe + _mtp.sh script from this PR until the launcher supports it.
Extended reasoning...
Bug: MI300X launcher does not dispatch to _mtp.sh
What is broken
The PR adds two recipes in .github/configs/amd-master.yaml:
glm5-fp8-mi300x-sglang→ expected to runbenchmarks/single_node/glm5_fp8_mi300x.shglm5-fp8-mi300x-sglang-mtp(withspec-decoding: mtp) → expected to runbenchmarks/single_node/glm5_fp8_mi300x_mtp.sh
The second one is the one that breaks. The MI300X launcher does not know how to route to the _mtp.sh variant.
The misrouting code
runners/launch_mi300x-amds.sh:41 hardcodes:
bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi300x.shThere is no SPEC_SUFFIX/FRAMEWORK_SUFFIX computation anywhere in this file (verified by re-reading the whole 43-line script — every line is shown above the bash invocation, and nothing computes SPEC_SUFFIX).
Compare runners/launch_mi355x-amds.sh:182-228, which is the working pattern:
FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "atom" ]] && printf '_atom' || printf '')
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
...
SCRIPT_BASE="${EXP_NAME%%_*}_${PRECISION}_mi355x"
SCRIPT_FW="benchmarks/single_node/${SCENARIO_SUBDIR:-}${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh"
SCRIPT_FALLBACK="benchmarks/single_node/${SCENARIO_SUBDIR:-}${SCRIPT_BASE}${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh"Step-by-step proof for glm5-fp8-mi300x-sglang-mtp at isl=1024, osl=1024, tp=8, conc=4:
- Sweep generator (
utils/matrix_logic/generate_sweep_configs.py:362) producesEXP_NAME = f"{model_code}_{seq_len_str}", i.e.EXP_NAME = "glm5_1k1k". - Env vars passed to launcher:
MODEL=zai-org/GLM-5-FP8,PRECISION=fp8,FRAMEWORK=sglang,SPEC_DECODING=mtp,EXP_NAME=glm5_1k1k,EP_SIZE=1. - In
launch_mi300x-amds.sh:41,${EXP_NAME%%_*}strips at the first underscore →"glm5". - The composed path is
benchmarks/single_node/glm5_fp8_mi300x.sh— the vanilla decode script, not the newglm5_fp8_mi300x_mtp.sh. glm5_fp8_mi300x.shis invoked. It has none of--speculative-algorithm EAGLE,--speculative-num-steps,--speculative-eagle-topk,--speculative-num-draft-tokens,--ep-size, or--use-chat-template. The server starts in vanilla decode mode..github/workflows/benchmark-tmpl.yml:180definesRESULT_FILENAME = "${EXP_NAME}_${PRECISION}_${FRAMEWORK}_tp${TP}-ep${EP_SIZE}-dpa${DP_ATTENTION}_disagg-${DISAGG}_spec-${SPEC_DECODING}_conc${CONC}_${RUNNER_NAME}", so the result file name still containsspec-mtp. The datapoint is filed in perf history as MTP data, but it was produced by the vanilla decode path. Silent misattribution.
Why nothing catches this
glm5_fp8_mi300x.sh'scheck_env_varsrequires onlyMODEL TP CONC ISL OSL RANDOM_RANGE_RATIO RESULT_FILENAME(noEP_SIZE), so the misrouted run does not fail loudly. It produces a plausible result file under a misleading name.- The new
glm5_fp8_mi300x_mtp.shdoes requireEP_SIZE, which would have surfaced the misrouting — but it is never executed. grep -n 'mi300x.*mtp' .github/configs/amd-master.yamlreturns only this PR's new entry (line 1840). This is the first MI300X recipe to depend on aSPEC_SUFFIXdispatch, so the absence has gone unnoticed.runners/launch_mi300x*.shglob returns exactly one file — there is no alternate launcher that could pick up the slack..github/workflows/benchmark-tmpl.ymldispatches viabash ./runners/launch_${RUNNER_NAME%%_*}.sh, so MI300X jobs go through this one launcher only.
Impact
The -mtp variant of the new recipe is non-functional end-to-end on the targeted runner, and the resulting spec-mtp perf datapoints would be vanilla decode numbers in disguise. This is exactly the failure mode the bug description calls out, and it should block the PR.
Fix options
- Add
SPEC_SUFFIXdispatch torunners/launch_mi300x-amds.shmirroringlaunch_mi355x-amds.sh:182-228(computeSPEC_SUFFIX, append it to the script path, optionally with a fallback chain). This is the proper fix and lets the new_mtp.shactually run. - Drop the
glm5-fp8-mi300x-sglang-mtprecipe andglm5_fp8_mi300x_mtp.shfrom this PR and re-add them after the launcher gains MTP support. The non-MTPglm5-fp8-mi300x-sglangrecipe is unaffected and can ship as-is.
|
Handing off to @Oseltamivir — added as §13 in #1511. Expected to hit the same aiter MHA bug as #1485 (sgl#25672) once the queued jobs run; same recipe-side workaround applies (swap to triton). AI-generated via Claude Code /loop. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26016535368 |
e718115 to
909a0bc
Compare
|
Trying the same alternate launch-arg recipe as #1485 (suggested upstream in sgl-project/sglang#25672 (comment)) — dropping the aiter backend in favor of tilelang NSA backends + fp8_e4m3 KV cache + multithread model loader: Applied to both the off and MTP variants in |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26130164798 |
New family on MI300X using lmsysorg/sglang:v0.5.12-rocm720-mi30x. TP=8, conc 4..64, 1k1k + 8k1k. Launch scripts follow the qwen3.5-fp8-mi300x SGLang recipe (aiter attention + AMD allreduce fusion) and add GLM-5 parsers (glm47 tool calls, glm45 reasoning). MTP variant adds --speculative-algorithm EAGLE plus the standard EAGLE knobs (num-steps 3, eagle-topk 1, num-draft-tokens 4) and --use-chat-template on the bench client per AGENTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops --attention-backend aiter / --enable-aiter-allreduce-fusion in
favor of the recipe suggested in sglang issue #25672 comment 4485916205:
--mem-fraction-static 0.80
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
--nsa-prefill-backend tilelang --nsa-decode-backend tilelang
--kv-cache-dtype fp8_e4m3
Same fix applied to glm5-fp8-mi325x in #1485; both recipes share the
aiter Triton MHA tl.arange power-of-2 crash on GLM-5. Applied to both
the off and MTP variants.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
909a0bc to
8fe48f7
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26130164798 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26185779422 |
Summary
Adds a new GLM-5 FP8 SGLang ROCm recipe family for MI300X, both the off and MTP/EAGLE variants in one PR (grouped per the project convention).
Recipes
glm5-fp8-mi300x-sglangglm5-fp8-mi300x-sglang-mtpImage
lmsysorg/sglang:v0.5.12-rocm720-mi30x(same tag used by the existingqwen3.5-fp8-mi300x-sglangand the freshly-opened mi325x GLM-5 sibling PR #1485).Launch scripts
Both base on the existing
qwen3.5-fp8-mi300x.shrecipe (AMD Andy's LinkedIn-recommended args:--attention-backend aiter,--enable-aiter-allreduce-fusion,--mem-fraction-static 0.75) plus the GLM-5 parsers (--tool-call-parser glm47,--reasoning-parser glm45). The MTP variant adds the standard EAGLE knobs and--use-chat-templateon the bench client per AGENTS.md.Test plan
bash -nsyntax passes on both launch scripts.🤖 Generated with Claude Code