[Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes by functionstackx · Pull Request #1486 · SemiAnalysisAI/InferenceX

functionstackx · 2026-05-18T06:06:04Z

Summary

Adds a new GLM-5 FP8 SGLang ROCm recipe family for MI300X, both the off and MTP/EAGLE variants in one PR (grouped per the project convention).

Recipes

glm5-fp8-mi300x-sglang
glm5-fp8-mi300x-sglang-mtp

Image

lmsysorg/sglang:v0.5.12-rocm720-mi30x (same tag used by the existing qwen3.5-fp8-mi300x-sglang and the freshly-opened mi325x GLM-5 sibling PR #1485).

Launch scripts

Both base on the existing qwen3.5-fp8-mi300x.sh recipe (AMD Andy's LinkedIn-recommended args: --attention-backend aiter, --enable-aiter-allreduce-fusion, --mem-fraction-static 0.75) plus the GLM-5 parsers (--tool-call-parser glm47, --reasoning-parser glm45). The MTP variant adds the standard EAGLE knobs and --use-chat-template on the bench client per AGENTS.md.

Test plan

YAML loads; bash -n syntax passes on both launch scripts.
full-sweep-enabled sweep finishes green on mi300x for tp=8 / conc 4..64 / 1k1k + 8k1k (both off + mtp).

🤖 Generated with Claude Code

github-actions · 2026-05-18T06:06:13Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-18T06:06:13Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-18T06:06:40Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26016532295
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26016532295

claude · 2026-05-18T06:15:47Z

+#!/usr/bin/env bash
+
+# GLM-5 FP8 on MI300X with EAGLE / MTP speculative decoding.
+# Mirrors glm5_fp8_mi300x.sh and adds the speculative-* flags.
+
+source "$(dirname "$0")/../benchmark_lib.sh"
+
+check_env_vars \
+    MODEL \
+    TP \
+    CONC \
+    ISL \
+    OSL \
+    RANDOM_RANGE_RATIO \
+    RESULT_FILENAME \
+    EP_SIZE
+
+if [[ -n "$SLURM_JOB_ID" ]]; then
+  echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
+fi
+
+if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
+
+SERVER_LOG=/workspace/server.log
+PORT=${PORT:-8888}
+CONTEXT_LENGTH=$((ISL + OSL + 20))
+MAX_PREFILL_TOKENS=32768
+
+EVAL_CONTEXT_ARGS=""
+if [ "${EVAL_ONLY}" = "true" ]; then


🔴 The new glm5_fp8_mi300x_mtp.sh passes the --speculative-* CLI flags but does not export SGLANG_ENABLE_SPEC_V2=1, which every other GLM-5 MTP sglang recipe in this repo sets (glm5_fp8_mi355x_mtp.sh:25, glm5_fp8_b200_mtp.sh:25, glm5_fp8_b300_mtp.sh:29, glm5_fp4_b200_mtp.sh:25, glm5_fp4_b300_mtp.sh:29). perf-changelog.yaml repeatedly describes the GLM-5 EAGLE codepath as gated "behind SGLANG_ENABLE_SPEC_V2=1" — without that env var the MTP datapoints for glm5-fp8-mi300x-sglang-mtp will run a different codepath than every other GLM-5 MTP datapoint in perf history, breaking cross-runner comparability. Fix: add export SGLANG_ENABLE_SPEC_V2=1 near the top of the script, mirroring glm5_fp8_mi355x_mtp.sh:25.

Extended reasoning...

What goes wrong

The new `\benchmarks/single_node/glm5_fp8_mi300x_mtp.sh' launches sglang with the standard EAGLE knobs:

--speculative-algorithm EAGLE \ --speculative-num-steps 3 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 4 \

…but it never sets \SGLANG_ENABLE_SPEC_V2=1'. Every other GLM-5 MTP launch script in this repo does set it, right next to the same --speculative-*' flags:

Script Line

`\benchmarks/single_node/glm5_fp8_mi355x_mtp.sh' 25

`\benchmarks/single_node/glm5_fp8_b200_mtp.sh' 25

`\benchmarks/single_node/glm5_fp8_b300_mtp.sh' 29

`\benchmarks/single_node/glm5_fp4_b200_mtp.sh' 25

`\benchmarks/single_node/glm5_fp4_b300_mtp.sh' 29

Why this matters

`\perf-changelog.yaml' documents the GLM-5 EAGLE path as gated on this env var on the sglang versions used here (v0.5.10–v0.5.12):

line 1623 / 1633 / 1643 — "Mirrors the glm5-fp8-XXX-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) behind SGLANG_ENABLE_SPEC_V2=1"

line 2219 — "Add MTP flags: SGLANG_ENABLE_SPEC_V2=1, EAGLE speculative decoding (steps=3, topk=1, draft=4)"

The wording "behind" indicates the new spec-v2 codepath is selected only when the env var is set; without it, sglang falls back to the v1 spec path (or in some builds, ignores the flags entirely). Either way the codepath is different from the one every other GLM-5 MTP datapoint already in perf history was collected on.

Step-by-step proof

PR adds `\glm5_fp8_mi300x_mtp.sh' modeled after the mi300x non-MTP script + EAGLE knobs.

Compare line-by-line against \glm5_fp8_mi355x_mtp.sh' (the canonical GLM-5 MTP sibling): mi355x sets \export SGLANG_ENABLE_SPEC_V2=1' at line 25, mi300x has no such export anywhere in the file.

\perf-changelog.yaml' lines 1623/1633/1643/1653/1663/2219 all describe GLM-5 EAGLE as living **behind** \SGLANG_ENABLE_SPEC_V2=1'.

Therefore on \lmsysorg/sglang:v0.5.12-rocm720-mi30x' (the image this recipe uses), starting sglang without \SGLANG_ENABLE_SPEC_V2=1' will route the `--speculative-*' flags through the spec-v1 path (or no-op), not the spec-v2 EAGLE path that mi355x/b200/b300 MTP numbers were collected on.

Result: the `\glm5-fp8-mi300x-sglang-mtp' datapoints recorded by the full-sweep CI will measure a different (and per the changelog, less performant) codepath than the rest of the GLM-5 MTP perf history — cross-runner comparisons are invalid.

Fix

Add the env-var export near the top of \glm5_fp8_mi300x_mtp.sh', mirroring \glm5_fp8_mi355x_mtp.sh:25':

export SGLANG_ENABLE_SPEC_V2=1

A natural place is right after the `\source ../benchmark_lib.sh' line, alongside the other env-driven knobs.

claude · 2026-05-18T06:15:47Z

+
+glm5-fp8-mi300x-sglang-mtp:
+  image: lmsysorg/sglang:v0.5.12-rocm720-mi30x
+  model: zai-org/GLM-5-FP8
+  model-prefix: glm5
+  runner: mi300x
+  precision: fp8
+  framework: sglang
+  multinode: false
+  scenarios:
+    fixed-seq-len:
+    - isl: 1024
+      osl: 1024
+      search-space:
+      - { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
+    - isl: 8192
+      osl: 1024
+      search-space:
+      - { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }


🔴 The new glm5-fp8-mi300x-sglang-mtp recipe (with spec-decoding: mtp) will never invoke the new benchmarks/single_node/glm5_fp8_mi300x_mtp.sh script: runners/launch_mi300x-amds.sh:41 hardcodes bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi300x.sh and never appends a SPEC_SUFFIX, so both glm5-fp8-mi300x-sglang and glm5-fp8-mi300x-sglang-mtp resolve to glm5_fp8_mi300x.sh (the vanilla decode path). Because benchmark-tmpl.yml:180 bakes spec-${SPEC_DECODING} into RESULT_FILENAME, the run is recorded as MTP data while actually executing without speculative decoding — silently misattributed perf numbers. Fix by adding SPEC_SUFFIX dispatch to launch_mi300x-amds.sh mirroring launch_mi355x-amds.sh:182-228, or drop the -mtp recipe + _mtp.sh script from this PR until the launcher supports it.

Extended reasoning...

Bug: MI300X launcher does not dispatch to _mtp.sh

What is broken

The PR adds two recipes in .github/configs/amd-master.yaml:

glm5-fp8-mi300x-sglang → expected to run benchmarks/single_node/glm5_fp8_mi300x.sh

glm5-fp8-mi300x-sglang-mtp (with spec-decoding: mtp) → expected to run benchmarks/single_node/glm5_fp8_mi300x_mtp.sh

The second one is the one that breaks. The MI300X launcher does not know how to route to the _mtp.sh variant.

The misrouting code

runners/launch_mi300x-amds.sh:41 hardcodes:

bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi300x.sh

There is no SPEC_SUFFIX/FRAMEWORK_SUFFIX computation anywhere in this file (verified by re-reading the whole 43-line script — every line is shown above the bash invocation, and nothing computes SPEC_SUFFIX).

Compare runners/launch_mi355x-amds.sh:182-228, which is the working pattern:

FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "atom" ]] && printf '_atom' || printf '') SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '') ... SCRIPT_BASE="${EXP_NAME%%_*}_${PRECISION}_mi355x" SCRIPT_FW="benchmarks/single_node/${SCENARIO_SUBDIR:-}${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh" SCRIPT_FALLBACK="benchmarks/single_node/${SCENARIO_SUBDIR:-}${SCRIPT_BASE}${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh"

Step-by-step proof for glm5-fp8-mi300x-sglang-mtp at isl=1024, osl=1024, tp=8, conc=4:

Sweep generator (utils/matrix_logic/generate_sweep_configs.py:362) produces EXP_NAME = f"{model_code}_{seq_len_str}", i.e. EXP_NAME = "glm5_1k1k".

Env vars passed to launcher: MODEL=zai-org/GLM-5-FP8, PRECISION=fp8, FRAMEWORK=sglang, SPEC_DECODING=mtp, EXP_NAME=glm5_1k1k, EP_SIZE=1.

In launch_mi300x-amds.sh:41, ${EXP_NAME%%_*} strips at the first underscore → "glm5".

The composed path is benchmarks/single_node/glm5_fp8_mi300x.sh — the vanilla decode script, not the new glm5_fp8_mi300x_mtp.sh.

glm5_fp8_mi300x.sh is invoked. It has none of --speculative-algorithm EAGLE, --speculative-num-steps, --speculative-eagle-topk, --speculative-num-draft-tokens, --ep-size, or --use-chat-template. The server starts in vanilla decode mode.

.github/workflows/benchmark-tmpl.yml:180 defines RESULT_FILENAME = "${EXP_NAME}_${PRECISION}_${FRAMEWORK}_tp${TP}-ep${EP_SIZE}-dpa${DP_ATTENTION}_disagg-${DISAGG}_spec-${SPEC_DECODING}_conc${CONC}_${RUNNER_NAME}", so the result file name still contains spec-mtp. The datapoint is filed in perf history as MTP data, but it was produced by the vanilla decode path. Silent misattribution.

Why nothing catches this

glm5_fp8_mi300x.sh's check_env_vars requires only MODEL TP CONC ISL OSL RANDOM_RANGE_RATIO RESULT_FILENAME (no EP_SIZE), so the misrouted run does not fail loudly. It produces a plausible result file under a misleading name.

The new glm5_fp8_mi300x_mtp.sh does require EP_SIZE, which would have surfaced the misrouting — but it is never executed.

grep -n 'mi300x.*mtp' .github/configs/amd-master.yaml returns only this PR's new entry (line 1840). This is the first MI300X recipe to depend on a SPEC_SUFFIX dispatch, so the absence has gone unnoticed.

runners/launch_mi300x*.sh glob returns exactly one file — there is no alternate launcher that could pick up the slack.

.github/workflows/benchmark-tmpl.yml dispatches via bash ./runners/launch_${RUNNER_NAME%%_*}.sh, so MI300X jobs go through this one launcher only.

Impact

The -mtp variant of the new recipe is non-functional end-to-end on the targeted runner, and the resulting spec-mtp perf datapoints would be vanilla decode numbers in disguise. This is exactly the failure mode the bug description calls out, and it should block the PR.

Fix options

Add SPEC_SUFFIX dispatch to runners/launch_mi300x-amds.sh mirroring launch_mi355x-amds.sh:182-228 (compute SPEC_SUFFIX, append it to the script path, optionally with a fallback chain). This is the proper fix and lets the new _mtp.sh actually run.

Drop the glm5-fp8-mi300x-sglang-mtp recipe and glm5_fp8_mi300x_mtp.sh from this PR and re-add them after the launcher gains MTP support. The non-MTP glm5-fp8-mi300x-sglang recipe is unaffected and can ship as-is.

functionstackx · 2026-05-19T04:40:25Z

Handing off to @Oseltamivir — added as §13 in #1511. Expected to hit the same aiter MHA bug as #1485 (sgl#25672) once the queued jobs run; same recipe-side workaround applies (swap to triton).

AI-generated via Claude Code /loop.

github-actions · 2026-05-19T05:32:42Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26016535368
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26016535368

functionstackx · 2026-05-19T22:53:47Z

Trying the same alternate launch-arg recipe as #1485 (suggested upstream in sgl-project/sglang#25672 (comment)) — dropping the aiter backend in favor of tilelang NSA backends + fp8_e4m3 KV cache + multithread model loader:

--mem-fraction-static 0.80
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
--nsa-prefill-backend tilelang --nsa-decode-backend tilelang --disable-radix-cache
--kv-cache-dtype fp8_e4m3

Applied to both the off and MTP variants in 909a0bc1. Sweep will tell us if this dodges the aiter Triton MHA tl.arange power-of-2 crash on mi300x.

github-actions · 2026-05-20T11:22:20Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26130164798
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26130164798

New family on MI300X using lmsysorg/sglang:v0.5.12-rocm720-mi30x. TP=8, conc 4..64, 1k1k + 8k1k. Launch scripts follow the qwen3.5-fp8-mi300x SGLang recipe (aiter attention + AMD allreduce fusion) and add GLM-5 parsers (glm47 tool calls, glm45 reasoning). MTP variant adds --speculative-algorithm EAGLE plus the standard EAGLE knobs (num-steps 3, eagle-topk 1, num-draft-tokens 4) and --use-chat-template on the bench client per AGENTS.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops --attention-backend aiter / --enable-aiter-allreduce-fusion in favor of the recipe suggested in sglang issue #25672 comment 4485916205: --mem-fraction-static 0.80 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' --nsa-prefill-backend tilelang --nsa-decode-backend tilelang --kv-cache-dtype fp8_e4m3 Same fix applied to glm5-fp8-mi325x in #1485; both recipes share the aiter Triton MHA tl.arange power-of-2 crash on GLM-5. Applied to both the off and MTP variants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-20T19:45:28Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26130164798
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26130164798

github-actions · 2026-05-21T02:12:22Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26185779422
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26185779422

functionstackx requested a review from a team May 18, 2026 06:06

functionstackx added the full-sweep-enabled label May 18, 2026

functionstackx requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners May 18, 2026 06:06

github-project-automation Bot added this to InferenceMAX Board May 18, 2026

functionstackx added a commit that referenced this pull request May 18, 2026

chore: fill pr-link for #1486

e718115

claude Bot reviewed May 18, 2026

View reviewed changes

functionstackx changed the title ~~[Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes~~ [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes May 19, 2026

functionstackx mentioned this pull request May 19, 2026

[AI Generated] [Handoff] out of 70+ image updates, 13 stuck Klaud Cold PRs need upstream coordination / scope decisions #1511

Open

functionstackx added a commit that referenced this pull request May 19, 2026

chore: fill pr-link for #1486

becfe88

functionstackx force-pushed the add-glm5-fp8-mi300x-sglang branch from e718115 to 909a0bc Compare May 19, 2026 22:52

functionstackx changed the title ~~[Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes~~ [Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes May 19, 2026

functionstackx mentioned this pull request May 20, 2026

[Klaud Cold] mi300x runner: switch --nodelist pin to --exclude -049 #1532

Merged

1 task

functionstackx and others added 3 commits May 20, 2026 15:42

chore: fill pr-link for #1486

eb4caea

functionstackx force-pushed the add-glm5-fp8-mi300x-sglang branch from 909a0bc to 8fe48f7 Compare May 20, 2026 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes#1486

[Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes#1486
functionstackx wants to merge 3 commits into
mainfrom
add-glm5-fp8-mi300x-sglang

functionstackx commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

claude Bot May 18, 2026

Uh oh!

claude Bot May 18, 2026

Uh oh!

functionstackx commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

functionstackx commented May 19, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Script	Line
`\benchmarks/single_node/glm5_fp8_mi355x_mtp.sh'	25
`\benchmarks/single_node/glm5_fp8_b200_mtp.sh'	25
`\benchmarks/single_node/glm5_fp8_b300_mtp.sh'	29
`\benchmarks/single_node/glm5_fp4_b200_mtp.sh'	25
`\benchmarks/single_node/glm5_fp4_b300_mtp.sh'	29

Conversation

functionstackx commented May 18, 2026

Summary

Recipes

Image

Launch scripts

Test plan

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

claude Bot May 18, 2026

Choose a reason for hiding this comment

What goes wrong

Why this matters

Step-by-step proof

Fix

Uh oh!

claude Bot May 18, 2026

Choose a reason for hiding this comment

Bug: MI300X launcher does not dispatch to _mtp.sh

What is broken

The misrouting code

Step-by-step proof for glm5-fp8-mi300x-sglang-mtp at isl=1024, osl=1024, tp=8, conc=4:

Why nothing catches this

Impact

Fix options

Uh oh!

functionstackx commented May 19, 2026

Uh oh!

github-actions Bot commented May 19, 2026

Uh oh!

functionstackx commented May 19, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug: MI300X launcher does not dispatch to `_mtp.sh`

Step-by-step proof for `glm5-fp8-mi300x-sglang-mtp` at `isl=1024, osl=1024, tp=8, conc=4`: