Skip to content

[Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes#1486

Open
functionstackx wants to merge 3 commits into
mainfrom
add-glm5-fp8-mi300x-sglang
Open

[Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes#1486
functionstackx wants to merge 3 commits into
mainfrom
add-glm5-fp8-mi300x-sglang

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

Summary

Adds a new GLM-5 FP8 SGLang ROCm recipe family for MI300X, both the off and MTP/EAGLE variants in one PR (grouped per the project convention).

Recipes

  • glm5-fp8-mi300x-sglang
  • glm5-fp8-mi300x-sglang-mtp

Image

lmsysorg/sglang:v0.5.12-rocm720-mi30x (same tag used by the existing qwen3.5-fp8-mi300x-sglang and the freshly-opened mi325x GLM-5 sibling PR #1485).

Launch scripts

Both base on the existing qwen3.5-fp8-mi300x.sh recipe (AMD Andy's LinkedIn-recommended args: --attention-backend aiter, --enable-aiter-allreduce-fusion, --mem-fraction-static 0.75) plus the GLM-5 parsers (--tool-call-parser glm47, --reasoning-parser glm45). The MTP variant adds the standard EAGLE knobs and --use-chat-template on the bench client per AGENTS.md.

Test plan

  • YAML loads; bash -n syntax passes on both launch scripts.
  • full-sweep-enabled sweep finishes green on mi300x for tp=8 / conc 4..64 / 1k1k + 8k1k (both off + mtp).

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Comment on lines +1 to +30
#!/usr/bin/env bash

# GLM-5 FP8 on MI300X with EAGLE / MTP speculative decoding.
# Mirrors glm5_fp8_mi300x.sh and adds the speculative-* flags.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
CONTEXT_LENGTH=$((ISL + OSL + 20))
MAX_PREFILL_TOKENS=32768

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new glm5_fp8_mi300x_mtp.sh passes the --speculative-* CLI flags but does not export SGLANG_ENABLE_SPEC_V2=1, which every other GLM-5 MTP sglang recipe in this repo sets (glm5_fp8_mi355x_mtp.sh:25, glm5_fp8_b200_mtp.sh:25, glm5_fp8_b300_mtp.sh:29, glm5_fp4_b200_mtp.sh:25, glm5_fp4_b300_mtp.sh:29). perf-changelog.yaml repeatedly describes the GLM-5 EAGLE codepath as gated "behind SGLANG_ENABLE_SPEC_V2=1" — without that env var the MTP datapoints for glm5-fp8-mi300x-sglang-mtp will run a different codepath than every other GLM-5 MTP datapoint in perf history, breaking cross-runner comparability. Fix: add export SGLANG_ENABLE_SPEC_V2=1 near the top of the script, mirroring glm5_fp8_mi355x_mtp.sh:25.

Extended reasoning...

What goes wrong

The new `\benchmarks/single_node/glm5_fp8_mi300x_mtp.sh' launches sglang with the standard EAGLE knobs:

--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \

…but it never sets \SGLANG_ENABLE_SPEC_V2=1'. Every other GLM-5 MTP launch script in this repo does set it, right next to the same --speculative-*' flags:

Script Line
`\benchmarks/single_node/glm5_fp8_mi355x_mtp.sh' 25
`\benchmarks/single_node/glm5_fp8_b200_mtp.sh' 25
`\benchmarks/single_node/glm5_fp8_b300_mtp.sh' 29
`\benchmarks/single_node/glm5_fp4_b200_mtp.sh' 25
`\benchmarks/single_node/glm5_fp4_b300_mtp.sh' 29

Why this matters

`\perf-changelog.yaml' documents the GLM-5 EAGLE path as gated on this env var on the sglang versions used here (v0.5.10–v0.5.12):

  • line 1623 / 1633 / 1643 — "Mirrors the glm5-fp8-XXX-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) behind SGLANG_ENABLE_SPEC_V2=1"
  • line 2219 — "Add MTP flags: SGLANG_ENABLE_SPEC_V2=1, EAGLE speculative decoding (steps=3, topk=1, draft=4)"

The wording "behind" indicates the new spec-v2 codepath is selected only when the env var is set; without it, sglang falls back to the v1 spec path (or in some builds, ignores the flags entirely). Either way the codepath is different from the one every other GLM-5 MTP datapoint already in perf history was collected on.

Step-by-step proof

  1. PR adds `\glm5_fp8_mi300x_mtp.sh' modeled after the mi300x non-MTP script + EAGLE knobs.
  2. Compare line-by-line against \glm5_fp8_mi355x_mtp.sh' (the canonical GLM-5 MTP sibling): mi355x sets \export SGLANG_ENABLE_SPEC_V2=1' at line 25, mi300x has no such export anywhere in the file.
  3. \perf-changelog.yaml' lines 1623/1633/1643/1653/1663/2219 all describe GLM-5 EAGLE as living **behind** \SGLANG_ENABLE_SPEC_V2=1'.
  4. Therefore on \lmsysorg/sglang:v0.5.12-rocm720-mi30x' (the image this recipe uses), starting sglang without \SGLANG_ENABLE_SPEC_V2=1' will route the `--speculative-*' flags through the spec-v1 path (or no-op), not the spec-v2 EAGLE path that mi355x/b200/b300 MTP numbers were collected on.
  5. Result: the `\glm5-fp8-mi300x-sglang-mtp' datapoints recorded by the full-sweep CI will measure a different (and per the changelog, less performant) codepath than the rest of the GLM-5 MTP perf history — cross-runner comparisons are invalid.

Fix

Add the env-var export near the top of \glm5_fp8_mi300x_mtp.sh', mirroring \glm5_fp8_mi355x_mtp.sh:25':

export SGLANG_ENABLE_SPEC_V2=1

A natural place is right after the `\source ../benchmark_lib.sh' line, alongside the other env-driven knobs.

Comment on lines +1822 to +1840

glm5-fp8-mi300x-sglang-mtp:
image: lmsysorg/sglang:v0.5.12-rocm720-mi30x
model: zai-org/GLM-5-FP8
model-prefix: glm5
runner: mi300x
precision: fp8
framework: sglang
multinode: false
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new glm5-fp8-mi300x-sglang-mtp recipe (with spec-decoding: mtp) will never invoke the new benchmarks/single_node/glm5_fp8_mi300x_mtp.sh script: runners/launch_mi300x-amds.sh:41 hardcodes bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi300x.sh and never appends a SPEC_SUFFIX, so both glm5-fp8-mi300x-sglang and glm5-fp8-mi300x-sglang-mtp resolve to glm5_fp8_mi300x.sh (the vanilla decode path). Because benchmark-tmpl.yml:180 bakes spec-${SPEC_DECODING} into RESULT_FILENAME, the run is recorded as MTP data while actually executing without speculative decoding — silently misattributed perf numbers. Fix by adding SPEC_SUFFIX dispatch to launch_mi300x-amds.sh mirroring launch_mi355x-amds.sh:182-228, or drop the -mtp recipe + _mtp.sh script from this PR until the launcher supports it.

Extended reasoning...

Bug: MI300X launcher does not dispatch to _mtp.sh

What is broken

The PR adds two recipes in .github/configs/amd-master.yaml:

  • glm5-fp8-mi300x-sglang → expected to run benchmarks/single_node/glm5_fp8_mi300x.sh
  • glm5-fp8-mi300x-sglang-mtp (with spec-decoding: mtp) → expected to run benchmarks/single_node/glm5_fp8_mi300x_mtp.sh

The second one is the one that breaks. The MI300X launcher does not know how to route to the _mtp.sh variant.

The misrouting code

runners/launch_mi300x-amds.sh:41 hardcodes:

bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi300x.sh

There is no SPEC_SUFFIX/FRAMEWORK_SUFFIX computation anywhere in this file (verified by re-reading the whole 43-line script — every line is shown above the bash invocation, and nothing computes SPEC_SUFFIX).

Compare runners/launch_mi355x-amds.sh:182-228, which is the working pattern:

FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "atom" ]] && printf '_atom' || printf '')
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
...
SCRIPT_BASE="${EXP_NAME%%_*}_${PRECISION}_mi355x"
SCRIPT_FW="benchmarks/single_node/${SCENARIO_SUBDIR:-}${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh"
SCRIPT_FALLBACK="benchmarks/single_node/${SCENARIO_SUBDIR:-}${SCRIPT_BASE}${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh"

Step-by-step proof for glm5-fp8-mi300x-sglang-mtp at isl=1024, osl=1024, tp=8, conc=4:

  1. Sweep generator (utils/matrix_logic/generate_sweep_configs.py:362) produces EXP_NAME = f"{model_code}_{seq_len_str}", i.e. EXP_NAME = "glm5_1k1k".
  2. Env vars passed to launcher: MODEL=zai-org/GLM-5-FP8, PRECISION=fp8, FRAMEWORK=sglang, SPEC_DECODING=mtp, EXP_NAME=glm5_1k1k, EP_SIZE=1.
  3. In launch_mi300x-amds.sh:41, ${EXP_NAME%%_*} strips at the first underscore → "glm5".
  4. The composed path is benchmarks/single_node/glm5_fp8_mi300x.sh — the vanilla decode script, not the new glm5_fp8_mi300x_mtp.sh.
  5. glm5_fp8_mi300x.sh is invoked. It has none of --speculative-algorithm EAGLE, --speculative-num-steps, --speculative-eagle-topk, --speculative-num-draft-tokens, --ep-size, or --use-chat-template. The server starts in vanilla decode mode.
  6. .github/workflows/benchmark-tmpl.yml:180 defines RESULT_FILENAME = "${EXP_NAME}_${PRECISION}_${FRAMEWORK}_tp${TP}-ep${EP_SIZE}-dpa${DP_ATTENTION}_disagg-${DISAGG}_spec-${SPEC_DECODING}_conc${CONC}_${RUNNER_NAME}", so the result file name still contains spec-mtp. The datapoint is filed in perf history as MTP data, but it was produced by the vanilla decode path. Silent misattribution.

Why nothing catches this

  • glm5_fp8_mi300x.sh's check_env_vars requires only MODEL TP CONC ISL OSL RANDOM_RANGE_RATIO RESULT_FILENAME (no EP_SIZE), so the misrouted run does not fail loudly. It produces a plausible result file under a misleading name.
  • The new glm5_fp8_mi300x_mtp.sh does require EP_SIZE, which would have surfaced the misrouting — but it is never executed.
  • grep -n 'mi300x.*mtp' .github/configs/amd-master.yaml returns only this PR's new entry (line 1840). This is the first MI300X recipe to depend on a SPEC_SUFFIX dispatch, so the absence has gone unnoticed.
  • runners/launch_mi300x*.sh glob returns exactly one file — there is no alternate launcher that could pick up the slack.
  • .github/workflows/benchmark-tmpl.yml dispatches via bash ./runners/launch_${RUNNER_NAME%%_*}.sh, so MI300X jobs go through this one launcher only.

Impact

The -mtp variant of the new recipe is non-functional end-to-end on the targeted runner, and the resulting spec-mtp perf datapoints would be vanilla decode numbers in disguise. This is exactly the failure mode the bug description calls out, and it should block the PR.

Fix options

  1. Add SPEC_SUFFIX dispatch to runners/launch_mi300x-amds.sh mirroring launch_mi355x-amds.sh:182-228 (compute SPEC_SUFFIX, append it to the script path, optionally with a fallback chain). This is the proper fix and lets the new _mtp.sh actually run.
  2. Drop the glm5-fp8-mi300x-sglang-mtp recipe and glm5_fp8_mi300x_mtp.sh from this PR and re-add them after the launcher gains MTP support. The non-MTP glm5-fp8-mi300x-sglang recipe is unaffected and can ship as-is.

@functionstackx functionstackx changed the title [Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes May 19, 2026
@functionstackx
Copy link
Copy Markdown
Collaborator Author

Handing off to @Oseltamivir — added as §13 in #1511. Expected to hit the same aiter MHA bug as #1485 (sgl#25672) once the queued jobs run; same recipe-side workaround applies (swap to triton).

AI-generated via Claude Code /loop.

@github-actions
Copy link
Copy Markdown
Contributor

functionstackx added a commit that referenced this pull request May 19, 2026
@functionstackx functionstackx force-pushed the add-glm5-fp8-mi300x-sglang branch from e718115 to 909a0bc Compare May 19, 2026 22:52
@functionstackx functionstackx changed the title [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes [Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes May 19, 2026
@functionstackx
Copy link
Copy Markdown
Collaborator Author

Trying the same alternate launch-arg recipe as #1485 (suggested upstream in sgl-project/sglang#25672 (comment)) — dropping the aiter backend in favor of tilelang NSA backends + fp8_e4m3 KV cache + multithread model loader:

--mem-fraction-static 0.80
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
--nsa-prefill-backend tilelang --nsa-decode-backend tilelang --disable-radix-cache
--kv-cache-dtype fp8_e4m3

Applied to both the off and MTP variants in 909a0bc1. Sweep will tell us if this dodges the aiter Triton MHA tl.arange power-of-2 crash on mi300x.

@github-actions
Copy link
Copy Markdown
Contributor

functionstackx and others added 3 commits May 20, 2026 15:42
New family on MI300X using lmsysorg/sglang:v0.5.12-rocm720-mi30x.
TP=8, conc 4..64, 1k1k + 8k1k. Launch scripts follow the
qwen3.5-fp8-mi300x SGLang recipe (aiter attention + AMD allreduce
fusion) and add GLM-5 parsers (glm47 tool calls, glm45 reasoning).

MTP variant adds --speculative-algorithm EAGLE plus the standard
EAGLE knobs (num-steps 3, eagle-topk 1, num-draft-tokens 4) and
--use-chat-template on the bench client per AGENTS.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops --attention-backend aiter / --enable-aiter-allreduce-fusion in
favor of the recipe suggested in sglang issue #25672 comment 4485916205:
  --mem-fraction-static 0.80
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
  --nsa-prefill-backend tilelang --nsa-decode-backend tilelang
  --kv-cache-dtype fp8_e4m3

Same fix applied to glm5-fp8-mi325x in #1485; both recipes share the
aiter Triton MHA tl.arange power-of-2 crash on GLM-5. Applied to both
the off and MTP variants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the add-glm5-fp8-mi300x-sglang branch from 909a0bc to 8fe48f7 Compare May 20, 2026 19:43
@github-actions
Copy link
Copy Markdown
Contributor

@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant