[Klaud Cold] Add glm5-fp8-mi325x-sglang (off + mtp) recipes by functionstackx · Pull Request #1485 · SemiAnalysisAI/InferenceX

functionstackx · 2026-05-18T06:04:46Z

Summary

Adds a new GLM-5 FP8 SGLang ROCm recipe family for MI325X, both the off and MTP/EAGLE variants in one PR (grouped per the project convention of pairing MTP with its non-MTP sibling).

Recipes

glm5-fp8-mi325x-sglang
glm5-fp8-mi325x-sglang-mtp

Image

lmsysorg/sglang:v0.5.12-rocm720-mi30x (same tag as the existing qwen3.5-*-mi325x-sglang recipes; mi30x suffix is shared between mi300x and mi325x).

Launch scripts

Launch args follow sglang issue #25672 comment 4485916205: tilelang NSA backends, fp8_e4m3 KV cache, multithread model loader, and a bumped --mem-fraction-static 0.80:

--mem-fraction-static 0.80
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
--nsa-prefill-backend tilelang --nsa-decode-backend tilelang --disable-radix-cache
--kv-cache-dtype fp8_e4m3

GLM-5 parsers: --tool-call-parser glm47, --reasoning-parser glm45. The MTP variant adds --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 and --use-chat-template on the bench client.

Test plan

YAML loads; bash -n syntax passes on both launch scripts.
full-sweep-enabled sweep finishes green on mi325x for tp=8 / conc 4..64 / 1k1k + 8k1k (both off + mtp matrices).

🤖 Generated with Claude Code

github-actions · 2026-05-18T06:04:56Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-18T06:04:56Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-18T06:04:56Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-18T06:05:20Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26016489321
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26016489321

claude · 2026-05-18T06:17:28Z

+#!/usr/bin/env bash
+
+# GLM-5 FP8 on MI325X with EAGLE / MTP speculative decoding.
+# Mirrors glm5_fp8_mi325x.sh and adds the speculative-* flags.
+
+source "$(dirname "$0")/../benchmark_lib.sh"
+
+check_env_vars \
+    MODEL \
+    TP \
+    CONC \
+    ISL \
+    OSL \
+    RANDOM_RANGE_RATIO \
+    RESULT_FILENAME \
+    EP_SIZE
+
+if [[ -n "$SLURM_JOB_ID" ]]; then
+  echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
+fi
+
+if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
+
+SERVER_LOG=/workspace/server.log
+PORT=${PORT:-8888}
+CONTEXT_LENGTH=$((ISL + OSL + 20))
+MAX_PREFILL_TOKENS=32768
+
+EVAL_CONTEXT_ARGS=""
+if [ "${EVAL_ONLY}" = "true" ]; then


🔴 The new glm5_fp8_mi325x_mtp.sh script enables EAGLE speculative decoding but does NOT export SGLANG_ENABLE_SPEC_V2=1, which every other GLM-5 MTP recipe in the codebase explicitly sets (b200, b300, fp4-b200, fp4-b300, and the closest ROCm sibling glm5_fp8_mi355x_mtp.sh:25). Without it, sglang on v0.5.12 falls back to the legacy spec-decoding implementation, which would give degraded/incorrect MTP numbers relative to the rest of the GLM-5 fleet. Fix: add export SGLANG_ENABLE_SPEC_V2=1 alongside the other top-of-script setup, mirroring glm5_fp8_mi355x_mtp.sh:25.

Extended reasoning...

What the bug is

benchmarks/single_node/glm5_fp8_mi325x_mtp.sh (lines 1–30 of the new file) launches sglang with full EAGLE / MTP speculative-decoding flags:

--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

…but the script never exports SGLANG_ENABLE_SPEC_V2=1. This env var is the documented opt-in for sglang's V2 spec-decoding path on v0.5.x — the path the rest of the GLM-5 fleet uses.

Why this matters / the GLM-5 family pattern

A grep across benchmarks/single_node/ shows that every GLM-5 MTP recipe sets this var, on both NVIDIA and AMD:

glm5_fp8_b200_mtp.sh:25

glm5_fp8_b300_mtp.sh:29

glm5_fp4_b200_mtp.sh:25

glm5_fp4_b300_mtp.sh:29

glm5_fp8_mi355x_mtp.sh:25 ← closest sibling (same model, sglang framework, MTP/EAGLE, ROCm)

The new glm5_fp8_mi325x_mtp.sh is the only GLM-5 MTP recipe missing it. The script's own header says it "Mirrors glm5_fp8_mi325x.sh and adds the speculative-* flags" — the mirroring is from the non-MTP mi325x and qwen3.5-mi325x lineage rather than from the GLM-5 MTP family, which is the likely source of the oversight. (The two qwen3.5 mi355x mtp scripts also lack the var, but they run sglang v0.5.10rc0 on a different model family; this is a GLM-5-specific knob tied to GLM-5's draft-model layout on v0.5.12.)

Step-by-step proof

The new recipe uses image lmsysorg/sglang:v0.5.12-rocm720-mi30x. This is the same v0.5.12 era as the b200/b300 GLM-5 MTP recipes that gate spec-V2 behind this env var.

glm5_fp8_mi355x_mtp.sh (same model, same framework, same MTP feature, ROCm) does export SGLANG_ENABLE_SPEC_V2=1 near the top of the script (line 25), right alongside other ROCm sglang env exports.

glm5_fp8_mi325x_mtp.sh defines no SGLANG_* exports at all; the only top-of-script setup is check_env_vars and the HF download branch.

With the V2 path disabled, sglang routes EAGLE/MTP through the legacy spec-decoding implementation that the GLM-5 family explicitly opts out of in every other recipe.

Net effect: when this script does run, MTP results will silently land on a different code path than the rest of the GLM-5 fleet — degraded or non-comparable numbers without any obvious failure signal.

Why existing code doesn't prevent it

There is no central place that injects SGLANG_ENABLE_SPEC_V2=1 — each recipe sets it directly in its launch script. The env var is unset by default in v0.5.12. benchmark_lib.sh does not export it. So a script that omits the export simply runs the legacy path.

Coupling with bug_001 (dispatch)

This bug is latent until the dispatcher actually invokes glm5_fp8_mi325x_mtp.sh (the related dispatch issue). However:

The two should be fixed together — fixing dispatch alone would still produce wrong-path MTP numbers.

It's a trivial one-line addition; deferring it just means the next sweep after the dispatch fix produces invalid data.

Fix

Add the export near the top of benchmarks/single_node/glm5_fp8_mi325x_mtp.sh, mirroring glm5_fp8_mi355x_mtp.sh:25:

export SGLANG_ENABLE_SPEC_V2=1

Placement: after the hf download line, before SERVER_LOG=... (i.e. immediately before line 22 of the new file).

claude · 2026-05-18T06:17:29Z

+#!/usr/bin/env bash
+
+# GLM-5 FP8 on MI325X with EAGLE / MTP speculative decoding.
+# Mirrors glm5_fp8_mi325x.sh and adds the speculative-* flags.
+
+source "$(dirname "$0")/../benchmark_lib.sh"
+
+check_env_vars \
+    MODEL \
+    TP \
+    CONC \
+    ISL \
+    OSL \
+    RANDOM_RANGE_RATIO \
+    RESULT_FILENAME \
+    EP_SIZE
+
+if [[ -n "$SLURM_JOB_ID" ]]; then
+  echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
+fi
+
+if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
+
+SERVER_LOG=/workspace/server.log
+PORT=${PORT:-8888}
+CONTEXT_LENGTH=$((ISL + OSL + 20))
+MAX_PREFILL_TOKENS=32768
+
+EVAL_CONTEXT_ARGS=""
+if [ "${EVAL_ONLY}" = "true" ]; then


🔴 The new glm5_fp8_mi325x_mtp.sh will never execute. runners/launch_mi325x-amds.sh:42 dispatches to benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_mi325x.sh without appending any FRAMEWORK_SUFFIX or SPEC_SUFFIX, so both glm5-fp8-mi325x-sglang and glm5-fp8-mi325x-sglang-mtp resolve to the same path (glm5_fp8_mi325x.sh) — the MTP sweep silently runs without EAGLE flags or --use-chat-template and produces numbers indistinguishable from the off sweep. Fix by extending the mi325x launcher to mirror launch_mi355x-amds.sh:221-228 (build ${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh with a fallback), or inline the EAGLE flags into glm5_fp8_mi325x.sh behind a SPEC_DECODING check and delete this new file.

Extended reasoning...

The bug. The MTP launch script added in this PR (benchmarks/single_node/glm5_fp8_mi325x_mtp.sh) is dead code — it will never be invoked by the mi325x runner, and the glm5-fp8-mi325x-sglang-mtp recipe in amd-master.yaml will silently use the non-MTP script instead.

Why. runners/launch_mi325x-amds.sh:42 dispatches via:

bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi325x.sh

There is no FRAMEWORK_SUFFIX or SPEC_SUFFIX appended. Contrast with runners/launch_mi355x-amds.sh:182-228, which sets SPEC_SUFFIX=_mtp when SPEC_DECODING=mtp and builds ${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh with a fallback. The mi325x launcher has no such logic — this is the first mi325x MTP recipe in the file (grep mi325x.*mtp .github/configs/amd-master.yaml returns only the newly-added entry), so the dispatch path has never had to exist.

Step-by-step proof. EXP_NAME is built in utils/matrix_logic/generate_sweep_configs.py:290,362 as f"{model_code}_{seq_len_str}". Both new recipes share model-prefix: glm5 in the yaml, so for the 1k1k scenario both produce EXP_NAME='glm5_1k1k', giving ${EXP_NAME%%_*}='glm5'. With PRECISION='fp8' and runner mi325x, both recipes resolve to exactly the same path: benchmarks/single_node/glm5_fp8_mi325x.sh. The newly-added glm5_fp8_mi325x_mtp.sh is never selected.

Impact on the MTP sweep. Because glm5_fp8_mi325x.sh (the non-MTP script) is what actually runs for the MTP recipe:

The server starts without --speculative-algorithm EAGLE, --speculative-num-steps, --speculative-eagle-topk, or --speculative-num-draft-tokens, so the "MTP" numbers are actually non-MTP numbers.

EP_SIZE is set by the runner for the mtp recipe (ep: 1 in the yaml) but the non-MTP script ignores it (it hardcodes --data-parallel-size 1 instead).

The bench client is invoked without --use-chat-template.

Net effect: the mtp sweep results will be statistically indistinguishable from the off sweep, polluting perf-changelog with bogus MTP-labeled data.

How to fix. Two options, either is fine:

(a) Extend launch_mi325x-amds.sh to mirror launch_mi355x-amds.sh:221-228 — compute SPEC_SUFFIX from SPEC_DECODING, construct ${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh with a fallback to the bare ${SCRIPT_BASE}.sh. This is the cleaner long-term fix because future mi325x recipes (e.g. atom on mi325x, or other-framework MTP variants) will need the same dispatch.

(b) Inline the EAGLE flags and --use-chat-template into glm5_fp8_mi325x.sh behind a SPEC_DECODING check, and delete glm5_fp8_mi325x_mtp.sh. Lower-blast-radius but doesn'''t generalize.

functionstackx · 2026-05-18T16:35:19Z

Filed upstream issue for the aiter MHA Triton kernel crash: sgl-project/sglang#25672

Root cause: GLM-5's qk_nope_head_dim = 192 is not a power of 2, and the concat_and_cast_mha_k_kernel Triton kernel uses tl.arange(0, nope_dim) which Triton requires to be a power of 2.

Local workaround: switch --attention-backend aiter to --attention-backend triton or --attention-backend flashinfer — those backends do not hit this code path.

functionstackx · 2026-05-18T16:49:01Z

Filed upstream: sgl-project/sglang#25672

The repeated FAILUREs on this PR are not infra — they're a deterministic crash in sglang's aiter MHA Triton kernel. Server starts and binds, then dies on the first warmup forward pass:

File ".../forward_mha.py:520 in _concat_and_cast_mha_k
File ".../layers/attention/utils.py:178 in concat_and_cast_mha_k_triton
File ".../triton/compiler/compiler.py:80 in make_ir
triton.compiler.errors.CompilationError: at 19:16:
    nope_offs = tl.arange(0, nope_dim)
                ^
arange's range must be a power of 2

GLM-5's qk_nope_head_dim is 192 (= 3×64), which is not a power of 2 — Triton's tl.arange requires power-of-2. The same kernel works for DeepSeek-v3 (head dim 128) but not GLM-5.

Local workaround: swap --attention-backend aiter → triton (or flashinfer if available) in glm5_fp8_mi325x{,_mtp}.sh. The non-aiter paths handle non-power-of-2 head dims correctly.

Upstream fix: track sgl-project/sglang#25672 — suggests padding nope_dim to next pow-2 and masking the tail in the Triton kernel. Once it lands and reships in the ROCm MI325X image, the aiter backend can come back.

functionstackx · 2026-05-18T19:56:14Z

Handing off to @Oseltamivir — tracked alongside 7 other stuck Klaud-Cold PRs in #1511. /loop will stop auto-retrying this one.

AI-generated via Claude Code /loop.

github-actions · 2026-05-18T21:30:53Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26016491988
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26016491988

functionstackx · 2026-05-19T22:47:32Z

Trying an alternate launch-arg recipe suggested upstream in sgl-project/sglang#25672 (comment) — dropping the aiter backend in favor of tilelang NSA backends + fp8_e4m3 KV cache + multithread model loader:

--mem-fraction-static 0.80
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
--nsa-prefill-backend tilelang --nsa-decode-backend tilelang --disable-radix-cache
--kv-cache-dtype fp8_e4m3

Applied to both the off and MTP variants in dcd8bcef. Sweep will tell us if this dodges the aiter Triton MHA tl.arange power-of-2 crash.

New family on MI325X using lmsysorg/sglang:v0.5.12-rocm720-mi30x. TP=8, conc 4..64, 1k1k + 8k1k. Launch scripts follow the qwen3.5-fp8-mi325x SGLang recipe (aiter attention backend + AMD allreduce fusion), adding GLM-5-specific --tool-call-parser glm47 and --reasoning-parser glm45. MTP variant adds --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 and the required --use-chat-template on the bench client. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops --attention-backend aiter / --enable-aiter-allreduce-fusion in favor of the recipe suggested in sglang issue #25672 comment 4485916205: --mem-fraction-static 0.80 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' --nsa-prefill-backend tilelang --nsa-decode-backend tilelang --kv-cache-dtype fp8_e4m3 Applied to both the off and MTP variants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drops --attention-backend aiter / --enable-aiter-allreduce-fusion in favor of the recipe suggested in sglang issue #25672 comment 4485916205: --mem-fraction-static 0.80 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' --nsa-prefill-backend tilelang --nsa-decode-backend tilelang --kv-cache-dtype fp8_e4m3 Same fix applied to glm5-fp8-mi325x in #1485; both recipes share the aiter Triton MHA tl.arange power-of-2 crash on GLM-5. Applied to both the off and MTP variants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-20T01:47:00Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26130045148
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26130045148

functionstackx · 2026-05-20T05:38:34Z

/reuse-sweep-run

github-actions · 2026-05-20T05:39:15Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26143688774
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26143688774

Drops --attention-backend aiter / --enable-aiter-allreduce-fusion in favor of the recipe suggested in sglang issue #25672 comment 4485916205: --mem-fraction-static 0.80 --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' --nsa-prefill-backend tilelang --nsa-decode-backend tilelang --kv-cache-dtype fp8_e4m3 Same fix applied to glm5-fp8-mi325x in #1485; both recipes share the aiter Triton MHA tl.arange power-of-2 crash on GLM-5. Applied to both the off and MTP variants. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team May 18, 2026 06:04

functionstackx added the full-sweep-enabled label May 18, 2026

functionstackx requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners May 18, 2026 06:04

github-project-automation Bot added this to InferenceMAX Board May 18, 2026

functionstackx added a commit that referenced this pull request May 18, 2026

chore: fill pr-link for #1485

bc5662d

functionstackx mentioned this pull request May 18, 2026

[Klaud Cold] Add glm5-fp8-mi300x-sglang (off + mtp) recipes #1486

Open

2 tasks

claude Bot reviewed May 18, 2026

View reviewed changes

functionstackx mentioned this pull request May 18, 2026

[Bug] GLM-5 on MI325X + aiter backend: concat_and_cast_mha_k_kernel Triton compile error "arange's range must be a power of 2" sgl-project/sglang#25672

Closed

functionstackx mentioned this pull request May 18, 2026

[AI Generated] [Handoff] out of 70+ image updates, 13 stuck Klaud Cold PRs need upstream coordination / scope decisions #1511

Open

functionstackx changed the title ~~[Klaud Cold] Add glm5-fp8-mi325x-sglang (off + mtp) recipes~~ [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Add glm5-fp8-mi325x-sglang (off + mtp) recipes May 18, 2026

functionstackx changed the title ~~[Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Add glm5-fp8-mi325x-sglang (off + mtp) recipes~~ [Klaud Cold] Add glm5-fp8-mi325x-sglang (off + mtp) recipes May 19, 2026

functionstackx and others added 3 commits May 19, 2026 18:49

chore: fill pr-link for #1485

de51e35

functionstackx force-pushed the add-glm5-fp8-mi325x-sglang branch from dcd8bce to 8b81e6a Compare May 19, 2026 22:49

Merge branch 'main' into add-glm5-fp8-mi325x-sglang

0fa4574

functionstackx merged commit e6549f8 into main May 20, 2026
4 of 5 checks passed

functionstackx deleted the add-glm5-fp8-mi325x-sglang branch May 20, 2026 05:38

github-project-automation Bot moved this to Done in InferenceMAX Board May 20, 2026

Conversation

functionstackx commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Recipes

Image

Launch scripts

Test plan

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

claude Bot May 18, 2026

Choose a reason for hiding this comment

What the bug is

Why this matters / the GLM-5 family pattern

Step-by-step proof

Why existing code doesn't prevent it

Coupling with bug_001 (dispatch)

Fix

Uh oh!

claude Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx commented May 18, 2026

Uh oh!

functionstackx commented May 18, 2026

Filed upstream: sgl-project/sglang#25672

Uh oh!

functionstackx commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

functionstackx commented May 19, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

functionstackx commented May 20, 2026

Uh oh!

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

functionstackx commented May 18, 2026 •

edited

Loading