Skip to content

[Klaud Cold] Add glm5-fp8-mi325x-sglang (off + mtp) recipes#1485

Merged
functionstackx merged 4 commits into
mainfrom
add-glm5-fp8-mi325x-sglang
May 20, 2026
Merged

[Klaud Cold] Add glm5-fp8-mi325x-sglang (off + mtp) recipes#1485
functionstackx merged 4 commits into
mainfrom
add-glm5-fp8-mi325x-sglang

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

@functionstackx functionstackx commented May 18, 2026

Summary

Adds a new GLM-5 FP8 SGLang ROCm recipe family for MI325X, both the off and MTP/EAGLE variants in one PR (grouped per the project convention of pairing MTP with its non-MTP sibling).

Recipes

  • glm5-fp8-mi325x-sglang
  • glm5-fp8-mi325x-sglang-mtp

Image

lmsysorg/sglang:v0.5.12-rocm720-mi30x (same tag as the existing qwen3.5-*-mi325x-sglang recipes; mi30x suffix is shared between mi300x and mi325x).

Launch scripts

Launch args follow sglang issue #25672 comment 4485916205: tilelang NSA backends, fp8_e4m3 KV cache, multithread model loader, and a bumped --mem-fraction-static 0.80:

--mem-fraction-static 0.80
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
--nsa-prefill-backend tilelang --nsa-decode-backend tilelang --disable-radix-cache
--kv-cache-dtype fp8_e4m3

GLM-5 parsers: --tool-call-parser glm47, --reasoning-parser glm45. The MTP variant adds --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 and --use-chat-template on the bench client.

Test plan

  • YAML loads; bash -n syntax passes on both launch scripts.
  • full-sweep-enabled sweep finishes green on mi325x for tp=8 / conc 4..64 / 1k1k + 8k1k (both off + mtp matrices).

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

2 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Comment on lines +1 to +30
#!/usr/bin/env bash

# GLM-5 FP8 on MI325X with EAGLE / MTP speculative decoding.
# Mirrors glm5_fp8_mi325x.sh and adds the speculative-* flags.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
CONTEXT_LENGTH=$((ISL + OSL + 20))
MAX_PREFILL_TOKENS=32768

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new glm5_fp8_mi325x_mtp.sh script enables EAGLE speculative decoding but does NOT export SGLANG_ENABLE_SPEC_V2=1, which every other GLM-5 MTP recipe in the codebase explicitly sets (b200, b300, fp4-b200, fp4-b300, and the closest ROCm sibling glm5_fp8_mi355x_mtp.sh:25). Without it, sglang on v0.5.12 falls back to the legacy spec-decoding implementation, which would give degraded/incorrect MTP numbers relative to the rest of the GLM-5 fleet. Fix: add export SGLANG_ENABLE_SPEC_V2=1 alongside the other top-of-script setup, mirroring glm5_fp8_mi355x_mtp.sh:25.

Extended reasoning...

What the bug is

benchmarks/single_node/glm5_fp8_mi325x_mtp.sh (lines 1–30 of the new file) launches sglang with full EAGLE / MTP speculative-decoding flags:

--speculative-algorithm EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4

…but the script never exports SGLANG_ENABLE_SPEC_V2=1. This env var is the documented opt-in for sglang's V2 spec-decoding path on v0.5.x — the path the rest of the GLM-5 fleet uses.

Why this matters / the GLM-5 family pattern

A grep across benchmarks/single_node/ shows that every GLM-5 MTP recipe sets this var, on both NVIDIA and AMD:

  • glm5_fp8_b200_mtp.sh:25
  • glm5_fp8_b300_mtp.sh:29
  • glm5_fp4_b200_mtp.sh:25
  • glm5_fp4_b300_mtp.sh:29
  • glm5_fp8_mi355x_mtp.sh:25 ← closest sibling (same model, sglang framework, MTP/EAGLE, ROCm)

The new glm5_fp8_mi325x_mtp.sh is the only GLM-5 MTP recipe missing it. The script's own header says it "Mirrors glm5_fp8_mi325x.sh and adds the speculative-* flags" — the mirroring is from the non-MTP mi325x and qwen3.5-mi325x lineage rather than from the GLM-5 MTP family, which is the likely source of the oversight. (The two qwen3.5 mi355x mtp scripts also lack the var, but they run sglang v0.5.10rc0 on a different model family; this is a GLM-5-specific knob tied to GLM-5's draft-model layout on v0.5.12.)

Step-by-step proof

  1. The new recipe uses image lmsysorg/sglang:v0.5.12-rocm720-mi30x. This is the same v0.5.12 era as the b200/b300 GLM-5 MTP recipes that gate spec-V2 behind this env var.
  2. glm5_fp8_mi355x_mtp.sh (same model, same framework, same MTP feature, ROCm) does export SGLANG_ENABLE_SPEC_V2=1 near the top of the script (line 25), right alongside other ROCm sglang env exports.
  3. glm5_fp8_mi325x_mtp.sh defines no SGLANG_* exports at all; the only top-of-script setup is check_env_vars and the HF download branch.
  4. With the V2 path disabled, sglang routes EAGLE/MTP through the legacy spec-decoding implementation that the GLM-5 family explicitly opts out of in every other recipe.
  5. Net effect: when this script does run, MTP results will silently land on a different code path than the rest of the GLM-5 fleet — degraded or non-comparable numbers without any obvious failure signal.

Why existing code doesn't prevent it

There is no central place that injects SGLANG_ENABLE_SPEC_V2=1 — each recipe sets it directly in its launch script. The env var is unset by default in v0.5.12. benchmark_lib.sh does not export it. So a script that omits the export simply runs the legacy path.

Coupling with bug_001 (dispatch)

This bug is latent until the dispatcher actually invokes glm5_fp8_mi325x_mtp.sh (the related dispatch issue). However:

  • The two should be fixed together — fixing dispatch alone would still produce wrong-path MTP numbers.
  • It's a trivial one-line addition; deferring it just means the next sweep after the dispatch fix produces invalid data.

Fix

Add the export near the top of benchmarks/single_node/glm5_fp8_mi325x_mtp.sh, mirroring glm5_fp8_mi355x_mtp.sh:25:

export SGLANG_ENABLE_SPEC_V2=1

Placement: after the hf download line, before SERVER_LOG=... (i.e. immediately before line 22 of the new file).

Comment on lines +1 to +30
#!/usr/bin/env bash

# GLM-5 FP8 on MI325X with EAGLE / MTP speculative decoding.
# Mirrors glm5_fp8_mi325x.sh and adds the speculative-* flags.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
CONTEXT_LENGTH=$((ISL + OSL + 20))
MAX_PREFILL_TOKENS=32768

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new glm5_fp8_mi325x_mtp.sh will never execute. runners/launch_mi325x-amds.sh:42 dispatches to benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_mi325x.sh without appending any FRAMEWORK_SUFFIX or SPEC_SUFFIX, so both glm5-fp8-mi325x-sglang and glm5-fp8-mi325x-sglang-mtp resolve to the same path (glm5_fp8_mi325x.sh) — the MTP sweep silently runs without EAGLE flags or --use-chat-template and produces numbers indistinguishable from the off sweep. Fix by extending the mi325x launcher to mirror launch_mi355x-amds.sh:221-228 (build ${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh with a fallback), or inline the EAGLE flags into glm5_fp8_mi325x.sh behind a SPEC_DECODING check and delete this new file.

Extended reasoning...

The bug. The MTP launch script added in this PR (benchmarks/single_node/glm5_fp8_mi325x_mtp.sh) is dead code — it will never be invoked by the mi325x runner, and the glm5-fp8-mi325x-sglang-mtp recipe in amd-master.yaml will silently use the non-MTP script instead.

Why. runners/launch_mi325x-amds.sh:42 dispatches via:

bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi325x.sh

There is no FRAMEWORK_SUFFIX or SPEC_SUFFIX appended. Contrast with runners/launch_mi355x-amds.sh:182-228, which sets SPEC_SUFFIX=_mtp when SPEC_DECODING=mtp and builds ${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh with a fallback. The mi325x launcher has no such logic — this is the first mi325x MTP recipe in the file (grep mi325x.*mtp .github/configs/amd-master.yaml returns only the newly-added entry), so the dispatch path has never had to exist.

Step-by-step proof. EXP_NAME is built in utils/matrix_logic/generate_sweep_configs.py:290,362 as f"{model_code}_{seq_len_str}". Both new recipes share model-prefix: glm5 in the yaml, so for the 1k1k scenario both produce EXP_NAME='glm5_1k1k', giving ${EXP_NAME%%_*}='glm5'. With PRECISION='fp8' and runner mi325x, both recipes resolve to exactly the same path: benchmarks/single_node/glm5_fp8_mi325x.sh. The newly-added glm5_fp8_mi325x_mtp.sh is never selected.

Impact on the MTP sweep. Because glm5_fp8_mi325x.sh (the non-MTP script) is what actually runs for the MTP recipe:

  1. The server starts without --speculative-algorithm EAGLE, --speculative-num-steps, --speculative-eagle-topk, or --speculative-num-draft-tokens, so the "MTP" numbers are actually non-MTP numbers.
  2. EP_SIZE is set by the runner for the mtp recipe (ep: 1 in the yaml) but the non-MTP script ignores it (it hardcodes --data-parallel-size 1 instead).
  3. The bench client is invoked without --use-chat-template.
  4. Net effect: the mtp sweep results will be statistically indistinguishable from the off sweep, polluting perf-changelog with bogus MTP-labeled data.

How to fix. Two options, either is fine:

  • (a) Extend launch_mi325x-amds.sh to mirror launch_mi355x-amds.sh:221-228 — compute SPEC_SUFFIX from SPEC_DECODING, construct ${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh with a fallback to the bare ${SCRIPT_BASE}.sh. This is the cleaner long-term fix because future mi325x recipes (e.g. atom on mi325x, or other-framework MTP variants) will need the same dispatch.
  • (b) Inline the EAGLE flags and --use-chat-template into glm5_fp8_mi325x.sh behind a SPEC_DECODING check, and delete glm5_fp8_mi325x_mtp.sh. Lower-blast-radius but doesn'''t generalize.

@functionstackx
Copy link
Copy Markdown
Collaborator Author

Filed upstream issue for the aiter MHA Triton kernel crash: sgl-project/sglang#25672

Root cause: GLM-5's qk_nope_head_dim = 192 is not a power of 2, and the concat_and_cast_mha_k_kernel Triton kernel uses tl.arange(0, nope_dim) which Triton requires to be a power of 2.

Local workaround: switch --attention-backend aiter to --attention-backend triton or --attention-backend flashinfer — those backends do not hit this code path.

@functionstackx
Copy link
Copy Markdown
Collaborator Author

Filed upstream: sgl-project/sglang#25672

The repeated FAILUREs on this PR are not infra — they're a deterministic crash in sglang's aiter MHA Triton kernel. Server starts and binds, then dies on the first warmup forward pass:

File ".../forward_mha.py:520 in _concat_and_cast_mha_k
File ".../layers/attention/utils.py:178 in concat_and_cast_mha_k_triton
File ".../triton/compiler/compiler.py:80 in make_ir
triton.compiler.errors.CompilationError: at 19:16:
    nope_offs = tl.arange(0, nope_dim)
                ^
arange's range must be a power of 2

GLM-5's qk_nope_head_dim is 192 (= 3×64), which is not a power of 2 — Triton's tl.arange requires power-of-2. The same kernel works for DeepSeek-v3 (head dim 128) but not GLM-5.

Local workaround: swap --attention-backend aitertriton (or flashinfer if available) in glm5_fp8_mi325x{,_mtp}.sh. The non-aiter paths handle non-power-of-2 head dims correctly.

Upstream fix: track sgl-project/sglang#25672 — suggests padding nope_dim to next pow-2 and masking the tail in the Triton kernel. Once it lands and reships in the ROCm MI325X image, the aiter backend can come back.

@functionstackx functionstackx changed the title [Klaud Cold] Add glm5-fp8-mi325x-sglang (off + mtp) recipes [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Add glm5-fp8-mi325x-sglang (off + mtp) recipes May 18, 2026
@functionstackx
Copy link
Copy Markdown
Collaborator Author

Handing off to @Oseltamivir — tracked alongside 7 other stuck Klaud-Cold PRs in #1511. /loop will stop auto-retrying this one.

AI-generated via Claude Code /loop.

@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx changed the title [Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Add glm5-fp8-mi325x-sglang (off + mtp) recipes [Klaud Cold] Add glm5-fp8-mi325x-sglang (off + mtp) recipes May 19, 2026
@functionstackx
Copy link
Copy Markdown
Collaborator Author

Trying an alternate launch-arg recipe suggested upstream in sgl-project/sglang#25672 (comment) — dropping the aiter backend in favor of tilelang NSA backends + fp8_e4m3 KV cache + multithread model loader:

--mem-fraction-static 0.80
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
--nsa-prefill-backend tilelang --nsa-decode-backend tilelang --disable-radix-cache
--kv-cache-dtype fp8_e4m3

Applied to both the off and MTP variants in dcd8bcef. Sweep will tell us if this dodges the aiter Triton MHA tl.arange power-of-2 crash.

functionstackx and others added 3 commits May 19, 2026 18:49
New family on MI325X using lmsysorg/sglang:v0.5.12-rocm720-mi30x.
TP=8, conc 4..64, 1k1k + 8k1k. Launch scripts follow the
qwen3.5-fp8-mi325x SGLang recipe (aiter attention backend + AMD
allreduce fusion), adding GLM-5-specific --tool-call-parser glm47
and --reasoning-parser glm45.

MTP variant adds --speculative-algorithm EAGLE --speculative-num-steps 3
--speculative-eagle-topk 1 --speculative-num-draft-tokens 4 and the
required --use-chat-template on the bench client.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops --attention-backend aiter / --enable-aiter-allreduce-fusion in
favor of the recipe suggested in sglang issue #25672 comment 4485916205:
  --mem-fraction-static 0.80
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
  --nsa-prefill-backend tilelang --nsa-decode-backend tilelang
  --kv-cache-dtype fp8_e4m3

Applied to both the off and MTP variants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the add-glm5-fp8-mi325x-sglang branch from dcd8bce to 8b81e6a Compare May 19, 2026 22:49
functionstackx added a commit that referenced this pull request May 19, 2026
Drops --attention-backend aiter / --enable-aiter-allreduce-fusion in
favor of the recipe suggested in sglang issue #25672 comment 4485916205:
  --mem-fraction-static 0.80
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
  --nsa-prefill-backend tilelang --nsa-decode-backend tilelang
  --kv-cache-dtype fp8_e4m3

Same fix applied to glm5-fp8-mi325x in #1485; both recipes share the
aiter Triton MHA tl.arange power-of-2 crash on GLM-5. Applied to both
the off and MTP variants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx
Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@functionstackx functionstackx merged commit e6549f8 into main May 20, 2026
4 of 5 checks passed
@functionstackx functionstackx deleted the add-glm5-fp8-mi325x-sglang branch May 20, 2026 05:38
@github-actions
Copy link
Copy Markdown
Contributor

functionstackx added a commit that referenced this pull request May 20, 2026
Drops --attention-backend aiter / --enable-aiter-allreduce-fusion in
favor of the recipe suggested in sglang issue #25672 comment 4485916205:
  --mem-fraction-static 0.80
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
  --nsa-prefill-backend tilelang --nsa-decode-backend tilelang
  --kv-cache-dtype fp8_e4m3

Same fix applied to glm5-fp8-mi325x in #1485; both recipes share the
aiter Triton MHA tl.arange power-of-2 crash on GLM-5. Applied to both
the off and MTP variants.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant