Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1876,3 +1876,41 @@ qwen3.5-fp8-mi325x-sglang-mtp:
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }

glm5-fp8-mi325x-sglang:
image: lmsysorg/sglang:v0.5.12-rocm720-mi30x
model: zai-org/GLM-5-FP8
model-prefix: glm5
runner: mi325x
precision: fp8
framework: sglang
multinode: false
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64 }

glm5-fp8-mi325x-sglang-mtp:
image: lmsysorg/sglang:v0.5.12-rocm720-mi30x
model: zai-org/GLM-5-FP8
model-prefix: glm5
runner: mi325x
precision: fp8
framework: sglang
multinode: false
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 4, conc-end: 64, spec-decoding: mtp }
79 changes: 79 additions & 0 deletions benchmarks/single_node/glm5_fp8_mi325x.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
CONTEXT_LENGTH=$((ISL + OSL + 20))
MAX_PREFILL_TOKENS=32768

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH"
fi

start_gpu_monitor

# Launch args follow sglang issue #25672 comment 4485916205:
# tilelang NSA backends + fp8_e4m3 KV cache + multithread model load.
python3 -m sglang.launch_server \
--model-path $MODEL \
--host=0.0.0.0 \
--port $PORT \
--tensor-parallel-size $TP \
--data-parallel-size 1 \
--trust-remote-code \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--tokenizer-worker-num 6 \
--cuda-graph-max-bs $CONC \
--disable-radix-cache \
--max-prefill-tokens $MAX_PREFILL_TOKENS \
--scheduler-recv-interval 30 \
--mem-fraction-static 0.80 \
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
--nsa-prefill-backend tilelang \
--nsa-decode-backend tilelang \
--kv-cache-dtype fp8_e4m3 \
$EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/

if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
set +x
89 changes: 89 additions & 0 deletions benchmarks/single_node/glm5_fp8_mi325x_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
#!/usr/bin/env bash

# GLM-5 FP8 on MI325X with EAGLE / MTP speculative decoding.
# Mirrors glm5_fp8_mi325x.sh and adds the speculative-* flags.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
CONTEXT_LENGTH=$((ISL + OSL + 20))
MAX_PREFILL_TOKENS=32768

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
Comment on lines +1 to +30
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new glm5_fp8_mi325x_mtp.sh script enables EAGLE speculative decoding but does NOT export SGLANG_ENABLE_SPEC_V2=1, which every other GLM-5 MTP recipe in the codebase explicitly sets (b200, b300, fp4-b200, fp4-b300, and the closest ROCm sibling glm5_fp8_mi355x_mtp.sh:25). Without it, sglang on v0.5.12 falls back to the legacy spec-decoding implementation, which would give degraded/incorrect MTP numbers relative to the rest of the GLM-5 fleet. Fix: add export SGLANG_ENABLE_SPEC_V2=1 alongside the other top-of-script setup, mirroring glm5_fp8_mi355x_mtp.sh:25.

Extended reasoning...

What the bug is

benchmarks/single_node/glm5_fp8_mi325x_mtp.sh (lines 1–30 of the new file) launches sglang with full EAGLE / MTP speculative-decoding flags:

--speculative-algorithm EAGLE
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4

…but the script never exports SGLANG_ENABLE_SPEC_V2=1. This env var is the documented opt-in for sglang's V2 spec-decoding path on v0.5.x — the path the rest of the GLM-5 fleet uses.

Why this matters / the GLM-5 family pattern

A grep across benchmarks/single_node/ shows that every GLM-5 MTP recipe sets this var, on both NVIDIA and AMD:

  • glm5_fp8_b200_mtp.sh:25
  • glm5_fp8_b300_mtp.sh:29
  • glm5_fp4_b200_mtp.sh:25
  • glm5_fp4_b300_mtp.sh:29
  • glm5_fp8_mi355x_mtp.sh:25 ← closest sibling (same model, sglang framework, MTP/EAGLE, ROCm)

The new glm5_fp8_mi325x_mtp.sh is the only GLM-5 MTP recipe missing it. The script's own header says it "Mirrors glm5_fp8_mi325x.sh and adds the speculative-* flags" — the mirroring is from the non-MTP mi325x and qwen3.5-mi325x lineage rather than from the GLM-5 MTP family, which is the likely source of the oversight. (The two qwen3.5 mi355x mtp scripts also lack the var, but they run sglang v0.5.10rc0 on a different model family; this is a GLM-5-specific knob tied to GLM-5's draft-model layout on v0.5.12.)

Step-by-step proof

  1. The new recipe uses image lmsysorg/sglang:v0.5.12-rocm720-mi30x. This is the same v0.5.12 era as the b200/b300 GLM-5 MTP recipes that gate spec-V2 behind this env var.
  2. glm5_fp8_mi355x_mtp.sh (same model, same framework, same MTP feature, ROCm) does export SGLANG_ENABLE_SPEC_V2=1 near the top of the script (line 25), right alongside other ROCm sglang env exports.
  3. glm5_fp8_mi325x_mtp.sh defines no SGLANG_* exports at all; the only top-of-script setup is check_env_vars and the HF download branch.
  4. With the V2 path disabled, sglang routes EAGLE/MTP through the legacy spec-decoding implementation that the GLM-5 family explicitly opts out of in every other recipe.
  5. Net effect: when this script does run, MTP results will silently land on a different code path than the rest of the GLM-5 fleet — degraded or non-comparable numbers without any obvious failure signal.

Why existing code doesn't prevent it

There is no central place that injects SGLANG_ENABLE_SPEC_V2=1 — each recipe sets it directly in its launch script. The env var is unset by default in v0.5.12. benchmark_lib.sh does not export it. So a script that omits the export simply runs the legacy path.

Coupling with bug_001 (dispatch)

This bug is latent until the dispatcher actually invokes glm5_fp8_mi325x_mtp.sh (the related dispatch issue). However:

  • The two should be fixed together — fixing dispatch alone would still produce wrong-path MTP numbers.
  • It's a trivial one-line addition; deferring it just means the next sweep after the dispatch fix produces invalid data.

Fix

Add the export near the top of benchmarks/single_node/glm5_fp8_mi325x_mtp.sh, mirroring glm5_fp8_mi355x_mtp.sh:25:

export SGLANG_ENABLE_SPEC_V2=1

Placement: after the hf download line, before SERVER_LOG=... (i.e. immediately before line 22 of the new file).

Comment on lines +1 to +30
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new glm5_fp8_mi325x_mtp.sh will never execute. runners/launch_mi325x-amds.sh:42 dispatches to benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_mi325x.sh without appending any FRAMEWORK_SUFFIX or SPEC_SUFFIX, so both glm5-fp8-mi325x-sglang and glm5-fp8-mi325x-sglang-mtp resolve to the same path (glm5_fp8_mi325x.sh) — the MTP sweep silently runs without EAGLE flags or --use-chat-template and produces numbers indistinguishable from the off sweep. Fix by extending the mi325x launcher to mirror launch_mi355x-amds.sh:221-228 (build ${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh with a fallback), or inline the EAGLE flags into glm5_fp8_mi325x.sh behind a SPEC_DECODING check and delete this new file.

Extended reasoning...

The bug. The MTP launch script added in this PR (benchmarks/single_node/glm5_fp8_mi325x_mtp.sh) is dead code — it will never be invoked by the mi325x runner, and the glm5-fp8-mi325x-sglang-mtp recipe in amd-master.yaml will silently use the non-MTP script instead.

Why. runners/launch_mi325x-amds.sh:42 dispatches via:

bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi325x.sh

There is no FRAMEWORK_SUFFIX or SPEC_SUFFIX appended. Contrast with runners/launch_mi355x-amds.sh:182-228, which sets SPEC_SUFFIX=_mtp when SPEC_DECODING=mtp and builds ${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh with a fallback. The mi325x launcher has no such logic — this is the first mi325x MTP recipe in the file (grep mi325x.*mtp .github/configs/amd-master.yaml returns only the newly-added entry), so the dispatch path has never had to exist.

Step-by-step proof. EXP_NAME is built in utils/matrix_logic/generate_sweep_configs.py:290,362 as f"{model_code}_{seq_len_str}". Both new recipes share model-prefix: glm5 in the yaml, so for the 1k1k scenario both produce EXP_NAME='glm5_1k1k', giving ${EXP_NAME%%_*}='glm5'. With PRECISION='fp8' and runner mi325x, both recipes resolve to exactly the same path: benchmarks/single_node/glm5_fp8_mi325x.sh. The newly-added glm5_fp8_mi325x_mtp.sh is never selected.

Impact on the MTP sweep. Because glm5_fp8_mi325x.sh (the non-MTP script) is what actually runs for the MTP recipe:

  1. The server starts without --speculative-algorithm EAGLE, --speculative-num-steps, --speculative-eagle-topk, or --speculative-num-draft-tokens, so the "MTP" numbers are actually non-MTP numbers.
  2. EP_SIZE is set by the runner for the mtp recipe (ep: 1 in the yaml) but the non-MTP script ignores it (it hardcodes --data-parallel-size 1 instead).
  3. The bench client is invoked without --use-chat-template.
  4. Net effect: the mtp sweep results will be statistically indistinguishable from the off sweep, polluting perf-changelog with bogus MTP-labeled data.

How to fix. Two options, either is fine:

  • (a) Extend launch_mi325x-amds.sh to mirror launch_mi355x-amds.sh:221-228 — compute SPEC_SUFFIX from SPEC_DECODING, construct ${SCRIPT_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh with a fallback to the bare ${SCRIPT_BASE}.sh. This is the cleaner long-term fix because future mi325x recipes (e.g. atom on mi325x, or other-framework MTP variants) will need the same dispatch.
  • (b) Inline the EAGLE flags and --use-chat-template into glm5_fp8_mi325x.sh behind a SPEC_DECODING check, and delete glm5_fp8_mi325x_mtp.sh. Lower-blast-radius but doesn'''t generalize.

setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH"
fi

start_gpu_monitor

# Launch args follow sglang issue #25672 comment 4485916205:
# tilelang NSA backends + fp8_e4m3 KV cache + multithread model load,
# plus EAGLE/MTP speculative decoding.
python3 -m sglang.launch_server \
--model-path $MODEL \
--host=0.0.0.0 \
--port $PORT \
--tensor-parallel-size $TP \
--ep-size $EP_SIZE \
--trust-remote-code \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--tokenizer-worker-num 6 \
--cuda-graph-max-bs $CONC \
--disable-radix-cache \
--max-prefill-tokens $MAX_PREFILL_TOKENS \
--scheduler-recv-interval 30 \
--mem-fraction-static 0.80 \
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
--nsa-prefill-backend tilelang \
--nsa-decode-backend tilelang \
--kv-cache-dtype fp8_e4m3 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
$EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--use-chat-template

if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
set +x
7 changes: 7 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2971,3 +2971,10 @@
description:
- "Update Atom ROCm image (off: rocm7.1.1-...-atom0.1.1-MI350x 125d / mtp: rocm7.2.0-...-atom0.1.1 83d) to rocm7.2.3_..._atom20260511"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1518

- config-keys:
- glm5-fp8-mi325x-sglang
- glm5-fp8-mi325x-sglang-mtp
description:
- "Add GLM-5 FP8 SGLang ROCm recipes (off + MTP/EAGLE) for MI325X on lmsysorg/sglang:v0.5.12-rocm720-mi30x"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1485
Loading