Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -301,6 +301,24 @@ glm5-fp8-mi355x-sglang:
search-space:
- { tp: 8, conc-start: 4, conc-end: 64 }

glm5-fp8-mi355x-sglang-mtp:
image: lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260413
model: zai-org/GLM-5-FP8
model-prefix: glm5
runner: mi355x
precision: fp8
framework: sglang
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: mtp }

glm5-fp8-mi355x-atom:
image: rocm/atom:rocm7.2.1-ubuntu24.04-pytorch2.9.1-atom0.1.2.post
model: zai-org/GLM-5-FP8
Expand Down
87 changes: 87 additions & 0 deletions benchmarks/single_node/glm5_fp8_mi355x_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
#!/usr/bin/env bash

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

# GLM-5 requires transformers with glm_moe_dsa model type support.
# However, the Image rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219 doesn't provide this support.
python3 -m pip install -U --no-cache-dir \
"git+https://github.com/huggingface/transformers.git@6ed9ee36f608fd145168377345bfc4a5de12e1e2"
Comment on lines +18 to +21
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The comment on lines 18-19 of glm5_fp8_mi355x_mtp.sh references image rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219, but the script actually runs against lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260413 (as defined in the YAML config). Update the comment to reference the correct image so maintainers can accurately evaluate whether the pinned transformers pip install is still required.

Extended reasoning...

What the bug is: Lines 18–19 of the new benchmarks/single_node/glm5_fp8_mi355x_mtp.sh contain the comment:

# However, the Image rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219 doesn't provide this support.

This comment explains why the script unconditionally runs pip install -U … transformers@<commit>. The stated reason is that a specific image lacks GLM-5 glm_moe_dsa model type support.

The specific code path that triggers it: A developer reading the script to decide whether the workaround is still necessary will look at lines 18–21 to understand the rationale. The comment identifies rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219 as the deficient image.

Why existing code doesn't prevent it: The comment was copy-pasted from benchmarks/single_node/glm5_fp8_mi355x.sh (the non-MTP script), which itself already had a stale comment after PR #1023 upgraded the non-MTP image from that old rocm/sgl-dev image to lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260413 without updating the explanatory comment. The MTP script was then created by copying the non-MTP script and inheriting the already-stale comment.

What the impact would be: The actual image used—as defined by glm5-fp8-mi355x-sglang-mtp in amd-master.yaml—is lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260413. A maintainer reading the comment might:

  1. Assume the pip install workaround is still needed because of a limitation in the old v0.5.8 image, and not think to re-check it against the newer v0.5.10rc0 image.
  2. Conversely, if they do notice the discrepancy, they waste time tracing the history to figure out which image the comment was meant to describe.
    The pip install runs regardless, so there is no functional impact—this is purely a documentation accuracy issue.

How to fix it: Update the comment to reference the image actually in use. For example:

# GLM-5 requires transformers with glm_moe_dsa model type support.
# However, the image lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260413 doesn't provide this support.

The same fix should also be applied to the identical stale comment in glm5_fp8_mi355x.sh (the non-MTP script), which was introduced in PR #1023.

Step-by-step proof:

  1. Open .github/configs/amd-master.yaml and look at the glm5-fp8-mi355x-sglang-mtp key → image: lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260413.
  2. Open benchmarks/single_node/glm5_fp8_mi355x_mtp.sh, line 19 → # However, the Image rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219 doesn't provide this support.
  3. The image name in the comment (rocm/sgl-dev:v0.5.8.post1-rocm720-mi35x-20260219) is entirely different from the image actually pulled (lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260413) — different registry, different tag, different date.
  4. Check benchmarks/single_node/glm5_fp8_mi355x.sh line 19 → identical stale comment, confirming the MTP script was copy-pasted from there without updating the comment.


hf download "$MODEL"

# ROCm / SGLang performance tuning for MI355X
export SGLANG_ROCM_FUSED_DECODE_MLA=0
export ROCM_QUICK_REDUCE_QUANTIZATION=INT4
export SAFETENSORS_FAST_GPU=1
export SGLANG_ENABLE_SPEC_V2=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

python3 -m sglang.launch_server \
--model-path $MODEL \
--host=0.0.0.0 \
--port $PORT \
--tensor-parallel-size $TP \
--trust-remote-code \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--mem-fraction-static 0.85 \
--model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
--nsa-prefill-backend tilelang \
--nsa-decode-backend tilelang $EVAL_CONTEXT_ARGS \
--kv-cache-dtype fp8_e4m3 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
--disable-radix-cache> $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--use-chat-template

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
10 changes: 10 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1576,3 +1576,13 @@
- "Mirrors the qwen3.5-fp8-mi355x-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4)"
- "Configs: 1k1k (TP8/EP1, TP8/EP8, TP2/EP2) and 8k1k (TP2/EP2, TP4/EP1) with spec-decoding=mtp"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX

- config-keys:
- glm5-fp8-mi355x-sglang-mtp
description:
- "Add GLM-5 FP8 MI355X SGLang MTP benchmark"
- "Image: lmsysorg/sglang-rocm:v0.5.10rc0-rocm720-mi35x-20260413"
- "Model: zai-org/GLM-5-FP8"
- "Mirrors the glm5-fp8-mi355x-sglang non-MTP recipe and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) behind SGLANG_ENABLE_SPEC_V2=1"
- "Configs: 1k1k and 8k1k, TP=8 conc 4-64 with spec-decoding=mtp"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX
Loading