Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3140,6 +3140,25 @@ glm5-fp8-h200-sglang:
search-space:
- { tp: 8, conc-start: 4, conc-end: 64 }

glm5-fp8-h200-sglang-mtp:
image: lmsysorg/sglang:v0.5.12-cu130
model: zai-org/GLM-5-FP8
model-prefix: glm5
runner: h200
precision: fp8
framework: sglang
multinode: false
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: mtp }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: mtp }

dsr1-fp8-h200-trt:
image: nvcr.io#nvidia/tensorrt-llm/release:1.1.0rc2.post2
model: deepseek-ai/DeepSeek-R1-0528
Expand Down
81 changes: 81 additions & 0 deletions benchmarks/single_node/glm5_fp8_h200_mtp.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
#!/usr/bin/env bash

# GLM-5 FP8 on H200 (Hopper) with EAGLE / MTP speculative decoding.
# Mirrors glm5_fp8_h200.sh but adds the speculative-* flags. We keep the
# server-arg shape from the non-MTP H200 recipe (sglang defaults — no
# nsa/trtllm-mha) since those backends are Blackwell-specific and not
# applicable to Hopper.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

EVAL_CONTEXT_ARGS=""
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN"
fi

start_gpu_monitor

set -x
python3 -m sglang.launch_server \
--model-path "$MODEL" \
--host 0.0.0.0 \
--port "$PORT" \
--tp-size "$TP" \
--tool-call-parser glm47 \
--reasoning-parser glm45 \
--mem-fraction-static 0.85 \
--served-model-name glm-5-fp8 \
Comment on lines +44 to +48
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new MTP launch script is missing SGLANG_ENABLE_SPEC_V2=1. Every other SGLang MTP recipe in this repo sets it — including the closest sibling qwen3.5_fp8_h200_mtp.sh (same H200/SGLang/EAGLE) and all glm5 MTP siblings (b200/b300/fp4/mi355x). Without it, the --speculative-* flags likely fall back to the legacy spec-decoding path, undermining the purpose of the recipe. Fix: add export SGLANG_ENABLE_SPEC_V2=1 near the other env setup (or inline it before the python3 -m sglang.launch_server invocation, matching qwen3.5_fp8_h200_mtp.sh:38).

Extended reasoning...

What is missing

benchmarks/single_node/glm5_fp8_h200_mtp.sh adds the four EAGLE speculative-decoding flags (--speculative-algorithm EAGLE, --speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4) but never enables SGLang's spec-v2 scheduler via the SGLANG_ENABLE_SPEC_V2=1 environment variable. The PR description notes the script mirrors glm5_fp8_h200.sh (the non-MTP recipe, which correctly has no spec env var) and then bolts on the EAGLE flags — but the env var that gates SGLang's optimized spec-decoding path was not bolted on alongside them.

Why this matters

Every other SGLang MTP recipe in the repo sets SGLANG_ENABLE_SPEC_V2=1 — either exported (glm5_fp8_b200_mtp.sh:25, glm5_fp8_b300_mtp.sh:29, glm5_fp4_b200_mtp.sh:25, glm5_fp4_b300_mtp.sh:29, glm5_fp8_mi355x_mtp.sh:25) or as a command prefix (qwen3.5_fp8_h200_mtp.sh:38, qwen3.5_fp4_b200_mtp.sh:36, qwen3.5_fp8_b200_mtp.sh:36, qwen3.5_fp8_b300_mtp.sh:34, dsr1_fp8_b200_mtp.sh:57, dsr1_fp8_b300_mtp.sh:61). The new glm5_fp8_h200_mtp.sh is the lone outlier.

The closest direct sibling is qwen3.5_fp8_h200_mtp.sh — same hardware (H200), same framework (SGLang), same EAGLE flag set — and it launches the server with SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server …. The new recipe omits this and uses bare python3 -m sglang.launch_server.

perf-changelog.yaml history reinforces that this is a deliberate, required toggle for SGLang spec-decoding recipes. PR #1017 was titled "Enable SGLANG_ENABLE_SPEC_V2=1 for Qwen3.5 FP8 H200 SGLang MTP" (line 1371). The five existing GLM5 MTP recipes are each documented as adding EAGLE "behind SGLANG_ENABLE_SPEC_V2=1" (lines 1623, 1633, 1643, 1653, 1663). Line 2185 documents aligning B200 with B300 by setting SGLANG_ENABLE_SPEC_V2=1, and line 2219 describes adding MTP flags together with SGLANG_ENABLE_SPEC_V2=1 as a unit.

Impact

Without SGLANG_ENABLE_SPEC_V2=1, the EAGLE config will either run through SGLang's legacy speculative-decoding scheduler (slower) or initialize sub-optimally — silently defeating the performance purpose of the MTP recipe. The sweep would still execute and post numbers, but they would not reflect what an H200 GLM-5 MTP recipe is supposed to measure.

How to fix

Add the env var alongside the other setup. Either:

export SGLANG_ENABLE_SPEC_V2=1

near the top of the script (matching the glm5 b200/b300/mi355x style), or inline it before the launch command (matching qwen3.5_fp8_h200_mtp.sh:38):

SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server …

Step-by-step proof

  1. The recipe is invoked by the harness; lines 1–43 of glm5_fp8_h200_mtp.sh set up env-var checks, monitor, and EVAL_CONTEXT_ARGS. No environment variable named SGLANG_ENABLE_SPEC_V2 is exported anywhere in the file (the diff shows the full file; grep confirms 0 hits).
  2. Line 44 begins python3 -m sglang.launch_server — not SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server as in qwen3.5_fp8_h200_mtp.sh:38.
  3. SGLang reads SGLANG_ENABLE_SPEC_V2 from the process environment at server startup; with the variable unset, the speculative-decoding stack falls back to its v1/legacy path.
  4. The --speculative-algorithm EAGLE … flags are still parsed and applied, but they run on the legacy scheduler — which is precisely what every other MTP recipe in the repo, and the perf-changelog history, deliberately avoids.
  5. Result: the recipe ships claiming to benchmark GLM-5 FP8 H200 with MTP, but is actually measuring GLM-5 FP8 H200 with EAGLE on the slower legacy spec path. The numbers published from this sweep will not match the MTP recipe's intent.

--trust-remote-code \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
$EVAL_CONTEXT_ARGS > "$SERVER_LOG" 2>&1 &

SERVER_PID=$!

wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts $(( CONC * 10 )) \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code \
--use-chat-template

if [ "${RUN_EVAL}" = "true" ]; then
export MODEL_NAME=glm-5-fp8
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
set +x
6 changes: 6 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2717,3 +2717,9 @@
description:
- "Update SGLang image from v0.5.10-rocm720-mi30x to v0.5.12-rocm720-mi30x"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1426

- config-keys:
- glm5-fp8-h200-sglang-mtp
description:
- "Add MTP/EAGLE speculative-decoding sibling for glm5-fp8-h200-sglang on lmsysorg/sglang:v0.5.12-cu130"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1480
Loading