-
Notifications
You must be signed in to change notification settings - Fork 172
[Klaud Cold] Add glm5-fp8-mi325x-sglang (off + mtp) recipes #1485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
4be466c
de51e35
8b81e6a
0fa4574
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,79 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
| CONTEXT_LENGTH=$((ISL + OSL + 20)) | ||
| MAX_PREFILL_TOKENS=32768 | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH" | ||
| fi | ||
|
|
||
| start_gpu_monitor | ||
|
|
||
| # Launch args follow sglang issue #25672 comment 4485916205: | ||
| # tilelang NSA backends + fp8_e4m3 KV cache + multithread model load. | ||
| python3 -m sglang.launch_server \ | ||
| --model-path $MODEL \ | ||
| --host=0.0.0.0 \ | ||
| --port $PORT \ | ||
| --tensor-parallel-size $TP \ | ||
| --data-parallel-size 1 \ | ||
| --trust-remote-code \ | ||
| --tool-call-parser glm47 \ | ||
| --reasoning-parser glm45 \ | ||
| --tokenizer-worker-num 6 \ | ||
| --cuda-graph-max-bs $CONC \ | ||
| --disable-radix-cache \ | ||
| --max-prefill-tokens $MAX_PREFILL_TOKENS \ | ||
| --scheduler-recv-interval 30 \ | ||
| --mem-fraction-static 0.80 \ | ||
| --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \ | ||
| --nsa-prefill-backend tilelang \ | ||
| --nsa-decode-backend tilelang \ | ||
| --kv-cache-dtype fp8_e4m3 \ | ||
| $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ | ||
|
|
||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor | ||
| set +x |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,89 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # GLM-5 FP8 on MI325X with EAGLE / MTP speculative decoding. | ||
| # Mirrors glm5_fp8_mi325x.sh and adds the speculative-* flags. | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| EP_SIZE | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
| CONTEXT_LENGTH=$((ISL + OSL + 20)) | ||
| MAX_PREFILL_TOKENS=32768 | ||
|
|
||
| EVAL_CONTEXT_ARGS="" | ||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
|
Comment on lines
+1
to
+30
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 The new Extended reasoning...The bug. The MTP launch script added in this PR ( Why. bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%_*}_${PRECISION}_mi325x.shThere is no Step-by-step proof. Impact on the MTP sweep. Because
How to fix. Two options, either is fine:
|
||
| setup_eval_context | ||
| EVAL_CONTEXT_ARGS="--context-length $EVAL_MAX_MODEL_LEN" | ||
| else EVAL_CONTEXT_ARGS="--context-length $CONTEXT_LENGTH" | ||
| fi | ||
|
|
||
| start_gpu_monitor | ||
|
|
||
| # Launch args follow sglang issue #25672 comment 4485916205: | ||
| # tilelang NSA backends + fp8_e4m3 KV cache + multithread model load, | ||
| # plus EAGLE/MTP speculative decoding. | ||
| python3 -m sglang.launch_server \ | ||
| --model-path $MODEL \ | ||
| --host=0.0.0.0 \ | ||
| --port $PORT \ | ||
| --tensor-parallel-size $TP \ | ||
| --ep-size $EP_SIZE \ | ||
| --trust-remote-code \ | ||
| --tool-call-parser glm47 \ | ||
| --reasoning-parser glm45 \ | ||
| --tokenizer-worker-num 6 \ | ||
| --cuda-graph-max-bs $CONC \ | ||
| --disable-radix-cache \ | ||
| --max-prefill-tokens $MAX_PREFILL_TOKENS \ | ||
| --scheduler-recv-interval 30 \ | ||
| --mem-fraction-static 0.80 \ | ||
| --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \ | ||
| --nsa-prefill-backend tilelang \ | ||
| --nsa-decode-backend tilelang \ | ||
| --kv-cache-dtype fp8_e4m3 \ | ||
| --speculative-algorithm EAGLE \ | ||
| --speculative-num-steps 3 \ | ||
| --speculative-eagle-topk 1 \ | ||
| --speculative-num-draft-tokens 4 \ | ||
| $EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --use-chat-template | ||
|
|
||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor | ||
| set +x | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔴 The new
glm5_fp8_mi325x_mtp.shscript enables EAGLE speculative decoding but does NOTexport SGLANG_ENABLE_SPEC_V2=1, which every other GLM-5 MTP recipe in the codebase explicitly sets (b200, b300, fp4-b200, fp4-b300, and the closest ROCm siblingglm5_fp8_mi355x_mtp.sh:25). Without it, sglang on v0.5.12 falls back to the legacy spec-decoding implementation, which would give degraded/incorrect MTP numbers relative to the rest of the GLM-5 fleet. Fix: addexport SGLANG_ENABLE_SPEC_V2=1alongside the other top-of-script setup, mirroringglm5_fp8_mi355x_mtp.sh:25.Extended reasoning...
What the bug is
benchmarks/single_node/glm5_fp8_mi325x_mtp.sh(lines 1–30 of the new file) launches sglang with full EAGLE / MTP speculative-decoding flags:…but the script never exports
SGLANG_ENABLE_SPEC_V2=1. This env var is the documented opt-in for sglang's V2 spec-decoding path on v0.5.x — the path the rest of the GLM-5 fleet uses.Why this matters / the GLM-5 family pattern
A grep across
benchmarks/single_node/shows that every GLM-5 MTP recipe sets this var, on both NVIDIA and AMD:glm5_fp8_b200_mtp.sh:25glm5_fp8_b300_mtp.sh:29glm5_fp4_b200_mtp.sh:25glm5_fp4_b300_mtp.sh:29glm5_fp8_mi355x_mtp.sh:25← closest sibling (same model, sglang framework, MTP/EAGLE, ROCm)The new
glm5_fp8_mi325x_mtp.shis the only GLM-5 MTP recipe missing it. The script's own header says it "Mirrors glm5_fp8_mi325x.sh and adds the speculative-* flags" — the mirroring is from the non-MTPmi325xandqwen3.5-mi325xlineage rather than from the GLM-5 MTP family, which is the likely source of the oversight. (The two qwen3.5 mi355x mtp scripts also lack the var, but they run sglang v0.5.10rc0 on a different model family; this is a GLM-5-specific knob tied to GLM-5's draft-model layout on v0.5.12.)Step-by-step proof
lmsysorg/sglang:v0.5.12-rocm720-mi30x. This is the same v0.5.12 era as the b200/b300 GLM-5 MTP recipes that gate spec-V2 behind this env var.glm5_fp8_mi355x_mtp.sh(same model, same framework, same MTP feature, ROCm) doesexport SGLANG_ENABLE_SPEC_V2=1near the top of the script (line 25), right alongside other ROCm sglang env exports.glm5_fp8_mi325x_mtp.shdefines noSGLANG_*exports at all; the only top-of-script setup ischeck_env_varsand the HF download branch.Why existing code doesn't prevent it
There is no central place that injects
SGLANG_ENABLE_SPEC_V2=1— each recipe sets it directly in its launch script. The env var is unset by default in v0.5.12.benchmark_lib.shdoes not export it. So a script that omits the export simply runs the legacy path.Coupling with bug_001 (dispatch)
This bug is latent until the dispatcher actually invokes
glm5_fp8_mi325x_mtp.sh(the related dispatch issue). However:Fix
Add the export near the top of
benchmarks/single_node/glm5_fp8_mi325x_mtp.sh, mirroringglm5_fp8_mi355x_mtp.sh:25:export SGLANG_ENABLE_SPEC_V2=1Placement: after the
hf downloadline, beforeSERVER_LOG=...(i.e. immediately before line 22 of the new file).