[Klaud Cold] Add qwen3.5-fp8-h100-sglang (off + mtp) recipes#1509
Conversation
H100 was missing all qwen3.5 sglang coverage. Adds FP8 on lmsysorg/sglang:v0.5.12-cu130. TP=8, EP=8, conc 4..32, 1k1k + 8k1k. BF16 intentionally skipped — Qwen3.5-397B-A17B BF16 doesn't fit in H100's 80GB HBM3 at TP=8 (~100GB/GPU just for weights). Launch scripts mirror qwen3.5_fp8_h200.sh but with tighter memory accommodations for H100 (80GB vs H200's 141GB): mem-fraction-static 0.80 → 0.75 chunked-prefill-size 16384 → 8192 max-running-requests 128 → 64 sweep conc cap 64 → 32 MTP variant adds SGLANG_ENABLE_SPEC_V2=1, the standard EAGLE knobs (num-steps 3, eagle-topk 1, num-draft-tokens 4), and --use-chat-template on the bench client per AGENTS.md. If the conservative settings leave throughput on the table once the first sweep lands, can iterate mem-fraction / chunked-prefill up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26053484112 |
| #!/usr/bin/env bash | ||
|
|
||
| # Qwen-3.5-397B-A17B FP8 on H100 with EAGLE / MTP speculative decoding. | ||
| # Mirrors qwen3.5_fp8_h100.sh; adds the speculative-* flags + SGLANG_ENABLE_SPEC_V2=1 | ||
| # and passes --use-chat-template per AGENTS.md. | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME \ | ||
| EP_SIZE | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| nvidia-smi | ||
|
|
||
| if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi | ||
|
|
||
| export SGLANG_ENABLE_SPEC_V2=1 | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} |
There was a problem hiding this comment.
🔴 The new qwen3.5_fp8_h100_mtp.sh will never be invoked: all three H100 launchers (runners/launch_h100-cw.sh:34, runners/launch_h100-dgxc-slurm.sh:301, runners/launch_h100-cr.sh:18) build the bench script path as benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_h100.sh with no FRAMEWORK_SUFFIX/SPEC_SUFFIX appended, unlike their H200/B200/B300 peers. So qwen3.5-fp8-h100-sglang-mtp will dispatch to the OFF script qwen3.5_fp8_h100.sh, none of the MTP server flags (SGLANG_ENABLE_SPEC_V2=1, EAGLE, --use-chat-template) will run, yet benchmark-tmpl.yml:180 still bakes spec-mtp into RESULT_FILENAME — so OFF numbers get filed as MTP in the changelog. The H100 launchers need the same FRAMEWORK_SUFFIX/SPEC_SUFFIX handling as the H200/B200/B300 launchers before this PR can produce valid MTP numbers.
Extended reasoning...
The bug
All three H100 launchers construct the bench script path without a framework or spec-decoding suffix:
- `runners/launch_h100-cw.sh:34` → `bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%*}${PRECISION}_h100.sh`
- `runners/launch_h100-dgxc-slurm.sh:301` → identical pattern (single-node else branch)
- `runners/launch_h100-cr.sh:18` → identical pattern
Compare `runners/launch_h200-cw.sh:6-8,47`, which is the obvious template the new H100 scripts were modelled on:
MODEL_CODE="${EXP_NAME%%_*}"
FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
...
bash benchmarks/single_node/${SCENARIO_SUBDIR}${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.shThe same pattern appears in `launch_h200-dgxc-slurm.sh:300-305`, `launch_h200-nb.sh:7-8,22`, `launch_b200-cw.sh`, `launch_b300-nv.sh:294-303`, `launch_mi355x-amds.sh`, etc. The H100 launchers are the only family that omits this logic.
Why it didn't bite until now
Prior H100 recipes (`gptoss-fp4-h100-vllm`, `minimaxm2.5-fp8-h100-vllm`, `kimik2.5-int4-h100-vllm`) were all vLLM-only with spec=none, so the path always resolved to the only file that existed. This PR introduces the first MTP-bearing recipe on the H100 runner family, which is why the gap surfaces here.
Step-by-step proof
- PR adds matrix entry
qwen3.5-fp8-h100-sglang-mtpwithspec-decoding: mtpandmodel-prefix: qwen3.5(.github/configs/nvidia-master.yaml). utils/matrix_logic/generate_sweep_configs.py:290buildsEXP_NAME = f"{model_code}_{seq_len_str}", so for this recipeEXP_NAMEis e.g.qwen3.5_1k1kand${EXP_NAME%%_*}→qwen3.5.benchmark-tmpl.yml:180setsRESULT_FILENAME=..._spec-${SPEC_DECODING}_...→ embeds_spec-mtp_.- The job dispatches to an H100 runner;
benchmark-tmpl.yml:188invokesrunners/launch_h100-*.sh. - The launcher (e.g.
launch_h100-cw.sh:34) runsbash benchmarks/single_node/qwen3.5_fp8_h100.sh— i.e. the OFF recipe added in this same PR, notqwen3.5_fp8_h100_mtp.sh. - None of the MTP-specific flags fire:
SGLANG_ENABLE_SPEC_V2=1,--speculative-algorithm EAGLE,--speculative-num-steps 3,--speculative-eagle-topk 1,--speculative-num-draft-tokens 4,--use-chat-template. - The non-MTP throughput numbers are written to a file tagged
_spec-mtp_andperf-changelog.yamllists this recipe as MTP — the dashboard now shows the OFF numbers under the MTP label.
Impact
qwen3.5_fp8_h100_mtp.shis unreachable, so the stated test plan ("full-sweep-enabled sweep finishes green for both off + mtp matrices") cannot validate MTP behavior.- Worse than a silent no-op: the resulting data is mislabeled. The OFF and MTP entries in
perf-changelog.yamlwill both reflect OFF runs, but the MTP one gets posted as a speculative-decoding result.
Fix
Mirror the H200 pattern in launch_h100-cw.sh, launch_h100-dgxc-slurm.sh, and launch_h100-cr.sh: compute FRAMEWORK_SUFFIX from $FRAMEWORK and SPEC_SUFFIX from $SPEC_DECODING, and append ${FRAMEWORK_SUFFIX}${SPEC_SUFFIX} to the bench script path. The H100 launchers are the only blocker — the OFF and MTP scripts in this PR are correct, they just need to be reachable.
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor |
There was a problem hiding this comment.
🟡 Nit — for consistency with the H200 MTP sibling (qwen3.5_fp8_h200_mtp.sh:88), consider adding export EVAL_CONCURRENT_REQUESTS=${EVAL_CONCURRENT_REQUESTS:-$CONC} before run_eval. Otherwise eval falls back to the default of 64 (benchmark_lib.sh:686), which is the only MTP-flavored recipe that diverges from this precedent. Not blocking — the server caps at --max-running-requests 64 so there is no over-subscription, just an unexplained intra-family inconsistency.
Extended reasoning...
What this is
benchmarks/single_node/qwen3.5_fp8_h200_mtp.sh:87-91 explicitly overrides the lm-eval concurrency before running eval:
if [ "${RUN_EVAL}" = "true" ]; then
export EVAL_CONCURRENT_REQUESTS="${EVAL_CONCURRENT_REQUESTS:-$CONC}"
run_eval --framework lm-eval --port "$PORT"
...The new qwen3.5_fp8_h100_mtp.sh does not, so it falls back to the default in benchmark_lib.sh:686 (local concurrent_requests=${EVAL_CONCURRENT_REQUESTS:-64}).
Why the H200 MTP sibling chose to set it
There is no commit message or code comment explaining the H200 MTP override, so this is partly speculative. The plausible reason is: MTP runs add draft-model overhead on top of the verifier, and the eval phase issues an unbounded burst from the lm-eval harness — capping at $CONC matches the steady-state load the server was warmed up for (--cuda-graph-max-bs $CONC).
The refutation's strongest points, and why I still think this is worth a one-line fix
The refutation argues:
- The server is provisioned for
--max-running-requests 64at startup, so EVAL_CONCURRENT_REQUESTS=64 matches by construction. This is correct — the KV-cache pool reservation comes from--mem-fraction-static 0.75at server launch, not from eval-time concurrency. No OOM. Agreed. - EVAL_CONCURRENT_REQUESTS is set in exactly one script across the directory, so the H200 MTP is an outlier, not a convention. Also correct as a count. But it is set in the one other MTP-flavored sglang recipe in this model family, which is the nearest sibling to the new script.
- The PR description says "Mirrors qwen3.5_fp8_h100.sh". True — and the H100 OFF script in this PR also lacks the override, consistent with that intent. The choice not to mirror the H200 MTP precedent looks deliberate.
So the refutation is right that this is not a correctness or perf bug. But for the MTP variant specifically, the H200 MTP author judged this knob worth setting, and the new MTP recipe runs at tighter memory (--mem-fraction-static 0.75 vs 0.8) and smaller sweep concurrencies (conc-start: 4) than the H200 MTP did. The cost of mirroring is one line; the cost of diverging is that the next person reading the two MTP scripts side by side has to figure out why.
Step-by-step
- Sweep launches H100 MTP recipe with
CONC=4,RUN_EVAL=true. - Server starts with
--cuda-graph-max-bs 4,--max-running-requests 64,--mem-fraction-static 0.75,--disable-radix-cache. - Benchmark phase runs at
--max-concurrency 4(fine). - Eval phase calls
run_eval --framework lm-eval --port $PORT. benchmark_lib.sh:686readsEVAL_CONCURRENT_REQUESTS; finds it unset; uses default 64.- Eval bursts up to 64 concurrent requests. Server queues them up to its 64 cap and serves them with cuda graphs falling back to eager for batch sizes >4. No crash, no number contamination — just behaviour that diverges from H200 MTP for no recorded reason.
How to fix
Add one line before run_eval in qwen3.5_fp8_h100_mtp.sh, matching qwen3.5_fp8_h200_mtp.sh:88:
if [ "${RUN_EVAL}" = "true" ]; then
export EVAL_CONCURRENT_REQUESTS="${EVAL_CONCURRENT_REQUESTS:-$CONC}"
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fiOptional and not required to land this PR.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26053502947 |
…weep race (#1510) The h100-dgxc-slurm launcher was doing `srun enroot import -o $SQUASH_FILE` without any locking, so when multiple sweep jobs landed on the cluster simultaneously they all tried to import the same image into the shared NFS path `/mnt/nfs/lustre/containers/<image>.sqsh`. First one wins; the rest crash with `[ERROR] File already exists: ...sqsh` and `OSError: [Errno 116] Stale file handle` (from the partial sqsh) once sglang/vllm tries to start. Observed on PR #1509 (qwen3.5-fp8-h100-sglang new recipe): 13/30 jobs failed, all hitting the same race on h100-dgxc-slurm_0 + _1. Failure rate scales with sweep concurrency — was masked previously because older H100 recipes had fewer matrix points sharing the cluster. Switches to the canonical `flock -w 600 + unsquashfs-l skip-if-valid + enroot import` pattern already used in launch_h100-cw.sh, plus the mi300x/mi325x/mi355x launchers (#1462/#1477/#1498). No other behavior change. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26055185038 |
|
/reuse-sweep-run |
Summary
H100 had no qwen3.5 sglang recipes at all. Adds FP8 (off + MTP) on
lmsysorg/sglang:v0.5.12-cu130. BF16 intentionally skipped — Qwen3.5-397B-A17B BF16 doesn't fit in H100's 80GB HBM3 at TP=8 (~100GB/GPU just for weights).Recipes
qwen3.5-fp8-h100-sglangqwen3.5-fp8-h100-sglang-mtpTP=8, EP=8, conc 4..32, 1k1k + 8k1k.
Launch scripts
Mirror
qwen3.5_fp8_h200.shbut with tighter memory accommodations for H100 (80GB vs H200's 141GB):--mem-fraction-static--chunked-prefill-size--max-running-requestsMTP variant adds
SGLANG_ENABLE_SPEC_V2=1, the standard EAGLE knobs, and--use-chat-template.If the conservative settings leave throughput on the table once the first sweep lands, we can iterate upward.
Test plan
bash -nsyntax passes.🤖 Generated with Claude Code