[Klaud Cold] Add qwen3.5-fp8-h100-sglang (off + mtp) recipes by functionstackx · Pull Request #1509 · SemiAnalysisAI/InferenceX

functionstackx · 2026-05-18T18:47:09Z

Summary

H100 had no qwen3.5 sglang recipes at all. Adds FP8 (off + MTP) on lmsysorg/sglang:v0.5.12-cu130. BF16 intentionally skipped — Qwen3.5-397B-A17B BF16 doesn't fit in H100's 80GB HBM3 at TP=8 (~100GB/GPU just for weights).

Recipes

qwen3.5-fp8-h100-sglang
qwen3.5-fp8-h100-sglang-mtp

TP=8, EP=8, conc 4..32, 1k1k + 8k1k.

Launch scripts

Mirror qwen3.5_fp8_h200.sh but with tighter memory accommodations for H100 (80GB vs H200's 141GB):

Knob	H200	H100 (this PR)
`--mem-fraction-static`	0.80	0.75
`--chunked-prefill-size`	16384	8192
`--max-running-requests`	128	64
Sweep conc cap	64	32

MTP variant adds SGLANG_ENABLE_SPEC_V2=1, the standard EAGLE knobs, and --use-chat-template.

If the conservative settings leave throughput on the table once the first sweep lands, we can iterate upward.

Test plan

YAML loads; bash -n syntax passes.
full-sweep-enabled sweep finishes green for both off + mtp matrices on H100.

🤖 Generated with Claude Code

H100 was missing all qwen3.5 sglang coverage. Adds FP8 on lmsysorg/sglang:v0.5.12-cu130. TP=8, EP=8, conc 4..32, 1k1k + 8k1k. BF16 intentionally skipped — Qwen3.5-397B-A17B BF16 doesn't fit in H100's 80GB HBM3 at TP=8 (~100GB/GPU just for weights). Launch scripts mirror qwen3.5_fp8_h200.sh but with tighter memory accommodations for H100 (80GB vs H200's 141GB): mem-fraction-static 0.80 → 0.75 chunked-prefill-size 16384 → 8192 max-running-requests 128 → 64 sweep conc cap 64 → 32 MTP variant adds SGLANG_ENABLE_SPEC_V2=1, the standard EAGLE knobs (num-steps 3, eagle-topk 1, num-draft-tokens 4), and --use-chat-template on the bench client per AGENTS.md. If the conservative settings leave throughput on the table once the first sweep lands, can iterate mem-fraction / chunked-prefill up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-18T18:47:21Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-05-18T18:51:52Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26053484112
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26053484112

claude · 2026-05-18T18:57:15Z

+#!/usr/bin/env bash
+
+# Qwen-3.5-397B-A17B FP8 on H100 with EAGLE / MTP speculative decoding.
+# Mirrors qwen3.5_fp8_h100.sh; adds the speculative-* flags + SGLANG_ENABLE_SPEC_V2=1
+# and passes --use-chat-template per AGENTS.md.
+
+source "$(dirname "$0")/../benchmark_lib.sh"
+
+check_env_vars \
+    MODEL \
+    TP \
+    CONC \
+    ISL \
+    OSL \
+    RANDOM_RANGE_RATIO \
+    RESULT_FILENAME \
+    EP_SIZE
+
+if [[ -n "$SLURM_JOB_ID" ]]; then
+  echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
+fi
+
+nvidia-smi
+
+if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi
+
+export SGLANG_ENABLE_SPEC_V2=1
+
+SERVER_LOG=/workspace/server.log
+PORT=${PORT:-8888}


🔴 The new qwen3.5_fp8_h100_mtp.sh will never be invoked: all three H100 launchers (runners/launch_h100-cw.sh:34, runners/launch_h100-dgxc-slurm.sh:301, runners/launch_h100-cr.sh:18) build the bench script path as benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_h100.sh with no FRAMEWORK_SUFFIX/SPEC_SUFFIX appended, unlike their H200/B200/B300 peers. So qwen3.5-fp8-h100-sglang-mtp will dispatch to the OFF script qwen3.5_fp8_h100.sh, none of the MTP server flags (SGLANG_ENABLE_SPEC_V2=1, EAGLE, --use-chat-template) will run, yet benchmark-tmpl.yml:180 still bakes spec-mtp into RESULT_FILENAME — so OFF numbers get filed as MTP in the changelog. The H100 launchers need the same FRAMEWORK_SUFFIX/SPEC_SUFFIX handling as the H200/B200/B300 launchers before this PR can produce valid MTP numbers.

Extended reasoning...

The bug

All three H100 launchers construct the bench script path without a framework or spec-decoding suffix:

`runners/launch_h100-cw.sh:34` → `bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%*}${PRECISION}_h100.sh`

`runners/launch_h100-dgxc-slurm.sh:301` → identical pattern (single-node else branch)

`runners/launch_h100-cr.sh:18` → identical pattern

Compare `runners/launch_h200-cw.sh:6-8,47`, which is the obvious template the new H100 scripts were modelled on:

MODEL_CODE="${EXP_NAME%%_*}" FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '') SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '') ... bash benchmarks/single_node/${SCENARIO_SUBDIR}${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh

The same pattern appears in `launch_h200-dgxc-slurm.sh:300-305`, `launch_h200-nb.sh:7-8,22`, `launch_b200-cw.sh`, `launch_b300-nv.sh:294-303`, `launch_mi355x-amds.sh`, etc. The H100 launchers are the only family that omits this logic.

Why it didn't bite until now

Prior H100 recipes (`gptoss-fp4-h100-vllm`, `minimaxm2.5-fp8-h100-vllm`, `kimik2.5-int4-h100-vllm`) were all vLLM-only with spec=none, so the path always resolved to the only file that existed. This PR introduces the first MTP-bearing recipe on the H100 runner family, which is why the gap surfaces here.

Step-by-step proof

PR adds matrix entry qwen3.5-fp8-h100-sglang-mtp with spec-decoding: mtp and model-prefix: qwen3.5 (.github/configs/nvidia-master.yaml).

utils/matrix_logic/generate_sweep_configs.py:290 builds EXP_NAME = f"{model_code}_{seq_len_str}", so for this recipe EXP_NAME is e.g. qwen3.5_1k1k and ${EXP_NAME%%_*} → qwen3.5.

benchmark-tmpl.yml:180 sets RESULT_FILENAME=..._spec-${SPEC_DECODING}_... → embeds _spec-mtp_.

The job dispatches to an H100 runner; benchmark-tmpl.yml:188 invokes runners/launch_h100-*.sh.

The launcher (e.g. launch_h100-cw.sh:34) runs bash benchmarks/single_node/qwen3.5_fp8_h100.sh — i.e. the OFF recipe added in this same PR, not qwen3.5_fp8_h100_mtp.sh.

None of the MTP-specific flags fire: SGLANG_ENABLE_SPEC_V2=1, --speculative-algorithm EAGLE, --speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4, --use-chat-template.

The non-MTP throughput numbers are written to a file tagged _spec-mtp_ and perf-changelog.yaml lists this recipe as MTP — the dashboard now shows the OFF numbers under the MTP label.

Impact

qwen3.5_fp8_h100_mtp.sh is unreachable, so the stated test plan ("full-sweep-enabled sweep finishes green for both off + mtp matrices") cannot validate MTP behavior.

Worse than a silent no-op: the resulting data is mislabeled. The OFF and MTP entries in perf-changelog.yaml will both reflect OFF runs, but the MTP one gets posted as a speculative-decoding result.

Fix

Mirror the H200 pattern in launch_h100-cw.sh, launch_h100-dgxc-slurm.sh, and launch_h100-cr.sh: compute FRAMEWORK_SUFFIX from $FRAMEWORK and SPEC_SUFFIX from $SPEC_DECODING, and append ${FRAMEWORK_SUFFIX}${SPEC_SUFFIX} to the bench script path. The H100 launchers are the only blocker — the OFF and MTP scripts in this PR are correct, they just need to be reachable.

claude · 2026-05-18T18:57:15Z

+    run_eval --framework lm-eval --port "$PORT"
+    append_lm_eval_summary
+fi
+
+stop_gpu_monitor


🟡 Nit — for consistency with the H200 MTP sibling (qwen3.5_fp8_h200_mtp.sh:88), consider adding export EVAL_CONCURRENT_REQUESTS=${EVAL_CONCURRENT_REQUESTS:-$CONC} before run_eval. Otherwise eval falls back to the default of 64 (benchmark_lib.sh:686), which is the only MTP-flavored recipe that diverges from this precedent. Not blocking — the server caps at --max-running-requests 64 so there is no over-subscription, just an unexplained intra-family inconsistency.

Extended reasoning...

What this is

benchmarks/single_node/qwen3.5_fp8_h200_mtp.sh:87-91 explicitly overrides the lm-eval concurrency before running eval:

if [ "${RUN_EVAL}" = "true" ]; then export EVAL_CONCURRENT_REQUESTS="${EVAL_CONCURRENT_REQUESTS:-$CONC}" run_eval --framework lm-eval --port "$PORT" ...

The new qwen3.5_fp8_h100_mtp.sh does not, so it falls back to the default in benchmark_lib.sh:686 (local concurrent_requests=${EVAL_CONCURRENT_REQUESTS:-64}).

Why the H200 MTP sibling chose to set it

There is no commit message or code comment explaining the H200 MTP override, so this is partly speculative. The plausible reason is: MTP runs add draft-model overhead on top of the verifier, and the eval phase issues an unbounded burst from the lm-eval harness — capping at $CONC matches the steady-state load the server was warmed up for (--cuda-graph-max-bs $CONC).

The refutation's strongest points, and why I still think this is worth a one-line fix

The refutation argues:

The server is provisioned for --max-running-requests 64 at startup, so EVAL_CONCURRENT_REQUESTS=64 matches by construction. This is correct — the KV-cache pool reservation comes from --mem-fraction-static 0.75 at server launch, not from eval-time concurrency. No OOM. Agreed.

EVAL_CONCURRENT_REQUESTS is set in exactly one script across the directory, so the H200 MTP is an outlier, not a convention. Also correct as a count. But it is set in the one other MTP-flavored sglang recipe in this model family, which is the nearest sibling to the new script.

The PR description says "Mirrors qwen3.5_fp8_h100.sh". True — and the H100 OFF script in this PR also lacks the override, consistent with that intent. The choice not to mirror the H200 MTP precedent looks deliberate.

So the refutation is right that this is not a correctness or perf bug. But for the MTP variant specifically, the H200 MTP author judged this knob worth setting, and the new MTP recipe runs at tighter memory (--mem-fraction-static 0.75 vs 0.8) and smaller sweep concurrencies (conc-start: 4) than the H200 MTP did. The cost of mirroring is one line; the cost of diverging is that the next person reading the two MTP scripts side by side has to figure out why.

Step-by-step

Sweep launches H100 MTP recipe with CONC=4, RUN_EVAL=true.

Server starts with --cuda-graph-max-bs 4, --max-running-requests 64, --mem-fraction-static 0.75, --disable-radix-cache.

Benchmark phase runs at --max-concurrency 4 (fine).

Eval phase calls run_eval --framework lm-eval --port $PORT.

benchmark_lib.sh:686 reads EVAL_CONCURRENT_REQUESTS; finds it unset; uses default 64.

Eval bursts up to 64 concurrent requests. Server queues them up to its 64 cap and serves them with cuda graphs falling back to eager for batch sizes >4. No crash, no number contamination — just behaviour that diverges from H200 MTP for no recorded reason.

How to fix

Add one line before run_eval in qwen3.5_fp8_h100_mtp.sh, matching qwen3.5_fp8_h200_mtp.sh:88:

if [ "${RUN_EVAL}" = "true" ]; then export EVAL_CONCURRENT_REQUESTS="${EVAL_CONCURRENT_REQUESTS:-$CONC}" run_eval --framework lm-eval --port "$PORT" append_lm_eval_summary fi

Optional and not required to land this PR.

github-actions · 2026-05-18T19:17:42Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26053502947
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26053502947

…weep race (#1510) The h100-dgxc-slurm launcher was doing `srun enroot import -o $SQUASH_FILE` without any locking, so when multiple sweep jobs landed on the cluster simultaneously they all tried to import the same image into the shared NFS path `/mnt/nfs/lustre/containers/<image>.sqsh`. First one wins; the rest crash with `[ERROR] File already exists: ...sqsh` and `OSError: [Errno 116] Stale file handle` (from the partial sqsh) once sglang/vllm tries to start. Observed on PR #1509 (qwen3.5-fp8-h100-sglang new recipe): 13/30 jobs failed, all hitting the same race on h100-dgxc-slurm_0 + _1. Failure rate scales with sweep concurrency — was masked previously because older H100 recipes had fewer matrix points sharing the cluster. Switches to the canonical `flock -w 600 + unsquashfs-l skip-if-valid + enroot import` pattern already used in launch_h100-cw.sh, plus the mi300x/mi325x/mi355x launchers (#1462/#1477/#1498). No other behavior change. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-18T21:26:27Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26055185038
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26055185038

functionstackx · 2026-05-18T21:30:33Z

/reuse-sweep-run

functionstackx requested a review from a team May 18, 2026 18:47

functionstackx added the full-sweep-enabled label May 18, 2026

functionstackx requested review from jgangani and kedarpotdar-nv as code owners May 18, 2026 18:47

github-project-automation Bot added this to InferenceMAX Board May 18, 2026

chore: fill pr-link for #1509

6774ea8

claude Bot reviewed May 18, 2026

View reviewed changes

functionstackx mentioned this pull request May 18, 2026

[Klaud Cold] runners(h100-dgxc-slurm): flock the enroot import to fix concurrent-sweep race #1510

Merged

2 tasks

Merge remote-tracking branch 'origin/main' into HEAD

f34542a

functionstackx merged commit 7ec633f into main May 18, 2026
37 checks passed

functionstackx deleted the add-qwen3.5-fp8-h100-sglang branch May 18, 2026 21:30

github-project-automation Bot moved this to Done in InferenceMAX Board May 18, 2026

claude Bot mentioned this pull request May 18, 2026

[DO NOT MERGE][WIP][NV] update- qwen3.5-fp8-b200-sglang-mtp #1513

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Klaud Cold] Add qwen3.5-fp8-h100-sglang (off + mtp) recipes#1509

[Klaud Cold] Add qwen3.5-fp8-h100-sglang (off + mtp) recipes#1509
functionstackx merged 3 commits into
mainfrom
add-qwen3.5-fp8-h100-sglang

functionstackx commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

claude Bot May 18, 2026

Uh oh!

claude Bot May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

functionstackx commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented May 18, 2026

Summary

Recipes

Launch scripts

Test plan

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

claude Bot May 18, 2026

Choose a reason for hiding this comment

The bug

Why it didn't bite until now

Step-by-step proof

Impact

Fix

Uh oh!

claude Bot May 18, 2026

Choose a reason for hiding this comment

What this is

Why the H200 MTP sibling chose to set it

The refutation's strongest points, and why I still think this is worth a one-line fix

Step-by-step

How to fix

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

functionstackx commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant