Skip to content

[Klaud Cold] Add qwen3.5-fp8-h100-sglang (off + mtp) recipes#1509

Merged
functionstackx merged 3 commits into
mainfrom
add-qwen3.5-fp8-h100-sglang
May 18, 2026
Merged

[Klaud Cold] Add qwen3.5-fp8-h100-sglang (off + mtp) recipes#1509
functionstackx merged 3 commits into
mainfrom
add-qwen3.5-fp8-h100-sglang

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

Summary

H100 had no qwen3.5 sglang recipes at all. Adds FP8 (off + MTP) on lmsysorg/sglang:v0.5.12-cu130. BF16 intentionally skipped — Qwen3.5-397B-A17B BF16 doesn't fit in H100's 80GB HBM3 at TP=8 (~100GB/GPU just for weights).

Recipes

  • qwen3.5-fp8-h100-sglang
  • qwen3.5-fp8-h100-sglang-mtp

TP=8, EP=8, conc 4..32, 1k1k + 8k1k.

Launch scripts

Mirror qwen3.5_fp8_h200.sh but with tighter memory accommodations for H100 (80GB vs H200's 141GB):

Knob H200 H100 (this PR)
--mem-fraction-static 0.80 0.75
--chunked-prefill-size 16384 8192
--max-running-requests 128 64
Sweep conc cap 64 32

MTP variant adds SGLANG_ENABLE_SPEC_V2=1, the standard EAGLE knobs, and --use-chat-template.

If the conservative settings leave throughput on the table once the first sweep lands, we can iterate upward.

Test plan

  • YAML loads; bash -n syntax passes.
  • full-sweep-enabled sweep finishes green for both off + mtp matrices on H100.

🤖 Generated with Claude Code

H100 was missing all qwen3.5 sglang coverage. Adds FP8 on
lmsysorg/sglang:v0.5.12-cu130. TP=8, EP=8, conc 4..32, 1k1k + 8k1k.

BF16 intentionally skipped — Qwen3.5-397B-A17B BF16 doesn't fit in
H100's 80GB HBM3 at TP=8 (~100GB/GPU just for weights).

Launch scripts mirror qwen3.5_fp8_h200.sh but with tighter memory
accommodations for H100 (80GB vs H200's 141GB):
  mem-fraction-static     0.80 → 0.75
  chunked-prefill-size    16384 → 8192
  max-running-requests    128 → 64
  sweep conc cap          64 → 32

MTP variant adds SGLANG_ENABLE_SPEC_V2=1, the standard EAGLE knobs
(num-steps 3, eagle-topk 1, num-draft-tokens 4), and
--use-chat-template on the bench client per AGENTS.md.

If the conservative settings leave throughput on the table once the
first sweep lands, can iterate mem-fraction / chunked-prefill up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Comment on lines +1 to +30
#!/usr/bin/env bash

# Qwen-3.5-397B-A17B FP8 on H100 with EAGLE / MTP speculative decoding.
# Mirrors qwen3.5_fp8_h100.sh; adds the speculative-* flags + SGLANG_ENABLE_SPEC_V2=1
# and passes --use-chat-template per AGENTS.md.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
RANDOM_RANGE_RATIO \
RESULT_FILENAME \
EP_SIZE

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

nvidia-smi

if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi

export SGLANG_ENABLE_SPEC_V2=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new qwen3.5_fp8_h100_mtp.sh will never be invoked: all three H100 launchers (runners/launch_h100-cw.sh:34, runners/launch_h100-dgxc-slurm.sh:301, runners/launch_h100-cr.sh:18) build the bench script path as benchmarks/single_node/${EXP_NAME%%_*}_${PRECISION}_h100.sh with no FRAMEWORK_SUFFIX/SPEC_SUFFIX appended, unlike their H200/B200/B300 peers. So qwen3.5-fp8-h100-sglang-mtp will dispatch to the OFF script qwen3.5_fp8_h100.sh, none of the MTP server flags (SGLANG_ENABLE_SPEC_V2=1, EAGLE, --use-chat-template) will run, yet benchmark-tmpl.yml:180 still bakes spec-mtp into RESULT_FILENAME — so OFF numbers get filed as MTP in the changelog. The H100 launchers need the same FRAMEWORK_SUFFIX/SPEC_SUFFIX handling as the H200/B200/B300 launchers before this PR can produce valid MTP numbers.

Extended reasoning...

The bug

All three H100 launchers construct the bench script path without a framework or spec-decoding suffix:

  • `runners/launch_h100-cw.sh:34` → `bash benchmarks/single_node/${SCENARIO_SUBDIR}${EXP_NAME%%*}${PRECISION}_h100.sh`
  • `runners/launch_h100-dgxc-slurm.sh:301` → identical pattern (single-node else branch)
  • `runners/launch_h100-cr.sh:18` → identical pattern

Compare `runners/launch_h200-cw.sh:6-8,47`, which is the obvious template the new H100 scripts were modelled on:

MODEL_CODE="${EXP_NAME%%_*}"
FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
...
bash benchmarks/single_node/${SCENARIO_SUBDIR}${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh

The same pattern appears in `launch_h200-dgxc-slurm.sh:300-305`, `launch_h200-nb.sh:7-8,22`, `launch_b200-cw.sh`, `launch_b300-nv.sh:294-303`, `launch_mi355x-amds.sh`, etc. The H100 launchers are the only family that omits this logic.

Why it didn't bite until now

Prior H100 recipes (`gptoss-fp4-h100-vllm`, `minimaxm2.5-fp8-h100-vllm`, `kimik2.5-int4-h100-vllm`) were all vLLM-only with spec=none, so the path always resolved to the only file that existed. This PR introduces the first MTP-bearing recipe on the H100 runner family, which is why the gap surfaces here.

Step-by-step proof

  1. PR adds matrix entry qwen3.5-fp8-h100-sglang-mtp with spec-decoding: mtp and model-prefix: qwen3.5 (.github/configs/nvidia-master.yaml).
  2. utils/matrix_logic/generate_sweep_configs.py:290 builds EXP_NAME = f"{model_code}_{seq_len_str}", so for this recipe EXP_NAME is e.g. qwen3.5_1k1k and ${EXP_NAME%%_*}qwen3.5.
  3. benchmark-tmpl.yml:180 sets RESULT_FILENAME=..._spec-${SPEC_DECODING}_... → embeds _spec-mtp_.
  4. The job dispatches to an H100 runner; benchmark-tmpl.yml:188 invokes runners/launch_h100-*.sh.
  5. The launcher (e.g. launch_h100-cw.sh:34) runs bash benchmarks/single_node/qwen3.5_fp8_h100.sh — i.e. the OFF recipe added in this same PR, not qwen3.5_fp8_h100_mtp.sh.
  6. None of the MTP-specific flags fire: SGLANG_ENABLE_SPEC_V2=1, --speculative-algorithm EAGLE, --speculative-num-steps 3, --speculative-eagle-topk 1, --speculative-num-draft-tokens 4, --use-chat-template.
  7. The non-MTP throughput numbers are written to a file tagged _spec-mtp_ and perf-changelog.yaml lists this recipe as MTP — the dashboard now shows the OFF numbers under the MTP label.

Impact

  • qwen3.5_fp8_h100_mtp.sh is unreachable, so the stated test plan ("full-sweep-enabled sweep finishes green for both off + mtp matrices") cannot validate MTP behavior.
  • Worse than a silent no-op: the resulting data is mislabeled. The OFF and MTP entries in perf-changelog.yaml will both reflect OFF runs, but the MTP one gets posted as a speculative-decoding result.

Fix

Mirror the H200 pattern in launch_h100-cw.sh, launch_h100-dgxc-slurm.sh, and launch_h100-cr.sh: compute FRAMEWORK_SUFFIX from $FRAMEWORK and SPEC_SUFFIX from $SPEC_DECODING, and append ${FRAMEWORK_SUFFIX}${SPEC_SUFFIX} to the bench script path. The H100 launchers are the only blocker — the OFF and MTP scripts in this PR are correct, they just need to be reachable.

Comment on lines +91 to +95
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

stop_gpu_monitor
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit — for consistency with the H200 MTP sibling (qwen3.5_fp8_h200_mtp.sh:88), consider adding export EVAL_CONCURRENT_REQUESTS=${EVAL_CONCURRENT_REQUESTS:-$CONC} before run_eval. Otherwise eval falls back to the default of 64 (benchmark_lib.sh:686), which is the only MTP-flavored recipe that diverges from this precedent. Not blocking — the server caps at --max-running-requests 64 so there is no over-subscription, just an unexplained intra-family inconsistency.

Extended reasoning...

What this is

benchmarks/single_node/qwen3.5_fp8_h200_mtp.sh:87-91 explicitly overrides the lm-eval concurrency before running eval:

if [ "${RUN_EVAL}" = "true" ]; then
    export EVAL_CONCURRENT_REQUESTS="${EVAL_CONCURRENT_REQUESTS:-$CONC}"
    run_eval --framework lm-eval --port "$PORT"
    ...

The new qwen3.5_fp8_h100_mtp.sh does not, so it falls back to the default in benchmark_lib.sh:686 (local concurrent_requests=${EVAL_CONCURRENT_REQUESTS:-64}).

Why the H200 MTP sibling chose to set it

There is no commit message or code comment explaining the H200 MTP override, so this is partly speculative. The plausible reason is: MTP runs add draft-model overhead on top of the verifier, and the eval phase issues an unbounded burst from the lm-eval harness — capping at $CONC matches the steady-state load the server was warmed up for (--cuda-graph-max-bs $CONC).

The refutation's strongest points, and why I still think this is worth a one-line fix

The refutation argues:

  1. The server is provisioned for --max-running-requests 64 at startup, so EVAL_CONCURRENT_REQUESTS=64 matches by construction. This is correct — the KV-cache pool reservation comes from --mem-fraction-static 0.75 at server launch, not from eval-time concurrency. No OOM. Agreed.
  2. EVAL_CONCURRENT_REQUESTS is set in exactly one script across the directory, so the H200 MTP is an outlier, not a convention. Also correct as a count. But it is set in the one other MTP-flavored sglang recipe in this model family, which is the nearest sibling to the new script.
  3. The PR description says "Mirrors qwen3.5_fp8_h100.sh". True — and the H100 OFF script in this PR also lacks the override, consistent with that intent. The choice not to mirror the H200 MTP precedent looks deliberate.

So the refutation is right that this is not a correctness or perf bug. But for the MTP variant specifically, the H200 MTP author judged this knob worth setting, and the new MTP recipe runs at tighter memory (--mem-fraction-static 0.75 vs 0.8) and smaller sweep concurrencies (conc-start: 4) than the H200 MTP did. The cost of mirroring is one line; the cost of diverging is that the next person reading the two MTP scripts side by side has to figure out why.

Step-by-step

  1. Sweep launches H100 MTP recipe with CONC=4, RUN_EVAL=true.
  2. Server starts with --cuda-graph-max-bs 4, --max-running-requests 64, --mem-fraction-static 0.75, --disable-radix-cache.
  3. Benchmark phase runs at --max-concurrency 4 (fine).
  4. Eval phase calls run_eval --framework lm-eval --port $PORT.
  5. benchmark_lib.sh:686 reads EVAL_CONCURRENT_REQUESTS; finds it unset; uses default 64.
  6. Eval bursts up to 64 concurrent requests. Server queues them up to its 64 cap and serves them with cuda graphs falling back to eager for batch sizes >4. No crash, no number contamination — just behaviour that diverges from H200 MTP for no recorded reason.

How to fix

Add one line before run_eval in qwen3.5_fp8_h100_mtp.sh, matching qwen3.5_fp8_h200_mtp.sh:88:

if [ "${RUN_EVAL}" = "true" ]; then
    export EVAL_CONCURRENT_REQUESTS="${EVAL_CONCURRENT_REQUESTS:-$CONC}"
    run_eval --framework lm-eval --port "$PORT"
    append_lm_eval_summary
fi

Optional and not required to land this PR.

@github-actions
Copy link
Copy Markdown
Contributor

functionstackx added a commit that referenced this pull request May 18, 2026
…weep race (#1510)

The h100-dgxc-slurm launcher was doing `srun enroot import -o $SQUASH_FILE`
without any locking, so when multiple sweep jobs landed on the cluster
simultaneously they all tried to import the same image into the shared
NFS path `/mnt/nfs/lustre/containers/<image>.sqsh`. First one wins; the
rest crash with `[ERROR] File already exists: ...sqsh` and
`OSError: [Errno 116] Stale file handle` (from the partial sqsh) once
sglang/vllm tries to start.

Observed on PR #1509 (qwen3.5-fp8-h100-sglang new recipe): 13/30 jobs
failed, all hitting the same race on h100-dgxc-slurm_0 + _1. Failure
rate scales with sweep concurrency — was masked previously because
older H100 recipes had fewer matrix points sharing the cluster.

Switches to the canonical `flock -w 600 + unsquashfs-l skip-if-valid +
enroot import` pattern already used in launch_h100-cw.sh, plus the
mi300x/mi325x/mi355x launchers (#1462/#1477/#1498). No other behavior
change.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx
Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@functionstackx functionstackx merged commit 7ec633f into main May 18, 2026
37 checks passed
@functionstackx functionstackx deleted the add-qwen3.5-fp8-h100-sglang branch May 18, 2026 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant