Skip to content

Add DSv4 FP8 H200 vLLM MTP benchmark#1222

Merged
functionstackx merged 6 commits intomainfrom
claude/add-dsv4-fp8-h200-vllm-mtp
May 4, 2026
Merged

Add DSv4 FP8 H200 vLLM MTP benchmark#1222
functionstackx merged 6 commits intomainfrom
claude/add-dsv4-fp8-h200-vllm-mtp

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx commented Apr 29, 2026

Summary

Ports the H200 STP recipe to MTP.

  • New dsv4-fp8-h200-vllm-mtp config + benchmarks/single_node/dsv4_fp8_h200_mtp.sh script.
  • MTP counterpart of dsv4-fp8-h200-vllm: identical launch flags (EP + DP=$TP, --gpu-memory-utilization 0.95, --max-num-seqs 512, --no-enable-flashinfer-autotune, FULL_DECODE_ONLY compile) with one addition — --speculative-config '{"method":"mtp","num_speculative_tokens":1}'.
  • num_speculative_tokens=1 because the recipe states H200 supports spec token=1 only for now (B200 / Blackwell SKUs are where token=2 lands on the Pareto front per @wzhao18's vLLM Blackwell MTP work; the H200 kernel coverage is still token=1).
  • Image: vllm/vllm-openai:v0.20.1 (pinned by SHA256 digest 9eff9734...23aa4). The non-MTP dsv4-fp8-h200-vllm entry is unchanged and still on the deepseekv4-cu129 tag.
  • max-model-len comes from the runner's $MAX_MODEL_LEN env var rather than being hardcoded to 800k — keeps this script consistent with the rest of the H200 fleet.
  • Sets VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 to skip the cudagraph-memory estimator during worker memory profiling — it overestimates and pushes us over the GPU memory budget on H200 + MTP, even though the actual cudagraph capture works fine.
  • run_benchmark_serving invocation passes --dsv4 so prompts get chat-formatted encoding, per the AGENTS.md MTP rule (raw random tokens silently regress EAGLE-style acceptance).
  • Search space mirrors the non-MTP H200 entry: TP=8, EP=8, DP-attn=true, CONC 4-64, both 1k1k and 8k1k, with spec-decoding: mtp on each entry.
  • Adds a perf-changelog.yaml entry to trigger the new config.

Test plan

  • Trigger the dsv4-fp8-h200-vllm-mtp benchmark workflow on an H200 runner and confirm the engine starts and the sweep completes for at least one cell from each of the two seq-len-configs.
  • Confirm vllm/vllm-openai:v0.20.1@sha256:9eff9734...23aa4 pulls cleanly.
  • server.log shows --speculative-config '{"method":"mtp","num_speculative_tokens":1}' and the rest of the H200 recipe flags.
  • Acceptance rate is in a sane range — --dsv4 is wired into run_benchmark_serving so the prompts go through the chat template.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@@ -0,0 +1,99 @@
#!/usr/bin/env bash
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The MTP benchmark script is added at benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh, but all three H200 launch scripts (runners/launch_h200-cw.sh:47, runners/launch_h200-nb.sh:22, runners/launch_h200-dgxc-slurm.sh:295) build the script path as benchmarks/single_node/${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh where FRAMEWORK_SUFFIX is empty for vllm — so they will look for dsv4_fp8_h200_mtp.sh and fail with 'No such file or directory' on every cell of the sweep. Unlike launch_b300-nv.sh, the H200 launchers have no framework-tagged-name fallback. Fix by either renaming the script to dsv4_fp8_h200_mtp.sh (matches the existing convention — see qwen3.5_fp8_h200_mtp.sh) or porting the B300 fallback logic to the H200 launchers.

Extended reasoning...

What the bug is

The PR adds a new vLLM MTP benchmark script at benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh and a corresponding dsv4-fp8-h200-vllm-mtp config in .github/configs/nvidia-master.yaml. However, the filename does not match what the H200 launch scripts will look for at runtime, so the workflow will hard-fail before vLLM ever starts.

How the launcher resolves the script path

All three H200 launch scripts build the benchmark script path the same way:

# runners/launch_h200-cw.sh:7-8, 47
MODEL_CODE="${EXP_NAME%%_*}"
FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '')
SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '')
...
bash benchmarks/single_node/${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh

runners/launch_h200-nb.sh:7-8,22 is identical, and runners/launch_h200-dgxc-slurm.sh:295 inlines the same construction.

FRAMEWORK_SUFFIX is _trt only when the framework is trt; for vllm (and sglang) it is empty. SPEC_SUFFIX is _mtp when SPEC_DECODING=mtp.

Step-by-step proof for the new config

For the new dsv4-fp8-h200-vllm-mtp entry:

variable value
model-prefix dsv4
MODEL_CODE dsv4 (from EXP_NAME="${model_code}_${seq_len_str}")
PRECISION fp8
FRAMEWORK vllmFRAMEWORK_SUFFIX=""
SPEC_DECODING mtpSPEC_SUFFIX="_mtp"

So the resolved path is:

benchmarks/single_node/dsv4_fp8_h200_mtp.sh

But the PR added the file at:

benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh

bash will exit with No such file or directory, the runner will mark the cell as failed, and every cell of the new sweep (TP=8/EP=8, conc 4–64, both 1k1k and 8k1k) will fail before the engine starts.

Why the existing code does not save it

Unlike runners/launch_b300-nv.sh:267-272, which prefers a framework-tagged name and falls back to the legacy un-tagged name (this is exactly why dsv4_fp4_b300_vllm_mtp.sh and dsv4_fp4_b300_sglang_mtp.sh work on B300), the H200 launchers have no fallback — they construct one path and run it.

The existing H200 file naming convention confirms the expected name: every other vLLM/SGLang H200 MTP/non-MTP script in the tree omits the framework name (qwen3.5_fp8_h200_mtp.sh, dsr1_fp8_h200.sh, glm5_fp8_h200.sh, dsv4_fp8_h200.sh from this same series), and the only framework-tagged H200 scripts use _trt (dsr1_fp8_h200_trt_mtp.sh). The non-MTP counterpart in this PR's series — dsv4_fp8_h200.sh — already follows the no-suffix convention and works, which is itself evidence of the bug.

Impact and fix

This is a hard, deterministic PR-blocker: every cell of the new benchmark sweep fails to launch. Two fixes:

  1. Simplest: rename benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.shbenchmarks/single_node/dsv4_fp8_h200_mtp.sh to match the existing H200 convention.
  2. Or: port the B300-style framework-tagged-then-legacy-fallback logic to all three H200 launch scripts so framework-tagged filenames also work.

@functionstackx
Copy link
Copy Markdown
Contributor Author

waiting for 0.20.1 vllm-project/vllm#41189 and vllm-project/vllm#41444

@functionstackx functionstackx changed the title Add DSv4 FP8 H200 vLLM MTP benchmark [waiting for bug fix to land in v0.20.1] Add DSv4 FP8 H200 vLLM MTP benchmark May 1, 2026
@functionstackx functionstackx force-pushed the claude/add-dsv4-fp8-h200-vllm-mtp branch 3 times, most recently from 4199098 to 07b3736 Compare May 4, 2026 20:06
functionstackx and others added 4 commits May 4, 2026 17:26
Mirror of dsv4-fp8-h200-vllm + --speculative-config
'{"method":"mtp","num_speculative_tokens":2}', so we get an MTP
counterpart of the existing H200 vLLM DeepSeek-V4-Pro recipe at
https://vllm.ai/blog/deepseek-v4.

- Image: vllm/vllm-openai:v0.20.0-cu130 (canonical v0.20.0; the
  non-MTP entry is still on the deepseekv4-cu129 tag).
- Launch flags otherwise identical to dsv4_fp8_h200.sh: EP + DP=$TP,
  --gpu-memory-utilization 0.95, --max-num-seqs 512,
  --no-enable-flashinfer-autotune, FULL_DECODE_ONLY compile.
- run_benchmark_serving uses --dsv4 per the AGENTS.md MTP rule —
  EAGLE-style spec decoding regresses acceptance on raw random tokens.
- Search space mirrors the non-MTP H200 entry (TP=8, EP=8, DP-attn,
  CONC 4-64, both 1k1k and 8k1k) with spec-decoding: mtp.

Adds a perf-changelog entry to trigger the new config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The H200 runner (runners/launch_h200-cw.sh) constructs the script name
as ${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh
where FRAMEWORK_SUFFIX is empty for vllm — so it expects
benchmarks/single_node/dsv4_fp8_h200_mtp.sh, not the framework-named
dsv4_fp8_h200_vllm_mtp.sh.

Run 12597 failed with "No such file or directory"; rename to fix it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…um_speculative_tokens=1

- Export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 before vllm serve.
  The estimator overshoots H200 + MTP at memory-profile time and pushes
  us over budget even though actual cudagraph capture works fine.
- Drop num_speculative_tokens from 2 to 1 for now; bring it back up
  once we have a stable baseline on this image.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…dcoded 800k

Take the max-model-len from the runner-supplied MAX_MODEL_LEN env var
(added to check_env_vars) so the value is set centrally per config
instead of pinned in the script. Eval-only path is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
functionstackx and others added 2 commits May 4, 2026 17:26
v0.20.1 contains the bug fix the PR was waiting on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the claude/add-dsv4-fp8-h200-vllm-mtp branch from 07b3736 to dd982cd Compare May 4, 2026 21:26
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

@functionstackx functionstackx changed the title [waiting for bug fix to land in v0.20.1] Add DSv4 FP8 H200 vLLM MTP benchmark Add DSv4 FP8 H200 vLLM MTP benchmark May 4, 2026
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions Bot May 4, 2026
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions Bot May 4, 2026
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions Bot May 4, 2026
@SemiAnalysisAI SemiAnalysisAI deleted a comment from github-actions Bot May 4, 2026
@functionstackx functionstackx merged commit c898aeb into main May 4, 2026
62 of 63 checks passed
@functionstackx functionstackx deleted the claude/add-dsv4-fp8-h200-vllm-mtp branch May 4, 2026 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant