Add DSv4 FP8 H200 vLLM MTP benchmark by functionstackx · Pull Request #1222 · SemiAnalysisAI/InferenceX

functionstackx · 2026-04-29T04:12:32Z

Summary

Ports the H200 STP recipe to MTP.

New dsv4-fp8-h200-vllm-mtp config + benchmarks/single_node/dsv4_fp8_h200_mtp.sh script.
MTP counterpart of dsv4-fp8-h200-vllm: identical launch flags (EP + DP=$TP, --gpu-memory-utilization 0.95, --max-num-seqs 512, --no-enable-flashinfer-autotune, FULL_DECODE_ONLY compile) with one addition — --speculative-config '{"method":"mtp","num_speculative_tokens":1}'.
num_speculative_tokens=1 because the recipe states H200 supports spec token=1 only for now (B200 / Blackwell SKUs are where token=2 lands on the Pareto front per @wzhao18's vLLM Blackwell MTP work; the H200 kernel coverage is still token=1).
Image: vllm/vllm-openai:v0.20.1 (pinned by SHA256 digest 9eff9734...23aa4). The non-MTP dsv4-fp8-h200-vllm entry is unchanged and still on the deepseekv4-cu129 tag.
max-model-len comes from the runner's $MAX_MODEL_LEN env var rather than being hardcoded to 800k — keeps this script consistent with the rest of the H200 fleet.
Sets VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 to skip the cudagraph-memory estimator during worker memory profiling — it overestimates and pushes us over the GPU memory budget on H200 + MTP, even though the actual cudagraph capture works fine.
run_benchmark_serving invocation passes --dsv4 so prompts get chat-formatted encoding, per the AGENTS.md MTP rule (raw random tokens silently regress EAGLE-style acceptance).
Search space mirrors the non-MTP H200 entry: TP=8, EP=8, DP-attn=true, CONC 4-64, both 1k1k and 8k1k, with spec-decoding: mtp on each entry.
Adds a perf-changelog.yaml entry to trigger the new config.

Test plan

Trigger the dsv4-fp8-h200-vllm-mtp benchmark workflow on an H200 runner and confirm the engine starts and the sweep completes for at least one cell from each of the two seq-len-configs.
Confirm vllm/vllm-openai:v0.20.1@sha256:9eff9734...23aa4 pulls cleanly.
server.log shows --speculative-config '{"method":"mtp","num_speculative_tokens":1}' and the rest of the H200 recipe flags.
Acceptance rate is in a sane range — --dsv4 is wired into run_benchmark_serving so the prompts go through the chat template.

🤖 Generated with Claude Code

github-actions · 2026-04-29T04:12:39Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-29T04:20:25Z

@@ -0,0 +1,99 @@
+#!/usr/bin/env bash


🔴 The MTP benchmark script is added at benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh, but all three H200 launch scripts (runners/launch_h200-cw.sh:47, runners/launch_h200-nb.sh:22, runners/launch_h200-dgxc-slurm.sh:295) build the script path as benchmarks/single_node/${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh where FRAMEWORK_SUFFIX is empty for vllm — so they will look for dsv4_fp8_h200_mtp.sh and fail with 'No such file or directory' on every cell of the sweep. Unlike launch_b300-nv.sh, the H200 launchers have no framework-tagged-name fallback. Fix by either renaming the script to dsv4_fp8_h200_mtp.sh (matches the existing convention — see qwen3.5_fp8_h200_mtp.sh) or porting the B300 fallback logic to the H200 launchers.

Extended reasoning...

What the bug is

The PR adds a new vLLM MTP benchmark script at benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh and a corresponding dsv4-fp8-h200-vllm-mtp config in .github/configs/nvidia-master.yaml. However, the filename does not match what the H200 launch scripts will look for at runtime, so the workflow will hard-fail before vLLM ever starts.

How the launcher resolves the script path

All three H200 launch scripts build the benchmark script path the same way:

# runners/launch_h200-cw.sh:7-8, 47 MODEL_CODE="${EXP_NAME%%_*}" FRAMEWORK_SUFFIX=$([[ "$FRAMEWORK" == "trt" ]] && printf '_trt' || printf '') SPEC_SUFFIX=$([[ "$SPEC_DECODING" == "mtp" ]] && printf '_mtp' || printf '') ... bash benchmarks/single_node/${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh

runners/launch_h200-nb.sh:7-8,22 is identical, and runners/launch_h200-dgxc-slurm.sh:295 inlines the same construction.

FRAMEWORK_SUFFIX is _trt only when the framework is trt; for vllm (and sglang) it is empty. SPEC_SUFFIX is _mtp when SPEC_DECODING=mtp.

Step-by-step proof for the new config

For the new dsv4-fp8-h200-vllm-mtp entry:

variable value

model-prefix dsv4

MODEL_CODE dsv4 (from EXP_NAME="${model_code}_${seq_len_str}")

PRECISION fp8

FRAMEWORK vllm → FRAMEWORK_SUFFIX=""

SPEC_DECODING mtp → SPEC_SUFFIX="_mtp"

So the resolved path is:

benchmarks/single_node/dsv4_fp8_h200_mtp.sh

But the PR added the file at:

benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh

bash will exit with No such file or directory, the runner will mark the cell as failed, and every cell of the new sweep (TP=8/EP=8, conc 4–64, both 1k1k and 8k1k) will fail before the engine starts.

Why the existing code does not save it

Unlike runners/launch_b300-nv.sh:267-272, which prefers a framework-tagged name and falls back to the legacy un-tagged name (this is exactly why dsv4_fp4_b300_vllm_mtp.sh and dsv4_fp4_b300_sglang_mtp.sh work on B300), the H200 launchers have no fallback — they construct one path and run it.

The existing H200 file naming convention confirms the expected name: every other vLLM/SGLang H200 MTP/non-MTP script in the tree omits the framework name (qwen3.5_fp8_h200_mtp.sh, dsr1_fp8_h200.sh, glm5_fp8_h200.sh, dsv4_fp8_h200.sh from this same series), and the only framework-tagged H200 scripts use _trt (dsr1_fp8_h200_trt_mtp.sh). The non-MTP counterpart in this PR's series — dsv4_fp8_h200.sh — already follows the no-suffix convention and works, which is itself evidence of the bug.

Impact and fix

This is a hard, deterministic PR-blocker: every cell of the new benchmark sweep fails to launch. Two fixes:

Simplest: rename benchmarks/single_node/dsv4_fp8_h200_vllm_mtp.sh → benchmarks/single_node/dsv4_fp8_h200_mtp.sh to match the existing H200 convention.

Or: port the B300-style framework-tagged-then-legacy-fallback logic to all three H200 launch scripts so framework-tagged filenames also work.

functionstackx · 2026-05-01T21:38:55Z

waiting for 0.20.1 vllm-project/vllm#41189 and vllm-project/vllm#41444

Mirror of dsv4-fp8-h200-vllm + --speculative-config '{"method":"mtp","num_speculative_tokens":2}', so we get an MTP counterpart of the existing H200 vLLM DeepSeek-V4-Pro recipe at https://vllm.ai/blog/deepseek-v4. - Image: vllm/vllm-openai:v0.20.0-cu130 (canonical v0.20.0; the non-MTP entry is still on the deepseekv4-cu129 tag). - Launch flags otherwise identical to dsv4_fp8_h200.sh: EP + DP=$TP, --gpu-memory-utilization 0.95, --max-num-seqs 512, --no-enable-flashinfer-autotune, FULL_DECODE_ONLY compile. - run_benchmark_serving uses --dsv4 per the AGENTS.md MTP rule — EAGLE-style spec decoding regresses acceptance on raw random tokens. - Search space mirrors the non-MTP H200 entry (TP=8, EP=8, DP-attn, CONC 4-64, both 1k1k and 8k1k) with spec-decoding: mtp. Adds a perf-changelog entry to trigger the new config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The H200 runner (runners/launch_h200-cw.sh) constructs the script name as ${MODEL_CODE}_${PRECISION}_h200${FRAMEWORK_SUFFIX}${SPEC_SUFFIX}.sh where FRAMEWORK_SUFFIX is empty for vllm — so it expects benchmarks/single_node/dsv4_fp8_h200_mtp.sh, not the framework-named dsv4_fp8_h200_vllm_mtp.sh. Run 12597 failed with "No such file or directory"; rename to fix it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…um_speculative_tokens=1 - Export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=0 before vllm serve. The estimator overshoots H200 + MTP at memory-profile time and pushes us over budget even though actual cudagraph capture works fine. - Drop num_speculative_tokens from 2 to 1 for now; bring it back up once we have a stable baseline on this image. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…dcoded 800k Take the max-model-len from the runner-supplied MAX_MODEL_LEN env var (added to check_env_vars) so the value is set centrally per config instead of pinned in the script. Eval-only path is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

v0.20.1 contains the bug fix the PR was waiting on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-04T22:14:03Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25344464277
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=25344464277

functionstackx requested a review from a team April 29, 2026 04:12

functionstackx requested review from jgangani and kedarpotdar-nv as code owners April 29, 2026 04:12

github-project-automation Bot added this to InferenceMAX Board Apr 29, 2026

functionstackx added the full-sweep-enabled label Apr 29, 2026

claude Bot reviewed Apr 29, 2026

View reviewed changes

functionstackx added full-sweep-enabled and removed full-sweep-enabled labels Apr 29, 2026

functionstackx force-pushed the claude/add-dsv4-fp8-h200-vllm-mtp branch from 3f3052c to 4e6f92e Compare April 29, 2026 04:47

functionstackx removed the full-sweep-enabled label Apr 30, 2026

functionstackx self-assigned this Apr 30, 2026

functionstackx added full-sweep-enabled and removed full-sweep-enabled labels Apr 30, 2026

functionstackx force-pushed the claude/add-dsv4-fp8-h200-vllm-mtp branch from e71f6f1 to 9e9b5bf Compare May 1, 2026 20:29

functionstackx added the sweep-enabled label May 1, 2026

functionstackx added the vllm/sglang release broken -need to wait label May 1, 2026

functionstackx changed the title ~~Add DSv4 FP8 H200 vLLM MTP benchmark~~ [waiting for bug fix to land in v0.20.1] Add DSv4 FP8 H200 vLLM MTP benchmark May 1, 2026

functionstackx force-pushed the claude/add-dsv4-fp8-h200-vllm-mtp branch 3 times, most recently from 4199098 to 07b3736 Compare May 4, 2026 20:06

functionstackx removed sweep-enabled vllm/sglang release broken -need to wait labels May 4, 2026

functionstackx and others added 4 commits May 4, 2026 17:26

functionstackx and others added 2 commits May 4, 2026 17:26

dsv4-fp8-h200-vllm-mtp: bump image to v0.20.1

b4ab555

v0.20.1 contains the bug fix the PR was waiting on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Update image reference to use SHA256 digest

dd982cd

functionstackx force-pushed the claude/add-dsv4-fp8-h200-vllm-mtp branch from 07b3736 to dd982cd Compare May 4, 2026 21:26

functionstackx added the full-sweep-enabled label May 4, 2026

functionstackx changed the title ~~[waiting for bug fix to land in v0.20.1] Add DSv4 FP8 H200 vLLM MTP benchmark~~ Add DSv4 FP8 H200 vLLM MTP benchmark May 4, 2026

functionstackx removed the full-sweep-enabled label May 4, 2026

SemiAnalysisAI deleted a comment from github-actions Bot May 4, 2026

functionstackx merged commit c898aeb into main May 4, 2026
62 of 63 checks passed

functionstackx deleted the claude/add-dsv4-fp8-h200-vllm-mtp branch May 4, 2026 22:20

github-project-automation Bot moved this to Done in InferenceMAX Board May 4, 2026

functionstackx mentioned this pull request May 4, 2026

dsv4-fp8-h200-vllm-mtp: bump num_speculative_tokens 1 → 2 #1279

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DSv4 FP8 H200 vLLM MTP benchmark#1222

Add DSv4 FP8 H200 vLLM MTP benchmark#1222
functionstackx merged 6 commits intomainfrom
claude/add-dsv4-fp8-h200-vllm-mtp

functionstackx commented Apr 29, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

claude Bot Apr 29, 2026

Uh oh!

functionstackx commented May 1, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

variable	value
`model-prefix`	`dsv4`
`MODEL_CODE`	`dsv4` (from `EXP_NAME="${model_code}_${seq_len_str}"`)
`PRECISION`	`fp8`
`FRAMEWORK`	`vllm` → `FRAMEWORK_SUFFIX=""`
`SPEC_DECODING`	`mtp` → `SPEC_SUFFIX="_mtp"`

Conversation

functionstackx commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

claude Bot Apr 29, 2026

Choose a reason for hiding this comment

What the bug is

How the launcher resolves the script path

Step-by-step proof for the new config

Why the existing code does not save it

Impact and fix

Uh oh!

functionstackx commented May 1, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

functionstackx commented Apr 29, 2026 •

edited

Loading