dsv4-fp4-b200-vllm: bump to vllm v0.20.0, deep_gemm_mega_moe MoE by functionstackx · Pull Request #1204 · SemiAnalysisAI/InferenceX

functionstackx · 2026-04-28T01:22:36Z

Summary

Pin dsv4-fp4-b200-vllm image to vllm/vllm-openai:v0.20.0-cu130 (canonical v0.20.0 tag) — replaces the floating deepseekv4-cu130 tag. DeepGEMM is preinstalled in this image, so no install_deepgemm.sh step is needed.
Update launch flags to the v0.20.0 DeepSeek-V4-Pro recipe.

Split compilation-config and MoE backend by DP_ATTENTION (the recipe diverges between TP-only / TP+EP and DP-attn + EP):

search-space entry	DP_ATTENTION	EP	compile-config	moe-backend
`{tp:8}` (low-latency)	false	–	`{"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY"}`	(default)
`{tp:8, ep:8}` (mid)	false	enabled	`{"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY"}`	(default)
`{tp:8, ep:8, dp-attn:true}` (max-thru)	true	enabled	`{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}`	`deep_gemm_mega_moe`

Use --attention_config.use_fp4_indexer_cache=True (= form, per the recipe).
Drop --pipeline-parallel-size 1 and --max-cudagraph-capture-size 2048 (no longer in the v0.20.0 recipe). Keep --no-enable-prefix-caching to match the other vLLM single-node benchmark scripts so cross-request cache hits don't skew steady-state throughput.
PARALLEL_ARGS / EP_ARGS / GMU_ARGS / MOE_ARGS arrays are preserved so the existing DP_ATTENTION / EP_SIZE search-space branching drives --tensor-parallel-size / --data-parallel-size / --enable-expert-parallel / --moe-backend cleanly.
Adds a perf-changelog.yaml entry to trigger the affected configs.

Test plan

Trigger the dsv4-fp4-b200-vllm benchmark workflow on a B200 runner and confirm the engine starts and the sweep completes for at least one cell in each of the three search-space classes (TP-only, TP+EP, DP-attn+EP).
Confirm the v0.20.0-cu130 image pulls and DeepGEMM is already importable inside the container (no install step at runtime).
Spot-check server.log to verify the right --compilation-config / --moe-backend combination is logged for each branch:
- DP_ATTENTION=false: --compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}', no --moe-backend.
- DP_ATTENTION=true: --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' --moe-backend deep_gemm_mega_moe.
--no-enable-prefix-caching is still present in all branches.

🤖 Generated with Claude Code

github-actions · 2026-04-28T01:22:44Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-28T01:22:44Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Per AGENTS.md ("Updating Docker Images"), image bumps and recipe changes need a perf-changelog entry to trigger the affected configs' benchmarks and record the change. Adds the entry for #1204: v0.20.0-x86_64-cu130-ubuntu2404 image, DeepGEMM install step, new compilation-config / use_fp4_indexer_cache=True / deep_gemm_mega_moe launch flags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI failed in PR #1204 with "git: command not found" because the v0.20.0 vllm image (vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404) doesn't ship git, but install_deepgemm.sh git-clones the DeepGEMM repo. Install git via apt-get before invoking the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Same fix as PR #1204: the v0.20.0 vllm image (vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404) doesn't ship git, but install_deepgemm.sh git-clones the DeepGEMM repo. Install git via apt-get before invoking the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx · 2026-04-28T04:03:16Z

talked to vllm maintainer esmeetu and he said that deepgemm extra install only needed for pip wheels and docker already has deepgemm megamoes kernels

functionstackx · 2026-04-28T04:04:17Z

https://github.com/vllm-project/recipes/pull/409/changes

Pin the image to vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404 (in place of the floating deepseekv4-cu130 tag) and install DeepGEMM from the v0.20.0 tools script before launching the engine. Update launch flags per the v0.20.0 DeepSeek-V4-Pro recipe: - compilation-config -> {"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"} - --attention_config.use_fp4_indexer_cache=True (= form) - add --moe-backend deep_gemm_mega_moe - drop --pipeline-parallel-size 1, --no-enable-prefix-caching, and --max-cudagraph-capture-size 2048 (no longer in the recipe) PARALLEL_ARGS, EP_ARGS, and GMU_ARGS are preserved so DP_ATTENTION / EP_SIZE branching keeps working across the existing search-space entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Restore the prefix-caching disable that the previous launch had, matching the other vLLM B200 benchmark scripts (gptoss, minimaxm2.5) so cross-request cache hits don't skew steady-state throughput. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per AGENTS.md ("Updating Docker Images"), image bumps and recipe changes need a perf-changelog entry to trigger the affected configs' benchmarks and record the change. Adds the entry for #1204: v0.20.0-x86_64-cu130-ubuntu2404 image, DeepGEMM install step, new compilation-config / use_fp4_indexer_cache=True / deep_gemm_mega_moe launch flags. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move --moe-backend deep_gemm_mega_moe into EP_ARGS so it only takes effect when expert parallelism is enabled (EP_SIZE>1). The deep_gemm_mega_moe backend isn't applicable in TP-only configs, so applying it unconditionally changed behavior for the small-batch TP-only search-space entries. Update the perf-changelog entry to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CI failed in PR #1204 with "git: command not found" because the v0.20.0 vllm image (vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404) doesn't ship git, but install_deepgemm.sh git-clones the DeepGEMM repo. Install git via apt-get before invoking the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The slim x86_64-cu130-ubuntu2404 tag ships without git and without CUDA library dev headers, which made install_deepgemm.sh fail twice (first on git, then on cusparse.h while compiling DeepGEMM's torch extension). Switch to the canonical vllm/vllm-openai:v0.20.0-cu130 tag, which includes the dev tooling needed to compile torch extensions, and drop the workaround apt-get-install-git step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The canonical vllm/vllm-openai:v0.20.0-cu130 image already ships with DeepGEMM, so installing it from source at benchmark time is redundant (and was the root of the recent CI failures: missing git, then missing cusparse.h dev headers when building DeepGEMM's torch extension). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per the vLLM v0.20.0 DeepSeek-V4-Pro recipe, the cudagraph mode and the MoE backend differ between the TP-only / TP+EP path and the DP-attn + EP path. Move --moe-backend deep_gemm_mega_moe out of EP_ARGS (it doesn't apply to plain TP+EP) and gate it together with the cudagraph mode on DP_ATTENTION: - DP_ATTENTION=false (TP-only or TP+EP): --compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}' no --moe-backend flag (default MoE backend) - DP_ATTENTION=true (DP-attn + EP): --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}' --moe-backend deep_gemm_mega_moe Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team April 28, 2026 01:22

functionstackx requested review from jgangani and kedarpotdar-nv as code owners April 28, 2026 01:22

github-project-automation Bot added this to InferenceMAX Board Apr 28, 2026

functionstackx added the sweep-enabled label Apr 28, 2026

functionstackx force-pushed the claude/dsv4-fp4-b200-vllm-v0.20.0 branch from 1872f6c to f28dfcc Compare April 28, 2026 01:29

functionstackx mentioned this pull request Apr 28, 2026

dsv4-fp4-b300-vllm: bump to vllm v0.20.0, deep_gemm_mega_moe MoE #1206

Open

3 tasks

functionstackx mentioned this pull request Apr 28, 2026

[CI/Build] install_deepgemm.sh: auto-install git when missing vllm-project/vllm#41085

Closed

functionstackx added full-sweep-enabled and removed sweep-enabled labels Apr 28, 2026

functionstackx and others added 8 commits April 28, 2026 02:28

functionstackx force-pushed the claude/dsv4-fp4-b200-vllm-v0.20.0 branch from dfb4304 to cb08406 Compare April 28, 2026 06:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dsv4-fp4-b200-vllm: bump to vllm v0.20.0, deep_gemm_mega_moe MoE#1204

dsv4-fp4-b200-vllm: bump to vllm v0.20.0, deep_gemm_mega_moe MoE#1204
functionstackx wants to merge 8 commits intomainfrom
claude/dsv4-fp4-b200-vllm-v0.20.0

functionstackx commented Apr 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

functionstackx commented Apr 28, 2026

Uh oh!

functionstackx commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

functionstackx commented Apr 28, 2026

Uh oh!

functionstackx commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

functionstackx commented Apr 28, 2026 •

edited

Loading