dsv4-fp4-b300-vllm: bump to vllm v0.20.0, deep_gemm_mega_moe MoE by functionstackx · Pull Request #1206 · SemiAnalysisAI/InferenceX

functionstackx · 2026-04-28T01:34:11Z

Summary

B300 counterpart of #1204.

Pin dsv4-fp4-b300-vllm to vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404 (replaces floating deepseekv4-cu130 tag).
Install DeepGEMM from the v0.20.0 tools script before launching the engine: bash <(curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm/v0.20.0/tools/install_deepgemm.sh).
Update launch flags to the v0.20.0 DeepSeek-V4-Pro recipe: new compilation-config ({"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}), --attention_config.use_fp4_indexer_cache=True, add --moe-backend deep_gemm_mega_moe. Drop --pipeline-parallel-size 1 and --max-cudagraph-capture-size 2048.
Keep --no-enable-prefix-caching to match the other vLLM single-node benchmark scripts.
PARALLEL_ARGS / EP_ARGS arrays are preserved so the existing DP_ATTENTION / EP_SIZE search-space branching still drives --tensor-parallel-size / --data-parallel-size / --enable-expert-parallel.
Adds a perf-changelog.yaml entry to trigger the affected configs.

Test plan

Trigger the dsv4-fp4-b300-vllm benchmark workflow on a B300 runner and confirm the engine starts and the sweep completes for at least one (ISL, OSL, CONC, DP_ATTENTION) cell.
Confirm the v0.20.0 image pulls and install_deepgemm.sh succeeds inside the container.
Spot-check server.log for the new flags (--moe-backend deep_gemm_mega_moe, new --compilation-config, --attention_config.use_fp4_indexer_cache=True) and that --no-enable-prefix-caching is still present.

🤖 Generated with Claude Code

Mirrors the B200 v0.20.0 update (#1204) for the B300 config. Pin the image to vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404 (in place of the floating deepseekv4-cu130 tag) and install DeepGEMM from the v0.20.0 tools script before launching the engine. Update launch flags per the v0.20.0 DeepSeek-V4-Pro recipe: - compilation-config -> {"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"} - --attention_config.use_fp4_indexer_cache=True (= form) - add --moe-backend deep_gemm_mega_moe - drop --pipeline-parallel-size 1 and --max-cudagraph-capture-size 2048 PARALLEL_ARGS and EP_ARGS are preserved so DP_ATTENTION / EP_SIZE branching keeps working across the existing search-space entries. --no-enable-prefix-caching is retained to match the other vLLM single-node benchmark scripts. Adds a perf-changelog entry to trigger the affected configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-28T01:34:18Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move --moe-backend deep_gemm_mega_moe into EP_ARGS so it only takes effect when expert parallelism is enabled (EP_SIZE>1). The deep_gemm_mega_moe backend isn't applicable in TP-only configs, so applying it unconditionally changed behavior for the small-batch TP-only search-space entries. Update the perf-changelog entry to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Same fix as PR #1204: the v0.20.0 vllm image (vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404) doesn't ship git, but install_deepgemm.sh git-clones the DeepGEMM repo. Install git via apt-get before invoking the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirror of the b200 fix: switch to the canonical vllm/vllm-openai:v0.20.0-cu130 image, which already ships with DeepGEMM preinstalled, and drop both the apt-get-install-git step and the install_deepgemm.sh invocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team April 28, 2026 01:34

functionstackx requested review from jgangani and kedarpotdar-nv as code owners April 28, 2026 01:34

github-project-automation Bot added this to InferenceMAX Board Apr 28, 2026

functionstackx and others added 2 commits April 27, 2026 21:34

perf-changelog: fill in PR link for dsv4-fp4-b300-vllm v0.20.0 entry

6349692

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx added sweep-enabled and removed sweep-enabled labels Apr 28, 2026

functionstackx and others added 2 commits April 27, 2026 23:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dsv4-fp4-b300-vllm: bump to vllm v0.20.0, deep_gemm_mega_moe MoE#1206

dsv4-fp4-b300-vllm: bump to vllm v0.20.0, deep_gemm_mega_moe MoE#1206
functionstackx wants to merge 5 commits intomainfrom
claude/dsv4-fp4-b300-vllm-v0.20.0

functionstackx commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented Apr 28, 2026

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant