Skip to content

dsv4-fp4-b300-vllm: bump to vllm v0.20.0, deep_gemm_mega_moe MoE#1206

Open
functionstackx wants to merge 5 commits intomainfrom
claude/dsv4-fp4-b300-vllm-v0.20.0
Open

dsv4-fp4-b300-vllm: bump to vllm v0.20.0, deep_gemm_mega_moe MoE#1206
functionstackx wants to merge 5 commits intomainfrom
claude/dsv4-fp4-b300-vllm-v0.20.0

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

Summary

B300 counterpart of #1204.

  • Pin dsv4-fp4-b300-vllm to vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404 (replaces floating deepseekv4-cu130 tag).
  • Install DeepGEMM from the v0.20.0 tools script before launching the engine: bash <(curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm/v0.20.0/tools/install_deepgemm.sh).
  • Update launch flags to the v0.20.0 DeepSeek-V4-Pro recipe: new compilation-config ({"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}), --attention_config.use_fp4_indexer_cache=True, add --moe-backend deep_gemm_mega_moe. Drop --pipeline-parallel-size 1 and --max-cudagraph-capture-size 2048.
  • Keep --no-enable-prefix-caching to match the other vLLM single-node benchmark scripts.
  • PARALLEL_ARGS / EP_ARGS arrays are preserved so the existing DP_ATTENTION / EP_SIZE search-space branching still drives --tensor-parallel-size / --data-parallel-size / --enable-expert-parallel.
  • Adds a perf-changelog.yaml entry to trigger the affected configs.

Test plan

  • Trigger the dsv4-fp4-b300-vllm benchmark workflow on a B300 runner and confirm the engine starts and the sweep completes for at least one (ISL, OSL, CONC, DP_ATTENTION) cell.
  • Confirm the v0.20.0 image pulls and install_deepgemm.sh succeeds inside the container.
  • Spot-check server.log for the new flags (--moe-backend deep_gemm_mega_moe, new --compilation-config, --attention_config.use_fp4_indexer_cache=True) and that --no-enable-prefix-caching is still present.

🤖 Generated with Claude Code

Mirrors the B200 v0.20.0 update (#1204) for the B300 config.

Pin the image to vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404 (in
place of the floating deepseekv4-cu130 tag) and install DeepGEMM from
the v0.20.0 tools script before launching the engine.

Update launch flags per the v0.20.0 DeepSeek-V4-Pro recipe:
- compilation-config -> {"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}
- --attention_config.use_fp4_indexer_cache=True (= form)
- add --moe-backend deep_gemm_mega_moe
- drop --pipeline-parallel-size 1 and --max-cudagraph-capture-size 2048

PARALLEL_ARGS and EP_ARGS are preserved so DP_ATTENTION / EP_SIZE
branching keeps working across the existing search-space entries.
--no-enable-prefix-caching is retained to match the other vLLM
single-node benchmark scripts.

Adds a perf-changelog entry to trigger the affected configs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

functionstackx and others added 2 commits April 27, 2026 21:34
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move --moe-backend deep_gemm_mega_moe into EP_ARGS so it only takes
effect when expert parallelism is enabled (EP_SIZE>1). The
deep_gemm_mega_moe backend isn't applicable in TP-only configs, so
applying it unconditionally changed behavior for the small-batch
TP-only search-space entries.

Update the perf-changelog entry to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
functionstackx and others added 2 commits April 27, 2026 23:09
Same fix as PR #1204: the v0.20.0 vllm image
(vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404) doesn't ship git,
but install_deepgemm.sh git-clones the DeepGEMM repo. Install git via
apt-get before invoking the script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror of the b200 fix: switch to the canonical
vllm/vllm-openai:v0.20.0-cu130 image, which already ships with
DeepGEMM preinstalled, and drop both the apt-get-install-git step
and the install_deepgemm.sh invocation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant