Skip to content

dsv4-fp4-b200-vllm: bump to vllm v0.20.0, deep_gemm_mega_moe MoE#1204

Open
functionstackx wants to merge 8 commits intomainfrom
claude/dsv4-fp4-b200-vllm-v0.20.0
Open

dsv4-fp4-b200-vllm: bump to vllm v0.20.0, deep_gemm_mega_moe MoE#1204
functionstackx wants to merge 8 commits intomainfrom
claude/dsv4-fp4-b200-vllm-v0.20.0

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

@functionstackx functionstackx commented Apr 28, 2026

Summary

  • Pin dsv4-fp4-b200-vllm image to vllm/vllm-openai:v0.20.0-cu130 (canonical v0.20.0 tag) — replaces the floating deepseekv4-cu130 tag. DeepGEMM is preinstalled in this image, so no install_deepgemm.sh step is needed.
  • Update launch flags to the v0.20.0 DeepSeek-V4-Pro recipe.
  • Split compilation-config and MoE backend by DP_ATTENTION (the recipe diverges between TP-only / TP+EP and DP-attn + EP):
    search-space entry DP_ATTENTION EP compile-config moe-backend
    {tp:8} (low-latency) false {"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY"} (default)
    {tp:8, ep:8} (mid) false enabled {"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY"} (default)
    {tp:8, ep:8, dp-attn:true} (max-thru) true enabled {"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]} deep_gemm_mega_moe
  • Use --attention_config.use_fp4_indexer_cache=True (= form, per the recipe).
  • Drop --pipeline-parallel-size 1 and --max-cudagraph-capture-size 2048 (no longer in the v0.20.0 recipe). Keep --no-enable-prefix-caching to match the other vLLM single-node benchmark scripts so cross-request cache hits don't skew steady-state throughput.
  • PARALLEL_ARGS / EP_ARGS / GMU_ARGS / MOE_ARGS arrays are preserved so the existing DP_ATTENTION / EP_SIZE search-space branching drives --tensor-parallel-size / --data-parallel-size / --enable-expert-parallel / --moe-backend cleanly.
  • Adds a perf-changelog.yaml entry to trigger the affected configs.

Test plan

  • Trigger the dsv4-fp4-b200-vllm benchmark workflow on a B200 runner and confirm the engine starts and the sweep completes for at least one cell in each of the three search-space classes (TP-only, TP+EP, DP-attn+EP).
  • Confirm the v0.20.0-cu130 image pulls and DeepGEMM is already importable inside the container (no install step at runtime).
  • Spot-check server.log to verify the right --compilation-config / --moe-backend combination is logged for each branch:
    • DP_ATTENTION=false: --compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}', no --moe-backend.
    • DP_ATTENTION=true: --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' --moe-backend deep_gemm_mega_moe.
  • --no-enable-prefix-caching is still present in all branches.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@functionstackx functionstackx force-pushed the claude/dsv4-fp4-b200-vllm-v0.20.0 branch from 1872f6c to f28dfcc Compare April 28, 2026 01:29
functionstackx added a commit that referenced this pull request Apr 28, 2026
Per AGENTS.md ("Updating Docker Images"), image bumps and recipe
changes need a perf-changelog entry to trigger the affected configs'
benchmarks and record the change. Adds the entry for #1204:
v0.20.0-x86_64-cu130-ubuntu2404 image, DeepGEMM install step,
new compilation-config / use_fp4_indexer_cache=True / deep_gemm_mega_moe
launch flags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Apr 28, 2026
CI failed in PR #1204 with "git: command not found" because the
v0.20.0 vllm image (vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404)
doesn't ship git, but install_deepgemm.sh git-clones the DeepGEMM
repo. Install git via apt-get before invoking the script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
functionstackx added a commit that referenced this pull request Apr 28, 2026
Same fix as PR #1204: the v0.20.0 vllm image
(vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404) doesn't ship git,
but install_deepgemm.sh git-clones the DeepGEMM repo. Install git via
apt-get before invoking the script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx
Copy link
Copy Markdown
Contributor Author

talked to vllm maintainer esmeetu and he said that deepgemm extra install only needed for pip wheels and docker already has deepgemm megamoes kernels

image

@functionstackx
Copy link
Copy Markdown
Contributor Author

functionstackx and others added 8 commits April 28, 2026 02:28
Pin the image to vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404 (in
place of the floating deepseekv4-cu130 tag) and install DeepGEMM from
the v0.20.0 tools script before launching the engine.

Update launch flags per the v0.20.0 DeepSeek-V4-Pro recipe:
- compilation-config -> {"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}
- --attention_config.use_fp4_indexer_cache=True (= form)
- add --moe-backend deep_gemm_mega_moe
- drop --pipeline-parallel-size 1, --no-enable-prefix-caching, and
  --max-cudagraph-capture-size 2048 (no longer in the recipe)

PARALLEL_ARGS, EP_ARGS, and GMU_ARGS are preserved so DP_ATTENTION /
EP_SIZE branching keeps working across the existing search-space
entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restore the prefix-caching disable that the previous launch had,
matching the other vLLM B200 benchmark scripts (gptoss, minimaxm2.5)
so cross-request cache hits don't skew steady-state throughput.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per AGENTS.md ("Updating Docker Images"), image bumps and recipe
changes need a perf-changelog entry to trigger the affected configs'
benchmarks and record the change. Adds the entry for #1204:
v0.20.0-x86_64-cu130-ubuntu2404 image, DeepGEMM install step,
new compilation-config / use_fp4_indexer_cache=True / deep_gemm_mega_moe
launch flags.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move --moe-backend deep_gemm_mega_moe into EP_ARGS so it only takes
effect when expert parallelism is enabled (EP_SIZE>1). The
deep_gemm_mega_moe backend isn't applicable in TP-only configs, so
applying it unconditionally changed behavior for the small-batch
TP-only search-space entries.

Update the perf-changelog entry to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI failed in PR #1204 with "git: command not found" because the
v0.20.0 vllm image (vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404)
doesn't ship git, but install_deepgemm.sh git-clones the DeepGEMM
repo. Install git via apt-get before invoking the script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The slim x86_64-cu130-ubuntu2404 tag ships without git and without
CUDA library dev headers, which made install_deepgemm.sh fail twice
(first on git, then on cusparse.h while compiling DeepGEMM's torch
extension).

Switch to the canonical vllm/vllm-openai:v0.20.0-cu130 tag, which
includes the dev tooling needed to compile torch extensions, and drop
the workaround apt-get-install-git step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The canonical vllm/vllm-openai:v0.20.0-cu130 image already ships with
DeepGEMM, so installing it from source at benchmark time is redundant
(and was the root of the recent CI failures: missing git, then missing
cusparse.h dev headers when building DeepGEMM's torch extension).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the vLLM v0.20.0 DeepSeek-V4-Pro recipe, the cudagraph mode and
the MoE backend differ between the TP-only / TP+EP path and the
DP-attn + EP path. Move --moe-backend deep_gemm_mega_moe out of
EP_ARGS (it doesn't apply to plain TP+EP) and gate it together with
the cudagraph mode on DP_ATTENTION:

- DP_ATTENTION=false (TP-only or TP+EP):
    --compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}'
    no --moe-backend flag (default MoE backend)
- DP_ATTENTION=true (DP-attn + EP):
    --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}'
    --moe-backend deep_gemm_mega_moe

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the claude/dsv4-fp4-b200-vllm-v0.20.0 branch from dfb4304 to cb08406 Compare April 28, 2026 06:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant