dsv4-fp4-b200-vllm: bump to vllm v0.20.0, deep_gemm_mega_moe MoE#1204
dsv4-fp4-b200-vllm: bump to vllm v0.20.0, deep_gemm_mega_moe MoE#1204functionstackx wants to merge 8 commits intomainfrom
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1872f6c to
f28dfcc
Compare
Per AGENTS.md ("Updating Docker Images"), image bumps and recipe
changes need a perf-changelog entry to trigger the affected configs'
benchmarks and record the change. Adds the entry for #1204:
v0.20.0-x86_64-cu130-ubuntu2404 image, DeepGEMM install step,
new compilation-config / use_fp4_indexer_cache=True / deep_gemm_mega_moe
launch flags.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI failed in PR #1204 with "git: command not found" because the v0.20.0 vllm image (vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404) doesn't ship git, but install_deepgemm.sh git-clones the DeepGEMM repo. Install git via apt-get before invoking the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same fix as PR #1204: the v0.20.0 vllm image (vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404) doesn't ship git, but install_deepgemm.sh git-clones the DeepGEMM repo. Install git via apt-get before invoking the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pin the image to vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404 (in
place of the floating deepseekv4-cu130 tag) and install DeepGEMM from
the v0.20.0 tools script before launching the engine.
Update launch flags per the v0.20.0 DeepSeek-V4-Pro recipe:
- compilation-config -> {"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}
- --attention_config.use_fp4_indexer_cache=True (= form)
- add --moe-backend deep_gemm_mega_moe
- drop --pipeline-parallel-size 1, --no-enable-prefix-caching, and
--max-cudagraph-capture-size 2048 (no longer in the recipe)
PARALLEL_ARGS, EP_ARGS, and GMU_ARGS are preserved so DP_ATTENTION /
EP_SIZE branching keeps working across the existing search-space
entries.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Restore the prefix-caching disable that the previous launch had, matching the other vLLM B200 benchmark scripts (gptoss, minimaxm2.5) so cross-request cache hits don't skew steady-state throughput. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per AGENTS.md ("Updating Docker Images"), image bumps and recipe
changes need a perf-changelog entry to trigger the affected configs'
benchmarks and record the change. Adds the entry for #1204:
v0.20.0-x86_64-cu130-ubuntu2404 image, DeepGEMM install step,
new compilation-config / use_fp4_indexer_cache=True / deep_gemm_mega_moe
launch flags.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move --moe-backend deep_gemm_mega_moe into EP_ARGS so it only takes effect when expert parallelism is enabled (EP_SIZE>1). The deep_gemm_mega_moe backend isn't applicable in TP-only configs, so applying it unconditionally changed behavior for the small-batch TP-only search-space entries. Update the perf-changelog entry to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI failed in PR #1204 with "git: command not found" because the v0.20.0 vllm image (vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404) doesn't ship git, but install_deepgemm.sh git-clones the DeepGEMM repo. Install git via apt-get before invoking the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The slim x86_64-cu130-ubuntu2404 tag ships without git and without CUDA library dev headers, which made install_deepgemm.sh fail twice (first on git, then on cusparse.h while compiling DeepGEMM's torch extension). Switch to the canonical vllm/vllm-openai:v0.20.0-cu130 tag, which includes the dev tooling needed to compile torch extensions, and drop the workaround apt-get-install-git step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The canonical vllm/vllm-openai:v0.20.0-cu130 image already ships with DeepGEMM, so installing it from source at benchmark time is redundant (and was the root of the recent CI failures: missing git, then missing cusparse.h dev headers when building DeepGEMM's torch extension). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per the vLLM v0.20.0 DeepSeek-V4-Pro recipe, the cudagraph mode and
the MoE backend differ between the TP-only / TP+EP path and the
DP-attn + EP path. Move --moe-backend deep_gemm_mega_moe out of
EP_ARGS (it doesn't apply to plain TP+EP) and gate it together with
the cudagraph mode on DP_ATTENTION:
- DP_ATTENTION=false (TP-only or TP+EP):
--compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}'
no --moe-backend flag (default MoE backend)
- DP_ATTENTION=true (DP-attn + EP):
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE", "custom_ops":["all"]}'
--moe-backend deep_gemm_mega_moe
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dfb4304 to
cb08406
Compare

Summary
dsv4-fp4-b200-vllmimage tovllm/vllm-openai:v0.20.0-cu130(canonical v0.20.0 tag) — replaces the floatingdeepseekv4-cu130tag. DeepGEMM is preinstalled in this image, so noinstall_deepgemm.shstep is needed.DP_ATTENTION(the recipe diverges between TP-only / TP+EP and DP-attn + EP):{tp:8}(low-latency){"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY"}{tp:8, ep:8}(mid){"mode":0,"cudagraph_mode":"FULL_DECODE_ONLY"}{tp:8, ep:8, dp-attn:true}(max-thru){"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}deep_gemm_mega_moe--attention_config.use_fp4_indexer_cache=True(= form, per the recipe).--pipeline-parallel-size 1and--max-cudagraph-capture-size 2048(no longer in the v0.20.0 recipe). Keep--no-enable-prefix-cachingto match the other vLLM single-node benchmark scripts so cross-request cache hits don't skew steady-state throughput.PARALLEL_ARGS/EP_ARGS/GMU_ARGS/MOE_ARGSarrays are preserved so the existingDP_ATTENTION/EP_SIZEsearch-space branching drives--tensor-parallel-size/--data-parallel-size/--enable-expert-parallel/--moe-backendcleanly.perf-changelog.yamlentry to trigger the affected configs.Test plan
dsv4-fp4-b200-vllmbenchmark workflow on a B200 runner and confirm the engine starts and the sweep completes for at least one cell in each of the three search-space classes (TP-only, TP+EP, DP-attn+EP).v0.20.0-cu130image pulls and DeepGEMM is already importable inside the container (no install step at runtime).server.logto verify the right--compilation-config/--moe-backendcombination is logged for each branch:--compilation-config '{"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}', no--moe-backend.--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' --moe-backend deep_gemm_mega_moe.--no-enable-prefix-cachingis still present in all branches.🤖 Generated with Claude Code