dsv4-fp4-b300-vllm: bump to vllm v0.20.0, deep_gemm_mega_moe MoE#1206
dsv4-fp4-b300-vllm: bump to vllm v0.20.0, deep_gemm_mega_moe MoE#1206functionstackx wants to merge 5 commits intomainfrom
Conversation
Mirrors the B200 v0.20.0 update (#1204) for the B300 config. Pin the image to vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404 (in place of the floating deepseekv4-cu130 tag) and install DeepGEMM from the v0.20.0 tools script before launching the engine. Update launch flags per the v0.20.0 DeepSeek-V4-Pro recipe: - compilation-config -> {"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"} - --attention_config.use_fp4_indexer_cache=True (= form) - add --moe-backend deep_gemm_mega_moe - drop --pipeline-parallel-size 1 and --max-cudagraph-capture-size 2048 PARALLEL_ARGS and EP_ARGS are preserved so DP_ATTENTION / EP_SIZE branching keeps working across the existing search-space entries. --no-enable-prefix-caching is retained to match the other vLLM single-node benchmark scripts. Adds a perf-changelog entry to trigger the affected configs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move --moe-backend deep_gemm_mega_moe into EP_ARGS so it only takes effect when expert parallelism is enabled (EP_SIZE>1). The deep_gemm_mega_moe backend isn't applicable in TP-only configs, so applying it unconditionally changed behavior for the small-batch TP-only search-space entries. Update the perf-changelog entry to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same fix as PR #1204: the v0.20.0 vllm image (vllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404) doesn't ship git, but install_deepgemm.sh git-clones the DeepGEMM repo. Install git via apt-get before invoking the script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror of the b200 fix: switch to the canonical vllm/vllm-openai:v0.20.0-cu130 image, which already ships with DeepGEMM preinstalled, and drop both the apt-get-install-git step and the install_deepgemm.sh invocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
B300 counterpart of #1204.
dsv4-fp4-b300-vllmtovllm/vllm-openai:v0.20.0-x86_64-cu130-ubuntu2404(replaces floatingdeepseekv4-cu130tag).bash <(curl -fsSL https://raw.githubusercontent.com/vllm-project/vllm/v0.20.0/tools/install_deepgemm.sh).compilation-config({"mode": 0, "cudagraph_mode": "FULL_DECODE_ONLY"}),--attention_config.use_fp4_indexer_cache=True, add--moe-backend deep_gemm_mega_moe. Drop--pipeline-parallel-size 1and--max-cudagraph-capture-size 2048.--no-enable-prefix-cachingto match the other vLLM single-node benchmark scripts.PARALLEL_ARGS/EP_ARGSarrays are preserved so the existingDP_ATTENTION/EP_SIZEsearch-space branching still drives--tensor-parallel-size/--data-parallel-size/--enable-expert-parallel.perf-changelog.yamlentry to trigger the affected configs.Test plan
dsv4-fp4-b300-vllmbenchmark workflow on a B300 runner and confirm the engine starts and the sweep completes for at least one (ISL, OSL, CONC, DP_ATTENTION) cell.install_deepgemm.shsucceeds inside the container.server.logfor the new flags (--moe-backend deep_gemm_mega_moe, new--compilation-config,--attention_config.use_fp4_indexer_cache=True) and that--no-enable-prefix-cachingis still present.🤖 Generated with Claude Code