Add B300 config: minimaxm2.5-fp8-vllm#1054
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| - { tp: 2, ep: 2, conc-start: 512, conc-end: 512 } | ||
| - { tp: 4, ep: 4, conc-start: 256, conc-end: 512 } |
There was a problem hiding this comment.
🔴 The { tp: 2, ep: 2 } and { tp: 4, ep: 4 } search-space entries added to the B300 config will silently produce results identical to the plain tp: 2 and tp: 4 entries because the B300 runner allocates exactly --gres=gpu:$TP GPUs, causing vLLM to infer EP=1. These entries should either be removed or fixed by passing --expert-parallel-size $EP_SIZE explicitly and allocating TP*EP GPUs in the runner.
Extended reasoning...
Root cause: GPU allocation mismatch between runner and benchmark script
The B300 single-node runner (runners/launch_b300-nv.sh, line 232) uses salloc ... --gres=gpu:$TP --exclusive, allocating exactly TP GPUs for the job. This is correct for pure-TP runs, but expert parallelism requires TP×EP GPUs to be visible.
How vLLM infers expert-parallel size
The new benchmark script (benchmarks/single_node/minimaxm2.5_fp8_b300.sh) conditionally sets the flag:
if [ "$EP_SIZE" -gt 1 ]; then
EP=" --enable-expert-parallel"
fi
# Then: vllm serve ... --tensor-parallel-size=$TP $EP ...No --expert-parallel-size argument is passed. vLLM infers expert_parallel_size = total_visible_GPUs / TP. With only TP GPUs allocated by the salloc command, the inference yields EP = TP / TP = 1, making --enable-expert-parallel a complete no-op.
Step-by-step proof for the { tp: 2, ep: 2 } entry
- CI picks up the entry, sets
TP=2,EP_SIZE=2. launch_b300-nv.shrunssalloc --gres=gpu:2 --exclusive ..., granting exactly 2 GPUs.- The benchmark script sees
EP_SIZE=2 > 1and setsEP=" --enable-expert-parallel". - vLLM is launched as:
vllm serve ... --tensor-parallel-size=2 --enable-expert-parallel ... - vLLM has 2 visible GPUs, TP=2, so it computes
expert_parallel_size = 2/2 = 1. - With EP=1,
--enable-expert-parallelhas no effect; the run is pure-TP=2. - The result filename nevertheless includes
ep2in its path (injected by the workflow template viaEP_SIZEinRESULT_FILENAME). - The same degeneration occurs for
{ tp: 4, ep: 4 }→ EP inferred as 4/4 = 1.
Why B200 is not affected
The B200 runner uses docker run --gpus all with CUDA_VISIBLE_DEVICES=0..7, always exposing 8 GPUs. With TP=2, vLLM computes EP = 8/2 = 4, so expert parallelism is genuinely active on B200. The B300 runner's Slurm-based GPU allocation strategy breaks this assumption entirely.
Why SGLang B300 configs are not affected
The DSR1 FP4 B300 SGLang config (dsr1_fp4_b300.sh, line 51) passes --ep-size $EP_SIZE as an explicit argument, which is independent of GPU count. vLLM lacks an equivalent flag and instead infers EP from the available hardware.
Impact
The two EP entries waste CI time running duplicate non-EP experiments while labeling them as EP configs in result filenames. Anyone comparing ep2/ep4 results to tp2/tp4 results will see identical throughput numbers and may draw incorrect conclusions (e.g., that expert parallelism provides no benefit on B300, or that the hardware doesn't support it).
Fix options
- Remove the EP entries from the 1k1k search-space for the B300 config until B300-specific EP tuning is available.
- Pass explicit EP size via
--expert-parallel-size $EP_SIZEin the benchmark script, and updatelaunch_b300-nv.shto allocateTP*EPGPUs (--gres=gpu:$((TP * EP_SIZE))) for EP runs.
| - "Add MiniMax-M2.5 FP8 B300 vLLM benchmark" | ||
| - "Image: vllm/vllm-openai:v0.19.0-cu130" | ||
| - "At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html does not have a B300-specific recipe, so this reuses the existing MiniMax-M2.5 FP8 B200 vLLM recipe as-is" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1054 |
There was a problem hiding this comment.
🟡 The perf-changelog.yaml entry for minimaxm2.5-fp8-b300-vllm has a placeholder pr-link ending in /pull/XXXX instead of the actual PR number 1054. Update the last line to #1054 to complete the changelog entry.
Extended reasoning...
What the bug is: The newly added perf-changelog.yaml entry for minimaxm2.5-fp8-b300-vllm (the last entry in the file) ends with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX — the placeholder was never replaced with the actual PR number 1054.
The specific code path: perf-changelog.yaml line 1414 contains the string /pull/XXXX. Multiple verifiers independently confirmed via git show 2e94bd5 that the committed code does indeed contain the placeholder. Notably, the PR diff shown in the review interface shows /pull/1054, but the actual committed file contains XXXX, suggesting the author staged an older version or forgot to update before committing.
Why existing validation does not catch it: The ChangelogEntry model in validation.py types pr_link as a plain str with no URL pattern or PR-number validation. Any non-empty string passes CI, so XXXX is accepted just as well as a real PR number.
Addressing the refutation: The refutation notes that 7 other entries in the file also use pull/XXX placeholders, framing this as accepted codebase practice. However, those entries were submitted before the PR number was known. In this case the PR author clearly knew the PR number — it is PR 1054, and the PR diff itself shows /pull/1054. This is not an intentional placeholder pattern; it is an oversight where the wrong file state was committed.
Impact: The changelog is used by process_changelog.py to trigger CI sweeps and cross-reference benchmark runs to PRs. An incorrect placeholder link breaks traceability: tooling or humans following the link cannot navigate to the originating PR, and any automated changelog queries filtering by PR number will miss this entry.
Fix: Replace pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX with pr-link: #1054 on the last line of perf-changelog.yaml.
At the time of submission, the vLLM MiniMax-M2 recipes page (https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html) does not have a B300-specific recipe, so this config reuses the existing MiniMax-M2.5 FP8 B200 vLLM recipe as-is until B300-specific tuning is available. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
32ca0a4 to
4df8ff3
Compare
Summary
minimaxm2.5-fp8-b300-vllmbenchmark config and the correspondingbenchmarks/single_node/minimaxm2.5_fp8_b300.shlaunch scriptvllm/vllm-openai:v0.19.0-cu130(same as B200), runner:b300, same TP/EP/concurrency search-space as B200Test plan
minimaxm2.5-fp8-b300-vllmsingle-node benchmark on a B300 node and confirm server starts, benchmark completes, and result file is produced🤖 Generated with Claude Code