Skip to content

Add B300 config: minimaxm2.5-fp8-vllm#1054

Merged
functionstackx merged 2 commits intomainfrom
claude/add-minimaxm2.5-fp8-b300-vllm
Apr 17, 2026
Merged

Add B300 config: minimaxm2.5-fp8-vllm#1054
functionstackx merged 2 commits intomainfrom
claude/add-minimaxm2.5-fp8-b300-vllm

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

Summary

  • Add minimaxm2.5-fp8-b300-vllm benchmark config and the corresponding benchmarks/single_node/minimaxm2.5_fp8_b300.sh launch script
  • At the time of submission, the vLLM MiniMax-M2 recipes page does not have a B300-specific recipe, so this reuses the existing MiniMax-M2.5 FP8 B200 vLLM recipe as-is until B300-specific tuning is available
  • Image: vllm/vllm-openai:v0.19.0-cu130 (same as B200), runner: b300, same TP/EP/concurrency search-space as B200

Test plan

  • CI config validation passes
  • Run minimaxm2.5-fp8-b300-vllm single-node benchmark on a B300 node and confirm server starts, benchmark completes, and result file is produced

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +3240 to +3241
- { tp: 2, ep: 2, conc-start: 512, conc-end: 512 }
- { tp: 4, ep: 4, conc-start: 256, conc-end: 512 }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The { tp: 2, ep: 2 } and { tp: 4, ep: 4 } search-space entries added to the B300 config will silently produce results identical to the plain tp: 2 and tp: 4 entries because the B300 runner allocates exactly --gres=gpu:$TP GPUs, causing vLLM to infer EP=1. These entries should either be removed or fixed by passing --expert-parallel-size $EP_SIZE explicitly and allocating TP*EP GPUs in the runner.

Extended reasoning...

Root cause: GPU allocation mismatch between runner and benchmark script

The B300 single-node runner (runners/launch_b300-nv.sh, line 232) uses salloc ... --gres=gpu:$TP --exclusive, allocating exactly TP GPUs for the job. This is correct for pure-TP runs, but expert parallelism requires TP×EP GPUs to be visible.

How vLLM infers expert-parallel size

The new benchmark script (benchmarks/single_node/minimaxm2.5_fp8_b300.sh) conditionally sets the flag:

if [ "$EP_SIZE" -gt 1 ]; then
  EP=" --enable-expert-parallel"
fi
# Then: vllm serve ... --tensor-parallel-size=$TP $EP ...

No --expert-parallel-size argument is passed. vLLM infers expert_parallel_size = total_visible_GPUs / TP. With only TP GPUs allocated by the salloc command, the inference yields EP = TP / TP = 1, making --enable-expert-parallel a complete no-op.

Step-by-step proof for the { tp: 2, ep: 2 } entry

  1. CI picks up the entry, sets TP=2, EP_SIZE=2.
  2. launch_b300-nv.sh runs salloc --gres=gpu:2 --exclusive ..., granting exactly 2 GPUs.
  3. The benchmark script sees EP_SIZE=2 > 1 and sets EP=" --enable-expert-parallel".
  4. vLLM is launched as: vllm serve ... --tensor-parallel-size=2 --enable-expert-parallel ...
  5. vLLM has 2 visible GPUs, TP=2, so it computes expert_parallel_size = 2/2 = 1.
  6. With EP=1, --enable-expert-parallel has no effect; the run is pure-TP=2.
  7. The result filename nevertheless includes ep2 in its path (injected by the workflow template via EP_SIZE in RESULT_FILENAME).
  8. The same degeneration occurs for { tp: 4, ep: 4 } → EP inferred as 4/4 = 1.

Why B200 is not affected

The B200 runner uses docker run --gpus all with CUDA_VISIBLE_DEVICES=0..7, always exposing 8 GPUs. With TP=2, vLLM computes EP = 8/2 = 4, so expert parallelism is genuinely active on B200. The B300 runner's Slurm-based GPU allocation strategy breaks this assumption entirely.

Why SGLang B300 configs are not affected

The DSR1 FP4 B300 SGLang config (dsr1_fp4_b300.sh, line 51) passes --ep-size $EP_SIZE as an explicit argument, which is independent of GPU count. vLLM lacks an equivalent flag and instead infers EP from the available hardware.

Impact

The two EP entries waste CI time running duplicate non-EP experiments while labeling them as EP configs in result filenames. Anyone comparing ep2/ep4 results to tp2/tp4 results will see identical throughput numbers and may draw incorrect conclusions (e.g., that expert parallelism provides no benefit on B300, or that the hardware doesn't support it).

Fix options

  1. Remove the EP entries from the 1k1k search-space for the B300 config until B300-specific EP tuning is available.
  2. Pass explicit EP size via --expert-parallel-size $EP_SIZE in the benchmark script, and update launch_b300-nv.sh to allocate TP*EP GPUs (--gres=gpu:$((TP * EP_SIZE))) for EP runs.

Comment thread perf-changelog.yaml
- "Add MiniMax-M2.5 FP8 B300 vLLM benchmark"
- "Image: vllm/vllm-openai:v0.19.0-cu130"
- "At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html does not have a B300-specific recipe, so this reuses the existing MiniMax-M2.5 FP8 B200 vLLM recipe as-is"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1054
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The perf-changelog.yaml entry for minimaxm2.5-fp8-b300-vllm has a placeholder pr-link ending in /pull/XXXX instead of the actual PR number 1054. Update the last line to #1054 to complete the changelog entry.

Extended reasoning...

What the bug is: The newly added perf-changelog.yaml entry for minimaxm2.5-fp8-b300-vllm (the last entry in the file) ends with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX — the placeholder was never replaced with the actual PR number 1054.

The specific code path: perf-changelog.yaml line 1414 contains the string /pull/XXXX. Multiple verifiers independently confirmed via git show 2e94bd5 that the committed code does indeed contain the placeholder. Notably, the PR diff shown in the review interface shows /pull/1054, but the actual committed file contains XXXX, suggesting the author staged an older version or forgot to update before committing.

Why existing validation does not catch it: The ChangelogEntry model in validation.py types pr_link as a plain str with no URL pattern or PR-number validation. Any non-empty string passes CI, so XXXX is accepted just as well as a real PR number.

Addressing the refutation: The refutation notes that 7 other entries in the file also use pull/XXX placeholders, framing this as accepted codebase practice. However, those entries were submitted before the PR number was known. In this case the PR author clearly knew the PR number — it is PR 1054, and the PR diff itself shows /pull/1054. This is not an intentional placeholder pattern; it is an oversight where the wrong file state was committed.

Impact: The changelog is used by process_changelog.py to trigger CI sweeps and cross-reference benchmark runs to PRs. An incorrect placeholder link breaks traceability: tooling or humans following the link cannot navigate to the originating PR, and any automated changelog queries filtering by PR number will miss this entry.

Fix: Replace pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX with pr-link: #1054 on the last line of perf-changelog.yaml.

functionstackx and others added 2 commits April 17, 2026 08:47
At the time of submission, the vLLM MiniMax-M2 recipes page
(https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html)
does not have a B300-specific recipe, so this config reuses the existing
MiniMax-M2.5 FP8 B200 vLLM recipe as-is until B300-specific tuning is
available.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the claude/add-minimaxm2.5-fp8-b300-vllm branch from 32ca0a4 to 4df8ff3 Compare April 17, 2026 12:47
@functionstackx functionstackx merged commit 340d785 into main Apr 17, 2026
3 checks passed
@functionstackx functionstackx deleted the claude/add-minimaxm2.5-fp8-b300-vllm branch April 17, 2026 12:47
@claude claude bot mentioned this pull request Apr 20, 2026
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant