Add B300 config: minimaxm2.5-fp8-vllm by functionstackx · Pull Request #1054 · SemiAnalysisAI/InferenceX

functionstackx · 2026-04-17T08:23:28Z

Summary

Add minimaxm2.5-fp8-b300-vllm benchmark config and the corresponding benchmarks/single_node/minimaxm2.5_fp8_b300.sh launch script
At the time of submission, the vLLM MiniMax-M2 recipes page does not have a B300-specific recipe, so this reuses the existing MiniMax-M2.5 FP8 B200 vLLM recipe as-is until B300-specific tuning is available
Image: vllm/vllm-openai:v0.19.0-cu130 (same as B200), runner: b300, same TP/EP/concurrency search-space as B200

Test plan

CI config validation passes
Run minimaxm2.5-fp8-b300-vllm single-node benchmark on a B300 node and confirm server starts, benchmark completes, and result file is produced

🤖 Generated with Claude Code

github-actions · 2026-04-17T08:23:37Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-17T08:23:37Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-17T08:51:03Z

+    - { tp: 2, ep: 2, conc-start: 512, conc-end: 512 }
+    - { tp: 4, ep: 4, conc-start: 256, conc-end: 512 }


🔴 The { tp: 2, ep: 2 } and { tp: 4, ep: 4 } search-space entries added to the B300 config will silently produce results identical to the plain tp: 2 and tp: 4 entries because the B300 runner allocates exactly --gres=gpu:$TP GPUs, causing vLLM to infer EP=1. These entries should either be removed or fixed by passing --expert-parallel-size $EP_SIZE explicitly and allocating TP*EP GPUs in the runner.

Extended reasoning...

Root cause: GPU allocation mismatch between runner and benchmark script

The B300 single-node runner (runners/launch_b300-nv.sh, line 232) uses salloc ... --gres=gpu:$TP --exclusive, allocating exactly TP GPUs for the job. This is correct for pure-TP runs, but expert parallelism requires TP×EP GPUs to be visible.

How vLLM infers expert-parallel size

The new benchmark script (benchmarks/single_node/minimaxm2.5_fp8_b300.sh) conditionally sets the flag:

if [ "$EP_SIZE" -gt 1 ]; then EP=" --enable-expert-parallel" fi # Then: vllm serve ... --tensor-parallel-size=$TP $EP ...

No --expert-parallel-size argument is passed. vLLM infers expert_parallel_size = total_visible_GPUs / TP. With only TP GPUs allocated by the salloc command, the inference yields EP = TP / TP = 1, making --enable-expert-parallel a complete no-op.

Step-by-step proof for the { tp: 2, ep: 2 } entry

CI picks up the entry, sets TP=2, EP_SIZE=2.

launch_b300-nv.sh runs salloc --gres=gpu:2 --exclusive ..., granting exactly 2 GPUs.

The benchmark script sees EP_SIZE=2 > 1 and sets EP=" --enable-expert-parallel".

vLLM is launched as: vllm serve ... --tensor-parallel-size=2 --enable-expert-parallel ...

vLLM has 2 visible GPUs, TP=2, so it computes expert_parallel_size = 2/2 = 1.

With EP=1, --enable-expert-parallel has no effect; the run is pure-TP=2.

The result filename nevertheless includes ep2 in its path (injected by the workflow template via EP_SIZE in RESULT_FILENAME).

The same degeneration occurs for { tp: 4, ep: 4 } → EP inferred as 4/4 = 1.

Why B200 is not affected

The B200 runner uses docker run --gpus all with CUDA_VISIBLE_DEVICES=0..7, always exposing 8 GPUs. With TP=2, vLLM computes EP = 8/2 = 4, so expert parallelism is genuinely active on B200. The B300 runner's Slurm-based GPU allocation strategy breaks this assumption entirely.

Why SGLang B300 configs are not affected

The DSR1 FP4 B300 SGLang config (dsr1_fp4_b300.sh, line 51) passes --ep-size $EP_SIZE as an explicit argument, which is independent of GPU count. vLLM lacks an equivalent flag and instead infers EP from the available hardware.

Impact

The two EP entries waste CI time running duplicate non-EP experiments while labeling them as EP configs in result filenames. Anyone comparing ep2/ep4 results to tp2/tp4 results will see identical throughput numbers and may draw incorrect conclusions (e.g., that expert parallelism provides no benefit on B300, or that the hardware doesn't support it).

Fix options

Remove the EP entries from the 1k1k search-space for the B300 config until B300-specific EP tuning is available.

Pass explicit EP size via --expert-parallel-size $EP_SIZE in the benchmark script, and update launch_b300-nv.sh to allocate TP*EP GPUs (--gres=gpu:$((TP * EP_SIZE))) for EP runs.

claude · 2026-04-17T08:51:03Z

+    - "Add MiniMax-M2.5 FP8 B300 vLLM benchmark"
+    - "Image: vllm/vllm-openai:v0.19.0-cu130"
+    - "At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html does not have a B300-specific recipe, so this reuses the existing MiniMax-M2.5 FP8 B200 vLLM recipe as-is"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1054


🟡 The perf-changelog.yaml entry for minimaxm2.5-fp8-b300-vllm has a placeholder pr-link ending in /pull/XXXX instead of the actual PR number 1054. Update the last line to #1054 to complete the changelog entry.

Extended reasoning...

What the bug is: The newly added perf-changelog.yaml entry for minimaxm2.5-fp8-b300-vllm (the last entry in the file) ends with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX — the placeholder was never replaced with the actual PR number 1054.

The specific code path: perf-changelog.yaml line 1414 contains the string /pull/XXXX. Multiple verifiers independently confirmed via git show 2e94bd5 that the committed code does indeed contain the placeholder. Notably, the PR diff shown in the review interface shows /pull/1054, but the actual committed file contains XXXX, suggesting the author staged an older version or forgot to update before committing.

Why existing validation does not catch it: The ChangelogEntry model in validation.py types pr_link as a plain str with no URL pattern or PR-number validation. Any non-empty string passes CI, so XXXX is accepted just as well as a real PR number.

Addressing the refutation: The refutation notes that 7 other entries in the file also use pull/XXX placeholders, framing this as accepted codebase practice. However, those entries were submitted before the PR number was known. In this case the PR author clearly knew the PR number — it is PR 1054, and the PR diff itself shows /pull/1054. This is not an intentional placeholder pattern; it is an oversight where the wrong file state was committed.

Impact: The changelog is used by process_changelog.py to trigger CI sweeps and cross-reference benchmark runs to PRs. An incorrect placeholder link breaks traceability: tooling or humans following the link cannot navigate to the originating PR, and any automated changelog queries filtering by PR number will miss this entry.

Fix: Replace pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX with pr-link: #1054 on the last line of perf-changelog.yaml.

At the time of submission, the vLLM MiniMax-M2 recipes page (https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html) does not have a B300-specific recipe, so this config reuses the existing MiniMax-M2.5 FP8 B200 vLLM recipe as-is until B300-specific tuning is available. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team April 17, 2026 08:23

functionstackx requested review from jgangani and kedarpotdar-nv as code owners April 17, 2026 08:23

github-project-automation bot added this to InferenceMAX Board Apr 17, 2026

claude bot reviewed Apr 17, 2026

View reviewed changes

functionstackx added sweep-enabled and removed sweep-enabled labels Apr 17, 2026

functionstackx and others added 2 commits April 17, 2026 08:47

Fill in PR link for minimaxm2.5-fp8-b300-vllm changelog entry

4df8ff3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx force-pushed the claude/add-minimaxm2.5-fp8-b300-vllm branch from 32ca0a4 to 4df8ff3 Compare April 17, 2026 12:47

functionstackx merged commit 340d785 into main Apr 17, 2026
3 checks passed

functionstackx deleted the claude/add-minimaxm2.5-fp8-b300-vllm branch April 17, 2026 12:47

github-project-automation bot moved this to Done in InferenceMAX Board Apr 17, 2026

claude bot mentioned this pull request Apr 20, 2026

Add B300 config: kimi-k2.5-fp4-vllm #1100

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add B300 config: minimaxm2.5-fp8-vllm#1054

Add B300 config: minimaxm2.5-fp8-vllm#1054
functionstackx merged 2 commits intomainfrom
claude/add-minimaxm2.5-fp8-b300-vllm

functionstackx commented Apr 17, 2026

Uh oh!

github-actions bot commented Apr 17, 2026

Uh oh!

github-actions bot commented Apr 17, 2026

Uh oh!

claude bot Apr 17, 2026

Uh oh!

claude bot Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		- { tp: 2, ep: 2, conc-start: 512, conc-end: 512 }
		- { tp: 4, ep: 4, conc-start: 256, conc-end: 512 }

Conversation

functionstackx commented Apr 17, 2026

Summary

Test plan

Uh oh!

github-actions bot commented Apr 17, 2026

Uh oh!

github-actions bot commented Apr 17, 2026

Uh oh!

claude bot Apr 17, 2026

Choose a reason for hiding this comment

Root cause: GPU allocation mismatch between runner and benchmark script

How vLLM infers expert-parallel size

Step-by-step proof for the { tp: 2, ep: 2 } entry

Why B200 is not affected

Why SGLang B300 configs are not affected

Impact

Fix options

Uh oh!

claude bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Step-by-step proof for the `{ tp: 2, ep: 2 }` entry