[NVIDIA] Update Minimax fp8 B200 Configs#1010
Conversation
Update perf-changelog.yaml
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you |
|
@claude please add PR description |
|
Claude finished @Ankur-singh's task in 57s —— View job Adding PR Description
Done! Added a detailed PR description summarizing the search-space config changes for |
Co-authored-by: Ankur Singh <Ankur-singh@users.noreply.github.com>
|
@functionstackx @cquil11 Can you please review this PR? I’ve already informed the team to update the corresponding recipe/cookbook. We should receive the PR any time now. |
Based on validated benchmark configs in SemiAnalysisAI/InferenceX#1010, tp:4/ep:4 and tp:2/ep:2 are now confirmed for B200. Also enables 2-GPU selection for B200, adds --kv-cache-dtype fp8_e4m3 and --disable-radix-cache as B200-specific flags per the benchmark script. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
Add benchmark-validated flags for B200 FP8 from SemiAnalysisAI/InferenceX#1010: --enable-expert-parallel (tp:4/ep:4 validated, tp:2/ep:2 also supported), --gpu-memory-utilization 0.90, --block-size 32, --kv-cache-dtype fp8, --stream-interval 20, --no-enable-prefix-caching. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com>
|
Receipe link- vllm-project/recipes#321 @functionstackx - could you please help review this? |
cquil11
left a comment
There was a problem hiding this comment.
evals + throughput look good
merge
* MiniMax-M2.5 B200: add EP, FP8 KV cache, disable radix cache Based on validated benchmark configs in SemiAnalysisAI/InferenceX#1010, tp:4/ep:4 and tp:2/ep:2 are now confirmed for B200. Also enables 2-GPU selection for B200, adds --kv-cache-dtype fp8_e4m3 and --disable-radix-cache as B200-specific flags per the benchmark script. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * Update Qwen35ConfigGenerator for B200 FP4 (NVFP4) Based on SemiAnalysisAI/InferenceX#820. - Set mem-fraction-static to 0.85 for B200 FP4 (benchmark uses 0.85) - Add --quantization modelopt_fp4 (required flag, was missing) - Add --chunked-prefill-size 32768, --max-prefill-tokens 32768 - Add --max-running-requests 128, --stream-interval 30 - Add --disable-radix-cache (always required for FP4) - Skip --enable-flashinfer-allreduce-fusion for FP4 (TP=4, not used per benchmark) Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * Remove --disable-radix-cache flag for B200 in MiniMaxM25ConfigGenerator Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * revert: remove accidental MiniMax B200 changes from Qwen3.5 PR PR #230 should only touch Qwen35ConfigGenerator. Revert all changes to MiniMaxM25ConfigGenerator (B200 2-GPU support, B200 EP, B200 kv-cache-dtype) that were accidentally included on this branch. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * revert: restore MiniMax comment order to match main Undo accidental comment/variable reorder in MiniMaxM25ConfigGenerator that was not part of the intended Qwen3.5 B200 FP4 changes. Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> * Update Qwen3.5 config to conditionally enable allreduce fusion based on quantization --------- Signed-off-by: Faradawn Yang <73060648+faradawn@users.noreply.github.com> Co-authored-by: zijiexia <37504505+zijiexia@users.noreply.github.com> Co-authored-by: Zijie Xia <zijie_xia@icloud.com>
Summary
Update benchmark search-space configurations for
minimaxm2.5-fp8-b200-vllmto refine concurrency ranges and parallelism strategies.Changes
.github/configs/nvidia-master.yaml— Updated search-space for MiniMax-M2.5 FP8 B200 vLLM:ISL 1024 / OSL 1024:
tp:2(no EP) sweep; replaced withtp:2, ep:2atconc: 512tp:4(no EP) range fromconc 4–512→conc 4–128tp:4, ep:4range fromconc 16–64→conc 256–512ISL 8192 / OSL 1024:
tp:2range fromconc 4–256→conc 64–512tp:4range fromconc 4–256→conc 4–64tp:4sweep point atconc 512perf-changelog.yaml— Added changelog entry for this config update.