Tune MiniMax MI355X vLLM scheduling thresholds#1276
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
17bc2cc to
c2b7d37
Compare
c2b7d37 to
98bc84c
Compare
98bc84c to
a9a3cef
Compare
|
/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys minimaxm2.5-fp8-mi355x-vllm |
|
@jiacao-amd Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/25346292897 |
Summary
Tune the MiniMax-M2.5 FP8 MI355X vLLM launch policy for better throughput across the 1k/1k and 8k/1k sweep points.
block-size=32, shuffled KV cache disabled, async scheduling enabled.1k/1k: useblock-size=16with shuffled KV cache; disable async scheduling through c128.1k/1k TP8/EP8 c2: keepblock-size=32and shuffled KV cache disabled, but disable async scheduling.TP8/EP8fallback path: keepblock-size=32and shuffled KV cache disabled.8k/1k: disable async scheduling through c64; keep c32 onblock-size=32with shuffled KV cache disabled, and enableblock-size=16with shuffled KV cache from c64 upward.Throughput Comparison
Metric:
tput_per_gpuonly.Testing
bash -n benchmarks/single_node/minimaxm2.5_fp8_mi355x.shgit diff --checkresults_bmkartifacts from the validation and baseline runs above.