[NVIDIA] Update H200/B200 SGLang image to v0.5.5-cu129-amd64 and fix deprecated flags#204
Conversation
There was a problem hiding this comment.
Pull Request Overview
Updates SGLang Docker images across B200 and H200 configurations to use a unified v0.5.5-cu129-amd64 image tag, replacing GPU-specific and older release candidate versions.
Key Changes:
- Consolidates all three SGLang configurations to use the same unified image version
- Upgrades H200 from CUDA 12.6 to CUDA 12.9
- Removes GPU-specific image tags in favor of a single amd64 architecture tag
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
@copilot set --ep-size to 8 instead and fix `--enable-flashinfer-trtllm-moe |
|
@copilot use $EP_SIZE var instead of hard setting it to 8 and add it to nvidia-master.yaml |
|
@copilot shouldnt u enable ep 4 for tp=4 too? |
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
can u help review this PR? Here is the validation on b200 fp4 & b200 fp8 & h200 fp8 https://github.com/InferenceMAX/InferenceMAX/actions/runs/19215140966?pr=204 had to change a couple flags due to sglang upstream removing it https://github.com/InferenceMAX/InferenceMAX/pull/204#issuecomment-3508853560 https://github.com/InferenceMAX/InferenceMAX/pull/204#issuecomment-3508855924 |
5f0a2d5 to
b868866
Compare
Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
… and --enable-flashinfer-trtllm-moe with --moe-runner-backend flashinfer_trtllm Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
…master.yaml for B200 SGLang configs Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
… scripts Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
b868866 to
28534c7
Compare

fix https://github.com/InferenceMAX/InferenceMAX/issues/208
Consolidates H200 and B200 SGLang configurations to use unified v0.5.5-cu129-amd64 image tag and updates deprecated SGLang server arguments to their current equivalents.
--enable-flashinfer-trtllm-moe&--enable-ep-moeis no longer available in sglang so we needed to change itChanges
v0.5.3rc1-cu129-b200→v0.5.5-cu129-amd64v0.5.3rc1-cu129-b200→v0.5.5-cu129-amd64v0.5.2rc2-cu126→v0.5.5-cu129-amd64epconfiguration to B200 SGLang search-space entries matching tp values (9 occurrences total):ep: 4for alltp: 4entries (3 occurrences in dsr1-fp4-b200-sglang)ep: 8for alltp: 8entries (6 occurrences across dsr1-fp4-b200-sglang and dsr1-fp8-b200-sglang)--enable-ep-moewith--ep-size $EP_SIZEand--enable-flashinfer-trtllm-moewith--moe-runner-backend flashinfer_trtllm--enable-flashinfer-trtllm-moewith--moe-runner-backend flashinfer_trtllmand added--ep-size $EP_SIZE-e EP_SIZEto Docker run command to pass environment variable to container-e EP_SIZEto Docker run command to pass environment variable to containerPrevious H200 configuration used CUDA 12.6 while B200 used CUDA 12.9 with GPU-specific tags. Now all three configs use the same image with CUDA 12.9. The deprecated flags were causing launch errors with the updated SGLang version and have been replaced per the official documentation. EP_SIZE is now configured through nvidia-master.yaml (with ep matching tp as per SGLang documentation) and passed as an environment variable to benchmark scripts. The runner scripts have been updated to ensure EP_SIZE is properly passed into Docker containers.
✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.