-
Notifications
You must be signed in to change notification settings - Fork 175
[Handoff to @Oseltamivir Claude /loop] [Klaud Cold] Update dsv4-fp4-b200-sglang SGLang image to v0.5.12-cu130 #1450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2629,3 +2629,9 @@ | |
| description: | ||
| - "Update vLLM ROCm image from v0.18.0 to v0.21.0" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1404 | ||
|
|
||
| - config-keys: | ||
| - dsv4-fp4-b200-sglang | ||
|
Comment on lines
+2632
to
+2634
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🔴 This PR accidentally commits a 1.5 MB / 10,883-line Extended reasoning...What is committedThe diff for this PR shows three changed files: the SGLang image bump in Why this is clearly an accidental commitReading So the content is a vLLM v0.20.0 startup log, not SGLang. The PR is a SGLang image-tag bump for the Step-by-step proof of origin
Impact
How to fix
Addressing the duplicate-refutation objectionsThe two refutations from the verifier round flagged that |
||
| description: | ||
| - "Update SGLang image from custom deepseek-v4-blackwell@sha256:df18bfc4... (21d old) to v0.5.12-cu130" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1450 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 This image bump silences the runner conditional and stale TODO in
runners/launch_b200-{cw,nb,dgxc}.shandbenchmarks/single_node/dsv4_fp4_b200.sh:31-35that explicitly said "Drop the runner conditional once lmsys moves sglang back out of /workspace" —v0.5.12-cu130is exactly that trigger, but the cleanup wasn't done, leaving four deadif [[ "$IMAGE" == *deepseek-v4-blackwell* ]]branches and five stale TODOs referencing an unused tag. Functionally low-risk since sibling b200 recipes (nvidia-master.yaml:1852,1938,1961,…) already runv0.5.12-cu130against the default/workspacemount, but worth confirmingv0.5.12-cu130installs sglang under/sgl-workspace(not/workspace) and then removing the now-dead conditional + TODOs in this PR.Extended reasoning...
What the bug is
This PR replaces the SHA-pinned
lmsysorg/sglang:deepseek-v4-blackwellimage withlmsysorg/sglang:v0.5.12-cu130fordsv4-fp4-b200-sglang. After this change, no recipe in.github/configs/nvidia-master.yamluses thedeepseek-v4-blackwelltag anymore (grep on the post-PR tree returns no matches outside the launchers themselves).But the runner scripts still carry conditional branches gated on that exact tag:
runners/launch_b200-cw.sh:27—if [[ "$IMAGE" == *deepseek-v4-blackwell* ]]; then CONTAINER_MOUNT_DIR=/ixrunners/launch_b200-nb.sh:23— samerunners/launch_b200-dgxc.sh:351— samerunners/launch_b300-nv.sh:314— broader pattern, the*deepseek-v4-blackwell*arm is dead but*deepseek-v4-b300*/*sglang-b300*arms still match other recipesPlus five TODO comments (
launch_b200-cw.sh:22-26,launch_b200-nb.sh:18-22,launch_b200-dgxc.sh:346-350,launch_b300-nv.sh:308-313,benchmarks/single_node/dsv4_fp4_b200.sh:31-35) describing the special/ixmount handling. The bench-script TODO is most explicit:That is exactly the event this PR represents, but the cleanup wasn't performed.
Step-by-step proof
dsv4-fp4-b200-sglanguseslmsysorg/sglang:deepseek-v4-blackwell@sha256:…. When a job runs onb200-dsv4andlaunch_b200-cw.shevaluatesif [[ "$IMAGE" == *deepseek-v4-blackwell* ]], it matches →CONTAINER_MOUNT_DIR=/ix→ repo is mounted at/ix, leaving the editable sglang install at/workspace/sglang/pythonvisible.$IMAGEbecomeslmsysorg/sglang:v0.5.12-cu130. The conditional no longer matches →CONTAINER_MOUNT_DIR=/workspace(default). For the recipe to keep working,v0.5.12-cu130must not install sglang under/workspace(otherwise the bind-mount would mask the install).v0.5.12-cu130is safe with the default/workspacemount: nine other recipes innvidia-master.yamlalready uselmsysorg/sglang:v0.5.12-cu130(lines 1852, 1938, 1961, 2290, 2311, 2708, 2731, 2825) on b200 launchers without an/ixoverride, so the releasedv0.5.12-cu130evidently installs sglang at/sgl-workspace/sglangas the TODO implies. The recipe should still function.*deepseek-v4-blackwell*arm of the b300 launcher is also unreachable (no recipe uses that tag).Impact and fix
Low functional risk — the recipe will almost certainly still pass the sweep, since sibling recipes prove
v0.5.12-cu130works with the default mount. The cost is technical-debt drift: future readers will see TODOs whose precondition has already been met and a conditional whose match arm is never exercised. The PR's own description warns about possible incompat, which is why pulling these guards out should ideally be paired with sweep validation.Suggested cleanup for this PR (or an immediate follow-up):
if [[ "$IMAGE" == *deepseek-v4-blackwell* ]]branch and its TODO inrunners/launch_b200-cw.sh,runners/launch_b200-nb.sh, andrunners/launch_b200-dgxc.sh(replace with a single unconditionalCONTAINER_MOUNT_DIR=/workspaceline, or drop the variable entirely if the default suffices).runners/launch_b300-nv.sh:314, drop just the*deepseek-v4-blackwell*arm of the disjunction; keep the others.benchmarks/single_node/dsv4_fp4_b200.sh:31-35— its premise no longer holds.