Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| elif [[ $MODEL_PREFIX == "dsv4" && $PRECISION == "fp4" ]]; then | ||
| # SRT_SLURM_MODEL_PREFIX matches the model.path alias in our DSv4 | ||
| # sglang recipes (benchmarks/multi_node/srt-slurm-recipes/sglang/ | ||
| # deepseek-v4/1k1k/disagg-gb300-1p1d-tp4.yaml). | ||
| export SERVED_MODEL_NAME="deepseek-v4-pro" | ||
| export MODEL_PATH=/raid/shared/models/deepseek-v4-pro | ||
| export SRT_SLURM_MODEL_PREFIX="deepseek-v4-pro" |
There was a problem hiding this comment.
🔴 The new GB300 sglang DSv4 recipe declares container: "lmsysorg/sglang:deepseek-v4-grace-blackwell" (a literal image name), but the containers: map emitted by runners/launch_gb300-nv.sh only declares the aliases dynamo-trtllm, dynamo-sglang, and nginx-sqsh — it has no entry for the literal IMAGE. launch_gb200-nv.sh already includes "${IMAGE}": ${SQUASH_FILE} precisely for locally-shipped recipes that use literal image names; without the same line here, srtctl cannot resolve the container for the new dsv4-fp4-gb300-dynamo-sglang sweep and will fall through to a docker pull on compute nodes rather than mount the imported squashfs. Fix by adding "${IMAGE}": ${SQUASH_FILE} to the containers block (or change the recipe to container: dynamo-sglang to use the existing alias).
Extended reasoning...
What the bug is\n\nbenchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/1k1k/disagg-gb300-1p1d-tp4.yaml line 21 sets:\nyaml\ncontainer: "lmsysorg/sglang:deepseek-v4-grace-blackwell"\n\nThis is a literal docker image name, not an alias. The srtslurm.yaml heredoc emitted by runners/launch_gb300-nv.sh only contains:\nyaml\ncontainers:\n dynamo-trtllm: ${SQUASH_FILE}\n dynamo-sglang: ${SQUASH_FILE}\n nginx-sqsh: ${NGINX_SQUASH_FILE}\n\nThere is no entry mapping lmsysorg/sglang:deepseek-v4-grace-blackwell (or ${IMAGE}) to the locally-imported squashfs.\n\n## Why existing code does not save us\n\nIn srt-slurm/src/srtctl/core/config.py the alias resolver only rewrites model.container when the value is a key in the containers: map; otherwise the literal string passes through unchanged. runtime.py then treats only strings starting with / or ./ as a squashfs path — anything else becomes a literal --container-image argument to srun, which under pyxis/enroot will attempt a registry pull on the compute node rather than mount the pre-imported sqsh.\n\n## Why this is new on GB300\n\nThe pre-PR launch_gb300-nv.sh only handled dsr1 recipes from upstream srt-slurm, all of which use container: dynamo-sglang / dynamo-trtllm (aliases) — so the missing literal-IMAGE mapping never bit. This PR ships the first GB300 recipe that uses a literal image name, and it ships exactly the launcher change the PR description claims it mirrors from GB200 — except the containers-map line was not mirrored.\n\n## Direct precedent in the GB200 sibling\n\nrunners/launch_gb200-nv.sh (the launcher whose gates the PR description explicitly says it mirrors) already has at line 206:\nyaml\n "${IMAGE}": ${SQUASH_FILE}\n\nThat line was added because GB200's locally-shipped vllm DSv4 recipes use literal container names like vllm/vllm-openai:deepseekv4-cu130. The H100 and H200 launchers (launch_h100-dgxc-slurm.sh, launch_h200-dgxc-slurm.sh) include the same line for the same reason. The new GB300 sglang DSv4 recipe is structurally identical (literal image string in container:), and needs the same launcher entry.\n\n## Step-by-step proof\n\n1. CI runs dsv4-fp4-gb300-dynamo-sglang. launch_gb300-nv.sh enters the new dsv4-fp4 branch, sets IMAGE=lmsysorg/sglang:deepseek-v4-grace-blackwell, and imports it into /home/sa-shared/squash/lmsysorg_sglang_deepseek-v4-grace-blackwell.sqsh.\n2. The heredoc writes srtslurm.yaml with three container aliases (none of them the literal image string).\n3. srtctl apply -f recipes/sglang/deepseek-v4/1k1k/disagg-gb300-1p1d-tp4.yaml reads model.container = "lmsysorg/sglang:deepseek-v4-grace-blackwell".\n4. config.py: if container in containers — false (only dynamo-trtllm, dynamo-sglang, nginx-sqsh). Literal string passes through unchanged.\n5. runtime.py: the string does not start with / or ./, so it is treated as a docker image name and passed to srun --container-image=lmsysorg/sglang:deepseek-v4-grace-blackwell.\n6. Pyxis/enroot tries to pull from Docker Hub on the compute node. The pre-imported squashfs in /home/sa-shared/squash/ is never used. Compute nodes typically lack registry pull access (and even if they did, the import was wasted) — runtime failure for the whole sweep.\n\n## Fix\n\nEither add the literal-IMAGE line to the heredoc in launch_gb300-nv.sh next to the existing aliases:\nyaml\ncontainers:\n dynamo-trtllm: ${SQUASH_FILE}\n dynamo-sglang: ${SQUASH_FILE}\n "${IMAGE}": ${SQUASH_FILE}\n nginx-sqsh: ${NGINX_SQUASH_FILE}\n\n(mirroring launch_gb200-nv.sh:206 exactly), or change the new recipe to container: "dynamo-sglang" to reuse the existing alias.\n\n## Impact / severity\n\nnormal — this blocks the only feature this PR adds. Any dsv4-fp4-gb300-dynamo-sglang run will fail at srun time, never reaching the SGLang server. Should block merge until either the launcher gets the IMAGE mapping or the recipe switches to the alias.
| # dynamo frontend. Ported from NVIDIA/srt-slurm PR #75 | ||
| # (recipes/gb300-fp4/1k1k-dsv4/disagg-1p1d-tp4-mxfp4.yaml). GB300 sibling of | ||
| # the dsv4-sglang-disagg-gb200-1p1d-dep8-tep8 recipe in this directory tree. | ||
| # | ||
| # Topology: 1 prefill node + 1 decode node, each TP=4 on a single GB300 | ||
| # (4 GPUs / node). KV transfer over NIXL. Targets steady decode TPOT under | ||
| # moderate-to-high concurrency. | ||
| # | ||
| # Local deltas vs upstream PR #75: | ||
| # * benchmark.type = sa-bench (upstream uses "manual" because they pair | ||
| # with a separate sa-bench launcher; our sweep harness drives sa-bench | ||
| # in-recipe). | ||
| # * Disagg timeout triple + NCCL_MNNVL/CUMEM env vars copied from the | ||
| # GB200 sglang sibling — same handshake-stability rationale. |
There was a problem hiding this comment.
🟡 The new recipe header at lines 5-6 and 16-17 of disagg-gb300-1p1d-tp4.yaml refers to a 'GB200 sglang sibling' (dsv4-sglang-disagg-gb200-1p1d-dep8-tep8) that does not exist in this repository — the only DSv4 GB200 recipes live under srt-slurm-recipes/vllm/deepseek-v4/, and launch_gb200-nv.sh routes dsv4-fp4 exclusively through the dynamo-vllm branch. This is a comment-only inconsistency with no runtime impact, but the PR description's claim that this 'mirrors the gates the GB200 launcher already uses for the SGLang sibling' is also inaccurate. Suggest editing the header to drop the sibling references or point at the actual upstream PR #75 source instead.
Extended reasoning...
What the bug is
The new file benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/1k1k/disagg-gb300-1p1d-tp4.yaml carries a header comment (lines 4-6) and a 'Local deltas' block (lines 16-17) that twice reference a sibling recipe — dsv4-sglang-disagg-gb200-1p1d-dep8-tep8 — said to live 'in this directory tree'. It also claims that the disagg-timeout triple and NCCL_MNNVL_ENABLE/NCCL_CUMEM_ENABLE env vars were 'copied from the GB200 sglang sibling'. No such sibling exists in the repo today.
The specific code path / proof
Step-by-step:
- The directory
benchmarks/multi_node/srt-slurm-recipes/sglang/contains only the new GB300 file added by this PR — no GB200 sglang DSv4 file is present. - A repo-wide search for
dsv4-sglang-disagg-gb200returns only the self-reference inside this new recipe's header comment. - DSv4 GB200 recipes do exist, but under the vllm subtree (
benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/), not sglang. runners/launch_gb200-nv.shroutesdsv4-fp4only through thedynamo-vllmbranch (the elif onFRAMEWORK==dynamo-vllm) and the overlay block copies fromsrt-slurm-recipes/vllm/deepseek-v4. There is nodynamo-sglangbranch fordsv4on GB200, so the PR description's claim 'Mirrors the gates the GB200 launcher already uses for the SGLang sibling' is also off.perf-changelog.yamlhasdsv4-fp4-gb200-dynamo-vllmbut no GB200 sglang DSv4 entry.
Why existing code doesn't prevent it
YAML comments are free-form text; nothing in the launcher, srtctl, or sweep harness validates them.
Impact
Zero runtime impact — the field values themselves are valid. The cost is purely documentation: a future reader trying to compare this recipe against the alleged sibling, or wanting to update env-var rationale in lockstep, will hit a dead-end.
How to fix it
Either:
- Drop the sibling references entirely; or
- Repoint them at the actual provenance — upstream NVIDIA/srt-slurm PR improve NVIDIA CI stability #75 (already cited elsewhere in the same header) for the timeout/env-var rationale; or
- If the intent was to track parity with the existing GB200
dynamo-vllmrecipe undersrt-slurm-recipes/vllm/deepseek-v4/, name that file instead and adjust the wording (it isn't an sglang sibling).
| - "Image: lmsysorg/sglang:deepseek-v4-grace-blackwell" | ||
| - "Topology: 1P + 1D, both TP=4 on a single GB300; MXFP4 MoE kernels, NIXL KV transfer" | ||
| - "Recipe ported from NVIDIA/srt-slurm PR #75 (recipes/gb300-fp4/1k1k-dsv4/disagg-1p1d-tp4-mxfp4.yaml)" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX |
There was a problem hiding this comment.
🟡 The new perf-changelog.yaml entry for 'dsv4-fp4-gb300-dynamo-sglang' (line 1844) has 'pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX' — a literal 'XXX' placeholder rather than the actual PR number. Every other entry in this file references a real PR number, so this will produce a 404 for any tooling that follows the link. Fix by replacing 'XXX' with '1169' (this PR's number) before merge.
Extended reasoning...
What the bug is
In perf-changelog.yaml at line 1844, the new changelog entry for dsv4-fp4-gb300-dynamo-sglang ends with:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXThe XXX is a literal placeholder string, not a real PR number. The PR description acknowledges this explicitly: "perf-changelog.yaml entry triggers the new sweep (PR-link placeholder to update post-merge)".
Why this is a bug
A quick scan of the rest of perf-changelog.yaml shows ~150+ other entries, every one of which uses a real numeric PR number (e.g., /pull/1160, /pull/1144, /pull/1129). This is the only pull/XXX literal in the file. Any consumer that iterates entries to render a changelog or validate links will get a 404 on this single entry.
Why the existing review process doesn't prevent it
There is no automated post-merge backfill that rewrites XXX → the real PR number — the author plans to update it manually post-merge, but there is nothing enforcing that. If it gets forgotten (which is the typical failure mode for "I'll fix it post-merge" promises), the placeholder ships permanently.
Why it's trivially fixable now
The PR number is already known: this is PR #1169. The author can replace XXX with 1169 in a single-character-range edit before merge — there is no actual reason it has to wait until post-merge. GitHub will resolve pull/1169 correctly the moment the PR is merged (and also pre-merge, since the URL just redirects to the open PR).
Step-by-step proof
- Open
perf-changelog.yamland grep forpull/XXX:- Exactly one match, at line 1844, in the new
dsv4-fp4-gb300-dynamo-sglangentry added by this PR.
- Exactly one match, at line 1844, in the new
- Compare with the immediately preceding entry (line 1836):
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1160— a real, resolvable URL. - Visit
https://github.com/SemiAnalysisAI/InferenceX/pull/XXXin a browser → GitHub returns "Page not found" (404), sinceXXXis not a valid pull request number. - Visit
https://github.com/SemiAnalysisAI/InferenceX/pull/1169(this PR) → resolves correctly.
Impact
- Severity: nit. This is documentation/changelog hygiene, not a runtime bug. It does not affect benchmark execution, sweep generation (
pr-linkis metadata only), or any CI gate. - However, it leaves a permanently dead link in the changelog for any consumer that follows pr-link URLs (e.g., changelog rendering, link validation, audit tooling).
Fix
Replace line 1844:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXwith:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1169Note on duplicate refutations
Two verifiers refuted this as a duplicate of the other report describing the same issue. They are correct that bug_002 and bug_005 describe the same underlying issue — the synthesis agent merged them, and this single comment now represents both. The underlying issue is real and confirmed by 6 verifiers across the two original bugs; only the duplication concern was raised, not the validity.
Summary
benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/1k1k/disagg-gb300-1p1d-tp4.yamloverlaid onto the upstream srt-slurm checkout at runtimedsv4-fp4-gb300-dynamo-sglangmaster config; conc-list[1, 4, 16, 64, 256]matches PR improve NVIDIA CI stability #75's verified test planrunners/launch_gb300-nv.shlearns dsv4-fp4 (model path/raid/shared/models/deepseek-v4-pro) and overlays the local sglang recipe directory whenFRAMEWORK=dynamo-sglang && MODEL_PREFIX=dsv4. Mirrors the gates the GB200 launcher already uses for the SGLang sibling.perf-changelog.yamlentry triggers the new sweep (PR-link placeholder to update post-merge)Test plan
bash -n runners/launch_gb300-nv.shcleanpython utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml --framework dynamo-sglang --runner-type gb300 --model-prefix dsv4produces exactly one config row with the expected topology + concurrenciespytest utils/matrix_logic/— 149/149 passperf-changelog.yamltriggersOut of scope
disagg-2p1d,disagg-1p2d) — upstream defers these as follow-ups