Skip to content

gb300 1k1k sglang#1169

Open
Oseltamivir wants to merge 1 commit intomainfrom
gb300-1k1k-sglang
Open

gb300 1k1k sglang#1169
Oseltamivir wants to merge 1 commit intomainfrom
gb300-1k1k-sglang

Conversation

@Oseltamivir
Copy link
Copy Markdown
Collaborator

Summary

  • Port NVIDIA/srt-slurm PR #75 (DeepSeek-V4-Pro disagg 1P+1D TP=4 MXFP4 on GB300, 1k/1k) into the InferenceX sweep
  • New recipe benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/1k1k/disagg-gb300-1p1d-tp4.yaml overlaid onto the upstream srt-slurm checkout at runtime
  • New dsv4-fp4-gb300-dynamo-sglang master config; conc-list [1, 4, 16, 64, 256] matches PR improve NVIDIA CI stability #75's verified test plan
  • runners/launch_gb300-nv.sh learns dsv4-fp4 (model path /raid/shared/models/deepseek-v4-pro) and overlays the local sglang recipe directory when FRAMEWORK=dynamo-sglang && MODEL_PREFIX=dsv4. Mirrors the gates the GB200 launcher already uses for the SGLang sibling.
  • perf-changelog.yaml entry triggers the new sweep (PR-link placeholder to update post-merge)

Test plan

  • bash -n runners/launch_gb300-nv.sh clean
  • python utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml --framework dynamo-sglang --runner-type gb300 --model-prefix dsv4 produces exactly one config row with the expected topology + concurrencies
  • Full-sweep generation parses end-to-end (1319 total configs)
  • pytest utils/matrix_logic/ — 149/149 pass
  • End-to-end on the gb300-nv cluster — exercised by the workflow once perf-changelog.yaml triggers

Out of scope

  • Aggregated GB300 DSv4 SGLang recipes (no agg-recipe area in this repo)
  • 8k/1k variant — upstream PR improve NVIDIA CI stability #75 is 1k/1k only
  • Additional disagg topologies (disagg-2p1d, disagg-1p2d) — upstream defers these as follow-ups

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +21 to +27
elif [[ $MODEL_PREFIX == "dsv4" && $PRECISION == "fp4" ]]; then
# SRT_SLURM_MODEL_PREFIX matches the model.path alias in our DSv4
# sglang recipes (benchmarks/multi_node/srt-slurm-recipes/sglang/
# deepseek-v4/1k1k/disagg-gb300-1p1d-tp4.yaml).
export SERVED_MODEL_NAME="deepseek-v4-pro"
export MODEL_PATH=/raid/shared/models/deepseek-v4-pro
export SRT_SLURM_MODEL_PREFIX="deepseek-v4-pro"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new GB300 sglang DSv4 recipe declares container: "lmsysorg/sglang:deepseek-v4-grace-blackwell" (a literal image name), but the containers: map emitted by runners/launch_gb300-nv.sh only declares the aliases dynamo-trtllm, dynamo-sglang, and nginx-sqsh — it has no entry for the literal IMAGE. launch_gb200-nv.sh already includes "${IMAGE}": ${SQUASH_FILE} precisely for locally-shipped recipes that use literal image names; without the same line here, srtctl cannot resolve the container for the new dsv4-fp4-gb300-dynamo-sglang sweep and will fall through to a docker pull on compute nodes rather than mount the imported squashfs. Fix by adding "${IMAGE}": ${SQUASH_FILE} to the containers block (or change the recipe to container: dynamo-sglang to use the existing alias).

Extended reasoning...

What the bug is\n\nbenchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/1k1k/disagg-gb300-1p1d-tp4.yaml line 21 sets:\nyaml\ncontainer: "lmsysorg/sglang:deepseek-v4-grace-blackwell"\n\nThis is a literal docker image name, not an alias. The srtslurm.yaml heredoc emitted by runners/launch_gb300-nv.sh only contains:\nyaml\ncontainers:\n dynamo-trtllm: ${SQUASH_FILE}\n dynamo-sglang: ${SQUASH_FILE}\n nginx-sqsh: ${NGINX_SQUASH_FILE}\n\nThere is no entry mapping lmsysorg/sglang:deepseek-v4-grace-blackwell (or ${IMAGE}) to the locally-imported squashfs.\n\n## Why existing code does not save us\n\nIn srt-slurm/src/srtctl/core/config.py the alias resolver only rewrites model.container when the value is a key in the containers: map; otherwise the literal string passes through unchanged. runtime.py then treats only strings starting with / or ./ as a squashfs path — anything else becomes a literal --container-image argument to srun, which under pyxis/enroot will attempt a registry pull on the compute node rather than mount the pre-imported sqsh.\n\n## Why this is new on GB300\n\nThe pre-PR launch_gb300-nv.sh only handled dsr1 recipes from upstream srt-slurm, all of which use container: dynamo-sglang / dynamo-trtllm (aliases) — so the missing literal-IMAGE mapping never bit. This PR ships the first GB300 recipe that uses a literal image name, and it ships exactly the launcher change the PR description claims it mirrors from GB200 — except the containers-map line was not mirrored.\n\n## Direct precedent in the GB200 sibling\n\nrunners/launch_gb200-nv.sh (the launcher whose gates the PR description explicitly says it mirrors) already has at line 206:\nyaml\n "${IMAGE}": ${SQUASH_FILE}\n\nThat line was added because GB200's locally-shipped vllm DSv4 recipes use literal container names like vllm/vllm-openai:deepseekv4-cu130. The H100 and H200 launchers (launch_h100-dgxc-slurm.sh, launch_h200-dgxc-slurm.sh) include the same line for the same reason. The new GB300 sglang DSv4 recipe is structurally identical (literal image string in container:), and needs the same launcher entry.\n\n## Step-by-step proof\n\n1. CI runs dsv4-fp4-gb300-dynamo-sglang. launch_gb300-nv.sh enters the new dsv4-fp4 branch, sets IMAGE=lmsysorg/sglang:deepseek-v4-grace-blackwell, and imports it into /home/sa-shared/squash/lmsysorg_sglang_deepseek-v4-grace-blackwell.sqsh.\n2. The heredoc writes srtslurm.yaml with three container aliases (none of them the literal image string).\n3. srtctl apply -f recipes/sglang/deepseek-v4/1k1k/disagg-gb300-1p1d-tp4.yaml reads model.container = "lmsysorg/sglang:deepseek-v4-grace-blackwell".\n4. config.py: if container in containers — false (only dynamo-trtllm, dynamo-sglang, nginx-sqsh). Literal string passes through unchanged.\n5. runtime.py: the string does not start with / or ./, so it is treated as a docker image name and passed to srun --container-image=lmsysorg/sglang:deepseek-v4-grace-blackwell.\n6. Pyxis/enroot tries to pull from Docker Hub on the compute node. The pre-imported squashfs in /home/sa-shared/squash/ is never used. Compute nodes typically lack registry pull access (and even if they did, the import was wasted) — runtime failure for the whole sweep.\n\n## Fix\n\nEither add the literal-IMAGE line to the heredoc in launch_gb300-nv.sh next to the existing aliases:\nyaml\ncontainers:\n dynamo-trtllm: ${SQUASH_FILE}\n dynamo-sglang: ${SQUASH_FILE}\n "${IMAGE}": ${SQUASH_FILE}\n nginx-sqsh: ${NGINX_SQUASH_FILE}\n\n(mirroring launch_gb200-nv.sh:206 exactly), or change the new recipe to container: "dynamo-sglang" to reuse the existing alias.\n\n## Impact / severity\n\nnormal — this blocks the only feature this PR adds. Any dsv4-fp4-gb300-dynamo-sglang run will fail at srun time, never reaching the SGLang server. Should block merge until either the launcher gets the IMAGE mapping or the recipe switches to the alias.

Comment on lines +4 to +17
# dynamo frontend. Ported from NVIDIA/srt-slurm PR #75
# (recipes/gb300-fp4/1k1k-dsv4/disagg-1p1d-tp4-mxfp4.yaml). GB300 sibling of
# the dsv4-sglang-disagg-gb200-1p1d-dep8-tep8 recipe in this directory tree.
#
# Topology: 1 prefill node + 1 decode node, each TP=4 on a single GB300
# (4 GPUs / node). KV transfer over NIXL. Targets steady decode TPOT under
# moderate-to-high concurrency.
#
# Local deltas vs upstream PR #75:
# * benchmark.type = sa-bench (upstream uses "manual" because they pair
# with a separate sa-bench launcher; our sweep harness drives sa-bench
# in-recipe).
# * Disagg timeout triple + NCCL_MNNVL/CUMEM env vars copied from the
# GB200 sglang sibling — same handshake-stability rationale.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new recipe header at lines 5-6 and 16-17 of disagg-gb300-1p1d-tp4.yaml refers to a 'GB200 sglang sibling' (dsv4-sglang-disagg-gb200-1p1d-dep8-tep8) that does not exist in this repository — the only DSv4 GB200 recipes live under srt-slurm-recipes/vllm/deepseek-v4/, and launch_gb200-nv.sh routes dsv4-fp4 exclusively through the dynamo-vllm branch. This is a comment-only inconsistency with no runtime impact, but the PR description's claim that this 'mirrors the gates the GB200 launcher already uses for the SGLang sibling' is also inaccurate. Suggest editing the header to drop the sibling references or point at the actual upstream PR #75 source instead.

Extended reasoning...

What the bug is

The new file benchmarks/multi_node/srt-slurm-recipes/sglang/deepseek-v4/1k1k/disagg-gb300-1p1d-tp4.yaml carries a header comment (lines 4-6) and a 'Local deltas' block (lines 16-17) that twice reference a sibling recipe — dsv4-sglang-disagg-gb200-1p1d-dep8-tep8 — said to live 'in this directory tree'. It also claims that the disagg-timeout triple and NCCL_MNNVL_ENABLE/NCCL_CUMEM_ENABLE env vars were 'copied from the GB200 sglang sibling'. No such sibling exists in the repo today.

The specific code path / proof

Step-by-step:

  1. The directory benchmarks/multi_node/srt-slurm-recipes/sglang/ contains only the new GB300 file added by this PR — no GB200 sglang DSv4 file is present.
  2. A repo-wide search for dsv4-sglang-disagg-gb200 returns only the self-reference inside this new recipe's header comment.
  3. DSv4 GB200 recipes do exist, but under the vllm subtree (benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/), not sglang.
  4. runners/launch_gb200-nv.sh routes dsv4-fp4 only through the dynamo-vllm branch (the elif on FRAMEWORK==dynamo-vllm) and the overlay block copies from srt-slurm-recipes/vllm/deepseek-v4. There is no dynamo-sglang branch for dsv4 on GB200, so the PR description's claim 'Mirrors the gates the GB200 launcher already uses for the SGLang sibling' is also off.
  5. perf-changelog.yaml has dsv4-fp4-gb200-dynamo-vllm but no GB200 sglang DSv4 entry.

Why existing code doesn't prevent it

YAML comments are free-form text; nothing in the launcher, srtctl, or sweep harness validates them.

Impact

Zero runtime impact — the field values themselves are valid. The cost is purely documentation: a future reader trying to compare this recipe against the alleged sibling, or wanting to update env-var rationale in lockstep, will hit a dead-end.

How to fix it

Either:

  • Drop the sibling references entirely; or
  • Repoint them at the actual provenance — upstream NVIDIA/srt-slurm PR improve NVIDIA CI stability #75 (already cited elsewhere in the same header) for the timeout/env-var rationale; or
  • If the intent was to track parity with the existing GB200 dynamo-vllm recipe under srt-slurm-recipes/vllm/deepseek-v4/, name that file instead and adjust the wording (it isn't an sglang sibling).

Comment thread perf-changelog.yaml
- "Image: lmsysorg/sglang:deepseek-v4-grace-blackwell"
- "Topology: 1P + 1D, both TP=4 on a single GB300; MXFP4 MoE kernels, NIXL KV transfer"
- "Recipe ported from NVIDIA/srt-slurm PR #75 (recipes/gb300-fp4/1k1k-dsv4/disagg-1p1d-tp4-mxfp4.yaml)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog.yaml entry for 'dsv4-fp4-gb300-dynamo-sglang' (line 1844) has 'pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX' — a literal 'XXX' placeholder rather than the actual PR number. Every other entry in this file references a real PR number, so this will produce a 404 for any tooling that follows the link. Fix by replacing 'XXX' with '1169' (this PR's number) before merge.

Extended reasoning...

What the bug is

In perf-changelog.yaml at line 1844, the new changelog entry for dsv4-fp4-gb300-dynamo-sglang ends with:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

The XXX is a literal placeholder string, not a real PR number. The PR description acknowledges this explicitly: "perf-changelog.yaml entry triggers the new sweep (PR-link placeholder to update post-merge)".

Why this is a bug

A quick scan of the rest of perf-changelog.yaml shows ~150+ other entries, every one of which uses a real numeric PR number (e.g., /pull/1160, /pull/1144, /pull/1129). This is the only pull/XXX literal in the file. Any consumer that iterates entries to render a changelog or validate links will get a 404 on this single entry.

Why the existing review process doesn't prevent it

There is no automated post-merge backfill that rewrites XXX → the real PR number — the author plans to update it manually post-merge, but there is nothing enforcing that. If it gets forgotten (which is the typical failure mode for "I'll fix it post-merge" promises), the placeholder ships permanently.

Why it's trivially fixable now

The PR number is already known: this is PR #1169. The author can replace XXX with 1169 in a single-character-range edit before merge — there is no actual reason it has to wait until post-merge. GitHub will resolve pull/1169 correctly the moment the PR is merged (and also pre-merge, since the URL just redirects to the open PR).

Step-by-step proof

  1. Open perf-changelog.yaml and grep for pull/XXX:
    • Exactly one match, at line 1844, in the new dsv4-fp4-gb300-dynamo-sglang entry added by this PR.
  2. Compare with the immediately preceding entry (line 1836): pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1160 — a real, resolvable URL.
  3. Visit https://github.com/SemiAnalysisAI/InferenceX/pull/XXX in a browser → GitHub returns "Page not found" (404), since XXX is not a valid pull request number.
  4. Visit https://github.com/SemiAnalysisAI/InferenceX/pull/1169 (this PR) → resolves correctly.

Impact

  • Severity: nit. This is documentation/changelog hygiene, not a runtime bug. It does not affect benchmark execution, sweep generation (pr-link is metadata only), or any CI gate.
  • However, it leaves a permanently dead link in the changelog for any consumer that follows pr-link URLs (e.g., changelog rendering, link validation, audit tooling).

Fix

Replace line 1844:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

with:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1169

Note on duplicate refutations

Two verifiers refuted this as a duplicate of the other report describing the same issue. They are correct that bug_002 and bug_005 describe the same underlying issue — the synthesis agent merged them, and this single comment now represents both. The underlying issue is real and confirmed by 6 verifiers across the two original bugs; only the duplication concern was raised, not the validity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant