Skip to content

[NVIDIA] port b200 from docker to slurm due to change of cluster#71

Merged
functionstackx merged 1 commit intomainfrom
b200-vllm-sglang-docker-to-slurm
Sep 28, 2025
Merged

[NVIDIA] port b200 from docker to slurm due to change of cluster#71
functionstackx merged 1 commit intomainfrom
b200-vllm-sglang-docker-to-slurm

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

No description provided.

@functionstackx functionstackx marked this pull request as ready for review September 28, 2025 23:06
@functionstackx functionstackx merged commit c88a4c3 into main Sep 28, 2025
@functionstackx functionstackx deleted the b200-vllm-sglang-docker-to-slurm branch September 28, 2025 23:06
@cquil11 cquil11 added the NVIDIA label Apr 8, 2026
@cquil11 cquil11 changed the title port b200 from docker to slurm due to change of cluster [NVIDIA] port b200 from docker to slurm due to change of cluster Apr 8, 2026
Oseltamivir added a commit that referenced this pull request Apr 24, 2026
Replaces our hand-rolled 8k/1k DSV4-Pro vLLM disagg recipes with the
four topologies from NVIDIA/srt-slurm PR #71 (source fork:
alec-flowers/srt-slurm, branch aflowers/dsv4-pr67-pr68, pinned at
commit d60e3f1c). PR #71 supersedes PR #67 that our original 8k/1k
recipes were based on, with more topologies, a wider concurrency
sweep per recipe, new env vars, explicit tokenizer-mode, and CPU/DRAM
expert offload.

We take everything except offload:

  * launch_gb200-nv.sh clones alec-flowers/srt-slurm for dsv4 instead
    of NVIDIA/srt-slurm.
  * Runtime post-clone patch strips `offload-group-size`,
    `offload-num-in-group`, `offload-prefetch-step`, and the commented
    `# offload-params` line from all four 8k/1k recipes.
  * Same post-clone patch injects our `slurm.time_limit: 8:00:00` and
    `health_check: {max_attempts: 1440, interval_seconds: 10}` (4 h
    budget) so the recipes match our cold-cache Lustre load budget.
  * Model-path alias changed from `deepseek-v4-pro` to `deepseekv4-fp4`
    to match PR #71 recipes' `model.path` field; 1k/1k local recipes
    updated to the same alias.
  * nvidia-master.yaml 8k/1k block rewritten: 4 search-space entries
    (1p1d-dep8-dep8, 3p1d-dep8-dep8, 3p1d-dep8-dep16, 6p1d-dep8-dep16),
    each running conc list [4, 8, 16, 32, 64, 256, 512, 1024] — 32 total
    8k/1k benchmark points across 4 cluster startups.
  * Obsolete local 8k/1k recipes under srt-slurm-recipes/vllm/deepseek-v4/8k1k/
    removed (superseded by the PR #71 upstream files).

1k/1k sweep is unchanged otherwise (2 matrix entries, 9 benchmark
points using the hand-rolled recipes — no PR #71 equivalent at 1k/1k).
Oseltamivir added a commit that referenced this pull request Apr 24, 2026
Oseltamivir added a commit that referenced this pull request Apr 25, 2026
* runners/launch_gb200-nv.sh: switch the recipe overlay step from
  `cp -r src dst` to `cp -rT src dst` (with explicit `mkdir -p dst`
  first). Addresses the bot review nit at line 144 — `cp -r src dst`
  works only because the upstream sa-submission-q2-2026 branch has no
  `recipes/vllm/deepseek-v4/` directory today; if upstream ever ships
  one, `cp -r` would nest as `recipes/vllm/deepseek-v4/deepseek-v4/...`
  and CONFIG_FILE in nvidia-master.yaml would silently resolve to the
  upstream stub. `-T` overlays unconditionally.

* perf-changelog.yaml: refresh the dsv4-fp4-gb200-dynamo-vllm entry's
  description. The previous wording referenced "8k1k, 7p1d-dep8-dep16"
  and "Mirrors NVIDIA/srt-slurm PR #67" which is stale after the move
  to a 1k/1k sweep with TEP low-conc (mirrored from PR #71) plus two
  hand-rolled mid/high topologies. Also fixes the directory reference
  (recipes moved to benchmarks/multi_node/srt-slurm-recipes/ during
  the cleanup pass).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants