[NVIDIA] port b200 from docker to slurm due to change of cluster by functionstackx · Pull Request #71 · SemiAnalysisAI/InferenceX

functionstackx · 2025-09-28T23:02:19Z

No description provided.

Replaces our hand-rolled 8k/1k DSV4-Pro vLLM disagg recipes with the four topologies from NVIDIA/srt-slurm PR #71 (source fork: alec-flowers/srt-slurm, branch aflowers/dsv4-pr67-pr68, pinned at commit d60e3f1c). PR #71 supersedes PR #67 that our original 8k/1k recipes were based on, with more topologies, a wider concurrency sweep per recipe, new env vars, explicit tokenizer-mode, and CPU/DRAM expert offload. We take everything except offload: * launch_gb200-nv.sh clones alec-flowers/srt-slurm for dsv4 instead of NVIDIA/srt-slurm. * Runtime post-clone patch strips `offload-group-size`, `offload-num-in-group`, `offload-prefetch-step`, and the commented `# offload-params` line from all four 8k/1k recipes. * Same post-clone patch injects our `slurm.time_limit: 8:00:00` and `health_check: {max_attempts: 1440, interval_seconds: 10}` (4 h budget) so the recipes match our cold-cache Lustre load budget. * Model-path alias changed from `deepseek-v4-pro` to `deepseekv4-fp4` to match PR #71 recipes' `model.path` field; 1k/1k local recipes updated to the same alias. * nvidia-master.yaml 8k/1k block rewritten: 4 search-space entries (1p1d-dep8-dep8, 3p1d-dep8-dep8, 3p1d-dep8-dep16, 6p1d-dep8-dep16), each running conc list [4, 8, 16, 32, 64, 256, 512, 1024] — 32 total 8k/1k benchmark points across 4 cluster startups. * Obsolete local 8k/1k recipes under srt-slurm-recipes/vllm/deepseek-v4/8k1k/ removed (superseded by the PR #71 upstream files). 1k/1k sweep is unchanged otherwise (2 matrix entries, 9 benchmark points using the hand-rolled recipes — no PR #71 equivalent at 1k/1k).

…k DSV4" This reverts commit 768cddc.

* runners/launch_gb200-nv.sh: switch the recipe overlay step from `cp -r src dst` to `cp -rT src dst` (with explicit `mkdir -p dst` first). Addresses the bot review nit at line 144 — `cp -r src dst` works only because the upstream sa-submission-q2-2026 branch has no `recipes/vllm/deepseek-v4/` directory today; if upstream ever ships one, `cp -r` would nest as `recipes/vllm/deepseek-v4/deepseek-v4/...` and CONFIG_FILE in nvidia-master.yaml would silently resolve to the upstream stub. `-T` overlays unconditionally. * perf-changelog.yaml: refresh the dsv4-fp4-gb200-dynamo-vllm entry's description. The previous wording referenced "8k1k, 7p1d-dep8-dep16" and "Mirrors NVIDIA/srt-slurm PR #67" which is stale after the move to a 1k/1k sweep with TEP low-conc (mirrored from PR #71) plus two hand-rolled mid/high topologies. Also fixes the directory reference (recipes moved to benchmarks/multi_node/srt-slurm-recipes/ during the cleanup pass).

port b200 from docker to slurm due to change of cluster

c2a915b

functionstackx marked this pull request as ready for review September 28, 2025 23:06

functionstackx merged commit c88a4c3 into main Sep 28, 2025

functionstackx deleted the b200-vllm-sglang-docker-to-slurm branch September 28, 2025 23:06

cquil11 added the NVIDIA label Apr 8, 2026

cquil11 changed the title ~~port b200 from docker to slurm due to change of cluster~~ [NVIDIA] port b200 from docker to slurm due to change of cluster Apr 8, 2026

Oseltamivir added a commit that referenced this pull request Apr 24, 2026

Revert "Adopt NVIDIA srt-slurm PR #71 recipes (sans offload) for 8k/1…

f524584

…k DSV4" This reverts commit 768cddc.

Ankur-singh mentioned this pull request Apr 26, 2026

[NV] dsv4-fp4-gb200-dynamo-vllm #1163

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] port b200 from docker to slurm due to change of cluster#71

[NVIDIA] port b200 from docker to slurm due to change of cluster#71
functionstackx merged 1 commit intomainfrom
b200-vllm-sglang-docker-to-slurm

functionstackx commented Sep 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

functionstackx commented Sep 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants