0927 final by seungrokj · Pull Request #67 · SemiAnalysisAI/InferenceX

seungrokj · 2025-09-28T03:21:23Z

please merge this

Signed-off-by: seungrokjung <seungrok.jung@amd.com>

Replaces our hand-rolled 8k/1k DSV4-Pro vLLM disagg recipes with the four topologies from NVIDIA/srt-slurm PR #71 (source fork: alec-flowers/srt-slurm, branch aflowers/dsv4-pr67-pr68, pinned at commit d60e3f1c). PR #71 supersedes PR #67 that our original 8k/1k recipes were based on, with more topologies, a wider concurrency sweep per recipe, new env vars, explicit tokenizer-mode, and CPU/DRAM expert offload. We take everything except offload: * launch_gb200-nv.sh clones alec-flowers/srt-slurm for dsv4 instead of NVIDIA/srt-slurm. * Runtime post-clone patch strips `offload-group-size`, `offload-num-in-group`, `offload-prefetch-step`, and the commented `# offload-params` line from all four 8k/1k recipes. * Same post-clone patch injects our `slurm.time_limit: 8:00:00` and `health_check: {max_attempts: 1440, interval_seconds: 10}` (4 h budget) so the recipes match our cold-cache Lustre load budget. * Model-path alias changed from `deepseek-v4-pro` to `deepseekv4-fp4` to match PR #71 recipes' `model.path` field; 1k/1k local recipes updated to the same alias. * nvidia-master.yaml 8k/1k block rewritten: 4 search-space entries (1p1d-dep8-dep8, 3p1d-dep8-dep8, 3p1d-dep8-dep16, 6p1d-dep8-dep16), each running conc list [4, 8, 16, 32, 64, 256, 512, 1024] — 32 total 8k/1k benchmark points across 4 cluster startups. * Obsolete local 8k/1k recipes under srt-slurm-recipes/vllm/deepseek-v4/8k1k/ removed (superseded by the PR #71 upstream files). 1k/1k sweep is unchanged otherwise (2 matrix entries, 9 benchmark points using the hand-rolled recipes — no PR #71 equivalent at 1k/1k).

* runners/launch_gb200-nv.sh: switch the recipe overlay step from `cp -r src dst` to `cp -rT src dst` (with explicit `mkdir -p dst` first). Addresses the bot review nit at line 144 — `cp -r src dst` works only because the upstream sa-submission-q2-2026 branch has no `recipes/vllm/deepseek-v4/` directory today; if upstream ever ships one, `cp -r` would nest as `recipes/vllm/deepseek-v4/deepseek-v4/...` and CONFIG_FILE in nvidia-master.yaml would silently resolve to the upstream stub. `-T` overlays unconditionally. * perf-changelog.yaml: refresh the dsv4-fp4-gb200-dynamo-vllm entry's description. The previous wording referenced "8k1k, 7p1d-dep8-dep16" and "Mirrors NVIDIA/srt-slurm PR #67" which is stale after the move to a 1k/1k sweep with TEP low-conc (mirrored from PR #71) plus two hand-rolled mid/high topologies. Also fixes the directory reference (recipes moved to benchmarks/multi_node/srt-slurm-recipes/ during the cleanup pass).

seungrokj added 2 commits September 27, 2025 22:19

0927 final

13f9559

Signed-off-by: seungrokjung <seungrok.jung@amd.com>

mi325; gfx942; compatible with old gpu fw nodes

2a2208f

Signed-off-by: seungrokjung <seungrok.jung@amd.com>

functionstackx merged commit 079c016 into main Sep 28, 2025

functionstackx deleted the amd-gpt-oss-0927final branch September 28, 2025 19:03

Oseltamivir mentioned this pull request Apr 24, 2026

Add DeepSeek-V4-Pro SGLang aggregated GB200 benchmarks (NVIDIA srt-slurm PR #69) #1137

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0927 final#67

0927 final#67
functionstackx merged 2 commits intomainfrom
amd-gpt-oss-0927final

seungrokj commented Sep 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

seungrokj commented Sep 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants