Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
@Klaud-Cold Can you update the PR description. |
|
@claude Can you update the PR description. |
|
Claude finished @functionstackx's task in 1m 15s —— View job Updated PR Description
Updated the PR description with:
|
|
@claude can you cross check the master yaml diff to the recipe diff to make sure all params such as tp, ep, image, etc match |
|
Claude finished @cquil11's task in 2m 52s —— View job Cross-checking master YAML vs recipe YAML parameters
Mapping ConventionThe master YAML uses abstract fields that map to vLLM config as follows:
Global Fields (all 6 recipes)
Recipe 1:
|
| Field | Master YAML | Recipe | Match? |
|---|---|---|---|
| Concurrency | 192 |
concurrencies: "192" |
✅ |
| Prefill workers | num-worker: 1 |
prefill_workers: 1 |
✅ |
| Decode workers | num-worker: 6 |
decode_workers: 6 |
✅ |
| Prefill tp/ep/dp-attn | tp:4, ep:4, dp-attn:true → DEP=4 |
TP=1, DP=4, EP=true (1×4=4 GPUs) |
✅ |
| Decode tp/ep/dp-attn | tp:4, ep:1, dp-attn:false → plain TP=4 |
TP=4, no DP, no EP |
✅ |
| GPUs/prefill | 4 | gpus_per_prefill: 4 |
✅ |
| GPUs/decode | 4 | gpus_per_decode: 4 |
✅ |
enable-ep-weight-filter: true and all2all-backend: "flashinfer_nvlink_one_sided" despite master having ep: 1 (no EP). These EP-only flags are inconsistent with a plain TP=4 decode. (Also flagged in prior review)
Recipe 2: disagg-gb300-1p17d-tep4-tp4.yaml (conc=18)
| Field | Master YAML | Recipe | Match? |
|---|---|---|---|
| Concurrency | 18 |
concurrencies: "18" |
✅ |
| Prefill workers | num-worker: 1 |
prefill_workers: 1 |
✅ |
| Decode workers | num-worker: 17 |
decode_workers: 17 |
✅ |
| Prefill tp/ep/dp-attn | tp:4, ep:4, dp-attn:false → TEP=4 |
TP=4, EP=true, no DP |
✅ |
| Decode tp/ep/dp-attn | tp:4, ep:1, dp-attn:false → plain TP=4 |
TP=4, no DP, no EP |
✅ |
| GPUs/prefill | 4 | gpus_per_prefill: 4 |
✅ |
| GPUs/decode | 4 | gpus_per_decode: 4 |
✅ |
enable-ep-weight-filter: true and all2all-backend: "flashinfer_nvlink_one_sided" without EP enabled.
Recipe 3: disagg-gb300-4p1d-dep4-dep8-24-c4096.yaml (conc=4096)
| Field | Master YAML | Recipe | Match? |
|---|---|---|---|
| Concurrency | 4096 |
concurrencies: "4096" |
✅ |
| Prefill workers | num-worker: 4 |
prefill_workers: 4 |
✅ |
| Decode workers | num-worker: 1 |
decode_workers: 1 |
✅ |
| Prefill tp/ep/dp-attn | tp:4, ep:4, dp-attn:true → DEP=4 |
TP=1, DP=4, EP=true (1×4=4 GPUs) |
✅ |
| Decode tp/ep/dp-attn | tp:8, ep:8, dp-attn:true → DEP=8 |
TP=1, DP=8, EP=true (1×8=8 GPUs) |
✅ |
| GPUs/prefill | 4 | gpus_per_prefill: 4 |
✅ |
| GPUs/decode | 8 | gpus_per_decode: 8 |
✅ |
| Total GPUs | — | 4×4 + 2×4 = 24 (matches name) | ✅ |
max-num-batched-tokens: 512 (all sibling DEP-decode recipes set it). (Also flagged in prior review)
Recipe 4: disagg-gb300-5p1d-dep4-dep8-28-c4096.yaml (conc=4096)
| Field | Master YAML | Recipe | Match? |
|---|---|---|---|
| Concurrency | 4096 |
concurrencies: "4096" |
✅ |
| Prefill workers | num-worker: 5 |
prefill_workers: 5 |
✅ |
| Decode workers | num-worker: 1 |
decode_workers: 1 |
✅ |
| Prefill tp/ep/dp-attn | tp:4, ep:4, dp-attn:true → DEP=4 |
TP=1, DP=4, EP=true |
✅ |
| Decode tp/ep/dp-attn | tp:8, ep:8, dp-attn:true → DEP=8 |
TP=1, DP=8, EP=true |
✅ |
| GPUs/prefill | 4 | gpus_per_prefill: 4 |
✅ |
| GPUs/decode | 8 | gpus_per_decode: 8 |
✅ |
| Total GPUs | — | 5×4 + 2×4 = 28 (matches name) | ✅ |
max-num-batched-tokens: 512.
Recipe 5: disagg-gb300-6p1d-dep4-dep8-32-c4096.yaml (conc=4096)
| Field | Master YAML | Recipe | Match? |
|---|---|---|---|
| Concurrency | 4096 |
concurrencies: "4096" |
✅ |
| Prefill workers | num-worker: 6 |
prefill_workers: 6 |
✅ |
| Decode workers | num-worker: 1 |
decode_workers: 1 |
✅ |
| Prefill tp/ep/dp-attn | tp:4, ep:4, dp-attn:true → DEP=4 |
TP=1, DP=4, EP=true |
✅ |
| Decode tp/ep/dp-attn | tp:8, ep:8, dp-attn:true → DEP=8 |
TP=1, DP=8, EP=true |
✅ |
| GPUs/prefill | 4 | gpus_per_prefill: 4 |
✅ |
| GPUs/decode | 8 | gpus_per_decode: 8 |
✅ |
| Total GPUs | — | 6×4 + 2×4 = 32 (matches name) | ✅ |
max-num-batched-tokens: 512.
Recipe 6: disagg-gb300-7p2d-dep4-dep16.yaml (conc=3072)
| Field | Master YAML | Recipe | Match? |
|---|---|---|---|
| Concurrency | 3072 |
concurrencies: "3072" |
✅ |
| Prefill workers | num-worker: 7 |
prefill_workers: 7 |
✅ |
| Decode workers | num-worker: 2 |
decode_workers: 2 |
✅ |
| Prefill tp/ep/dp-attn | tp:4, ep:4, dp-attn:true → DEP=4 |
TP=1, DP=4, EP=true |
✅ |
| Decode tp/ep/dp-attn | tp:16, ep:16, dp-attn:true → DEP=16 |
TP=1, DP=16, EP=true |
✅ |
| GPUs/prefill | 4 | gpus_per_prefill: 4 |
✅ |
| GPUs/decode | 16 | gpus_per_decode: 16 |
✅ |
| Total GPUs | — | 7×4 + 8×4 = 60 | ✅ |
✅ No issues — this recipe is fully consistent.
Summary
All core params (image, model, precision, tp, ep, dp-attn, worker counts, GPU counts, concurrencies, ISL/OSL) match between the master YAML and all 6 recipes. No mismatches found on the primary fields.
Two consistency issues (also flagged in the prior automated review) remain worth noting:
| Issue | Affected Recipes | Severity |
|---|---|---|
Decode has EP-only flags (enable-ep-weight-filter, all2all-backend) but master says ep: 1 (no EP) |
1p6d, 1p17d |
🟡 Medium — likely no-op but inconsistent with all sibling recipes |
Decode missing max-num-batched-tokens: 512 (all siblings set it) |
4p1d, 5p1d, 6p1d |
🟡 Low — max-num-seqs: 512 caps decode batches in practice, but deviates from pattern |
Mirror the launch_gb300-nv.sh pattern: compute nodes are aarch64, the runner pod is x86_64, so dispatch `enroot import` to a compute node via srun rather than relying on a manually-staged sqsh. flock on the shared VAST lock file serializes parallel cw_0/1/2/3 jobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25202898401 |
The cw runners hit a srtctl version mismatch on the dynamo-vllm srt-slurm pin (aflowers/gb200-dsv4-recipes rejects the default_bash_preamble field, dropping the model_paths block). Route this config to the nv runners until the cw srtctl pin is bumped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25202974835 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=25203293262 |
* Add recipes * fix benchmark, fix srt-slurm branch * update runner * chore: resolve dsv4 gb300 changelog merge markers * fix: use gb300 local dsv4 model path * ci: support runner filtering for test configs * fix: isolate gb300 srt setup state * fix: remove unsupported gb300 recipe metadata * clean up * fix: support gb300 cw vllm launcher * gb300-cw: import squash files via srun under flock Mirror the launch_gb300-nv.sh pattern: compute nodes are aarch64, the runner pod is x86_64, so dispatch `enroot import` to a compute node via srun rather than relying on a manually-staged sqsh. flock on the shared VAST lock file serializes parallel cw_0/1/2/3 jobs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Pin dsv4-fp4-gb300-dynamo-vllm to gb300-nv runners The cw runners hit a srtctl version mismatch on the dynamo-vllm srt-slurm pin (aflowers/gb200-dsv4-recipes rejects the default_bash_preamble field, dropping the model_paths block). Route this config to the nv runners until the cw srtctl pin is bumped. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Cameron Quilici <cjquilici@gmail.com> Co-authored-by: Alec Flowers <aflowers@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds DeepSeek-V4-Pro FP4 disaggregated Dynamo vLLM benchmark recipes for GB300 at the 8k/1k sequence-length sweep, and updates both GB300 launchers (NV and CW) to support the new
dynamo-vllmframework for DSV4.What Changed
New config:
dsv4-fp4-gb300-dynamo-vllm(nvidia-master.yaml)Six pareto points covering the 8k/1k ISL/OSL sweep:
1p6d-dep4-tp41p17d-tep4-tp44p1d-dep4-dep8-24-c40965p1d-dep4-dep8-28-c40966p1d-dep4-dep8-32-c40967p2d-dep4-dep16All recipes use
vllm/vllm-openai:v0.20.0-ubuntu2404,deep_gemm_mega_moeMoE backend, and NATS/etcd disaggregated orchestration.Launcher updates
runners/launch_gb300-nv.shdsv4/fp4model gate → model path/scratch/models/DeepSeek-V4-Pro.dynamo-vllm + dsv4branch that clonesNVIDIA/srt-slurm@aflowers/gb200-dsv4-recipesand overlays the vLLM DSV4 recipes.RUN_KEY) to avoid collisions when multiple jobs share the same runner.set -exo pipefailfor stricter error handling.runners/launch_gb300-cw.shFRAMEWORK == dynamo-sglangto aMODEL_PREFIX + PRECISIONouter gate with aFRAMEWORKinner gate, so CW now accepts bothdynamo-sglanganddynamo-vllmfordsv4/fp4.dynamo-sglang: keeps the existingfzyzcjy/srt-slurmfork pin and SGLang recipe overlay.dynamo-vllm: checks outNVIDIA/srt-slurm@aflowers/gb200-dsv4-recipesand overlays the vLLM DSV4 recipes.dynamo-vllmcontainer entry insrtslurm.yaml./scratch/models/dsv4/.Sweep config tooling (
generate_sweep_configs.py)_runner_values_for_filter()helper for--runner-node-filtersupport intest-configsweeps.generate_test_config_sweep()now acceptsrunner_dataand expands runner entries per the filter, enabling targeted single-node dispatch.Other
perf-changelog.yaml: added entry fordsv4-fp4-gb300-dynamo-vllm.runner: gb300(broad) so GitHub can schedule onto either NV or CW GB300 runners.Validation
Local:
bash -n runners/launch_gb300-cw.sh✓bash -n runners/launch_gb300-nv.sh✓python3 -m pytest utils/matrix_logic -q→ 151 passedWorkflow runs: