Conversation
…-dsv4-recipes (PR #77) Repoint launch_gb200-nv.sh to NVIDIA/srt-slurm@aflowers/gb200-dsv4-recipes, which supersedes #71 and ships the vllm_numa_bind_hash_fix.py patch and sa-bench DSV4 tokenizer support — so numa-bind, benchmark.use_chat_template, and benchmark.tokenizer_mode no longer have to be stripped from recipes. 8k/1k search-space expanded from 3 topologies to 8: adds 1p4d/1p8d pure-TP decode (offload), 1p1d/2p1d/3p1d DEP-8 decode, and a 3p1d-dep16-40 wide decode shape. 1k/1k topologies unchanged (no upstream 1k/1k counterpart); 1k/1k tep8 also re-enables numa-bind + chat template to stay consistent. Local recipe deltas vs upstream are limited to: model.path alias rename deepseekv4-fp4 -> deepseek-v4-pro (matches SRT_SLURM_MODEL_PREFIX), container kept on the floating :deepseekv4-cu130 tag, slurm.time_limit added, and health_check.max_attempts bumped 360 -> 1440 for cold-cache loads.
…amo-vllm # Conflicts: # perf-changelog.yaml
The 1k/1k tep8 numa-bind + chat-template re-enabling is rolled back — 1k/1k stays at the previous local-extrapolation tuning. Updates the perf-changelog entry to reflect that.
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🔴
benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml:140-144— Within the 1k/1k sweep, this PR updated only1k1k/disagg-gb200-1p1d-dep8-tep8.yaml(use_chat_template: true, tokenizer_mode: deepseek_v4, prefill numa-bind: true) but left the two sibling DEP-16 recipes (1k1k/disagg-gb200-1p1d-dep8-dep16.yamland1k1k/disagg-gb200-3p1d-dep8-dep16.yaml) on the old workaround path (use_chat_template: false, no tokenizer_mode, no numa-bind). All three are routed from the samedsv4-fp4-gb200-dynamo-vllm1k/1k search-space (.github/configs/nvidia-master.yaml7577-7620), so conc 1-64 now run through sa-bench's chat-template + DSV4 tokenizer path while conc 128-8192 use the legacy random-token /v1/completions path — the changelog explicitly states the tep8 update was made "to stay consistent" but the result is the opposite. Either also flip the two DEP-16 recipes (use_chat_template: true, tokenizer_mode: deepseek_v4, prefill numa-bind: true) or revert the tep8 update.Extended reasoning...
What is happening
The PR description says "drops local workarounds: numa-bind, benchmark.use_chat_template, and benchmark.tokenizer_mode are restored now that PR #77 ships vllm_numa_bind_hash_fix.py and sa-bench DSV4 tokenizer support", and the perf-changelog entry adds "1k/1k tep8 also re-enables numa-bind + chat template to stay consistent". The intent is clearly to get all 1k/1k recipes onto the same input/tokenization pipeline as the 8k/1k recipes that PR #77 ships.
But only the 1k/1k tep8 recipe was actually touched. The two DEP-16 siblings still carry the pre-PR workaround:
1k/1k recipe use_chat_template tokenizer_mode prefill numa-bind 1k1k/disagg-gb200-1p1d-dep8-tep8.yaml(modified)truedeepseek_v4true1k1k/disagg-gb200-1p1d-dep8-dep16.yaml(line 125, unchanged)false(unset) (unset) 1k1k/disagg-gb200-3p1d-dep8-dep16.yaml(line 117, unchanged)false(unset) (unset) Why it matters
.github/configs/nvidia-master.yamllines 7577-7620 route all three of these inside the samedsv4-fp4-gb200-dynamo-vllm1k/1k seq-len-config:conc [1, 4, 8, 16, 32, 64] -> 1p1d-dep8-tep8.yaml (chat_template=true, tokenizer_mode=deepseek_v4) conc [128, 256, 1024, 2048, 4096] -> 1p1d-dep8-dep16.yaml (chat_template=false, tokenizer_mode unset) conc [4096, 8192] -> 3p1d-dep8-dep16.yaml (chat_template=false, tokenizer_mode unset)sa-bench's
use_chat_template: truepath callstokenizer.apply_chat_template()and adds role markers / system prompt, so the actual on-the-wire prompt token count and request endpoint differs from the random-token/v1/completionspath used whenuse_chat_template: false. So within a single 1k/1k sweep:- conc 1-64 (tep8) measures latency through chat-template + DSV4-aware tokenization
- conc 128-8192 (DEP-16) measures throughput through raw tokens, no DSV4 tokenizer, plain completions endpoint
This breaks the comparable-curve invariant the PR was trying to restore: a low-conc TEP latency point and a higher-conc DEP-16 throughput point on the same 1k/1k pareto front are no longer apples-to-apples, because their input/tokenization pipelines differ.
Step-by-step proof
- The 1k/1k tep8 recipe (touched by this PR) ends with:
benchmark: ... tokenizer_mode: "deepseek_v4" use_chat_template: true
1k1k/disagg-gb200-1p1d-dep8-dep16.yamlline 125 (NOT in this PR's diff) ends with:Nobenchmark: ... use_chat_template: false
tokenizer_mode, no prefillnuma-bind: true.1k1k/disagg-gb200-3p1d-dep8-dep16.yamlline 117 (NOT in this PR's diff): identical situation..github/configs/nvidia-master.yamllines 7577-7620 confirm all three are in the same 1k/1k search-space, so a single benchmark sweep atisl=1024,osl=1024will hit all three recipes at different concurrencies.- Pre-PR all three 1k/1k recipes had
use_chat_template: false(consistent). Post-PR, only one was flipped, so this PR introduces the inconsistency.
How to fix
Either:
- (Most consistent with the changelog's stated intent) Update both DEP-16 recipes the same way: set
benchmark.use_chat_template: true,benchmark.tokenizer_mode: "deepseek_v4", and prefillnuma-bind: true(the worker-sidetokenizer-mode: deepseek_v4if not already there). - Or revert the 1k/1k tep8 update for now and leave all three on the old workarounds, matching pre-PR behavior. The 1k/1k recipes have no upstream PR #77 reference anyway, so this is a self-consistent fallback until 1k/1k DEP-16 also gets a chance to be validated end-to-end.
Runs will succeed either way (the recipes are individually valid), so this is a benchmark-correctness / comparability bug, not a runtime crash. Severity: normal.
| VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1" | ||
| VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random" |
There was a problem hiding this comment.
We need to comment these out in all yamls - it was added to see the most optimal engine perf.
Repoint at NVIDIA/srt-slurm@
aflowers/gb200-dsv4-recipes(NVIDIA/srt-slurm#77, supersedes #71), mirrors all 8 new 8k/1k recipes and drops the local workarounds (numa-bind, chat template, tokenizer mode) the previous submission (#1129) had to apply.