[NV] dsv4-fp4-gb200-dynamo-vllm by Ankur-singh · Pull Request #1163 · SemiAnalysisAI/InferenceX

Ankur-singh · 2026-04-26T02:40:31Z

Repoint at NVIDIA/srt-slurm@aflowers/gb200-dsv4-recipes (NVIDIA/srt-slurm#77, supersedes #71), mirrors all 8 new 8k/1k recipes and drops the local workarounds (numa-bind, chat template, tokenizer mode) the previous submission (#1129) had to apply.

…-dsv4-recipes (PR #77) Repoint launch_gb200-nv.sh to NVIDIA/srt-slurm@aflowers/gb200-dsv4-recipes, which supersedes #71 and ships the vllm_numa_bind_hash_fix.py patch and sa-bench DSV4 tokenizer support — so numa-bind, benchmark.use_chat_template, and benchmark.tokenizer_mode no longer have to be stripped from recipes. 8k/1k search-space expanded from 3 topologies to 8: adds 1p4d/1p8d pure-TP decode (offload), 1p1d/2p1d/3p1d DEP-8 decode, and a 3p1d-dep16-40 wide decode shape. 1k/1k topologies unchanged (no upstream 1k/1k counterpart); 1k/1k tep8 also re-enables numa-bind + chat template to stay consistent. Local recipe deltas vs upstream are limited to: model.path alias rename deepseekv4-fp4 -> deepseek-v4-pro (matches SRT_SLURM_MODEL_PREFIX), container kept on the floating :deepseekv4-cu130 tag, slurm.time_limit added, and health_check.max_attempts bumped 360 -> 1440 for cold-cache loads.

…amo-vllm # Conflicts: # perf-changelog.yaml

The 1k/1k tep8 numa-bind + chat-template re-enabling is rolled back — 1k/1k stays at the previous local-extrapolation tuning. Updates the perf-changelog entry to reflect that.

claude

Additional findings (outside current diff — PR may have been updated during review):

🔴 benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml:140-144 — Within the 1k/1k sweep, this PR updated only 1k1k/disagg-gb200-1p1d-dep8-tep8.yaml (use_chat_template: true, tokenizer_mode: deepseek_v4, prefill numa-bind: true) but left the two sibling DEP-16 recipes (1k1k/disagg-gb200-1p1d-dep8-dep16.yaml and 1k1k/disagg-gb200-3p1d-dep8-dep16.yaml) on the old workaround path (use_chat_template: false, no tokenizer_mode, no numa-bind). All three are routed from the same dsv4-fp4-gb200-dynamo-vllm 1k/1k search-space (.github/configs/nvidia-master.yaml 7577-7620), so conc 1-64 now run through sa-bench's chat-template + DSV4 tokenizer path while conc 128-8192 use the legacy random-token /v1/completions path — the changelog explicitly states the tep8 update was made "to stay consistent" but the result is the opposite. Either also flip the two DEP-16 recipes (use_chat_template: true, tokenizer_mode: deepseek_v4, prefill numa-bind: true) or revert the tep8 update.

Extended reasoning...

What is happening

The PR description says "drops local workarounds: numa-bind, benchmark.use_chat_template, and benchmark.tokenizer_mode are restored now that PR #77 ships vllm_numa_bind_hash_fix.py and sa-bench DSV4 tokenizer support", and the perf-changelog entry adds "1k/1k tep8 also re-enables numa-bind + chat template to stay consistent". The intent is clearly to get all 1k/1k recipes onto the same input/tokenization pipeline as the 8k/1k recipes that PR #77 ships.

But only the 1k/1k tep8 recipe was actually touched. The two DEP-16 siblings still carry the pre-PR workaround:

1k/1k recipe	use_chat_template	tokenizer_mode	prefill numa-bind
`1k1k/disagg-gb200-1p1d-dep8-tep8.yaml` (modified)	`true`	`deepseek_v4`	`true`
`1k1k/disagg-gb200-1p1d-dep8-dep16.yaml` (line 125, unchanged)	`false`	(unset)	(unset)
`1k1k/disagg-gb200-3p1d-dep8-dep16.yaml` (line 117, unchanged)	`false`	(unset)	(unset)

Why it matters

.github/configs/nvidia-master.yaml lines 7577-7620 route all three of these inside the same dsv4-fp4-gb200-dynamo-vllm 1k/1k seq-len-config:

conc [1, 4, 8, 16, 32, 64]                  -> 1p1d-dep8-tep8.yaml         (chat_template=true,  tokenizer_mode=deepseek_v4)
conc [128, 256, 1024, 2048, 4096]           -> 1p1d-dep8-dep16.yaml        (chat_template=false, tokenizer_mode unset)
conc [4096, 8192]                           -> 3p1d-dep8-dep16.yaml        (chat_template=false, tokenizer_mode unset)

sa-bench's use_chat_template: true path calls tokenizer.apply_chat_template() and adds role markers / system prompt, so the actual on-the-wire prompt token count and request endpoint differs from the random-token /v1/completions path used when use_chat_template: false. So within a single 1k/1k sweep:

conc 1-64 (tep8) measures latency through chat-template + DSV4-aware tokenization
conc 128-8192 (DEP-16) measures throughput through raw tokens, no DSV4 tokenizer, plain completions endpoint

This breaks the comparable-curve invariant the PR was trying to restore: a low-conc TEP latency point and a higher-conc DEP-16 throughput point on the same 1k/1k pareto front are no longer apples-to-apples, because their input/tokenization pipelines differ.

Step-by-step proof

The 1k/1k tep8 recipe (touched by this PR) ends with:

benchmark:
  ...
  tokenizer_mode: "deepseek_v4"
  use_chat_template: true

1k1k/disagg-gb200-1p1d-dep8-dep16.yaml line 125 (NOT in this PR's diff) ends with:
```
benchmark:
  ...
  use_chat_template: false
```
No tokenizer_mode, no prefill numa-bind: true.
1k1k/disagg-gb200-3p1d-dep8-dep16.yaml line 117 (NOT in this PR's diff): identical situation.
.github/configs/nvidia-master.yaml lines 7577-7620 confirm all three are in the same 1k/1k search-space, so a single benchmark sweep at isl=1024,osl=1024 will hit all three recipes at different concurrencies.
Pre-PR all three 1k/1k recipes had use_chat_template: false (consistent). Post-PR, only one was flipped, so this PR introduces the inconsistency.

How to fix

Either:

(Most consistent with the changelog's stated intent) Update both DEP-16 recipes the same way: set benchmark.use_chat_template: true, benchmark.tokenizer_mode: "deepseek_v4", and prefill numa-bind: true (the worker-side tokenizer-mode: deepseek_v4 if not already there).
Or revert the 1k/1k tep8 update for now and leave all three on the old workarounds, matching pre-PR behavior. The 1k/1k recipes have no upstream PR #77 reference anyway, so this is a self-consistent fallback until 1k/1k DEP-16 also gets a chance to be validated end-to-end.

Runs will succeed either way (the recipes are individually valid), so this is a benchmark-correctness / comparability bug, not a runtime crash. Severity: normal.

ywang96 · 2026-04-26T05:45:44Z

+    VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1"
+    VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random"


We need to comment these out in all yamls - it was added to see the most optimal engine perf.

Ankur-singh requested a review from a team April 26, 2026 02:40

github-project-automation Bot added this to InferenceMAX Board Apr 26, 2026

Ankur-singh requested review from jgangani and kedarpotdar-nv as code owners April 26, 2026 02:40

Ankur-singh changed the title ~~Re-submit dsv4-fp4-gb200-dynamo-vllm against srt-slurm PR #77~~ [NV] dsv4-fp4-gb200-dynamo-vllm Apr 26, 2026

Ankur-singh added 2 commits April 25, 2026 19:49

Merge remote-tracking branch 'origin/main' into nv/dsv4-fp4-gb200-dyn…

65e804c

…amo-vllm # Conflicts: # perf-changelog.yaml

Revert 1k/1k tep8 recipe changes; leave 1k/1k untouched

925706b

The 1k/1k tep8 numa-bind + chat-template re-enabling is rolled back — 1k/1k stays at the previous local-extrapolation tuning. Updates the perf-changelog entry to reflect that.

Ankur-singh added the sweep-enabled label Apr 26, 2026

claude Bot reviewed Apr 26, 2026

View reviewed changes

ywang96 reviewed Apr 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NV] dsv4-fp4-gb200-dynamo-vllm#1163

[NV] dsv4-fp4-gb200-dynamo-vllm#1163
Ankur-singh wants to merge 3 commits intomainfrom
nv/dsv4-fp4-gb200-dynamo-vllm

Ankur-singh commented Apr 26, 2026 •

edited

Loading

Uh oh!

claude Bot left a comment

Uh oh!

ywang96 Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1"
		VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random"

Conversation

Ankur-singh commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

What is happening

Why it matters

Step-by-step proof

How to fix

Uh oh!

ywang96 Apr 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ankur-singh commented Apr 26, 2026 •

edited

Loading