Skip to content

[NV] dsv4-fp4-gb200-dynamo-vllm#1163

Open
Ankur-singh wants to merge 3 commits intomainfrom
nv/dsv4-fp4-gb200-dynamo-vllm
Open

[NV] dsv4-fp4-gb200-dynamo-vllm#1163
Ankur-singh wants to merge 3 commits intomainfrom
nv/dsv4-fp4-gb200-dynamo-vllm

Conversation

@Ankur-singh
Copy link
Copy Markdown
Collaborator

@Ankur-singh Ankur-singh commented Apr 26, 2026

Repoint at NVIDIA/srt-slurm@aflowers/gb200-dsv4-recipes (NVIDIA/srt-slurm#77, supersedes #71), mirrors all 8 new 8k/1k recipes and drops the local workarounds (numa-bind, chat template, tokenizer mode) the previous submission (#1129) had to apply.

…-dsv4-recipes (PR #77)

Repoint launch_gb200-nv.sh to NVIDIA/srt-slurm@aflowers/gb200-dsv4-recipes,
which supersedes #71 and ships the vllm_numa_bind_hash_fix.py patch and
sa-bench DSV4 tokenizer support — so numa-bind, benchmark.use_chat_template,
and benchmark.tokenizer_mode no longer have to be stripped from recipes.

8k/1k search-space expanded from 3 topologies to 8: adds 1p4d/1p8d pure-TP
decode (offload), 1p1d/2p1d/3p1d DEP-8 decode, and a 3p1d-dep16-40 wide
decode shape. 1k/1k topologies unchanged (no upstream 1k/1k counterpart);
1k/1k tep8 also re-enables numa-bind + chat template to stay consistent.

Local recipe deltas vs upstream are limited to: model.path alias rename
deepseekv4-fp4 -> deepseek-v4-pro (matches SRT_SLURM_MODEL_PREFIX), container
kept on the floating :deepseekv4-cu130 tag, slurm.time_limit added, and
health_check.max_attempts bumped 360 -> 1440 for cold-cache loads.
@Ankur-singh Ankur-singh requested a review from a team April 26, 2026 02:40
@Ankur-singh Ankur-singh changed the title Re-submit dsv4-fp4-gb200-dynamo-vllm against srt-slurm PR #77 [NV] dsv4-fp4-gb200-dynamo-vllm Apr 26, 2026
The 1k/1k tep8 numa-bind + chat-template re-enabling is rolled back —
1k/1k stays at the previous local-extrapolation tuning. Updates the
perf-changelog entry to reflect that.
Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 benchmarks/multi_node/srt-slurm-recipes/vllm/deepseek-v4/1k1k/disagg-gb200-1p1d-dep8-tep8.yaml:140-144 — Within the 1k/1k sweep, this PR updated only 1k1k/disagg-gb200-1p1d-dep8-tep8.yaml (use_chat_template: true, tokenizer_mode: deepseek_v4, prefill numa-bind: true) but left the two sibling DEP-16 recipes (1k1k/disagg-gb200-1p1d-dep8-dep16.yaml and 1k1k/disagg-gb200-3p1d-dep8-dep16.yaml) on the old workaround path (use_chat_template: false, no tokenizer_mode, no numa-bind). All three are routed from the same dsv4-fp4-gb200-dynamo-vllm 1k/1k search-space (.github/configs/nvidia-master.yaml 7577-7620), so conc 1-64 now run through sa-bench's chat-template + DSV4 tokenizer path while conc 128-8192 use the legacy random-token /v1/completions path — the changelog explicitly states the tep8 update was made "to stay consistent" but the result is the opposite. Either also flip the two DEP-16 recipes (use_chat_template: true, tokenizer_mode: deepseek_v4, prefill numa-bind: true) or revert the tep8 update.

    Extended reasoning...

    What is happening

    The PR description says "drops local workarounds: numa-bind, benchmark.use_chat_template, and benchmark.tokenizer_mode are restored now that PR #77 ships vllm_numa_bind_hash_fix.py and sa-bench DSV4 tokenizer support", and the perf-changelog entry adds "1k/1k tep8 also re-enables numa-bind + chat template to stay consistent". The intent is clearly to get all 1k/1k recipes onto the same input/tokenization pipeline as the 8k/1k recipes that PR #77 ships.

    But only the 1k/1k tep8 recipe was actually touched. The two DEP-16 siblings still carry the pre-PR workaround:

    1k/1k recipe use_chat_template tokenizer_mode prefill numa-bind
    1k1k/disagg-gb200-1p1d-dep8-tep8.yaml (modified) true deepseek_v4 true
    1k1k/disagg-gb200-1p1d-dep8-dep16.yaml (line 125, unchanged) false (unset) (unset)
    1k1k/disagg-gb200-3p1d-dep8-dep16.yaml (line 117, unchanged) false (unset) (unset)

    Why it matters

    .github/configs/nvidia-master.yaml lines 7577-7620 route all three of these inside the same dsv4-fp4-gb200-dynamo-vllm 1k/1k seq-len-config:

    conc [1, 4, 8, 16, 32, 64]                  -> 1p1d-dep8-tep8.yaml         (chat_template=true,  tokenizer_mode=deepseek_v4)
    conc [128, 256, 1024, 2048, 4096]           -> 1p1d-dep8-dep16.yaml        (chat_template=false, tokenizer_mode unset)
    conc [4096, 8192]                           -> 3p1d-dep8-dep16.yaml        (chat_template=false, tokenizer_mode unset)
    

    sa-bench's use_chat_template: true path calls tokenizer.apply_chat_template() and adds role markers / system prompt, so the actual on-the-wire prompt token count and request endpoint differs from the random-token /v1/completions path used when use_chat_template: false. So within a single 1k/1k sweep:

    • conc 1-64 (tep8) measures latency through chat-template + DSV4-aware tokenization
    • conc 128-8192 (DEP-16) measures throughput through raw tokens, no DSV4 tokenizer, plain completions endpoint

    This breaks the comparable-curve invariant the PR was trying to restore: a low-conc TEP latency point and a higher-conc DEP-16 throughput point on the same 1k/1k pareto front are no longer apples-to-apples, because their input/tokenization pipelines differ.

    Step-by-step proof

    1. The 1k/1k tep8 recipe (touched by this PR) ends with:
      benchmark:
        ...
        tokenizer_mode: "deepseek_v4"
        use_chat_template: true
    2. 1k1k/disagg-gb200-1p1d-dep8-dep16.yaml line 125 (NOT in this PR's diff) ends with:
      benchmark:
        ...
        use_chat_template: false
      No tokenizer_mode, no prefill numa-bind: true.
    3. 1k1k/disagg-gb200-3p1d-dep8-dep16.yaml line 117 (NOT in this PR's diff): identical situation.
    4. .github/configs/nvidia-master.yaml lines 7577-7620 confirm all three are in the same 1k/1k search-space, so a single benchmark sweep at isl=1024,osl=1024 will hit all three recipes at different concurrencies.
    5. Pre-PR all three 1k/1k recipes had use_chat_template: false (consistent). Post-PR, only one was flipped, so this PR introduces the inconsistency.

    How to fix

    Either:

    • (Most consistent with the changelog's stated intent) Update both DEP-16 recipes the same way: set benchmark.use_chat_template: true, benchmark.tokenizer_mode: "deepseek_v4", and prefill numa-bind: true (the worker-side tokenizer-mode: deepseek_v4 if not already there).
    • Or revert the 1k/1k tep8 update for now and leave all three on the old workarounds, matching pre-PR behavior. The 1k/1k recipes have no upstream PR #77 reference anyway, so this is a self-consistent fallback until 1k/1k DEP-16 also gets a chance to be validated end-to-end.

    Runs will succeed either way (the recipes are individually valid), so this is a benchmark-correctness / comparability bug, not a runtime crash. Severity: normal.

Comment on lines +70 to +71
VLLM_RANDOMIZE_DP_DUMMY_INPUTS: "1"
VLLM_MOE_ROUTING_SIMULATION_STRATEGY: "uniform_random"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to comment these out in all yamls - it was added to see the most optimal engine perf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants