[NV] Add deepseek-v4-pro b300 vllm config #1144

claude · 2026-04-24T22:19:05Z

🔴 All 4 search-space entries for dsv4-fp8-b300-vllm (nvidia-master.yaml:2402-2413) omit the ep field, so generate_sweep_configs.py defaults each matrix entry to ep=1. But benchmarks/single_node/dsv4_fp8_b300.sh always passes --enable-expert-parallel, meaning the actual EP is 8 (for tp:8), 4 (for tp:4), or 4 (for tp:4/dp-attn:true) — never 1. Downstream metadata (RESULT_FILENAME, process_result.py, compare_results.py/summarize.py grouping keys) will therefore record ep=1 for every data point. Fix by adding ep: 8 to the two tp:8 entries and ep: 4 to the two tp:4 entries, mirroring the adjacent dsv4-fp8-h200-vllm config and PR #919's metadata cleanup.

Extended reasoning...

What the bug is. The newly added dsv4-fp8-b300-vllm block (.github/configs/nvidia-master.yaml:2388-2413) declares four search-space entries across its two seq-len configs and none of them set the ep field: {tp:8,...}, {tp:4,...}, {tp:8,...}, {tp:4,dp-attn:true,...}. In contrast, the sibling dsv4-fp8-h200-vllm at line 2385 correctly specifies ep: 8, which is the established convention for MoE configs in this file.

Why the default is wrong for this recipe. utils/matrix_logic/generate_sweep_configs.py:354 initializes Fields.EP.value to 1 for single-node entries and only overrides it (lines 362-363) when ep is explicitly present in the YAML entry. So every generated matrix row for this config gets ep=1. However, benchmarks/single_node/dsv4_fp8_b300.sh unconditionally passes --enable-expert-parallel on the vllm serve command (line ~76 of the new script), independent of TP or DP_ATTENTION. With vLLM's expert-parallel semantics, the effective expert-parallel degree equals the world size (TP × DP), so the runtime EP is 8 or 4, never 1.

How the metadata mismatch propagates. The EP value from the matrix becomes EP_SIZE via .github/workflows/benchmark-tmpl.yml:85, and that value is then (a) embedded in RESULT_FILENAME at line 146 as ep${EP_SIZE}, (b) written into the aggregated JSON by utils/process_result.py:100-108 as data['ep'] = ep_size, (c) used as a grouping key in utils/summarize.py:82,104, and (d) forms the tp{tp}/ep{ep} lookup key in utils/compare_results.py:244. So every single B300 result file for this PR will be named ...ep1... and every aggregated data point will claim ep: 1, while the actual run executed with EP=4 or EP=8. Any downstream baseline comparison or eval grouping will key on a value that doesn't exist in the launched recipe.

Step-by-step proof for the second entry (tp:4, conc 4-128 on 1k1k).

YAML entry: { tp: 4, conc-start: 4, conc-end: 128 } — no ep key.

generate_sweep_configs.py:354 seeds the row with ep: 1 (default) and the tp override sets tp: 4; line 362-363 does not run because 'ep' is not in the dict.

Matrix row is emitted with tp=4, ep=1, dp-attn=false.

benchmark-tmpl.yml:85 exports EP_SIZE=1; line 146 stamps the result file as ..._tp4-ep1-dpaFalse_....

The launch script enters the else-branch (DP_ATTENTION != true), so PARALLEL_ARGS=--tensor-parallel-size 4 --data-parallel-size 1, and --enable-expert-parallel is always present → vLLM runs with TP=4, DP=1, EP enabled over world size 4 → effective EP=4.

process_result.py reads EP_SIZE=1 from env and writes {'ep': 1, ...} to the JSON — the ep field recorded is 1, the actual EP used was 4.

Why this was not caught earlier. There is no validation that cross-references --enable-expert-parallel in a launch script against the ep field in matrix entries; the coupling is by convention. This is precisely the class of mismatch that PR #919 ('Fix metadata inconsistencies in nvidia-master.yaml - TP/EP/DP-attn values now match actual recipe files') was created to clean up, and that the gptoss-fp4-* and dsr1-fp4-* changelogs repeatedly reference ('Explicitly add EP=TP for DP attention configs', 'Set ep:4 for all tp:4 entries, ep:8 for all tp:8 entries').

Fix. Add explicit ep to each B300 search-space entry to match the launched EP:

{ tp: 8, ep: 8, conc-start: 4, conc-end: 4 }

{ tp: 4, ep: 4, conc-start: 4, conc-end: 128 }

{ tp: 8, ep: 8, conc-start: 128, conc-end: 128 }

{ tp: 4, ep: 4, dp-attn: true, conc-start: 256, conc-end: 512 }

This mirrors the adjacent dsv4-fp8-h200-vllm convention (ep: 8 for tp: 8, dp-attn: true) and keeps RESULT_FILENAME/process_result.py/compare_results.py in sync with the actual runtime EP. Purely metadata-only — no recipe-file changes required.

-Original file line number
+Diff line change
@@ Expand Up / @@ -2448,6 +2448,34 @@ dsv4-fp8-h200-vllm: @@
         search-space:
         - { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64 }
+    # DeepSeek-V4-Pro B300 single-node aggregate recipe from the submitted B300
+    # pareto sweep. The single-node schema has no explicit data-parallel-size
+    # field, so dp-attn=true is used as the existing vLLM script switch for DP4
+    # layouts on 4 allocated GPUs.
+    dsv4-fp4-b300-vllm:
+      image: vllm/vllm-openai:deepseekv4-cu130
+      model: deepseek-ai/DeepSeek-V4-Pro
+      model-prefix: dsv4
+      runner: b300
+      precision: fp4
+      framework: vllm
+      multinode: false
+      seq-len-configs:
+      - isl: 1024
+        osl: 1024
+        search-space:
+        - { tp: 8, conc-start: 4, conc-end: 4 }
+        - { tp: 4, conc-start: 4, conc-end: 128 }
+        - { tp: 8, conc-start: 128, conc-end: 128 }
+        - { tp: 4, dp-attn: true, conc-start: 256, conc-end: 512 }
+      - isl: 8192
+        osl: 1024
+        search-space:
+        - { tp: 8, conc-start: 4, conc-end: 4 }
+        - { tp: 4, conc-start: 4, conc-end: 128 }
+        - { tp: 8, conc-start: 128, conc-end: 128 }
+        - { tp: 4, dp-attn: true, conc-start: 256, conc-end: 512 }
     qwen3.5-fp8-h200-sglang:
       image: lmsysorg/sglang:v0.5.9-cu129-amd64
       model: Qwen/Qwen3.5-397B-A17B-FP8
@@ Expand Down @@

-Original file line number
+Diff line change
@@ -1,14 +1,16 @@
     #!/usr/bin/env bash
-    # Per https://vllm.ai/blog/deepseek-v4 the DeepSeek-V4-Pro recipe lists
-    # 8xB200 and 8xB300 with identical flags, so this script mirrors
-    # dsv4_fp4_b200.sh.
+    # DeepSeek-V4-Pro B300 single-node aggregate recipe from the submitted B300
+    # pareto sweep. The matrix uses dp-attn=true as the existing switch to flip a
+    # 4-GPU run from TP4 to DP4. Expert parallel is always enabled to match the
+    # provided vllm serve command exactly.
     source "$(dirname "$0")/../benchmark_lib.sh"
     check_env_vars \
         MODEL \
         TP \
+        DP_ATTENTION \
         CONC \
         ISL \
         OSL \
@@ Expand All / @@ -22,56 +24,54 @@ fi @@
     nvidia-smi
+    hf download "$MODEL"
     SERVER_LOG=/workspace/server.log
     PORT=${PORT:-8888}
-    # DeepSeek-V4-Pro weights are large and engine startup on B300 can exceed
-    # the default 600s. Give it an hour to load.
+    # DeepSeek-V4-Pro weights are large; engine startup can exceed the default
+    # 600s. Give it an hour to load.
     export VLLM_ENGINE_READY_TIMEOUT_S=3600
-    if [ "${EVAL_ONLY}" = "true" ]; then
-        setup_eval_context
-        MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
+    PARALLEL_ARGS=(--tensor-parallel-size "$TP" --data-parallel-size 1)
+    if [ "${DP_ATTENTION}" = "true" ]; then
+        PARALLEL_ARGS=(--tensor-parallel-size 1 --data-parallel-size "$TP")
     fi
-    # Monkey-patch: bypass persistent_topk unconditionally. It raises "k out of
-    # range" during CUDA graph capture when the dummy batch has rows with
-    # seq_lens[i] < k (=2048 for DSV4). An attn_metadata.max_seq_len-based gate is
-    # not strict enough because dummy batches can have max >= k while individual
-    # rows have seq_lens[i] = 1. Fall back to top_k_per_row_decode everywhere so
-    # 1k/1k capture completes; 8k/1k already worked without the patch but we trade
-    # a small decode-time perf cost there to keep the script single-branch.
-    INDEXER_PY=/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sparse_attn_indexer.py
-    echo "[monkey-patch] patching $INDEXER_PY"
-    sed -i 's/if current_platform.is_cuda() and topk_tokens in (512, 1024, 2048)[^:]*:/if False:  # monkey-patched: bypass persistent_topk (k out of range)/' "$INDEXER_PY"
-    if ! grep -Fq 'if False:  # monkey-patched: bypass persistent_topk' "$INDEXER_PY"; then
-        echo "[monkey-patch] FAILED: expected marker not found in $INDEXER_PY" >&2
-        echo "[monkey-patch] current line around persistent_topk dispatch:" >&2
-        grep -n 'topk_tokens in\|persistent_topk' "$INDEXER_PY" >&2 || true
-        exit 1
+    BENCHMARK_MAX_MODEL_LEN="$MAX_MODEL_LEN"
+    if [ "$ISL" -eq 1024 ] && [ "$OSL" -eq 1024 ]; then
+        BENCHMARK_MAX_MODEL_LEN=4096
+    fi
+    if [ "${EVAL_ONLY}" = "true" ]; then
+        EVAL_MAX_MODEL_LEN=$(compute_eval_context_length "$MODEL" "$BENCHMARK_MAX_MODEL_LEN")
+        export EVAL_MAX_MODEL_LEN
+        SERVE_MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
+    else
+        SERVE_MAX_MODEL_LEN="$BENCHMARK_MAX_MODEL_LEN"
     fi
-    echo "[monkey-patch] applied: $(grep -n 'if False:  # monkey-patched' $INDEXER_PY)"
     # Start GPU monitoring (power, temperature, clocks every second)
     start_gpu_monitor
-    # Per the recipe, run with EP + DP=8 (no --tensor-parallel-size flag). TP
-    # from the search space is used only for GPU allocation by the runner and
-    # as the DP size.
     set -x
-    vllm serve $MODEL --host 0.0.0.0 --port $PORT \
-    --trust-remote-code \
-    --kv-cache-dtype fp8 \
-    --block-size 256 \
-    --no-enable-prefix-caching \
-    --enable-expert-parallel \
-    --data-parallel-size $TP \
-    --max-model-len $MAX_MODEL_LEN \
-    --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
-    --tokenizer-mode deepseek_v4 \
-    --tool-call-parser deepseek_v4 \
-    --enable-auto-tool-choice \
-    --reasoning-parser deepseek_v4 > $SERVER_LOG 2>&1 &
+    vllm serve "$MODEL" --host 0.0.0.0 --port "$PORT" \
+        "${PARALLEL_ARGS[@]}" \
+        --pipeline-parallel-size 1 \
+        --kv-cache-dtype fp8 \
+        --trust-remote-code \
+        --block-size 256 \
+        --no-enable-prefix-caching \
+        --enable-expert-parallel \
+        --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
+        --attention_config.use_fp4_indexer_cache True \
+        --tokenizer-mode deepseek_v4 \
+        --tool-call-parser deepseek_v4 \
+        --enable-auto-tool-choice \
+        --reasoning-parser deepseek_v4 \
+        --max-cudagraph-capture-size 2048 \
+        --max-model-len "$SERVE_MAX_MODEL_LEN" \
+        --max-num-batched-tokens 2048 > "$SERVER_LOG" 2>&1 &
     SERVER_PID=$!
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -1755,7 +1755,7 @@ @@
         - "VLLM_ENGINE_READY_TIMEOUT_S=3600 to accommodate large weight loading"
         - "Configs: 1k1k conc 4-64, 8k1k conc 4-64"
       pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1130
     - config-keys:
         - dsv4-fp4-b300-sglang
       description:
@@ Expand All / @@ -1775,3 +1775,14 @@ @@
         - "Model: sgl-project/DeepSeek-V4-Pro-FP8"
         - "https://github.com/sgl-project/sglang/pull/23608#issuecomment-4311952977"
       pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1134
+    - config-keys:
+        - dsv4-fp4-b300-vllm
+      description:
+        - "Add DeepSeek-V4-Pro single-node B300 vLLM aggregate benchmark"
+        - "Image: vllm/vllm-openai:deepseekv4-cu130"
+        - "Model: deepseek-ai/DeepSeek-V4-Pro"
+        - "Uses the submitted B300 pareto schedule for both 1k1k and 8k1k, excluding conc 1: TP8 at conc 4/128, TP4 at conc 4/8/16/32/64/128, DP4 at conc 256/512"
+        - "Launch args match the provided vllm serve command, including FP4 indexer cache, FULL_AND_PIECEWISE cudagraph config, and max-num-batched-tokens 2048"
+        - "1k1k uses --max-model-len 4096; 8k1k uses the workflow-provided benchmark context length"
+      pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1144

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NV] Add deepseek-v4-pro b300 vllm config #1144

Uh oh!

Diff view

Diff view

There are no files selected for viewing

claude Bot Apr 24, 2026

Uh oh!

Uh oh!

[NV] Add deepseek-v4-pro b300 vllm config #1144

Uh oh!

[NV] Add deepseek-v4-pro b300 vllm config #1144

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!