Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2448,6 +2448,34 @@ dsv4-fp8-h200-vllm:
search-space:
- { tp: 8, ep: 8, dp-attn: true, conc-start: 4, conc-end: 64 }

# DeepSeek-V4-Pro B300 single-node aggregate recipe from the submitted B300
# pareto sweep. The single-node schema has no explicit data-parallel-size
# field, so dp-attn=true is used as the existing vLLM script switch for DP4
# layouts on 4 allocated GPUs.
dsv4-fp4-b300-vllm:
image: vllm/vllm-openai:deepseekv4-cu130
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: b300
precision: fp4
framework: vllm
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 4 }
- { tp: 4, conc-start: 4, conc-end: 128 }
- { tp: 8, conc-start: 128, conc-end: 128 }
- { tp: 4, dp-attn: true, conc-start: 256, conc-end: 512 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 4 }
- { tp: 4, conc-start: 4, conc-end: 128 }
- { tp: 8, conc-start: 128, conc-end: 128 }
- { tp: 4, dp-attn: true, conc-start: 256, conc-end: 512 }
Comment on lines +2466 to +2477
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 All 4 search-space entries for dsv4-fp8-b300-vllm (nvidia-master.yaml:2402-2413) omit the ep field, so generate_sweep_configs.py defaults each matrix entry to ep=1. But benchmarks/single_node/dsv4_fp8_b300.sh always passes --enable-expert-parallel, meaning the actual EP is 8 (for tp:8), 4 (for tp:4), or 4 (for tp:4/dp-attn:true) — never 1. Downstream metadata (RESULT_FILENAME, process_result.py, compare_results.py/summarize.py grouping keys) will therefore record ep=1 for every data point. Fix by adding ep: 8 to the two tp:8 entries and ep: 4 to the two tp:4 entries, mirroring the adjacent dsv4-fp8-h200-vllm config and PR #919's metadata cleanup.

Extended reasoning...

What the bug is. The newly added dsv4-fp8-b300-vllm block (.github/configs/nvidia-master.yaml:2388-2413) declares four search-space entries across its two seq-len configs and none of them set the ep field: {tp:8,...}, {tp:4,...}, {tp:8,...}, {tp:4,dp-attn:true,...}. In contrast, the sibling dsv4-fp8-h200-vllm at line 2385 correctly specifies ep: 8, which is the established convention for MoE configs in this file.

Why the default is wrong for this recipe. utils/matrix_logic/generate_sweep_configs.py:354 initializes Fields.EP.value to 1 for single-node entries and only overrides it (lines 362-363) when ep is explicitly present in the YAML entry. So every generated matrix row for this config gets ep=1. However, benchmarks/single_node/dsv4_fp8_b300.sh unconditionally passes --enable-expert-parallel on the vllm serve command (line ~76 of the new script), independent of TP or DP_ATTENTION. With vLLM's expert-parallel semantics, the effective expert-parallel degree equals the world size (TP × DP), so the runtime EP is 8 or 4, never 1.

How the metadata mismatch propagates. The EP value from the matrix becomes EP_SIZE via .github/workflows/benchmark-tmpl.yml:85, and that value is then (a) embedded in RESULT_FILENAME at line 146 as ep${EP_SIZE}, (b) written into the aggregated JSON by utils/process_result.py:100-108 as data['ep'] = ep_size, (c) used as a grouping key in utils/summarize.py:82,104, and (d) forms the tp{tp}/ep{ep} lookup key in utils/compare_results.py:244. So every single B300 result file for this PR will be named ...ep1... and every aggregated data point will claim ep: 1, while the actual run executed with EP=4 or EP=8. Any downstream baseline comparison or eval grouping will key on a value that doesn't exist in the launched recipe.

Step-by-step proof for the second entry (tp:4, conc 4-128 on 1k1k).

  1. YAML entry: { tp: 4, conc-start: 4, conc-end: 128 } — no ep key.
  2. generate_sweep_configs.py:354 seeds the row with ep: 1 (default) and the tp override sets tp: 4; line 362-363 does not run because 'ep' is not in the dict.
  3. Matrix row is emitted with tp=4, ep=1, dp-attn=false.
  4. benchmark-tmpl.yml:85 exports EP_SIZE=1; line 146 stamps the result file as ..._tp4-ep1-dpaFalse_....
  5. The launch script enters the else-branch (DP_ATTENTION != true), so PARALLEL_ARGS=--tensor-parallel-size 4 --data-parallel-size 1, and --enable-expert-parallel is always present → vLLM runs with TP=4, DP=1, EP enabled over world size 4 → effective EP=4.
  6. process_result.py reads EP_SIZE=1 from env and writes {'ep': 1, ...} to the JSON — the ep field recorded is 1, the actual EP used was 4.

Why this was not caught earlier. There is no validation that cross-references --enable-expert-parallel in a launch script against the ep field in matrix entries; the coupling is by convention. This is precisely the class of mismatch that PR #919 ('Fix metadata inconsistencies in nvidia-master.yaml - TP/EP/DP-attn values now match actual recipe files') was created to clean up, and that the gptoss-fp4-* and dsr1-fp4-* changelogs repeatedly reference ('Explicitly add EP=TP for DP attention configs', 'Set ep:4 for all tp:4 entries, ep:8 for all tp:8 entries').

Fix. Add explicit ep to each B300 search-space entry to match the launched EP:

  • { tp: 8, ep: 8, conc-start: 4, conc-end: 4 }
  • { tp: 4, ep: 4, conc-start: 4, conc-end: 128 }
  • { tp: 8, ep: 8, conc-start: 128, conc-end: 128 }
  • { tp: 4, ep: 4, dp-attn: true, conc-start: 256, conc-end: 512 }

This mirrors the adjacent dsv4-fp8-h200-vllm convention (ep: 8 for tp: 8, dp-attn: true) and keeps RESULT_FILENAME/process_result.py/compare_results.py in sync with the actual runtime EP. Purely metadata-only — no recipe-file changes required.


qwen3.5-fp8-h200-sglang:
image: lmsysorg/sglang:v0.5.9-cu129-amd64
model: Qwen/Qwen3.5-397B-A17B-FP8
Expand Down
80 changes: 40 additions & 40 deletions benchmarks/single_node/dsv4_fp4_b300_vllm.sh
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
#!/usr/bin/env bash

# Per https://vllm.ai/blog/deepseek-v4 the DeepSeek-V4-Pro recipe lists
# 8xB200 and 8xB300 with identical flags, so this script mirrors
# dsv4_fp4_b200.sh.
# DeepSeek-V4-Pro B300 single-node aggregate recipe from the submitted B300
# pareto sweep. The matrix uses dp-attn=true as the existing switch to flip a
# 4-GPU run from TP4 to DP4. Expert parallel is always enabled to match the
# provided vllm serve command exactly.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
DP_ATTENTION \
CONC \
ISL \
OSL \
Expand All @@ -22,56 +24,54 @@ fi

nvidia-smi

hf download "$MODEL"

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

# DeepSeek-V4-Pro weights are large and engine startup on B300 can exceed
# the default 600s. Give it an hour to load.
# DeepSeek-V4-Pro weights are large; engine startup can exceed the default
# 600s. Give it an hour to load.
export VLLM_ENGINE_READY_TIMEOUT_S=3600

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
PARALLEL_ARGS=(--tensor-parallel-size "$TP" --data-parallel-size 1)
if [ "${DP_ATTENTION}" = "true" ]; then
PARALLEL_ARGS=(--tensor-parallel-size 1 --data-parallel-size "$TP")
fi

# Monkey-patch: bypass persistent_topk unconditionally. It raises "k out of
# range" during CUDA graph capture when the dummy batch has rows with
# seq_lens[i] < k (=2048 for DSV4). An attn_metadata.max_seq_len-based gate is
# not strict enough because dummy batches can have max >= k while individual
# rows have seq_lens[i] = 1. Fall back to top_k_per_row_decode everywhere so
# 1k/1k capture completes; 8k/1k already worked without the patch but we trade
# a small decode-time perf cost there to keep the script single-branch.
INDEXER_PY=/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/sparse_attn_indexer.py
echo "[monkey-patch] patching $INDEXER_PY"
sed -i 's/if current_platform.is_cuda() and topk_tokens in (512, 1024, 2048)[^:]*:/if False: # monkey-patched: bypass persistent_topk (k out of range)/' "$INDEXER_PY"
if ! grep -Fq 'if False: # monkey-patched: bypass persistent_topk' "$INDEXER_PY"; then
echo "[monkey-patch] FAILED: expected marker not found in $INDEXER_PY" >&2
echo "[monkey-patch] current line around persistent_topk dispatch:" >&2
grep -n 'topk_tokens in\|persistent_topk' "$INDEXER_PY" >&2 || true
exit 1
BENCHMARK_MAX_MODEL_LEN="$MAX_MODEL_LEN"
if [ "$ISL" -eq 1024 ] && [ "$OSL" -eq 1024 ]; then
BENCHMARK_MAX_MODEL_LEN=4096
fi

if [ "${EVAL_ONLY}" = "true" ]; then
EVAL_MAX_MODEL_LEN=$(compute_eval_context_length "$MODEL" "$BENCHMARK_MAX_MODEL_LEN")
export EVAL_MAX_MODEL_LEN
SERVE_MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
else
SERVE_MAX_MODEL_LEN="$BENCHMARK_MAX_MODEL_LEN"
fi
echo "[monkey-patch] applied: $(grep -n 'if False: # monkey-patched' $INDEXER_PY)"

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

# Per the recipe, run with EP + DP=8 (no --tensor-parallel-size flag). TP
# from the search space is used only for GPU allocation by the runner and
# as the DP size.
set -x
vllm serve $MODEL --host 0.0.0.0 --port $PORT \
--trust-remote-code \
--kv-cache-dtype fp8 \
--block-size 256 \
--no-enable-prefix-caching \
--enable-expert-parallel \
--data-parallel-size $TP \
--max-model-len $MAX_MODEL_LEN \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 > $SERVER_LOG 2>&1 &
vllm serve "$MODEL" --host 0.0.0.0 --port "$PORT" \
"${PARALLEL_ARGS[@]}" \
--pipeline-parallel-size 1 \
--kv-cache-dtype fp8 \
--trust-remote-code \
--block-size 256 \
--no-enable-prefix-caching \
--enable-expert-parallel \
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
--attention_config.use_fp4_indexer_cache True \
--tokenizer-mode deepseek_v4 \
--tool-call-parser deepseek_v4 \
--enable-auto-tool-choice \
--reasoning-parser deepseek_v4 \
--max-cudagraph-capture-size 2048 \
--max-model-len "$SERVE_MAX_MODEL_LEN" \
--max-num-batched-tokens 2048 > "$SERVER_LOG" 2>&1 &

SERVER_PID=$!

Expand Down
13 changes: 12 additions & 1 deletion perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1755,7 +1755,7 @@
- "VLLM_ENGINE_READY_TIMEOUT_S=3600 to accommodate large weight loading"
- "Configs: 1k1k conc 4-64, 8k1k conc 4-64"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1130

- config-keys:
- dsv4-fp4-b300-sglang
description:
Expand All @@ -1775,3 +1775,14 @@
- "Model: sgl-project/DeepSeek-V4-Pro-FP8"
- "https://github.com/sgl-project/sglang/pull/23608#issuecomment-4311952977"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1134

- config-keys:
- dsv4-fp4-b300-vllm
description:
- "Add DeepSeek-V4-Pro single-node B300 vLLM aggregate benchmark"
- "Image: vllm/vllm-openai:deepseekv4-cu130"
- "Model: deepseek-ai/DeepSeek-V4-Pro"
- "Uses the submitted B300 pareto schedule for both 1k1k and 8k1k, excluding conc 1: TP8 at conc 4/128, TP4 at conc 4/8/16/32/64/128, DP4 at conc 256/512"
- "Launch args match the provided vllm serve command, including FP4 indexer cache, FULL_AND_PIECEWISE cudagraph config, and max-num-batched-tokens 2048"
- "1k1k uses --max-model-len 4096; 8k1k uses the workflow-provided benchmark context length"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1144
Loading