Add B300 config: dsv4-fp4fp8-vllm (DeepSeek-V4-Pro)#1128
Add B300 config: dsv4-fp4fp8-vllm (DeepSeek-V4-Pro)#1128functionstackx wants to merge 3 commits intomainfrom
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
2 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
There was a problem hiding this comment.
LGTM — straightforward B300 port mirroring the companion B200 recipe; the inline EP-flag nit is non-blocking since the search space only emits ep: 8.
Extended reasoning...
Overview
This PR adds a new B300 vLLM config dsv4-fp4-b300-vllm for DeepSeek-V4-Pro per the official vLLM recipe (which lists 8xB200 and 8xB300 with identical flags). It touches three files: a new search-space entry in .github/configs/nvidia-master.yaml, a new launch script benchmarks/single_node/dsv4_fp4_b300.sh that's a direct port of the B200 script, and an additions-only entry in perf-changelog.yaml.
Security risks
None. This is a benchmarking config + launch script — no auth, crypto, permissions, or input-handling changes. The script runs in an isolated benchmark container.
Level of scrutiny
Low. This is a routine benchmark recipe addition following an established pattern (≈10 similar B200→B300 ports already in this changelog). The script mirrors dsv4_fp4_b200.sh from the companion PR #1127, the YAML entry is structurally identical to neighboring vLLM configs, and the PR description shows test-config expansion was verified.
Other factors
A nit was filed inline about the hardcoded --enable-expert-parallel flag not following the conditional-EP convention. The bug report itself classifies it as WARNING-level and notes runtime behavior is unaffected today (search space only emits ep: 8). The same pattern exists in the B200 source script being ported, so it's a pre-existing convention drift rather than a regression introduced here. The PR's only outstanding test-plan item is the actual B300 runner sweep, which can only be validated post-merge.
| --kv-cache-dtype fp8 \ | ||
| --block-size 256 \ | ||
| --no-enable-prefix-caching \ | ||
| --enable-expert-parallel \ |
There was a problem hiding this comment.
🟡 The new dsv4_fp4_b300.sh hardcodes --enable-expert-parallel at line 47, violating the project's PR-review rule (.github/workflows/claude-pr-review.yml lines 141-159) that scripts must conditionally enable EP based on EP_SIZE. Every other vLLM/ATOM script in benchmarks/single_node/ uses the if [ "$EP_SIZE" -gt 1 ] pattern; please wrap the flag the same way (and add EP_SIZE to check_env_vars) so a future search-space entry with ep: 1 doesn't silently still apply expert parallelism.
Extended reasoning...
What the bug is
In benchmarks/single_node/dsv4_fp4_b300.sh line 47, the vllm serve invocation hardcodes --enable-expert-parallel unconditionally. The repo's own PR-review rules (.github/workflows/claude-pr-review.yml lines 141-159) explicitly forbid this and prescribe the canonical pattern:
if [ "$EP_SIZE" -gt 1 ]; then
EP=" --enable-expert-parallel"
else
EP=" "
fiThe rule is documented as a 🟡 WARNING-level review issue. Today the script also doesn't list EP_SIZE in its check_env_vars call (lines 8-16), so even if a caller exported EP_SIZE=1, the script would ignore it.
Why existing code doesn't prevent it
The sole reason runtime behavior is currently fine is that the new YAML search space (.github/configs/nvidia-master.yaml) only emits ep: 8 entries, so the runner always intends EP. There is nothing structural preventing a future ep: 1 entry — and once one is added, this script will silently still pass --enable-expert-parallel, contradicting the YAML's declared intent.
Code-path proof
- A future contributor adds
{ tp: 8, ep: 1, ... }todsv4-fp4-b300-vllm.search-spaceinnvidia-master.yamlto compare TP-only vs EP performance. - The runner expands the entry and exports
EP_SIZE=1into the script's environment (this is the standard contract used by every other ATOM/vLLM script inbenchmarks/single_node/). - The script ignores
EP_SIZE—check_env_varsdoesn't list it, and thevllm servecommand unconditionally has--enable-expert-parallelbaked in. - vLLM launches with expert parallelism on, producing perf numbers that don't match the search-space's declared
ep: 1configuration. The run is silently mislabeled in the result store.
Convention evidence
Grep over benchmarks/single_node/ shows ~24 sibling vLLM/ATOM scripts that use the conditional pattern (minimaxm2.5_fp8_b300.sh:34, minimaxm2.5_fp8_b200.sh:30, dsr1_fp4_mi355x_atom.sh, dsr1_fp4_mi355x_atom_mtp.sh, dsr1_fp8_mi355x_atom.sh, dsr1_fp8_mi355x_atom_mtp.sh, glm5_fp8_mi355x_atom.sh, glm5.1_fp4_mi355x_atom.sh, gptoss_fp4_mi355x_atom.sh, kimik2.5_fp4_mi355x_atom.sh, qwen3.5_fp8_mi355x_atom.sh, qwen3.5_fp8_mi355x_atom_mtp.sh, etc.). dsv4_fp4_b300.sh is the lone outlier.
Fix
- Add
EP_SIZEto thecheck_env_varscall (lines 8-16). - Above the
vllm serveblock, insert:if [ "$EP_SIZE" -gt 1 ]; then EP=" --enable-expert-parallel" else EP=" " fi
- Replace the hardcoded
--enable-expert-parallel \line with$EP \(matching the pattern in the sibling scripts).
Severity rationale
The project's own review rule classifies this as WARNING (not blocking), and runtime behavior is unaffected today because the YAML always emits ep: 8. It's a convention/robustness issue rather than a current functional bug — filing as nit.
PR #1128 (dsv4-fp4fp8-b300-vllm) runs on the same cluster with ZERO changes to launch_b300-nv.sh. The pyxis 10-shadow.sh failures we were chasing aren't caused by the runner -- reset it to origin/main and keep the sglang config/bench additions only. Reverts (from this branch): - 4bb1f1a point B300 runner at shared gharunners/{squash,hf-hub-cache} - 106deea drop ENROOT_CACHE_PATH override - 97a488e add flock-guarded squash import - 744c5a0 move enroot import out of srun - d003c59 wipe stale pyxis scratch before benchmark srun Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…1143) * feat: add DeepSeek-V4-Flash FP4 B300 SGLang benchmark Adds dsv4-fp4-b300-sglang config, single-node benchmark script, and perf-changelog entry for the DeepSeek-V4 recipe from the SGLang cookbook. The cookbook ships a B200 (not B300) recipe, so this reuses the B200 Flash Low-Latency recipe on B300 until a B300-specific recipe lands. Speculative decoding (EAGLE) and prefix caching are disabled per request. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: switch dsv4-fp4-b300-sglang to Pro + Max-Throughput recipe Match parallelism (TP=8/EP=8/dp-attn=true) and concurrency ranges (4-1024 for 1k1k, 4-512 for 8k1k) to dsv4-fp4-b200-vllm. Use the DeepSeek-V4-Pro variant with the cookbook Max-Throughput recipe (DP=8 + DeepEP, no MTP), which aligns with the requested no-spec parallelism. Prefix caching remains disabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: sync launch_b200-dgxc-slurm.sh cache mount from claude/add-dsv4-fp4-b200-vllm Port the HF cache mount rework from the DSV4 B200 VLLM branch so both PRs stay consistent: use the shared /scratch/fsw/gharunners/hf-hub-cache path, drop the local MODEL override, and mount onto \$HF_HUB_CACHE inside the container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: restore trailing whitespace stripped from glm5.1 changelog entry The dsv4-fp4-b300-sglang entry was appended correctly, but the earlier edit also stripped trailing spaces on an existing line, producing a spurious deletion. Revert so the diff is additive-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: add flock-guarded squash import to B300 runner Mirror the lockfile logic already in launch_b200-dgxc-slurm.sh and launch_h200-dgxc-slurm.sh: serialize concurrent enroot imports of the same squash file via flock, skip the import when the squash is already valid, and override ENROOT_CACHE_PATH to avoid permission issues with the system-wide cache on worker nodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: drop ENROOT_CACHE_PATH override from B300 runner The override ("avoid permission issues with system-wide cache on worker nodes") is a dgxc-slurm-specific workaround; launch_b300-nv.sh is on the NV slurm cluster, not dgxc-slurm. Copying it in caused the benchmark srun's pyxis shadow hook to fail with 'mkdir: cannot create directory pyxis_$JOBID.1/data: File exists'. Keep the flock + skip-if-valid logic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: point B300 runner at shared gharunners/{squash,hf-hub-cache} Move the squash cache from /data/squash to /data/home/sa-shared/gharunners/squash, and the HF cache mount from /scratch/models to /data/home/sa-shared/gharunners/hf-hub-cache. Also mount the host HF cache onto \$HF_HUB_CACHE inside the container so tools reading the default HF path pick it up (matches the B200 dgxc-slurm runner). Drop the /scratch/models Qwen3.5 path override since that path is no longer used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: move enroot import out of srun to avoid pyxis namespace collision Running two srun steps in the same allocation (flock+import, then the benchmark --container-image srun) reproducibly fails on this cluster with: error: pyxis: mkdir: cannot create directory '/scratch/data/user-$UID/pyxis_$JOBID.1/data': File exists error: pyxis: [ERROR] /etc/enroot/hooks.d/10-shadow.sh exited with return code 1 Per NVIDIA/pyxis#138, two srun steps sharing an allocation can leave enroot/pyxis state between steps. Collapsing to a single srun (the benchmark) is the cleanest workaround. Move the flock-guarded enroot import to the host side, before salloc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: wipe stale pyxis scratch dirs for this JOB_ID before benchmark srun Even with a single srun step, pyxis fails with error: pyxis: mkdir: cannot create directory '/scratch/data/user-$UID/pyxis_$JOBID.0/data': File exists on fresh SLURM JOB_IDs. The /scratch path is left behind by previous jobs whose IDs SLURM later reuses (and the cluster's pyxis epilog doesn't clean it up). Wipe pyxis_$JOBID.* from the host after salloc; no-op if /scratch is node-local, effective if it's shared NFS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert: drop all B300 runner changes, mirror #1128's approach PR #1128 (dsv4-fp4fp8-b300-vllm) runs on the same cluster with ZERO changes to launch_b300-nv.sh. The pyxis 10-shadow.sh failures we were chasing aren't caused by the runner -- reset it to origin/main and keep the sglang config/bench additions only. Reverts (from this branch): - 4bb1f1a point B300 runner at shared gharunners/{squash,hf-hub-cache} - 106deea drop ENROOT_CACHE_PATH override - 97a488e add flock-guarded squash import - 744c5a0 move enroot import out of srun - d003c59 wipe stale pyxis scratch before benchmark srun Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runner: add head-node flock-guarded squash import on B300 Move enroot import out of srun to the head node and serialize parallel GH jobs with flock on the shared squash file. Skips the import when a valid squash already exists. The benchmark srun is now the only step in the allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: mount at /ix and clear baked-in CUDA_VISIBLE_DEVICES Port the B200 branch's fix for the lmsysorg/sglang:deepseek-v4-blackwell image on B300: - The image installs sglang editable under /workspace/sglang; the default $GITHUB_WORKSPACE:/workspace/ bind-mount masks the install and breaks 'import sglang'. For this image, mount at /ix instead. - The image's ENV bakes CUDA_VISIBLE_DEVICES=4,5,6,7, masking half the GPUs Slurm allocates. Unset it in the bench script so TP=8 sees all 8. - Write artefacts under $PWD instead of hard-coded /workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runner: use /data/models pre-staged path for dsv4 on B300 Pre-staged models on the B300 cluster live under /data/models (Qwen3.5-397B-A17B-FP8, dsv4-pro, etc.). Switch HF_HUB_CACHE_MOUNT from /scratch/models to /data/models, and export MODEL to /data/models/dsv4-pro when MODEL_PREFIX=dsv4 so the benchmark reads from the mounted dir directly. The bench script skips `hf download` when MODEL looks like an absolute path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: switch B300 dsv4 sglang to bw-ultra-compiled image The stock lmsysorg/sglang:deepseek-v4-blackwell image ships kernels compiled for B200 (SM_100) and crashes on B300 with RuntimeError: RMSNorm failed with error code no kernel image is available for execution on the device during CUDA graph capture. Switch to cquil/sglang-deepseek-v4-bw-ultra:v1, which is recompiled with B300 SM support. Broaden the /ix mount conditional to match both image tags: the fork keeps the same /workspace/sglang editable install that would otherwise be masked by $GITHUB_WORKSPACE:/workspace/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: switch B300 dsv4 sglang image to yhyang201/sglang-b300:v3 Use the B300-recompiled image from yhyang201; extend the /ix mount conditional to match the new tag in addition to the previous deepseek-v4-blackwell / deepseek-v4-bw-ultra patterns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * update b300 * feat(dsv4-fp4-b300-sglang): pick recipe by CONC; split search-space Mirror chore/dsv4-sgl-b200 commits 103a202 + 43be495 for B300: Bench script now selects one of three cookbook recipes by CONC instead of a single static flag set: CONC <= 32 -> low-latency (TP only, chunked-prefill 4096, disable-flashinfer-autotune) 33..128 -> balanced (+ DP-attention, max-running-reqs=128, cuda-graph-max-bs=64, deepep-config) CONC > 128 -> max-throughput (+ DP-attention, max-running-reqs=256, cuda-graph-max-bs=64, deepep-config) No speculative decoding in any recipe; --disable-radix-cache kept for the no-prefix-caching baseline. Split the dsv4-fp4-b300-sglang search-space rows per recipe boundary so result filenames (ep=, dpa=) accurately reflect which recipe ran. ep=8 on balanced/max-throughput reflects sglang's implicit ep_size=tp_size override when --moe-a2a-backend deepep is set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * update b300 Switch B300 dsv4 sglang image to lmsysorg/sglang:deepseek-v4-b300 and extend the /ix mount conditional to match the new tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dsv4-fp4-b300-sglang): low-latency recipe at every CONC (fallback) The DeepEP FP8 weight-postprocess path is broken for deepseek-ai/DeepSeek-V4-Pro on B300 with lmsysorg/sglang:deepseek-v4-b300 -- every sglang launch with --moe-a2a-backend deepep fails during model load with RuntimeError: Recipe must be a list/tuple of 3 integers. raised from sglang.srt.layers.quantization.fp8 .process_weights_after_loading_block_quant (fp8.py:957). Hardcode the bench script to the low-latency recipe flags at every CONC (drop the CONC-based dispatch) and collapse the yaml search-space back to a single row spanning CONC 4..1024 / 4..512 so the full sweep still runs, just without DP-attn or DeepEP. TODO(Cam) noted in both files to revert to the recipe-per-CONC dispatch on chore/dsv4-sgl-b300 once sglang can load the checkpoint under --moe-a2a-backend deepep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: align perf-changelog and config comments with low-latency fallback The perf-changelog entry and outer NOTE comment in nvidia-master.yaml described the max-throughput recipe from #1132, not the low-latency fallback this PR actually adds. Rewrite both to match the actual config: TP=8/EP=1, no DP-attn, no DeepEP, image deepseek-v4-b300, pr-link #1143. Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>
Mirrors the B200 dsv4-fp4-vllm config for B300. The recipe at https://vllm.ai/blog/deepseek-v4 lists 8xB200 and 8xB300 with identical flags for DeepSeek-V4-Pro, so the launch script is a straight port of dsv4_fp4_b200.sh with a note at the top. Launch: EP + DP=$TP (no --tensor-parallel-size), FP8 KV cache, block size 256, FP4 indexer cache, prefix caching disabled, deepseek_v4 tokenizer/tool-call/reasoning parsers. Configs: 1k1k conc 4-1024, 8k1k conc 4-512. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Set VLLM_ENGINE_READY_TIMEOUT_S=1200 (20 min) in the launch script.
DeepSeek-V4-Pro weights are large and engine startup on B300 was
exceeding the default 600s, surfacing as:
TimeoutError: Timed out waiting for engine core processes to start
- Cap both 1k1k and 8k1k search-space concurrency at 64 (conc-start 4,
conc-end 64) to keep the initial sweep bounded while the config
stabilizes. Higher-conc points can be re-enabled once the runner is
consistently coming up green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Rename config key dsv4-fp4-b300-vllm -> dsv4-fp4fp8-b300-vllm
and precision fp4 -> fp4fp8, reflecting that the model runs fp4
weights with an fp8 KV cache.
- Rename benchmarks/single_node/dsv4_fp4_b300.sh -> dsv4_fp4fp8_b300.sh
so the runner (${EXP_NAME%%_*}_${PRECISION}_b300.sh) resolves to
the new script. User wrote "fp4p8" in the script name; I'm using
"fp4fp8" to match the PRECISION field — otherwise the launcher
would look for a script that doesn't exist.
- Bump VLLM_ENGINE_READY_TIMEOUT_S from 1200 (20m) to 3600 (1h)
so slow weight loading doesn't trip the engine-ready gate.
- Update perf-changelog entry accordingly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
711236f to
9434fca
Compare
* feat: add DeepSeek-V4-Flash FP4 B300 SGLang benchmark
Adds dsv4-fp4-b300-sglang config, single-node benchmark script, and
perf-changelog entry for the DeepSeek-V4 recipe from the SGLang
cookbook. The cookbook ships a B200 (not B300) recipe, so this
reuses the B200 Flash Low-Latency recipe on B300 until a
B300-specific recipe lands. Speculative decoding (EAGLE) and prefix
caching are disabled per request.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: switch dsv4-fp4-b300-sglang to Pro + Max-Throughput recipe
Match parallelism (TP=8/EP=8/dp-attn=true) and concurrency ranges
(4-1024 for 1k1k, 4-512 for 8k1k) to dsv4-fp4-b200-vllm. Use the
DeepSeek-V4-Pro variant with the cookbook Max-Throughput recipe
(DP=8 + DeepEP, no MTP), which aligns with the requested no-spec
parallelism. Prefix caching remains disabled.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: sync launch_b200-dgxc-slurm.sh cache mount from claude/add-dsv4-fp4-b200-vllm
Port the HF cache mount rework from the DSV4 B200 VLLM branch so
both PRs stay consistent: use the shared /scratch/fsw/gharunners/hf-hub-cache
path, drop the local MODEL override, and mount onto \$HF_HUB_CACHE
inside the container.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: restore trailing whitespace stripped from glm5.1 changelog entry
The dsv4-fp4-b300-sglang entry was appended correctly, but the earlier
edit also stripped trailing spaces on an existing line, producing a
spurious deletion. Revert so the diff is additive-only.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: add flock-guarded squash import to B300 runner
Mirror the lockfile logic already in launch_b200-dgxc-slurm.sh and
launch_h200-dgxc-slurm.sh: serialize concurrent enroot imports of
the same squash file via flock, skip the import when the squash is
already valid, and override ENROOT_CACHE_PATH to avoid permission
issues with the system-wide cache on worker nodes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: drop ENROOT_CACHE_PATH override from B300 runner
The override ("avoid permission issues with system-wide cache on
worker nodes") is a dgxc-slurm-specific workaround; launch_b300-nv.sh
is on the NV slurm cluster, not dgxc-slurm. Copying it in caused
the benchmark srun's pyxis shadow hook to fail with
'mkdir: cannot create directory pyxis_$JOBID.1/data: File exists'.
Keep the flock + skip-if-valid logic.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore: point B300 runner at shared gharunners/{squash,hf-hub-cache}
Move the squash cache from /data/squash to /data/home/sa-shared/gharunners/squash,
and the HF cache mount from /scratch/models to /data/home/sa-shared/gharunners/hf-hub-cache.
Also mount the host HF cache onto \$HF_HUB_CACHE inside the container so
tools reading the default HF path pick it up (matches the B200 dgxc-slurm
runner). Drop the /scratch/models Qwen3.5 path override since that path
is no longer used.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: move enroot import out of srun to avoid pyxis namespace collision
Running two srun steps in the same allocation (flock+import, then the
benchmark --container-image srun) reproducibly fails on this cluster
with:
error: pyxis: mkdir: cannot create directory
'/scratch/data/user-$UID/pyxis_$JOBID.1/data': File exists
error: pyxis: [ERROR] /etc/enroot/hooks.d/10-shadow.sh exited with return code 1
Per NVIDIA/pyxis#138, two srun steps sharing an allocation can leave
enroot/pyxis state between steps. Collapsing to a single srun (the
benchmark) is the cleanest workaround. Move the flock-guarded
enroot import to the host side, before salloc.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: wipe stale pyxis scratch dirs for this JOB_ID before benchmark srun
Even with a single srun step, pyxis fails with
error: pyxis: mkdir: cannot create directory
'/scratch/data/user-$UID/pyxis_$JOBID.0/data': File exists
on fresh SLURM JOB_IDs. The /scratch path is left behind by previous
jobs whose IDs SLURM later reuses (and the cluster's pyxis epilog
doesn't clean it up). Wipe pyxis_$JOBID.* from the host after salloc;
no-op if /scratch is node-local, effective if it's shared NFS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Revert: drop all B300 runner changes, mirror #1128's approach
PR #1128 (dsv4-fp4fp8-b300-vllm) runs on the same cluster with ZERO
changes to launch_b300-nv.sh. The pyxis 10-shadow.sh failures we were
chasing aren't caused by the runner -- reset it to origin/main and
keep the sglang config/bench additions only.
Reverts (from this branch):
- 4bb1f1a point B300 runner at shared gharunners/{squash,hf-hub-cache}
- 106deea drop ENROOT_CACHE_PATH override
- 97a488e add flock-guarded squash import
- 744c5a0 move enroot import out of srun
- d003c59 wipe stale pyxis scratch before benchmark srun
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* runner: add head-node flock-guarded squash import on B300
Move enroot import out of srun to the head node and serialize parallel
GH jobs with flock on the shared squash file. Skips the import when a
valid squash already exists. The benchmark srun is now the only step
in the allocation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: mount at /ix and clear baked-in CUDA_VISIBLE_DEVICES
Port the B200 branch's fix for the lmsysorg/sglang:deepseek-v4-blackwell
image on B300:
- The image installs sglang editable under /workspace/sglang; the default
$GITHUB_WORKSPACE:/workspace/ bind-mount masks the install and breaks
'import sglang'. For this image, mount at /ix instead.
- The image's ENV bakes CUDA_VISIBLE_DEVICES=4,5,6,7, masking half the
GPUs Slurm allocates. Unset it in the bench script so TP=8 sees all 8.
- Write artefacts under $PWD instead of hard-coded /workspace.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* runner: use /data/models pre-staged path for dsv4 on B300
Pre-staged models on the B300 cluster live under /data/models
(Qwen3.5-397B-A17B-FP8, dsv4-pro, etc.). Switch HF_HUB_CACHE_MOUNT
from /scratch/models to /data/models, and export MODEL to
/data/models/dsv4-pro when MODEL_PREFIX=dsv4 so the benchmark reads
from the mounted dir directly. The bench script skips `hf download`
when MODEL looks like an absolute path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: switch B300 dsv4 sglang to bw-ultra-compiled image
The stock lmsysorg/sglang:deepseek-v4-blackwell image ships kernels
compiled for B200 (SM_100) and crashes on B300 with
RuntimeError: RMSNorm failed with error code no kernel image is
available for execution on the device
during CUDA graph capture. Switch to cquil/sglang-deepseek-v4-bw-ultra:v1,
which is recompiled with B300 SM support.
Broaden the /ix mount conditional to match both image tags: the fork
keeps the same /workspace/sglang editable install that would otherwise
be masked by $GITHUB_WORKSPACE:/workspace/.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* fix: switch B300 dsv4 sglang image to yhyang201/sglang-b300:v3
Use the B300-recompiled image from yhyang201; extend the /ix mount
conditional to match the new tag in addition to the previous
deepseek-v4-blackwell / deepseek-v4-bw-ultra patterns.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* update b300
* feat(dsv4-fp4-b300-sglang): pick recipe by CONC; split search-space
Mirror chore/dsv4-sgl-b200 commits 103a202 + 43be495 for B300:
Bench script now selects one of three cookbook recipes by CONC instead
of a single static flag set:
CONC <= 32 -> low-latency (TP only, chunked-prefill 4096,
disable-flashinfer-autotune)
33..128 -> balanced (+ DP-attention, max-running-reqs=128,
cuda-graph-max-bs=64, deepep-config)
CONC > 128 -> max-throughput (+ DP-attention, max-running-reqs=256,
cuda-graph-max-bs=64, deepep-config)
No speculative decoding in any recipe; --disable-radix-cache kept for
the no-prefix-caching baseline.
Split the dsv4-fp4-b300-sglang search-space rows per recipe boundary so
result filenames (ep=, dpa=) accurately reflect which recipe ran.
ep=8 on balanced/max-throughput reflects sglang's implicit
ep_size=tp_size override when --moe-a2a-backend deepep is set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* update b300
Switch B300 dsv4 sglang image to lmsysorg/sglang:deepseek-v4-b300
and extend the /ix mount conditional to match the new tag.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* feat(dsv4-fp4-b300-sglang): hardcode low-latency recipe at every CONC
The DeepEP FP8 weight-postprocess path is broken for
deepseek-ai/DeepSeek-V4-Pro on B300 with
lmsysorg/sglang:deepseek-v4-b300 -- every sglang launch with
--moe-a2a-backend deepep fails during model load with
RuntimeError: Recipe must be a list/tuple of 3 integers.
raised from sglang.srt.layers.quantization.fp8
.process_weights_after_loading_block_quant (fp8.py:957). The balanced
and max-throughput recipes both go through that path; the low-latency
recipe (TP-only, flashinfer_mxfp4 MoE) does not and loads cleanly.
Collapse the yaml search-space back to a single row spanning the full
CONC range (4..1024 for 1k1k, 4..512 for 8k1k) and hardcode the bench
script to the low-latency flags at every CONC. TODO(Cam) noted in both
files to restore the recipe-per-CONC dispatch once the DeepEP FP8 load
path is fixed upstream.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* trigger test check
* Revert "feat(dsv4-fp4-b300-sglang): hardcode low-latency recipe at every CONC"
This reverts commit bc43672.
* trigger test check
* Move dsv4 b300 sglang bench script to framework-tagged path
Per the runner naming convention introduced in #1146
(BENCH_SCRIPT="${BENCH_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh"), the b300
runner now prefers benchmarks/single_node/dsv4_fp4_b300_sglang.sh over
the legacy dsv4_fp4_b300.sh. The merge from main left this branch with
both scripts: the legacy file carrying the recipe-per-CONC dispatch
this PR added, and the framework-tagged file with the low-latency-only
fallback content from main. CI was therefore picking the wrong script.
Move the recipe-per-CONC dispatch onto dsv4_fp4_b300_sglang.sh and
delete the legacy filename so the runner picks up the intended logic.
Update the yaml comment to point at the new path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* chore(perf-changelog): tighten dsv4-fp4-b300-sglang entry
Now that DeepEP FP8 loads cleanly, this PR is purely about restoring
the recipe-per-CONC split on top of the low-latency-only fallback
from #1143. Trim the changelog to that delta.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
dsv4-fp4-b300-vllmfor DeepSeek-V4-Pro, per the recipe at https://vllm.ai/blog/deepseek-v4 (which lists 8xB200 and 8xB300 with identical flags).vllm/vllm-openai:deepseekv4-cu130againstdeepseek-ai/DeepSeek-V4-Pro. EP + DP=8 (no--tensor-parallel-sizeflag), FP8 KV cache, block size 256, FP4 indexer cache, prefix caching disabled.benchmarks/single_node/dsv4_fp4_b300.shis a direct port ofdsv4_fp4_b200.shwith a header note.Companion PR
Recipe flags
Search space
{ tp: 8, ep: 8, dp-attn: true, conc: 4..1024 }{ tp: 8, ep: 8, dp-attn: true, conc: 4..512 }Test plan
generate_sweep_configs.py test-config --config-keys dsv4-fp4-b300-vllmexpands to 17 entries (exp-namedsv4_1k1k/dsv4_8k1k, runner b300, tp=8, ep=8, dp-attn=true, max-model-len 2304/9472).bash -n benchmarks/single_node/dsv4_fp4_b300.shpasses.perf-changelog.yamldiff vs main is pure additions (additions-only gate inprocess_changelog.py).🤖 Generated with Claude Code