Add B300 config: dsv4-fp4fp8-vllm (DeepSeek-V4-Pro) by functionstackx · Pull Request #1128 · SemiAnalysisAI/InferenceX

functionstackx · 2026-04-24T03:54:43Z

Summary

Add new B300 vLLM config dsv4-fp4-b300-vllm for DeepSeek-V4-Pro, per the recipe at https://vllm.ai/blog/deepseek-v4 (which lists 8xB200 and 8xB300 with identical flags).
Uses vllm/vllm-openai:deepseekv4-cu130 against deepseek-ai/DeepSeek-V4-Pro. EP + DP=8 (no --tensor-parallel-size flag), FP8 KV cache, block size 256, FP4 indexer cache, prefix caching disabled.
New launch script benchmarks/single_node/dsv4_fp4_b300.sh is a direct port of dsv4_fp4_b200.sh with a header note.

Companion PR

B200 version: Add B200 config: dsv4-fp4-vllm (DeepSeek-V4-Pro) 🐋 🐋 #1127

Recipe flags

--trust-remote-code
--kv-cache-dtype fp8
--block-size 256
--no-enable-prefix-caching
--enable-expert-parallel
--data-parallel-size $TP    # $TP = 8 from search space
--compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}'
--attention_config.use_fp4_indexer_cache=True
--tokenizer-mode deepseek_v4
--tool-call-parser deepseek_v4
--enable-auto-tool-choice
--reasoning-parser deepseek_v4

Search space

1k1k: { tp: 8, ep: 8, dp-attn: true, conc: 4..1024 }
8k1k: { tp: 8, ep: 8, dp-attn: true, conc: 4..512 }

Test plan

generate_sweep_configs.py test-config --config-keys dsv4-fp4-b300-vllm expands to 17 entries (exp-name dsv4_1k1k/dsv4_8k1k, runner b300, tp=8, ep=8, dp-attn=true, max-model-len 2304/9472).
bash -n benchmarks/single_node/dsv4_fp4_b300.sh passes.
YAML files parse; perf-changelog.yaml diff vs main is pure additions (additions-only gate in process_changelog.py).
Run the triggered sweep on a B300 runner — verify the server launches and benchmark + eval produce results.

🤖 Generated with Claude Code

github-actions · 2026-04-24T03:54:51Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-24T03:54:51Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-24T03:54:51Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude

LGTM — straightforward B300 port mirroring the companion B200 recipe; the inline EP-flag nit is non-blocking since the search space only emits ep: 8.

Extended reasoning...

Overview

This PR adds a new B300 vLLM config dsv4-fp4-b300-vllm for DeepSeek-V4-Pro per the official vLLM recipe (which lists 8xB200 and 8xB300 with identical flags). It touches three files: a new search-space entry in .github/configs/nvidia-master.yaml, a new launch script benchmarks/single_node/dsv4_fp4_b300.sh that's a direct port of the B200 script, and an additions-only entry in perf-changelog.yaml.

Security risks

None. This is a benchmarking config + launch script — no auth, crypto, permissions, or input-handling changes. The script runs in an isolated benchmark container.

Level of scrutiny

Low. This is a routine benchmark recipe addition following an established pattern (≈10 similar B200→B300 ports already in this changelog). The script mirrors dsv4_fp4_b200.sh from the companion PR #1127, the YAML entry is structurally identical to neighboring vLLM configs, and the PR description shows test-config expansion was verified.

Other factors

A nit was filed inline about the hardcoded --enable-expert-parallel flag not following the conditional-EP convention. The bug report itself classifies it as WARNING-level and notes runtime behavior is unaffected today (search space only emits ep: 8). The same pattern exists in the B200 source script being ported, so it's a pre-existing convention drift rather than a regression introduced here. The PR's only outstanding test-plan item is the actual B300 runner sweep, which can only be validated post-merge.

claude · 2026-04-24T04:02:34Z

+--kv-cache-dtype fp8 \
+--block-size 256 \
+--no-enable-prefix-caching \
+--enable-expert-parallel \


🟡 The new dsv4_fp4_b300.sh hardcodes --enable-expert-parallel at line 47, violating the project's PR-review rule (.github/workflows/claude-pr-review.yml lines 141-159) that scripts must conditionally enable EP based on EP_SIZE. Every other vLLM/ATOM script in benchmarks/single_node/ uses the if [ "$EP_SIZE" -gt 1 ] pattern; please wrap the flag the same way (and add EP_SIZE to check_env_vars) so a future search-space entry with ep: 1 doesn't silently still apply expert parallelism.

Extended reasoning...

What the bug is

In benchmarks/single_node/dsv4_fp4_b300.sh line 47, the vllm serve invocation hardcodes --enable-expert-parallel unconditionally. The repo's own PR-review rules (.github/workflows/claude-pr-review.yml lines 141-159) explicitly forbid this and prescribe the canonical pattern:

if [ "$EP_SIZE" -gt 1 ]; then EP=" --enable-expert-parallel" else EP=" " fi

The rule is documented as a 🟡 WARNING-level review issue. Today the script also doesn't list EP_SIZE in its check_env_vars call (lines 8-16), so even if a caller exported EP_SIZE=1, the script would ignore it.

Why existing code doesn't prevent it

The sole reason runtime behavior is currently fine is that the new YAML search space (.github/configs/nvidia-master.yaml) only emits ep: 8 entries, so the runner always intends EP. There is nothing structural preventing a future ep: 1 entry — and once one is added, this script will silently still pass --enable-expert-parallel, contradicting the YAML's declared intent.

Code-path proof

A future contributor adds { tp: 8, ep: 1, ... } to dsv4-fp4-b300-vllm.search-space in nvidia-master.yaml to compare TP-only vs EP performance.

The runner expands the entry and exports EP_SIZE=1 into the script's environment (this is the standard contract used by every other ATOM/vLLM script in benchmarks/single_node/).

The script ignores EP_SIZE — check_env_vars doesn't list it, and the vllm serve command unconditionally has --enable-expert-parallel baked in.

vLLM launches with expert parallelism on, producing perf numbers that don't match the search-space's declared ep: 1 configuration. The run is silently mislabeled in the result store.

Convention evidence

Grep over benchmarks/single_node/ shows ~24 sibling vLLM/ATOM scripts that use the conditional pattern (minimaxm2.5_fp8_b300.sh:34, minimaxm2.5_fp8_b200.sh:30, dsr1_fp4_mi355x_atom.sh, dsr1_fp4_mi355x_atom_mtp.sh, dsr1_fp8_mi355x_atom.sh, dsr1_fp8_mi355x_atom_mtp.sh, glm5_fp8_mi355x_atom.sh, glm5.1_fp4_mi355x_atom.sh, gptoss_fp4_mi355x_atom.sh, kimik2.5_fp4_mi355x_atom.sh, qwen3.5_fp8_mi355x_atom.sh, qwen3.5_fp8_mi355x_atom_mtp.sh, etc.). dsv4_fp4_b300.sh is the lone outlier.

Fix

Add EP_SIZE to the check_env_vars call (lines 8-16).

Above the vllm serve block, insert:
if [ "$EP_SIZE" -gt 1 ]; then EP=" --enable-expert-parallel" else EP=" " fi

Replace the hardcoded --enable-expert-parallel \ line with $EP \ (matching the pattern in the sibling scripts).

Severity rationale

The project's own review rule classifies this as WARNING (not blocking), and runtime behavior is unaffected today because the YAML always emits ep: 8. It's a convention/robustness issue rather than a current functional bug — filing as nit.

PR #1128 (dsv4-fp4fp8-b300-vllm) runs on the same cluster with ZERO changes to launch_b300-nv.sh. The pyxis 10-shadow.sh failures we were chasing aren't caused by the runner -- reset it to origin/main and keep the sglang config/bench additions only. Reverts (from this branch): - 4bb1f1a point B300 runner at shared gharunners/{squash,hf-hub-cache} - 106deea drop ENROOT_CACHE_PATH override - 97a488e add flock-guarded squash import - 744c5a0 move enroot import out of srun - d003c59 wipe stale pyxis scratch before benchmark srun Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…1143) * feat: add DeepSeek-V4-Flash FP4 B300 SGLang benchmark Adds dsv4-fp4-b300-sglang config, single-node benchmark script, and perf-changelog entry for the DeepSeek-V4 recipe from the SGLang cookbook. The cookbook ships a B200 (not B300) recipe, so this reuses the B200 Flash Low-Latency recipe on B300 until a B300-specific recipe lands. Speculative decoding (EAGLE) and prefix caching are disabled per request. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: switch dsv4-fp4-b300-sglang to Pro + Max-Throughput recipe Match parallelism (TP=8/EP=8/dp-attn=true) and concurrency ranges (4-1024 for 1k1k, 4-512 for 8k1k) to dsv4-fp4-b200-vllm. Use the DeepSeek-V4-Pro variant with the cookbook Max-Throughput recipe (DP=8 + DeepEP, no MTP), which aligns with the requested no-spec parallelism. Prefix caching remains disabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: sync launch_b200-dgxc-slurm.sh cache mount from claude/add-dsv4-fp4-b200-vllm Port the HF cache mount rework from the DSV4 B200 VLLM branch so both PRs stay consistent: use the shared /scratch/fsw/gharunners/hf-hub-cache path, drop the local MODEL override, and mount onto \$HF_HUB_CACHE inside the container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: restore trailing whitespace stripped from glm5.1 changelog entry The dsv4-fp4-b300-sglang entry was appended correctly, but the earlier edit also stripped trailing spaces on an existing line, producing a spurious deletion. Revert so the diff is additive-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: add flock-guarded squash import to B300 runner Mirror the lockfile logic already in launch_b200-dgxc-slurm.sh and launch_h200-dgxc-slurm.sh: serialize concurrent enroot imports of the same squash file via flock, skip the import when the squash is already valid, and override ENROOT_CACHE_PATH to avoid permission issues with the system-wide cache on worker nodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: drop ENROOT_CACHE_PATH override from B300 runner The override ("avoid permission issues with system-wide cache on worker nodes") is a dgxc-slurm-specific workaround; launch_b300-nv.sh is on the NV slurm cluster, not dgxc-slurm. Copying it in caused the benchmark srun's pyxis shadow hook to fail with 'mkdir: cannot create directory pyxis_$JOBID.1/data: File exists'. Keep the flock + skip-if-valid logic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: point B300 runner at shared gharunners/{squash,hf-hub-cache} Move the squash cache from /data/squash to /data/home/sa-shared/gharunners/squash, and the HF cache mount from /scratch/models to /data/home/sa-shared/gharunners/hf-hub-cache. Also mount the host HF cache onto \$HF_HUB_CACHE inside the container so tools reading the default HF path pick it up (matches the B200 dgxc-slurm runner). Drop the /scratch/models Qwen3.5 path override since that path is no longer used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: move enroot import out of srun to avoid pyxis namespace collision Running two srun steps in the same allocation (flock+import, then the benchmark --container-image srun) reproducibly fails on this cluster with: error: pyxis: mkdir: cannot create directory '/scratch/data/user-$UID/pyxis_$JOBID.1/data': File exists error: pyxis: [ERROR] /etc/enroot/hooks.d/10-shadow.sh exited with return code 1 Per NVIDIA/pyxis#138, two srun steps sharing an allocation can leave enroot/pyxis state between steps. Collapsing to a single srun (the benchmark) is the cleanest workaround. Move the flock-guarded enroot import to the host side, before salloc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: wipe stale pyxis scratch dirs for this JOB_ID before benchmark srun Even with a single srun step, pyxis fails with error: pyxis: mkdir: cannot create directory '/scratch/data/user-$UID/pyxis_$JOBID.0/data': File exists on fresh SLURM JOB_IDs. The /scratch path is left behind by previous jobs whose IDs SLURM later reuses (and the cluster's pyxis epilog doesn't clean it up). Wipe pyxis_$JOBID.* from the host after salloc; no-op if /scratch is node-local, effective if it's shared NFS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert: drop all B300 runner changes, mirror #1128's approach PR #1128 (dsv4-fp4fp8-b300-vllm) runs on the same cluster with ZERO changes to launch_b300-nv.sh. The pyxis 10-shadow.sh failures we were chasing aren't caused by the runner -- reset it to origin/main and keep the sglang config/bench additions only. Reverts (from this branch): - 4bb1f1a point B300 runner at shared gharunners/{squash,hf-hub-cache} - 106deea drop ENROOT_CACHE_PATH override - 97a488e add flock-guarded squash import - 744c5a0 move enroot import out of srun - d003c59 wipe stale pyxis scratch before benchmark srun Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runner: add head-node flock-guarded squash import on B300 Move enroot import out of srun to the head node and serialize parallel GH jobs with flock on the shared squash file. Skips the import when a valid squash already exists. The benchmark srun is now the only step in the allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: mount at /ix and clear baked-in CUDA_VISIBLE_DEVICES Port the B200 branch's fix for the lmsysorg/sglang:deepseek-v4-blackwell image on B300: - The image installs sglang editable under /workspace/sglang; the default $GITHUB_WORKSPACE:/workspace/ bind-mount masks the install and breaks 'import sglang'. For this image, mount at /ix instead. - The image's ENV bakes CUDA_VISIBLE_DEVICES=4,5,6,7, masking half the GPUs Slurm allocates. Unset it in the bench script so TP=8 sees all 8. - Write artefacts under $PWD instead of hard-coded /workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runner: use /data/models pre-staged path for dsv4 on B300 Pre-staged models on the B300 cluster live under /data/models (Qwen3.5-397B-A17B-FP8, dsv4-pro, etc.). Switch HF_HUB_CACHE_MOUNT from /scratch/models to /data/models, and export MODEL to /data/models/dsv4-pro when MODEL_PREFIX=dsv4 so the benchmark reads from the mounted dir directly. The bench script skips `hf download` when MODEL looks like an absolute path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: switch B300 dsv4 sglang to bw-ultra-compiled image The stock lmsysorg/sglang:deepseek-v4-blackwell image ships kernels compiled for B200 (SM_100) and crashes on B300 with RuntimeError: RMSNorm failed with error code no kernel image is available for execution on the device during CUDA graph capture. Switch to cquil/sglang-deepseek-v4-bw-ultra:v1, which is recompiled with B300 SM support. Broaden the /ix mount conditional to match both image tags: the fork keeps the same /workspace/sglang editable install that would otherwise be masked by $GITHUB_WORKSPACE:/workspace/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: switch B300 dsv4 sglang image to yhyang201/sglang-b300:v3 Use the B300-recompiled image from yhyang201; extend the /ix mount conditional to match the new tag in addition to the previous deepseek-v4-blackwell / deepseek-v4-bw-ultra patterns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * update b300 * feat(dsv4-fp4-b300-sglang): pick recipe by CONC; split search-space Mirror chore/dsv4-sgl-b200 commits 103a202 + 43be495 for B300: Bench script now selects one of three cookbook recipes by CONC instead of a single static flag set: CONC <= 32 -> low-latency (TP only, chunked-prefill 4096, disable-flashinfer-autotune) 33..128 -> balanced (+ DP-attention, max-running-reqs=128, cuda-graph-max-bs=64, deepep-config) CONC > 128 -> max-throughput (+ DP-attention, max-running-reqs=256, cuda-graph-max-bs=64, deepep-config) No speculative decoding in any recipe; --disable-radix-cache kept for the no-prefix-caching baseline. Split the dsv4-fp4-b300-sglang search-space rows per recipe boundary so result filenames (ep=, dpa=) accurately reflect which recipe ran. ep=8 on balanced/max-throughput reflects sglang's implicit ep_size=tp_size override when --moe-a2a-backend deepep is set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * update b300 Switch B300 dsv4 sglang image to lmsysorg/sglang:deepseek-v4-b300 and extend the /ix mount conditional to match the new tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dsv4-fp4-b300-sglang): low-latency recipe at every CONC (fallback) The DeepEP FP8 weight-postprocess path is broken for deepseek-ai/DeepSeek-V4-Pro on B300 with lmsysorg/sglang:deepseek-v4-b300 -- every sglang launch with --moe-a2a-backend deepep fails during model load with RuntimeError: Recipe must be a list/tuple of 3 integers. raised from sglang.srt.layers.quantization.fp8 .process_weights_after_loading_block_quant (fp8.py:957). Hardcode the bench script to the low-latency recipe flags at every CONC (drop the CONC-based dispatch) and collapse the yaml search-space back to a single row spanning CONC 4..1024 / 4..512 so the full sweep still runs, just without DP-attn or DeepEP. TODO(Cam) noted in both files to revert to the recipe-per-CONC dispatch on chore/dsv4-sgl-b300 once sglang can load the checkpoint under --moe-a2a-backend deepep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: align perf-changelog and config comments with low-latency fallback The perf-changelog entry and outer NOTE comment in nvidia-master.yaml described the max-throughput recipe from #1132, not the low-latency fallback this PR actually adds. Rewrite both to match the actual config: TP=8/EP=1, no DP-attn, no DeepEP, image deepseek-v4-b300, pr-link #1143. Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: claude[bot] <41898282+claude[bot]@users.noreply.github.com> Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com>

functionstackx · 2026-04-25T02:31:35Z

supersecded by @Ankur-singh https://github.com/SemiAnalysisAI/InferenceX/pull/1144/changes

#1144

Mirrors the B200 dsv4-fp4-vllm config for B300. The recipe at https://vllm.ai/blog/deepseek-v4 lists 8xB200 and 8xB300 with identical flags for DeepSeek-V4-Pro, so the launch script is a straight port of dsv4_fp4_b200.sh with a note at the top. Launch: EP + DP=$TP (no --tensor-parallel-size), FP8 KV cache, block size 256, FP4 indexer cache, prefix caching disabled, deepseek_v4 tokenizer/tool-call/reasoning parsers. Configs: 1k1k conc 4-1024, 8k1k conc 4-512. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Set VLLM_ENGINE_READY_TIMEOUT_S=1200 (20 min) in the launch script. DeepSeek-V4-Pro weights are large and engine startup on B300 was exceeding the default 600s, surfacing as: TimeoutError: Timed out waiting for engine core processes to start - Cap both 1k1k and 8k1k search-space concurrency at 64 (conc-start 4, conc-end 64) to keep the initial sweep bounded while the config stabilizes. Higher-conc points can be re-enabled once the runner is consistently coming up green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Rename config key dsv4-fp4-b300-vllm -> dsv4-fp4fp8-b300-vllm and precision fp4 -> fp4fp8, reflecting that the model runs fp4 weights with an fp8 KV cache. - Rename benchmarks/single_node/dsv4_fp4_b300.sh -> dsv4_fp4fp8_b300.sh so the runner (${EXP_NAME%%_*}_${PRECISION}_b300.sh) resolves to the new script. User wrote "fp4p8" in the script name; I'm using "fp4fp8" to match the PRECISION field — otherwise the launcher would look for a script that doesn't exist. - Bump VLLM_ENGINE_READY_TIMEOUT_S from 1200 (20m) to 3600 (1h) so slow weight loading doesn't trip the engine-ready gate. - Update perf-changelog entry accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: add DeepSeek-V4-Flash FP4 B300 SGLang benchmark Adds dsv4-fp4-b300-sglang config, single-node benchmark script, and perf-changelog entry for the DeepSeek-V4 recipe from the SGLang cookbook. The cookbook ships a B200 (not B300) recipe, so this reuses the B200 Flash Low-Latency recipe on B300 until a B300-specific recipe lands. Speculative decoding (EAGLE) and prefix caching are disabled per request. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: switch dsv4-fp4-b300-sglang to Pro + Max-Throughput recipe Match parallelism (TP=8/EP=8/dp-attn=true) and concurrency ranges (4-1024 for 1k1k, 4-512 for 8k1k) to dsv4-fp4-b200-vllm. Use the DeepSeek-V4-Pro variant with the cookbook Max-Throughput recipe (DP=8 + DeepEP, no MTP), which aligns with the requested no-spec parallelism. Prefix caching remains disabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: sync launch_b200-dgxc-slurm.sh cache mount from claude/add-dsv4-fp4-b200-vllm Port the HF cache mount rework from the DSV4 B200 VLLM branch so both PRs stay consistent: use the shared /scratch/fsw/gharunners/hf-hub-cache path, drop the local MODEL override, and mount onto \$HF_HUB_CACHE inside the container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: restore trailing whitespace stripped from glm5.1 changelog entry The dsv4-fp4-b300-sglang entry was appended correctly, but the earlier edit also stripped trailing spaces on an existing line, producing a spurious deletion. Revert so the diff is additive-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: add flock-guarded squash import to B300 runner Mirror the lockfile logic already in launch_b200-dgxc-slurm.sh and launch_h200-dgxc-slurm.sh: serialize concurrent enroot imports of the same squash file via flock, skip the import when the squash is already valid, and override ENROOT_CACHE_PATH to avoid permission issues with the system-wide cache on worker nodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: drop ENROOT_CACHE_PATH override from B300 runner The override ("avoid permission issues with system-wide cache on worker nodes") is a dgxc-slurm-specific workaround; launch_b300-nv.sh is on the NV slurm cluster, not dgxc-slurm. Copying it in caused the benchmark srun's pyxis shadow hook to fail with 'mkdir: cannot create directory pyxis_$JOBID.1/data: File exists'. Keep the flock + skip-if-valid logic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: point B300 runner at shared gharunners/{squash,hf-hub-cache} Move the squash cache from /data/squash to /data/home/sa-shared/gharunners/squash, and the HF cache mount from /scratch/models to /data/home/sa-shared/gharunners/hf-hub-cache. Also mount the host HF cache onto \$HF_HUB_CACHE inside the container so tools reading the default HF path pick it up (matches the B200 dgxc-slurm runner). Drop the /scratch/models Qwen3.5 path override since that path is no longer used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: move enroot import out of srun to avoid pyxis namespace collision Running two srun steps in the same allocation (flock+import, then the benchmark --container-image srun) reproducibly fails on this cluster with: error: pyxis: mkdir: cannot create directory '/scratch/data/user-$UID/pyxis_$JOBID.1/data': File exists error: pyxis: [ERROR] /etc/enroot/hooks.d/10-shadow.sh exited with return code 1 Per NVIDIA/pyxis#138, two srun steps sharing an allocation can leave enroot/pyxis state between steps. Collapsing to a single srun (the benchmark) is the cleanest workaround. Move the flock-guarded enroot import to the host side, before salloc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: wipe stale pyxis scratch dirs for this JOB_ID before benchmark srun Even with a single srun step, pyxis fails with error: pyxis: mkdir: cannot create directory '/scratch/data/user-$UID/pyxis_$JOBID.0/data': File exists on fresh SLURM JOB_IDs. The /scratch path is left behind by previous jobs whose IDs SLURM later reuses (and the cluster's pyxis epilog doesn't clean it up). Wipe pyxis_$JOBID.* from the host after salloc; no-op if /scratch is node-local, effective if it's shared NFS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert: drop all B300 runner changes, mirror #1128's approach PR #1128 (dsv4-fp4fp8-b300-vllm) runs on the same cluster with ZERO changes to launch_b300-nv.sh. The pyxis 10-shadow.sh failures we were chasing aren't caused by the runner -- reset it to origin/main and keep the sglang config/bench additions only. Reverts (from this branch): - 4bb1f1a point B300 runner at shared gharunners/{squash,hf-hub-cache} - 106deea drop ENROOT_CACHE_PATH override - 97a488e add flock-guarded squash import - 744c5a0 move enroot import out of srun - d003c59 wipe stale pyxis scratch before benchmark srun Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runner: add head-node flock-guarded squash import on B300 Move enroot import out of srun to the head node and serialize parallel GH jobs with flock on the shared squash file. Skips the import when a valid squash already exists. The benchmark srun is now the only step in the allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: mount at /ix and clear baked-in CUDA_VISIBLE_DEVICES Port the B200 branch's fix for the lmsysorg/sglang:deepseek-v4-blackwell image on B300: - The image installs sglang editable under /workspace/sglang; the default $GITHUB_WORKSPACE:/workspace/ bind-mount masks the install and breaks 'import sglang'. For this image, mount at /ix instead. - The image's ENV bakes CUDA_VISIBLE_DEVICES=4,5,6,7, masking half the GPUs Slurm allocates. Unset it in the bench script so TP=8 sees all 8. - Write artefacts under $PWD instead of hard-coded /workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runner: use /data/models pre-staged path for dsv4 on B300 Pre-staged models on the B300 cluster live under /data/models (Qwen3.5-397B-A17B-FP8, dsv4-pro, etc.). Switch HF_HUB_CACHE_MOUNT from /scratch/models to /data/models, and export MODEL to /data/models/dsv4-pro when MODEL_PREFIX=dsv4 so the benchmark reads from the mounted dir directly. The bench script skips `hf download` when MODEL looks like an absolute path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: switch B300 dsv4 sglang to bw-ultra-compiled image The stock lmsysorg/sglang:deepseek-v4-blackwell image ships kernels compiled for B200 (SM_100) and crashes on B300 with RuntimeError: RMSNorm failed with error code no kernel image is available for execution on the device during CUDA graph capture. Switch to cquil/sglang-deepseek-v4-bw-ultra:v1, which is recompiled with B300 SM support. Broaden the /ix mount conditional to match both image tags: the fork keeps the same /workspace/sglang editable install that would otherwise be masked by $GITHUB_WORKSPACE:/workspace/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: switch B300 dsv4 sglang image to yhyang201/sglang-b300:v3 Use the B300-recompiled image from yhyang201; extend the /ix mount conditional to match the new tag in addition to the previous deepseek-v4-blackwell / deepseek-v4-bw-ultra patterns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * update b300 * feat(dsv4-fp4-b300-sglang): pick recipe by CONC; split search-space Mirror chore/dsv4-sgl-b200 commits 103a202 + 43be495 for B300: Bench script now selects one of three cookbook recipes by CONC instead of a single static flag set: CONC <= 32 -> low-latency (TP only, chunked-prefill 4096, disable-flashinfer-autotune) 33..128 -> balanced (+ DP-attention, max-running-reqs=128, cuda-graph-max-bs=64, deepep-config) CONC > 128 -> max-throughput (+ DP-attention, max-running-reqs=256, cuda-graph-max-bs=64, deepep-config) No speculative decoding in any recipe; --disable-radix-cache kept for the no-prefix-caching baseline. Split the dsv4-fp4-b300-sglang search-space rows per recipe boundary so result filenames (ep=, dpa=) accurately reflect which recipe ran. ep=8 on balanced/max-throughput reflects sglang's implicit ep_size=tp_size override when --moe-a2a-backend deepep is set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * update b300 Switch B300 dsv4 sglang image to lmsysorg/sglang:deepseek-v4-b300 and extend the /ix mount conditional to match the new tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dsv4-fp4-b300-sglang): hardcode low-latency recipe at every CONC The DeepEP FP8 weight-postprocess path is broken for deepseek-ai/DeepSeek-V4-Pro on B300 with lmsysorg/sglang:deepseek-v4-b300 -- every sglang launch with --moe-a2a-backend deepep fails during model load with RuntimeError: Recipe must be a list/tuple of 3 integers. raised from sglang.srt.layers.quantization.fp8 .process_weights_after_loading_block_quant (fp8.py:957). The balanced and max-throughput recipes both go through that path; the low-latency recipe (TP-only, flashinfer_mxfp4 MoE) does not and loads cleanly. Collapse the yaml search-space back to a single row spanning the full CONC range (4..1024 for 1k1k, 4..512 for 8k1k) and hardcode the bench script to the low-latency flags at every CONC. TODO(Cam) noted in both files to restore the recipe-per-CONC dispatch once the DeepEP FP8 load path is fixed upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * trigger test check * Revert "feat(dsv4-fp4-b300-sglang): hardcode low-latency recipe at every CONC" This reverts commit bc43672. * trigger test check * Move dsv4 b300 sglang bench script to framework-tagged path Per the runner naming convention introduced in #1146 (BENCH_SCRIPT="${BENCH_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh"), the b300 runner now prefers benchmarks/single_node/dsv4_fp4_b300_sglang.sh over the legacy dsv4_fp4_b300.sh. The merge from main left this branch with both scripts: the legacy file carrying the recipe-per-CONC dispatch this PR added, and the framework-tagged file with the low-latency-only fallback content from main. CI was therefore picking the wrong script. Move the recipe-per-CONC dispatch onto dsv4_fp4_b300_sglang.sh and delete the legacy filename so the runner picks up the intended logic. Update the yaml comment to point at the new path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(perf-changelog): tighten dsv4-fp4-b300-sglang entry Now that DeepEP FP8 loads cleanly, this PR is purely about restoring the recipe-per-CONC split on top of the low-latency-only fallback from #1143. Trim the changelog to that delta. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

functionstackx requested a review from a team April 24, 2026 03:54

functionstackx requested review from jgangani and kedarpotdar-nv as code owners April 24, 2026 03:54

github-project-automation Bot added this to InferenceMAX Board Apr 24, 2026

functionstackx added the sweep-enabled label Apr 24, 2026

claude Bot reviewed Apr 24, 2026

View reviewed changes

functionstackx added full-sweep-enabled and removed sweep-enabled labels Apr 24, 2026

functionstackx mentioned this pull request Apr 24, 2026

Add H200 config: dsv4-fp8-vllm (DeepSeek-V4-Pro) #1130

Merged

4 tasks

functionstackx changed the title ~~Add B300 config: dsv4-fp4-vllm (DeepSeek-V4-Pro)~~ Add B300 config: dsv4-fp4fp8-vllm (DeepSeek-V4-Pro) Apr 24, 2026

functionstackx added full-sweep-enabled and removed full-sweep-enabled labels Apr 24, 2026

Oseltamivir added full-sweep-enabled and removed full-sweep-enabled labels Apr 24, 2026

kimbochen added sweep-enabled full-sweep-enabled and removed full-sweep-enabled sweep-enabled labels Apr 24, 2026

functionstackx and others added 3 commits April 24, 2026 22:32

functionstackx force-pushed the claude/add-dsv4-fp4-b300-vllm branch from 711236f to 9434fca Compare April 25, 2026 02:36

functionstackx closed this Apr 25, 2026

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add B300 config: dsv4-fp4fp8-vllm (DeepSeek-V4-Pro)#1128

Add B300 config: dsv4-fp4fp8-vllm (DeepSeek-V4-Pro)#1128
functionstackx wants to merge 3 commits intomainfrom
claude/add-dsv4-fp4-b300-vllm

functionstackx commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot Apr 24, 2026

Uh oh!

functionstackx commented Apr 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

functionstackx commented Apr 24, 2026

Summary

Companion PR

Recipe flags

Search space

Test plan

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

functionstackx commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

functionstackx commented Apr 25, 2026 •

edited

Loading