[NVIDIA] chore: B300 single node DeepSeek v4 SGLang LOW LATENCY ONLY by cquil11 · Pull Request #1143 · SemiAnalysisAI/InferenceX

cquil11 · 2026-04-24T21:33:34Z

Summary

Fallback branch off `chore/dsv4-sgl-b300`. Strips the balanced and max-throughput rows from the `dsv4-fp4-b300-sglang` search-space so only the low-latency (TP-only) recipe runs.

Why

Every `sglang serve` launch with `--moe-a2a-backend deepep` against `deepseek-ai/DeepSeek-V4-Pro` on `lmsysorg/sglang:deepseek-v4-b300` fails during model load with:

```
RuntimeError: Recipe must be a list/tuple of 3 integers.
File "sglang/srt/layers/quantization/fp8.py", line 957,
in process_weights_after_loading_block_quant
```

Confirmed in run 24911516760 — every `dpa=true` job (CONC 64–1024 on both ISL) hits the same traceback at the same line, while every `dpa=false` (low-latency) job passes load and runs.

Matrix

`generate_sweep_configs.py --runner-type b300 --model-prefix dsv4` yields 8 entries, all `tp=8 ep=1 dpa=false`, CONC ∈ {4, 8, 16, 32} for both 1k1k and 8k1k.

Follow-up

Re-introduce the balanced and max-throughput rows on #1132 once the FP8+DeepEP weight-postprocess issue is fixed upstream.

Test plan

`generate_sweep_configs.py` → 8 matrix entries, all dpa=false
`pytest utils/matrix_logic/`
Sweep run completes without the FP8 DeepEP traceback

Adds dsv4-fp4-b300-sglang config, single-node benchmark script, and perf-changelog entry for the DeepSeek-V4 recipe from the SGLang cookbook. The cookbook ships a B200 (not B300) recipe, so this reuses the B200 Flash Low-Latency recipe on B300 until a B300-specific recipe lands. Speculative decoding (EAGLE) and prefix caching are disabled per request. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Match parallelism (TP=8/EP=8/dp-attn=true) and concurrency ranges (4-1024 for 1k1k, 4-512 for 8k1k) to dsv4-fp4-b200-vllm. Use the DeepSeek-V4-Pro variant with the cookbook Max-Throughput recipe (DP=8 + DeepEP, no MTP), which aligns with the requested no-spec parallelism. Prefix caching remains disabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…4-fp4-b200-vllm Port the HF cache mount rework from the DSV4 B200 VLLM branch so both PRs stay consistent: use the shared /scratch/fsw/gharunners/hf-hub-cache path, drop the local MODEL override, and mount onto \$HF_HUB_CACHE inside the container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The dsv4-fp4-b300-sglang entry was appended correctly, but the earlier edit also stripped trailing spaces on an existing line, producing a spurious deletion. Revert so the diff is additive-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirror the lockfile logic already in launch_b200-dgxc-slurm.sh and launch_h200-dgxc-slurm.sh: serialize concurrent enroot imports of the same squash file via flock, skip the import when the squash is already valid, and override ENROOT_CACHE_PATH to avoid permission issues with the system-wide cache on worker nodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The override ("avoid permission issues with system-wide cache on worker nodes") is a dgxc-slurm-specific workaround; launch_b300-nv.sh is on the NV slurm cluster, not dgxc-slurm. Copying it in caused the benchmark srun's pyxis shadow hook to fail with 'mkdir: cannot create directory pyxis_$JOBID.1/data: File exists'. Keep the flock + skip-if-valid logic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move the squash cache from /data/squash to /data/home/sa-shared/gharunners/squash, and the HF cache mount from /scratch/models to /data/home/sa-shared/gharunners/hf-hub-cache. Also mount the host HF cache onto \$HF_HUB_CACHE inside the container so tools reading the default HF path pick it up (matches the B200 dgxc-slurm runner). Drop the /scratch/models Qwen3.5 path override since that path is no longer used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Running two srun steps in the same allocation (flock+import, then the benchmark --container-image srun) reproducibly fails on this cluster with: error: pyxis: mkdir: cannot create directory '/scratch/data/user-$UID/pyxis_$JOBID.1/data': File exists error: pyxis: [ERROR] /etc/enroot/hooks.d/10-shadow.sh exited with return code 1 Per NVIDIA/pyxis#138, two srun steps sharing an allocation can leave enroot/pyxis state between steps. Collapsing to a single srun (the benchmark) is the cleanest workaround. Move the flock-guarded enroot import to the host side, before salloc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Even with a single srun step, pyxis fails with error: pyxis: mkdir: cannot create directory '/scratch/data/user-$UID/pyxis_$JOBID.0/data': File exists on fresh SLURM JOB_IDs. The /scratch path is left behind by previous jobs whose IDs SLURM later reuses (and the cluster's pyxis epilog doesn't clean it up). Wipe pyxis_$JOBID.* from the host after salloc; no-op if /scratch is node-local, effective if it's shared NFS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

PR #1128 (dsv4-fp4fp8-b300-vllm) runs on the same cluster with ZERO changes to launch_b300-nv.sh. The pyxis 10-shadow.sh failures we were chasing aren't caused by the runner -- reset it to origin/main and keep the sglang config/bench additions only. Reverts (from this branch): - 4bb1f1a point B300 runner at shared gharunners/{squash,hf-hub-cache} - 106deea drop ENROOT_CACHE_PATH override - 97a488e add flock-guarded squash import - 744c5a0 move enroot import out of srun - d003c59 wipe stale pyxis scratch before benchmark srun Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Move enroot import out of srun to the head node and serialize parallel GH jobs with flock on the shared squash file. Skips the import when a valid squash already exists. The benchmark srun is now the only step in the allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Port the B200 branch's fix for the lmsysorg/sglang:deepseek-v4-blackwell image on B300: - The image installs sglang editable under /workspace/sglang; the default $GITHUB_WORKSPACE:/workspace/ bind-mount masks the install and breaks 'import sglang'. For this image, mount at /ix instead. - The image's ENV bakes CUDA_VISIBLE_DEVICES=4,5,6,7, masking half the GPUs Slurm allocates. Unset it in the bench script so TP=8 sees all 8. - Write artefacts under $PWD instead of hard-coded /workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Pre-staged models on the B300 cluster live under /data/models (Qwen3.5-397B-A17B-FP8, dsv4-pro, etc.). Switch HF_HUB_CACHE_MOUNT from /scratch/models to /data/models, and export MODEL to /data/models/dsv4-pro when MODEL_PREFIX=dsv4 so the benchmark reads from the mounted dir directly. The bench script skips `hf download` when MODEL looks like an absolute path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The stock lmsysorg/sglang:deepseek-v4-blackwell image ships kernels compiled for B200 (SM_100) and crashes on B300 with RuntimeError: RMSNorm failed with error code no kernel image is available for execution on the device during CUDA graph capture. Switch to cquil/sglang-deepseek-v4-bw-ultra:v1, which is recompiled with B300 SM support. Broaden the /ix mount conditional to match both image tags: the fork keeps the same /workspace/sglang editable install that would otherwise be masked by $GITHUB_WORKSPACE:/workspace/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Use the B300-recompiled image from yhyang201; extend the /ix mount conditional to match the new tag in addition to the previous deepseek-v4-blackwell / deepseek-v4-bw-ultra patterns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirror chore/dsv4-sgl-b200 commits 103a202 + 43be495 for B300: Bench script now selects one of three cookbook recipes by CONC instead of a single static flag set: CONC <= 32 -> low-latency (TP only, chunked-prefill 4096, disable-flashinfer-autotune) 33..128 -> balanced (+ DP-attention, max-running-reqs=128, cuda-graph-max-bs=64, deepep-config) CONC > 128 -> max-throughput (+ DP-attention, max-running-reqs=256, cuda-graph-max-bs=64, deepep-config) No speculative decoding in any recipe; --disable-radix-cache kept for the no-prefix-caching baseline. Split the dsv4-fp4-b300-sglang search-space rows per recipe boundary so result filenames (ep=, dpa=) accurately reflect which recipe ran. ep=8 on balanced/max-throughput reflects sglang's implicit ep_size=tp_size override when --moe-a2a-backend deepep is set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Switch B300 dsv4 sglang image to lmsysorg/sglang:deepseek-v4-b300 and extend the /ix mount conditional to match the new tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-24T21:33:42Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-24T21:33:42Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

The DeepEP FP8 weight-postprocess path is broken for deepseek-ai/DeepSeek-V4-Pro on B300 with lmsysorg/sglang:deepseek-v4-b300 -- every sglang launch with --moe-a2a-backend deepep fails during model load with RuntimeError: Recipe must be a list/tuple of 3 integers. raised from sglang.srt.layers.quantization.fp8 .process_weights_after_loading_block_quant (fp8.py:957). Hardcode the bench script to the low-latency recipe flags at every CONC (drop the CONC-based dispatch) and collapse the yaml search-space back to a single row spanning CONC 4..1024 / 4..512 so the full sweep still runs, just without DP-attn or DeepEP. TODO(Cam) noted in both files to revert to the recipe-per-CONC dispatch on chore/dsv4-sgl-b300 once sglang can load the checkpoint under --moe-a2a-backend deepep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cquil11 · 2026-04-24T21:41:05Z

Subsumed by #1132 — low-latency-only fallback applied directly to chore/dsv4-sgl-b300 (commit bc436727). Closing this standalone branch.

cquil11 · 2026-04-24T21:42:04Z

Reopened — the low-latency fallback also lives in #1132 now (commit bc436727), but keeping this branch/PR as the isolated low-latency-only track.

claude · 2026-04-24T21:43:31Z

+- config-keys:
+    - dsv4-fp4-b300-sglang
+  description:
+    - "Add DeepSeek-V4-Pro FP4 B300 SGLang benchmark"
+    - "Image: lmsysorg/sglang:deepseek-v4-blackwell"
+    - "Model: deepseek-ai/DeepSeek-V4-Pro (FP4 MoE experts + FP8 attention/dense)"
+    - "Reuses the B200 Pro Max-Throughput recipe from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4 on B300 until a B300-specific recipe ships"
+    - "DP=8 + DeepEP, prefix caching disabled, no speculative decoding"
+    - "Parallelism (TP=8/EP=8/dp-attn=true) and concurrency ranges (4-1024 for 1k1k, 4-512 for 8k1k) mirror dsv4-fp4-b200-vllm"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1132


🔴 The perf-changelog entry added at lines 1749-1758 (and the outer NOTE comment at nvidia-master.yaml:1799-1803) describes the max-throughput config from the sibling PR #1132, not the low-latency-only fallback this PR actually adds: every field is wrong — image is deepseek-v4-blackwell vs the actual deepseek-v4-b300, it claims DP=8 + DeepEP and TP=8/EP=8/dp-attn=true with concurrency 4-1024/4-512 vs the actual tp:8, ep:1 with conc 4-32 and no DP-attn/DeepEP, and it references a nonexistent config dsv4-fp4-b200-vllm with pr-link pointing at #1132. Both blocks look like they were inherited from the parent branch and should be rewritten to describe the low-latency-only fallback (or the changelog entry deferred until #1132 lands); the in-block comment at nvidia-master.yaml:1812-1817 already has the correct description and directly contradicts the stale outer NOTE.

Extended reasoning...

What the bug is

This PR adds a new dsv4-fp4-b300-sglang config to .github/configs/nvidia-master.yaml and a matching entry to perf-changelog.yaml. The PR title and description make clear that it is a low-latency-only fallback — it strips the balanced and max-throughput rows because --moe-a2a-backend deepep is broken on this image/checkpoint. But the new changelog entry (lines 1749-1758) and the outer NOTE comment in the yaml (lines 1799-1803) both describe the opposite: the balanced/max-throughput recipe that the sibling PR #1132 will add once DeepEP is fixed.

Field-by-field comparison

Changelog entry at perf-changelog.yaml:1749-1758 vs the actual yaml that this PR adds:

Field Changelog says Actual yaml says

Image lmsysorg/sglang:deepseek-v4-blackwell lmsysorg/sglang:deepseek-v4-b300 (line 1805)

Recipe "B200 Pro Max-Throughput recipe" with "DP=8 + DeepEP" "Low-latency only (TP-only, no DP-attn, no DeepEP)" (line 1812)

Parallelism TP=8/EP=8/dp-attn=true { tp: 8, ep: 1 }, no dp-attn (lines 1821-1826)

Concurrency 4-1024 (1k1k), 4-512 (8k1k) conc-start: 4, conc-end: 32 for both

Mirror reference dsv4-fp4-b200-vllm No such key exists in nvidia-master.yaml (only dsv4-fp8-h200-vllm)

pr-link pull/1132 This PR is #1143 — #1132 is explicitly called the follow-up in the description

Step-by-step proof

Read the diff for .github/configs/nvidia-master.yaml. The new config has exactly one search-space tuple per ISL: { tp: 8, ep: 1, conc-start: 4, conc-end: 32 }. No dp-attention, no --moe-a2a-backend deepep, no EP>1.

The in-yaml comment at lines 1812-1817 explicitly reads: "Low-latency only (TP-only, no DP-attn, no DeepEP) while the DeepEP FP8 weight-postprocess path is broken for this checkpoint on B300."

Now read the outer NOTE at lines 1799-1803 of the same diff: "B200 Pro FP4 Max-Throughput recipe (DP=8 + DeepEP, no MTP) on B300... Parallelisms and concurrency ranges mirror dsv4-fp4-b200-vllm." These two comment blocks describe mutually exclusive recipes for the same config entry.

Read the PR description: "Strips the balanced and max-throughput rows from the dsv4-fp4-b300-sglang search-space so only the low-latency (TP-only) recipe runs" and "Re-introduce the balanced and max-throughput rows on [NVIDIA] chore: B300 single node DeepSeek v4 SGLang #1132 once the FP8+DeepEP weight-postprocess issue is fixed upstream." The outer NOTE and the changelog entry describe the [NVIDIA] chore: B300 single node DeepSeek v4 SGLang #1132 version, not this PR.

A grep over .github/configs/nvidia-master.yaml for dsv4-fp4-b200-vllm returns only the stale comment — the referenced config does not exist.

The changelog entry's pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1132 points at the wrong PR: this is [NVIDIA] chore: B300 single node DeepSeek v4 SGLang LOW LATENCY ONLY #1143, and per the description [NVIDIA] chore: B300 single node DeepSeek v4 SGLang #1132 is the follow-up.

Why existing code doesn't prevent it

perf-changelog.yaml and the yaml comments are free-form documentation — nothing validates that the changelog description matches the config it's documenting. The entry appears to have been inherited from the parent branch (chore/dsv4-sgl-b300 → #1132) and the search-space was rewritten for the fallback without updating the surrounding comments or the changelog.

Impact

After this PR merges, perf-changelog.yaml — the authoritative record of what changed in each PR — will state that PR #1132 added a DP=8/DeepEP/EP=8 max-throughput recipe with concurrency up to 1024. In reality, #1132 has not merged, and PR #1143 added a TP=8/EP=1 low-latency recipe capped at concurrency 32. Future readers diffing historical performance numbers against the changelog will be materially misled about the recipe in effect. Inside the yaml, the two contradictory comment blocks make it hard to tell which one reflects the actual intent.

How to fix

Rewrite both blocks to describe what this PR actually does. Minimal fix:

Update the perf-changelog.yaml entry to: image lmsysorg/sglang:deepseek-v4-b300, "Low-latency only (TP=8, EP=1, no DP-attn, no DeepEP) — fallback while the FP8 + DeepEP weight-postprocess path is broken on this checkpoint", concurrency 4-32 for both ISLs, drop the dsv4-fp4-b200-vllm reference, and set pr-link to pull/1143.

Either delete the outer NOTE comment at nvidia-master.yaml:1799-1803 (the in-block comment at 1812-1817 already explains the situation correctly) or rewrite it to match.

Alternatively, defer the changelog entry and add it in [NVIDIA] chore: B300 single node DeepSeek v4 SGLang #1132 when the full max-throughput recipe actually lands.

@claude please fix this

Claude finished @cquil11's task in 1m 28s —— View job

Fixing stale perf-changelog and config comments

Read current perf-changelog.yaml and nvidia-master.yaml to identify exact content

Fix perf-changelog.yaml entry to describe the low-latency fallback (correct image, parallelism, concurrency, pr-link)

Fix stale NOTE comment in nvidia-master.yaml

Commit and push changes

Changes (410df74)

perf-changelog.yaml — rewrote the dsv4-fp4-b300-sglang entry:

Image: deepseek-v4-blackwell → deepseek-v4-b300

Description: max-throughput DP=8/DeepEP/EP=8 → low-latency TP=8/EP=1, no DP-attn, no DeepEP

Concurrency: updated to match actual config (1k1k 4-1024, 8k1k 4-512)

Removed stale dsv4-fp4-b200-vllm mirror reference

pr-link: #1132 → #1143

.github/configs/nvidia-master.yaml — replaced the outer NOTE (lines 1799-1803) that described the B200 max-throughput recipe with a note matching the actual low-latency fallback. The in-block TODO comment at line 1812 was already correct and is unchanged.

Oseltamivir

lgtm

The perf-changelog entry and outer NOTE comment in nvidia-master.yaml described the max-throughput recipe from #1132, not the low-latency fallback this PR actually adds. Rewrite both to match the actual config: TP=8/EP=1, no DP-attn, no DeepEP, image deepseek-v4-b300, pr-link #1143. Co-authored-by: Cameron Quilici <cquil11@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Now that DeepEP FP8 loads cleanly, this PR is purely about restoring the recipe-per-CONC split on top of the low-latency-only fallback from #1143. Trim the changelog to that delta. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat: add DeepSeek-V4-Flash FP4 B300 SGLang benchmark Adds dsv4-fp4-b300-sglang config, single-node benchmark script, and perf-changelog entry for the DeepSeek-V4 recipe from the SGLang cookbook. The cookbook ships a B200 (not B300) recipe, so this reuses the B200 Flash Low-Latency recipe on B300 until a B300-specific recipe lands. Speculative decoding (EAGLE) and prefix caching are disabled per request. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: switch dsv4-fp4-b300-sglang to Pro + Max-Throughput recipe Match parallelism (TP=8/EP=8/dp-attn=true) and concurrency ranges (4-1024 for 1k1k, 4-512 for 8k1k) to dsv4-fp4-b200-vllm. Use the DeepSeek-V4-Pro variant with the cookbook Max-Throughput recipe (DP=8 + DeepEP, no MTP), which aligns with the requested no-spec parallelism. Prefix caching remains disabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: sync launch_b200-dgxc-slurm.sh cache mount from claude/add-dsv4-fp4-b200-vllm Port the HF cache mount rework from the DSV4 B200 VLLM branch so both PRs stay consistent: use the shared /scratch/fsw/gharunners/hf-hub-cache path, drop the local MODEL override, and mount onto \$HF_HUB_CACHE inside the container. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: restore trailing whitespace stripped from glm5.1 changelog entry The dsv4-fp4-b300-sglang entry was appended correctly, but the earlier edit also stripped trailing spaces on an existing line, producing a spurious deletion. Revert so the diff is additive-only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: add flock-guarded squash import to B300 runner Mirror the lockfile logic already in launch_b200-dgxc-slurm.sh and launch_h200-dgxc-slurm.sh: serialize concurrent enroot imports of the same squash file via flock, skip the import when the squash is already valid, and override ENROOT_CACHE_PATH to avoid permission issues with the system-wide cache on worker nodes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: drop ENROOT_CACHE_PATH override from B300 runner The override ("avoid permission issues with system-wide cache on worker nodes") is a dgxc-slurm-specific workaround; launch_b300-nv.sh is on the NV slurm cluster, not dgxc-slurm. Copying it in caused the benchmark srun's pyxis shadow hook to fail with 'mkdir: cannot create directory pyxis_$JOBID.1/data: File exists'. Keep the flock + skip-if-valid logic. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore: point B300 runner at shared gharunners/{squash,hf-hub-cache} Move the squash cache from /data/squash to /data/home/sa-shared/gharunners/squash, and the HF cache mount from /scratch/models to /data/home/sa-shared/gharunners/hf-hub-cache. Also mount the host HF cache onto \$HF_HUB_CACHE inside the container so tools reading the default HF path pick it up (matches the B200 dgxc-slurm runner). Drop the /scratch/models Qwen3.5 path override since that path is no longer used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: move enroot import out of srun to avoid pyxis namespace collision Running two srun steps in the same allocation (flock+import, then the benchmark --container-image srun) reproducibly fails on this cluster with: error: pyxis: mkdir: cannot create directory '/scratch/data/user-$UID/pyxis_$JOBID.1/data': File exists error: pyxis: [ERROR] /etc/enroot/hooks.d/10-shadow.sh exited with return code 1 Per NVIDIA/pyxis#138, two srun steps sharing an allocation can leave enroot/pyxis state between steps. Collapsing to a single srun (the benchmark) is the cleanest workaround. Move the flock-guarded enroot import to the host side, before salloc. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: wipe stale pyxis scratch dirs for this JOB_ID before benchmark srun Even with a single srun step, pyxis fails with error: pyxis: mkdir: cannot create directory '/scratch/data/user-$UID/pyxis_$JOBID.0/data': File exists on fresh SLURM JOB_IDs. The /scratch path is left behind by previous jobs whose IDs SLURM later reuses (and the cluster's pyxis epilog doesn't clean it up). Wipe pyxis_$JOBID.* from the host after salloc; no-op if /scratch is node-local, effective if it's shared NFS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert: drop all B300 runner changes, mirror #1128's approach PR #1128 (dsv4-fp4fp8-b300-vllm) runs on the same cluster with ZERO changes to launch_b300-nv.sh. The pyxis 10-shadow.sh failures we were chasing aren't caused by the runner -- reset it to origin/main and keep the sglang config/bench additions only. Reverts (from this branch): - 4bb1f1a point B300 runner at shared gharunners/{squash,hf-hub-cache} - 106deea drop ENROOT_CACHE_PATH override - 97a488e add flock-guarded squash import - 744c5a0 move enroot import out of srun - d003c59 wipe stale pyxis scratch before benchmark srun Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runner: add head-node flock-guarded squash import on B300 Move enroot import out of srun to the head node and serialize parallel GH jobs with flock on the shared squash file. Skips the import when a valid squash already exists. The benchmark srun is now the only step in the allocation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: mount at /ix and clear baked-in CUDA_VISIBLE_DEVICES Port the B200 branch's fix for the lmsysorg/sglang:deepseek-v4-blackwell image on B300: - The image installs sglang editable under /workspace/sglang; the default $GITHUB_WORKSPACE:/workspace/ bind-mount masks the install and breaks 'import sglang'. For this image, mount at /ix instead. - The image's ENV bakes CUDA_VISIBLE_DEVICES=4,5,6,7, masking half the GPUs Slurm allocates. Unset it in the bench script so TP=8 sees all 8. - Write artefacts under $PWD instead of hard-coded /workspace. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * runner: use /data/models pre-staged path for dsv4 on B300 Pre-staged models on the B300 cluster live under /data/models (Qwen3.5-397B-A17B-FP8, dsv4-pro, etc.). Switch HF_HUB_CACHE_MOUNT from /scratch/models to /data/models, and export MODEL to /data/models/dsv4-pro when MODEL_PREFIX=dsv4 so the benchmark reads from the mounted dir directly. The bench script skips `hf download` when MODEL looks like an absolute path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: switch B300 dsv4 sglang to bw-ultra-compiled image The stock lmsysorg/sglang:deepseek-v4-blackwell image ships kernels compiled for B200 (SM_100) and crashes on B300 with RuntimeError: RMSNorm failed with error code no kernel image is available for execution on the device during CUDA graph capture. Switch to cquil/sglang-deepseek-v4-bw-ultra:v1, which is recompiled with B300 SM support. Broaden the /ix mount conditional to match both image tags: the fork keeps the same /workspace/sglang editable install that would otherwise be masked by $GITHUB_WORKSPACE:/workspace/. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * fix: switch B300 dsv4 sglang image to yhyang201/sglang-b300:v3 Use the B300-recompiled image from yhyang201; extend the /ix mount conditional to match the new tag in addition to the previous deepseek-v4-blackwell / deepseek-v4-bw-ultra patterns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * update b300 * feat(dsv4-fp4-b300-sglang): pick recipe by CONC; split search-space Mirror chore/dsv4-sgl-b200 commits 103a202 + 43be495 for B300: Bench script now selects one of three cookbook recipes by CONC instead of a single static flag set: CONC <= 32 -> low-latency (TP only, chunked-prefill 4096, disable-flashinfer-autotune) 33..128 -> balanced (+ DP-attention, max-running-reqs=128, cuda-graph-max-bs=64, deepep-config) CONC > 128 -> max-throughput (+ DP-attention, max-running-reqs=256, cuda-graph-max-bs=64, deepep-config) No speculative decoding in any recipe; --disable-radix-cache kept for the no-prefix-caching baseline. Split the dsv4-fp4-b300-sglang search-space rows per recipe boundary so result filenames (ep=, dpa=) accurately reflect which recipe ran. ep=8 on balanced/max-throughput reflects sglang's implicit ep_size=tp_size override when --moe-a2a-backend deepep is set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * update b300 Switch B300 dsv4 sglang image to lmsysorg/sglang:deepseek-v4-b300 and extend the /ix mount conditional to match the new tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * feat(dsv4-fp4-b300-sglang): hardcode low-latency recipe at every CONC The DeepEP FP8 weight-postprocess path is broken for deepseek-ai/DeepSeek-V4-Pro on B300 with lmsysorg/sglang:deepseek-v4-b300 -- every sglang launch with --moe-a2a-backend deepep fails during model load with RuntimeError: Recipe must be a list/tuple of 3 integers. raised from sglang.srt.layers.quantization.fp8 .process_weights_after_loading_block_quant (fp8.py:957). The balanced and max-throughput recipes both go through that path; the low-latency recipe (TP-only, flashinfer_mxfp4 MoE) does not and loads cleanly. Collapse the yaml search-space back to a single row spanning the full CONC range (4..1024 for 1k1k, 4..512 for 8k1k) and hardcode the bench script to the low-latency flags at every CONC. TODO(Cam) noted in both files to restore the recipe-per-CONC dispatch once the DeepEP FP8 load path is fixed upstream. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * trigger test check * Revert "feat(dsv4-fp4-b300-sglang): hardcode low-latency recipe at every CONC" This reverts commit bc43672. * trigger test check * Move dsv4 b300 sglang bench script to framework-tagged path Per the runner naming convention introduced in #1146 (BENCH_SCRIPT="${BENCH_BASE}_${FRAMEWORK}${SPEC_SUFFIX}.sh"), the b300 runner now prefers benchmarks/single_node/dsv4_fp4_b300_sglang.sh over the legacy dsv4_fp4_b300.sh. The merge from main left this branch with both scripts: the legacy file carrying the recipe-per-CONC dispatch this PR added, and the framework-tagged file with the low-latency-only fallback content from main. CI was therefore picking the wrong script. Move the recipe-per-CONC dispatch onto dsv4_fp4_b300_sglang.sh and delete the legacy filename so the runner picks up the intended logic. Update the yaml comment to point at the new path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * chore(perf-changelog): tighten dsv4-fp4-b300-sglang entry Now that DeepEP FP8 loads cleanly, this PR is purely about restoring the recipe-per-CONC split on top of the low-latency-only fallback from #1143. Trim the changelog to that delta. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* dsv4-fp4-b300-sglang: revert to #1143 low-latency-only baseline Reverts the matrix expansion (#1132), script edits (#1158, #1173, #1174), and changelog retriggers (#1178) on top of the original #1143 entry. Restores the script and config block to their #1143 state and clears all prior dsv4-fp4-b300-sglang changelog entries to start fresh. The dsv4-fp4-b300-sglang-mtp config (#1166) is untouched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: add pr-link for #1184 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * perf-changelog: keep only the original #1143 entry, drop new entry Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cquil11 and others added 19 commits April 24, 2026 01:14

Merge branch 'main' into chore/dsv4-sgl-b300

5d93913

update b300

08edf26

update b300

d35696c

Switch B300 dsv4 sglang image to lmsysorg/sglang:deepseek-v4-b300 and extend the /ix mount conditional to match the new tag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cquil11 added the sweep-enabled label Apr 24, 2026

cquil11 requested a review from a team April 24, 2026 21:33

github-project-automation Bot added this to InferenceMAX Board Apr 24, 2026

cquil11 requested review from jgangani and kedarpotdar-nv as code owners April 24, 2026 21:33

cquil11 added full-sweep-enabled and removed sweep-enabled labels Apr 24, 2026

cquil11 force-pushed the chore/dsv4-sgl-b300-low-latency branch from a63e6f4 to c3b562c Compare April 24, 2026 21:36

cquil11 changed the title ~~feat(dsv4-fp4-b300-sglang): low-latency-only while DeepEP is broken~~ [NVIDIA][needs fixing] chore: B300 single node DeepSeek v4 SGLang LOW LATENCY ONLY Apr 24, 2026

cquil11 changed the title ~~[NVIDIA][needs fixing] chore: B300 single node DeepSeek v4 SGLang LOW LATENCY ONLY~~ [NVIDIA] chore: B300 single node DeepSeek v4 SGLang LOW LATENCY ONLY Apr 24, 2026

cquil11 closed this Apr 24, 2026

cquil11 deleted the chore/dsv4-sgl-b300-low-latency branch April 24, 2026 21:41

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 24, 2026

cquil11 restored the chore/dsv4-sgl-b300-low-latency branch April 24, 2026 21:41

cquil11 reopened this Apr 24, 2026

claude Bot reviewed Apr 24, 2026

View reviewed changes

Oseltamivir approved these changes Apr 24, 2026

View reviewed changes

cquil11 merged commit a0ddd19 into main Apr 24, 2026
2 checks passed

cquil11 deleted the chore/dsv4-sgl-b300-low-latency branch April 24, 2026 22:17

claude Bot mentioned this pull request Apr 26, 2026

retry sglang b300 #1171

Merged

cquil11 mentioned this pull request Apr 26, 2026

dsv4-fp4-b300-sglang: revert to #1143 low-latency-only baseline #1184

Merged

2 tasks

cquil11 added a commit that referenced this pull request Apr 26, 2026

perf-changelog: keep only the original #1143 entry, drop new entry

3941b76

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NVIDIA] chore: B300 single node DeepSeek v4 SGLang LOW LATENCY ONLY#1143

[NVIDIA] chore: B300 single node DeepSeek v4 SGLang LOW LATENCY ONLY#1143
cquil11 merged 21 commits intomainfrom
chore/dsv4-sgl-b300-low-latency

cquil11 commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

cquil11 commented Apr 24, 2026

Uh oh!

cquil11 commented Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Uh oh!

cquil11 Apr 24, 2026

Uh oh!

Klaud-Cold Apr 24, 2026 •

edited

Loading

Uh oh!

Oseltamivir left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Field	Changelog says	Actual yaml says
Image	`lmsysorg/sglang:deepseek-v4-blackwell`	`lmsysorg/sglang:deepseek-v4-b300` (line 1805)
Recipe	"B200 Pro Max-Throughput recipe" with "DP=8 + DeepEP"	"Low-latency only (TP-only, no DP-attn, no DeepEP)" (line 1812)
Parallelism	`TP=8/EP=8/dp-attn=true`	`{ tp: 8, ep: 1 }`, no dp-attn (lines 1821-1826)
Concurrency	4-1024 (1k1k), 4-512 (8k1k)	`conc-start: 4, conc-end: 32` for both
Mirror reference	`dsv4-fp4-b200-vllm`	No such key exists in nvidia-master.yaml (only `dsv4-fp8-h200-vllm`)
pr-link	`pull/1132`	This PR is #1143 — #1132 is explicitly called the follow-up in the description

Conversation

cquil11 commented Apr 24, 2026

Summary

Why

Matrix

Follow-up

Test plan

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

cquil11 commented Apr 24, 2026

Uh oh!

cquil11 commented Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

What the bug is

Field-by-field comparison

Step-by-step proof

Why existing code doesn't prevent it

Impact

How to fix

Uh oh!

cquil11 Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Klaud-Cold Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Fixing stale perf-changelog and config comments

Changes (410df74)

Uh oh!

Oseltamivir left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Klaud-Cold Apr 24, 2026 •

edited

Loading

Changes (`410df74`)