[NV] llm-d-vllm: Add llm-d to the InferenceX benchmarking framework by ezrasilvera · Pull Request #2050 · SemiAnalysisAI/InferenceX

ezrasilvera · 2026-07-04T10:13:30Z

Adding llm-d into InferenceX Starting with DeepSeek-V4-Pro FP4 GB200 disaggregated benchmarks (mid-curve + high-tpt)

Summary

Adds llm-d (vLLM) as a new benchmark framework in InferenceX, with a no-k8s multi-node path (Envoy + EPP + P/D sidecar over SLURM/pyxis), and theDeepSeek-V4-Pro FP4 GB200 P/D-disaggregated recipes:

dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe - 1P1D DEP8, conc 256/512/1024
dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe - 2P1D DEP8, conc 4096

Both keys share one engine/EPP recipe (benchmarks/multi_node/llm-d-recipes/dsv4-fp4-gb200-mid-curve-megamoe.yaml);
topology and concurrency come from the master-config keys, not the recipe.

Notable

These keys mirror dsv4-fp4-gb200-dynamo-vllm - same SKU, model, ISL/OSL, disagg TP1/DP8/EP8 layout, and the same concurrency points (256/512/1024/4096).
Emits the same agg_*.json rows through process_result.py (framework=llm-d-vllm, X=conc, Y=tput_per_gpu) and the same _gpus_/_ctx_/_gen_ result-filename convention the multinode ingest already parses. The only downstream addition is
the framework label in InferenceX-app (separate PR).
Every reported point is a full 100%-completion sweep: (e.g. 12288/12288 at conc4096).
One recipe, two topologies: Mid-curve (1P1D) and high-tpt (2P1D) share a single engine/EPP recipe; node counts and concurrency live in the master-config keys - so adding a curve point is a config-key change, not a new recipe file.

Scope

GB200 DeepSeek-V4-Pro FP4 only. No changes to other SKUs/models/frameworks.

Known merge-blocker

The image: on both keys is a personal build (ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23). It should be republished to the org registry and the image: lines updated before merge.

Before merge

Republish image to org registry; update image: on both keys.
Canonicalize perf-changelog.yaml pr-link from pull/XXX to this PR's number.
Full sweep green.

github-actions · 2026-07-04T10:13:38Z

Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase As a PR reviewer and CODEOWNER, I have reviewed this and have.

For PR verification, add the full-sweep-enabled or full-sweep-fail-fast label to this PR — the benchmark sweep only runs on labeled PRs.

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs

感谢你的贡献！请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md，然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人，你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板，包括保留英文语句 As a PR reviewer and CODEOWNER, I have reviewed this and have。

如需进行 PR 验证，请为此 PR 添加 full-sweep-enabled 或 full-sweep-fail-fast 标签 — 基准测试 sweep 仅在带有标签的 PR 上运行。

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档

github-actions · 2026-07-04T10:13:38Z

Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase As a PR reviewer and CODEOWNER, I have reviewed this and have.

For PR verification, add the full-sweep-enabled or full-sweep-fail-fast label to this PR — the benchmark sweep only runs on labeled PRs.

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs

感谢你的贡献！请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md，然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人，你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板，包括保留英文语句 As a PR reviewer and CODEOWNER, I have reviewed this and have。

如需进行 PR 验证，请为此 PR 添加 full-sweep-enabled 或 full-sweep-fail-fast 标签 — 基准测试 sweep 仅在带有标签的 PR 上运行。

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档

github-actions · 2026-07-04T10:13:39Z

Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase As a PR reviewer and CODEOWNER, I have reviewed this and have.

For PR verification, add the full-sweep-enabled or full-sweep-fail-fast label to this PR — the benchmark sweep only runs on labeled PRs.

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs

感谢你的贡献！请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md，然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人，你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板，包括保留英文语句 As a PR reviewer and CODEOWNER, I have reviewed this and have。

如需进行 PR 验证，请为此 PR 添加 full-sweep-enabled 或 full-sweep-fail-fast 标签 — 基准测试 sweep 仅在带有标签的 PR 上运行。

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档

claude · 2026-07-04T10:27:59Z

+#
+# This is currently not used as we build an use an image that contains 
+# all binaries


🟡 The comment on lines 10-12 says "This is currently not used as we build an use an image that contains all binaries," but benchmarks/multi_node/llm-d/job.slurm (lines 236-241 of the pyxis branch) explicitly sources this file and uses LLMD_BIN_DIR to conditionally bind-mount epp/pd-sidecar/envoy — so the file IS wired into the runtime as an optional override. Also a typo: "build an use" → "build and use". Nit — no runtime impact, but the stale comment will mislead maintainers into thinking they can delete the file.

Extended reasoning...

What the comment claims vs. what the code does

Lines 10-12 of benchmarks/llm-d/binaries.env state:

This is currently not used as we build an use an image that contains all binaries

However, the pyxis branch of benchmarks/multi_node/llm-d/job.slurm explicitly sources this file and consumes LLMD_BIN_DIR to build optional bind-mounts:

[[ -f "${DI_REPO_DIR}/benchmarks/llm-d/binaries.env" ]] && \ source "${DI_REPO_DIR}/benchmarks/llm-d/binaries.env" for _bin in epp pd-sidecar envoy; do if [[ -n "${LLMD_BIN_DIR:-}" && -x "${LLMD_BIN_DIR}/${_bin}" ]]; then PYXIS_MOUNTS+=",${LLMD_BIN_DIR}/${_bin}:/usr/local/bin/${_bin}:ro" ... fi done

So binaries.env is wired into the runtime — it acts as the single source of truth for LLMD_BIN_DIR (defaulted to /mnt/lustre01/users-public/sa-shared/llm-d-bins), which the pyxis path uses to swap in mounted binaries and skip a combined-image rebuild.

Step-by-step proof

Author enables the pyxis path (the default on GB200 via LLMD_CONTAINER_ENGINE=pyxis set in runners/launch_gb200-nv.sh).

job.slurm reaches the pyxis branch and sources binaries.env, populating LLMD_BIN_DIR.

If a user has run extract-binaries.sh (which also sources this file), the three binaries exist under $LLMD_BIN_DIR and get bind-mounted into every container.

This is the entire point of the mounted-binary path documented in extract-binaries.sh and the Dockerfile — running a stock vllm/vllm-openai image without rebuilding the combined image.

The comment therefore contradicts both the sibling files (extract-binaries.sh, Dockerfile, envoy.yaml docstrings) that describe binaries.env as the single source of truth, and job.slurm itself.

Addressing the counter-argument

One could argue the comment is "arguably accurate for the DEFAULT deployment path" — in the current recipe the combined image bakes binaries in, so the -x check makes the mount loop a no-op unless the operator has explicitly run extract-binaries.sh. That's a fair characterization of when the file has effect, but "currently not used" is stronger than "currently a no-op by default" and it flatly contradicts the sourcing on job.slurm:236. A future maintainer trying to trim "unused" scaffolding would delete this file and quietly break the mounted-binary path.

Impact and fix

No runtime impact — the code works today. The concern is purely documentation drift in a new file this PR is introducing. Fix in one edit:

# This file is sourced by benchmarks/multi_node/llm-d/job.slurm to # optionally bind-mount extracted binaries in place of the combined # image's baked-in copies (see extract-binaries.sh). It is a no-op # by default (the -x check fails when the binaries have not been # extracted); the Dockerfile also reads the same URLs when building # and using an image that contains all binaries.

Also fix the "build an use" → "build and use" typo while you're in there.

claude · 2026-07-04T10:28:01Z

+### Why one entry per *leader* (not per node)
+
+In the wide-EP guide each instance is a single vLLM engine that spans
+multiple nodes via `--data-parallel-hybrid-lb`. With hybrid-lb, the
+leader pod (`LWS_WORKER_INDEX=0`) accepts external traffic and
+distributes it internally across the local DP ranks; in our LWS-free
+SLURM mapping, the prefill-leader and decode-leader are the only nodes
+addressable from outside. Adding an entry per worker would cause EPP to
+route directly to a worker, bypassing the engine's internal load
+balancing.
+
+If we later want to expose all pods of an instance (the alternative
+hybrid-lb interpretation: external LB across nodes too), we can extend
+the loop in `server.sh` to emit one entry per `IPS[i]` in the prefill
+range and one per `IPS[i]` in the decode range, all carrying the same
+role label. EPP then load-balances across them via `random-picker`.


🟡 The README section "Why one entry per leader (not per node)" (lines 80-89) plus the example endpoints.yaml at lines 57-69 both document per-leader emission, but server.sh add_role() (lines 334-343) actually emits ONE endpoint per node — its own comment says "ONE endpoint per node at base_port (hybrid LB: ...)". The lines 91-95 "future work" section describing per-node emission as hypothetical has already been implemented. Docs-only staleness (per-node is the intended behavior — see server.sh:289-291 and its consistency with the multi-node coordinator health poll in lines 519-535), but future maintainers may misdiagnose whether current behavior is intentional. Please update the README to reflect actual per-node emission, or add a note explaining why the leader-only text was preserved.

Extended reasoning...

What the bug is

The new file benchmarks/multi_node/llm-d/README.md (added by this PR) describes the endpoints.yaml file-discovery contract as emitting one entry per leader. Concretely:

Line 55: "…writes /tmp/endpoints.yaml inside the container with one entry per leader"

Lines 57-69: the example YAML shows only prefill-0 (PREFILL_LEADER_IP) and decode-0 (DECODE_LEADER_IP)

Lines 80-89: an entire section titled "Why one entry per leader (not per node)" justifies leader-only emission and warns: "Adding an entry per worker would cause EPP to route directly to a worker, bypassing the engine's internal load balancing"

Lines 91-95: describes per-node emission as a hypothetical future extension ("we can extend the loop … to emit one entry per IPS[i]")

But the code at benchmarks/multi_node/llm-d/server.sh:334-348 (the add_role() block that writes endpoints.yaml) does exactly the opposite:

def add_role(role, ips, base_port): # ONE endpoint per node at base_port (hybrid LB: the node's api-server / # sidecar internally load-balances its local DP ranks). for i, ip in enumerate(ips): endpoints.append({'name': f'{role}-{i}', ...}) add_role('prefill', prefill_ips, VLLM_PORT) # prefill_ips = all_ips[:pn] add_role('decode', decode_ips, SIDECAR_PORT) # decode_ips = all_ips[pn:pn + dn]

The inline comment at server.sh:339-340 literally says the opposite of the README section header: "ONE endpoint per node at base_port".

Step-by-step proof

Take the new high-tpt 2P+1D key (dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe, configs/nvidia-master.yaml). Topology: PREFILL_NODES=4, DECODE_NODES=2, PREFILL_WORKERS=2.

submit.sh sbatches 6 nodes. job.slurm collects their IPs in rank order into ALL_IPS = [P0,P1,P2,P3,D0,D1].

On the decode leader, server.sh runs the coordinator Python block:

prefill_ips = all_ips[:4] = [P0, P1, P2, P3]

decode_ips = all_ips[4:6] = [D0, D1]

add_role('prefill', prefill_ips, 8200) iterates 4 times → emits prefill-0..prefill-3.

add_role('decode', decode_ips, 8000) iterates 2 times → emits decode-0..decode-1.

Final endpoints.yaml has 6 entries — not the 2 (prefill-0 + decode-0) the README example shows.

For the 1P+1D mid-curve key it emits 1 prefill + 2 decode entries (still per-node, not per-leader).

Why this is intentional (i.e. the code is right, docs are stale)

Several signals confirm per-node emission is the deliberate design, not a bug:

server.sh:289-291 comment explicitly justifies it: "Each decode node runs its own sidecar (SIDECAR_PORT -> local decode vLLM), and endpoints.yaml lists one decode endpoint per node so EPP fans out across all decode ranks."

server.sh:519-535 (the decode coordinator's cross-node prefill health poll) iterates over _prefill_ips=( "${_ALL_IPS[@]:0:${PREFILL_NODES}}" ) — polling every prefill node, not just the leader. That poll is only meaningful if all prefill nodes are individually addressable endpoints.

--api-server-count is set on every EP-enabled node (server.sh:236-240), plus --data-parallel-hybrid-lb — meaning each node has its own api-server binding, so per-node endpoints are actually the correct interpretation of hybrid-lb.

PREFILL_WORKERS=2 in the high-tpt key requires per-node emission: with 2 independent DP engines across 4 prefill nodes, per-leader emission would only expose one of the two engines.

Impact

Zero runtime impact — code is self-consistent and produces the intended routing behavior. But a future maintainer reading the README's "Why one entry per leader (not per node)" section will:

Believe the current behavior is per-leader when it isn't.

Reference the warning about "EPP would bypass internal load balancing" and (incorrectly) conclude the current implementation is broken.

Not realize the "future work" paragraph at lines 91-95 has already been implemented.

Suggested fix

Update benchmarks/multi_node/llm-d/README.md:

Line 55: change "one entry per leader" → "one entry per node".

Lines 57-69: expand the example YAML to show all prefill and decode nodes (e.g. prefill-0..prefill-N-1, decode-0..decode-M-1).

Lines 80-89: rewrite the "Why one entry per leader" section to explain why per-node is used (hybrid-lb + independent api-servers + PREFILL_WORKERS multi-engine).

Lines 91-95: delete or rewrite the "future work" paragraph since per-node emission is what the code already does. The leader-only mode could be described here as the hypothetical alternative if desired.

Fixed in d7e4180 - README now documents one endpoint per node (not per leader) to match server.sh add_role(), with corrected example ports (prefill 8200 / decode 8000) and a rewritten rationale section.

claude · 2026-07-04T10:28:02Z

+# Build source: benchmarks/llm-d/Dockerfile.
+dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe:
+  image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23


🟡 Both new keys (mid-curve line 9080, high-tpt line 9125) pin image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23, a personal ghcr namespace rather than the SemiAnalysisAI org registry. The PR description's "Before merge" checklist and the in-file comment at 9075-9077 both flag this — surfacing here as a merge-time reminder so it isn't missed. Republish to ghcr.io/semianalysisai/... and update both image: lines before merge.

Extended reasoning...

What the bug is

Both new master-config keys added by this PR pin their engine image to a personal ghcr namespace:

configs/nvidia-master.yaml:9080 (dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe): image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23

configs/nvidia-master.yaml:9125 (dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe): image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23

The PR author has explicitly flagged this in two places already: the PR description's "Known merge-blocker" section ("The image: on both keys is a personal build... It should be republished to the org registry and the image: lines updated before merge") and an in-file comment at lines 9075-9077 ("PR follow-up (merge-blocking): the image: on both llm-d-vllm keys below (mid-curve + high-tpt) is a personal build. Republish it to the org registry (ghcr.io/semianalysisai/...) and update these image: lines before merge").

Why it matters

Reproducibility. On the GB200 pyxis path, runners/launch_gb200-nv.sh derives SQUASH_FILE="${SQUASH_DIR}/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" — so the enroot-import cache key is a direct function of the image string. Any historical rerun (e.g., re-ingesting a failed sweep, reproducing a curve point months later) resolves the image via that exact URL. A personal ghcr namespace can be retagged, deleted, or made private without notice by the individual owner, silently breaking reruns.

Step-by-step proof

configs/nvidia-master.yaml:9080 sets image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23 on the mid-curve key.

configs/nvidia-master.yaml:9125 sets the same image on the high-tpt key.

runners/launch_gb200-nv.sh (the llm-d-vllm branch, top of file) reads $IMAGE and constructs ENROOT_URL="docker://ghcr.io#ezrasilvera/llm-d-nokube-vllm:vllm0.23" for enroot import.

enroot import authenticates and pulls against ezrasilvera's personal ghcr namespace on every fresh cluster / cache miss.

If ezrasilvera retags/deletes/private-marks this tag, step 4 fails; the sweep can no longer be reproduced.

Addressing the refutation

One verifier argued this duplicates the author's own annotations (PR description merge-blocker + in-file comment) and should be dropped as "no new information." That's a fair point, and it's why I'm filing this as nit rather than normal — I'm not claiming the author is unaware. But surfacing it in the code-review pass provides a merge-time reminder against the actual diff (as opposed to a checklist item in the description that can be overlooked when the reviewer clicks Merge). If this shows up alongside other findings in the same review, it's cheap to satisfy; if the maintainer knows the image swap is already in flight, the reminder can be dismissed in a second.

Fix

Republish the combined image (built from benchmarks/llm-d/Dockerfile) to ghcr.io/semianalysisai/ and update both image: lines. The enroot cache key changes with the image string, so this is a valid config change to land in this PR rather than a follow-up — landing it here keeps the initial ingest and all historical reruns pinned to the org-owned image.

claude · 2026-07-04T10:28:04Z

+    - "Add DeepSeek-V4-Pro FP4 GB200 llm-d-vllm disaggregated multinode benchmarks"
+    - "Mid-curve 1P+1D DEP8 (conc 256/512/1024)"
+    - "High-tpt 2P+1D DEP8 (conc 4096)"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX


🟡 The new perf-changelog entry has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX (literal XXX placeholder at perf-changelog.yaml:4444). Every other entry in this file resolves to a real PR number — this one alone will 404 if merged as-is. Fix by replacing XXX with 2050 before merge. (The author already listed this under 'Before merge' in the PR description, so filing as a friendly nit rather than a blocker.)

Extended reasoning...

What the bug is. The perf-changelog entry appended by this PR (perf-changelog.yaml:4441-4444) carries a literal pull/XXX placeholder rather than the real PR number:

- config-keys: - dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe - dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe description: - "Add DeepSeek-V4-Pro FP4 GB200 llm-d-vllm disaggregated multinode benchmarks" - "Mid-curve 1P+1D DEP8 (conc 256/512/1024)" - "High-tpt 2P+1D DEP8 (conc 4096)" pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

Why the current diff is inconsistent. Every other entry in perf-changelog.yaml resolves to a real PR (the immediately preceding entry ends with pull/2001; earlier entries reference pull/1990, pull/1978, etc.). The pr-link contract of this file is to trace which PR introduced which benchmark set — a literal XXX breaks that contract.

Impact if merged as-is. Clicking the link in the changelog will hit https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, which resolves to no PR (GitHub returns 404 for numeric paths that don't exist, or worse silently redirects). No functionality is affected — the sweep still runs, the config keys still work — but the diagnostic "which PR introduced this row?" trail is broken for the two new dsv4-fp4-gb200-llm-d-vllm keys.

Why existing checks don't catch it. There is no linter validating perf-changelog pr-link values against real PR numbers, so nothing stops the placeholder from shipping. The PR description's 'Before merge' checklist explicitly lists this: 'Canonicalize perf-changelog.yaml pr-link from pull/XXX to this PR's number' — the author has already flagged it as a known merge-blocker to canonicalize.

Step-by-step proof.

Open perf-changelog.yaml at line 4444 → observe pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.

grep -c 'pull/XXX' perf-changelog.yaml → 1 (only this new entry).

grep -oE 'pull/[0-9]+' perf-changelog.yaml | tail -3 → all resolve to real numeric PRs (2001, 1990, 1978).

Attempt https://github.com/SemiAnalysisAI/InferenceX/pull/XXX in a browser → GitHub 404.

Expected value: pull/2050 (this PR's number, per the PR metadata).

Fix. One-character-scope edit: replace XXX with 2050 on perf-changelog.yaml:4444. No other files need to change.

Fixed in d7e4180 - pr-link canonicalized from pull/XXX to pull/2050.

- llm-d/README.md: document endpoints.yaml as one entry per node (not per leader) to match server.sh add_role(); fix example ports (prefill 8200, decode 8000) and rewrite the rationale section (C2) - perf-changelog.yaml: canonicalize pr-link XXX -> 2050 (C4) Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

github-actions · 2026-07-04T10:53:34Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28702992096
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28702992096

github-actions · 2026-07-04T11:55:36Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28703886323
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28703886323

…ent (C1) Bump the EPP and P/D routing-sidecar source images from the moving -dev:main tags to release :v0.9.0 in both binaries.env and the combined-image Dockerfile, so a rebuild is reproducible. Verified both tags ship linux/arm64 and still expose /app/epp and /app/pd-sidecar (statically-linked aarch64). Envoy pin unchanged. Also reword the binaries.env header (PR #2050 review C1): the old comment claimed the file is 'not used', but job.slurm still sources it and conditionally bind-mounts the binaries via LLMD_BIN_DIR. Clarify it is an optional mounted-binary override, a no-op by default, and fix the 'build an use' -> 'build and use' typo. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

github-actions · 2026-07-04T14:18:42Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28707148515
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28707148515

Add dsv4-fp4-gb200-llm-d-vllm-max-tpt-megamoe: 3 prefill (DEP8) + 1 decode (DEP8) at conc=4096, mirroring NVIDIA/srt-slurm disagg-gb200-max-tpt-megamoe (engine-identical to high-tpt; only prefill scale-out 2->3), so it reuses the same CONFIG_FILE. Validated standalone via e2e run 28711505502 at 8797 tok/s/GPU (median TTFT 58 ms) - above high-tpt's 8241 tok/s/GPU, since the extra prefill instance relieves the conc=4096 prefill bottleneck. Adds a matching perf-changelog entry under PR #2050. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

github-actions · 2026-07-04T19:32:38Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28714892207
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28714892207

Add a new llm-d-vllm disaggregated (prefill/decode) multi-node benchmark path to InferenceX, orchestrated entirely by SLURM (no Kubernetes). vLLM engines with llm-d EPP routing, an Envoy front-end, and NIXL KV transfer. - benchmarks/llm-d/: container image (Dockerfile on vLLM v0.23 base), Envoy + EPP config, binaries.env, and an extraction helper that decouples epp/sidecar/envoy from the vLLM image. - benchmarks/multi_node/llm-d/: server.sh orchestrator (per-role TP/EP, prefill/decode coordinators, PREFILL_WORKERS multi-engine prefill), job.slurm (pyxis/enroot + docker paths, per-node model probe), submit.sh. - runners/launch_gb200-nv.sh: dispatch branch for FRAMEWORK=llm-d-vllm. - benchmarks/benchmark_lib.sh: anchor a relative lm_eval task yaml to the repo root so evals resolve inside the serving container. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

Route dsv4 models through the vLLM tokenizer wrapper and allow deepseek_v4 as a --tokenizer-mode choice, bypassing HF AutoTokenizer which does not recognize the model on the benchmark image. Adds --trust-remote-code / --use-chat-template / --dsv4 passthroughs. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

- benchmarks/multi_node/dsv4_fp4_gb200_llm-d-vllm.sh: GB200 wrapper (GPUS_PER_NODE=4) for the DSV4-Pro FP4 llm-d-vllm path. - llm-d-recipes/dsv4-fp4-gb200-mid-curve-megamoe.yaml: 1P+1D DEP8 recipe. - nvidia-master.yaml: register the mid-curve-megamoe (conc 256/512/1024) and high-tpt-megamoe (Option-B 2P+1D, conc 4096) config keys. - perf-changelog.yaml: changelog entry for the two keys. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

…diagnostics The decode leader's prefill-readiness curl had no connect/max timeout, so a blackholed prefill endpoint made a single curl block forever - the 5-min deadline is only checked between calls, so it never fired. A 2P (PREFILL_WORKERS=2) run wedged ~7h on this before being cancelled. - Add --connect-timeout 5 --max-time 10 so an unreachable endpoint trips the 5-min deadline and FATALs cleanly instead of hanging the whole run. - Probe every prefill node (ALL_IPS[0:PREFILL_NODES]), not just IPS[0]: endpoints.yaml lists one prefill endpoint per node, and PREFILL_WORKERS>1 spans multiple independent DP engines, so IPS[0] only covered engine 1. - On failure, dump network diagnostics - source IP toward target, /proc/net/route on-link-vs-gateway subnet classification, bash /dev/tcp L4 connect test, ping, verbose curl - to separate a server-not-ready problem from a subnet/routing/firewall problem. Log the subnet layout up front too. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

Two teardown fixes so a failed/hung benchmark can no longer wedge the whole SLURM allocation (a conc4096 2P run hung 8h until GitHub killed it at the max-execution limit, losing the server-log tarball): - server.sh coordinator: write the .bench_done release marker from an EXIT trap, so job.slurm scancels the allocation on ANY coordinator exit (failed bench, set -e abort, EPP/Envoy bring-up error), not only a clean finish. - benchmark_lib.sh: optional `timeout` wrapper around the bench client, gated on BENCH_TIMEOUT_S (default 2400s, set by server.sh); a hung client (asyncio shutdown wedged on thousands of failed in-flight requests) is killed instead of pinning 24 GPUs to TIME_LIMIT. The bench call is also made non-fatal so a multi-conc sweep records every point it can and always reaches teardown. Replace the leader-only .prom scraper with metrics_scrape.py, which scrapes EVERY prefill and decode vLLM (on the vLLM port, not the sidecar) plus EPP into one time-correlated CSV. This makes per-engine load visible - crucially whether EPP fans out across all prefill engines - which the old PREFILL_LEADER_IP-only scraper could not show for PREFILL_WORKERS>1. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

dsv4-fp4-gb200-llm-d-vllm-smoke-2p1d: two independent 1-node DEP4 prefill engines (PREFILL_NODES=2 PREFILL_WORKERS=2) + 1 DEP4 decode = 3 nodes / 12 GPUs, ISL/OSL 1024/128, single conc 128 x ~512 prompts, RANDOM_RANGE_RATIO=1.0. Runs in minutes; validates the per-engine metrics scraper and confirms EPP fans out across both prefill engines before spending the 6-node high-tpt run. Reuses the mid-curve recipe. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

The 3-node DEP4 smoke (1 node/engine) OOMs at model load: DSV4-Pro's MegaMoE needs each EP rank to hold 1/4 of the experts at EP4 (~2x the per-GPU weight of the validated EP8), which does not fit in HBM (run 28588980717: "CUDA Error: out of memory at cumem_allocator.cpp:139"). DEP8 (2 nodes/engine) is the minimum viable EP. Smoke is now PREFILL_NODES=4 PREFILL_WORKERS=2 (2x DEP8) + DECODE_NODES=2 = 6 nodes, same layout as high-tpt but a tiny workload (conc 128), so it validates the scraper + EPP fan-out fast on the real topology. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

LLMD_LB_MODE (recipe env, default hybrid). multi-port switches both roles to --data-parallel-multi-port-external-lb + --data-parallel-supervisor-port and drops --api-server-count, so each DP rank serves on VLLM_PORT+rank (base+global rank; a worker node binds VLLM_PORT+START_RANK..). The decode pd-sidecar gets --data-parallel-size=$DP_SIZE, which per data_parallel.go makes it expose SIDECAR_PORT+i -> vLLM VLLM_PORT+i; per node only its local rank ports work, so endpoints.yaml lists each node's local rank ports (node j: base + j*GPUS_PER_NODE .. +GPUS_PER_NODE). Health waits + the metrics scraper follow the same per-rank ports (scraper hits the vLLM port, never the sidecar). Mirrors the upstream GENERIC wide-ep-lws guide (DeepSeek-R1), which uses per-rank ports for P/D; our DSV4 path was hybrid-lb (one port/node). Recipe dsv4-fp4-gb200-megamoe-mp.yaml = mid-curve engine config + LLMD_LB_MODE multi-port + EPP scorers matched to the upstream router config (drop kv-cache-utilization-scorer; prefill = prefix-cache/queue/active-request, decode = active-request only). Keys smoke-2p1d-mp (fast validation) and high-tpt-megamoe-mp (conc4096 A/B vs the hybrid high-tpt key) select it. Hybrid path is unchanged (default), so existing keys are byte-identical. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

mp-smoke 28596473295 showed every prefill node (leader AND worker) binds VLLM_PORT+local_index (8200-8203), not base+global rank: --data-parallel-start-rank only sets DP rank IDs, the HTTP servers are always the node-local range. My base+global assumption made the health gate probe a dead worker port (10.30.1.39:8204 not ready). Fix everywhere to base+local: - HEALTH_PORT / SIDECAR_HEALTH_PORT = VLLM_PORT / SIDECAR_PORT (local rank-0). - sidecar --data-parallel-size = DP_SIZE_LOCAL (each node exposes its 4 local ranks at SIDECAR_PORT..+3 -> vLLM VLLM_PORT..+3). - endpoints + scraper: each node lists base..base+GPUS_PER_NODE-1 (same range on every node, distinct address). Also fix a double 'export export LLMD_LB_MODE' from overlapping edits. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

… bundle At conc4096 both Dynamo (8153 tok/s/GPU) and upstream llm-d PR #1737 (8858) succeed on the IDENTICAL 2P TP1/DP8/EP8 layout we use, while ours collapses to 8.3% completion. The difference is the engine bundle, so this recipe applies it: prefill gpu-mem 0.95, max-model-len 9280, max-num-seqs 16, max-num-batched-tokens 32768, no-enable-flashinfer-autotune; decode gpu-mem 0.9, max-model-len 9280, no-flashinfer-autotune. Same EPP/env/topology + kv producer-consumer + hybrid-lb as mid-curve, to isolate the engine bundle. The max-model-len/no-autotune/ max-num-seqs guards are what let this 32768/0.95 bundle bring up without the two OOMs our earlier speed-bundle hit. Keys tuned-smoke + tuned-high-tpt select it. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

The no-kube Envoy front door (benchmarks/llm-d/envoy.yaml) defined the epp ext_proc cluster and the original_dst cluster with no circuit_breakers block, so Envoy applied its DEFAULT per-cluster thresholds of max_requests=1024 and max_pending_requests=1024. The ext_proc filter holds one long-lived HTTP/2 stream to EPP for the entire lifetime of every request, so at max-concurrency 4096 only 1024 active + 1024 pending = 2048 requests ever reached EPP; the remaining ~18432 (warmups + prompts) got a fast 500 and never saw the engine. This capped every high-tpt run at exactly 1024 completions / 2048 disagg decisions, identically across engine configs (mid-curve, Dynamo-tuned) and LB modes (hybrid, multi-port) - because the cap was the front door, not the engine. It is why Dynamo (no Envoy) and upstream llm-d (k8s gateway) hit 8000+ tok/s/GPU on the same TP1/DP8/EP8 2P1D layout where we stalled at ~1783. Fix: add circuit_breakers with thresholds far above max-concurrency to both clusters. Evidence: envoy_access RESPONSE_CODE histogram = 2048x200 + 18432x500 in all three walled runs; disagg_decision_total pinned at 2048 in all three. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

…0000 Align the circuit_breakers thresholds added in the previous commit to the exact values from the authoritative upstream preset - the llm-d-router-standalone Helm chart (llm-d-inference-scheduler config/charts/llm-d-router-standalone/values.yaml lines 191-208): max_connections/max_pending_requests/max_requests = 40000 on both the ext_proc (EPP) and original_dst clusters, plus max_retries 1024 on ext_proc. Both clear max-concurrency 4096 with wide headroom; 40000 just matches upstream rather than an arbitrary ceiling. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

… upstream) The prior CB commit lifted the epp ext_proc cluster to 40000 but left the original_dst cluster at 1000000. Both clear max-concurrency 4096 identically (this is the exact config the 8452 tok/s/GPU high-tpt confirmation run used), but pin original_dst to 40000 too so both clusters match each other and the upstream llm-d-router-standalone preset. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

…util + decode prefix-cache) Align the mid-curve/high-tpt recipe's EPP scheduling profiles to the upstream generic wide-ep-lws router config (llm-d guides/wide-ep-lws/router/wide-ep-lws.values.yaml): prefill = prefix-cache(3) + queue(2) + active-request(2) decode = active-request only Removed: kv-cache-utilization-scorer from prefill (upstream does not use it; its weight becomes active-request(2)); and prefix-cache-scorer from decode - at weight 3 it outweighed active-request(2) and pinned decode to one pod on a shared prompt prefix, producing the ~2:1 decode routing imbalance (decode-0 1389 vs decode-1 659). Upstream decode routes on active-request alone and balances. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

…keys Pre-PR cleanup of debug/experiment scaffolding added while root-causing the conc4096 wall (now fixed via the Envoy circuit-breaker change). Removed: - benchmarks/multi_node/llm-d-recipes/dsv4-fp4-gb200-megamoe-mp.yaml (multi-port LB experiment - dead: NIXL KV-expiry) - benchmarks/multi_node/llm-d-recipes/dsv4-fp4-gb200-megamoe-tuned.yaml (Dynamo engine-bundle A/B - superseded; the circuit-breaker fix, not engine args, was the wall) - nvidia-master.yaml keys: smoke-2p1d, smoke-2p1d-mp, high-tpt-megamoe-mp, tuned-smoke, tuned-high-tpt (experiment/smoke keys tied to the above) Kept: mid-curve-megamoe + high-tpt-megamoe keys, the envoy circuit_breakers fix, the teardown/timeout fix, and the DSV4 tokenizer plumbing. Deferred to a follow-up (after the mid-curve rerun validates and its scraper CSV is analysed): removing the per-engine vLLM metrics scraper and the multi-port branches in server.sh. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

…pper - Remove metrics_scrape.py + its scraper bring-up/stop/conc-tag in server.sh - Remove dead multi-port LB path (LLMD_LB_MODE and all branches); hybrid only - Remove EPP --v=4 and NCCL/UCX transport debug logging (keep UCX_TLS pin) - Rename dsv4_fp4_gb200_llm-d-vllm.sh -> ...-disagg.sh; update launcher SCRIPT_NAME Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

Comment-only cleanup across the llm-d-vllm GB200 assets: - recipe header rewritten to describe the common per-engine shape and the two topologies (mid-curve 1P1D / high-tpt 2P1D) it backs - drop historical tuning-log comments from the recipe, launcher, envoy.yaml, binaries.env, and Dockerfile Verified: no non-comment lines changed; envoy circuit_breakers (40000) intact. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

Include dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe (2P1D DEP8, conc 4096) alongside the mid-curve key so a labeled PR sweep validates both topologies, not just mid-curve. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

- llm-d/README.md: document endpoints.yaml as one entry per node (not per leader) to match server.sh add_role(); fix example ports (prefill 8200, decode 8000) and rewrite the rationale section (C2) - perf-changelog.yaml: canonicalize pr-link XXX -> 2050 (C4) Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

…ent (C1) Bump the EPP and P/D routing-sidecar source images from the moving -dev:main tags to release :v0.9.0 in both binaries.env and the combined-image Dockerfile, so a rebuild is reproducible. Verified both tags ship linux/arm64 and still expose /app/epp and /app/pd-sidecar (statically-linked aarch64). Envoy pin unchanged. Also reword the binaries.env header (PR #2050 review C1): the old comment claimed the file is 'not used', but job.slurm still sources it and conditionally bind-mounts the binaries via LLMD_BIN_DIR. Clarify it is an optional mounted-binary override, a no-op by default, and fix the 'build an use' -> 'build and use' typo. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

…ild-arg BuildKit rejects `COPY --from=${VAR}` with 'variable expansion is not supported for --from'. Promote VLLM_BASE and the three source-image ARGs (EPP/sidecar/envoy) to global scope, bind each to a named FROM stage (epp_src / sidecar_src / envoy_src), and COPY --from those stage aliases. No change to defaults or --build-arg override behavior; still in sync with binaries.env. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

…sses The eval-only jobs failed 'eval metadata concurrency does not match workflow request' even though gsm8k accuracy passed (0.95 >= 0.91). Cause: run_eval/append_lm_eval_summary read EVAL_CONCURRENT_REQUESTS and CONC (benchmark_lib.sh), never EVAL_CONC, so with CONC empty meta_env.json recorded conc=1 while validate_scores.py --expected-concs checked 4096. Mirror the AMD multi-node servers: before run_eval, set EVAL_CONCURRENT_REQUESTS from EVAL_CONC (fallback: max of the x-delimited BENCH_MAX_CONCURRENCY) and export CONC so the stamped concurrency matches the workflow request. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

Add dsv4-fp4-gb200-llm-d-vllm-max-tpt-megamoe: 3 prefill (DEP8) + 1 decode (DEP8) at conc=4096, mirroring NVIDIA/srt-slurm disagg-gb200-max-tpt-megamoe (engine-identical to high-tpt; only prefill scale-out 2->3), so it reuses the same CONFIG_FILE. Validated standalone via e2e run 28711505502 at 8797 tok/s/GPU (median TTFT 58 ms) - above high-tpt's 8241 tok/s/GPU, since the extra prefill instance relieves the conc=4096 prefill bottleneck. Adds a matching perf-changelog entry under PR #2050. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

Move the max-tpt config-key to the top of the changelog entry so it dispatches first in the serialized multi-node matrix. The prior sweep (28714892207) had max-tpt land last; a transient decode-node slowdown degraded that point (4.5x TPOT vs high-tpt on an identical decode config). Running it first surfaces its result early and re-tests the transient before the shorter mid/high legs finish. Ordering only - all three points still run and land on the graph. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

github-actions · 2026-07-05T13:51:44Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28742470098
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28742470098

ezrasilvera requested a review from a team July 4, 2026 10:13

ezrasilvera requested review from Ankur-singh, jgangani and kedarpotdar-nv as code owners July 4, 2026 10:13

ezrasilvera added the full-sweep-enabled label Jul 4, 2026

github-project-automation Bot added this to InferenceMAX Board Jul 4, 2026

claude Bot reviewed Jul 4, 2026

View reviewed changes

ezrasilvera added full-sweep-enabled and removed full-sweep-enabled labels Jul 5, 2026

ezrasilvera added 7 commits July 5, 2026 16:16

ezrasilvera added 17 commits July 5, 2026 16:29

[NV] llm-d: add high-tpt key to perf-changelog sweep entry

c59e3a4

Include dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe (2P1D DEP8, conc 4096) alongside the mid-curve key so a labeled PR sweep validates both topologies, not just mid-curve. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>

ezrasilvera force-pushed the llm-d-initial branch from 3bd0b1d to 67be9f4 Compare July 5, 2026 13:31

Uh oh!

Conversation

ezrasilvera commented Jul 4, 2026

Summary

Notable

Scope

Known merge-blocker

Related

Before merge

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

claude Bot Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 4, 2026

Choose a reason for hiding this comment

What the bug is

Step-by-step proof

Why this is intentional (i.e. the code is right, docs are stale)

Impact

Suggested fix

Uh oh!

ezrasilvera Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

ezrasilvera Jul 4, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

github-actions Bot commented Jul 4, 2026

Uh oh!

github-actions Bot commented Jul 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant