[NV] llm-d-vllm: Add llm-d to the InferenceX benchmarking framework#2050
[NV] llm-d-vllm: Add llm-d to the InferenceX benchmarking framework#2050ezrasilvera wants to merge 24 commits into
Conversation
|
Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase For PR verification, add the PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs 感谢你的贡献!请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md,然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人,你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板,包括保留英文语句 如需进行 PR 验证,请为此 PR 添加 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档 |
2 similar comments
|
Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase For PR verification, add the PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs 感谢你的贡献!请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md,然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人,你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板,包括保留英文语句 如需进行 PR 验证,请为此 PR 添加 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档 |
|
Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase For PR verification, add the PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs 感谢你的贡献!请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md,然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人,你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板,包括保留英文语句 如需进行 PR 验证,请为此 PR 添加 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档 |
| # | ||
| # This is currently not used as we build an use an image that contains | ||
| # all binaries |
There was a problem hiding this comment.
🟡 The comment on lines 10-12 says "This is currently not used as we build an use an image that contains all binaries," but benchmarks/multi_node/llm-d/job.slurm (lines 236-241 of the pyxis branch) explicitly sources this file and uses LLMD_BIN_DIR to conditionally bind-mount epp/pd-sidecar/envoy — so the file IS wired into the runtime as an optional override. Also a typo: "build an use" → "build and use". Nit — no runtime impact, but the stale comment will mislead maintainers into thinking they can delete the file.
Extended reasoning...
What the comment claims vs. what the code does
Lines 10-12 of benchmarks/llm-d/binaries.env state:
This is currently not used as we build an use an image that contains all binaries
However, the pyxis branch of benchmarks/multi_node/llm-d/job.slurm explicitly sources this file and consumes LLMD_BIN_DIR to build optional bind-mounts:
[[ -f "${DI_REPO_DIR}/benchmarks/llm-d/binaries.env" ]] && \
source "${DI_REPO_DIR}/benchmarks/llm-d/binaries.env"
for _bin in epp pd-sidecar envoy; do
if [[ -n "${LLMD_BIN_DIR:-}" && -x "${LLMD_BIN_DIR}/${_bin}" ]]; then
PYXIS_MOUNTS+=",${LLMD_BIN_DIR}/${_bin}:/usr/local/bin/${_bin}:ro"
...
fi
doneSo binaries.env is wired into the runtime — it acts as the single source of truth for LLMD_BIN_DIR (defaulted to /mnt/lustre01/users-public/sa-shared/llm-d-bins), which the pyxis path uses to swap in mounted binaries and skip a combined-image rebuild.
Step-by-step proof
- Author enables the pyxis path (the default on GB200 via
LLMD_CONTAINER_ENGINE=pyxisset inrunners/launch_gb200-nv.sh). job.slurmreaches the pyxis branch and sourcesbinaries.env, populatingLLMD_BIN_DIR.- If a user has run
extract-binaries.sh(which also sources this file), the three binaries exist under$LLMD_BIN_DIRand get bind-mounted into every container. - This is the entire point of the mounted-binary path documented in
extract-binaries.shand the Dockerfile — running a stockvllm/vllm-openaiimage without rebuilding the combined image.
The comment therefore contradicts both the sibling files (extract-binaries.sh, Dockerfile, envoy.yaml docstrings) that describe binaries.env as the single source of truth, and job.slurm itself.
Addressing the counter-argument
One could argue the comment is "arguably accurate for the DEFAULT deployment path" — in the current recipe the combined image bakes binaries in, so the -x check makes the mount loop a no-op unless the operator has explicitly run extract-binaries.sh. That's a fair characterization of when the file has effect, but "currently not used" is stronger than "currently a no-op by default" and it flatly contradicts the sourcing on job.slurm:236. A future maintainer trying to trim "unused" scaffolding would delete this file and quietly break the mounted-binary path.
Impact and fix
No runtime impact — the code works today. The concern is purely documentation drift in a new file this PR is introducing. Fix in one edit:
# This file is sourced by benchmarks/multi_node/llm-d/job.slurm to
# optionally bind-mount extracted binaries in place of the combined
# image's baked-in copies (see extract-binaries.sh). It is a no-op
# by default (the -x check fails when the binaries have not been
# extracted); the Dockerfile also reads the same URLs when building
# and using an image that contains all binaries.
Also fix the "build an use" → "build and use" typo while you're in there.
| ### Why one entry per *leader* (not per node) | ||
|
|
||
| In the wide-EP guide each instance is a single vLLM engine that spans | ||
| multiple nodes via `--data-parallel-hybrid-lb`. With hybrid-lb, the | ||
| leader pod (`LWS_WORKER_INDEX=0`) accepts external traffic and | ||
| distributes it internally across the local DP ranks; in our LWS-free | ||
| SLURM mapping, the prefill-leader and decode-leader are the only nodes | ||
| addressable from outside. Adding an entry per worker would cause EPP to | ||
| route directly to a worker, bypassing the engine's internal load | ||
| balancing. | ||
|
|
||
| If we later want to expose all pods of an instance (the alternative | ||
| hybrid-lb interpretation: external LB across nodes too), we can extend | ||
| the loop in `server.sh` to emit one entry per `IPS[i]` in the prefill | ||
| range and one per `IPS[i]` in the decode range, all carrying the same | ||
| role label. EPP then load-balances across them via `random-picker`. |
There was a problem hiding this comment.
🟡 The README section "Why one entry per leader (not per node)" (lines 80-89) plus the example endpoints.yaml at lines 57-69 both document per-leader emission, but server.sh add_role() (lines 334-343) actually emits ONE endpoint per node — its own comment says "ONE endpoint per node at base_port (hybrid LB: ...)". The lines 91-95 "future work" section describing per-node emission as hypothetical has already been implemented. Docs-only staleness (per-node is the intended behavior — see server.sh:289-291 and its consistency with the multi-node coordinator health poll in lines 519-535), but future maintainers may misdiagnose whether current behavior is intentional. Please update the README to reflect actual per-node emission, or add a note explaining why the leader-only text was preserved.
Extended reasoning...
What the bug is
The new file benchmarks/multi_node/llm-d/README.md (added by this PR) describes the endpoints.yaml file-discovery contract as emitting one entry per leader. Concretely:
- Line 55: "…writes
/tmp/endpoints.yamlinside the container with one entry per leader" - Lines 57-69: the example YAML shows only
prefill-0(PREFILL_LEADER_IP) anddecode-0(DECODE_LEADER_IP) - Lines 80-89: an entire section titled "Why one entry per leader (not per node)" justifies leader-only emission and warns: "Adding an entry per worker would cause EPP to route directly to a worker, bypassing the engine's internal load balancing"
- Lines 91-95: describes per-node emission as a hypothetical future extension ("we can extend the loop … to emit one entry per IPS[i]")
But the code at benchmarks/multi_node/llm-d/server.sh:334-348 (the add_role() block that writes endpoints.yaml) does exactly the opposite:
def add_role(role, ips, base_port):
# ONE endpoint per node at base_port (hybrid LB: the node's api-server /
# sidecar internally load-balances its local DP ranks).
for i, ip in enumerate(ips):
endpoints.append({'name': f'{role}-{i}', ...})
add_role('prefill', prefill_ips, VLLM_PORT) # prefill_ips = all_ips[:pn]
add_role('decode', decode_ips, SIDECAR_PORT) # decode_ips = all_ips[pn:pn + dn]The inline comment at server.sh:339-340 literally says the opposite of the README section header: "ONE endpoint per node at base_port".
Step-by-step proof
Take the new high-tpt 2P+1D key (dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe, configs/nvidia-master.yaml). Topology: PREFILL_NODES=4, DECODE_NODES=2, PREFILL_WORKERS=2.
submit.shsbatches 6 nodes.job.slurmcollects their IPs in rank order intoALL_IPS = [P0,P1,P2,P3,D0,D1].- On the decode leader,
server.shruns the coordinator Python block:prefill_ips = all_ips[:4] = [P0, P1, P2, P3]decode_ips = all_ips[4:6] = [D0, D1]
add_role('prefill', prefill_ips, 8200)iterates 4 times → emitsprefill-0..prefill-3.add_role('decode', decode_ips, 8000)iterates 2 times → emitsdecode-0..decode-1.- Final endpoints.yaml has 6 entries — not the 2 (prefill-0 + decode-0) the README example shows.
For the 1P+1D mid-curve key it emits 1 prefill + 2 decode entries (still per-node, not per-leader).
Why this is intentional (i.e. the code is right, docs are stale)
Several signals confirm per-node emission is the deliberate design, not a bug:
- server.sh:289-291 comment explicitly justifies it: "Each decode node runs its own sidecar (SIDECAR_PORT -> local decode vLLM), and endpoints.yaml lists one decode endpoint per node so EPP fans out across all decode ranks."
- server.sh:519-535 (the decode coordinator's cross-node prefill health poll) iterates over
_prefill_ips=( "${_ALL_IPS[@]:0:${PREFILL_NODES}}" )— polling every prefill node, not just the leader. That poll is only meaningful if all prefill nodes are individually addressable endpoints. --api-server-countis set on every EP-enabled node (server.sh:236-240), plus--data-parallel-hybrid-lb— meaning each node has its own api-server binding, so per-node endpoints are actually the correct interpretation of hybrid-lb.- PREFILL_WORKERS=2 in the high-tpt key requires per-node emission: with 2 independent DP engines across 4 prefill nodes, per-leader emission would only expose one of the two engines.
Impact
Zero runtime impact — code is self-consistent and produces the intended routing behavior. But a future maintainer reading the README's "Why one entry per leader (not per node)" section will:
- Believe the current behavior is per-leader when it isn't.
- Reference the warning about "EPP would bypass internal load balancing" and (incorrectly) conclude the current implementation is broken.
- Not realize the "future work" paragraph at lines 91-95 has already been implemented.
Suggested fix
Update benchmarks/multi_node/llm-d/README.md:
- Line 55: change "one entry per leader" → "one entry per node".
- Lines 57-69: expand the example YAML to show all prefill and decode nodes (e.g.
prefill-0..prefill-N-1,decode-0..decode-M-1). - Lines 80-89: rewrite the "Why one entry per leader" section to explain why per-node is used (hybrid-lb + independent api-servers + PREFILL_WORKERS multi-engine).
- Lines 91-95: delete or rewrite the "future work" paragraph since per-node emission is what the code already does. The leader-only mode could be described here as the hypothetical alternative if desired.
There was a problem hiding this comment.
Fixed in d7e4180 - README now documents one endpoint per node (not per leader) to match server.sh add_role(), with corrected example ports (prefill 8200 / decode 8000) and a rewritten rationale section.
| # Build source: benchmarks/llm-d/Dockerfile. | ||
| dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe: | ||
| image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23 |
There was a problem hiding this comment.
🟡 Both new keys (mid-curve line 9080, high-tpt line 9125) pin image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23, a personal ghcr namespace rather than the SemiAnalysisAI org registry. The PR description's "Before merge" checklist and the in-file comment at 9075-9077 both flag this — surfacing here as a merge-time reminder so it isn't missed. Republish to ghcr.io/semianalysisai/... and update both image: lines before merge.
Extended reasoning...
What the bug is
Both new master-config keys added by this PR pin their engine image to a personal ghcr namespace:
configs/nvidia-master.yaml:9080(dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe):image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23configs/nvidia-master.yaml:9125(dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe):image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23
The PR author has explicitly flagged this in two places already: the PR description's "Known merge-blocker" section ("The image: on both keys is a personal build... It should be republished to the org registry and the image: lines updated before merge") and an in-file comment at lines 9075-9077 ("PR follow-up (merge-blocking): the image: on both llm-d-vllm keys below (mid-curve + high-tpt) is a personal build. Republish it to the org registry (ghcr.io/semianalysisai/...) and update these image: lines before merge").
Why it matters
Reproducibility. On the GB200 pyxis path, runners/launch_gb200-nv.sh derives SQUASH_FILE="${SQUASH_DIR}/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" — so the enroot-import cache key is a direct function of the image string. Any historical rerun (e.g., re-ingesting a failed sweep, reproducing a curve point months later) resolves the image via that exact URL. A personal ghcr namespace can be retagged, deleted, or made private without notice by the individual owner, silently breaking reruns.
Step-by-step proof
configs/nvidia-master.yaml:9080setsimage: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23on the mid-curve key.configs/nvidia-master.yaml:9125sets the same image on the high-tpt key.runners/launch_gb200-nv.sh(the llm-d-vllm branch, top of file) reads$IMAGEand constructsENROOT_URL="docker://ghcr.io#ezrasilvera/llm-d-nokube-vllm:vllm0.23"for enroot import.enroot importauthenticates and pulls against ezrasilvera's personal ghcr namespace on every fresh cluster / cache miss.- If ezrasilvera retags/deletes/private-marks this tag, step 4 fails; the sweep can no longer be reproduced.
Addressing the refutation
One verifier argued this duplicates the author's own annotations (PR description merge-blocker + in-file comment) and should be dropped as "no new information." That's a fair point, and it's why I'm filing this as nit rather than normal — I'm not claiming the author is unaware. But surfacing it in the code-review pass provides a merge-time reminder against the actual diff (as opposed to a checklist item in the description that can be overlooked when the reviewer clicks Merge). If this shows up alongside other findings in the same review, it's cheap to satisfy; if the maintainer knows the image swap is already in flight, the reminder can be dismissed in a second.
Fix
Republish the combined image (built from benchmarks/llm-d/Dockerfile) to ghcr.io/semianalysisai/ and update both image: lines. The enroot cache key changes with the image string, so this is a valid config change to land in this PR rather than a follow-up — landing it here keeps the initial ingest and all historical reruns pinned to the org-owned image.
| - "Add DeepSeek-V4-Pro FP4 GB200 llm-d-vllm disaggregated multinode benchmarks" | ||
| - "Mid-curve 1P+1D DEP8 (conc 256/512/1024)" | ||
| - "High-tpt 2P+1D DEP8 (conc 4096)" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX |
There was a problem hiding this comment.
🟡 The new perf-changelog entry has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX (literal XXX placeholder at perf-changelog.yaml:4444). Every other entry in this file resolves to a real PR number — this one alone will 404 if merged as-is. Fix by replacing XXX with 2050 before merge. (The author already listed this under 'Before merge' in the PR description, so filing as a friendly nit rather than a blocker.)
Extended reasoning...
What the bug is. The perf-changelog entry appended by this PR (perf-changelog.yaml:4441-4444) carries a literal pull/XXX placeholder rather than the real PR number:
- config-keys:
- dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe
- dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe
description:
- "Add DeepSeek-V4-Pro FP4 GB200 llm-d-vllm disaggregated multinode benchmarks"
- "Mid-curve 1P+1D DEP8 (conc 256/512/1024)"
- "High-tpt 2P+1D DEP8 (conc 4096)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXWhy the current diff is inconsistent. Every other entry in perf-changelog.yaml resolves to a real PR (the immediately preceding entry ends with pull/2001; earlier entries reference pull/1990, pull/1978, etc.). The pr-link contract of this file is to trace which PR introduced which benchmark set — a literal XXX breaks that contract.
Impact if merged as-is. Clicking the link in the changelog will hit https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, which resolves to no PR (GitHub returns 404 for numeric paths that don't exist, or worse silently redirects). No functionality is affected — the sweep still runs, the config keys still work — but the diagnostic "which PR introduced this row?" trail is broken for the two new dsv4-fp4-gb200-llm-d-vllm keys.
Why existing checks don't catch it. There is no linter validating perf-changelog pr-link values against real PR numbers, so nothing stops the placeholder from shipping. The PR description's 'Before merge' checklist explicitly lists this: 'Canonicalize perf-changelog.yaml pr-link from pull/XXX to this PR's number' — the author has already flagged it as a known merge-blocker to canonicalize.
Step-by-step proof.
- Open
perf-changelog.yamlat line 4444 → observepr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX. grep -c 'pull/XXX' perf-changelog.yaml→ 1 (only this new entry).grep -oE 'pull/[0-9]+' perf-changelog.yaml | tail -3→ all resolve to real numeric PRs (2001, 1990, 1978).- Attempt
https://github.com/SemiAnalysisAI/InferenceX/pull/XXXin a browser → GitHub 404. - Expected value:
pull/2050(this PR's number, per the PR metadata).
Fix. One-character-scope edit: replace XXX with 2050 on perf-changelog.yaml:4444. No other files need to change.
There was a problem hiding this comment.
Fixed in d7e4180 - pr-link canonicalized from pull/XXX to pull/2050.
- llm-d/README.md: document endpoints.yaml as one entry per node (not per leader) to match server.sh add_role(); fix example ports (prefill 8200, decode 8000) and rewrite the rationale section (C2) - perf-changelog.yaml: canonicalize pr-link XXX -> 2050 (C4) Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28702992096 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28703886323 |
…ent (C1) Bump the EPP and P/D routing-sidecar source images from the moving -dev:main tags to release :v0.9.0 in both binaries.env and the combined-image Dockerfile, so a rebuild is reproducible. Verified both tags ship linux/arm64 and still expose /app/epp and /app/pd-sidecar (statically-linked aarch64). Envoy pin unchanged. Also reword the binaries.env header (PR #2050 review C1): the old comment claimed the file is 'not used', but job.slurm still sources it and conditionally bind-mounts the binaries via LLMD_BIN_DIR. Clarify it is an optional mounted-binary override, a no-op by default, and fix the 'build an use' -> 'build and use' typo. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28707148515 |
Add dsv4-fp4-gb200-llm-d-vllm-max-tpt-megamoe: 3 prefill (DEP8) + 1 decode (DEP8) at conc=4096, mirroring NVIDIA/srt-slurm disagg-gb200-max-tpt-megamoe (engine-identical to high-tpt; only prefill scale-out 2->3), so it reuses the same CONFIG_FILE. Validated standalone via e2e run 28711505502 at 8797 tok/s/GPU (median TTFT 58 ms) - above high-tpt's 8241 tok/s/GPU, since the extra prefill instance relieves the conc=4096 prefill bottleneck. Adds a matching perf-changelog entry under PR #2050. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28714892207 |
Add a new llm-d-vllm disaggregated (prefill/decode) multi-node benchmark path to InferenceX, orchestrated entirely by SLURM (no Kubernetes). vLLM engines with llm-d EPP routing, an Envoy front-end, and NIXL KV transfer. - benchmarks/llm-d/: container image (Dockerfile on vLLM v0.23 base), Envoy + EPP config, binaries.env, and an extraction helper that decouples epp/sidecar/envoy from the vLLM image. - benchmarks/multi_node/llm-d/: server.sh orchestrator (per-role TP/EP, prefill/decode coordinators, PREFILL_WORKERS multi-engine prefill), job.slurm (pyxis/enroot + docker paths, per-node model probe), submit.sh. - runners/launch_gb200-nv.sh: dispatch branch for FRAMEWORK=llm-d-vllm. - benchmarks/benchmark_lib.sh: anchor a relative lm_eval task yaml to the repo root so evals resolve inside the serving container. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Route dsv4 models through the vLLM tokenizer wrapper and allow deepseek_v4 as a --tokenizer-mode choice, bypassing HF AutoTokenizer which does not recognize the model on the benchmark image. Adds --trust-remote-code / --use-chat-template / --dsv4 passthroughs. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
- benchmarks/multi_node/dsv4_fp4_gb200_llm-d-vllm.sh: GB200 wrapper (GPUS_PER_NODE=4) for the DSV4-Pro FP4 llm-d-vllm path. - llm-d-recipes/dsv4-fp4-gb200-mid-curve-megamoe.yaml: 1P+1D DEP8 recipe. - nvidia-master.yaml: register the mid-curve-megamoe (conc 256/512/1024) and high-tpt-megamoe (Option-B 2P+1D, conc 4096) config keys. - perf-changelog.yaml: changelog entry for the two keys. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…diagnostics The decode leader's prefill-readiness curl had no connect/max timeout, so a blackholed prefill endpoint made a single curl block forever - the 5-min deadline is only checked between calls, so it never fired. A 2P (PREFILL_WORKERS=2) run wedged ~7h on this before being cancelled. - Add --connect-timeout 5 --max-time 10 so an unreachable endpoint trips the 5-min deadline and FATALs cleanly instead of hanging the whole run. - Probe every prefill node (ALL_IPS[0:PREFILL_NODES]), not just IPS[0]: endpoints.yaml lists one prefill endpoint per node, and PREFILL_WORKERS>1 spans multiple independent DP engines, so IPS[0] only covered engine 1. - On failure, dump network diagnostics - source IP toward target, /proc/net/route on-link-vs-gateway subnet classification, bash /dev/tcp L4 connect test, ping, verbose curl - to separate a server-not-ready problem from a subnet/routing/firewall problem. Log the subnet layout up front too. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Two teardown fixes so a failed/hung benchmark can no longer wedge the whole SLURM allocation (a conc4096 2P run hung 8h until GitHub killed it at the max-execution limit, losing the server-log tarball): - server.sh coordinator: write the .bench_done release marker from an EXIT trap, so job.slurm scancels the allocation on ANY coordinator exit (failed bench, set -e abort, EPP/Envoy bring-up error), not only a clean finish. - benchmark_lib.sh: optional `timeout` wrapper around the bench client, gated on BENCH_TIMEOUT_S (default 2400s, set by server.sh); a hung client (asyncio shutdown wedged on thousands of failed in-flight requests) is killed instead of pinning 24 GPUs to TIME_LIMIT. The bench call is also made non-fatal so a multi-conc sweep records every point it can and always reaches teardown. Replace the leader-only .prom scraper with metrics_scrape.py, which scrapes EVERY prefill and decode vLLM (on the vLLM port, not the sidecar) plus EPP into one time-correlated CSV. This makes per-engine load visible - crucially whether EPP fans out across all prefill engines - which the old PREFILL_LEADER_IP-only scraper could not show for PREFILL_WORKERS>1. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
dsv4-fp4-gb200-llm-d-vllm-smoke-2p1d: two independent 1-node DEP4 prefill engines (PREFILL_NODES=2 PREFILL_WORKERS=2) + 1 DEP4 decode = 3 nodes / 12 GPUs, ISL/OSL 1024/128, single conc 128 x ~512 prompts, RANDOM_RANGE_RATIO=1.0. Runs in minutes; validates the per-engine metrics scraper and confirms EPP fans out across both prefill engines before spending the 6-node high-tpt run. Reuses the mid-curve recipe. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
The 3-node DEP4 smoke (1 node/engine) OOMs at model load: DSV4-Pro's MegaMoE needs each EP rank to hold 1/4 of the experts at EP4 (~2x the per-GPU weight of the validated EP8), which does not fit in HBM (run 28588980717: "CUDA Error: out of memory at cumem_allocator.cpp:139"). DEP8 (2 nodes/engine) is the minimum viable EP. Smoke is now PREFILL_NODES=4 PREFILL_WORKERS=2 (2x DEP8) + DECODE_NODES=2 = 6 nodes, same layout as high-tpt but a tiny workload (conc 128), so it validates the scraper + EPP fan-out fast on the real topology. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
LLMD_LB_MODE (recipe env, default hybrid). multi-port switches both roles to --data-parallel-multi-port-external-lb + --data-parallel-supervisor-port and drops --api-server-count, so each DP rank serves on VLLM_PORT+rank (base+global rank; a worker node binds VLLM_PORT+START_RANK..). The decode pd-sidecar gets --data-parallel-size=$DP_SIZE, which per data_parallel.go makes it expose SIDECAR_PORT+i -> vLLM VLLM_PORT+i; per node only its local rank ports work, so endpoints.yaml lists each node's local rank ports (node j: base + j*GPUS_PER_NODE .. +GPUS_PER_NODE). Health waits + the metrics scraper follow the same per-rank ports (scraper hits the vLLM port, never the sidecar). Mirrors the upstream GENERIC wide-ep-lws guide (DeepSeek-R1), which uses per-rank ports for P/D; our DSV4 path was hybrid-lb (one port/node). Recipe dsv4-fp4-gb200-megamoe-mp.yaml = mid-curve engine config + LLMD_LB_MODE multi-port + EPP scorers matched to the upstream router config (drop kv-cache-utilization-scorer; prefill = prefix-cache/queue/active-request, decode = active-request only). Keys smoke-2p1d-mp (fast validation) and high-tpt-megamoe-mp (conc4096 A/B vs the hybrid high-tpt key) select it. Hybrid path is unchanged (default), so existing keys are byte-identical. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
mp-smoke 28596473295 showed every prefill node (leader AND worker) binds
VLLM_PORT+local_index (8200-8203), not base+global rank: --data-parallel-start-rank
only sets DP rank IDs, the HTTP servers are always the node-local range. My
base+global assumption made the health gate probe a dead worker port
(10.30.1.39:8204 not ready). Fix everywhere to base+local:
- HEALTH_PORT / SIDECAR_HEALTH_PORT = VLLM_PORT / SIDECAR_PORT (local rank-0).
- sidecar --data-parallel-size = DP_SIZE_LOCAL (each node exposes its 4 local
ranks at SIDECAR_PORT..+3 -> vLLM VLLM_PORT..+3).
- endpoints + scraper: each node lists base..base+GPUS_PER_NODE-1 (same range
on every node, distinct address).
Also fix a double 'export export LLMD_LB_MODE' from overlapping edits.
Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
… bundle At conc4096 both Dynamo (8153 tok/s/GPU) and upstream llm-d PR #1737 (8858) succeed on the IDENTICAL 2P TP1/DP8/EP8 layout we use, while ours collapses to 8.3% completion. The difference is the engine bundle, so this recipe applies it: prefill gpu-mem 0.95, max-model-len 9280, max-num-seqs 16, max-num-batched-tokens 32768, no-enable-flashinfer-autotune; decode gpu-mem 0.9, max-model-len 9280, no-flashinfer-autotune. Same EPP/env/topology + kv producer-consumer + hybrid-lb as mid-curve, to isolate the engine bundle. The max-model-len/no-autotune/ max-num-seqs guards are what let this 32768/0.95 bundle bring up without the two OOMs our earlier speed-bundle hit. Keys tuned-smoke + tuned-high-tpt select it. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
The no-kube Envoy front door (benchmarks/llm-d/envoy.yaml) defined the epp ext_proc cluster and the original_dst cluster with no circuit_breakers block, so Envoy applied its DEFAULT per-cluster thresholds of max_requests=1024 and max_pending_requests=1024. The ext_proc filter holds one long-lived HTTP/2 stream to EPP for the entire lifetime of every request, so at max-concurrency 4096 only 1024 active + 1024 pending = 2048 requests ever reached EPP; the remaining ~18432 (warmups + prompts) got a fast 500 and never saw the engine. This capped every high-tpt run at exactly 1024 completions / 2048 disagg decisions, identically across engine configs (mid-curve, Dynamo-tuned) and LB modes (hybrid, multi-port) - because the cap was the front door, not the engine. It is why Dynamo (no Envoy) and upstream llm-d (k8s gateway) hit 8000+ tok/s/GPU on the same TP1/DP8/EP8 2P1D layout where we stalled at ~1783. Fix: add circuit_breakers with thresholds far above max-concurrency to both clusters. Evidence: envoy_access RESPONSE_CODE histogram = 2048x200 + 18432x500 in all three walled runs; disagg_decision_total pinned at 2048 in all three. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…0000 Align the circuit_breakers thresholds added in the previous commit to the exact values from the authoritative upstream preset - the llm-d-router-standalone Helm chart (llm-d-inference-scheduler config/charts/llm-d-router-standalone/values.yaml lines 191-208): max_connections/max_pending_requests/max_requests = 40000 on both the ext_proc (EPP) and original_dst clusters, plus max_retries 1024 on ext_proc. Both clear max-concurrency 4096 with wide headroom; 40000 just matches upstream rather than an arbitrary ceiling. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
… upstream) The prior CB commit lifted the epp ext_proc cluster to 40000 but left the original_dst cluster at 1000000. Both clear max-concurrency 4096 identically (this is the exact config the 8452 tok/s/GPU high-tpt confirmation run used), but pin original_dst to 40000 too so both clusters match each other and the upstream llm-d-router-standalone preset. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…util + decode prefix-cache) Align the mid-curve/high-tpt recipe's EPP scheduling profiles to the upstream generic wide-ep-lws router config (llm-d guides/wide-ep-lws/router/wide-ep-lws.values.yaml): prefill = prefix-cache(3) + queue(2) + active-request(2) decode = active-request only Removed: kv-cache-utilization-scorer from prefill (upstream does not use it; its weight becomes active-request(2)); and prefix-cache-scorer from decode - at weight 3 it outweighed active-request(2) and pinned decode to one pod on a shared prompt prefix, producing the ~2:1 decode routing imbalance (decode-0 1389 vs decode-1 659). Upstream decode routes on active-request alone and balances. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…keys Pre-PR cleanup of debug/experiment scaffolding added while root-causing the conc4096 wall (now fixed via the Envoy circuit-breaker change). Removed: - benchmarks/multi_node/llm-d-recipes/dsv4-fp4-gb200-megamoe-mp.yaml (multi-port LB experiment - dead: NIXL KV-expiry) - benchmarks/multi_node/llm-d-recipes/dsv4-fp4-gb200-megamoe-tuned.yaml (Dynamo engine-bundle A/B - superseded; the circuit-breaker fix, not engine args, was the wall) - nvidia-master.yaml keys: smoke-2p1d, smoke-2p1d-mp, high-tpt-megamoe-mp, tuned-smoke, tuned-high-tpt (experiment/smoke keys tied to the above) Kept: mid-curve-megamoe + high-tpt-megamoe keys, the envoy circuit_breakers fix, the teardown/timeout fix, and the DSV4 tokenizer plumbing. Deferred to a follow-up (after the mid-curve rerun validates and its scraper CSV is analysed): removing the per-engine vLLM metrics scraper and the multi-port branches in server.sh. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…pper - Remove metrics_scrape.py + its scraper bring-up/stop/conc-tag in server.sh - Remove dead multi-port LB path (LLMD_LB_MODE and all branches); hybrid only - Remove EPP --v=4 and NCCL/UCX transport debug logging (keep UCX_TLS pin) - Rename dsv4_fp4_gb200_llm-d-vllm.sh -> ...-disagg.sh; update launcher SCRIPT_NAME Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Comment-only cleanup across the llm-d-vllm GB200 assets: - recipe header rewritten to describe the common per-engine shape and the two topologies (mid-curve 1P1D / high-tpt 2P1D) it backs - drop historical tuning-log comments from the recipe, launcher, envoy.yaml, binaries.env, and Dockerfile Verified: no non-comment lines changed; envoy circuit_breakers (40000) intact. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Include dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe (2P1D DEP8, conc 4096) alongside the mid-curve key so a labeled PR sweep validates both topologies, not just mid-curve. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
- llm-d/README.md: document endpoints.yaml as one entry per node (not per leader) to match server.sh add_role(); fix example ports (prefill 8200, decode 8000) and rewrite the rationale section (C2) - perf-changelog.yaml: canonicalize pr-link XXX -> 2050 (C4) Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…ent (C1) Bump the EPP and P/D routing-sidecar source images from the moving -dev:main tags to release :v0.9.0 in both binaries.env and the combined-image Dockerfile, so a rebuild is reproducible. Verified both tags ship linux/arm64 and still expose /app/epp and /app/pd-sidecar (statically-linked aarch64). Envoy pin unchanged. Also reword the binaries.env header (PR #2050 review C1): the old comment claimed the file is 'not used', but job.slurm still sources it and conditionally bind-mounts the binaries via LLMD_BIN_DIR. Clarify it is an optional mounted-binary override, a no-op by default, and fix the 'build an use' -> 'build and use' typo. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…ild-arg
BuildKit rejects `COPY --from=${VAR}` with 'variable expansion is not
supported for --from'. Promote VLLM_BASE and the three source-image ARGs
(EPP/sidecar/envoy) to global scope, bind each to a named FROM stage
(epp_src / sidecar_src / envoy_src), and COPY --from those stage aliases.
No change to defaults or --build-arg override behavior; still in sync
with binaries.env.
Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…sses The eval-only jobs failed 'eval metadata concurrency does not match workflow request' even though gsm8k accuracy passed (0.95 >= 0.91). Cause: run_eval/append_lm_eval_summary read EVAL_CONCURRENT_REQUESTS and CONC (benchmark_lib.sh), never EVAL_CONC, so with CONC empty meta_env.json recorded conc=1 while validate_scores.py --expected-concs checked 4096. Mirror the AMD multi-node servers: before run_eval, set EVAL_CONCURRENT_REQUESTS from EVAL_CONC (fallback: max of the x-delimited BENCH_MAX_CONCURRENCY) and export CONC so the stamped concurrency matches the workflow request. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Add dsv4-fp4-gb200-llm-d-vllm-max-tpt-megamoe: 3 prefill (DEP8) + 1 decode (DEP8) at conc=4096, mirroring NVIDIA/srt-slurm disagg-gb200-max-tpt-megamoe (engine-identical to high-tpt; only prefill scale-out 2->3), so it reuses the same CONFIG_FILE. Validated standalone via e2e run 28711505502 at 8797 tok/s/GPU (median TTFT 58 ms) - above high-tpt's 8241 tok/s/GPU, since the extra prefill instance relieves the conc=4096 prefill bottleneck. Adds a matching perf-changelog entry under PR #2050. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Move the max-tpt config-key to the top of the changelog entry so it dispatches first in the serialized multi-node matrix. The prior sweep (28714892207) had max-tpt land last; a transient decode-node slowdown degraded that point (4.5x TPOT vs high-tpt on an identical decode config). Running it first surfaces its result early and re-tests the transient before the shorter mid/high legs finish. Ordering only - all three points still run and land on the graph. Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
3bd0b1d to
67be9f4
Compare
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28742470098 |
Adding llm-d into InferenceX Starting with DeepSeek-V4-Pro FP4 GB200 disaggregated benchmarks (mid-curve + high-tpt)
Summary
Adds llm-d (vLLM) as a new benchmark framework in InferenceX, with a no-k8s multi-node path (Envoy + EPP + P/D sidecar over SLURM/pyxis), and theDeepSeek-V4-Pro FP4 GB200 P/D-disaggregated recipes:
dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe- 1P1D DEP8, conc 256/512/1024dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe- 2P1D DEP8, conc 4096Both keys share one engine/EPP recipe (
benchmarks/multi_node/llm-d-recipes/dsv4-fp4-gb200-mid-curve-megamoe.yaml);topology and concurrency come from the master-config keys, not the recipe.
Notable
dsv4-fp4-gb200-dynamo-vllm- same SKU, model, ISL/OSL, disagg TP1/DP8/EP8 layout, and the same concurrency points (256/512/1024/4096).agg_*.jsonrows throughprocess_result.py(framework=llm-d-vllm, X=conc, Y=tput_per_gpu) and the same_gpus_/_ctx_/_gen_result-filename convention the multinode ingest already parses. The only downstream addition isthe framework label in InferenceX-app (separate PR).
Scope
Known merge-blocker
image:on both keys is a personal build (ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23). It should be republished to the org registry and theimage:lines updated before merge.Related
llm-d-vllm-> "llm-d vLLM" inSemiAnalysisAI/InferenceX-app(FW_REGISTRY) so the new series is recognized and labeled on inferencex.semianalysis.com. https://github.com/SemiAnalysisAI/InferenceX-app/pull/524Before merge
image:on both keys.perf-changelog.yamlpr-linkfrompull/XXXto this PR's number.