Skip to content

[NV] llm-d-vllm: Add llm-d to the InferenceX benchmarking framework#2050

Open
ezrasilvera wants to merge 24 commits into
mainfrom
llm-d-initial
Open

[NV] llm-d-vllm: Add llm-d to the InferenceX benchmarking framework#2050
ezrasilvera wants to merge 24 commits into
mainfrom
llm-d-initial

Conversation

@ezrasilvera

Copy link
Copy Markdown
Collaborator

Adding llm-d into InferenceX Starting with DeepSeek-V4-Pro FP4 GB200 disaggregated benchmarks (mid-curve + high-tpt)

Summary

Adds llm-d (vLLM) as a new benchmark framework in InferenceX, with a no-k8s multi-node path (Envoy + EPP + P/D sidecar over SLURM/pyxis), and theDeepSeek-V4-Pro FP4 GB200 P/D-disaggregated recipes:

  • dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe - 1P1D DEP8, conc 256/512/1024
  • dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe - 2P1D DEP8, conc 4096

Both keys share one engine/EPP recipe (benchmarks/multi_node/llm-d-recipes/dsv4-fp4-gb200-mid-curve-megamoe.yaml);
topology and concurrency come from the master-config keys, not the recipe.

Notable

  • These keys mirror dsv4-fp4-gb200-dynamo-vllm - same SKU, model, ISL/OSL, disagg TP1/DP8/EP8 layout, and the same concurrency points (256/512/1024/4096).
  • Emits the same agg_*.json rows through process_result.py (framework=llm-d-vllm, X=conc, Y=tput_per_gpu) and the same _gpus_/_ctx_/_gen_ result-filename convention the multinode ingest already parses. The only downstream addition is
    the framework label in InferenceX-app (separate PR).
  • Every reported point is a full 100%-completion sweep: (e.g. 12288/12288 at conc4096).
  • One recipe, two topologies: Mid-curve (1P1D) and high-tpt (2P1D) share a single engine/EPP recipe; node counts and concurrency live in the master-config keys - so adding a curve point is a config-key change, not a new recipe file.

Scope

  • GB200 DeepSeek-V4-Pro FP4 only. No changes to other SKUs/models/frameworks.

Known merge-blocker

  • The image: on both keys is a personal build (ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23). It should be republished to the org registry and the image: lines updated before merge.

Related

Before merge

  • Republish image to org registry; update image: on both keys.
  • Canonicalize perf-changelog.yaml pr-link from pull/XXX to this PR's number.
  • Full sweep green.

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase As a PR reviewer and CODEOWNER, I have reviewed this and have.

For PR verification, add the full-sweep-enabled or full-sweep-fail-fast label to this PR — the benchmark sweep only runs on labeled PRs.

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs


感谢你的贡献!请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md,然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人,你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板,包括保留英文语句 As a PR reviewer and CODEOWNER, I have reviewed this and have

如需进行 PR 验证,请为此 PR 添加 full-sweep-enabledfull-sweep-fail-fast 标签 — 基准测试 sweep 仅在带有标签的 PR 上运行。

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档

2 similar comments
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase As a PR reviewer and CODEOWNER, I have reviewed this and have.

For PR verification, add the full-sweep-enabled or full-sweep-fail-fast label to this PR — the benchmark sweep only runs on labeled PRs.

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs


感谢你的贡献!请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md,然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人,你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板,包括保留英文语句 As a PR reviewer and CODEOWNER, I have reviewed this and have

如需进行 PR 验证,请为此 PR 添加 full-sweep-enabledfull-sweep-fail-fast 标签 — 基准测试 sweep 仅在带有标签的 PR 上运行。

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Thanks for the contribution! Please reach out to respective companies' CODEOWNER to fill in the latest PR_REVIEW_CHECKLIST.md before pinging core maintainer on Slack for review. In order for the signoff PR check bot to trigger, you must follow the PR_REVIEW_CHECKLIST.md template correctly, including the phrase As a PR reviewer and CODEOWNER, I have reviewed this and have.

For PR verification, add the full-sweep-enabled or full-sweep-fail-fast label to this PR — the benchmark sweep only runs on labeled PRs.

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. See GitHub's docs on re-running failed jobs


感谢你的贡献!请联系相应公司的 CODEOWNER 填写最新的 PR_REVIEW_CHECKLIST.md,然后再在 Slack 上联系核心维护者进行审阅。为了触发 signoff PR 检查机器人,你必须正确遵循 PR_REVIEW_CHECKLIST.md 模板,包括保留英文语句 As a PR reviewer and CODEOWNER, I have reviewed this and have

如需进行 PR 验证,请为此 PR 添加 full-sweep-enabledfull-sweep-fail-fast 标签 — 基准测试 sweep 仅在带有标签的 PR 上运行。

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。参见 GitHub 关于重新运行失败任务的文档

Comment thread benchmarks/llm-d/binaries.env Outdated
Comment on lines +10 to +12
#
# This is currently not used as we build an use an image that contains
# all binaries

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The comment on lines 10-12 says "This is currently not used as we build an use an image that contains all binaries," but benchmarks/multi_node/llm-d/job.slurm (lines 236-241 of the pyxis branch) explicitly sources this file and uses LLMD_BIN_DIR to conditionally bind-mount epp/pd-sidecar/envoy — so the file IS wired into the runtime as an optional override. Also a typo: "build an use" → "build and use". Nit — no runtime impact, but the stale comment will mislead maintainers into thinking they can delete the file.

Extended reasoning...

What the comment claims vs. what the code does

Lines 10-12 of benchmarks/llm-d/binaries.env state:

This is currently not used as we build an use an image that contains all binaries

However, the pyxis branch of benchmarks/multi_node/llm-d/job.slurm explicitly sources this file and consumes LLMD_BIN_DIR to build optional bind-mounts:

[[ -f "${DI_REPO_DIR}/benchmarks/llm-d/binaries.env" ]] && \
    source "${DI_REPO_DIR}/benchmarks/llm-d/binaries.env"
for _bin in epp pd-sidecar envoy; do
    if [[ -n "${LLMD_BIN_DIR:-}" && -x "${LLMD_BIN_DIR}/${_bin}" ]]; then
        PYXIS_MOUNTS+=",${LLMD_BIN_DIR}/${_bin}:/usr/local/bin/${_bin}:ro"
        ...
    fi
done

So binaries.env is wired into the runtime — it acts as the single source of truth for LLMD_BIN_DIR (defaulted to /mnt/lustre01/users-public/sa-shared/llm-d-bins), which the pyxis path uses to swap in mounted binaries and skip a combined-image rebuild.

Step-by-step proof

  1. Author enables the pyxis path (the default on GB200 via LLMD_CONTAINER_ENGINE=pyxis set in runners/launch_gb200-nv.sh).
  2. job.slurm reaches the pyxis branch and sources binaries.env, populating LLMD_BIN_DIR.
  3. If a user has run extract-binaries.sh (which also sources this file), the three binaries exist under $LLMD_BIN_DIR and get bind-mounted into every container.
  4. This is the entire point of the mounted-binary path documented in extract-binaries.sh and the Dockerfile — running a stock vllm/vllm-openai image without rebuilding the combined image.

The comment therefore contradicts both the sibling files (extract-binaries.sh, Dockerfile, envoy.yaml docstrings) that describe binaries.env as the single source of truth, and job.slurm itself.

Addressing the counter-argument

One could argue the comment is "arguably accurate for the DEFAULT deployment path" — in the current recipe the combined image bakes binaries in, so the -x check makes the mount loop a no-op unless the operator has explicitly run extract-binaries.sh. That's a fair characterization of when the file has effect, but "currently not used" is stronger than "currently a no-op by default" and it flatly contradicts the sourcing on job.slurm:236. A future maintainer trying to trim "unused" scaffolding would delete this file and quietly break the mounted-binary path.

Impact and fix

No runtime impact — the code works today. The concern is purely documentation drift in a new file this PR is introducing. Fix in one edit:

# This file is sourced by benchmarks/multi_node/llm-d/job.slurm to
# optionally bind-mount extracted binaries in place of the combined
# image's baked-in copies (see extract-binaries.sh). It is a no-op
# by default (the -x check fails when the binaries have not been
# extracted); the Dockerfile also reads the same URLs when building
# and using an image that contains all binaries.

Also fix the "build an use" → "build and use" typo while you're in there.

Comment thread benchmarks/multi_node/llm-d/README.md Outdated
Comment on lines +80 to +95
### Why one entry per *leader* (not per node)

In the wide-EP guide each instance is a single vLLM engine that spans
multiple nodes via `--data-parallel-hybrid-lb`. With hybrid-lb, the
leader pod (`LWS_WORKER_INDEX=0`) accepts external traffic and
distributes it internally across the local DP ranks; in our LWS-free
SLURM mapping, the prefill-leader and decode-leader are the only nodes
addressable from outside. Adding an entry per worker would cause EPP to
route directly to a worker, bypassing the engine's internal load
balancing.

If we later want to expose all pods of an instance (the alternative
hybrid-lb interpretation: external LB across nodes too), we can extend
the loop in `server.sh` to emit one entry per `IPS[i]` in the prefill
range and one per `IPS[i]` in the decode range, all carrying the same
role label. EPP then load-balances across them via `random-picker`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The README section "Why one entry per leader (not per node)" (lines 80-89) plus the example endpoints.yaml at lines 57-69 both document per-leader emission, but server.sh add_role() (lines 334-343) actually emits ONE endpoint per node — its own comment says "ONE endpoint per node at base_port (hybrid LB: ...)". The lines 91-95 "future work" section describing per-node emission as hypothetical has already been implemented. Docs-only staleness (per-node is the intended behavior — see server.sh:289-291 and its consistency with the multi-node coordinator health poll in lines 519-535), but future maintainers may misdiagnose whether current behavior is intentional. Please update the README to reflect actual per-node emission, or add a note explaining why the leader-only text was preserved.

Extended reasoning...

What the bug is

The new file benchmarks/multi_node/llm-d/README.md (added by this PR) describes the endpoints.yaml file-discovery contract as emitting one entry per leader. Concretely:

  • Line 55: "…writes /tmp/endpoints.yaml inside the container with one entry per leader"
  • Lines 57-69: the example YAML shows only prefill-0 (PREFILL_LEADER_IP) and decode-0 (DECODE_LEADER_IP)
  • Lines 80-89: an entire section titled "Why one entry per leader (not per node)" justifies leader-only emission and warns: "Adding an entry per worker would cause EPP to route directly to a worker, bypassing the engine's internal load balancing"
  • Lines 91-95: describes per-node emission as a hypothetical future extension ("we can extend the loop … to emit one entry per IPS[i]")

But the code at benchmarks/multi_node/llm-d/server.sh:334-348 (the add_role() block that writes endpoints.yaml) does exactly the opposite:

def add_role(role, ips, base_port):
    # ONE endpoint per node at base_port (hybrid LB: the node's api-server /
    # sidecar internally load-balances its local DP ranks).
    for i, ip in enumerate(ips):
        endpoints.append({'name': f'{role}-{i}', ...})

add_role('prefill', prefill_ips, VLLM_PORT)     # prefill_ips = all_ips[:pn]
add_role('decode', decode_ips, SIDECAR_PORT)    # decode_ips = all_ips[pn:pn + dn]

The inline comment at server.sh:339-340 literally says the opposite of the README section header: "ONE endpoint per node at base_port".

Step-by-step proof

Take the new high-tpt 2P+1D key (dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe, configs/nvidia-master.yaml). Topology: PREFILL_NODES=4, DECODE_NODES=2, PREFILL_WORKERS=2.

  1. submit.sh sbatches 6 nodes. job.slurm collects their IPs in rank order into ALL_IPS = [P0,P1,P2,P3,D0,D1].
  2. On the decode leader, server.sh runs the coordinator Python block:
    • prefill_ips = all_ips[:4] = [P0, P1, P2, P3]
    • decode_ips = all_ips[4:6] = [D0, D1]
  3. add_role('prefill', prefill_ips, 8200) iterates 4 times → emits prefill-0..prefill-3.
  4. add_role('decode', decode_ips, 8000) iterates 2 times → emits decode-0..decode-1.
  5. Final endpoints.yaml has 6 entries — not the 2 (prefill-0 + decode-0) the README example shows.

For the 1P+1D mid-curve key it emits 1 prefill + 2 decode entries (still per-node, not per-leader).

Why this is intentional (i.e. the code is right, docs are stale)

Several signals confirm per-node emission is the deliberate design, not a bug:

  1. server.sh:289-291 comment explicitly justifies it: "Each decode node runs its own sidecar (SIDECAR_PORT -> local decode vLLM), and endpoints.yaml lists one decode endpoint per node so EPP fans out across all decode ranks."
  2. server.sh:519-535 (the decode coordinator's cross-node prefill health poll) iterates over _prefill_ips=( "${_ALL_IPS[@]:0:${PREFILL_NODES}}" ) — polling every prefill node, not just the leader. That poll is only meaningful if all prefill nodes are individually addressable endpoints.
  3. --api-server-count is set on every EP-enabled node (server.sh:236-240), plus --data-parallel-hybrid-lb — meaning each node has its own api-server binding, so per-node endpoints are actually the correct interpretation of hybrid-lb.
  4. PREFILL_WORKERS=2 in the high-tpt key requires per-node emission: with 2 independent DP engines across 4 prefill nodes, per-leader emission would only expose one of the two engines.

Impact

Zero runtime impact — code is self-consistent and produces the intended routing behavior. But a future maintainer reading the README's "Why one entry per leader (not per node)" section will:

  • Believe the current behavior is per-leader when it isn't.
  • Reference the warning about "EPP would bypass internal load balancing" and (incorrectly) conclude the current implementation is broken.
  • Not realize the "future work" paragraph at lines 91-95 has already been implemented.

Suggested fix

Update benchmarks/multi_node/llm-d/README.md:

  • Line 55: change "one entry per leader" → "one entry per node".
  • Lines 57-69: expand the example YAML to show all prefill and decode nodes (e.g. prefill-0..prefill-N-1, decode-0..decode-M-1).
  • Lines 80-89: rewrite the "Why one entry per leader" section to explain why per-node is used (hybrid-lb + independent api-servers + PREFILL_WORKERS multi-engine).
  • Lines 91-95: delete or rewrite the "future work" paragraph since per-node emission is what the code already does. The leader-only mode could be described here as the hypothetical alternative if desired.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d7e4180 - README now documents one endpoint per node (not per leader) to match server.sh add_role(), with corrected example ports (prefill 8200 / decode 8000) and a rewritten rationale section.

Comment on lines +9078 to +9080
# Build source: benchmarks/llm-d/Dockerfile.
dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe:
image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Both new keys (mid-curve line 9080, high-tpt line 9125) pin image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23, a personal ghcr namespace rather than the SemiAnalysisAI org registry. The PR description's "Before merge" checklist and the in-file comment at 9075-9077 both flag this — surfacing here as a merge-time reminder so it isn't missed. Republish to ghcr.io/semianalysisai/... and update both image: lines before merge.

Extended reasoning...

What the bug is

Both new master-config keys added by this PR pin their engine image to a personal ghcr namespace:

  • configs/nvidia-master.yaml:9080 (dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe): image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23
  • configs/nvidia-master.yaml:9125 (dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe): image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23

The PR author has explicitly flagged this in two places already: the PR description's "Known merge-blocker" section ("The image: on both keys is a personal build... It should be republished to the org registry and the image: lines updated before merge") and an in-file comment at lines 9075-9077 ("PR follow-up (merge-blocking): the image: on both llm-d-vllm keys below (mid-curve + high-tpt) is a personal build. Republish it to the org registry (ghcr.io/semianalysisai/...) and update these image: lines before merge").

Why it matters

Reproducibility. On the GB200 pyxis path, runners/launch_gb200-nv.sh derives SQUASH_FILE="${SQUASH_DIR}/$(echo "$IMAGE" | sed 's/[\/:@#]/_/g').sqsh" — so the enroot-import cache key is a direct function of the image string. Any historical rerun (e.g., re-ingesting a failed sweep, reproducing a curve point months later) resolves the image via that exact URL. A personal ghcr namespace can be retagged, deleted, or made private without notice by the individual owner, silently breaking reruns.

Step-by-step proof

  1. configs/nvidia-master.yaml:9080 sets image: ghcr.io/ezrasilvera/llm-d-nokube-vllm:vllm0.23 on the mid-curve key.
  2. configs/nvidia-master.yaml:9125 sets the same image on the high-tpt key.
  3. runners/launch_gb200-nv.sh (the llm-d-vllm branch, top of file) reads $IMAGE and constructs ENROOT_URL="docker://ghcr.io#ezrasilvera/llm-d-nokube-vllm:vllm0.23" for enroot import.
  4. enroot import authenticates and pulls against ezrasilvera's personal ghcr namespace on every fresh cluster / cache miss.
  5. If ezrasilvera retags/deletes/private-marks this tag, step 4 fails; the sweep can no longer be reproduced.

Addressing the refutation

One verifier argued this duplicates the author's own annotations (PR description merge-blocker + in-file comment) and should be dropped as "no new information." That's a fair point, and it's why I'm filing this as nit rather than normal — I'm not claiming the author is unaware. But surfacing it in the code-review pass provides a merge-time reminder against the actual diff (as opposed to a checklist item in the description that can be overlooked when the reviewer clicks Merge). If this shows up alongside other findings in the same review, it's cheap to satisfy; if the maintainer knows the image swap is already in flight, the reminder can be dismissed in a second.

Fix

Republish the combined image (built from benchmarks/llm-d/Dockerfile) to ghcr.io/semianalysisai/ and update both image: lines. The enroot cache key changes with the image string, so this is a valid config change to land in this PR rather than a follow-up — landing it here keeps the initial ingest and all historical reruns pinned to the org-owned image.

Comment thread perf-changelog.yaml Outdated
Comment on lines +4441 to +4444
- "Add DeepSeek-V4-Pro FP4 GB200 llm-d-vllm disaggregated multinode benchmarks"
- "Mid-curve 1P+1D DEP8 (conc 256/512/1024)"
- "High-tpt 2P+1D DEP8 (conc 4096)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog entry has pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX (literal XXX placeholder at perf-changelog.yaml:4444). Every other entry in this file resolves to a real PR number — this one alone will 404 if merged as-is. Fix by replacing XXX with 2050 before merge. (The author already listed this under 'Before merge' in the PR description, so filing as a friendly nit rather than a blocker.)

Extended reasoning...

What the bug is. The perf-changelog entry appended by this PR (perf-changelog.yaml:4441-4444) carries a literal pull/XXX placeholder rather than the real PR number:

- config-keys:
    - dsv4-fp4-gb200-llm-d-vllm-mid-curve-megamoe
    - dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe
  description:
    - "Add DeepSeek-V4-Pro FP4 GB200 llm-d-vllm disaggregated multinode benchmarks"
    - "Mid-curve 1P+1D DEP8 (conc 256/512/1024)"
    - "High-tpt 2P+1D DEP8 (conc 4096)"
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX

Why the current diff is inconsistent. Every other entry in perf-changelog.yaml resolves to a real PR (the immediately preceding entry ends with pull/2001; earlier entries reference pull/1990, pull/1978, etc.). The pr-link contract of this file is to trace which PR introduced which benchmark set — a literal XXX breaks that contract.

Impact if merged as-is. Clicking the link in the changelog will hit https://github.com/SemiAnalysisAI/InferenceX/pull/XXX, which resolves to no PR (GitHub returns 404 for numeric paths that don't exist, or worse silently redirects). No functionality is affected — the sweep still runs, the config keys still work — but the diagnostic "which PR introduced this row?" trail is broken for the two new dsv4-fp4-gb200-llm-d-vllm keys.

Why existing checks don't catch it. There is no linter validating perf-changelog pr-link values against real PR numbers, so nothing stops the placeholder from shipping. The PR description's 'Before merge' checklist explicitly lists this: 'Canonicalize perf-changelog.yaml pr-link from pull/XXX to this PR's number' — the author has already flagged it as a known merge-blocker to canonicalize.

Step-by-step proof.

  1. Open perf-changelog.yaml at line 4444 → observe pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXX.
  2. grep -c 'pull/XXX' perf-changelog.yaml → 1 (only this new entry).
  3. grep -oE 'pull/[0-9]+' perf-changelog.yaml | tail -3 → all resolve to real numeric PRs (2001, 1990, 1978).
  4. Attempt https://github.com/SemiAnalysisAI/InferenceX/pull/XXX in a browser → GitHub 404.
  5. Expected value: pull/2050 (this PR's number, per the PR metadata).

Fix. One-character-scope edit: replace XXX with 2050 on perf-changelog.yaml:4444. No other files need to change.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d7e4180 - pr-link canonicalized from pull/XXX to pull/2050.

ezrasilvera added a commit that referenced this pull request Jul 4, 2026
- llm-d/README.md: document endpoints.yaml as one entry per node (not per
  leader) to match server.sh add_role(); fix example ports (prefill 8200,
  decode 8000) and rewrite the rationale section (C2)
- perf-changelog.yaml: canonicalize pr-link XXX -> 2050 (C4)

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

ezrasilvera added a commit that referenced this pull request Jul 4, 2026
…ent (C1)

Bump the EPP and P/D routing-sidecar source images from the moving
-dev:main tags to release :v0.9.0 in both binaries.env and the
combined-image Dockerfile, so a rebuild is reproducible. Verified both
tags ship linux/arm64 and still expose /app/epp and /app/pd-sidecar
(statically-linked aarch64). Envoy pin unchanged.

Also reword the binaries.env header (PR #2050 review C1): the old
comment claimed the file is 'not used', but job.slurm still sources it
and conditionally bind-mounts the binaries via LLMD_BIN_DIR. Clarify
it is an optional mounted-binary override, a no-op by default, and fix
the 'build an use' -> 'build and use' typo.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

ezrasilvera added a commit that referenced this pull request Jul 4, 2026
Add dsv4-fp4-gb200-llm-d-vllm-max-tpt-megamoe: 3 prefill (DEP8) + 1 decode
(DEP8) at conc=4096, mirroring NVIDIA/srt-slurm disagg-gb200-max-tpt-megamoe
(engine-identical to high-tpt; only prefill scale-out 2->3), so it reuses the
same CONFIG_FILE. Validated standalone via e2e run 28711505502 at 8797
tok/s/GPU (median TTFT 58 ms) - above high-tpt's 8241 tok/s/GPU, since the
extra prefill instance relieves the conc=4096 prefill bottleneck.

Adds a matching perf-changelog entry under PR #2050.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
@github-actions

github-actions Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Add a new llm-d-vllm disaggregated (prefill/decode) multi-node benchmark
path to InferenceX, orchestrated entirely by SLURM (no Kubernetes). vLLM
engines with llm-d EPP routing, an Envoy front-end, and NIXL KV transfer.

- benchmarks/llm-d/: container image (Dockerfile on vLLM v0.23 base),
  Envoy + EPP config, binaries.env, and an extraction helper that
  decouples epp/sidecar/envoy from the vLLM image.
- benchmarks/multi_node/llm-d/: server.sh orchestrator (per-role TP/EP,
  prefill/decode coordinators, PREFILL_WORKERS multi-engine prefill),
  job.slurm (pyxis/enroot + docker paths, per-node model probe), submit.sh.
- runners/launch_gb200-nv.sh: dispatch branch for FRAMEWORK=llm-d-vllm.
- benchmarks/benchmark_lib.sh: anchor a relative lm_eval task yaml to the
  repo root so evals resolve inside the serving container.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Route dsv4 models through the vLLM tokenizer wrapper and allow
deepseek_v4 as a --tokenizer-mode choice, bypassing HF AutoTokenizer
which does not recognize the model on the benchmark image. Adds
--trust-remote-code / --use-chat-template / --dsv4 passthroughs.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
- benchmarks/multi_node/dsv4_fp4_gb200_llm-d-vllm.sh: GB200 wrapper
  (GPUS_PER_NODE=4) for the DSV4-Pro FP4 llm-d-vllm path.
- llm-d-recipes/dsv4-fp4-gb200-mid-curve-megamoe.yaml: 1P+1D DEP8 recipe.
- nvidia-master.yaml: register the mid-curve-megamoe (conc 256/512/1024)
  and high-tpt-megamoe (Option-B 2P+1D, conc 4096) config keys.
- perf-changelog.yaml: changelog entry for the two keys.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…diagnostics

The decode leader's prefill-readiness curl had no connect/max timeout, so a
blackholed prefill endpoint made a single curl block forever - the 5-min
deadline is only checked between calls, so it never fired. A 2P
(PREFILL_WORKERS=2) run wedged ~7h on this before being cancelled.

- Add --connect-timeout 5 --max-time 10 so an unreachable endpoint trips the
  5-min deadline and FATALs cleanly instead of hanging the whole run.
- Probe every prefill node (ALL_IPS[0:PREFILL_NODES]), not just IPS[0]:
  endpoints.yaml lists one prefill endpoint per node, and PREFILL_WORKERS>1
  spans multiple independent DP engines, so IPS[0] only covered engine 1.
- On failure, dump network diagnostics - source IP toward target,
  /proc/net/route on-link-vs-gateway subnet classification, bash /dev/tcp L4
  connect test, ping, verbose curl - to separate a server-not-ready problem
  from a subnet/routing/firewall problem. Log the subnet layout up front too.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Two teardown fixes so a failed/hung benchmark can no longer wedge the whole
SLURM allocation (a conc4096 2P run hung 8h until GitHub killed it at the
max-execution limit, losing the server-log tarball):

- server.sh coordinator: write the .bench_done release marker from an EXIT
  trap, so job.slurm scancels the allocation on ANY coordinator exit (failed
  bench, set -e abort, EPP/Envoy bring-up error), not only a clean finish.
- benchmark_lib.sh: optional `timeout` wrapper around the bench client, gated
  on BENCH_TIMEOUT_S (default 2400s, set by server.sh); a hung client (asyncio
  shutdown wedged on thousands of failed in-flight requests) is killed instead
  of pinning 24 GPUs to TIME_LIMIT. The bench call is also made non-fatal so a
  multi-conc sweep records every point it can and always reaches teardown.

Replace the leader-only .prom scraper with metrics_scrape.py, which scrapes
EVERY prefill and decode vLLM (on the vLLM port, not the sidecar) plus EPP into
one time-correlated CSV. This makes per-engine load visible - crucially whether
EPP fans out across all prefill engines - which the old PREFILL_LEADER_IP-only
scraper could not show for PREFILL_WORKERS>1.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
dsv4-fp4-gb200-llm-d-vllm-smoke-2p1d: two independent 1-node DEP4 prefill
engines (PREFILL_NODES=2 PREFILL_WORKERS=2) + 1 DEP4 decode = 3 nodes / 12 GPUs,
ISL/OSL 1024/128, single conc 128 x ~512 prompts, RANDOM_RANGE_RATIO=1.0. Runs
in minutes; validates the per-engine metrics scraper and confirms EPP fans out
across both prefill engines before spending the 6-node high-tpt run. Reuses the
mid-curve recipe.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
The 3-node DEP4 smoke (1 node/engine) OOMs at model load: DSV4-Pro's MegaMoE
needs each EP rank to hold 1/4 of the experts at EP4 (~2x the per-GPU weight of
the validated EP8), which does not fit in HBM (run 28588980717:
"CUDA Error: out of memory at cumem_allocator.cpp:139"). DEP8 (2 nodes/engine)
is the minimum viable EP. Smoke is now PREFILL_NODES=4 PREFILL_WORKERS=2 (2x
DEP8) + DECODE_NODES=2 = 6 nodes, same layout as high-tpt but a tiny workload
(conc 128), so it validates the scraper + EPP fan-out fast on the real topology.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
LLMD_LB_MODE (recipe env, default hybrid). multi-port switches both roles to
--data-parallel-multi-port-external-lb + --data-parallel-supervisor-port and
drops --api-server-count, so each DP rank serves on VLLM_PORT+rank (base+global
rank; a worker node binds VLLM_PORT+START_RANK..). The decode pd-sidecar gets
--data-parallel-size=$DP_SIZE, which per data_parallel.go makes it expose
SIDECAR_PORT+i -> vLLM VLLM_PORT+i; per node only its local rank ports work, so
endpoints.yaml lists each node's local rank ports (node j: base + j*GPUS_PER_NODE
.. +GPUS_PER_NODE). Health waits + the metrics scraper follow the same per-rank
ports (scraper hits the vLLM port, never the sidecar). Mirrors the upstream
GENERIC wide-ep-lws guide (DeepSeek-R1), which uses per-rank ports for P/D; our
DSV4 path was hybrid-lb (one port/node).

Recipe dsv4-fp4-gb200-megamoe-mp.yaml = mid-curve engine config + LLMD_LB_MODE
multi-port + EPP scorers matched to the upstream router config (drop
kv-cache-utilization-scorer; prefill = prefix-cache/queue/active-request,
decode = active-request only). Keys smoke-2p1d-mp (fast validation) and
high-tpt-megamoe-mp (conc4096 A/B vs the hybrid high-tpt key) select it. Hybrid
path is unchanged (default), so existing keys are byte-identical.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
mp-smoke 28596473295 showed every prefill node (leader AND worker) binds
VLLM_PORT+local_index (8200-8203), not base+global rank: --data-parallel-start-rank
only sets DP rank IDs, the HTTP servers are always the node-local range. My
base+global assumption made the health gate probe a dead worker port
(10.30.1.39:8204 not ready). Fix everywhere to base+local:
  - HEALTH_PORT / SIDECAR_HEALTH_PORT = VLLM_PORT / SIDECAR_PORT (local rank-0).
  - sidecar --data-parallel-size = DP_SIZE_LOCAL (each node exposes its 4 local
    ranks at SIDECAR_PORT..+3 -> vLLM VLLM_PORT..+3).
  - endpoints + scraper: each node lists base..base+GPUS_PER_NODE-1 (same range
    on every node, distinct address).
Also fix a double 'export export LLMD_LB_MODE' from overlapping edits.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
… bundle

At conc4096 both Dynamo (8153 tok/s/GPU) and upstream llm-d PR #1737 (8858)
succeed on the IDENTICAL 2P TP1/DP8/EP8 layout we use, while ours collapses to
8.3% completion. The difference is the engine bundle, so this recipe applies it:
prefill gpu-mem 0.95, max-model-len 9280, max-num-seqs 16, max-num-batched-tokens
32768, no-enable-flashinfer-autotune; decode gpu-mem 0.9, max-model-len 9280,
no-flashinfer-autotune. Same EPP/env/topology + kv producer-consumer + hybrid-lb
as mid-curve, to isolate the engine bundle. The max-model-len/no-autotune/
max-num-seqs guards are what let this 32768/0.95 bundle bring up without the two
OOMs our earlier speed-bundle hit. Keys tuned-smoke + tuned-high-tpt select it.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
The no-kube Envoy front door (benchmarks/llm-d/envoy.yaml) defined the epp
ext_proc cluster and the original_dst cluster with no circuit_breakers block,
so Envoy applied its DEFAULT per-cluster thresholds of max_requests=1024 and
max_pending_requests=1024. The ext_proc filter holds one long-lived HTTP/2
stream to EPP for the entire lifetime of every request, so at max-concurrency
4096 only 1024 active + 1024 pending = 2048 requests ever reached EPP; the
remaining ~18432 (warmups + prompts) got a fast 500 and never saw the engine.

This capped every high-tpt run at exactly 1024 completions / 2048 disagg
decisions, identically across engine configs (mid-curve, Dynamo-tuned) and LB
modes (hybrid, multi-port) - because the cap was the front door, not the
engine. It is why Dynamo (no Envoy) and upstream llm-d (k8s gateway) hit
8000+ tok/s/GPU on the same TP1/DP8/EP8 2P1D layout where we stalled at ~1783.

Fix: add circuit_breakers with thresholds far above max-concurrency to both
clusters. Evidence: envoy_access RESPONSE_CODE histogram = 2048x200 + 18432x500
in all three walled runs; disagg_decision_total pinned at 2048 in all three.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…0000

Align the circuit_breakers thresholds added in the previous commit to the exact
values from the authoritative upstream preset - the llm-d-router-standalone Helm
chart (llm-d-inference-scheduler config/charts/llm-d-router-standalone/values.yaml
lines 191-208): max_connections/max_pending_requests/max_requests = 40000 on both
the ext_proc (EPP) and original_dst clusters, plus max_retries 1024 on ext_proc.
Both clear max-concurrency 4096 with wide headroom; 40000 just matches upstream
rather than an arbitrary ceiling.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
… upstream)

The prior CB commit lifted the epp ext_proc cluster to 40000 but left the
original_dst cluster at 1000000. Both clear max-concurrency 4096 identically
(this is the exact config the 8452 tok/s/GPU high-tpt confirmation run used),
but pin original_dst to 40000 too so both clusters match each other and the
upstream llm-d-router-standalone preset.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…util + decode prefix-cache)

Align the mid-curve/high-tpt recipe's EPP scheduling profiles to the upstream
generic wide-ep-lws router config
(llm-d guides/wide-ep-lws/router/wide-ep-lws.values.yaml):
  prefill = prefix-cache(3) + queue(2) + active-request(2)
  decode  = active-request only

Removed: kv-cache-utilization-scorer from prefill (upstream does not use it; its
weight becomes active-request(2)); and prefix-cache-scorer from decode - at
weight 3 it outweighed active-request(2) and pinned decode to one pod on a
shared prompt prefix, producing the ~2:1 decode routing imbalance (decode-0 1389
vs decode-1 659). Upstream decode routes on active-request alone and balances.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…keys

Pre-PR cleanup of debug/experiment scaffolding added while root-causing the
conc4096 wall (now fixed via the Envoy circuit-breaker change). Removed:
- benchmarks/multi_node/llm-d-recipes/dsv4-fp4-gb200-megamoe-mp.yaml (multi-port
  LB experiment - dead: NIXL KV-expiry)
- benchmarks/multi_node/llm-d-recipes/dsv4-fp4-gb200-megamoe-tuned.yaml (Dynamo
  engine-bundle A/B - superseded; the circuit-breaker fix, not engine args, was
  the wall)
- nvidia-master.yaml keys: smoke-2p1d, smoke-2p1d-mp, high-tpt-megamoe-mp,
  tuned-smoke, tuned-high-tpt (experiment/smoke keys tied to the above)

Kept: mid-curve-megamoe + high-tpt-megamoe keys, the envoy circuit_breakers fix,
the teardown/timeout fix, and the DSV4 tokenizer plumbing. Deferred to a
follow-up (after the mid-curve rerun validates and its scraper CSV is analysed):
removing the per-engine vLLM metrics scraper and the multi-port branches in
server.sh.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…pper

- Remove metrics_scrape.py + its scraper bring-up/stop/conc-tag in server.sh
- Remove dead multi-port LB path (LLMD_LB_MODE and all branches); hybrid only
- Remove EPP --v=4 and NCCL/UCX transport debug logging (keep UCX_TLS pin)
- Rename dsv4_fp4_gb200_llm-d-vllm.sh -> ...-disagg.sh; update launcher SCRIPT_NAME

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Comment-only cleanup across the llm-d-vllm GB200 assets:
- recipe header rewritten to describe the common per-engine shape and the
  two topologies (mid-curve 1P1D / high-tpt 2P1D) it backs
- drop historical tuning-log comments from the recipe, launcher, envoy.yaml,
  binaries.env, and Dockerfile
Verified: no non-comment lines changed; envoy circuit_breakers (40000) intact.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Include dsv4-fp4-gb200-llm-d-vllm-high-tpt-megamoe (2P1D DEP8, conc 4096)
alongside the mid-curve key so a labeled PR sweep validates both topologies,
not just mid-curve.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
- llm-d/README.md: document endpoints.yaml as one entry per node (not per
  leader) to match server.sh add_role(); fix example ports (prefill 8200,
  decode 8000) and rewrite the rationale section (C2)
- perf-changelog.yaml: canonicalize pr-link XXX -> 2050 (C4)

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…ent (C1)

Bump the EPP and P/D routing-sidecar source images from the moving
-dev:main tags to release :v0.9.0 in both binaries.env and the
combined-image Dockerfile, so a rebuild is reproducible. Verified both
tags ship linux/arm64 and still expose /app/epp and /app/pd-sidecar
(statically-linked aarch64). Envoy pin unchanged.

Also reword the binaries.env header (PR #2050 review C1): the old
comment claimed the file is 'not used', but job.slurm still sources it
and conditionally bind-mounts the binaries via LLMD_BIN_DIR. Clarify
it is an optional mounted-binary override, a no-op by default, and fix
the 'build an use' -> 'build and use' typo.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…ild-arg

BuildKit rejects `COPY --from=${VAR}` with 'variable expansion is not
supported for --from'. Promote VLLM_BASE and the three source-image ARGs
(EPP/sidecar/envoy) to global scope, bind each to a named FROM stage
(epp_src / sidecar_src / envoy_src), and COPY --from those stage aliases.
No change to defaults or --build-arg override behavior; still in sync
with binaries.env.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
…sses

The eval-only jobs failed 'eval metadata concurrency does not match
workflow request' even though gsm8k accuracy passed (0.95 >= 0.91). Cause:
run_eval/append_lm_eval_summary read EVAL_CONCURRENT_REQUESTS and CONC
(benchmark_lib.sh), never EVAL_CONC, so with CONC empty meta_env.json
recorded conc=1 while validate_scores.py --expected-concs checked 4096.

Mirror the AMD multi-node servers: before run_eval, set
EVAL_CONCURRENT_REQUESTS from EVAL_CONC (fallback: max of the x-delimited
BENCH_MAX_CONCURRENCY) and export CONC so the stamped concurrency matches
the workflow request.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Add dsv4-fp4-gb200-llm-d-vllm-max-tpt-megamoe: 3 prefill (DEP8) + 1 decode
(DEP8) at conc=4096, mirroring NVIDIA/srt-slurm disagg-gb200-max-tpt-megamoe
(engine-identical to high-tpt; only prefill scale-out 2->3), so it reuses the
same CONFIG_FILE. Validated standalone via e2e run 28711505502 at 8797
tok/s/GPU (median TTFT 58 ms) - above high-tpt's 8241 tok/s/GPU, since the
extra prefill instance relieves the conc=4096 prefill bottleneck.

Adds a matching perf-changelog entry under PR #2050.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
Move the max-tpt config-key to the top of the changelog entry so it
dispatches first in the serialized multi-node matrix. The prior sweep
(28714892207) had max-tpt land last; a transient decode-node slowdown
degraded that point (4.5x TPOT vs high-tpt on an identical decode
config). Running it first surfaces its result early and re-tests the
transient before the shorter mid/high legs finish.

Ordering only - all three points still run and land on the graph.

Signed-off-by: Ezra Silvera <ezra@il.ibm.com>
@github-actions

github-actions Bot commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant