Add GLM5 FP8 dynamo-sglang GB300 disagg configs by yeswanthk-26 · Pull Request #1557 · SemiAnalysisAI/InferenceX

yeswanthk-26 · 2026-05-22T20:19:28Z

Summary

Add new glm5-fp8-gb300-dynamo-sglang entry in .github/configs/nvidia-master.yaml with 1k1k and 8k1k STP hightpt/lowlat scenarios.
Wire glm5-fp8 support in runners/launch_gb300-nv.sh

Note

Low Risk
Benchmark and CI launcher/config YAML only; no production serving or auth logic changes.

Overview
Adds GLM-5 FP8 disaggregated Dynamo + SGLang benchmark coverage on GB300, parallel to the existing glm5-fp4-gb300-dynamo-sglang setup.

A new glm5-fp8-gb300-dynamo-sglang block in nvidia-master.yaml defines 1k1k and 8k1k STP scenarios (high-throughput wide-EP decode vs low-latency per-node TP=4 decode workers), each pointing at recipes/sglang/glm5/gb300-fp8/... configs. 14 new Slurm recipe YAMLs under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/ implement those topologies (prefill/decode parallelism, DeepEP env tuning, sa-bench concurrencies).

Runners: launch_gb300-nv.sh maps glm5/fp8 to /scratch/models/GLM-5-FP8 and glm-5-fp8, and copies the full vendored sglang/glm5 recipe tree into srt-slurm (drops the old fp4-only comment block). launch_gb300-cw.sh gains the same glm5/fp8 + dynamo-sglang path with glm-5-fp8 in srtslurm.yaml model paths. perf-changelog.yaml documents the new config key.

^{Reviewed by Cursor Bugbot for commit 893aa82. Bugbot is set up for automated code reviews on this repo. Configure here.}

Port PR69 GLM5 FP8 GB300 disaggregated SGLang recipes to SA upstream and wire gb300-nv launcher support while keeping SA-default SLURM account/partition and sqsh paths.

claude · 2026-05-22T20:27:52Z

+      ep-dispatch-algorithm: static
+      moe-a2a-backend: deepep


🔴 In 1k1k_stp_hightpt_2.yaml the decode-side max-running-requests: 256 (line 134) is far below the benchmark's target concurrency of 7300 and is an outlier vs all sibling hightpt configs (which set it to 8192/8192/6500/5700, all aligned with their concurrency). The value 256 exactly matches the prefill section's setting in the same file, which strongly suggests a copy-paste error from prefill into decode. With this cap, ~7044 of the 7300 concurrent requests will perpetually queue inside the decode server and this sweep point will not reach intended decode throughput — please bump it to track the concurrency target (e.g. 7300 or 8192) like the other hightpt configs.

Extended reasoning...

What this is

In benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/1k1k/disagg/stp/1k1k_stp_hightpt_2.yaml, the decode block sets:

max-running-requests: 256 cuda-graph-max-bs: 180

while the benchmark concurrency at the bottom of the same file is concurrencies: 7300.

Why this is a bug — comparison across the sweep

All five 1k1k_stp_hightpt_* configs are part of the same hightpt sweep, and in every other file the decode max-running-requests is set to (or above) the target concurrency:

File concurrency decode max-running-requests decode cuda-graph-max-bs

hightpt_0 8192 8192 512

hightpt_1 7500 8192 256

hightpt_2 7300 256 ← outlier 180

hightpt_3 6500 6500 128

hightpt_4 5700 5700 100

Only hightpt_2 has decode max-running-requests: 256. That value is identical to the prefill block earlier in the same file (line 71: max-running-requests: 256 in the prefill section), which is the classic copy-paste signature — the decode block was authored by copying prefill and the max-running-requests line was not bumped.

Why existing settings do not save us

The same decode block also sets cuda-graph-max-bs: 180 with data-parallel-size: 40, implying the decode server was sized for roughly 180 * 40 = 7200 in-flight requests. So the rest of the decode config is consistent with a ~7300-concurrent workload — only the max-running-requests: 256 line is out of place. SGLang enforces max-running-requests as a hard cap on simultaneously-scheduled requests across all DP ranks, so the lower of (256, 7200) wins.

Step-by-step proof of the symptom

Bench harness launches with concurrencies: 7300 → opens 7300 simultaneous client connections.

Each client request arrives at the prefill stage, gets prefilled (prefill max-running-requests: 256 throttles the prefill side; that is intentional and matches all siblings).

After KV transfer, the request flips to the decode server. Decode SGLang sees max-running-requests=256 and will only schedule 256 requests at a time.

The remaining ~7044 client requests sit in the waiting/queue state; decode TBT is measured only over the 256 actually running.

The reported throughput at conc=7300 is effectively the throughput at running-concurrency ≈ 256, not 7300. The hightpt_2 sweep point reports a number that is unrelated to what a 40-way DP decode at conc=7300 actually does.

Impact

This is freshly-added config, so no regression to bisect, but it silently produces a misleading data point in a perf sweep that is specifically meant to characterize decode throughput at high concurrency. Across the five 1k1k hightpt points, hightpt_2 will look anomalously low (or anomalously low-utilization on the decode side) and the whole curve becomes uninterpretable around 7300 concurrency.

Fix

Bump the decode-side max-running-requests in this file to match the target concurrency, the same way every sibling does — e.g. max-running-requests: 7300 (mirroring the hightpt_3/hightpt_4 pattern of "decode cap == conc") or max-running-requests: 8192 (mirroring hightpt_0/hightpt_1). One-line change at line 134.

claude · 2026-05-22T20:27:52Z

+    SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
+    SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
+    DYN_REQUEST_PLANE: nats
+      # DeepEP per-rank dispatch buffer; must be >= ceil(cuda_graph_max_bs / dp_size).


🟡 All 5 lowlat decode configs (1k1k_stp_lowlat_0/1 and 8k1k_stp_lowlat_0/1/2) copy the decode_environment block from the hightpt configs, including SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '512' and a comment explaining DeepEP per-rank dispatch buffer sizing. However these lowlat decode workers use moe-runner-backend: flashinfer_trtllm and do not set moe-a2a-backend: deepep or deepep-mode — DeepEP is not in use, so the env var is a no-op and the comment (talking about 4096/24 ~= 171) is misleading since lowlat decode runs DP=1 with cuda-graph-max-bs <= 32. Nit/cosmetic — consider dropping both the env var and the comment from the 5 lowlat files.

Extended reasoning...

What's going on

In each of the 5 lowlat decode configs added in this PR:

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/1k1k/disagg/stp/1k1k_stp_lowlat_0.yaml

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/1k1k/disagg/stp/1k1k_stp_lowlat_1.yaml

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_0.yaml

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_1.yaml

benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_2.yaml

the decode_environment block ends with:

# DeepEP per-rank dispatch buffer; must be >= ceil(cuda_graph_max_bs / dp_size). # Default 128 overflows with large DP + batch (e.g. 4096/24 ~= 171 > 128). Limit 1024. SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '512'

That block was copy-pasted from the hightpt configs where DeepEP is actually enabled (moe-a2a-backend: deepep, deepep-mode: low_latency, deepep-config: /configs/deepep_config.json).

Why this is misleading in the lowlat files

The lowlat decode sglang_config.decode section in each of these files uses:

tensor-parallel-size: 4 expert-parallel-size: 1 data-parallel-size: 1 enable-flashinfer-allreduce-fusion: true moe-runner-backend: flashinfer_trtllm

There is no moe-a2a-backend: deepep and no deepep-mode / deepep-config — DeepEP is not in use, so SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK is a no-op.

The accompanying comment is also wrong-by-context: it talks about ceil(cuda_graph_max_bs / dp_size) with an example of 4096/24 ~= 171, but the lowlat decode configs run with data-parallel-size: 1 and cuda-graph-max-bs of 1, 8, 15, or 32 — the formula gives 1..32, nowhere near the 128 default.

Step-by-step proof for 1k1k_stp_lowlat_0.yaml

Lines 49–52 (decode_environment) set SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '512' with the DeepEP buffer comment.

Lines 117–122 (decode sglang_config) set moe-runner-backend: flashinfer_trtllm, data-parallel-size: 1, cuda-graph-max-bs: 32.

No moe-a2a-backend or deepep-mode is set anywhere in the decode block — DeepEP is not invoked.

SGLang reads SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK only on the DeepEP code path, so it has no effect.

The comment's example (4096/24 ~= 171) describes hightpt math (large DP + large batch); for this file the relevant value would be ceil(32/1) = 32, which never approaches the 128 default the comment warns about.

Impact and fix

No functional impact — SGLang silently ignores the unused env var, and benchmark behavior is unchanged. The cost is cosmetic but real: anyone reading these lowlat recipes will see a comment promising DeepEP-related tuning and an env var that doesn't apply, which is exactly the kind of friction that erodes trust in copy-pasted recipes.

Suggested fix: in all 5 lowlat yamls, drop the 3 lines (the two # comments and the SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK entry) from decode_environment. The hightpt files should keep them as-is.

claude · 2026-05-22T20:27:52Z

+    - "Add GLM-5 FP8 GB300 Dynamo SGLang disaggregated multi-node coverage using lmsysorg/sglang:v0.5.11-cu130"
+    - "1k1k and 8k1k STP hightpt and lowlat srt-slurm recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/ (resolved from upstream srt-slurm PR #160 via srtctl resolve-override)"
+    - "Wire glm5/fp8 model + dynamo-sglang framework branches into runners/launch_gb300-nv.sh with SA upstream defaults (SLURM_PARTITION=batch_1, SLURM_ACCOUNT=benchmark, SQUASH_FILE under /home/sa-shared/gharunners/squash/)"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX


🟡 The pr-link for this changelog entry contains the literal placeholder pull/XXXX instead of the actual PR number (1557). All other recent entries (lines 3084, 3091, 3102) use real PR numbers — please substitute XXXX with 1557 before merge so the changelog link resolves correctly.

Extended reasoning...

What the bug is: perf-changelog.yaml line 3110 contains:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX

The XXXX is a literal placeholder that was never replaced with this PR's actual number, 1557 (visible in the PR metadata). The link as written points to /pull/XXXX, which GitHub will resolve to a 404 (or an unrelated page) forever once this is merged.

The code path that triggers it: This is purely a static YAML metadata entry. Anyone — tooling or a human — who walks the changelog and follows the link for the glm5-fp8-gb300-dynamo-sglang entry will hit a broken link.

Why existing code doesn't prevent it: perf-changelog.yaml is a hand-edited document; there is no validator that checks pr-link URLs for placeholder tokens. The PR author clearly intended to fill it in but forgot before pushing.

Impact: Cosmetic / documentation only — no runtime impact, the YAML still parses, the benchmarks still run. But the changelog's whole purpose is to let readers trace each config change back to its PR; a permanently broken link defeats that for this entry, and the mistake will be frozen in git history once merged.

How to fix: One-character substitution. Replace line 3110 with:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1557

Step-by-step proof:

Open perf-changelog.yaml at line 3110. The line reads pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX.

Compare with the most recent prior entries: line 3102 uses /pull/1534, line 3091 uses /pull/1451, line 3084 uses /pull/1548 — all real, resolvable PR numbers.

The PR metadata for this change states <pr number="1557">, so the correct value is 1557, not XXXX.

Once merged, https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX will 404 (GitHub PR URLs require a numeric ID). The changelog entry will be unable to cross-reference its own PR.

Severity: nit — purely documentation/metadata, no functional consequence, but trivial to fix and worth catching pre-merge before it becomes a permanent artifact of git history.

…00-disagg # Conflicts: # .github/configs/nvidia-master.yaml # perf-changelog.yaml # runners/launch_gb300-nv.sh

github-actions · 2026-05-29T12:52:56Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26606969606
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=26606969606

functionstackx · 2026-05-29T23:09:33Z

/reuse-sweep-run

[GB300][SGLang] Add GLM5 FP8 dynamo-sglang disagg configs

886e619

Port PR69 GLM5 FP8 GB300 disaggregated SGLang recipes to SA upstream and wire gb300-nv launcher support while keeping SA-default SLURM account/partition and sqsh paths.

yeswanthk-26 requested a review from a team May 22, 2026 20:19

yeswanthk-26 requested review from jgangani and kedarpotdar-nv as code owners May 22, 2026 20:19

github-project-automation Bot added this to InferenceMAX Board May 22, 2026

claude Bot reviewed May 22, 2026

View reviewed changes

Merge remote-tracking branch 'origin/main' into yeswanth/glm5-fp8-gb3…

893aa82

…00-disagg # Conflicts: # .github/configs/nvidia-master.yaml # perf-changelog.yaml # runners/launch_gb300-nv.sh

Ankur-singh added the full-sweep-enabled label May 28, 2026

functionstackx merged commit c088658 into main May 29, 2026
58 checks passed

functionstackx deleted the yeswanth/glm5-fp8-gb300-disagg branch May 29, 2026 23:09

github-project-automation Bot moved this to Done in InferenceMAX Board May 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GLM5 FP8 dynamo-sglang GB300 disagg configs#1557

Add GLM5 FP8 dynamo-sglang GB300 disagg configs#1557
functionstackx merged 2 commits into
mainfrom
yeswanth/glm5-fp8-gb300-disagg

yeswanthk-26 commented May 22, 2026 •

edited by cursor Bot

Loading

Uh oh!

claude Bot May 22, 2026

Uh oh!

claude Bot May 22, 2026

Uh oh!

claude Bot May 22, 2026

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

functionstackx commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

File	concurrency	decode max-running-requests	decode cuda-graph-max-bs
hightpt_0	8192	8192	512
hightpt_1	7500	8192	256
hightpt_2	7300	256 ← outlier	180
hightpt_3	6500	6500	128
hightpt_4	5700	5700	100

Conversation

yeswanthk-26 commented May 22, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

claude Bot May 22, 2026

Choose a reason for hiding this comment

What this is

Why this is a bug — comparison across the sweep

Why existing settings do not save us

Step-by-step proof of the symptom

Impact

Fix

Uh oh!

claude Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 22, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented May 29, 2026

Uh oh!

functionstackx commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yeswanthk-26 commented May 22, 2026 •

edited by cursor Bot

Loading