Skip to content

Add GLM5 FP8 dynamo-sglang GB300 disagg configs#1557

Merged
functionstackx merged 2 commits into
mainfrom
yeswanth/glm5-fp8-gb300-disagg
May 29, 2026
Merged

Add GLM5 FP8 dynamo-sglang GB300 disagg configs#1557
functionstackx merged 2 commits into
mainfrom
yeswanth/glm5-fp8-gb300-disagg

Conversation

@yeswanthk-26
Copy link
Copy Markdown
Collaborator

@yeswanthk-26 yeswanthk-26 commented May 22, 2026

Summary

  • Add new glm5-fp8-gb300-dynamo-sglang entry in .github/configs/nvidia-master.yaml with 1k1k and 8k1k STP hightpt/lowlat scenarios.
  • Wire glm5-fp8 support in runners/launch_gb300-nv.sh

Note

Low Risk
Benchmark and CI launcher/config YAML only; no production serving or auth logic changes.

Overview
Adds GLM-5 FP8 disaggregated Dynamo + SGLang benchmark coverage on GB300, parallel to the existing glm5-fp4-gb300-dynamo-sglang setup.

A new glm5-fp8-gb300-dynamo-sglang block in nvidia-master.yaml defines 1k1k and 8k1k STP scenarios (high-throughput wide-EP decode vs low-latency per-node TP=4 decode workers), each pointing at recipes/sglang/glm5/gb300-fp8/... configs. 14 new Slurm recipe YAMLs under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/ implement those topologies (prefill/decode parallelism, DeepEP env tuning, sa-bench concurrencies).

Runners: launch_gb300-nv.sh maps glm5/fp8 to /scratch/models/GLM-5-FP8 and glm-5-fp8, and copies the full vendored sglang/glm5 recipe tree into srt-slurm (drops the old fp4-only comment block). launch_gb300-cw.sh gains the same glm5/fp8 + dynamo-sglang path with glm-5-fp8 in srtslurm.yaml model paths. perf-changelog.yaml documents the new config key.

Reviewed by Cursor Bugbot for commit 893aa82. Bugbot is set up for automated code reviews on this repo. Configure here.

Port PR69 GLM5 FP8 GB300 disaggregated SGLang recipes to SA upstream and wire gb300-nv launcher support while keeping SA-default SLURM account/partition and sqsh paths.
Comment on lines +133 to +134
ep-dispatch-algorithm: static
moe-a2a-backend: deepep
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 In 1k1k_stp_hightpt_2.yaml the decode-side max-running-requests: 256 (line 134) is far below the benchmark's target concurrency of 7300 and is an outlier vs all sibling hightpt configs (which set it to 8192/8192/6500/5700, all aligned with their concurrency). The value 256 exactly matches the prefill section's setting in the same file, which strongly suggests a copy-paste error from prefill into decode. With this cap, ~7044 of the 7300 concurrent requests will perpetually queue inside the decode server and this sweep point will not reach intended decode throughput — please bump it to track the concurrency target (e.g. 7300 or 8192) like the other hightpt configs.

Extended reasoning...

What this is

In benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/1k1k/disagg/stp/1k1k_stp_hightpt_2.yaml, the decode block sets:

      max-running-requests: 256
      cuda-graph-max-bs: 180

while the benchmark concurrency at the bottom of the same file is concurrencies: 7300.

Why this is a bug — comparison across the sweep

All five 1k1k_stp_hightpt_* configs are part of the same hightpt sweep, and in every other file the decode max-running-requests is set to (or above) the target concurrency:

File concurrency decode max-running-requests decode cuda-graph-max-bs
hightpt_0 8192 8192 512
hightpt_1 7500 8192 256
hightpt_2 7300 256 ← outlier 180
hightpt_3 6500 6500 128
hightpt_4 5700 5700 100

Only hightpt_2 has decode max-running-requests: 256. That value is identical to the prefill block earlier in the same file (line 71: max-running-requests: 256 in the prefill section), which is the classic copy-paste signature — the decode block was authored by copying prefill and the max-running-requests line was not bumped.

Why existing settings do not save us

The same decode block also sets cuda-graph-max-bs: 180 with data-parallel-size: 40, implying the decode server was sized for roughly 180 * 40 = 7200 in-flight requests. So the rest of the decode config is consistent with a ~7300-concurrent workload — only the max-running-requests: 256 line is out of place. SGLang enforces max-running-requests as a hard cap on simultaneously-scheduled requests across all DP ranks, so the lower of (256, 7200) wins.

Step-by-step proof of the symptom

  1. Bench harness launches with concurrencies: 7300 → opens 7300 simultaneous client connections.
  2. Each client request arrives at the prefill stage, gets prefilled (prefill max-running-requests: 256 throttles the prefill side; that is intentional and matches all siblings).
  3. After KV transfer, the request flips to the decode server. Decode SGLang sees max-running-requests=256 and will only schedule 256 requests at a time.
  4. The remaining ~7044 client requests sit in the waiting/queue state; decode TBT is measured only over the 256 actually running.
  5. The reported throughput at conc=7300 is effectively the throughput at running-concurrency ≈ 256, not 7300. The hightpt_2 sweep point reports a number that is unrelated to what a 40-way DP decode at conc=7300 actually does.

Impact

This is freshly-added config, so no regression to bisect, but it silently produces a misleading data point in a perf sweep that is specifically meant to characterize decode throughput at high concurrency. Across the five 1k1k hightpt points, hightpt_2 will look anomalously low (or anomalously low-utilization on the decode side) and the whole curve becomes uninterpretable around 7300 concurrency.

Fix

Bump the decode-side max-running-requests in this file to match the target concurrency, the same way every sibling does — e.g. max-running-requests: 7300 (mirroring the hightpt_3/hightpt_4 pattern of "decode cap == conc") or max-running-requests: 8192 (mirroring hightpt_0/hightpt_1). One-line change at line 134.

Comment on lines +49 to +52
SGLANG_USE_MESSAGE_QUEUE_BROADCASTER: '0'
SGLANG_DISABLE_TP_MEMORY_INBALANCE_CHECK: '1'
DYN_REQUEST_PLANE: nats
# DeepEP per-rank dispatch buffer; must be >= ceil(cuda_graph_max_bs / dp_size).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 All 5 lowlat decode configs (1k1k_stp_lowlat_0/1 and 8k1k_stp_lowlat_0/1/2) copy the decode_environment block from the hightpt configs, including SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '512' and a comment explaining DeepEP per-rank dispatch buffer sizing. However these lowlat decode workers use moe-runner-backend: flashinfer_trtllm and do not set moe-a2a-backend: deepep or deepep-mode — DeepEP is not in use, so the env var is a no-op and the comment (talking about 4096/24 ~= 171) is misleading since lowlat decode runs DP=1 with cuda-graph-max-bs <= 32. Nit/cosmetic — consider dropping both the env var and the comment from the 5 lowlat files.

Extended reasoning...

What's going on

In each of the 5 lowlat decode configs added in this PR:

  • benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/1k1k/disagg/stp/1k1k_stp_lowlat_0.yaml
  • benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/1k1k/disagg/stp/1k1k_stp_lowlat_1.yaml
  • benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_0.yaml
  • benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_1.yaml
  • benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/8k1k/disagg/stp/8k1k_stp_lowlat_2.yaml

the decode_environment block ends with:

      # DeepEP per-rank dispatch buffer; must be >= ceil(cuda_graph_max_bs / dp_size).
      # Default 128 overflows with large DP + batch (e.g. 4096/24 ~= 171 > 128). Limit 1024.
    SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '512'

That block was copy-pasted from the hightpt configs where DeepEP is actually enabled (moe-a2a-backend: deepep, deepep-mode: low_latency, deepep-config: /configs/deepep_config.json).

Why this is misleading in the lowlat files

The lowlat decode sglang_config.decode section in each of these files uses:

      tensor-parallel-size: 4
      expert-parallel-size: 1
      data-parallel-size: 1
      enable-flashinfer-allreduce-fusion: true
      moe-runner-backend: flashinfer_trtllm

There is no moe-a2a-backend: deepep and no deepep-mode / deepep-config — DeepEP is not in use, so SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK is a no-op.

The accompanying comment is also wrong-by-context: it talks about ceil(cuda_graph_max_bs / dp_size) with an example of 4096/24 ~= 171, but the lowlat decode configs run with data-parallel-size: 1 and cuda-graph-max-bs of 1, 8, 15, or 32 — the formula gives 1..32, nowhere near the 128 default.

Step-by-step proof for 1k1k_stp_lowlat_0.yaml

  1. Lines 49–52 (decode_environment) set SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK: '512' with the DeepEP buffer comment.
  2. Lines 117–122 (decode sglang_config) set moe-runner-backend: flashinfer_trtllm, data-parallel-size: 1, cuda-graph-max-bs: 32.
  3. No moe-a2a-backend or deepep-mode is set anywhere in the decode block — DeepEP is not invoked.
  4. SGLang reads SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK only on the DeepEP code path, so it has no effect.
  5. The comment's example (4096/24 ~= 171) describes hightpt math (large DP + large batch); for this file the relevant value would be ceil(32/1) = 32, which never approaches the 128 default the comment warns about.

Impact and fix

No functional impact — SGLang silently ignores the unused env var, and benchmark behavior is unchanged. The cost is cosmetic but real: anyone reading these lowlat recipes will see a comment promising DeepEP-related tuning and an env var that doesn't apply, which is exactly the kind of friction that erodes trust in copy-pasted recipes.

Suggested fix: in all 5 lowlat yamls, drop the 3 lines (the two # comments and the SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK entry) from decode_environment. The hightpt files should keep them as-is.

Comment thread perf-changelog.yaml Outdated
- "Add GLM-5 FP8 GB300 Dynamo SGLang disaggregated multi-node coverage using lmsysorg/sglang:v0.5.11-cu130"
- "1k1k and 8k1k STP hightpt and lowlat srt-slurm recipes under benchmarks/multi_node/srt-slurm-recipes/sglang/glm5/gb300-fp8/ (resolved from upstream srt-slurm PR #160 via srtctl resolve-override)"
- "Wire glm5/fp8 model + dynamo-sglang framework branches into runners/launch_gb300-nv.sh with SA upstream defaults (SLURM_PARTITION=batch_1, SLURM_ACCOUNT=benchmark, SQUASH_FILE under /home/sa-shared/gharunners/squash/)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The pr-link for this changelog entry contains the literal placeholder pull/XXXX instead of the actual PR number (1557). All other recent entries (lines 3084, 3091, 3102) use real PR numbers — please substitute XXXX with 1557 before merge so the changelog link resolves correctly.

Extended reasoning...

What the bug is: perf-changelog.yaml line 3110 contains:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX

The XXXX is a literal placeholder that was never replaced with this PR's actual number, 1557 (visible in the PR metadata). The link as written points to /pull/XXXX, which GitHub will resolve to a 404 (or an unrelated page) forever once this is merged.

The code path that triggers it: This is purely a static YAML metadata entry. Anyone — tooling or a human — who walks the changelog and follows the link for the glm5-fp8-gb300-dynamo-sglang entry will hit a broken link.

Why existing code doesn't prevent it: perf-changelog.yaml is a hand-edited document; there is no validator that checks pr-link URLs for placeholder tokens. The PR author clearly intended to fill it in but forgot before pushing.

Impact: Cosmetic / documentation only — no runtime impact, the YAML still parses, the benchmarks still run. But the changelog's whole purpose is to let readers trace each config change back to its PR; a permanently broken link defeats that for this entry, and the mistake will be frozen in git history once merged.

How to fix: One-character substitution. Replace line 3110 with:

  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1557

Step-by-step proof:

  1. Open perf-changelog.yaml at line 3110. The line reads pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX.
  2. Compare with the most recent prior entries: line 3102 uses /pull/1534, line 3091 uses /pull/1451, line 3084 uses /pull/1548 — all real, resolvable PR numbers.
  3. The PR metadata for this change states <pr number="1557">, so the correct value is 1557, not XXXX.
  4. Once merged, https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX will 404 (GitHub PR URLs require a numeric ID). The changelog entry will be unable to cross-reference its own PR.

Severity: nit — purely documentation/metadata, no functional consequence, but trivial to fix and worth catching pre-merge before it becomes a permanent artifact of git history.

…00-disagg

# Conflicts:
#	.github/configs/nvidia-master.yaml
#	perf-changelog.yaml
#	runners/launch_gb300-nv.sh
@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx
Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@functionstackx functionstackx merged commit c088658 into main May 29, 2026
58 checks passed
@functionstackx functionstackx deleted the yeswanth/glm5-fp8-gb300-disagg branch May 29, 2026 23:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

3 participants