Add B200 config: glm5-fp4-sglang-mtp by functionstackx · Pull Request #1087 · SemiAnalysisAI/InferenceX

functionstackx · 2026-04-18T04:18:16Z

Status

Draft — opening for review before labeling for sweep.

Summary

Adds glm5-fp4-b200-sglang-mtp config + new benchmarks/single_node/glm5_fp4_b200_mtp.sh launch script.
Launch recipe: follows glm5-fp8-b200-sglang non-MTP script verbatim (as requested), not the existing glm5_fp4_b200.sh recipe. Please sanity-check whether this is intended — the fp8 script has --quantization fp8 and --kv-cache-dtype fp8_e4m3 hardcoded, which may or may not interact correctly with the NVFP4 weights in nvidia/GLM-5-NVFP4.
Adds EAGLE speculative decoding flags on top of the fp8 script: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.
Sets SGLANG_ENABLE_SPEC_V2=1 in the env before sglang.launch_server (required for GLM-5 MTP).
Passes --use-chat-template to run_benchmark_serving per the AGENTS.md requirement for all MTP scripts.
Config block: NVFP4 model, fp4 precision, search-space mirrors glm5-fp4-b200-sglang (TP8/EP1 conc 4-4 + TP4/EP1 conc 4-256, for 1k1k and 8k1k), with spec-decoding: mtp on every row.
perf-changelog.yaml diff is append-only.

Test plan

YAML parses for both master config and perf-changelog.
bash -n benchmarks/single_node/glm5_fp4_b200_mtp.sh — bash syntax OK.
git diff perf-changelog.yaml shows only additions.
python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml — emits 16 entries (2 ISL/OSL × 2 search-space rows with concurrencies as configured) with spec-decoding=mtp.
Decide whether fp8 launch flags are correct for NVFP4 weights, or if script should switch to fp4-specific quantization.
CI sweep passes on B200 once labeled sweep-enabled.

🤖 Generated with Claude Code

github-actions · 2026-04-18T04:18:25Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude

Additional findings (outside current diff — PR may have been updated during review):

🟡 perf-changelog.yaml:1538 — The new perf-changelog.yaml entry for glm5-fp4-b200-sglang-mtp uses pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX instead of the actual PR number 1087. Update the placeholder to /pull/1087 before merging.
Extended reasoning...

What the bug is: The newly added perf-changelog.yaml entry (lines 1535–1538 in the diff) for glm5-fp4-b200-sglang-mtp sets pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX. The PR number (1087) is known at the time of submission and visible in the PR title/URL itself.

The specific code path: The diff appends a new changelog block ending with pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX. This is the only line that needs updating.

Why existing code doesn't prevent it: There is no lint check or CI gate that validates pr-link values are not placeholder strings. The YAML parses successfully with XXXX present, so no automated check catches it.

Impact: The changelog serves as a human-readable audit trail linking each config change to the PR that introduced it. A placeholder XXXX breaks that traceability — anyone reading the changelog later cannot follow the link to see the associated discussion, test results, or review comments. Several other nearby entries also use XXXX (e.g., the entries immediately preceding this one at lines ~1508, 1518, 1528), but those are pre-existing issues from draft PRs that were not updated; the one added by this PR can and should be corrected before merge.

Addressing the refutation: The refutation argues the XXXX placeholder is intentional for draft PRs and will be updated before merge. That may be the workflow intent, but since the PR number is already known (1087), there is no practical reason to leave it as XXXX. Calling this out now is precisely the purpose of review — catching things that are easy to fix pre-merge.

How to fix: In perf-changelog.yaml, change the last line of the newly added block from:
```
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX
```
to:
```
  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1087
```
Step-by-step proof:
1. PR #1087 is opened (this PR).
2. The author adds a new changelog entry for glm5-fp4-b200-sglang-mtp.
3. The pr-link field is set to .../pull/XXXX — a placeholder.
4. The actual PR number 1087 is visible in the GitHub URL and PR description.
5. After merge, any reader of perf-changelog.yaml who tries to follow the pr-link for glm5-fp4-b200-sglang-mtp will get a 404 or land on an unrelated PR.

claude · 2026-04-18T05:48:16Z

+--data-parallel-size 1 --expert-parallel-size 1 \
+--tool-call-parser glm47 \
+--reasoning-parser glm45 \
+--kv-cache-dtype fp8_e4m3 --quantization fp8 \


🔴 The new script passes --quantization fp8 and --kv-cache-dtype fp8_e4m3 (copied verbatim from the FP8 sibling script), but the model is nvidia/GLM-5-NVFP4 which uses pre-quantized NVFP4 weights. The correct flag for NVFP4 weights is --quantization modelopt_fp4, as used by the existing benchmarks/single_node/glm5_fp4_b200.sh (line 42). With the wrong quantization flag, SGLang will either fail at model load or produce invalid benchmark results.

Extended reasoning...

What the bug is and how it manifests

The new launch script benchmarks/single_node/glm5_fp4_b200_mtp.sh (line 48) passes --kv-cache-dtype fp8_e4m3 --quantization fp8 to sglang.launch_server. However, the model being loaded is nvidia/GLM-5-NVFP4 — a model whose weights are already pre-quantized in NVIDIA's NVFP4 (modelopt_fp4) format. The fp8 quantization flags are appropriate for FP8-quantized models like zai-org/GLM-5-FP8, not for NVFP4 weights.

The specific code path that triggers it

When a sweep job runs for the glm5-fp4-b200-sglang-mtp config, the harness selects glm5_fp4_b200_mtp.sh as the launch script. That script calls:

python3 -m sglang.launch_server --model-path= ... --kv-cache-dtype fp8_e4m3 --quantization fp8

with MODEL=nvidia/GLM-5-NVFP4. SGLang will attempt to apply FP8 quantization to weights that are already stored in NVFP4 format.

Why existing safeguards do not catch this

This is a semantic error invisible to bash -n syntax checking (which the PR author confirms passed). The YAML config and model field correctly reference nvidia/GLM-5-NVFP4 with precision: fp4, but the quantization directive in the launch script contradicts this at runtime. There is no static analysis that cross-validates the quantization flag against the model weights format.

Impact

SGLang will either: (a) reject the conflicting quantization scheme and fail to start the server, causing the sweep job to error out; or (b) silently misinterpret the NVFP4 weights under an FP8 quantization scheme, resulting in benchmark numbers that do not represent actual NVFP4 performance. Either outcome invalidates any measurements collected under this config. The config is already labeled sweep-enabled, meaning it could be swept as-is.

How to fix it

Replace line 48 of benchmarks/single_node/glm5_fp4_b200_mtp.sh:

# Before (wrong): --kv-cache-dtype fp8_e4m3 --quantization fp8 \ # After (correct): --quantization modelopt_fp4 \

This matches the existing non-MTP counterpart benchmarks/single_node/glm5_fp4_b200.sh line 42. Whether to also add --kv-cache-dtype fp8_e4m3 for FP8 KV cache should be verified against the FP4 B200 recipe.

Step-by-step proof

The YAML config glm5-fp4-b200-sglang-mtp sets model: nvidia/GLM-5-NVFP4 and precision: fp4.

A sweep job instantiates glm5_fp4_b200_mtp.sh with MODEL=nvidia/GLM-5-NVFP4.

Line 48 of that script passes --quantization fp8 to sglang.launch_server.

SGLang sees NVFP4 weights but is told to use FP8 quantization — a mismatch.

Compare with benchmarks/single_node/glm5_fp4_b200.sh line 42, which correctly uses --quantization modelopt_fp4 for the same nvidia/GLM-5-NVFP4 model.

The PR description itself flags this as an unresolved TODO: 'Decide whether fp8 launch flags are correct for NVFP4 weights, or if script should switch to fp4-specific quantization.'

Follows the glm5-fp8-b200-sglang non-MTP launch recipe (as requested) and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) via the standard spec-decoding=mtp suffix. SGLANG_ENABLE_SPEC_V2=1 is set before launching the server as required for GLM-5 MTP. Script also passes --use-chat-template to run_benchmark_serving, as required by AGENTS.md for all MTP configs. Config block uses the NVFP4 model (nvidia/GLM-5-NVFP4) and mirrors the existing glm5-fp4-b200-sglang search space. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-project-automation Bot added this to InferenceMAX Board Apr 18, 2026

functionstackx marked this pull request as ready for review April 18, 2026 05:42

functionstackx requested a review from a team April 18, 2026 05:42

functionstackx requested review from jgangani and kedarpotdar-nv as code owners April 18, 2026 05:42

functionstackx added the sweep-enabled label Apr 18, 2026

functionstackx force-pushed the claude/add-glm5-fp4-b200-mtp branch from 1107d9f to ec74c17 Compare April 18, 2026 05:43

claude Bot reviewed Apr 18, 2026

View reviewed changes

functionstackx force-pushed the claude/add-glm5-fp4-b200-mtp branch from ec74c17 to 76f25d2 Compare April 18, 2026 06:40

functionstackx merged commit d2430ae into main Apr 18, 2026
27 checks passed

functionstackx deleted the claude/add-glm5-fp4-b200-mtp branch April 18, 2026 10:48

github-project-automation Bot moved this to Done in InferenceMAX Board Apr 18, 2026

functionstackx mentioned this pull request May 26, 2026

feat(blog): B200 NVFP4 vs H200 FP8 on GLM-5 — up to 3.65x better perf/$ SemiAnalysisAI/InferenceX-app#386

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add B200 config: glm5-fp4-sglang-mtp#1087

Add B200 config: glm5-fp4-sglang-mtp#1087
functionstackx merged 1 commit into
mainfrom
claude/add-glm5-fp4-b200-mtp

functionstackx commented Apr 18, 2026

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

claude Bot left a comment

Uh oh!

claude Bot Apr 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

functionstackx commented Apr 18, 2026

Status

Summary

Test plan

Uh oh!

github-actions Bot commented Apr 18, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

claude Bot Apr 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant