Add B200 config: glm5-fp4-sglang-mtp#1087
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1107d9f to
ec74c17
Compare
There was a problem hiding this comment.
Additional findings (outside current diff — PR may have been updated during review):
-
🟡
perf-changelog.yaml:1538— The new perf-changelog.yaml entry for glm5-fp4-b200-sglang-mtp usespr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXXinstead of the actual PR number 1087. Update the placeholder to/pull/1087before merging.Extended reasoning...
What the bug is: The newly added perf-changelog.yaml entry (lines 1535–1538 in the diff) for
glm5-fp4-b200-sglang-mtpsetspr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX. The PR number (1087) is known at the time of submission and visible in the PR title/URL itself.The specific code path: The diff appends a new changelog block ending with
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXX. This is the only line that needs updating.Why existing code doesn't prevent it: There is no lint check or CI gate that validates pr-link values are not placeholder strings. The YAML parses successfully with
XXXXpresent, so no automated check catches it.Impact: The changelog serves as a human-readable audit trail linking each config change to the PR that introduced it. A placeholder
XXXXbreaks that traceability — anyone reading the changelog later cannot follow the link to see the associated discussion, test results, or review comments. Several other nearby entries also useXXXX(e.g., the entries immediately preceding this one at lines ~1508, 1518, 1528), but those are pre-existing issues from draft PRs that were not updated; the one added by this PR can and should be corrected before merge.Addressing the refutation: The refutation argues the XXXX placeholder is intentional for draft PRs and will be updated before merge. That may be the workflow intent, but since the PR number is already known (1087), there is no practical reason to leave it as
XXXX. Calling this out now is precisely the purpose of review — catching things that are easy to fix pre-merge.How to fix: In
perf-changelog.yaml, change the last line of the newly added block from:pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/XXXXto:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1087Step-by-step proof:
- PR #1087 is opened (this PR).
- The author adds a new changelog entry for
glm5-fp4-b200-sglang-mtp. - The
pr-linkfield is set to.../pull/XXXX— a placeholder. - The actual PR number 1087 is visible in the GitHub URL and PR description.
- After merge, any reader of perf-changelog.yaml who tries to follow the pr-link for
glm5-fp4-b200-sglang-mtpwill get a 404 or land on an unrelated PR.
| --data-parallel-size 1 --expert-parallel-size 1 \ | ||
| --tool-call-parser glm47 \ | ||
| --reasoning-parser glm45 \ | ||
| --kv-cache-dtype fp8_e4m3 --quantization fp8 \ |
There was a problem hiding this comment.
🔴 The new script passes --quantization fp8 and --kv-cache-dtype fp8_e4m3 (copied verbatim from the FP8 sibling script), but the model is nvidia/GLM-5-NVFP4 which uses pre-quantized NVFP4 weights. The correct flag for NVFP4 weights is --quantization modelopt_fp4, as used by the existing benchmarks/single_node/glm5_fp4_b200.sh (line 42). With the wrong quantization flag, SGLang will either fail at model load or produce invalid benchmark results.
Extended reasoning...
What the bug is and how it manifests
The new launch script benchmarks/single_node/glm5_fp4_b200_mtp.sh (line 48) passes --kv-cache-dtype fp8_e4m3 --quantization fp8 to sglang.launch_server. However, the model being loaded is nvidia/GLM-5-NVFP4 — a model whose weights are already pre-quantized in NVIDIA's NVFP4 (modelopt_fp4) format. The fp8 quantization flags are appropriate for FP8-quantized models like zai-org/GLM-5-FP8, not for NVFP4 weights.
The specific code path that triggers it
When a sweep job runs for the glm5-fp4-b200-sglang-mtp config, the harness selects glm5_fp4_b200_mtp.sh as the launch script. That script calls:
python3 -m sglang.launch_server --model-path= ... --kv-cache-dtype fp8_e4m3 --quantization fp8with MODEL=nvidia/GLM-5-NVFP4. SGLang will attempt to apply FP8 quantization to weights that are already stored in NVFP4 format.
Why existing safeguards do not catch this
This is a semantic error invisible to bash -n syntax checking (which the PR author confirms passed). The YAML config and model field correctly reference nvidia/GLM-5-NVFP4 with precision: fp4, but the quantization directive in the launch script contradicts this at runtime. There is no static analysis that cross-validates the quantization flag against the model weights format.
Impact
SGLang will either: (a) reject the conflicting quantization scheme and fail to start the server, causing the sweep job to error out; or (b) silently misinterpret the NVFP4 weights under an FP8 quantization scheme, resulting in benchmark numbers that do not represent actual NVFP4 performance. Either outcome invalidates any measurements collected under this config. The config is already labeled sweep-enabled, meaning it could be swept as-is.
How to fix it
Replace line 48 of benchmarks/single_node/glm5_fp4_b200_mtp.sh:
# Before (wrong):
--kv-cache-dtype fp8_e4m3 --quantization fp8 \
# After (correct):
--quantization modelopt_fp4 \This matches the existing non-MTP counterpart benchmarks/single_node/glm5_fp4_b200.sh line 42. Whether to also add --kv-cache-dtype fp8_e4m3 for FP8 KV cache should be verified against the FP4 B200 recipe.
Step-by-step proof
- The YAML config
glm5-fp4-b200-sglang-mtpsetsmodel: nvidia/GLM-5-NVFP4andprecision: fp4. - A sweep job instantiates
glm5_fp4_b200_mtp.shwithMODEL=nvidia/GLM-5-NVFP4. - Line 48 of that script passes
--quantization fp8tosglang.launch_server. - SGLang sees NVFP4 weights but is told to use FP8 quantization — a mismatch.
- Compare with
benchmarks/single_node/glm5_fp4_b200.shline 42, which correctly uses--quantization modelopt_fp4for the samenvidia/GLM-5-NVFP4model. - The PR description itself flags this as an unresolved TODO: 'Decide whether fp8 launch flags are correct for NVFP4 weights, or if script should switch to fp4-specific quantization.'
Follows the glm5-fp8-b200-sglang non-MTP launch recipe (as requested) and adds EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4) via the standard spec-decoding=mtp suffix. SGLANG_ENABLE_SPEC_V2=1 is set before launching the server as required for GLM-5 MTP. Script also passes --use-chat-template to run_benchmark_serving, as required by AGENTS.md for all MTP configs. Config block uses the NVFP4 model (nvidia/GLM-5-NVFP4) and mirrors the existing glm5-fp4-b200-sglang search space. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ec74c17 to
76f25d2
Compare
Status
Draft — opening for review before labeling for sweep.
Summary
glm5-fp4-b200-sglang-mtpconfig + newbenchmarks/single_node/glm5_fp4_b200_mtp.shlaunch script.glm5-fp8-b200-sglangnon-MTP script verbatim (as requested), not the existingglm5_fp4_b200.shrecipe. Please sanity-check whether this is intended — the fp8 script has--quantization fp8and--kv-cache-dtype fp8_e4m3hardcoded, which may or may not interact correctly with the NVFP4 weights innvidia/GLM-5-NVFP4.--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.SGLANG_ENABLE_SPEC_V2=1in the env beforesglang.launch_server(required for GLM-5 MTP).--use-chat-templatetorun_benchmark_servingper the AGENTS.md requirement for all MTP scripts.glm5-fp4-b200-sglang(TP8/EP1 conc 4-4 + TP4/EP1 conc 4-256, for 1k1k and 8k1k), withspec-decoding: mtpon every row.perf-changelog.yamldiff is append-only.Test plan
bash -n benchmarks/single_node/glm5_fp4_b200_mtp.sh— bash syntax OK.git diff perf-changelog.yamlshows only additions.python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/nvidia-master.yaml— emits 16 entries (2 ISL/OSL × 2 search-space rows with concurrencies as configured) with spec-decoding=mtp.sweep-enabled.🤖 Generated with Claude Code