feat(blog): add MI355X vs B200 GLM-5 FP8 SGLang post#378
Merged
Conversation
14 weeks after GLM-5's release, MI355X SGLang FP8 undercuts B200 SGLang FP8 per million tokens across the single-node Pareto on 8k/1k — peak 1.41x with MTP at 18 tok/s/user, 1.36x non-MTP at 10 tok/s/user. Walks through SGLang PR #21511 (HaiShaw) fusing QK rope cat + MLA cache + FP8 quant on MI355 via TileLang. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Removes the redundant kernel-fusion recap (already covered in the "What Shipped to Make This Happen" section) and lifts the MI355X capability sentence into its own paragraph for clearer pacing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…kill Removes the stray blank line between the MTP iso-interactivity table header and its data rows that was preventing markdown from parsing them as a table (rendering all rows as a single pipe-delimited paragraph instead). Also adds .claude/skills/write-inferencex-blog/SKILL.md, codifying the structure, numeric-verification workflow, frontmatter, MDX components, dashboard-link conventions, and FAQ JSON-LD pattern that this PR's post follows — so future InferenceX blog posts can be authored against a consistent template. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
B200 ran on lmsysorg/sglang:v0.5.12-cu130; MI355X ran on lmsysorg/sglang-rocm:v0.5.12-rocm720-mi35x-20260517. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There is no MI355X GLM-5 disagg or wide-EP recipe yet. Updates both the What's Next bullet and the matching FAQ answer to state the gap directly rather than implying a recipe exists but underperforms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…llout Replaces "playbook exists" framing with the direct statement that AMD has still not shipped disagg for GLM-5. Applied to both the bullet and the matching FAQ answer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Data run date (2026-05-20) stays as-is in the body since that's when the InferenceX measurement happened. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit e8a9524. Configure here.
1. Soften "across the entire Pareto" claim in lede and subtitle to "across most of the Pareto" with the ~10-77 tok/s/user range called out explicitly. The MTP table already shows B200 noses ahead above ~90 tok/s/user. 2. Correct "TP=4 dominates across the whole range" in the iso-interactivity intro — TP=4 dominates up to ~77 tok/s/user; TP=8 conc 4 takes over at ~90 tok/s/user where TP=4 can't reach. 3. Fix FAQ overstatement: MTP "roughly doubles" -> "lifts ~1.34x" on the cited concurrency 32 data point (1,274 -> 1,707 tok/s/GPU). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Summary
fused_qk_rope_cat_and_cache_mlafor both Q and KV quant on MI355g_runid=26187777287); chart preset linked from both DashboardCTA blocksTest plan
pnpm dev→/blog/mi355x-glm5-fp8-sglang-40-cheaper-than-b200)i_metric=y_costhand the four series active🤖 Generated with Claude Code
Note
Low Risk
Content-only additions (documentation skill and static MDX); no application logic, auth, or data pipeline changes.
Overview
Adds a Claude skill (
.claude/skills/write-inferencex-blog/SKILL.md) that documents how to draft InferenceX benchmark posts—source-of-truth priority (CSV vs chart), TCO/cost formulas, slug/frontmatter, MDX sections (DashboardCTA,Figure, FAQJsonLd), and commit/PR workflow—and points at this post as the AMD-vs-NVIDIA single-node cost template.Publishes a new MDX article at
packages/app/content/blog/mi355x-glm5-fp8-sglang-40-cheaper-than-b200.mdxclaiming MI355X SGLang FP8 on GLM-5 8k/1k is up to 40% cheaper per million tokens than B200 (peak 1.41x with MTP at 18 tok/s/user), tied to SGLang PR #21511 and InferenceX PR #1440, with per-concurrency tables, iso-interactivity comparisons (including where B200 wins above ~90 tok/s/user), preset dashboard links, and five FAQ JSON-LD entries.Reviewed by Cursor Bugbot for commit c2f98a5. Bugbot is set up for automated code reviews on this repo. Configure here.