Skip to content

[None][doc] Refactor blog18#13956

Merged
bobboli merged 1 commit into
NVIDIA:mainfrom
bobboli:update_blog18_alltoall
May 11, 2026
Merged

[None][doc] Refactor blog18#13956
bobboli merged 1 commit into
NVIDIA:mainfrom
bobboli:update_blog18_alltoall

Conversation

@bobboli
Copy link
Copy Markdown
Collaborator

@bobboli bobboli commented May 10, 2026

Summary

Restructures the Performance Benchmark section of blog18 into focused subsections (Methodology / Scaling With EP Size / Post-Quant Dispatch / Latency Floor / Reproduction) and adds new MXFP8 and NVFP4 results so the post-quant story is no longer hypothetical.

  • Methodology now spells out `bytes_per_token` per recipe (BF16 / MXFP8 / NVFP4) and clarifies that the reported bandwidth is logical — includes the local-rank fraction of traffic — matching the convention used by other MoE comm libraries.
  • Scaling With EP Size retains the BF16 ep ∈ {8, 16, 32, 64} sweep with corrected GB/s numbers (the previous tables conflated FP8 and BF16 byte counts on the dispatch column; both phases ship BF16 since blockwise FP8 has no post-quant dispatch path today).
  • Post-Quant Dispatch (new) — MXFP8 hits 1.81× speedup vs BF16 at ep=8 / bsz=2048; NVFP4 hits 3.06×, both close to their byte-ratio asymptotes. Includes a new `tech_blog18_post_quant_dispatch.png` chart.

Bandwidth chart re-rendered as a landscape side-by-side panel using BF16 byte counts throughout. Adds reference figures for quant formats and the dispatch-MoE-combine R0 detail; re-renders the rank-major vs expert-major figure.

Test plan

  • Markdown structure verified (no broken anchors / TOC consistent).
  • Numbers cross-checked against `tests/microbenchmarks/bench_moe_comm.py` JSON output for ep=8 BF16 / MXFP8 / NVFP4 runs.
  • Reviewer to spot-check the chart against the table values.

Summary by CodeRabbit

  • Documentation
    • Updated blog article on MoE communication optimization with refined terminology and improved framework descriptions.
    • Enhanced performance benchmarking section with updated bandwidth measurements and comprehensive methodology details.
    • Expanded discussion of dispatch optimization techniques with updated performance metrics.
    • Restructured sections for improved clarity and navigation.

Review Change Stack

@bobboli bobboli requested a review from a team as a code owner May 10, 2026 14:48
@bobboli bobboli requested review from QiJune and arysef May 10, 2026 14:48
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 10, 2026

📝 Walkthrough

Walkthrough

This PR updates a technical blog post documenting NVIDIA's NVLink one-sided AlltoAll optimization for MoE communication. Changes include renaming concepts (push/pull instead of dispatch/combine), introducing expanded raw-token data layout explanation, rewriting performance benchmarking with detailed methodology and updated metrics, adding post-quantization dispatch analysis, and restructuring future work guidance.

Changes

MoE Communication Blog Article Update

Layer / File(s) Summary
Navigation and Terminology Foundations
docs/source/blogs/tech_blog/blog18_Optimizing_MoE_Communication_with_One_Sided_AlltoAll_Over_NVLink.md (lines 14–28, 43)
Table of contents restructured to add "Raw Token Data Layout," "Quantization-Agnostic Communication," methodology, scaling, post-quant dispatch, latency floor, and reproduction sections. Design overview updated to reference "raw-token data layout" rather than "token-major."
Core Communication Concepts
docs/source/blogs/tech_blog/blog18_Optimizing_MoE_Communication_with_One_Sided_AlltoAll_Over_NVLink.md (lines 61–81)
One-sided communication section reframed with push (dispatch) and pull (combine) semantics. Raw token data layout expanded with explanation of token delivery, deduplication behavior for multiple experts on same rank, and smaller recv buffer requirements.
Interface and Mechanism Details
docs/source/blogs/tech_blog/blog18_Optimizing_MoE_Communication_with_One_Sided_AlltoAll_Over_NVLink.md (lines 113–140)
Interface description updated to reference raw-token layout for recv buffer allocation. "Dispatch Put and Combine Get" section renamed to "Dispatch Push and Combine Pull" with expanded description of atomic-based slot assignment, deduplication, combine's reuse of routing for weighted reduction, and zero-copy path where MoE output writes directly to symmetric workspace.
Performance Methodology and Analysis
docs/source/blogs/tech_blog/blog18_Optimizing_MoE_Communication_with_One_Sided_AlltoAll_Over_NVLink.md (lines 169–211, 272–320)
Performance benchmarking completely rewritten with detailed methodology including bytes_per_token table for BF16, MXFP8, and NVFP4; clarified bandwidth calculation and timing scope; refreshed dispatch/combine latency and bandwidth tables for ep_size (8). Added post-quantization dispatch section with new recipe comparison table, speedup and GB/s observations. Updated latency floor narrative with quantified statement that synchronization accounts for ~40% of dispatch time at batch size 1, decreasing to ~7% at batch size 2048.
Reproduction and Future Work
docs/source/blogs/tech_blog/blog18_Optimizing_MoE_Communication_with_One_Sided_AlltoAll_Over_NVLink.md (lines 321–341)
Reproduction section updated and reformatted. Future work and conclusion sections restructured with explicit future-work bullet points and updated description of NVLinkOneSided AlltoAll's role as default communication strategy within single NVLink domain in TensorRT-LLM.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The PR title '[None][doc] Refactor blog18' is vague and generic. While it indicates a documentation change to blog18, it does not convey the specific nature of the changes (performance section restructuring, post-quant dispatch results, benchmark updates). Consider a more descriptive title like '[doc] Restructure blog18 perf section + add post-quant dispatch results' to better reflect the main changes and help reviewers understand the scope of the update.
✅ Passed checks (4 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description clearly explains the changes, objectives, and test plan for the blog restructuring and benchmark updates.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@docs/source/blogs/tech_blog/blog18_Optimizing_MoE_Communication_with_One_Sided_AlltoAll_Over_NVLink.md`:
- Around line 175-177: The fenced code block containing the formula "bandwidth =
batch_size × min(ep_size, top_k) × bytes_per_token / latency" lacks a language
identifier; update the block delimiter from ``` to include a language (e.g.,
```text or ```python) so Markdown lint (MD040) and syntax highlighting work
correctly for the formula, ensuring the line with the variables bandwidth,
batch_size, ep_size, top_k, bytes_per_token, and latency remains unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7acf4322-7919-4306-abf6-3e7ba1d94713

📥 Commits

Reviewing files that changed from the base of the PR and between afe1a31 and fab1e30.

⛔ Files ignored due to path filters (9)
  • docs/source/blogs/media/tech_blog18_bandwidth.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog18_dispatch_moe_combine.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog18_dispatch_moe_combine_R0.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog18_one_sided_vs_two_sided.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog18_post_quant_dispatch.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog18_quant_formats.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog18_rank_major_vs_expert_major.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog18_raw_tokens_vs_permuted_tokens.png is excluded by !**/*.png
  • docs/source/blogs/media/tech_blog18_token_major_vs_expert_major.png is excluded by !**/*.png
📒 Files selected for processing (1)
  • docs/source/blogs/tech_blog/blog18_Optimizing_MoE_Communication_with_One_Sided_AlltoAll_Over_NVLink.md

@bobboli bobboli changed the title [None][doc] Restructure blog18 perf section + add post-quant dispatch results [None][doc] Refactor blog18 May 10, 2026
@bobboli bobboli force-pushed the update_blog18_alltoall branch from 6e62b8c to ea43673 Compare May 10, 2026 18:01
@bobboli
Copy link
Copy Markdown
Collaborator Author

bobboli commented May 10, 2026

/bot run

Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
@bobboli bobboli force-pushed the update_blog18_alltoall branch from ea43673 to 59590cc Compare May 10, 2026 18:03
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47610 [ run ] triggered by Bot. Commit: 59590cc Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #47610 [ run ] completed with state SUCCESS. Commit: 59590cc
/LLM/main/L0_MergeRequest_PR pipeline #37516 completed with status: 'SUCCESS'

CI Report

Link to invocation

@bobboli bobboli merged commit 944b7eb into NVIDIA:main May 11, 2026
7 of 10 checks passed
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants