[None][doc] Blogpost for Helix Parallelism by brb-nv · Pull Request #13547 · NVIDIA/TensorRT-LLM

brb-nv · 2026-04-28T06:10:09Z

Description

This MR adds a blogpost for Helix Parallelism.

Test Coverage

N/A

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

Documentation
- Added a new blog post detailing Helix Parallelism techniques for scaling multi-million-token decoding with KV cache sharding, including performance optimization strategies, distributed architecture insights, and practical implementation considerations with performance benchmarks.

coderabbitai · 2026-04-28T06:12:13Z

📝 Walkthrough

Walkthrough

A new blog post is added documenting Helix Parallelism for multi-million-token decoding with KV cache sharding. The post covers decoding bottlenecks, temporal disaggregation of parallelism between attention and FFN phases, distributed KV cache partitioning strategies, and TensorRT-LLM integration points with performance results.

Changes

Cohort / File(s)	Summary
Documentation - Blog Post `docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md`	New technical blog post (269 lines) detailing Helix Parallelism architecture, KV cache sharding techniques, distributed attention reconstruction via log-sum-exp rescaling, KV cache partitioning policies, TensorRT-LLM integration configuration and custom CUDA collectives, with DeepSeek-R1 performance analysis.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~15 minutes

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies that this PR adds documentation (blog post) for Helix Parallelism, which matches the changeset that introduces a new blog post file.
Description check	✅ Passed	The PR description includes a brief explanation of the changes (blogpost for Helix Parallelism), marks test coverage as N/A (appropriate for documentation), and includes the required checklist with confirmation, matching the template structure.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md (1)

33-33: Small wording polish for concision.

“On top of that” reads a bit wordy here; a shorter transition improves flow.

Suggested wording tweak

-...preserving long-range context is essential for relevance and coherence. On top of that, users expect fast, interactive responses.
+...preserving long-range context is essential for relevance and coherence. Users also expect fast, interactive responses.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md`
at line 33, The sentence "On top of that, users expect fast, interactive
responses." is wordy; replace the transition "On top of that" with a shorter
alternative (e.g., "Additionally," or "Furthermore,") in the paragraph that
begins "Modern AI applications increasingly rely on models..." so the line reads
concisely like "Additionally, users expect fast, interactive responses." to
improve flow and concision.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md`:
- Around line 47-49: The text uses the ambiguous acronym "TTL" when discussing
decode responsiveness; update the two occurrences of "TTL" in the paragraph (the
instances describing DRAM bandwidth/attention ceilings and batch-size
constraints) to a clear term such as "latency" or "TTFT (time-to-first-token)"
and ensure a brief parenthetical clarifier is added on first use (e.g., "latency
(time-to-first-token, TTFT)") so subsequent mentions can use "latency" or "TTFT"
consistently to avoid misinterpretation.

---

Nitpick comments:
In
`@docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md`:
- Line 33: The sentence "On top of that, users expect fast, interactive
responses." is wordy; replace the transition "On top of that" with a shorter
alternative (e.g., "Additionally," or "Furthermore,") in the paragraph that
begins "Modern AI applications increasingly rely on models..." so the line reads
concisely like "Additionally, users expect fast, interactive responses." to
improve flow and concision.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 90cb031a-248f-4049-9cef-8923899b0990

📥 Commits

Reviewing files that changed from the base of the PR and between f3270f9 and 1dba74b.

⛔ Files ignored due to path filters (4)

docs/source/blogs/media/tech_blog21_attention_sharding.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_dsr1_fp4_pareto.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_helix_execution_flow.png is excluded by !**/*.png
docs/source/blogs/media/tech_blog21_roofline_analysis.png is excluded by !**/*.png

📒 Files selected for processing (1)

docs/source/blogs/tech_blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md

pcastonguay

Overall looks good.

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

laikhtewari

Please ensure data is measured, not simulated. Great blog.

brb-nv · 2026-05-05T17:43:33Z

/bot skip --comment "docs update only"

tensorrt-cicd · 2026-05-05T17:53:01Z

PR_Github #46847 [ skip ] triggered by Bot. Commit: 167b795 Link to invocation

tensorrt-cicd · 2026-05-05T18:06:01Z

PR_Github #46847 [ skip ] completed with state SUCCESS. Commit: 167b795
Skipping testing for commit 167b795

Link to invocation

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv requested a review from a team as a code owner April 28, 2026 06:10

brb-nv requested review from QiJune and Shixiaowei02 April 28, 2026 06:10

github-actions Bot assigned brb-nv Apr 28, 2026

brb-nv requested review from juney-nvidia, pcastonguay and schetlur-nv April 28, 2026 06:10

coderabbitai Bot reviewed Apr 28, 2026

View reviewed changes

Comment thread ...blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md Outdated

pcastonguay reviewed Apr 29, 2026

View reviewed changes

Comment thread docs/source/blogs/media/tech_blog21_dsr1_fp4_pareto.png

Comment thread docs/source/blogs/media/tech_blog21_dsr1_fp4_pareto.png

ankmore-nv reviewed Apr 29, 2026

View reviewed changes

schetlur-nv reviewed May 1, 2026

View reviewed changes

brb-nv force-pushed the user/brb/helix-blog-post branch 3 times, most recently from e3ecaae to d0dbe16 Compare May 4, 2026 02:32

[None][doc] Blogpost for Helix Parallelism

9f3f214

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv force-pushed the user/brb/helix-blog-post branch from d0dbe16 to 9f3f214 Compare May 4, 2026 02:43

brb-nv added 2 commits May 4, 2026 18:40

explanation for baseline

c170736

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

update plot

1a9c049

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

schetlur-nv approved these changes May 4, 2026

View reviewed changes

brb-nv added 2 commits May 4, 2026 22:05

add description for roofline analysis params

440d8d2

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

make moe_ep and moe_tp clearer

167b795

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

brb-nv force-pushed the user/brb/helix-blog-post branch from 82587a8 to 167b795 Compare May 4, 2026 23:16

brb-nv requested review from ankmore-nv and pcastonguay May 5, 2026 00:03

pcastonguay approved these changes May 5, 2026

View reviewed changes

laikhtewari reviewed May 5, 2026

View reviewed changes

Comment thread ...blog/blog21_Helix_Parallelism_Scaling_Multi_Million_Token_Decoding_with_KV_Cache_Sharding.md

laikhtewari approved these changes May 5, 2026

View reviewed changes

brb-nv enabled auto-merge (squash) May 5, 2026 17:43

brb-nv merged commit 2da7a97 into NVIDIA:main May 5, 2026
7 checks passed

yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request May 19, 2026

[None][doc] Blogpost for Helix Parallelism (NVIDIA#13547)

0c48439

Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>

Conversation

brb-nv commented Apr 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 28, 2026

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pcastonguay left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

laikhtewari left a comment

Choose a reason for hiding this comment

Uh oh!

brb-nv commented May 5, 2026

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

tensorrt-cicd commented May 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

brb-nv commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading