[https://nvbugs/6250866][fix] Fix deep ep partial warp sync for gptoss shapes by dongfengy · Pull Request #14977 · NVIDIA/TensorRT-LLM

dongfengy · 2026-06-04T22:42:46Z

NOTE: This ensure partial warp sync is fine.

[NVBUG-6250866][bugfix] wire DeepEP patch into FetchContent

Summary by CodeRabbit

Bug Fixes
- Applied a fix to GPU kernel synchronization to improve stability and correctness of deep learning computations.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

coderabbitai · 2026-06-04T22:45:36Z

📝 Walkthrough

Walkthrough

The PR registers a CUDA kernel synchronization fix for the deep_ep third-party dependency. The fetch configuration now references a new patch file that replaces bare warp synchronization calls with active-mask-aware variants to handle the case where the final tile contains fewer lanes than a full warp.

Changes

CUDA Synchronization Fix for deep_ep

Layer / File(s)	Summary
Patch configuration and intranode synchronization fix `3rdparty/fetch_content.json`, `3rdparty/patches/deep_ep_intranode_combine_fix.patch`	The `deep_ep_download` entry in `fetch_content.json` is updated to register `patches/deep_ep_intranode_combine_fix.patch` for automatic application. The patch replaces `__syncwarp()` with `__syncwarp(__activemask())` at TMA wait, fence, and final synchronization points in `intranode.cu`, ensuring only active warp lanes participate when the final tile has a subset of lanes.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description lacks essential details required by the template, with only a brief note and title but missing implementation explanation and test coverage information.	Fill in the Description section explaining the issue being fixed and the solution approach, and detail relevant tests that validate the warp synchronization fix.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly describes the main change: fixing partial warp synchronization in Deep EP for GPT-OSS shapes, which is the core issue addressed by the patch to intranode.cu and the FetchContent configuration update.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

dongfengy · 2026-06-04T22:51:57Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-04T22:57:32Z

PR_Github #52176 [ run ] triggered by Bot. Commit: ec4d1f3 Link to invocation

tensorrt-cicd · 2026-06-05T02:52:51Z

PR_Github #52176 [ run ] completed with state SUCCESS. Commit: ec4d1f3
/LLM/main/L0_MergeRequest_PR pipeline #41496 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dongfengy · 2026-06-05T05:12:40Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-05T05:19:25Z

PR_Github #52254 [ run ] triggered by Bot. Commit: 1df8987 Link to invocation

tensorrt-cicd · 2026-06-05T08:48:54Z

PR_Github #52254 [ run ] completed with state FAILURE. Commit: 1df8987
/LLM/main/L0_MergeRequest_PR pipeline #41568 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com> [NVBUG-6250866][bugfix] wire DeepEP patch into FetchContent Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>

dongfengy · 2026-06-05T16:43:08Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-05T16:49:00Z

PR_Github #52384 [ run ] triggered by Bot. Commit: 339ef49 Link to invocation

tensorrt-cicd · 2026-06-05T20:45:21Z

PR_Github #52384 [ run ] completed with state SUCCESS. Commit: 339ef49
/LLM/main/L0_MergeRequest_PR pipeline #41680 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

dongfengy · 2026-06-05T21:14:17Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-05T21:20:12Z

PR_Github #52433 [ run ] triggered by Bot. Commit: 339ef49 Link to invocation

github-actions Bot assigned dongfengy Jun 4, 2026

dongfengy marked this pull request as ready for review June 4, 2026 22:43

dongfengy requested a review from a team as a code owner June 4, 2026 22:43

dongfengy changed the title ~~[NVBUG-6250866][bugfix] Fix deep ep partial warp sync for gptoss shapes~~ [https://nvbugs/6250866][bugfix] Fix deep ep partial warp sync for gptoss shapes Jun 4, 2026

dongfengy changed the title ~~[https://nvbugs/6250866][bugfix] Fix deep ep partial warp sync for gptoss shapes~~ [https://nvbugs/6250866][fix] Fix deep ep partial warp sync for gptoss shapes Jun 4, 2026

tijyojwad approved these changes Jun 5, 2026

View reviewed changes

dongfengy force-pushed the user/dongfengy/deepep-fix-v2 branch from ec4d1f3 to 1df8987 Compare June 5, 2026 03:24

[NVBUG-6250866][bugfix] fix DeepEP intranode combine fallback

339ef49

Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com> [NVBUG-6250866][bugfix] wire DeepEP patch into FetchContent Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>

dongfengy force-pushed the user/dongfengy/deepep-fix-v2 branch from 1df8987 to 339ef49 Compare June 5, 2026 16:43

Conversation

dongfengy commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

coderabbitai Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

dongfengy commented Jun 4, 2026

Uh oh!

tensorrt-cicd commented Jun 4, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

dongfengy commented Jun 5, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

dongfengy commented Jun 5, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

dongfengy commented Jun 5, 2026

Uh oh!

tensorrt-cicd commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dongfengy commented Jun 4, 2026 •

edited

Loading

coderabbitai Bot commented Jun 4, 2026 •

edited

Loading