[https://nvbugs/6168859][fix] move tinygemm PDL release after reduction#14537
Conversation
📝 WalkthroughWalkthroughThe PR relocates a CUDA synchronization call within the tinygemm2 kernel. The ChangesKernel synchronization barrier repositioning
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh (1)
1-2:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the NVIDIA header year.
This file is modified in this PR, so the header should reflect 2026 as the latest meaningful modification year.
As per coding guidelines, "
**/*.{cpp,cc,h,hpp,py,cu,cuh}: Include NVIDIA copyright header on ALL new files; update year on modified files."🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh` around lines 1 - 2, The copyright header in tinygemm2_kernel.cuh still shows 2025–2025; update the top-of-file NVIDIA copyright comment to reflect 2026 as the latest modification year (e.g., change the year range to 2025–2026 or 2026–2026 consistent with project style) so the file header matches the PR modifications.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh`:
- Around line 424-427: The Programmatic Dependent Launch trigger is invoked too
early: move the cudaTriggerProgrammaticLaunchCompletion() call so it executes
after the final global-memory stores performed by the warp_id == 0 path (the
output[...] stores) rather than immediately after filling reduction_buffer and
__syncthreads(); specifically, ensure that
cudaTriggerProgrammaticLaunchCompletion() is placed after the code block that
does the output[...] global writes (so that a dependent that returns from
cudaGridDependencySynchronize() sees the completed stores), keeping the existing
checks (threadIdx.x/warp_id conditions) consistent with the original
synchronization logic.
---
Outside diff comments:
In `@cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh`:
- Around line 1-2: The copyright header in tinygemm2_kernel.cuh still shows
2025–2025; update the top-of-file NVIDIA copyright comment to reflect 2026 as
the latest modification year (e.g., change the year range to 2025–2026 or
2026–2026 consistent with project style) so the file header matches the PR
modifications.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 1dd17842-40b1-4cbf-85fe-4cc074c1b8d2
📒 Files selected for processing (1)
cpp/tensorrt_llm/kernels/tinygemm2/tinygemm2_kernel.cuh
|
/bot run |
|
PR_Github #50235 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #50237 [ run ] triggered by Bot. Commit: |
|
PR_Github #50235 [ run ] completed with state |
|
PR_Github #50237 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #50251 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #50254 [ run ] triggered by Bot. Commit: |
|
PR_Github #50251 [ run ] completed with state |
|
should we propagate the fix to FlashInfer ? |
I think so if we want to be on the safe side. Although the issue seems to be only reproducible within TRTLLM. I was not able to create a standalone repro. |
|
PR_Github #50254 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #50299 [ run ] triggered by Bot. Commit: |
|
PR_Github #50299 [ run ] completed with state
|
Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
Signed-off-by: dongfengy <99041270+dongfengy@users.noreply.github.com>
Signed-off-by: dongfengy <99041270+dongfengy@users.noreply.github.com>
6159db5 to
cbb1b08
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #50357 [ run ] triggered by Bot. Commit: |
|
PR_Github #50357 [ run ] completed with state |
…on (NVIDIA#14537) Signed-off-by: Dongfeng Yu <dongfengy@nvidia.com>
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.