Skip to content

[None][feat] reuse triton slicing kernel for GDN prefill transpose#12737

Merged
nv-guomingz merged 1 commit intoNVIDIA:mainfrom
nv-guomingz:user/guomingz/qwen3.5-fla-triton-slice
Apr 7, 2026
Merged

[None][feat] reuse triton slicing kernel for GDN prefill transpose#12737
nv-guomingz merged 1 commit intoNVIDIA:mainfrom
nv-guomingz:user/guomingz/qwen3.5-fla-triton-slice

Conversation

@nv-guomingz
Copy link
Copy Markdown
Collaborator

@nv-guomingz nv-guomingz commented Apr 3, 2026

Summary by CodeRabbit

  • Refactor
    • Consolidated internal tensor operations handling by reducing duplicate logic and improving code maintainability for model inference processing.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 3, 2026

📝 Walkthrough

Walkthrough

The changes refactor tensor transposition and slicing logic for prefill tokens by introducing a new utility function extract_transpose_prefill_slice that consolidates these operations, reducing code duplication and simplifying the prefill processing path in the Qwen3 model's forward pass.

Changes

Cohort / File(s) Summary
New Utility Helper Function
tensorrt_llm/_torch/modules/mamba/fuse_elementwise_ops.py
Added extract_transpose_prefill_slice() helper that allocates and returns a transposed/sliced tensor by invoking the existing _extract_transpose_prefill_kernel. Refactored extract_transpose_xbc_prefill() to delegate to this new helper, eliminating duplicated logic and consolidating kernel invocation.
Prefill Tensor Operations Refactor
tensorrt_llm/_torch/models/modeling_qwen3_next.py
Replaced explicit prefill tensor transposition and post-convolution operations with calls to extract_transpose_prefill_slice(). Updated both prefill+decode and prefill-only code paths to use the new function, removing the combined torch.cat concatenation step for the prefill+decode case.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description lacks substantive content; all required sections (Description, Test Coverage) remain empty placeholder comments, and the checklist is unchecked despite being marked complete. Fill in the Description section explaining what the refactoring accomplishes and why it is beneficial; add Test Coverage section listing relevant tests that validate the changes; verify and mark appropriate checklist items as completed.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title 'reuse triton slicing kernel for GDN prefill transpose' accurately describes the main change—refactoring to reuse a Triton kernel for handling prefill transpose operations, as evidenced by the addition of extract_transpose_prefill_slice and its integration across the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/models/modeling_qwen3_next.py (1)

701-709: ⚠️ Potential issue | 🟠 Major

Explicitly copy back the decode result to ensure correctness across all backends.

Lines 701–709 assign the decode result to mixed_qkv_d without explicitly copying it back to the split view, whereas the prefill result is explicitly copied at line 709 via mixed_qkv_p.copy_(...). This asymmetry creates a latent bug: if causal_conv1d_update materializes a new tensor (as the Triton backend does), the parent mixed_qkv tensor retains stale decode rows, which corrupts the subsequent Q/K/V split.

Suggested fix with data_ptr guard
-            mixed_qkv_d = causal_conv1d_update(
+            mixed_qkv_d_out = causal_conv1d_update(
                 mixed_qkv_d,
                 conv_states_to_use,
                 self.conv1d.weight,
                 self.conv1d.bias,
                 activation=self.activation,
                 conv_state_indices=state_indices_d,
             )
+            if mixed_qkv_d_out.data_ptr() != mixed_qkv_d.data_ptr():
+                mixed_qkv_d.copy_(mixed_qkv_d_out)
             mixed_qkv_p.copy_(mixed_qkv_p_t.transpose(0, 1))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py` around lines 701 - 709,
The decode path may materialize a new tensor in causal_conv1d_update causing
mixed_qkv's decode rows to remain stale; mirror the prefill handling by
explicitly copying the updated decode data back into the split view. After
calling causal_conv1d_update (for mixed_qkv_d), perform an in-place copy into
the original decode view (the same way mixed_qkv_p.copy_(...) is used) and
optionally guard with a data_ptr comparison between the returned tensor and the
destination to avoid redundant copies; update references around mixed_qkv_d,
mixed_qkv_p.copy_, causal_conv1d_update and mixed_qkv accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/modules/mamba/fuse_elementwise_ops.py`:
- Around line 54-80: The store-side offset computation in the Triton kernel
_extract_transpose_prefill_kernel is done in 32-bit arithmetic and can overflow
when width * num_prefill_tokens exceeds 2,147,483,647; mirror the fix used for
src_offsets by casting the operands to tl.int64 (or performing the
multiplication in tl.int64) before computing dst_offsets and before any
subsequent index arithmetic or calls to tl.store so the write addresses cannot
overflow. Locate the dst_offsets/dst_ptr computation inside
_extract_transpose_prefill_kernel and change the multiplication/additions to use
tl.int64 (e.g., cast row/col/width or the product) and ensure the tl.store uses
the 64-bit offset variable.

---

Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_qwen3_next.py`:
- Around line 701-709: The decode path may materialize a new tensor in
causal_conv1d_update causing mixed_qkv's decode rows to remain stale; mirror the
prefill handling by explicitly copying the updated decode data back into the
split view. After calling causal_conv1d_update (for mixed_qkv_d), perform an
in-place copy into the original decode view (the same way mixed_qkv_p.copy_(...)
is used) and optionally guard with a data_ptr comparison between the returned
tensor and the destination to avoid redundant copies; update references around
mixed_qkv_d, mixed_qkv_p.copy_, causal_conv1d_update and mixed_qkv accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 124c4bc8-c452-4797-af80-6edb29df5311

📥 Commits

Reviewing files that changed from the base of the PR and between 1045f38 and 8bd70de.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/models/modeling_qwen3_next.py
  • tensorrt_llm/_torch/modules/mamba/fuse_elementwise_ops.py

@nv-guomingz nv-guomingz force-pushed the user/guomingz/qwen3.5-fla-triton-slice branch 2 times, most recently from 1f78c5a to d522412 Compare April 3, 2026 15:22
@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run --add-multi-gpu-test

@nv-guomingz nv-guomingz changed the title reuse triton slicing kernel for GDN prefill transpose [None][feat] reuse triton slicing kernel for GDN prefill transpose Apr 3, 2026
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41693 [ run ] triggered by Bot. Commit: d522412 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41693 [ run ] completed with state SUCCESS. Commit: d522412
/LLM/main/L0_MergeRequest_PR pipeline #32596 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41775 [ run ] triggered by Bot. Commit: d522412 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41775 [ run ] completed with state SUCCESS. Commit: d522412
/LLM/main/L0_MergeRequest_PR pipeline #32670 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
@nv-guomingz nv-guomingz force-pushed the user/guomingz/qwen3.5-fla-triton-slice branch from d522412 to 114834c Compare April 5, 2026 11:07
@nv-guomingz
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41867 [ run ] triggered by Bot. Commit: 114834c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41867 [ run ] completed with state SUCCESS. Commit: 114834c
/LLM/main/L0_MergeRequest_PR pipeline #32733 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@rosenrodt
Copy link
Copy Markdown
Collaborator

/bot run

@rosenrodt rosenrodt self-requested a review April 5, 2026 19:50
Copy link
Copy Markdown
Collaborator

@rosenrodt rosenrodt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as long as tests pass

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41886 [ run ] triggered by Bot. Commit: 114834c Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41886 [ run ] completed with state SUCCESS. Commit: 114834c
/LLM/main/L0_MergeRequest_PR pipeline #32751 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

CI Report

Link to invocation

Copy link
Copy Markdown
Collaborator

@yechank-nvidia yechank-nvidia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@nv-guomingz nv-guomingz merged commit 6488d7f into NVIDIA:main Apr 7, 2026
5 checks passed
xinhe-nv pushed a commit to xinhe-nv/TensorRT-LLM that referenced this pull request Apr 7, 2026
…VIDIA#12737)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
yufeiwu-nv pushed a commit to yufeiwu-nv/TensorRT-LLM that referenced this pull request Apr 7, 2026
…VIDIA#12737)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
karen-sy pushed a commit to karen-sy/TensorRT-LLM that referenced this pull request Apr 7, 2026
…VIDIA#12737)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
suyoggupta pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Apr 8, 2026
…VIDIA#12737)

Signed-off-by: nv-guomingz <137257613+nv-guomingz@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants