[TRTLLM-11794][feat] Optimize ViT Attention kernel on Nemotron#12911
[TRTLLM-11794][feat] Optimize ViT Attention kernel on Nemotron#129112ez4bz merged 3 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughThe changes extend the FlashInfer attention backend to support ragged prefill operations without requiring a KV cache manager. This involves making the KV cache manager optional in the interface, adding ragged prefill wrapper support, implementing a cuDNN-based planning path, and updating model configuration handling. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/models/modeling_nemotron_nano.py (1)
1237-1243: Consider removing redundant deep copy ofvision_model_config.Line 1238 creates a deep copy of
model_configfor the vision encoder, but this copy is never modified before being passed toNanoV2VLVisionEncoderat line 1243. Additionally,NanoV2VLVisionEncoder.__init__creates its own deep copy at lines 296-298 when instantiatingRADIOVisionModel.Since
vision_model_configis not modified and the encoder creates its own copy, you could passmodel_configdirectly toNanoV2VLVisionEncoderto avoid the redundant deep copy operation.♻️ Suggested optimization
llm_model_config = copy.deepcopy(model_config) -vision_model_config = copy.deepcopy(model_config) if hasattr(self, "llm"): return if not _is_disagg(): - self.vision_encoder = NanoV2VLVisionEncoder(vision_model_config).eval() + self.vision_encoder = NanoV2VLVisionEncoder(model_config).eval()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` around lines 1237 - 1243, Remove the unnecessary deep copy for vision_model_config: instead of creating vision_model_config = copy.deepcopy(model_config) and passing it to NanoV2VLVisionEncoder, pass model_config directly to NanoV2VLVisionEncoder in the block where self.vision_encoder is created (the code that checks hasattr(self, "llm") and if not _is_disagg()). Leave llm_model_config as-is; note that NanoV2VLVisionEncoder.__init__ already deep-copies when instantiating RADIOVisionModel (see its __init__), so the external deep copy is redundant.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Around line 1237-1243: Remove the unnecessary deep copy for
vision_model_config: instead of creating vision_model_config =
copy.deepcopy(model_config) and passing it to NanoV2VLVisionEncoder, pass
model_config directly to NanoV2VLVisionEncoder in the block where
self.vision_encoder is created (the code that checks hasattr(self, "llm") and if
not _is_disagg()). Leave llm_model_config as-is; note that
NanoV2VLVisionEncoder.__init__ already deep-copies when instantiating
RADIOVisionModel (see its __init__), so the external deep copy is redundant.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 564bc019-e5e8-4c0e-bfed-aa106130cc2b
📒 Files selected for processing (5)
tensorrt_llm/_torch/attention_backend/flashinfer.pytensorrt_llm/_torch/attention_backend/interface.pytensorrt_llm/_torch/models/modeling_nemotron_nano.pytensorrt_llm/_torch/models/modeling_radio.pytests/unittest/_torch/attention/test_flashinfer_attention.py
|
/bot run |
|
PR_Github #42676 [ run ] triggered by Bot. Commit: |
|
PR_Github #42676 [ run ] completed with state
|
|
/bot run |
|
PR_Github #42997 [ run ] triggered by Bot. Commit: |
|
PR_Github #42997 [ run ] completed with state |
9a3dda0 to
0e80eb3
Compare
|
/bot run |
|
PR_Github #43225 [ run ] triggered by Bot. Commit: |
|
PR_Github #43225 [ run ] completed with state |
0e80eb3 to
97748f9
Compare
|
/bot run |
|
PR_Github #43348 [ run ] triggered by Bot. Commit: |
|
PR_Github #43348 [ run ] completed with state
|
97748f9 to
90ce46a
Compare
|
/bot run |
|
PR_Github #43518 [ run ] triggered by Bot. Commit: |
brb-nv
left a comment
There was a problem hiding this comment.
Just an informational question. LGTM.
|
PR_Github #43518 [ run ] completed with state
|
…Nemotron ViT Attention to Flashinfer Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
|
/bot run --disable-fail-fast |
1 similar comment
|
/bot run --disable-fail-fast |
|
PR_Github #43614 [ run ] triggered by Bot. Commit: |
|
PR_Github #43614 [ run ] completed with state
|
|
/bot run |
|
PR_Github #43722 [ run ] triggered by Bot. Commit: |
|
PR_Github #43722 [ run ] completed with state
|
|
/bot run |
|
PR_Github #43758 [ run ] triggered by Bot. Commit: |
|
PR_Github #43758 [ run ] completed with state |
|
/bot run |
|
PR_Github #43794 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #43824 [ run ] triggered by Bot. Commit: |
|
/bot run |
|
PR_Github #43844 [ run ] triggered by Bot. Commit: |
|
PR_Github #43824 [ run ] completed with state |
|
PR_Github #43844 [ run ] completed with state
|
|
/bot run |
|
PR_Github #44004 [ run ] triggered by Bot. Commit: |
|
PR_Github #44004 [ run ] completed with state |
Description
This PR introduces new Flashinfer Backend
BatchPrefillWithRaggedKVCacheWrapperfor the case where kv_cache_manger is None.(e.g. Vision Attention case)For the case of seq_len > 1024, FLASHINFER's cudnn backend seems promising result. And the image data is on average over seq_len > 1024, so it is convincing that changing to FLASHINFER's cudnn is good.
Below is the data from microbenchmark.
Summary by CodeRabbit
New Features
Tests