-
Notifications
You must be signed in to change notification settings - Fork 1.9k
[None][feat] Autodeploy add triton configs and optimize mamba prefill #9083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
📝 WalkthroughWalkthroughThis PR introduces chunk-based processing support throughout the auto-deploy pipeline, implements dynamic MoE kernel configuration loading from JSON files, adds a causal convolution fusion optimization pass, and extends various backend implementations to handle activation parameters and chunking metadata. Changes
Sequence Diagram(s)sequenceDiagram
participant GraphModule
participant Matcher as Pattern Matcher
participant Transform as FuseCausalConvActivation
participant Backend as CUDA Backend
GraphModule->>Matcher: Scan for causal_conv1d + activation
Matcher-->>Transform: Return matched (conv_node, activation_node, op_name)
Transform->>Transform: Extract activation function name
Note over Transform: Identify silu/activation type
Transform->>GraphModule: Insert fused op call<br/>(cuda_cached_causal_conv1d + activation arg)
GraphModule->>GraphModule: Replace activation node with fused call
GraphModule->>GraphModule: Erase original conv & activation nodes
GraphModule->>Backend: Execute fused kernel<br/>(activation baked in)
Backend-->>GraphModule: Return fused output
sequenceDiagram
participant User
participant Factory as ModelFactory
participant Executor as ADExecutor
participant Config as get_moe_configs()
participant Kernel as Triton Kernel
User->>Factory: Query model chunk_size
Factory-->>User: Return chunk_size from config
User->>Executor: Initialize with factory
Executor->>Factory: Fetch chunk_size
Executor->>Executor: Build SequenceInfo with chunk_size
Kernel->>Config: Request optimized config<br/>(E, N, dtype, batch_size)
Config->>Config: Load JSON from disk (cached)
Config->>Config: Find closest batch-size key to M
alt Config Found
Config-->>Kernel: Return optimized block sizes
Kernel->>Kernel: Use tuned BLOCK_SIZE_M/N/K
else Fallback
Config-->>Kernel: Return None
Kernel->>Kernel: Use default_kernel_config(M, E)
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py (1)
358-384: Fix the fake op signature to includechunk_size.
torch_backend_prepare_metadatanow receiveschunk_size, but the fake registration still exposes the old signature. During fake tensor tracing the dispatcher will pass the extra argument, leading to an immediateTypeErrorand breaking export. Please update the fake to accept the new parameter as well.@torch_backend_prepare_metadata.register_fake def torch_backend_prepare_metadata_fake( - position_ids, seq_len, input_pos, cache_loc, pages_per_seq, slot_idx, page_size + position_ids, + seq_len, + input_pos, + cache_loc, + pages_per_seq, + slot_idx, + page_size, + chunk_size, ): num_seq = SequenceInfo._get_sanitized_num_sequences(position_ids, seq_len) return (
🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py (1)
299-301: Drop the unused# noqa.Ruff flags this
# noqa: E501as unused. Removing the directive (or splitting the string if needed) keeps the file clean.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py(1 hunks)tensorrt_llm/_torch/auto_deploy/config/default.yaml(1 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py(4 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py(2 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1856,device_name=NVIDIA_H100_80GB_HBM3.json(1 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1856,device_name=NVIDIA_L40S.json(1 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py(4 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py(6 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py(1 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py(2 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py(7 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py(2 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py(1 hunks)tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py(2 hunks)tensorrt_llm/_torch/auto_deploy/models/factory.py(1 hunks)tensorrt_llm/_torch/auto_deploy/models/hf.py(1 hunks)tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py(2 hunks)tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py(1 hunks)tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_cuda_causal_conv_cached_op.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.pytests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_cuda_causal_conv_cached_op.pytensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.pytensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/mla.pytensorrt_llm/_torch/auto_deploy/models/factory.pytensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.pytensorrt_llm/_torch/auto_deploy/models/hf.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.pytensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.pytensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.pytensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.pytensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.pytensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.pytensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.pytests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_cuda_causal_conv_cached_op.pytensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.pytensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/mla.pytensorrt_llm/_torch/auto_deploy/models/factory.pytensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.pytensorrt_llm/_torch/auto_deploy/models/hf.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.pytensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.pytensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.pytensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.pytensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.pytensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.pytensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.pytests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_cuda_causal_conv_cached_op.pytensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.pytensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/mla.pytensorrt_llm/_torch/auto_deploy/models/factory.pytensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.pytensorrt_llm/_torch/auto_deploy/models/hf.pytensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.pytensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.pytensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.pytensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.pytensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.pytensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.pytensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
🧠 Learnings (10)
📓 Common learnings
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1856,device_name=NVIDIA_L40S.jsontensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.
Applied to files:
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.
Applied to files:
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
📚 Learning: 2025-10-20T17:09:21.560Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.
Applied to files:
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.
Applied to files:
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
🧬 Code graph analysis (13)
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py (4)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
chunk_size(128-131)tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
chunk_size(198-200)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)
seq_len(296-297)input_pos(300-301)cache_loc(304-305)pages_per_seq(308-309)tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
page_size(197-201)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py (1)
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
extract_op_args(469-506)
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py (2)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
chunk_size(128-131)tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
chunk_size(198-200)
tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py (4)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
chunk_size(128-131)tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
chunk_size(198-200)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)
seq_len(296-297)input_pos(300-301)cache_loc(304-305)pages_per_seq(308-309)tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
page_size(197-201)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
chunk_size(128-131)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py (4)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
chunk_size(128-131)tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
chunk_size(198-200)tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)
seq_len(296-297)input_pos(300-301)cache_loc(304-305)pages_per_seq(308-309)tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
page_size(197-201)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
chunk_size(198-200)
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py (3)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
CachedSequenceInterface(11-92)tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
is_op(197-220)tensorrt_llm/_torch/auto_deploy/transform/interface.py (4)
BaseTransform(217-504)SharedConfig(61-66)TransformInfo(121-178)TransformRegistry(507-535)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
chunk_size(128-131)tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
chunk_size(198-200)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)
seq_len(296-297)_get_sanitized_seq_len(388-428)to(465-472)device(190-191)tensorrt_llm/_torch/modules/mamba/mamba2_metadata.py (1)
cu_seqlens_to_chunk_indices_offsets(24-85)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py (2)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
chunk_size(128-131)tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
chunk_size(198-200)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py (3)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
chunk_size(128-131)tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
chunk_size(198-200)tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
extract_op_args(469-506)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (3)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
chunk_size(128-131)tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
chunk_size(198-200)tensorrt_llm/_torch/auto_deploy/llm.py (1)
factory(110-113)
🪛 Ruff (0.14.4)
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py
165-165: Unused function argument: chunk_size
(ARG001)
217-217: Unused function argument: input_pos
(ARG001)
217-217: Unused function argument: pages_per_seq
(ARG001)
217-217: Unused function argument: slot_idx
(ARG001)
217-217: Unused function argument: page_size
(ARG001)
217-217: Unused function argument: chunk_size
(ARG001)
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py
294-294: Unused function argument: chunk_size
(ARG001)
312-312: Unused function argument: pages_per_seq
(ARG001)
312-312: Unused function argument: slot_idx
(ARG001)
312-312: Unused function argument: page_size
(ARG001)
312-312: Unused function argument: chunk_size
(ARG001)
tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py
185-185: Unused function argument: chunk_size
(ARG001)
200-200: Unused function argument: position_ids
(ARG001)
200-200: Unused function argument: pages_per_seq
(ARG001)
200-200: Unused function argument: slot_idx
(ARG001)
200-200: Unused function argument: page_size
(ARG001)
200-200: Unused function argument: chunk_size
(ARG001)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py
123-123: Unused function argument: chunk_size
(ARG001)
147-147: Unused function argument: input_pos
(ARG001)
147-147: Unused function argument: cache_loc
(ARG001)
147-147: Unused function argument: pages_per_seq
(ARG001)
147-147: Unused function argument: page_size
(ARG001)
147-147: Unused function argument: chunk_size
(ARG001)
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py
43-43: Prefer next(iter(node.users.keys())) over single element slice
Replace with next(iter(node.users.keys()))
(RUF015)
82-82: Unused method argument: cm
(ARG002)
83-83: Unused method argument: factory
(ARG002)
84-84: Unused method argument: shared_config
(ARG002)
99-99: Consider [*list(conv_node.args[:-1]), activation_name] instead of concatenation
Replace with [*list(conv_node.args[:-1]), activation_name]
(RUF005)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
32-32: Unused function argument: cache_loc
(ARG001)
33-33: Unused function argument: pages_per_seq
(ARG001)
35-35: Unused function argument: page_size
(ARG001)
98-98: Unused function argument: input_pos
(ARG001)
98-98: Unused function argument: cache_loc
(ARG001)
98-98: Unused function argument: pages_per_seq
(ARG001)
98-98: Unused function argument: page_size
(ARG001)
98-98: Unused function argument: chunk_size
(ARG001)
260-260: Unused function argument: cu_seqlens
(ARG001)
261-261: Unused function argument: chunk_indices
(ARG001)
262-262: Unused function argument: chunk_offsets
(ARG001)
263-263: Unused function argument: batch_info_tensor
(ARG001)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py
366-366: Unused function argument: chunk_size
(ARG001)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
300-300: Unused noqa directive (non-enabled: E501)
Remove unused noqa directive
(RUF100)
361-361: Unused function argument: dtype
(ARG001)
361-361: Unused function argument: block_shape
(ARG001)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py
64-64: Unused function argument: chunk_size
(ARG001)
85-85: Unused function argument: input_pos
(ARG001)
85-85: Unused function argument: cache_loc
(ARG001)
85-85: Unused function argument: pages_per_seq
(ARG001)
85-85: Unused function argument: page_size
(ARG001)
85-85: Unused function argument: chunk_size
(ARG001)
232-232: Unused function argument: activation
(ARG001)
🔇 Additional comments (18)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py (1)
354-363: Approved with a note on error handling.The implementation correctly handles the optional
activationparameter extraction with appropriate fallback toNone. The try/except block accounts for cases where the parameter doesn't exist in the source node (as noted in the comment, it may be added by fusion later).Note: The broad exception catching (RuntimeError, IndexError) follows the pattern from
extract_op_argswhich can raiseRuntimeErrorwhen a parameter is not found. This is acceptable given the optional nature of the parameter.tensorrt_llm/_torch/auto_deploy/config/default.yaml (1)
168-169: LGTM!The new
fuse_causal_conv_activationtransform is correctly placed at the compile stage, which is appropriate for fusion optimizations that occur after cache initialization and before model compilation.tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
127-131: LGTM!The
chunk_sizeproperty implementation follows the established pattern used byvocab_size_paddedabove it, correctly retrieving the value from the model config with an appropriate None fallback.tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_cuda_causal_conv_cached_op.py (1)
85-85: LGTM!The test is correctly updated to pass
Nonefor the new activation parameter, maintaining backward compatibility while adapting to the extended op signature.tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)
124-124: LGTM!The
chunk_sizeparameter is correctly passed from the factory toSequenceInfo, enabling chunk-based processing throughout the pipeline.tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py (1)
185-185: Signature extension for interface consistency.The
chunk_sizeparameter is added to maintain consistency with otherprepare_*_metadatainterfaces being updated across the codebase. While currently unused in this implementation, it ensures a uniform signature for future enhancements.Also applies to: 200-200
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)
91-91: LGTM!The
chunk_sizeparameter is correctly added as an optional attribute with appropriate initialization and storage.Also applies to: 118-118
179-179: LGTM!Adding
"chunk_size"to_cached_constantscorrectly enables it to be passed as a constant argument toprepare_metadataoperations.
169-169: Update tests to match newslot_idxdtype specification.The change to
torch.longis valid and aligns with PyTorch's indexing requirements (operations likeindex_select()andindex_copy_()requiretorch.longdtype). However, existing tests createslot_idxwithdtype=torch.int32. While the implementation currently handles the conversion via.to(torch.long), tests should be updated to match the new interface specification inattention_interface.py:169.Update the following test files to create
slot_idxwithdtype=torch.long:
test_triton_mamba_cached_op.py(lines 46, 117)test_torch_causal_conv_cached_op.py(lines 47, 111, 178)test_torch_attention_op.py(line 478)test_cuda_causal_conv_cached_op.py(lines 49, 115, 187)test_torch_mamba_cached_op.py(lines 55, 127, 192)⛔ Skipped due to learnings
Learnt from: ixlmar Repo: NVIDIA/TensorRT-LLM PR: 7294 File: tensorrt_llm/_torch/pyexecutor/sampler.py:1068-1085 Timestamp: 2025-08-28T10:21:46.652Z Learning: torch.index_select works with int32 indices in practice despite documentation stating LongTensor requirement. In TensorRT-LLM codebase, int32 indices are used intentionally and work correctly.tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py (1)
165-165: Signature extension for interface consistency.The
chunk_sizeparameter is added to maintain a uniform signature across allprepare_*_metadataoperations in the codebase. While not currently utilized by FlashInfer's metadata preparation, this ensures interface consistency for future chunked prefill support.Also applies to: 217-217
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py (1)
286-295: Interface expansion:chunk_sizeparameter added but not yet used.The
chunk_sizeparameter has been added to maintain consistency with other metadata preparation functions across the codebase. While currently unused, this is part of a coordinated API expansion for future chunk-based processing support.tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py (5)
116-116: Well-structured activation parameter propagation.The optional
activationparameter is correctly threaded through the prefill and decode paths, with proper propagation to bothcausal_conv1d_fnandcausal_conv1d_update. The try/except block inget_constantsappropriately handles cases where the activation parameter is added later by the fusion transform.Also applies to: 180-180, 199-199, 232-232, 299-304
190-192: Improved clarity with slice-based token selection.The change from index-based mapping to explicit slicing makes the decode path token selection more readable and maintainable.
209-210: Appropriate optimization: removed unnecessarycontiguous()call.Since
yis allocated withtorch.empty()at line 140, it's already contiguous. The.contiguous()call was redundant. The comment correctly notes thatyis not an alias of any input tensor.
185-186: No issues found. The dtype optimization is safe.The dtype flow is consistent throughout the operation:
yis initialized withdtype=input.dtype(line 140)causal_conv1d_fnreturns its input tensor unchanged (line 74 of causal_conv1d.py), preserving the input dtypey_varleninheritsinput.dtypefrom the function returny_prefill = y_varlen.transpose(0, 1)preserves dtype through transpose- Both
y_flat[:total_prefill_tokens]andy_prefillhave matching dtype, making the.to(y_flat.dtype)cast redundantThe removal of the explicit cast is a valid optimization.
205-207: Verify dtype compatibility for decode path—manual testing recommended.The dtype concern is valid but unverifiable from the Python wrapper alone. Line 207's
copy_operation assumesy_dec(returned fromcausal_conv1d_update) preserves the dtype of its inputx_decode. However, the underlying CUDA kernel implementation is not accessible in the codebase, making it impossible to confirm dtype preservation behavior. Test this with different input dtypes (e.g., float16, bfloat16) to ensurecopy_succeeds without unexpected conversions.tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py (2)
15-58: Pattern matcher correctly identifies fusible activations.The pattern matching logic properly identifies causal conv nodes with a single activation user. Currently supports SiLU with clear extensibility points for additional activations. The implementation is sound.
95-110: The fusion logic correctly assumes activation is the last parameter—verified against the signature.The
_cuda_cached_causal_conv1dfunction signature confirmsactivationis the final parameter, making the code at lines 99–100 correct:list(conv_node.args[:-1]) + [activation_name]properly constructs the new arguments.
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
Show resolved
Hide resolved
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
|
/bot run |
|
PR_Github #24370 [ run ] triggered by Bot. Commit: |
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
|
PR_Github #24370 [ run ] completed with state |
…version of the op Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
|
adding @atrifex to review on behalf of oss-compliance |
|
/bot run |
|
PR_Github #24416 [ run ] triggered by Bot. Commit: |
|
PR_Github #24416 [ run ] completed with state |
|
/bot run |
|
PR_Github #24484 [ run ] triggered by Bot. Commit: |
|
PR_Github #24484 [ run ] completed with state |
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py
Show resolved
Hide resolved
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
|
/bot reuse-pipeline |
|
PR_Github #24525 [ reuse-pipeline ] triggered by Bot. Commit: |
|
PR_Github #24525 [ reuse-pipeline ] completed with state |
…NVIDIA#9083) Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
…NVIDIA#9083) Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Summary by CodeRabbit
New Features
Optimizations