Skip to content

Conversation

@suyoggupta
Copy link
Collaborator

@suyoggupta suyoggupta commented Nov 12, 2025

Summary by CodeRabbit

  • New Features

    • Added fused causal convolution with activation optimization for improved kernel performance
    • Introduced dynamic hardware-specific tuning for MoE kernel configurations
    • Extended support for chunked sequence processing across attention and SSM operations
  • Optimizations

    • Improved output memory handling in CUDA graph compilation
    • Enhanced metadata computation for cached SSM operations with additional context information

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 12, 2025

📝 Walkthrough

Walkthrough

This PR introduces chunk-based processing support throughout the auto-deploy pipeline, implements dynamic MoE kernel configuration loading from JSON files, adds a causal convolution fusion optimization pass, and extends various backend implementations to handle activation parameters and chunking metadata.

Changes

Cohort / File(s) Summary
Configuration & Model Properties
tensorrt_llm/_torch/auto_deploy/config/default.yaml, tensorrt_llm/_torch/auto_deploy/models/factory.py, tensorrt_llm/_torch/auto_deploy/models/hf.py
Added fuse_causal_conv_activation transformation to compile stage. Introduced chunk_size property to ModelFactory and AutoModelForCausalLMFactory with fallback retrieval from model config.
Custom Ops: Attention & Metadata Interfaces
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py, tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py, tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py, tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py, tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py
Extended all metadata preparation function signatures to accept chunk_size: int parameter. Updated SequenceInfo to store chunk_size as optional attribute and changed slot_idx dtype from int to long. Updated cached constants tuple to include chunk_size.
Custom Ops: Causal Convolution (CUDA & Torch)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py, tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py
Added chunk_size and activation: Optional[str] parameters to metadata preparation functions. Extended _cuda_cached_causal_conv1d to accept and propagate activation through prefill/decode paths. Modified output handling to eliminate dtype casting and avoid cloning. Updated get_constants to extract optional activation from source node with fallback to None.
Custom Ops: Mamba SSM (Torch & Triton)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py, tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
Added chunk_size parameter to torch metadata preparation. Introduced new _triton_ssm_prepare_metadata op returning 8-tuple with extended metadata (cu_seqlens, chunk_indices, chunk_offsets, batch_info_tensor). Updated TritonBackendSSM to consume and propagate augmented metadata through prefill/decode phases.
Custom Ops: MoE Configuration Infrastructure
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py, tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1856,device_name=NVIDIA_H100_80GB_HBM3.json, tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1856,device_name=NVIDIA_L40S.json
Added config loading infrastructure: get_config_file_name(), get_moe_configs() (cached), and _get_kernel_config() to support dynamic kernel tuning. Introduced JSON config files with Triton version and block-size presets keyed by batch size. Replaced direct _default_kernel_config() calls with resolver that attempts optimized config lookup before fallback.
Transforms: Causal Convolution Fusion
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py
New module introducing FuseCausalConvActivation transform registered as "fuse_causal_conv_activation". Implements pattern matching for causal_conv1d followed by activation (silu), rewrites matched patterns to fused op call with activation baked as argument, and erases original nodes.
Runtime & Execution
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py, tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
Extended SequenceInfo construction in build_from_config to pass chunk_size. Added debug output of llm_args during ADEngine initialization. Modified forward output path in torch_cudagraph.py to return sliced buffer without detach/clone, altering memory sharing and gradient tracking.
Tests
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_cuda_causal_conv_cached_op.py
Updated test calls to cuda_cached_causal_conv1d op to pass additional None argument reflecting updated op signature.

Sequence Diagram(s)

sequenceDiagram
    participant GraphModule
    participant Matcher as Pattern Matcher
    participant Transform as FuseCausalConvActivation
    participant Backend as CUDA Backend

    GraphModule->>Matcher: Scan for causal_conv1d + activation
    Matcher-->>Transform: Return matched (conv_node, activation_node, op_name)
    
    Transform->>Transform: Extract activation function name
    Note over Transform: Identify silu/activation type
    
    Transform->>GraphModule: Insert fused op call<br/>(cuda_cached_causal_conv1d + activation arg)
    GraphModule->>GraphModule: Replace activation node with fused call
    GraphModule->>GraphModule: Erase original conv & activation nodes
    
    GraphModule->>Backend: Execute fused kernel<br/>(activation baked in)
    Backend-->>GraphModule: Return fused output
Loading
sequenceDiagram
    participant User
    participant Factory as ModelFactory
    participant Executor as ADExecutor
    participant Config as get_moe_configs()
    participant Kernel as Triton Kernel

    User->>Factory: Query model chunk_size
    Factory-->>User: Return chunk_size from config
    
    User->>Executor: Initialize with factory
    Executor->>Factory: Fetch chunk_size
    Executor->>Executor: Build SequenceInfo with chunk_size
    
    Kernel->>Config: Request optimized config<br/>(E, N, dtype, batch_size)
    Config->>Config: Load JSON from disk (cached)
    Config->>Config: Find closest batch-size key to M
    alt Config Found
        Config-->>Kernel: Return optimized block sizes
        Kernel->>Kernel: Use tuned BLOCK_SIZE_M/N/K
    else Fallback
        Config-->>Kernel: Return None
        Kernel->>Kernel: Use default_kernel_config(M, E)
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • torch_cudagraph.py (forward output handling): Change from detach/clone to direct slice affects memory ownership and gradient flow—verify no unintended side effects on backprop or memory lifecycle.
  • triton_backend_mamba.py (augmented metadata outputs): New 8-tuple return value with batch info tensors and chunk metadata; ensure all consumers properly unpack and utilize new fields, and verify prefill/decode logic correctly applies new indexing.
  • triton_moe.py (config loading & resolver): New _get_kernel_config() resolution logic selects closest batch-size key; verify rounding/fallback behavior handles edge cases and cache invalidation is correct.
  • cuda_backend_causal_conv.py & torch_backend_causal_conv.py (activation propagation): Activation parameter extraction and threading through prefill/decode—ensure all code paths handle None activation correctly and no activation is lost.
  • fuse_causal_conv.py (pattern matching & graph rewriting): New transform logic that modifies GraphModule; verify pattern matcher correctly identifies all intended causal_conv1d+activation patterns and rewrites do not break computation.
  • Signature consistency across custom ops: chunk_size added to many metadata functions; check that all callers have been updated and no mismatches exist between function signatures and call sites.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 48.84% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ⚠️ Warning The pull request description is entirely empty, missing all required sections from the template including a description of changes, test coverage details, and PR checklist verification. Provide a comprehensive PR description including: (1) what changes are made and why; (2) relevant test coverage information; (3) verification of the PR checklist items. Refer to the description template for required sections.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main changes: adding Triton configs and optimizing mamba prefill for Autodeploy, which aligns with the file changes and objectives.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py (1)

358-384: Fix the fake op signature to include chunk_size.

torch_backend_prepare_metadata now receives chunk_size, but the fake registration still exposes the old signature. During fake tensor tracing the dispatcher will pass the extra argument, leading to an immediate TypeError and breaking export. Please update the fake to accept the new parameter as well.

 @torch_backend_prepare_metadata.register_fake
 def torch_backend_prepare_metadata_fake(
-    position_ids, seq_len, input_pos, cache_loc, pages_per_seq, slot_idx, page_size
+    position_ids,
+    seq_len,
+    input_pos,
+    cache_loc,
+    pages_per_seq,
+    slot_idx,
+    page_size,
+    chunk_size,
 ):
     num_seq = SequenceInfo._get_sanitized_num_sequences(position_ids, seq_len)
     return (
🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py (1)

299-301: Drop the unused # noqa.

Ruff flags this # noqa: E501 as unused. Removing the directive (or splitting the string if needed) keeps the file clean.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bb6eb95 and 40622e9.

📒 Files selected for processing (19)
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/config/default.yaml (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1856,device_name=NVIDIA_H100_80GB_HBM3.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1856,device_name=NVIDIA_L40S.json (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py (4 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py (6 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py (7 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/models/factory.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/models/hf.py (1 hunks)
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (2 hunks)
  • tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py (1 hunks)
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_cuda_causal_conv_cached_op.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Use only spaces, no tabs; indent with 4 spaces.

Files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_cuda_causal_conv_cached_op.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py
  • tensorrt_llm/_torch/auto_deploy/models/factory.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py
  • tensorrt_llm/_torch/auto_deploy/models/hf.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.

Files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_cuda_causal_conv_cached_op.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py
  • tensorrt_llm/_torch/auto_deploy/models/factory.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py
  • tensorrt_llm/_torch/auto_deploy/models/hf.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).

Files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_cuda_causal_conv_cached_op.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py
  • tensorrt_llm/_torch/auto_deploy/models/factory.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py
  • tensorrt_llm/_torch/auto_deploy/models/hf.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
🧠 Learnings (10)
📓 Common learnings
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py
📚 Learning: 2025-08-08T04:10:19.038Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6728
File: cpp/tensorrt_llm/plugins/mixtureOfExperts/mixtureOfExpertsPlugin.cpp:966-966
Timestamp: 2025-08-08T04:10:19.038Z
Learning: TensorRT plugins currently don't support padding functionality, and TensorRT is not getting new features (in maintenance mode). This means that duplicating parameters like mExpertHiddenSize in function calls, even with TODO comments, can be acceptable as pragmatic solutions within these constraints.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_fused_moe_configs/E=128,N=1856,device_name=NVIDIA_L40S.json
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
📚 Learning: 2025-08-09T20:57:04.084Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu:118-127
Timestamp: 2025-08-09T20:57:04.084Z
Learning: In the CUTLASS MoE finalize fusion implementation (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_gemm_tma_warp_specialized_input.cu), when setting `fused_finalize_epilogue.stride_final_output` with shape `(hidden_size, num_output_tokens, 1)`, the `num_rows_in_final_output` should be set to `num_output_tokens` (not `hidden_size`) because of a swap+transpose operation that maps rows of the output tensor to `hidden_size` and columns to `num_output_tokens`.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
📚 Learning: 2025-10-20T17:09:21.560Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
📚 Learning: 2025-08-20T06:56:02.889Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6768
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:577-579
Timestamp: 2025-08-20T06:56:02.889Z
Learning: In cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, maxSequenceLength is now enforced as a non-optional argument in the BlockManager constructor, so concerns about std::nullopt defaulting to 0 are not applicable. When windowSize > maxSequenceLength, a warning should be added instead of handling optional parameter cases.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
📚 Learning: 2025-08-14T23:23:27.449Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 6915
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:4010-4012
Timestamp: 2025-08-14T23:23:27.449Z
Learning: For MOE (Mixture of Experts) code reviews in TensorRT-LLM, avoid repeatedly suggesting finalize fusion validation checks and safety assertions. The user djns99 has indicated these suggestions are repetitive and unwanted across multiple MOE-related changes.

Applied to files:

  • tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
🧬 Code graph analysis (13)
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py (4)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
  • chunk_size (128-131)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
  • chunk_size (198-200)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)
  • seq_len (296-297)
  • input_pos (300-301)
  • cache_loc (304-305)
  • pages_per_seq (308-309)
tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
  • page_size (197-201)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py (1)
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
  • extract_op_args (469-506)
tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py (2)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
  • chunk_size (128-131)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
  • chunk_size (198-200)
tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py (4)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
  • chunk_size (128-131)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
  • chunk_size (198-200)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)
  • seq_len (296-297)
  • input_pos (300-301)
  • cache_loc (304-305)
  • pages_per_seq (308-309)
tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
  • page_size (197-201)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
  • chunk_size (128-131)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py (4)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
  • chunk_size (128-131)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
  • chunk_size (198-200)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)
  • seq_len (296-297)
  • input_pos (300-301)
  • cache_loc (304-305)
  • pages_per_seq (308-309)
tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
  • page_size (197-201)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
  • chunk_size (198-200)
tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py (3)
tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
  • CachedSequenceInterface (11-92)
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
  • is_op (197-220)
tensorrt_llm/_torch/auto_deploy/transform/interface.py (4)
  • BaseTransform (217-504)
  • SharedConfig (61-66)
  • TransformInfo (121-178)
  • TransformRegistry (507-535)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
  • chunk_size (128-131)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
  • chunk_size (198-200)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (4)
  • seq_len (296-297)
  • _get_sanitized_seq_len (388-428)
  • to (465-472)
  • device (190-191)
tensorrt_llm/_torch/modules/mamba/mamba2_metadata.py (1)
  • cu_seqlens_to_chunk_indices_offsets (24-85)
tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py (2)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
  • chunk_size (128-131)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
  • chunk_size (198-200)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py (3)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
  • chunk_size (128-131)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
  • chunk_size (198-200)
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
  • extract_op_args (469-506)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (3)
tensorrt_llm/_torch/auto_deploy/models/hf.py (1)
  • chunk_size (128-131)
tensorrt_llm/_torch/auto_deploy/models/factory.py (1)
  • chunk_size (198-200)
tensorrt_llm/_torch/auto_deploy/llm.py (1)
  • factory (110-113)
🪛 Ruff (0.14.4)
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py

165-165: Unused function argument: chunk_size

(ARG001)


217-217: Unused function argument: input_pos

(ARG001)


217-217: Unused function argument: pages_per_seq

(ARG001)


217-217: Unused function argument: slot_idx

(ARG001)


217-217: Unused function argument: page_size

(ARG001)


217-217: Unused function argument: chunk_size

(ARG001)

tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py

294-294: Unused function argument: chunk_size

(ARG001)


312-312: Unused function argument: pages_per_seq

(ARG001)


312-312: Unused function argument: slot_idx

(ARG001)


312-312: Unused function argument: page_size

(ARG001)


312-312: Unused function argument: chunk_size

(ARG001)

tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py

185-185: Unused function argument: chunk_size

(ARG001)


200-200: Unused function argument: position_ids

(ARG001)


200-200: Unused function argument: pages_per_seq

(ARG001)


200-200: Unused function argument: slot_idx

(ARG001)


200-200: Unused function argument: page_size

(ARG001)


200-200: Unused function argument: chunk_size

(ARG001)

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py

123-123: Unused function argument: chunk_size

(ARG001)


147-147: Unused function argument: input_pos

(ARG001)


147-147: Unused function argument: cache_loc

(ARG001)


147-147: Unused function argument: pages_per_seq

(ARG001)


147-147: Unused function argument: page_size

(ARG001)


147-147: Unused function argument: chunk_size

(ARG001)

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py

43-43: Prefer next(iter(node.users.keys())) over single element slice

Replace with next(iter(node.users.keys()))

(RUF015)


82-82: Unused method argument: cm

(ARG002)


83-83: Unused method argument: factory

(ARG002)


84-84: Unused method argument: shared_config

(ARG002)


99-99: Consider [*list(conv_node.args[:-1]), activation_name] instead of concatenation

Replace with [*list(conv_node.args[:-1]), activation_name]

(RUF005)

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py

32-32: Unused function argument: cache_loc

(ARG001)


33-33: Unused function argument: pages_per_seq

(ARG001)


35-35: Unused function argument: page_size

(ARG001)


98-98: Unused function argument: input_pos

(ARG001)


98-98: Unused function argument: cache_loc

(ARG001)


98-98: Unused function argument: pages_per_seq

(ARG001)


98-98: Unused function argument: page_size

(ARG001)


98-98: Unused function argument: chunk_size

(ARG001)


260-260: Unused function argument: cu_seqlens

(ARG001)


261-261: Unused function argument: chunk_indices

(ARG001)


262-262: Unused function argument: chunk_offsets

(ARG001)


263-263: Unused function argument: batch_info_tensor

(ARG001)

tensorrt_llm/_torch/auto_deploy/custom_ops/torch_backend_attention.py

366-366: Unused function argument: chunk_size

(ARG001)

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py

300-300: Unused noqa directive (non-enabled: E501)

Remove unused noqa directive

(RUF100)


361-361: Unused function argument: dtype

(ARG001)


361-361: Unused function argument: block_shape

(ARG001)

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py

64-64: Unused function argument: chunk_size

(ARG001)


85-85: Unused function argument: input_pos

(ARG001)


85-85: Unused function argument: cache_loc

(ARG001)


85-85: Unused function argument: pages_per_seq

(ARG001)


85-85: Unused function argument: page_size

(ARG001)


85-85: Unused function argument: chunk_size

(ARG001)


232-232: Unused function argument: activation

(ARG001)

🔇 Additional comments (18)
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py (1)

354-363: Approved with a note on error handling.

The implementation correctly handles the optional activation parameter extraction with appropriate fallback to None. The try/except block accounts for cases where the parameter doesn't exist in the source node (as noted in the comment, it may be added by fusion later).

Note: The broad exception catching (RuntimeError, IndexError) follows the pattern from extract_op_args which can raise RuntimeError when a parameter is not found. This is acceptable given the optional nature of the parameter.

tensorrt_llm/_torch/auto_deploy/config/default.yaml (1)

168-169: LGTM!

The new fuse_causal_conv_activation transform is correctly placed at the compile stage, which is appropriate for fusion optimizations that occur after cache initialization and before model compilation.

tensorrt_llm/_torch/auto_deploy/models/hf.py (1)

127-131: LGTM!

The chunk_size property implementation follows the established pattern used by vocab_size_padded above it, correctly retrieving the value from the model config with an appropriate None fallback.

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_cuda_causal_conv_cached_op.py (1)

85-85: LGTM!

The test is correctly updated to pass None for the new activation parameter, maintaining backward compatibility while adapting to the extended op signature.

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)

124-124: LGTM!

The chunk_size parameter is correctly passed from the factory to SequenceInfo, enabling chunk-based processing throughout the pipeline.

tensorrt_llm/_torch/auto_deploy/custom_ops/mla.py (1)

185-185: Signature extension for interface consistency.

The chunk_size parameter is added to maintain consistency with other prepare_*_metadata interfaces being updated across the codebase. While currently unused in this implementation, it ensures a uniform signature for future enhancements.

Also applies to: 200-200

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (3)

91-91: LGTM!

The chunk_size parameter is correctly added as an optional attribute with appropriate initialization and storage.

Also applies to: 118-118


179-179: LGTM!

Adding "chunk_size" to _cached_constants correctly enables it to be passed as a constant argument to prepare_metadata operations.


169-169: Update tests to match new slot_idx dtype specification.

The change to torch.long is valid and aligns with PyTorch's indexing requirements (operations like index_select() and index_copy_() require torch.long dtype). However, existing tests create slot_idx with dtype=torch.int32. While the implementation currently handles the conversion via .to(torch.long), tests should be updated to match the new interface specification in attention_interface.py:169.

Update the following test files to create slot_idx with dtype=torch.long:

  • test_triton_mamba_cached_op.py (lines 46, 117)
  • test_torch_causal_conv_cached_op.py (lines 47, 111, 178)
  • test_torch_attention_op.py (line 478)
  • test_cuda_causal_conv_cached_op.py (lines 49, 115, 187)
  • test_torch_mamba_cached_op.py (lines 55, 127, 192)
⛔ Skipped due to learnings
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/pyexecutor/sampler.py:1068-1085
Timestamp: 2025-08-28T10:21:46.652Z
Learning: torch.index_select works with int32 indices in practice despite documentation stating LongTensor requirement. In TensorRT-LLM codebase, int32 indices are used intentionally and work correctly.
tensorrt_llm/_torch/auto_deploy/custom_ops/flashinfer_attention.py (1)

165-165: Signature extension for interface consistency.

The chunk_size parameter is added to maintain a uniform signature across all prepare_*_metadata operations in the codebase. While not currently utilized by FlashInfer's metadata preparation, this ensures interface consistency for future chunked prefill support.

Also applies to: 217-217

tensorrt_llm/_torch/auto_deploy/custom_ops/triton_attention.py (1)

286-295: Interface expansion: chunk_size parameter added but not yet used.

The chunk_size parameter has been added to maintain consistency with other metadata preparation functions across the codebase. While currently unused, this is part of a coordinated API expansion for future chunk-based processing support.

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py (5)

116-116: Well-structured activation parameter propagation.

The optional activation parameter is correctly threaded through the prefill and decode paths, with proper propagation to both causal_conv1d_fn and causal_conv1d_update. The try/except block in get_constants appropriately handles cases where the activation parameter is added later by the fusion transform.

Also applies to: 180-180, 199-199, 232-232, 299-304


190-192: Improved clarity with slice-based token selection.

The change from index-based mapping to explicit slicing makes the decode path token selection more readable and maintainable.


209-210: Appropriate optimization: removed unnecessary contiguous() call.

Since y is allocated with torch.empty() at line 140, it's already contiguous. The .contiguous() call was redundant. The comment correctly notes that y is not an alias of any input tensor.


185-186: No issues found. The dtype optimization is safe.

The dtype flow is consistent throughout the operation:

  • y is initialized with dtype=input.dtype (line 140)
  • causal_conv1d_fn returns its input tensor unchanged (line 74 of causal_conv1d.py), preserving the input dtype
  • y_varlen inherits input.dtype from the function return
  • y_prefill = y_varlen.transpose(0, 1) preserves dtype through transpose
  • Both y_flat[:total_prefill_tokens] and y_prefill have matching dtype, making the .to(y_flat.dtype) cast redundant

The removal of the explicit cast is a valid optimization.


205-207: Verify dtype compatibility for decode path—manual testing recommended.

The dtype concern is valid but unverifiable from the Python wrapper alone. Line 207's copy_ operation assumes y_dec (returned from causal_conv1d_update) preserves the dtype of its input x_decode. However, the underlying CUDA kernel implementation is not accessible in the codebase, making it impossible to confirm dtype preservation behavior. Test this with different input dtypes (e.g., float16, bfloat16) to ensure copy_ succeeds without unexpected conversions.

tensorrt_llm/_torch/auto_deploy/transform/library/fuse_causal_conv.py (2)

15-58: Pattern matcher correctly identifies fusible activations.

The pattern matching logic properly identifies causal conv nodes with a single activation user. Currently supports SiLU with clear extensibility points for additional activations. The implementation is sound.


95-110: The fusion logic correctly assumes activation is the last parameter—verified against the signature.

The _cuda_cached_causal_conv1d function signature confirms activation is the final parameter, making the code at lines 99–100 correct: list(conv_node.args[:-1]) + [activation_name] properly constructs the new arguments.

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24370 [ run ] triggered by Bot. Commit: f2ec3d8

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
@tensorrt-cicd
Copy link
Collaborator

PR_Github #24370 [ run ] completed with state SUCCESS. Commit: f2ec3d8
/LLM/main/L0_MergeRequest_PR pipeline #18392 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

…version of the op

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
@suyoggupta suyoggupta requested a review from atrifex November 13, 2025 06:57
@suyoggupta
Copy link
Collaborator Author

adding @atrifex to review on behalf of oss-compliance

@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24416 [ run ] triggered by Bot. Commit: b15a2f8

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24416 [ run ] completed with state SUCCESS. Commit: b15a2f8
/LLM/main/L0_MergeRequest_PR pipeline #18422 completed with status: 'FAILURE'

@suyoggupta
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24484 [ run ] triggered by Bot. Commit: b15a2f8

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24484 [ run ] completed with state SUCCESS. Commit: b15a2f8
/LLM/main/L0_MergeRequest_PR pipeline #18478 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
@suyoggupta
Copy link
Collaborator Author

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24525 [ reuse-pipeline ] triggered by Bot. Commit: 33ef830

@tensorrt-cicd
Copy link
Collaborator

PR_Github #24525 [ reuse-pipeline ] completed with state SUCCESS. Commit: 33ef830
Reusing PR_Github #24484 for commit 33ef830

@suyoggupta suyoggupta merged commit d12cb94 into NVIDIA:main Nov 14, 2025
5 checks passed
@github-project-automation github-project-automation bot moved this from Backlog to Done in AutoDeploy Board Nov 14, 2025
zheyuf pushed a commit to zheyuf/TensorRT-LLM that referenced this pull request Nov 19, 2025
…NVIDIA#9083)

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
greg-kwasniewski1 pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Nov 20, 2025
…NVIDIA#9083)

Signed-off-by: Suyog Gupta <41447211+suyoggupta@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

7 participants