Skip to content

[#12633][feat] AutoDeploy: Support torch-cudagraph for Eagle#12745

Merged
govind-ramnarayan merged 3 commits intoNVIDIA:mainfrom
nv-auto-deploy:gramnarayan/mtp-enable-cudagraph
Apr 30, 2026
Merged

[#12633][feat] AutoDeploy: Support torch-cudagraph for Eagle#12745
govind-ramnarayan merged 3 commits intoNVIDIA:mainfrom
nv-auto-deploy:gramnarayan/mtp-enable-cudagraph

Conversation

@govind-ramnarayan
Copy link
Copy Markdown
Collaborator

@govind-ramnarayan govind-ramnarayan commented Apr 3, 2026

torch-cudagraph mainly tested on Llama-3.1-8B-Instruct + Eagle3, as well as SuperV3 + MTP.

Fixes: #12633

Note: There are some TODOs left from this PR to investigate some lingering issues. They are here:

#13100
#13133
#13134
#13135
#13143

Summary by CodeRabbit

  • New Features

    • Enhanced speculative decoding support for Eagle and MTP models with improved batch handling for extend-only sequences.
    • Extended maximum sequence length to 65K tokens.
    • Added TensorRT-LLM as a supported attention backend with CUDA graph capture optimization.
    • Improved distributed synchronization for speculative decoding in multi-GPU environments.
  • Bug Fixes

    • Fixed attention backend compatibility issues with speculative decoding configurations.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py Outdated
@govind-ramnarayan govind-ramnarayan force-pushed the gramnarayan/mtp-enable-cudagraph branch from ff75700 to d6a5cfe Compare April 8, 2026 05:37
Comment thread tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/llm_args.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py Outdated
@govind-ramnarayan govind-ramnarayan force-pushed the gramnarayan/mtp-enable-cudagraph branch 4 times, most recently from f092c06 to d05d4b4 Compare April 15, 2026 22:39
Comment thread tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
Comment thread tests/integration/defs/accuracy/test_llm_api_autodeploy.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py Outdated
@govind-ramnarayan govind-ramnarayan marked this pull request as ready for review April 16, 2026 06:39
@govind-ramnarayan govind-ramnarayan requested review from a team as code owners April 16, 2026 06:39
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 16, 2026

📝 Walkthrough

Walkthrough

The PR integrates Eagle3 speculative-decoding support into AutoDeploy with TensorRT-LLM attention backend, including CUDA graph dynamic-input refresh callbacks, metadata handling for prompt lengths and extend-only batches, draft model dtype management, and expanded transform pipeline support for spec-decoding-aware graph compilation.

Changes

Cohort / File(s) Summary
Configuration & Model Registry
examples/auto_deploy/model_registry/configs/super_v3_mtp.yaml
Upgraded SuperV3 MTP config from torch/flashinfer to TensorRT-LLM runtime: added runtime: trtllm, changed backends (compile_backend to torch-cudagraph, attn_backend to trtllm), increased max sequence length to 65536, added CUDA graph/KV cache configs, and extended transforms with sharding detection and MoE fusion.
CUDA Graph Backend
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
Extended graph capture to support optional refresh_inputs_fn callback for dynamic input re-computation during warmup vs actual capture, refactored batch input preparation into reusable _prepare_capture_inputs, and moved get_args_kwargs call inside warm-up context for PiecewiseCapturedGraph.
TensorRT-LLM Attention
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py
Added spec-decoding state management (init_spec_decoding(), persistent tensors for position offsets/packed masks), extended metadata preparation to accept prompt lengths and spec config, updated context-length handling to use prompt lengths instead of sequence lengths, and added runtime spec-decoding parameter derivation with SM-version-aware adjustments.
Attention Interface & Utilities
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py, tensorrt_llm/_torch/auto_deploy/custom_ops/utils/torch_gather_logits.py
Added extend-only batch detection (is_extend_only()), introduced prompt_lens tensor staging, replaced flatten() with generic flatten_3d(), updated position-offset expansion for extend-only batches, added set_eagle_extend_batch() method, and removed runtime shape assertion in gather_tokens.
LLM Configuration & Validation
tensorrt_llm/_torch/auto_deploy/llm_args.py
Added model validator to conditionally force torch-simple compile backend when speculative decoding is enabled with flashinfer attention, and updated factory creation to pass sync_before_hidden_state_capture flag.
Eagle Drafter Models
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py, tensorrt_llm/_torch/auto_deploy/models/eagle.py
Added target_dtype parameter to EagleConfig for explicit dtype specification, introduced conditional torch.cuda.synchronize() for hidden-state capture, added distributed token synchronization for multi-rank setups, refactored hidden-state collection, added extend-only logits reshaping, and enhanced factory to propagate dtype overrides and sync flags.
AutoDeploy Executor & Runtime
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py, tensorrt_llm/_torch/auto_deploy/shim/interface.py
Updated executor to construct and pass prompt_lens tensor during sequence nesting, changed spec-decoding invocation to unconditionally pass cache_seq_interface, and enhanced KV resource compatibility logging with per-layer block-offset multiplier consistency checks.
Transform Pipeline
tensorrt_llm/_torch/auto_deploy/transform/library/collectives.py, tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py, tensorrt_llm/_torch/auto_deploy/transform/library/hidden_states.py, tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py, tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py
Removed early-return logic for draft models in collectives/sharding, added draft-aware source selection in sharding, updated compile_model to support extend-only batch creation with draft length configuration, enhanced KV cache metadata registration to conditionally inject spec_config parameter, and added comment clarification in hidden-states capture.
Utilities
tensorrt_llm/_torch/auto_deploy/utils/cuda_graph.py, tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
Made CudaGraphWarmUpPhase re-entrant by checking warm-up status before state transitions, and added draft embedding-size inference utilities (infer_draft_embedding_size) with pattern matching for MTP and Eagle draft families, including helper functions for dimension detection.
Test Updates
tests/integration/defs/accuracy/test_llm_api_autodeploy.py, tests/integration/defs/examples/test_ad_speculative_decoding.py, tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py, tests/unittest/auto_deploy/singlegpu/custom_ops/attention/test_trtllm_attention_op.py, tests/unittest/auto_deploy/singlegpu/models/test_eagle.py, tests/unittest/auto_deploy/singlegpu/shim/test_cached_sequence_interface.py, tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py, tests/unittest/auto_deploy/singlegpu/smoke/test_ad_speculative_decoding.py
Extended CUDA graph tests with opaque-kwarg and extend-only mode coverage, added comprehensive TrtllmAttentionMetadata test class validating planner state consistency, introduced Eagle config/dtype/wrapper unit tests, parameterized Eagle3 tests across attention backends, added extend-batch validation test, added compile-backend auto-downgrade test, and introduced Eagle3 smoke test with both flashinfer and trtllm backends.

Sequence Diagram(s)

sequenceDiagram
    participant Executor as AD Executor
    participant CudaGraph as CUDA Graph Capturer
    participant AttentionOp as TensorRT-LLM Attention
    participant KVCache as KV Cache Manager

    Executor->>CudaGraph: capture_graph(get_args_kwargs_fn)
    CudaGraph->>CudaGraph: _prepare_capture_inputs(batch_size)
    CudaGraph->>Executor: refresh_inputs_fn() [if needed]
    Executor->>Executor: compute prompt_lens<br/>nest_sequences(prompt_lens=...)
    CudaGraph->>CudaGraph: warmup_and_capture()
    activate CudaGraph
    CudaGraph->>AttentionOp: prepare_trtllm_metadata_host<br/>(prompt_lens_host, spec_config)
    AttentionOp->>AttentionOp: init_spec_decoding()<br/>populate context_lengths<br/>from prompt_lens
    CudaGraph->>AttentionOp: trtllm_mha_with_cache<br/>(context_lengths_gpu)
    AttentionOp->>KVCache: manage KV with spec_decoding<br/>tensors/offsets
    KVCache-->>AttentionOp: return outputs
    deactivate CudaGraph
    CudaGraph-->>Executor: captured_graph
Loading
sequenceDiagram
    participant Exec as AutoDeployLLM
    participant Comp as Compilation Pipeline
    participant CSI as CacheSeqInterface
    participant EagleWrapper as Eagle Wrapper
    participant Attention as Attention Op

    Exec->>Comp: set_eagle_extend_batch<br/>(batch_size, max_draft_len)
    Comp->>CSI: set_eagle_extend_batch()
    CSI->>CSI: create synthetic<br/>extend-only batch
    CSI->>CSI: mark batch_info<br/>as extend_only=True
    Exec->>EagleWrapper: forward(extend_batch)
    EagleWrapper->>EagleWrapper: detect is_extend_only()
    alt Extend-Only Mode
        EagleWrapper->>EagleWrapper: reshape target_logits to<br/>(num_seqs, 1+draft_len, vocab)
        EagleWrapper->>EagleWrapper: broadcast tokens<br/>across ranks (multi-rank)
    end
    EagleWrapper->>Attention: pass cache_seq_interface<br/>for metadata
    Attention->>Attention: use prompt_lens for<br/>context_lengths
    Attention-->>EagleWrapper: return logits
    EagleWrapper-->>Exec: outputs
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

waiting for feedback

Suggested reviewers

  • lucaslie
  • suyoggupta
  • zheyuf
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.80% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is largely incomplete. It references issue #12633 and includes TODO links, but lacks a comprehensive explanation of what is being implemented, why, and how. Add detailed Description, Test Coverage, and PR Checklist sections explaining the changes, affected components, testing strategy, and verification that all checklist items have been reviewed.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title '[#12633][feat] AutoDeploy: Support torch-cudagraph for Eagle' clearly and specifically summarizes the main change: enabling torch-cudagraph support for Eagle in AutoDeploy.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Add the standard NVIDIA SPDX header before the imports.

This modified TensorRT-LLM source file still lacks the required copyright/license header. Please add the standard NVIDIA SPDX header with the latest modification year.

As per coding guidelines All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py` at line
1, Add the standard NVIDIA SPDX copyright/license header (with the latest
modification year) at the very top of the file before the existing imports;
ensure the header precedes the first import statement "from typing import List,
Literal, Optional, Tuple, Type" so the file complies with the TensorRT-LLM
source file header requirement.
tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Add the standard NVIDIA SPDX header before the imports.

This modified TensorRT-LLM source file still lacks the required copyright/license header. Please add the standard NVIDIA SPDX header with the latest modification year.

As per coding guidelines All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py` at line 1, Add the
standard NVIDIA SPDX copyright header (including the latest modification year)
at the top of the file before any imports or the module docstring (the existing
triple-quoted string at the top of
tensorrt_llm._torch.auto_deploy.utils.node_utils), so the file begins with the
required NVIDIA SPDX header line(s) followed by the existing docstring; ensure
the header matches the project’s standard format and contains the correct year.
tensorrt_llm/_torch/auto_deploy/utils/cuda_graph.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Add the standard NVIDIA SPDX header before the imports.

This modified TensorRT-LLM source file still lacks the required copyright/license header. Please add the standard NVIDIA SPDX header with the latest modification year.

As per coding guidelines All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/utils/cuda_graph.py` at line 1, Add the
standard NVIDIA SPDX copyright/license header at the very top of the file (above
the first import statement "from contextlib import contextmanager") using the
current/latest modification year; ensure the header matches the project's
required NVIDIA SPDX format exactly and appears before any code or imports so
the file complies with the TensorRT-LLM source file header policy.
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Add the standard NVIDIA SPDX header before the imports.

This modified TensorRT-LLM source file still lacks the required copyright/license header. Please add the standard NVIDIA SPDX header with the latest modification year.

As per coding guidelines All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` at line 1,
This file (module tensorrt_llm._torch.auto_deploy.transform.library.sharding.py)
is missing the NVIDIA SPDX copyright header; add the standard NVIDIA header
block containing the latest modification year and the SPDX license identifier
(e.g., NVIDIA CORPORATION & AFFILIATES with the SPDX-License-Identifier)
immediately at the top of the file before the module docstring and any imports
so the file-level header appears before the existing triple-quoted docstring.
tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)

1-1: ⚠️ Potential issue | 🟡 Minor

Add the standard NVIDIA SPDX header before the imports.

This modified TensorRT-LLM source file still lacks the required copyright/license header. Please add the standard NVIDIA SPDX header with the latest modification year.

As per coding guidelines All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/shim/interface.py` at line 1, This file is
missing the required NVIDIA SPDX copyright header; add the standard NVIDIA SPDX
header (including the latest modification year) as the very first lines of
tensorrt_llm/_torch/auto_deploy/shim/interface.py before any imports (i.e.,
before the existing "import copy") so the file contains the mandated
copyright/license header.
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)

747-805: ⚠️ Potential issue | 🔴 Critical

Append full prompt length to prompt_lens for all request types, not chunk size for prefill.

Line 763 appends request.context_chunk_size for context requests, while Line 805 appends request.py_prompt_len for extend/decode requests. The downstream consumer in attention_interface.py:1230 documents that prompt_lens must contain "original context length per sequence, constant across iterations." For chunked-prefill requests, the chunk size is not the original context length—it is the current iteration's slice. This mismatch causes context_lengths_gpu in trtllm_attention.py:225 to record incorrect values for sequences undergoing chunk-based prefill, breaking attention metadata computation.

Append request.py_prompt_len (or equivalent full prompt length) for context requests at line 763, matching the semantics of line 805.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py` around lines 747 - 805,
The bug is that prompt_lens is populated with request.context_chunk_size for
context_requests (in the context_requests loop) instead of the full original
prompt length, causing downstream consumers (expecting constant "original
context length per sequence") to get wrong values; fix by appending the full
prompt length (request.py_prompt_len or equivalent full-context property) to
prompt_lens inside the context_requests loop (replace the
prompt_lens.append(request.context_chunk_size) call with
prompt_lens.append(request.py_prompt_len) or the request field that represents
the original context length) so both context_requests and gen_requests append
the same semantic value; check attention_interface.py expectations (line ~1230)
to ensure the chosen field matches "original context length per sequence."
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py (1)

942-952: ⚠️ Potential issue | 🔴 Critical

Sort hidden-state caches by numeric layer index, not by raw name.

sorted(..., key=lambda x: x[0]) will misorder layer-suffixed buffers once the index has more than one digit. For example, the Eagle3 path in tests/integration/defs/accuracy/test_llm_api_autodeploy.py captures layers {1, 15, 28}, and names like ..._15_hidden_states_cache sort before ..._1_hidden_states_cache. That scrambles the concatenation fed into the drafter FC and changes speculative outputs.

🔧 Proposed fix
+import re
...
-        buffers = sorted(
-            [
-                (name, tensor)
-                for name, tensor in kwargs.items()
-                if name.endswith("hidden_states_cache")
-            ],
-            key=lambda x: x[0],
-        )
+        def _cache_sort_key(item: tuple[str, torch.Tensor]) -> int:
+            name, _ = item
+            match = re.search(r"(\d+)_hidden_states_cache$", name)
+            if match is None:
+                raise ValueError(f"Could not extract layer index from {name!r}")
+            return int(match.group(1))
+
+        buffers = sorted(
+            [
+                (name, tensor)
+                for name, tensor in kwargs.items()
+                if name.endswith("hidden_states_cache")
+            ],
+            key=_cache_sort_key,
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py` around lines
942 - 952, The current sort of hidden-state buffers uses the raw name
(sorted(..., key=lambda x: x[0])), which misorders multi-digit layer suffixes;
change the sort to extract the numeric layer index from each buffer name (the
digits immediately preceding "_hidden_states_cache") and sort by that integer
instead, e.g., parse the index from the tuple's name (the first element of
buffers), convert to int with a safe fallback if parsing fails, then torch.cat
the buffers in the resulting numeric order so the concatenation passed to the
drafter FC preserves correct layer order.
♻️ Duplicate comments (1)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)

598-602: ⚠️ Potential issue | 🟠 Major

Restore a memory gate for test_mtp.

The TODO is right: without a replacement for skip_less_device_memory(180000), this 120B MTP accuracy test will get scheduled on machines the test itself says are under-provisioned, which is likely to turn it into an OOM/timeout source instead of a useful regression test.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py` around lines 598
- 602, Reinstate a memory gate for the 120B MTP test by re-applying the skip
decorator (or an equivalent guard) to test_mtp so it only runs on nodes with
>=180GB device memory: restore the commented-out
`@pytest.mark.skip_less_device_memory`(180000) above the test_mtp definition (or
wrap the test_mtp function with a runtime check that calls pytest.skip when
available memory <180000) so the parametric test (test_mtp, attn_backend,
world_size) is not scheduled on under-provisioned CI machines.
🧹 Nitpick comments (6)
tests/unittest/auto_deploy/singlegpu/models/test_eagle.py (1)

157-168: Avoid torch.device() in the default argument.

This default is evaluated at import time, and Ruff is already flagging the pattern on Line 167. Using None and normalizing inside __init__ keeps the helper aligned with the repo lint rules.

Suggested refactor
 class _FakeCSI:
     def __init__(
         self,
         *,
         max_batch_size: int,
         max_num_tokens: int,
         hidden_size: int,
         num_capture_layers: int,
         ids_dtype: torch.dtype = torch.int64,
         hidden_states_dtype: torch.dtype = torch.float16,
-        device: torch.device = torch.device("cpu"),
+        device: torch.device | None = None,
     ):
+        if device is None:
+            device = torch.device("cpu")
         self.info = SimpleNamespace(
             max_batch_size=max_batch_size,
             max_num_tokens=max_num_tokens,
             device=device,
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/auto_deploy/singlegpu/models/test_eagle.py` around lines 157 -
168, The _FakeCSI.__init__ currently uses torch.device("cpu") as a default
argument which is evaluated at import time; change the signature to accept
device: Optional[torch.device] = None (or remove the default device), then
inside __init__ normalize it with something like device = torch.device("cpu") if
device is None else device (or torch.device(device)) so the device is
constructed at runtime; update the type hint and any uses of self.device
accordingly in class _FakeCSI to avoid the import-time torch.device() default.
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)

1424-1450: Address unused variable warnings from static analysis.

The static analysis flagged num_decode and num_decode_tokens as unused. Since these are from tuple unpacking where only some values are needed, prefix them with underscores to indicate intentional non-use.

♻️ Proposed fix
-            num_prefill, num_extend, num_decode = self.batch_info.get_num_sequences()
-            num_prefill_tokens, num_extend_tokens, num_decode_tokens = (
+            num_prefill, num_extend, _num_decode = self.batch_info.get_num_sequences()
+            num_prefill_tokens, num_extend_tokens, _num_decode_tokens = (
                 self.batch_info.get_num_tokens()
             )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py` around
lines 1424 - 1450, Change the unused tuple-unpacked variables from
batch_info.get_num_sequences() and batch_info.get_num_tokens() to explicitly
ignored names so static analysis stops flagging them; replace num_decode and
num_decode_tokens with _num_decode and _num_decode_tokens (keeping num_prefill,
num_extend, num_prefill_tokens, num_extend_tokens as-is) in the block that
assigns from self.batch_info.get_num_sequences() and
self.batch_info.get_num_tokens() used by the logic in attention_interface.py
(the code that computes offset_for_pos_ids and uses tokens_per_seq and
repeat_interleave).

1478-1486: Verify the overlap mode assumption for extend requests.

The comment states "AD overlap mode packs real overlap-carried non-prefill sequences first" and "CUDA-graph dummy requests are appended later". This relies on ordering invariants maintained elsewhere. Consider adding a brief comment or assertion to document this contract more explicitly at this location.

📝 Optional: Add defensive comment or soft assertion
         num_overlap = gather_slot_idx.numel()
         # AD overlap mode packs real overlap-carried non-prefill sequences first. CUDA-graph dummy
         # requests are appended later and therefore remain in the trailing zero-increment region.
         # This keeps the padded dummy requests from consuming a new_tokens_lens entry while still
         # preserving the existing ordering-based contract used throughout SequenceInfo.
+        # Invariant: either all non-prefill are extend OR all are decode, never mixed.
         increment[num_prefill : num_prefill + num_overlap] = (
             new_lens_ungathered[gather_slot_idx] - 1
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py` around
lines 1478 - 1486, Add a short defensive comment documenting the AD overlap-mode
ordering contract and a soft assertion that verifies gather_slot_idx maps into
the expected slice immediately following num_prefill: ensure num_overlap ==
gather_slot_idx.numel() and that all indices in gather_slot_idx are within the
range [num_prefill, num_prefill + num_overlap) before computing increment and
calling self.offset_pos_and_cache_. Reference the local symbols gather_slot_idx,
num_prefill, num_overlap, increment, and self.offset_pos_and_cache_ when adding
this check so the invariant is explicit and fails fast if ordering elsewhere
changes.
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py (3)

445-461: Duplicate variable assignments for num_tokens and max_context_length.

Lines 447-449 assign num_tokens, max_context_length, and max_num_requests from batch_info, but lines 458-461 reassign the same variables. The first set of assignments appear unused before being overwritten.

♻️ Proposed fix to remove duplicate assignments
     # Get batch dimensions and model-level constants from host tensors (no device sync)
     batch_info = BatchInfo(batch_info_host)
-    num_tokens = batch_info.get_total_num_tokens()
-    max_context_length = batch_info.get_max_context_length()
-    max_num_requests = batch_info.get_max_batch_size()
     _GlobalTrtllmPlanner.update_host_request_types(batch_info)
 
     # Infer dimensions from tensor shapes (bsnd layout)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py`
around lines 445 - 461, BatchInfo-derived variables num_tokens,
max_context_length, and max_num_requests are assigned twice (before and after
inferring tensor shapes); remove the first set of redundant assignments (the
ones immediately after BatchInfo(batch_info_host)) and keep the single
assignments that occur after shape inference so num_tokens, max_context_length,
and max_num_requests are only set once; retain the
_GlobalTrtllmPlanner.update_host_request_types(batch_info) call but do not
re-declare or reassign those variables earlier.

278-305: Consider adding cache size limits for @functools.cache decorators.

These cached functions allocate GPU tensors keyed by (max_num_requests, draft_len). While typically only a few distinct combinations exist at runtime, unbounded caching could lead to GPU memory accumulation if many different parameter combinations are used. Consider using @functools.lru_cache(maxsize=...) with a reasonable limit instead.

♻️ Suggested change
-@functools.cache
+@functools.lru_cache(maxsize=8)
 def _generate_spec_decoding_position_offsets(max_num_requests: int, draft_len: int) -> torch.Tensor:
     ...

-@functools.cache
+@functools.lru_cache(maxsize=8)
 def _generate_spec_decoding_packed_mask(max_num_requests: int, draft_len: int) -> torch.Tensor:
     ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py`
around lines 278 - 305, The two GPU-tensor-producing functions
_generate_spec_decoding_position_offsets and _generate_spec_decoding_packed_mask
currently use `@functools.cache` (unbounded) and should use a bounded cache to
avoid unbounded GPU memory growth; replace `@functools.cache` with
`@functools.lru_cache`(maxsize=...) (pick a reasonable maxsize like 8 or make it
configurable) so the cache keys (max_num_requests, draft_len) are limited, and
ensure the functions' signatures stay hashable for lru_cache to work.

532-538: Refactor to use the existing is_sm_version_trtllm_gen_kernel() method instead of duplicating the logic inline.

The condition not (sm_version < 100 or sm_version in (120, 121)) is correctly mirrored from is_sm_version_trtllm_gen_kernel() in trtllm.py, but duplicating it here creates unnecessary maintenance risk. Replace the inline condition with a call to the method to ensure they stay synchronized.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py`
around lines 532 - 538, Replace the inline SM-version check with the shared
helper: instead of calling get_sm_version() and using the duplicated condition,
call is_sm_version_trtllm_gen_kernel() (the same helper in
trtllm.py/TrtllmAttentionWrapper) to decide whether to append three None entries
to spec_decoding_tensor_params; i.e., if is_sm_version_trtllm_gen_kernel()
returns True then extend spec_decoding_tensor_params with [None, None, None],
otherwise do nothing—remove the duplicated sm_version logic and any unused
get_sm_version references.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`:
- Around line 1230-1237: The decode loop in demollm.py calls
sequence_info.nest_sequences() without passing the original prompt lengths,
causing prompt_lens to default to sl_host and break attention masking; fix it by
preserving total_lens from the prefill phase and passing it into the
decode-phase call to sequence_info.nest_sequences(...) as the prompt_lens
argument (same value used in the prefill call), ensuring the decode call (in the
loop around line ~198) includes prompt_lens=total_lens so attention masking uses
the original context lengths.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py`:
- Around line 136-150: init_spec_decoding may access self.block_offsets before
reset() has allocated it, causing torch.zeros_like(self.block_offsets) to fail;
add a defensive guard at the top of init_spec_decoding that returns if
self.block_offsets is None (or ensure reset() has run) and only then proceed to
set self._scratch_block_offsets, spec_decoding_generation_lengths,
spec_decoding_position_offsets, and spec_decoding_packed_mask; reference
init_spec_decoding, reset, block_offsets, _scratch_block_offsets, and
is_spec_decoding_enabled when making the change.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/utils/torch_gather_logits.py`:
- Around line 30-32: The function gather_tokens in torch_gather_logits.py
incorrectly collapses extend-only inputs ([bs, sl, ...]) into a decode-shaped
output ([N, 1, ...]) by branching only on bsz==1; update gather_tokens (and any
related reshape logic used by flatten_3d) to preserve the original
second-dimension when input is extend-only by deriving it from BatchInfo
per-sequence token counts instead of forcing [N,1,...]. Specifically, detect
extend-only layout via BatchInfo (or the per-sequence counts API), compute the
correct per-batch sequence lengths for the second dimension, and reshape/permute
using those lengths so outputs keep [bs, sl, ...] for extend-only cases while
still supporting [bs,1,...] decode and [1,total,...] prefill layouts.

In `@tensorrt_llm/_torch/auto_deploy/models/eagle.py`:
- Around line 315-329: The target dtype and trust settings are being inferred by
unconditionally calling AutoConfig.from_pretrained(..., trust_remote_code=True),
which ignores explicit overrides in model_kwargs or kwargs; update the logic
that computes target_dtype (used when constructing self.draft_factory /
EagleDrafterFactory) to first check for explicit dtype overrides in model_kwargs
(e.g. "torch_dtype") and kwargs (e.g. "dtype"), and to respect the caller's
trust_remote_code value rather than hard-coding True when calling
AutoConfig.from_pretrained; i.e., derive a local trust_remote_code variable from
kwargs or a surrounding parameter and only call AutoConfig.from_pretrained with
that value if no dtype was supplied in model_kwargs/kwargs, then pass the
resolved target_dtype into EagleDrafterFactory.

In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Around line 1534-1540: The current Eagle-drafter check only looks at
node.users directly, which misses cases where q/k/v linears are piped through
passthrough ops (view/reshape/transpose) before reaching the attention op;
update the branch that checks "if in_eagle_drafter and in_dim == 2 * embd" to
follow passthrough users transitively instead of only immediate users: reuse or
add a small helper that starting from node (or each user) recursively walks
through allowed passthrough op kinds (e.g., view/reshape/transpose) and returns
true if any reachable user satisfies is_any_attention_op; use that helper in
place of the current any(is_any_attention_op(u) for u in node.users) call so
Eagle q/k/v openings are recognized through reshape/view chains.

---

Outside diff comments:
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py`:
- Around line 942-952: The current sort of hidden-state buffers uses the raw
name (sorted(..., key=lambda x: x[0])), which misorders multi-digit layer
suffixes; change the sort to extract the numeric layer index from each buffer
name (the digits immediately preceding "_hidden_states_cache") and sort by that
integer instead, e.g., parse the index from the tuple's name (the first element
of buffers), convert to int with a safe fallback if parsing fails, then
torch.cat the buffers in the resulting numeric order so the concatenation passed
to the drafter FC preserves correct layer order.

In `@tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`:
- Around line 747-805: The bug is that prompt_lens is populated with
request.context_chunk_size for context_requests (in the context_requests loop)
instead of the full original prompt length, causing downstream consumers
(expecting constant "original context length per sequence") to get wrong values;
fix by appending the full prompt length (request.py_prompt_len or equivalent
full-context property) to prompt_lens inside the context_requests loop (replace
the prompt_lens.append(request.context_chunk_size) call with
prompt_lens.append(request.py_prompt_len) or the request field that represents
the original context length) so both context_requests and gen_requests append
the same semantic value; check attention_interface.py expectations (line ~1230)
to ensure the chosen field matches "original context length per sequence."

In `@tensorrt_llm/_torch/auto_deploy/shim/interface.py`:
- Line 1: This file is missing the required NVIDIA SPDX copyright header; add
the standard NVIDIA SPDX header (including the latest modification year) as the
very first lines of tensorrt_llm/_torch/auto_deploy/shim/interface.py before any
imports (i.e., before the existing "import copy") so the file contains the
mandated copyright/license header.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py`:
- Line 1: Add the standard NVIDIA SPDX copyright/license header (with the latest
modification year) at the very top of the file before the existing imports;
ensure the header precedes the first import statement "from typing import List,
Literal, Optional, Tuple, Type" so the file complies with the TensorRT-LLM
source file header requirement.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py`:
- Line 1: This file (module
tensorrt_llm._torch.auto_deploy.transform.library.sharding.py) is missing the
NVIDIA SPDX copyright header; add the standard NVIDIA header block containing
the latest modification year and the SPDX license identifier (e.g., NVIDIA
CORPORATION & AFFILIATES with the SPDX-License-Identifier) immediately at the
top of the file before the module docstring and any imports so the file-level
header appears before the existing triple-quoted docstring.

In `@tensorrt_llm/_torch/auto_deploy/utils/cuda_graph.py`:
- Line 1: Add the standard NVIDIA SPDX copyright/license header at the very top
of the file (above the first import statement "from contextlib import
contextmanager") using the current/latest modification year; ensure the header
matches the project's required NVIDIA SPDX format exactly and appears before any
code or imports so the file complies with the TensorRT-LLM source file header
policy.

In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Line 1: Add the standard NVIDIA SPDX copyright header (including the latest
modification year) at the top of the file before any imports or the module
docstring (the existing triple-quoted string at the top of
tensorrt_llm._torch.auto_deploy.utils.node_utils), so the file begins with the
required NVIDIA SPDX header line(s) followed by the existing docstring; ensure
the header matches the project’s standard format and contains the correct year.

---

Duplicate comments:
In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py`:
- Around line 598-602: Reinstate a memory gate for the 120B MTP test by
re-applying the skip decorator (or an equivalent guard) to test_mtp so it only
runs on nodes with >=180GB device memory: restore the commented-out
`@pytest.mark.skip_less_device_memory`(180000) above the test_mtp definition (or
wrap the test_mtp function with a runtime check that calls pytest.skip when
available memory <180000) so the parametric test (test_mtp, attn_backend,
world_size) is not scheduled on under-provisioned CI machines.

---

Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`:
- Around line 1424-1450: Change the unused tuple-unpacked variables from
batch_info.get_num_sequences() and batch_info.get_num_tokens() to explicitly
ignored names so static analysis stops flagging them; replace num_decode and
num_decode_tokens with _num_decode and _num_decode_tokens (keeping num_prefill,
num_extend, num_prefill_tokens, num_extend_tokens as-is) in the block that
assigns from self.batch_info.get_num_sequences() and
self.batch_info.get_num_tokens() used by the logic in attention_interface.py
(the code that computes offset_for_pos_ids and uses tokens_per_seq and
repeat_interleave).
- Around line 1478-1486: Add a short defensive comment documenting the AD
overlap-mode ordering contract and a soft assertion that verifies
gather_slot_idx maps into the expected slice immediately following num_prefill:
ensure num_overlap == gather_slot_idx.numel() and that all indices in
gather_slot_idx are within the range [num_prefill, num_prefill + num_overlap)
before computing increment and calling self.offset_pos_and_cache_. Reference the
local symbols gather_slot_idx, num_prefill, num_overlap, increment, and
self.offset_pos_and_cache_ when adding this check so the invariant is explicit
and fails fast if ordering elsewhere changes.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py`:
- Around line 445-461: BatchInfo-derived variables num_tokens,
max_context_length, and max_num_requests are assigned twice (before and after
inferring tensor shapes); remove the first set of redundant assignments (the
ones immediately after BatchInfo(batch_info_host)) and keep the single
assignments that occur after shape inference so num_tokens, max_context_length,
and max_num_requests are only set once; retain the
_GlobalTrtllmPlanner.update_host_request_types(batch_info) call but do not
re-declare or reassign those variables earlier.
- Around line 278-305: The two GPU-tensor-producing functions
_generate_spec_decoding_position_offsets and _generate_spec_decoding_packed_mask
currently use `@functools.cache` (unbounded) and should use a bounded cache to
avoid unbounded GPU memory growth; replace `@functools.cache` with
`@functools.lru_cache`(maxsize=...) (pick a reasonable maxsize like 8 or make it
configurable) so the cache keys (max_num_requests, draft_len) are limited, and
ensure the functions' signatures stay hashable for lru_cache to work.
- Around line 532-538: Replace the inline SM-version check with the shared
helper: instead of calling get_sm_version() and using the duplicated condition,
call is_sm_version_trtllm_gen_kernel() (the same helper in
trtllm.py/TrtllmAttentionWrapper) to decide whether to append three None entries
to spec_decoding_tensor_params; i.e., if is_sm_version_trtllm_gen_kernel()
returns True then extend spec_decoding_tensor_params with [None, None, None],
otherwise do nothing—remove the duplicated sm_version logic and any unused
get_sm_version references.

In `@tests/unittest/auto_deploy/singlegpu/models/test_eagle.py`:
- Around line 157-168: The _FakeCSI.__init__ currently uses torch.device("cpu")
as a default argument which is evaluated at import time; change the signature to
accept device: Optional[torch.device] = None (or remove the default device),
then inside __init__ normalize it with something like device =
torch.device("cpu") if device is None else device (or torch.device(device)) so
the device is constructed at runtime; update the type hint and any uses of
self.device accordingly in class _FakeCSI to avoid the import-time
torch.device() default.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 7d79db9b-89eb-4e40-a47d-8ccd70d3daa5

📥 Commits

Reviewing files that changed from the base of the PR and between 51f7956 and 2b3d27c.

📒 Files selected for processing (25)
  • examples/auto_deploy/model_registry/configs/super_v3_mtp.yaml
  • tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/utils/torch_gather_logits.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py
  • tensorrt_llm/_torch/auto_deploy/models/eagle.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/auto_deploy/shim/interface.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/collectives.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/hidden_states.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py
  • tensorrt_llm/_torch/auto_deploy/utils/cuda_graph.py
  • tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
  • tests/integration/defs/examples/test_ad_speculative_decoding.py
  • tests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/attention/test_trtllm_attention_op.py
  • tests/unittest/auto_deploy/singlegpu/models/test_eagle.py
  • tests/unittest/auto_deploy/singlegpu/shim/test_cached_sequence_interface.py
  • tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py
  • tests/unittest/auto_deploy/singlegpu/smoke/test_ad_speculative_decoding.py
💤 Files with no reviewable changes (1)
  • tensorrt_llm/_torch/auto_deploy/transform/library/collectives.py

Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/utils/torch_gather_logits.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/eagle.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
Comment thread tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/utils/cuda_graph.py
Comment thread tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
Comment thread tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py Outdated
@govind-ramnarayan govind-ramnarayan changed the title [feat] AutoDeploy: Support torch-cudagraph for Eagle [#12633][feat] AutoDeploy: Support torch-cudagraph for Eagle Apr 16, 2026
@xinhe-nv
Copy link
Copy Markdown
Collaborator

@coderabbitai Act as a QA engineer reviewing test changes for TensorRT-LLM.

    QA test list hygiene (integration / release runs):
    - If the change adds or materially alters an integration test under
      tests/integration/defs/ (or otherwise affects what QA should run on a
      schedule), call out whether an entry is needed under
      tests/integration/test_lists/qa/. See tests/integration/test_lists/qa/README.md
      for which file to use (e.g. llm_function_core.txt for primary single-node
      multi-GPU functional cases; llm_function_multinode.txt for multi-node;
      llm_perf_*.yml for perf; llm_triton_integration.txt for Triton).
    - If the PR only touches unittest/ or narrow unit scope, say explicitly
      whether QA list updates are unnecessary or optional.

    Coverage expectations:
    - Assess whether new/changed tests cover happy path, important edge cases,
      and failure modes relevant to the feature or fix (skips, guards, env
      vars like LLM_MODELS_ROOT, GPU count, backend-specific paths).
    - Flag tests that assert overly brittle behavior (e.g. exact token match
      across speculative vs non-speculative paths) unless the product contract
      requires it.
    - Note missing negative tests, missing parametrization where multiple
      backends or dtypes apply, or tests that cannot fail when behavior regresses.

    Performance test coverage:
    - If the PR touches performance-sensitive paths (attention kernels, MoE
      routing/dispatch, KV cache management, scheduler, batching logic,
      CUDA graph capture, speculative decoding, or quantization kernels),
      check whether a perf test entry is present or updated in:
      (a) tests/integration/test_lists/test-db/l0_perf.yml or the
          appropriate per-GPU l0_*.yml (primary requirement — this is how
          tests enter pre-merge and scheduled CI); and
      (b) tests/integration/test_lists/qa/llm_perf_*.yml (single-node) or
          tests/integration/test_lists/qa/llm_perf_multinode.txt (multi-node)
          for QA scheduled runs.
      Flag if (a) is missing even when (b) is present, since test-db is
      the authoritative source for CI execution.
    - Flag if a test in tests/integration/defs/perf/test_perf_sanity.py is
      missing for a new model or serving configuration that warrants a
      latency/throughput baseline (examples/benchmark/ is deprecated).
    - Note if only functional correctness is tested for a change where a
      performance regression would not be caught (e.g. a kernel rewrite with
      no perf test, or a new scheduling policy with no throughput assertion).
    - Do not require perf tests for doc-only, infra, style, or pure refactor
      changes where no runtime behavior changes.

    Keep feedback actionable: suggest concrete list file names and whether
    coverage is sufficient, insufficient, or needs follow-up outside the PR.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 17, 2026

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Read the QA README
cat tests/integration/test_lists/qa/README.md 2>/dev/null || echo "FILE NOT FOUND"

echo "=== llm_function_core.txt (grep autodeploy/eagle/speculative) ==="
grep -i -E "autodeploy|eagle|speculative|mtp|cudagraph" tests/integration/test_lists/qa/llm_function_core.txt 2>/dev/null || echo "FILE NOT FOUND or no matches"

echo "=== llm_perf_*.yml files listing ==="
ls tests/integration/test_lists/qa/llm_perf*.yml 2>/dev/null || echo "No perf yml found"

echo "=== grep autodeploy/eagle in llm_perf files ==="
grep -r -i -E "autodeploy|eagle|speculative|mtp|cudagraph" tests/integration/test_lists/qa/ 2>/dev/null || echo "No matches"

echo "=== test-db l0_perf.yml grep ==="
grep -i -E "autodeploy|eagle|speculative|mtp|cudagraph" tests/integration/test_lists/test-db/l0_perf.yml 2>/dev/null || echo "FILE NOT FOUND or no matches"

echo "=== test-db l0_*.yml listing ==="
ls tests/integration/test_lists/test-db/l0_*.yml 2>/dev/null | head -20

echo "=== all test-db perf ymls grep for autodeploy/eagle ==="
grep -r -i -E "autodeploy|eagle|speculative|mtp|cudagraph" tests/integration/test_lists/test-db/ 2>/dev/null | head -40

echo "=== test_perf_sanity.py grep for eagle/autodeploy ==="
grep -i -E "autodeploy|eagle|speculative|mtp|cudagraph" tests/integration/defs/perf/test_perf_sanity.py 2>/dev/null | head -20 || echo "FILE NOT FOUND or no matches"

echo "=== qa list files listing ==="
ls tests/integration/test_lists/qa/ 2>/dev/null

echo "=== integration test: test_ad_speculative_decoding.py ==="
cat tests/integration/defs/examples/test_ad_speculative_decoding.py 2>/dev/null || echo "FILE NOT FOUND"

echo "=== integration test accuracy changes for eagle3 ==="
grep -n -A5 -B5 "eagle3\|Eagle3\|autodeploy_eagle\|test_autodeploy_eagle" tests/integration/defs/accuracy/test_llm_api_autodeploy.py 2>/dev/null | head -80

Length of output: 202933


🏁 Script executed:

#!/bin/bash
echo "=== llm_function_core.txt full content (first 100 lines) ==="
head -100 tests/integration/test_lists/qa/llm_function_core.txt 2>/dev/null || echo "FILE NOT FOUND"

echo "=== llm_function_core.txt grep ad_spec ==="
grep -n -E "ad_spec|speculative|eagle|autodeploy" tests/integration/test_lists/qa/llm_function_core.txt 2>/dev/null || echo "no matches"

echo "=== check if test_ad_speculative_decoding appears anywhere in qa lists ==="
grep -r "test_ad_speculative_decoding\|test_autodeploy_eagle" tests/integration/test_lists/ 2>/dev/null || echo "no matches"

echo "=== check PR changed files in test_lists ==="
git diff --name-only origin/main HEAD 2>/dev/null | grep test_lists || echo "no test_lists files changed"

echo "=== list all qa txt/yml files ==="
find tests/integration/test_lists/qa/ -type f 2>/dev/null | sort

Length of output: 23279


@xinhe-nv: Here is the QA-engineer review for this PR.


QA Test List Hygiene

✅ QA list entries present — but parametrization IDs need updating

The PR touches tests/integration/test_lists/qa/llm_function_core.txt and several test-db/l0_*.yml files. Two entries for the changed integration tests already exist:

accuracy/test_llm_api_autodeploy.py::TestLlama3_1_8B_Instruct_Eagle3::test_eagle3_one_model
accuracy/test_llm_api_autodeploy.py::TestNemotronSuperV3::test_mtp[4]

Problem 1 — llm_function_core.txt unparametrized entry is stale.
test_eagle3_one_model is now parametrized over attn_backend=["flashinfer", "trtllm"]. With an exact-ID test list, the unparametrized entry ::test_eagle3_one_model will no longer collect either variant. The new trtllm+cudagraph path — the core feature of this PR — risks being invisible to QA unless the list is updated:

# Suggested replacements in llm_function_core.txt (and llm_function_core_sanity.txt):
accuracy/test_llm_api_autodeploy.py::TestLlama3_1_8B_Instruct_Eagle3::test_eagle3_one_model[flashinfer]
accuracy/test_llm_api_autodeploy.py::TestLlama3_1_8B_Instruct_Eagle3::test_eagle3_one_model[trtllm]

Problem 2 — test-db/l0_h100.yml entries are stale.
test_autodeploy_eagle3_one_model_acceptance_rate in tests/integration/defs/examples/test_ad_speculative_decoding.py was reparametrized from disable_overlap_scheduler: bool to (attn_backend, compile_backend) pairs. The existing l0_h100.yml entries use the old unparametrized name and will fail to collect under the new fixture IDs:

# Old (stale):
- examples/test_ad_speculative_decoding.py::test_autodeploy_eagle3_one_model_acceptance_rate

# Suggested updates:
- examples/test_ad_speculative_decoding.py::test_autodeploy_eagle3_one_model_acceptance_rate[trtllm-torch-cudagraph]
- examples/test_ad_speculative_decoding.py::test_autodeploy_eagle3_one_model_acceptance_rate[flashinfer-torch-simple]

✅ Unit tests — QA list updates not required

All new tests under tests/unittest/auto_deploy/singlegpu/ (captured graph, trtllm attention op metadata, Eagle model, CSI, llm_config, smoke) are unit scope. No QA list entries are needed for these.


Performance Test Coverage

❌ Missing entry in tests/integration/test_lists/test-db/l0_perf.yml (blocking gap)

This PR introduces torch-cudagraph support for Eagle speculative decoding — a highly performance-sensitive path touching CUDA graph capture, trtllm attention kernels, KV cache block offset management, and the spec-dec scheduler loop. There is no entry in l0_perf.yml (the authoritative source for pre-merge CI perf) for AutoDeploy Eagle3 + cudagraph. The only existing autodeploy perf entries are for unrelated models:

# Current l0_perf.yml (autodeploy entries only):
perf/test_perf.py::test_perf[deepseek_r1_distill_qwen_32b-bench-_autodeploy-...]
perf/test_perf.py::test_perf[nemotron_nano_3_30b_fp8-bench-_autodeploy-...]

A perf regression in graph capture throughput (e.g., a latency increase from extra cudaEventSynchronize calls, a scratch-buffer copy on every extend step, or a capture-time blowup) would not be caught in pre-merge CI.

Recommendation: Add an entry to l0_perf.yml for Llama-3.1-8B-Instruct + Eagle3 + torch-cudagraph + trtllm in a representative throughput or latency configuration, and a matching entry in llm_perf_core.yml (b).

❌ No test_perf_sanity.py baseline for AutoDeploy Eagle3+cudagraph

tests/integration/defs/perf/test_perf_sanity.py has no test entry for AutoDeploy Eagle3 one-model with torch-cudagraph. Given the PR's stated goal (nsys screenshots showing graph launch), a latency/throughput sanity baseline here would catch regressions in future PRs that touch the same code path.

✅ QA llm_perf_core.yml — appears updated by this PR

The PR includes llm_perf_core.yml in the diff. The exact content added was not extracted, but since the PR author modified it, please verify it includes an Eagle3+cudagraph entry; if it does, point (a) above remains the outstanding gap.


Coverage Assessment

Happy path — adequate

Test Path covered Verdict
test_eagle3_one_model[trtllm] AutoDeploy Eagle3 one-model, trtllm+cudagraph, GSM8K accuracy
test_mtp[4] via super_v3_mtp.yaml SuperV3 MTP with torch-cudagraph, 4-GPU
test_autodeploy_eagle3_one_model_acceptance_rate[trtllm-torch-cudagraph] Acceptance rate ≥ 10% for Eagle3+cudagraph
test_ad_eagle3_one_model_smoke[trtllm] Short+long prompt smoke with skip_loading_weights ✅ (unit/smoke scope)
test_speculative_flashinfer_forces_torch_simple Fallback guard correctness ✅ (unit scope)

Edge-case and regression gaps — flagged

  1. No TP/multi-GPU test for AutoDeploy Eagle3+cudagraph. The test_eagle3_one_model[trtllm] is single-GPU. A regression in the distributed token broadcast path (EagleWrapper.sample_greedy) or the NCCL allreduce unfusion (removed early-return in FuseAllreduceResidualRMSNorm) would not be caught. Consider a 4-GPU parametrized variant or a separate TP smoke.

  2. multi_stream_moe disabled in super_v3_mtp.yaml without a guard test. The YAML now hard-codes multi_stream_moe: {enabled: false}. There is no test that would catch a regression if the flag is accidentally re-enabled or the disablement breaks under a config update.

  3. Stale @pytest.mark.skip_less_device_memory removal in SuperV3 MTP test. The PR replaces @pytest.mark.skip_less_device_memory(180000) with a TODO comment. This means the test will run (and likely OOM) on GPUs with < 180 GB memory, potentially producing false failures on hardware that previously skipped it. This should be restored or the test should be guarded.

  4. No negative/failure-mode test for CUDA graph capture errors. If CapturedGraph._capture_one_graph raises during the extend-only Eagle warmup (e.g., non-CUDA-graph-safe op captured), the only signal is a runtime crash. A unit test that injects a bad refresh_inputs_fn would make this fail explicitly.

  5. test_autodeploy_spec_dec_output exact-match assertion is brittle for larger batch sizes. The test already has an inline comment noting it can fail with larger batch sizes due to nondeterminism. The current batch_size=1 restriction prevents this in practice, but if a future PR inadvertently increases max_batch_size or removes disable_overlap_scheduler, the test will silently become flaky rather than failing hard. Consider asserting on acceptance rate or token-overlap instead of exact string equality.


Summary

Category Status
llm_function_core.txt QA entry (parametrized [trtllm] variant) ⚠️ Needs explicit parametrized ID
l0_h100.yml entry for test_autodeploy_eagle3_one_model_acceptance_rate ⚠️ Needs new parametrized IDs
l0_perf.yml perf entry for Eagle3+AutoDeploy+cudagraph ❌ Missing (primary CI gap)
llm_perf_core.yml QA perf entry Check PR diff
test_perf_sanity.py baseline for AutoDeploy+Eagle3+cudagraph ❌ Missing
Multi-GPU AutoDeploy Eagle3+cudagraph coverage ⚠️ Gap
Unit test QA list updates ✅ Not required
Happy path functional coverage ✅ Adequate

Comment thread tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
Comment thread tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Comment thread tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
Comment thread tests/integration/defs/accuracy/test_llm_api_autodeploy.py Outdated
@govind-ramnarayan govind-ramnarayan force-pushed the gramnarayan/mtp-enable-cudagraph branch from d777423 to 2878730 Compare April 17, 2026 22:52
@govind-ramnarayan govind-ramnarayan requested review from a team as code owners April 20, 2026 19:30
@govind-ramnarayan govind-ramnarayan force-pushed the gramnarayan/mtp-enable-cudagraph branch from 97f97fd to 9b59b6c Compare April 20, 2026 20:41
Comment thread tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/models/eagle.py Outdated
Comment thread tensorrt_llm/_torch/auto_deploy/utils/node_utils.py
@govind-ramnarayan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45433 [ run ] triggered by Bot. Commit: 76c8e79 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45433 [ run ] completed with state SUCCESS. Commit: 76c8e79
/LLM/main/L0_MergeRequest_PR pipeline #35666 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@govind-ramnarayan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45456 [ run ] triggered by Bot. Commit: 7e82884 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45456 [ run ] completed with state SUCCESS. Commit: 7e82884
/LLM/main/L0_MergeRequest_PR pipeline #35688 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@govind-ramnarayan govind-ramnarayan force-pushed the gramnarayan/mtp-enable-cudagraph branch from 3d22ea7 to 416a88a Compare April 27, 2026 17:15
@govind-ramnarayan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45767 [ run ] triggered by Bot. Commit: 416a88a Link to invocation

@govind-ramnarayan govind-ramnarayan force-pushed the gramnarayan/mtp-enable-cudagraph branch from 416a88a to a8ddccc Compare April 27, 2026 20:44
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45767 [ run ] completed with state FAILURE. Commit: 416a88a
/LLM/main/L0_MergeRequest_PR pipeline #35958 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@govind-ramnarayan govind-ramnarayan force-pushed the gramnarayan/mtp-enable-cudagraph branch from a8ddccc to d4bcc30 Compare April 27, 2026 22:16
@govind-ramnarayan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45796 [ run ] triggered by Bot. Commit: d4bcc30 Link to invocation

@govind-ramnarayan
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45872 [ run ] triggered by Bot. Commit: d4bcc30 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #45872 [ run ] completed with state FAILURE. Commit: d4bcc30
/LLM/main/L0_MergeRequest_PR pipeline #36046 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@govind-ramnarayan
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
- test_capture_graph_uses_per_input_extents_for_truncation: the fake
  _capture_one_graph used (self, *args, **kwargs), but capture_graph
  calls it with keyword args (args=..., kwargs=..., refresh_args_static=...),
  so positional *args was always empty. Match the real signature so the
  fake can introspect tensor shapes.
- test_ad_engine_prepare_inputs_generation_with_hybrid_cache: this PR added
  request.py_prompt_len access in ad_executor._prepare_inputs, but the
  local _GenRequest mock did not define it. Add py_prompt_len to the mock.

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
…pec dec mode on this architecture

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
@govind-ramnarayan govind-ramnarayan force-pushed the gramnarayan/mtp-enable-cudagraph branch from d4bcc30 to ef82208 Compare April 29, 2026 16:30
@govind-ramnarayan
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46182 [ run ] triggered by Bot. Commit: ef82208 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46182 [ run ] completed with state SUCCESS. Commit: ef82208
/LLM/main/L0_MergeRequest_PR pipeline #36299 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@galagam
Copy link
Copy Markdown
Collaborator

galagam commented Apr 30, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46315 [ run ] triggered by Bot. Commit: ef82208 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46315 [ run ] completed with state SUCCESS. Commit: ef82208
/LLM/main/L0_MergeRequest_PR pipeline #36414 completed with status: 'SUCCESS'

CI Report

Link to invocation

@govind-ramnarayan govind-ramnarayan merged commit 5f28ce2 into NVIDIA:main Apr 30, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: [AutoDeploy]: TRTLLM Attention + CUDA Graph for Eagle

6 participants