[#12633][feat] AutoDeploy: Support torch-cudagraph for Eagle#12745
Conversation
ff75700 to
d6a5cfe
Compare
f092c06 to
d05d4b4
Compare
📝 WalkthroughWalkthroughThe PR integrates Eagle3 speculative-decoding support into AutoDeploy with TensorRT-LLM attention backend, including CUDA graph dynamic-input refresh callbacks, metadata handling for prompt lengths and extend-only batches, draft model dtype management, and expanded transform pipeline support for spec-decoding-aware graph compilation. Changes
Sequence Diagram(s)sequenceDiagram
participant Executor as AD Executor
participant CudaGraph as CUDA Graph Capturer
participant AttentionOp as TensorRT-LLM Attention
participant KVCache as KV Cache Manager
Executor->>CudaGraph: capture_graph(get_args_kwargs_fn)
CudaGraph->>CudaGraph: _prepare_capture_inputs(batch_size)
CudaGraph->>Executor: refresh_inputs_fn() [if needed]
Executor->>Executor: compute prompt_lens<br/>nest_sequences(prompt_lens=...)
CudaGraph->>CudaGraph: warmup_and_capture()
activate CudaGraph
CudaGraph->>AttentionOp: prepare_trtllm_metadata_host<br/>(prompt_lens_host, spec_config)
AttentionOp->>AttentionOp: init_spec_decoding()<br/>populate context_lengths<br/>from prompt_lens
CudaGraph->>AttentionOp: trtllm_mha_with_cache<br/>(context_lengths_gpu)
AttentionOp->>KVCache: manage KV with spec_decoding<br/>tensors/offsets
KVCache-->>AttentionOp: return outputs
deactivate CudaGraph
CudaGraph-->>Executor: captured_graph
sequenceDiagram
participant Exec as AutoDeployLLM
participant Comp as Compilation Pipeline
participant CSI as CacheSeqInterface
participant EagleWrapper as Eagle Wrapper
participant Attention as Attention Op
Exec->>Comp: set_eagle_extend_batch<br/>(batch_size, max_draft_len)
Comp->>CSI: set_eagle_extend_batch()
CSI->>CSI: create synthetic<br/>extend-only batch
CSI->>CSI: mark batch_info<br/>as extend_only=True
Exec->>EagleWrapper: forward(extend_batch)
EagleWrapper->>EagleWrapper: detect is_extend_only()
alt Extend-Only Mode
EagleWrapper->>EagleWrapper: reshape target_logits to<br/>(num_seqs, 1+draft_len, vocab)
EagleWrapper->>EagleWrapper: broadcast tokens<br/>across ranks (multi-rank)
end
EagleWrapper->>Attention: pass cache_seq_interface<br/>for metadata
Attention->>Attention: use prompt_lens for<br/>context_lengths
Attention-->>EagleWrapper: return logits
EagleWrapper-->>Exec: outputs
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (7)
tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd the standard NVIDIA SPDX header before the imports.
This modified TensorRT-LLM source file still lacks the required copyright/license header. Please add the standard NVIDIA SPDX header with the latest modification year.
As per coding guidelines
All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py` at line 1, Add the standard NVIDIA SPDX copyright/license header (with the latest modification year) at the very top of the file before the existing imports; ensure the header precedes the first import statement "from typing import List, Literal, Optional, Tuple, Type" so the file complies with the TensorRT-LLM source file header requirement.tensorrt_llm/_torch/auto_deploy/utils/node_utils.py (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd the standard NVIDIA SPDX header before the imports.
This modified TensorRT-LLM source file still lacks the required copyright/license header. Please add the standard NVIDIA SPDX header with the latest modification year.
As per coding guidelines
All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py` at line 1, Add the standard NVIDIA SPDX copyright header (including the latest modification year) at the top of the file before any imports or the module docstring (the existing triple-quoted string at the top of tensorrt_llm._torch.auto_deploy.utils.node_utils), so the file begins with the required NVIDIA SPDX header line(s) followed by the existing docstring; ensure the header matches the project’s standard format and contains the correct year.tensorrt_llm/_torch/auto_deploy/utils/cuda_graph.py (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd the standard NVIDIA SPDX header before the imports.
This modified TensorRT-LLM source file still lacks the required copyright/license header. Please add the standard NVIDIA SPDX header with the latest modification year.
As per coding guidelines
All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/utils/cuda_graph.py` at line 1, Add the standard NVIDIA SPDX copyright/license header at the very top of the file (above the first import statement "from contextlib import contextmanager") using the current/latest modification year; ensure the header matches the project's required NVIDIA SPDX format exactly and appears before any code or imports so the file complies with the TensorRT-LLM source file header policy.tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd the standard NVIDIA SPDX header before the imports.
This modified TensorRT-LLM source file still lacks the required copyright/license header. Please add the standard NVIDIA SPDX header with the latest modification year.
As per coding guidelines
All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py` at line 1, This file (module tensorrt_llm._torch.auto_deploy.transform.library.sharding.py) is missing the NVIDIA SPDX copyright header; add the standard NVIDIA header block containing the latest modification year and the SPDX license identifier (e.g., NVIDIA CORPORATION & AFFILIATES with the SPDX-License-Identifier) immediately at the top of the file before the module docstring and any imports so the file-level header appears before the existing triple-quoted docstring.tensorrt_llm/_torch/auto_deploy/shim/interface.py (1)
1-1:⚠️ Potential issue | 🟡 MinorAdd the standard NVIDIA SPDX header before the imports.
This modified TensorRT-LLM source file still lacks the required copyright/license header. Please add the standard NVIDIA SPDX header with the latest modification year.
As per coding guidelines
All TensorRT-LLM source files must contain an NVIDIA copyright header with the year of latest meaningful modification.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/shim/interface.py` at line 1, This file is missing the required NVIDIA SPDX copyright header; add the standard NVIDIA SPDX header (including the latest modification year) as the very first lines of tensorrt_llm/_torch/auto_deploy/shim/interface.py before any imports (i.e., before the existing "import copy") so the file contains the mandated copyright/license header.tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)
747-805:⚠️ Potential issue | 🔴 CriticalAppend full prompt length to
prompt_lensfor all request types, not chunk size for prefill.Line 763 appends
request.context_chunk_sizefor context requests, while Line 805 appendsrequest.py_prompt_lenfor extend/decode requests. The downstream consumer inattention_interface.py:1230documents thatprompt_lensmust contain "original context length per sequence, constant across iterations." For chunked-prefill requests, the chunk size is not the original context length—it is the current iteration's slice. This mismatch causescontext_lengths_gpuintrtllm_attention.py:225to record incorrect values for sequences undergoing chunk-based prefill, breaking attention metadata computation.Append
request.py_prompt_len(or equivalent full prompt length) for context requests at line 763, matching the semantics of line 805.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py` around lines 747 - 805, The bug is that prompt_lens is populated with request.context_chunk_size for context_requests (in the context_requests loop) instead of the full original prompt length, causing downstream consumers (expecting constant "original context length per sequence") to get wrong values; fix by appending the full prompt length (request.py_prompt_len or equivalent full-context property) to prompt_lens inside the context_requests loop (replace the prompt_lens.append(request.context_chunk_size) call with prompt_lens.append(request.py_prompt_len) or the request field that represents the original context length) so both context_requests and gen_requests append the same semantic value; check attention_interface.py expectations (line ~1230) to ensure the chosen field matches "original context length per sequence."tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py (1)
942-952:⚠️ Potential issue | 🔴 CriticalSort hidden-state caches by numeric layer index, not by raw name.
sorted(..., key=lambda x: x[0])will misorder layer-suffixed buffers once the index has more than one digit. For example, the Eagle3 path intests/integration/defs/accuracy/test_llm_api_autodeploy.pycaptures layers{1, 15, 28}, and names like..._15_hidden_states_cachesort before..._1_hidden_states_cache. That scrambles the concatenation fed into the drafter FC and changes speculative outputs.🔧 Proposed fix
+import re ... - buffers = sorted( - [ - (name, tensor) - for name, tensor in kwargs.items() - if name.endswith("hidden_states_cache") - ], - key=lambda x: x[0], - ) + def _cache_sort_key(item: tuple[str, torch.Tensor]) -> int: + name, _ = item + match = re.search(r"(\d+)_hidden_states_cache$", name) + if match is None: + raise ValueError(f"Could not extract layer index from {name!r}") + return int(match.group(1)) + + buffers = sorted( + [ + (name, tensor) + for name, tensor in kwargs.items() + if name.endswith("hidden_states_cache") + ], + key=_cache_sort_key, + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py` around lines 942 - 952, The current sort of hidden-state buffers uses the raw name (sorted(..., key=lambda x: x[0])), which misorders multi-digit layer suffixes; change the sort to extract the numeric layer index from each buffer name (the digits immediately preceding "_hidden_states_cache") and sort by that integer instead, e.g., parse the index from the tuple's name (the first element of buffers), convert to int with a safe fallback if parsing fails, then torch.cat the buffers in the resulting numeric order so the concatenation passed to the drafter FC preserves correct layer order.
♻️ Duplicate comments (1)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)
598-602:⚠️ Potential issue | 🟠 MajorRestore a memory gate for
test_mtp.The TODO is right: without a replacement for
skip_less_device_memory(180000), this 120B MTP accuracy test will get scheduled on machines the test itself says are under-provisioned, which is likely to turn it into an OOM/timeout source instead of a useful regression test.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py` around lines 598 - 602, Reinstate a memory gate for the 120B MTP test by re-applying the skip decorator (or an equivalent guard) to test_mtp so it only runs on nodes with >=180GB device memory: restore the commented-out `@pytest.mark.skip_less_device_memory`(180000) above the test_mtp definition (or wrap the test_mtp function with a runtime check that calls pytest.skip when available memory <180000) so the parametric test (test_mtp, attn_backend, world_size) is not scheduled on under-provisioned CI machines.
🧹 Nitpick comments (6)
tests/unittest/auto_deploy/singlegpu/models/test_eagle.py (1)
157-168: Avoidtorch.device()in the default argument.This default is evaluated at import time, and Ruff is already flagging the pattern on Line 167. Using
Noneand normalizing inside__init__keeps the helper aligned with the repo lint rules.Suggested refactor
class _FakeCSI: def __init__( self, *, max_batch_size: int, max_num_tokens: int, hidden_size: int, num_capture_layers: int, ids_dtype: torch.dtype = torch.int64, hidden_states_dtype: torch.dtype = torch.float16, - device: torch.device = torch.device("cpu"), + device: torch.device | None = None, ): + if device is None: + device = torch.device("cpu") self.info = SimpleNamespace( max_batch_size=max_batch_size, max_num_tokens=max_num_tokens, device=device, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/auto_deploy/singlegpu/models/test_eagle.py` around lines 157 - 168, The _FakeCSI.__init__ currently uses torch.device("cpu") as a default argument which is evaluated at import time; change the signature to accept device: Optional[torch.device] = None (or remove the default device), then inside __init__ normalize it with something like device = torch.device("cpu") if device is None else device (or torch.device(device)) so the device is constructed at runtime; update the type hint and any uses of self.device accordingly in class _FakeCSI to avoid the import-time torch.device() default.tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (2)
1424-1450: Address unused variable warnings from static analysis.The static analysis flagged
num_decodeandnum_decode_tokensas unused. Since these are from tuple unpacking where only some values are needed, prefix them with underscores to indicate intentional non-use.♻️ Proposed fix
- num_prefill, num_extend, num_decode = self.batch_info.get_num_sequences() - num_prefill_tokens, num_extend_tokens, num_decode_tokens = ( + num_prefill, num_extend, _num_decode = self.batch_info.get_num_sequences() + num_prefill_tokens, num_extend_tokens, _num_decode_tokens = ( self.batch_info.get_num_tokens() )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py` around lines 1424 - 1450, Change the unused tuple-unpacked variables from batch_info.get_num_sequences() and batch_info.get_num_tokens() to explicitly ignored names so static analysis stops flagging them; replace num_decode and num_decode_tokens with _num_decode and _num_decode_tokens (keeping num_prefill, num_extend, num_prefill_tokens, num_extend_tokens as-is) in the block that assigns from self.batch_info.get_num_sequences() and self.batch_info.get_num_tokens() used by the logic in attention_interface.py (the code that computes offset_for_pos_ids and uses tokens_per_seq and repeat_interleave).
1478-1486: Verify the overlap mode assumption for extend requests.The comment states "AD overlap mode packs real overlap-carried non-prefill sequences first" and "CUDA-graph dummy requests are appended later". This relies on ordering invariants maintained elsewhere. Consider adding a brief comment or assertion to document this contract more explicitly at this location.
📝 Optional: Add defensive comment or soft assertion
num_overlap = gather_slot_idx.numel() # AD overlap mode packs real overlap-carried non-prefill sequences first. CUDA-graph dummy # requests are appended later and therefore remain in the trailing zero-increment region. # This keeps the padded dummy requests from consuming a new_tokens_lens entry while still # preserving the existing ordering-based contract used throughout SequenceInfo. + # Invariant: either all non-prefill are extend OR all are decode, never mixed. increment[num_prefill : num_prefill + num_overlap] = ( new_lens_ungathered[gather_slot_idx] - 1 )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py` around lines 1478 - 1486, Add a short defensive comment documenting the AD overlap-mode ordering contract and a soft assertion that verifies gather_slot_idx maps into the expected slice immediately following num_prefill: ensure num_overlap == gather_slot_idx.numel() and that all indices in gather_slot_idx are within the range [num_prefill, num_prefill + num_overlap) before computing increment and calling self.offset_pos_and_cache_. Reference the local symbols gather_slot_idx, num_prefill, num_overlap, increment, and self.offset_pos_and_cache_ when adding this check so the invariant is explicit and fails fast if ordering elsewhere changes.tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py (3)
445-461: Duplicate variable assignments fornum_tokensandmax_context_length.Lines 447-449 assign
num_tokens,max_context_length, andmax_num_requestsfrombatch_info, but lines 458-461 reassign the same variables. The first set of assignments appear unused before being overwritten.♻️ Proposed fix to remove duplicate assignments
# Get batch dimensions and model-level constants from host tensors (no device sync) batch_info = BatchInfo(batch_info_host) - num_tokens = batch_info.get_total_num_tokens() - max_context_length = batch_info.get_max_context_length() - max_num_requests = batch_info.get_max_batch_size() _GlobalTrtllmPlanner.update_host_request_types(batch_info) # Infer dimensions from tensor shapes (bsnd layout)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py` around lines 445 - 461, BatchInfo-derived variables num_tokens, max_context_length, and max_num_requests are assigned twice (before and after inferring tensor shapes); remove the first set of redundant assignments (the ones immediately after BatchInfo(batch_info_host)) and keep the single assignments that occur after shape inference so num_tokens, max_context_length, and max_num_requests are only set once; retain the _GlobalTrtllmPlanner.update_host_request_types(batch_info) call but do not re-declare or reassign those variables earlier.
278-305: Consider adding cache size limits for@functools.cachedecorators.These cached functions allocate GPU tensors keyed by
(max_num_requests, draft_len). While typically only a few distinct combinations exist at runtime, unbounded caching could lead to GPU memory accumulation if many different parameter combinations are used. Consider using@functools.lru_cache(maxsize=...)with a reasonable limit instead.♻️ Suggested change
-@functools.cache +@functools.lru_cache(maxsize=8) def _generate_spec_decoding_position_offsets(max_num_requests: int, draft_len: int) -> torch.Tensor: ... -@functools.cache +@functools.lru_cache(maxsize=8) def _generate_spec_decoding_packed_mask(max_num_requests: int, draft_len: int) -> torch.Tensor: ...🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py` around lines 278 - 305, The two GPU-tensor-producing functions _generate_spec_decoding_position_offsets and _generate_spec_decoding_packed_mask currently use `@functools.cache` (unbounded) and should use a bounded cache to avoid unbounded GPU memory growth; replace `@functools.cache` with `@functools.lru_cache`(maxsize=...) (pick a reasonable maxsize like 8 or make it configurable) so the cache keys (max_num_requests, draft_len) are limited, and ensure the functions' signatures stay hashable for lru_cache to work.
532-538: Refactor to use the existingis_sm_version_trtllm_gen_kernel()method instead of duplicating the logic inline.The condition
not (sm_version < 100 or sm_version in (120, 121))is correctly mirrored fromis_sm_version_trtllm_gen_kernel()intrtllm.py, but duplicating it here creates unnecessary maintenance risk. Replace the inline condition with a call to the method to ensure they stay synchronized.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py` around lines 532 - 538, Replace the inline SM-version check with the shared helper: instead of calling get_sm_version() and using the duplicated condition, call is_sm_version_trtllm_gen_kernel() (the same helper in trtllm.py/TrtllmAttentionWrapper) to decide whether to append three None entries to spec_decoding_tensor_params; i.e., if is_sm_version_trtllm_gen_kernel() returns True then extend spec_decoding_tensor_params with [None, None, None], otherwise do nothing—remove the duplicated sm_version logic and any unused get_sm_version references.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`:
- Around line 1230-1237: The decode loop in demollm.py calls
sequence_info.nest_sequences() without passing the original prompt lengths,
causing prompt_lens to default to sl_host and break attention masking; fix it by
preserving total_lens from the prefill phase and passing it into the
decode-phase call to sequence_info.nest_sequences(...) as the prompt_lens
argument (same value used in the prefill call), ensuring the decode call (in the
loop around line ~198) includes prompt_lens=total_lens so attention masking uses
the original context lengths.
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py`:
- Around line 136-150: init_spec_decoding may access self.block_offsets before
reset() has allocated it, causing torch.zeros_like(self.block_offsets) to fail;
add a defensive guard at the top of init_spec_decoding that returns if
self.block_offsets is None (or ensure reset() has run) and only then proceed to
set self._scratch_block_offsets, spec_decoding_generation_lengths,
spec_decoding_position_offsets, and spec_decoding_packed_mask; reference
init_spec_decoding, reset, block_offsets, _scratch_block_offsets, and
is_spec_decoding_enabled when making the change.
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/utils/torch_gather_logits.py`:
- Around line 30-32: The function gather_tokens in torch_gather_logits.py
incorrectly collapses extend-only inputs ([bs, sl, ...]) into a decode-shaped
output ([N, 1, ...]) by branching only on bsz==1; update gather_tokens (and any
related reshape logic used by flatten_3d) to preserve the original
second-dimension when input is extend-only by deriving it from BatchInfo
per-sequence token counts instead of forcing [N,1,...]. Specifically, detect
extend-only layout via BatchInfo (or the per-sequence counts API), compute the
correct per-batch sequence lengths for the second dimension, and reshape/permute
using those lengths so outputs keep [bs, sl, ...] for extend-only cases while
still supporting [bs,1,...] decode and [1,total,...] prefill layouts.
In `@tensorrt_llm/_torch/auto_deploy/models/eagle.py`:
- Around line 315-329: The target dtype and trust settings are being inferred by
unconditionally calling AutoConfig.from_pretrained(..., trust_remote_code=True),
which ignores explicit overrides in model_kwargs or kwargs; update the logic
that computes target_dtype (used when constructing self.draft_factory /
EagleDrafterFactory) to first check for explicit dtype overrides in model_kwargs
(e.g. "torch_dtype") and kwargs (e.g. "dtype"), and to respect the caller's
trust_remote_code value rather than hard-coding True when calling
AutoConfig.from_pretrained; i.e., derive a local trust_remote_code variable from
kwargs or a surrounding parameter and only call AutoConfig.from_pretrained with
that value if no dtype was supplied in model_kwargs/kwargs, then pass the
resolved target_dtype into EagleDrafterFactory.
In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Around line 1534-1540: The current Eagle-drafter check only looks at
node.users directly, which misses cases where q/k/v linears are piped through
passthrough ops (view/reshape/transpose) before reaching the attention op;
update the branch that checks "if in_eagle_drafter and in_dim == 2 * embd" to
follow passthrough users transitively instead of only immediate users: reuse or
add a small helper that starting from node (or each user) recursively walks
through allowed passthrough op kinds (e.g., view/reshape/transpose) and returns
true if any reachable user satisfies is_any_attention_op; use that helper in
place of the current any(is_any_attention_op(u) for u in node.users) call so
Eagle q/k/v openings are recognized through reshape/view chains.
---
Outside diff comments:
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py`:
- Around line 942-952: The current sort of hidden-state buffers uses the raw
name (sorted(..., key=lambda x: x[0])), which misorders multi-digit layer
suffixes; change the sort to extract the numeric layer index from each buffer
name (the digits immediately preceding "_hidden_states_cache") and sort by that
integer instead, e.g., parse the index from the tuple's name (the first element
of buffers), convert to int with a safe fallback if parsing fails, then
torch.cat the buffers in the resulting numeric order so the concatenation passed
to the drafter FC preserves correct layer order.
In `@tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`:
- Around line 747-805: The bug is that prompt_lens is populated with
request.context_chunk_size for context_requests (in the context_requests loop)
instead of the full original prompt length, causing downstream consumers
(expecting constant "original context length per sequence") to get wrong values;
fix by appending the full prompt length (request.py_prompt_len or equivalent
full-context property) to prompt_lens inside the context_requests loop (replace
the prompt_lens.append(request.context_chunk_size) call with
prompt_lens.append(request.py_prompt_len) or the request field that represents
the original context length) so both context_requests and gen_requests append
the same semantic value; check attention_interface.py expectations (line ~1230)
to ensure the chosen field matches "original context length per sequence."
In `@tensorrt_llm/_torch/auto_deploy/shim/interface.py`:
- Line 1: This file is missing the required NVIDIA SPDX copyright header; add
the standard NVIDIA SPDX header (including the latest modification year) as the
very first lines of tensorrt_llm/_torch/auto_deploy/shim/interface.py before any
imports (i.e., before the existing "import copy") so the file contains the
mandated copyright/license header.
In `@tensorrt_llm/_torch/auto_deploy/transform/library/compile_model.py`:
- Line 1: Add the standard NVIDIA SPDX copyright/license header (with the latest
modification year) at the very top of the file before the existing imports;
ensure the header precedes the first import statement "from typing import List,
Literal, Optional, Tuple, Type" so the file complies with the TensorRT-LLM
source file header requirement.
In `@tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py`:
- Line 1: This file (module
tensorrt_llm._torch.auto_deploy.transform.library.sharding.py) is missing the
NVIDIA SPDX copyright header; add the standard NVIDIA header block containing
the latest modification year and the SPDX license identifier (e.g., NVIDIA
CORPORATION & AFFILIATES with the SPDX-License-Identifier) immediately at the
top of the file before the module docstring and any imports so the file-level
header appears before the existing triple-quoted docstring.
In `@tensorrt_llm/_torch/auto_deploy/utils/cuda_graph.py`:
- Line 1: Add the standard NVIDIA SPDX copyright/license header at the very top
of the file (above the first import statement "from contextlib import
contextmanager") using the current/latest modification year; ensure the header
matches the project's required NVIDIA SPDX format exactly and appears before any
code or imports so the file complies with the TensorRT-LLM source file header
policy.
In `@tensorrt_llm/_torch/auto_deploy/utils/node_utils.py`:
- Line 1: Add the standard NVIDIA SPDX copyright header (including the latest
modification year) at the top of the file before any imports or the module
docstring (the existing triple-quoted string at the top of
tensorrt_llm._torch.auto_deploy.utils.node_utils), so the file begins with the
required NVIDIA SPDX header line(s) followed by the existing docstring; ensure
the header matches the project’s standard format and contains the correct year.
---
Duplicate comments:
In `@tests/integration/defs/accuracy/test_llm_api_autodeploy.py`:
- Around line 598-602: Reinstate a memory gate for the 120B MTP test by
re-applying the skip decorator (or an equivalent guard) to test_mtp so it only
runs on nodes with >=180GB device memory: restore the commented-out
`@pytest.mark.skip_less_device_memory`(180000) above the test_mtp definition (or
wrap the test_mtp function with a runtime check that calls pytest.skip when
available memory <180000) so the parametric test (test_mtp, attn_backend,
world_size) is not scheduled on under-provisioned CI machines.
---
Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`:
- Around line 1424-1450: Change the unused tuple-unpacked variables from
batch_info.get_num_sequences() and batch_info.get_num_tokens() to explicitly
ignored names so static analysis stops flagging them; replace num_decode and
num_decode_tokens with _num_decode and _num_decode_tokens (keeping num_prefill,
num_extend, num_prefill_tokens, num_extend_tokens as-is) in the block that
assigns from self.batch_info.get_num_sequences() and
self.batch_info.get_num_tokens() used by the logic in attention_interface.py
(the code that computes offset_for_pos_ids and uses tokens_per_seq and
repeat_interleave).
- Around line 1478-1486: Add a short defensive comment documenting the AD
overlap-mode ordering contract and a soft assertion that verifies
gather_slot_idx maps into the expected slice immediately following num_prefill:
ensure num_overlap == gather_slot_idx.numel() and that all indices in
gather_slot_idx are within the range [num_prefill, num_prefill + num_overlap)
before computing increment and calling self.offset_pos_and_cache_. Reference the
local symbols gather_slot_idx, num_prefill, num_overlap, increment, and
self.offset_pos_and_cache_ when adding this check so the invariant is explicit
and fails fast if ordering elsewhere changes.
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py`:
- Around line 445-461: BatchInfo-derived variables num_tokens,
max_context_length, and max_num_requests are assigned twice (before and after
inferring tensor shapes); remove the first set of redundant assignments (the
ones immediately after BatchInfo(batch_info_host)) and keep the single
assignments that occur after shape inference so num_tokens, max_context_length,
and max_num_requests are only set once; retain the
_GlobalTrtllmPlanner.update_host_request_types(batch_info) call but do not
re-declare or reassign those variables earlier.
- Around line 278-305: The two GPU-tensor-producing functions
_generate_spec_decoding_position_offsets and _generate_spec_decoding_packed_mask
currently use `@functools.cache` (unbounded) and should use a bounded cache to
avoid unbounded GPU memory growth; replace `@functools.cache` with
`@functools.lru_cache`(maxsize=...) (pick a reasonable maxsize like 8 or make it
configurable) so the cache keys (max_num_requests, draft_len) are limited, and
ensure the functions' signatures stay hashable for lru_cache to work.
- Around line 532-538: Replace the inline SM-version check with the shared
helper: instead of calling get_sm_version() and using the duplicated condition,
call is_sm_version_trtllm_gen_kernel() (the same helper in
trtllm.py/TrtllmAttentionWrapper) to decide whether to append three None entries
to spec_decoding_tensor_params; i.e., if is_sm_version_trtllm_gen_kernel()
returns True then extend spec_decoding_tensor_params with [None, None, None],
otherwise do nothing—remove the duplicated sm_version logic and any unused
get_sm_version references.
In `@tests/unittest/auto_deploy/singlegpu/models/test_eagle.py`:
- Around line 157-168: The _FakeCSI.__init__ currently uses torch.device("cpu")
as a default argument which is evaluated at import time; change the signature to
accept device: Optional[torch.device] = None (or remove the default device),
then inside __init__ normalize it with something like device =
torch.device("cpu") if device is None else device (or torch.device(device)) so
the device is constructed at runtime; update the type hint and any uses of
self.device accordingly in class _FakeCSI to avoid the import-time
torch.device() default.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 7d79db9b-89eb-4e40-a47d-8ccd70d3daa5
📒 Files selected for processing (25)
examples/auto_deploy/model_registry/configs/super_v3_mtp.yamltensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.pytensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.pytensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.pytensorrt_llm/_torch/auto_deploy/custom_ops/utils/torch_gather_logits.pytensorrt_llm/_torch/auto_deploy/llm_args.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.pytensorrt_llm/_torch/auto_deploy/models/eagle.pytensorrt_llm/_torch/auto_deploy/shim/ad_executor.pytensorrt_llm/_torch/auto_deploy/shim/interface.pytensorrt_llm/_torch/auto_deploy/transform/library/collectives.pytensorrt_llm/_torch/auto_deploy/transform/library/compile_model.pytensorrt_llm/_torch/auto_deploy/transform/library/hidden_states.pytensorrt_llm/_torch/auto_deploy/transform/library/kvcache.pytensorrt_llm/_torch/auto_deploy/transform/library/sharding.pytensorrt_llm/_torch/auto_deploy/utils/cuda_graph.pytensorrt_llm/_torch/auto_deploy/utils/node_utils.pytests/integration/defs/accuracy/test_llm_api_autodeploy.pytests/integration/defs/examples/test_ad_speculative_decoding.pytests/unittest/auto_deploy/singlegpu/compile/test_captured_graph.pytests/unittest/auto_deploy/singlegpu/custom_ops/attention/test_trtllm_attention_op.pytests/unittest/auto_deploy/singlegpu/models/test_eagle.pytests/unittest/auto_deploy/singlegpu/shim/test_cached_sequence_interface.pytests/unittest/auto_deploy/singlegpu/shim/test_llm_config.pytests/unittest/auto_deploy/singlegpu/smoke/test_ad_speculative_decoding.py
💤 Files with no reviewable changes (1)
- tensorrt_llm/_torch/auto_deploy/transform/library/collectives.py
|
@coderabbitai Act as a QA engineer reviewing test changes for TensorRT-LLM. |
|
Tip For best results, initiate chat on the files or code changes. 🧩 Analysis chain🏁 Script executed: Length of output: 202933 🏁 Script executed: Length of output: 23279
QA Test List Hygiene✅ QA list entries present — but parametrization IDs need updatingThe PR touches Problem 1 — Problem 2 — # Old (stale):
- examples/test_ad_speculative_decoding.py::test_autodeploy_eagle3_one_model_acceptance_rate
# Suggested updates:
- examples/test_ad_speculative_decoding.py::test_autodeploy_eagle3_one_model_acceptance_rate[trtllm-torch-cudagraph]
- examples/test_ad_speculative_decoding.py::test_autodeploy_eagle3_one_model_acceptance_rate[flashinfer-torch-simple]✅ Unit tests — QA list updates not requiredAll new tests under Performance Test Coverage❌ Missing entry in
|
| Test | Path covered | Verdict |
|---|---|---|
test_eagle3_one_model[trtllm] |
AutoDeploy Eagle3 one-model, trtllm+cudagraph, GSM8K accuracy | ✅ |
test_mtp[4] via super_v3_mtp.yaml |
SuperV3 MTP with torch-cudagraph, 4-GPU | ✅ |
test_autodeploy_eagle3_one_model_acceptance_rate[trtllm-torch-cudagraph] |
Acceptance rate ≥ 10% for Eagle3+cudagraph | ✅ |
test_ad_eagle3_one_model_smoke[trtllm] |
Short+long prompt smoke with skip_loading_weights | ✅ (unit/smoke scope) |
test_speculative_flashinfer_forces_torch_simple |
Fallback guard correctness | ✅ (unit scope) |
Edge-case and regression gaps — flagged
-
No TP/multi-GPU test for AutoDeploy Eagle3+cudagraph. The
test_eagle3_one_model[trtllm]is single-GPU. A regression in the distributed token broadcast path (EagleWrapper.sample_greedy) or the NCCL allreduce unfusion (removed early-return inFuseAllreduceResidualRMSNorm) would not be caught. Consider a 4-GPU parametrized variant or a separate TP smoke. -
multi_stream_moedisabled insuper_v3_mtp.yamlwithout a guard test. The YAML now hard-codesmulti_stream_moe: {enabled: false}. There is no test that would catch a regression if the flag is accidentally re-enabled or the disablement breaks under a config update. -
Stale
@pytest.mark.skip_less_device_memoryremoval in SuperV3 MTP test. The PR replaces@pytest.mark.skip_less_device_memory(180000)with a TODO comment. This means the test will run (and likely OOM) on GPUs with < 180 GB memory, potentially producing false failures on hardware that previously skipped it. This should be restored or the test should be guarded. -
No negative/failure-mode test for CUDA graph capture errors. If
CapturedGraph._capture_one_graphraises during the extend-only Eagle warmup (e.g., non-CUDA-graph-safe op captured), the only signal is a runtime crash. A unit test that injects a badrefresh_inputs_fnwould make this fail explicitly. -
test_autodeploy_spec_dec_outputexact-match assertion is brittle for larger batch sizes. The test already has an inline comment noting it can fail with larger batch sizes due to nondeterminism. The currentbatch_size=1restriction prevents this in practice, but if a future PR inadvertently increasesmax_batch_sizeor removesdisable_overlap_scheduler, the test will silently become flaky rather than failing hard. Consider asserting on acceptance rate or token-overlap instead of exact string equality.
Summary
| Category | Status |
|---|---|
llm_function_core.txt QA entry (parametrized [trtllm] variant) |
|
l0_h100.yml entry for test_autodeploy_eagle3_one_model_acceptance_rate |
|
l0_perf.yml perf entry for Eagle3+AutoDeploy+cudagraph |
❌ Missing (primary CI gap) |
llm_perf_core.yml QA perf entry |
Check PR diff |
test_perf_sanity.py baseline for AutoDeploy+Eagle3+cudagraph |
❌ Missing |
| Multi-GPU AutoDeploy Eagle3+cudagraph coverage | |
| Unit test QA list updates | ✅ Not required |
| Happy path functional coverage | ✅ Adequate |
d777423 to
2878730
Compare
97f97fd to
9b59b6c
Compare
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" |
|
PR_Github #45433 [ run ] triggered by Bot. Commit: |
|
PR_Github #45433 [ run ] completed with state
|
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" |
|
PR_Github #45456 [ run ] triggered by Bot. Commit: |
|
PR_Github #45456 [ run ] completed with state
|
3d22ea7 to
416a88a
Compare
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" |
|
PR_Github #45767 [ run ] triggered by Bot. Commit: |
416a88a to
a8ddccc
Compare
|
PR_Github #45767 [ run ] completed with state
|
a8ddccc to
d4bcc30
Compare
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1" |
|
PR_Github #45796 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #45872 [ run ] triggered by Bot. Commit: |
|
PR_Github #45872 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
- test_capture_graph_uses_per_input_extents_for_truncation: the fake _capture_one_graph used (self, *args, **kwargs), but capture_graph calls it with keyword args (args=..., kwargs=..., refresh_args_static=...), so positional *args was always empty. Match the real signature so the fake can introspect tensor shapes. - test_ad_engine_prepare_inputs_generation_with_hybrid_cache: this PR added request.py_prompt_len access in ad_executor._prepare_inputs, but the local _GenRequest mock did not define it. Add py_prompt_len to the mock. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
…pec dec mode on this architecture Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
d4bcc30 to
ef82208
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #46182 [ run ] triggered by Bot. Commit: |
|
PR_Github #46182 [ run ] completed with state
|
|
/bot run |
|
PR_Github #46315 [ run ] triggered by Bot. Commit: |
|
PR_Github #46315 [ run ] completed with state |
torch-cudagraph mainly tested on Llama-3.1-8B-Instruct + Eagle3, as well as SuperV3 + MTP.
Fixes: #12633
Note: There are some TODOs left from this PR to investigate some lingering issues. They are here:
#13100
#13133
#13134
#13135
#13143
Summary by CodeRabbit
New Features
Bug Fixes
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.