[#10931][feat] AutoDeploy: one-model spec dec by lucaslie · Pull Request #11701 · NVIDIA/TensorRT-LLM

lucaslie · 2026-02-25T02:43:19Z

Summary by CodeRabbit

New Features
- Added support for Eagle3 one-model speculative decoding path, enabling single-model inference optimizations with speculative decoding capabilities.
- Introduced speculative_model_kwargs configuration field for passing additional parameters to speculative models.
Improvements
- Enhanced batch metadata handling and KV cache resource management for improved inference efficiency.
- Streamlined public APIs by consolidating internal metadata structures.
Bug Fixes
- Fixed overlap scheduler behavior and token acceptance rate tracking in speculative decoding workflows.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py

tensorrt_llm/_torch/auto_deploy/models/eagle.py

suyoggupta · 2026-03-04T22:37:26Z

@lucaslie , @govind-ramnarayan : is this ready to be reviewed?

coderabbitai · 2026-03-06T00:19:37Z

📝 Walkthrough

Walkthrough

Introduces a new BatchInfo class to encapsulate and manage batch metadata tensors across attention and other backend operations, refactors Eagle speculative decoding for one-model support, updates block-table management with page re-insertion logic, and integrates these changes across flashinfer, torch, triton, mamba, FLA, MLA backends, plus test and configuration updates.

Changes

Cohort / File(s)	Summary
BatchInfo Core Implementation `tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`	Introduced `BatchInfo` class with serialization, update, and accessor methods for batch metadata; integrated into `SequenceInfo` to replace direct tensor manipulation; refactored batch info handling, token gathering logic, and host-device synchronization with new helper methods (`offset_pos_and_cache_`, `switch_to_generate_`, `rescatter_input_ids_`).
BatchInfo Integration in Attention Backends `tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py`, `torch_backend_attention.py`, `triton_attention.py`, `trtllm_attention.py`	Replaced direct tensor-to-list conversions with `BatchInfo` wrapper and `get_absorbed_info()` method; updated function signatures to remove `max_seq_info_host` parameter in trtllm paths and centralize metadata extraction via `BatchInfo`.
BatchInfo Integration in Other Backends `tensorrt_llm/_torch/auto_deploy/custom_ops/fla/`, `mamba/`, `mla/*`	Applied consistent `BatchInfo`-based batch metadata extraction across FLA, Mamba, and MLA backend implementations (flashinfer, torch, triton variants), replacing direct tensor unpacking with `BatchInfo().get_absorbed_info()` or `get_num_tokens_to_gather()`.
Block Table and Ragged Cache Management `tensorrt_llm/_torch/auto_deploy/custom_ops/utils/block_table_ragged.py`	Added `extra_idx` parameter to track removed pages for re-insertion when `delta < 0`; introduced `max_blocks_per_seq` parameter to guide memory sizing; updated public function signatures for `adjust_block_table_torch`, `adjust_ragged_torch`, and Triton variants to support new cache management semantics.
Gather Logits Refactoring `tensorrt_llm/_torch/auto_deploy/custom_ops/utils/torch_gather_logits.py`, `transform/library/gather_logits_before_lm_head.py`	Replaced `tokens_gather_info_host` parameter with `batch_info_host`; updated control flow to use `BatchInfo.get_num_tokens_to_gather()` and `is_gather_required()` for conditional gathering logic.
Eagle Speculative Decoding Refactoring `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py`, `models/eagle.py`	Refactored `EagleWrapper.__init__` to remove `resource_manager` parameter; added `_draft_inner_model` and `_draft_dtype` properties; introduced KV-cache-aware forward path (`_forward_with_kv_cache`) and prefill-only path (`_forward_prefill_only`); introduced `EagleOneModelFactory` and export-info classes (`TargetModelExportInfo`, `DraftModelExportInfo`) for one-model Eagle composition.
Speculative Decoding Infrastructure `tensorrt_llm/_torch/auto_deploy/llm_args.py`, `shim/ad_executor.py`, `shim/interface.py`, `shim/demollm.py`	Added `speculative_model_kwargs` field to `LlmArgs`; extended `CachedSequenceInterface` to accept `spec_config`; updated `add_resource` to return generated full resource name; added `_run_forward()` method to `ADEngine` and updated input preparation for new `new_tokens_lens` and `gather_context_logits` parameters; introduced Eagle3OneModel sampler support.
KVCache Transform Updates `tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py`	Updated cache registration to use original key names instead of per-node indices; extended forward invocation with CSI compatibility check (`_requires_csi` flag) to conditionally pass `cache_seq_interface` parameter.
Hidden States and Configuration `tensorrt_llm/_torch/auto_deploy/config/default.yaml`, `transform/library/hidden_states.py`	Added explicit `enabled: false` flag for `detect_hidden_states_for_capture` transform; refactored detection logic to skip transform when `gm.is_draft` is true or `residual_add_for_capture` already exists.
Shim Module Export Changes `tensorrt_llm/_torch/auto_deploy/shim/__init__.py`	Removed `create_autodeploy_executor` from exported symbols, reducing public API surface.
Test Updates for BatchInfo `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/attention/test_.py`, `mamba/test_.py`, `mla/test_.py`, `fla/test_.py`, `utils/test_.py`, `models/test_.py`	Updated test fixtures across multiple backends to construct `batch_info_host` via `BatchInfo().update(...).serialize()` instead of direct tensor literals; covers flashinfer, torch, triton, CUDA, and Triton variants.
New Comprehensive Tests `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_switch_to_generate_inplace.py`, `tests/integration/defs/accuracy/test_llm_api_autodeploy.py`, `tests/integration/defs/examples/test_ad_speculative_decoding.py`	Added extensive test coverage for `SequenceInfo.switch_to_generate_inplace()` behavior; introduced `TestLlama3_1_8B_Instruct_Eagle3` for Eagle3 one-model speculative decoding; added `test_autodeploy_eagle3_one_model_acceptance_rate()` and helper `_run_acceptance_rate_check()` for acceptance-rate validation.
Block Table and Cache Tests `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/utils/test_block_table_ragged_conversion.py`	Updated ragged/block-table conversion tests to pass `max_blocks_per_seq` parameter; added comprehensive test `test_adjust_remove_saves_to_extra_idx_and_reinserts()` demonstrating page re-insertion cycles.
CachedSequenceInterface Tests `tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_cached_sequence_interface.py`, `test_engine.py`	Updated tests to use returned `full_name` from `add_resource()` for resource lookups and caches; validated token-gathering behavior with new `gather_context_logits` parameter; added `py_seq_slot` attribute to request objects; updated engine tests to use `_run_forward()["logits"]` instead of `_compute_logits()`.
Minor Configuration and Utility Changes `tensorrt_llm/_torch/auto_deploy/models/patches/bamba.py`, `transform/library/sharding.py`, `tests/integration/test_lists/test-db/l0_h100.yml`	Updated Bamba patch to use `BatchInfo`-based metadata construction; added conditional config validation guard for multi-process setups; added placeholder comments for AutoDeploy tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 57.44% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	The PR description is the template skeleton without completion. It lacks title format, explanation of changes, test coverage details, and PR checklist completion.	Complete the PR description with: (1) proper title following [ticket][type] format, (2) clear explanation of the one-model speculative decoding feature in the Description section, (3) list of relevant tests in Test Coverage section, and (4) confirm all PR checklist items have been reviewed.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title '[`#10931`][feat] AutoDeploy: one-model spec dec' is fully related to the main changes in the changeset, which implement one-model Eagle speculative decoding support in AutoDeploy.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tests/integration/defs/accuracy/test_llm_api_autodeploy.py

coderabbitai

Actionable comments posted: 15

🧹 Nitpick comments (7)

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_switch_to_generate_inplace.py (1)

20-20: Use module-level import instead of importing SequenceInfo directly.

Line 20 imports a class symbol directly, which conflicts with the repository’s Python import rule.

Proposed refactor

 import torch
 
-from tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface import SequenceInfo
+from tensorrt_llm._torch.auto_deploy.custom_ops import attention_interface
@@
-def _make_seq_info(extra_activate=()) -> SequenceInfo:
+def _make_seq_info(extra_activate=()) -> attention_interface.SequenceInfo:
@@
-    si = SequenceInfo(
+    si = attention_interface.SequenceInfo(
@@
-def _nest_prefill(si: SequenceInfo, input_ids, pages_per_seq, cache_loc, **kw):
+def _nest_prefill(si: attention_interface.SequenceInfo, input_ids, pages_per_seq, cache_loc, **kw):

As per coding guidelines, "**/*.py: Python imports must use form from package.subpackage import module (never from module import Class)."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_switch_to_generate_inplace.py`
at line 20, Replace the direct class import of SequenceInfo with a module-level
import to follow the repo import rule: change the import to bring in the
attention_interface module (e.g., import
tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface as
attention_interface) and then update all usages of SequenceInfo in this test
(function/class names in this file) to reference
attention_interface.SequenceInfo; ensure only the import line and references are
modified so behavior remains unchanged.

tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)

274-278: Run the resize probe forward in inference mode.

This path is used only for memory sizing; wrapping the forward call in torch.inference_mode() avoids autograd allocations that can inflate mem_reserved_for_forward and reduce KV-cache capacity estimates.

♻️ Suggested patch

         cm.info.set_max_num_tokens_sample()
         try:
             # TODO (lucaslie): revisit this logic as part of spec dec cudagraph support...
-            if getattr(mod, "_requires_csi", False):
-                mod(cache_seq_interface=cm)
-            else:
-                mod(**cm.named_args)
+            with torch.inference_mode():
+                if getattr(mod, "_requires_csi", False):
+                    mod(cache_seq_interface=cm)
+                else:
+                    mod(**cm.named_args)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py` around lines
274 - 278, The forward probe used for memory sizing should run under
torch.inference_mode() to avoid autograd allocations: wrap the calls to mod(...)
in a torch.inference_mode() context manager so both branches (when getattr(mod,
"_requires_csi", False) calls mod(cache_seq_interface=cm) and the else branch
calling mod(**cm.named_args)) execute inside torch.inference_mode(), preserving
existing argument usage and behavior while preventing grad-related memory
allocations during the resize probe.

tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py (1)

309-310: Avoid unused unpacked token count in host metadata prep.

Line 310 unpacks num_prefill_tokens but never uses it (RUF059).

💡 Proposed fix

-    num_prefill, num_prefill_tokens, num_decode = batch_info.get_absorbed_info()
+    num_prefill, _num_prefill_tokens, num_decode = batch_info.get_absorbed_info()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py`
around lines 309 - 310, The host metadata unpack currently assigns an unused
variable num_prefill_tokens from BatchInfo.get_absorbed_info(); update the
unpack to avoid the unused symbol — either ignore the middle value by assigning
it to _ or only unpack the two needed values (e.g., assign num_prefill and
num_decode) from BatchInfo(batch_info_host).get_absorbed_info() so
num_prefill_tokens is not created but the rest of the code (using BatchInfo and
get_absorbed_info) continues to work.

tensorrt_llm/_torch/auto_deploy/custom_ops/attention/torch_backend_attention.py (1)

320-322: Clean up unused unpacked value from absorbed metadata.

Line 321 unpacks num_prefill_tokens but doesn’t use it (RUF059).

💡 Proposed fix

-    num_prefill, num_prefill_tokens, num_decode = batch_info.get_absorbed_info()
+    num_prefill, _num_prefill_tokens, num_decode = batch_info.get_absorbed_info()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/torch_backend_attention.py`
around lines 320 - 322, The code unpacks three values from
BatchInfo.get_absorbed_info() but never uses num_prefill_tokens; update the
unpack to remove the unused variable (e.g., unpack only num_prefill and
num_decode) or replace num_prefill_tokens with an underscore to indicate it’s
intentionally ignored in the BatchInfo / get_absorbed_info call site where
batch_info_host is converted and num_seq is computed from num_prefill +
num_decode.

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py (1)

192-194: Drop unused unpacked metadata to keep lint clean.

At Line 193, num_prefill_tokens is unpacked but unused (RUF059).

💡 Proposed fix

-    num_prefill, num_prefill_tokens, num_decode = batch_info.get_absorbed_info()
+    num_prefill, _num_prefill_tokens, num_decode = batch_info.get_absorbed_info()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py`
around lines 192 - 194, The unpacked variable num_prefill_tokens from
BatchInfo(batch_info_host).get_absorbed_info() is unused; change the unpacking
in the block that creates BatchInfo and calls get_absorbed_info() so the unused
metadata is discarded (e.g., replace num_prefill_tokens with a throwaway name
like _ or otherwise only extract needed elements) and keep the subsequent
computation of num_seq using num_prefill and num_decode unchanged; locate this
in the code around the BatchInfo constructor call and the get_absorbed_info()
unpacking.

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

423-423: Avoid unpacking an unused value.

Line 423 unpacks num_decode_tokens but never uses it, which triggers Ruff (RUF059).

💡 Proposed fix

-        num_prefill_tokens, num_extend_tokens, num_decode_tokens = self.get_num_tokens()
+        num_prefill_tokens, num_extend_tokens, _num_decode_tokens = self.get_num_tokens()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py` at line
423, The code unpacks three values from self.get_num_tokens() but never uses
num_decode_tokens, causing a linter warning; change the unpack to only capture
the needed values (e.g., num_prefill_tokens, num_extend_tokens =
self.get_num_tokens()) or replace the third target with a discard variable
(e.g., _ or _num_decode_tokens) where the call appears (the call to
self.get_num_tokens() in attention_interface.py), and ensure any downstream code
that expects num_decode_tokens is updated accordingly.

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_gather_logits_before_lm_head.py (1)

25-25: Prefer module-namespace import for BatchInfo

Line 25 imports a class symbol directly. The repository guideline asks for importing module namespaces and using module.Symbol.

♻️ Example refactor pattern

-from tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface import BatchInfo
+from tensorrt_llm._torch.auto_deploy.custom_ops import attention_interface

-        batch_info = BatchInfo()
+        batch_info = attention_interface.BatchInfo()

As per coding guidelines "Python imports must use form from package.subpackage import module (never from module import Class)".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_gather_logits_before_lm_head.py`
at line 25, The test imports the class BatchInfo directly; change the import to
the module form (import the module namespace) so usages use module.BatchInfo
instead. Replace the line that currently does "from
tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface import BatchInfo"
with a module import (e.g., "from tensorrt_llm._torch.auto_deploy.custom_ops
import attention_interface") and update all references in
test_gather_logits_before_lm_head.py to use attention_interface.BatchInfo
wherever BatchInfo is referenced.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`:
- Around line 411-413: The update method (update(self, batch_info: List[int]) ->
None) currently assigns batch_info into self._batch_info_np[:6] without
validating length, causing obscure numpy broadcast errors for legacy 3-value
inputs; add an explicit check that len(batch_info) == 6 at the top of update and
raise a clear ValueError (e.g. "batch_info must be length 6: [fields...]") when
it isn't, so callers get an immediate, descriptive error rather than a
silent/broadcast failure; keep the assignment to self._batch_info_np[:6] after
the check.
- Around line 1362-1375: The code overwrites cu_seqlen before computing
extraction_indices, so extraction_indices uses the reset arange rather than each
sequence's true end; to fix, read or save the original cu_seqlen (via
self.get_arg or a local variable) before you create/overwrite the new cu_seqlen,
then compute extraction_indices = (original_cu_seqlen[1:] - 1).long() and use
that when calling self.copy_ for input_ids; keep references to cu_seqlen,
input_ids_flat, extraction_indices, self.get_arg and self.copy_ to locate the
change.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py`:
- Around line 286-289: The BatchInfo initialization calls .numpy() which breaks
under torch.compile FakeTensor tracing; inside the register_fake implementation
replace the direct BatchInfo(batch_info_host) call with a FakeTensor guard:
detect FakeTensor instances (e.g. isinstance(batch_info_host,
torch._subclasses.fake_tensor.FakeTensor) or similar project convention) and
when fake, compute max_blocks_per_seq and max_batch_size from
batch_info_host.shape/metadata directly instead of constructing BatchInfo;
otherwise keep using BatchInfo(batch_info_host) and then call
get_max_blocks_per_seq() and get_max_batch_size() as before (refer to symbols
BatchInfo, register_fake, get_max_blocks_per_seq, get_max_batch_size).

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/mamba_backend_common.py`:
- Line 54: The unpacking on the call to batch_info.get_absorbed_info() creates
unused variables num_prefill_tokens and num_decode causing a lint error; change
the unpacking in the metadata preparation to only capture the used value (e.g.,
assign to num_prefill or use _ placeholders) so that only the necessary variable
from get_absorbed_info() is bound (refer to batch_info.get_absorbed_info and the
num_prefill usage to locate the edit).

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py`:
- Around line 163-164: The tuple returned from
BatchInfo(batch_info_host).get_absorbed_info() is unpacked into num_prefill,
num_prefill_tokens, num_decode but num_prefill_tokens is never used; update the
unpacking in torch_backend_mamba.py to drop the unused binding (e.g., unpack
into num_prefill, num_decode or use a throwaway name like _ for the middle
element) so only the used symbols (num_prefill and num_decode) remain.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/mla/flashinfer_mla.py`:
- Around line 439-440: The unpacked value num_prefill_tokens from
BatchInfo(batch_info_host).get_absorbed_info() is unused and triggers RUF059;
update the unpacking in the function containing this call so the unused slot is
intentionally marked (e.g., replace num_prefill_tokens with a throwaway name
like _ or _num_prefill_tokens) to indicate it’s intentionally unused and satisfy
the linter while keeping num_prefill and num_decode as-is.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/mla/torch_backend_mla.py`:
- Around line 368-369: The unpacked value num_prefill_tokens from
BatchInfo.get_absorbed_info() is unused; update the unpack to either ignore it
or rename it to indicate intentional unusedness (e.g., use "_" or
"_num_prefill_tokens") so the RUF059 warning is resolved—locate the assignment
to batch_info = BatchInfo(batch_info_host) and the subsequent unpack of
get_absorbed_info() and change "num_prefill, num_prefill_tokens, num_decode =
batch_info.get_absorbed_info()" to "num_prefill, _, num_decode =
batch_info.get_absorbed_info()" (or "num_prefill, _num_prefill_tokens,
num_decode") to make the unused value explicit.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/utils/block_table_ragged.py`:
- Around line 95-97: The custom Triton kernel adjust_block_table_triton writes
into extra_idx via tl.store(extra_idx_ptr + seq_id, removed) but extra_idx is
not listed in mutates_args, breaking functionalization; update the kernel
registration/call site for adjust_block_table_triton to include extra_idx in
mutates_args (or the equivalent mutating-args list) so PyTorch knows extra_idx
is mutated by the op, ensuring graph correctness and preserving the tl.store
side-effect referencing extra_idx_ptr/extra_idx.

In `@tensorrt_llm/_torch/auto_deploy/llm_args.py`:
- Around line 173-178: Replace the loose Dict[str, Any] on the LlmArgs field
speculative_model_kwargs with a concrete Pydantic model (e.g.,
SpeculativeModelConfig) that enumerates the supported keys (model_name, device,
batch_size, temperature, max_tokens, any other allowed options) and their
types/defaults; declare that dataclass/Model in the same module (or an adjacent
models module), update the Field(...) type on LlmArgs.speculative_model_kwargs
to use SpeculativeModelConfig with default_factory=SpeculativeModelConfig and
keep/adjust the description, and update any code that accessed keys as dicts to
read attributes from SpeculativeModelConfig (and update imports or tests
accordingly) so runtime/schema validation and generated OpenAPI/JSON schema
reflect the concrete fields.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py`:
- Around line 746-747: The wrapper currently assigns target_logits =
csi.info.maybe_gather_and_squeeze(out.logits) for local verification and then
returns that gathered tensor as the wrapper's logits, but
ADEngine._run_forward() will gather again and double-corrupt shapes/content;
update the wrapper so you only use maybe_gather_and_squeeze(out.logits) for
local checking (use a separate variable like gathered_for_check) and ensure the
wrapper output's logits field returns the raw, ungathered tensor (out.logits)
instead of target_logits so ADEngine._run_forward() receives the original
logits; adjust references around target_logits and the return at the code that
constructs the wrapper output (the place returning logits at the end of the
forward wrapper) accordingly.
- Around line 705-716: The code currently filters kwargs with
name.endswith("hidden_states_cache"), missing keys like "hidden_states_cache_1"
and causing empty buffers or wrong ordering; update the filter to match any key
that starts with "hidden_states_cache" (e.g.,
name.startswith("hidden_states_cache")) and change the sort key to order capture
layers deterministically by numeric suffix (use a small regex to extract
trailing digits from the buffer name and sort by that integer, falling back to
the full name or 0 when no digits are present). Ensure you import re if needed
and keep the subsequent concatenation (hidden_states =
torch.cat([buf[:num_tokens] for _, buf in buffers], dim=1)) intact.

In `@tensorrt_llm/_torch/auto_deploy/models/eagle.py`:
- Around line 142-145: The assert currently passes multi-element weight tensors
(e.g., n_embed from sub_gm.graph.get_attr(f"{embed_name}.weight")) into
torch._assert which triggers an "ambiguous bool" runtime error; change the
condition to a scalar boolean by testing the tensor's element count (e.g., use
n_embed.numel() > 0 or n_embed.numel() != 0) and pass that scalar boolean to
torch._assert along with the same message, and apply the same fix to the other
five places where a weight tensor from get_attr() is passed directly into
torch._assert.

In `@tests/integration/test_lists/test-db/l0_h100.yml`:
- Around line 442-443: The one-model Eagle3 test has been commented out,
removing CI coverage for the feature; either re-enable the test case
accuracy/test_llm_api_autodeploy.py::TestLlama3_1_8B_Instruct_Eagle3::test_eagle3_one_model
by undoing the commented-out entry in the YAML and ensuring it passes CI, or
move it to a tracked quarantine entry: add the test to an explicit quarantine
list with a linked issue ID and clear re-enable criteria (what needs to be fixed
and target re-enable date/condition) so CI visibility and traceability are
preserved.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/attention/test_attention_op.py`:
- Around line 57-67: BatchInfo.update(...) currently uses ng for both
generate-related fields which can misrepresent decode token count in edge cases
(e.g., max_seq_len=0); compute the actual decode token count used by the test
(e.g., decode_tokens = int(num_decode_tokens.item()) if a Tensor else
num_decode_tokens or otherwise derive it from the test's generate/seq_len logic)
and pass that value into BatchInfo.update([nc, npt, 0, 0, ng, decode_tokens])
instead of using ng for the final element so BatchInfo reflects the real decode
token count; update the local variable computation near ng and then call
BatchInfo.update with the new decode_tokens variable before serializing
batch_info_host.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_switch_to_generate_inplace.py`:
- Around line 165-190: The helper _setup_decode_batch currently accepts an
unused parameter swc_vals; remove this dead parameter from the
_setup_decode_batch signature and from every call site (e.g.,
test_decode_increment_by_one and other tests that pass swc_vals) and delete any
leftover references or comments about swc_vals so the helper and tests only pass
and accept the actual used arguments (positions, pages_per_seq, cache_loc,
etc.). Ensure the signature and all call sites are updated consistently to avoid
mismatches.

---

Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`:
- Line 423: The code unpacks three values from self.get_num_tokens() but never
uses num_decode_tokens, causing a linter warning; change the unpack to only
capture the needed values (e.g., num_prefill_tokens, num_extend_tokens =
self.get_num_tokens()) or replace the third target with a discard variable
(e.g., _ or _num_decode_tokens) where the call appears (the call to
self.get_num_tokens() in attention_interface.py), and ensure any downstream code
that expects num_decode_tokens is updated accordingly.

In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py`:
- Around line 309-310: The host metadata unpack currently assigns an unused
variable num_prefill_tokens from BatchInfo.get_absorbed_info(); update the
unpack to avoid the unused symbol — either ignore the middle value by assigning
it to _ or only unpack the two needed values (e.g., assign num_prefill and
num_decode) from BatchInfo(batch_info_host).get_absorbed_info() so
num_prefill_tokens is not created but the rest of the code (using BatchInfo and
get_absorbed_info) continues to work.

In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/torch_backend_attention.py`:
- Around line 320-322: The code unpacks three values from
BatchInfo.get_absorbed_info() but never uses num_prefill_tokens; update the
unpack to remove the unused variable (e.g., unpack only num_prefill and
num_decode) or replace num_prefill_tokens with an underscore to indicate it’s
intentionally ignored in the BatchInfo / get_absorbed_info call site where
batch_info_host is converted and num_seq is computed from num_prefill +
num_decode.

In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py`:
- Around line 192-194: The unpacked variable num_prefill_tokens from
BatchInfo(batch_info_host).get_absorbed_info() is unused; change the unpacking
in the block that creates BatchInfo and calls get_absorbed_info() so the unused
metadata is discarded (e.g., replace num_prefill_tokens with a throwaway name
like _ or otherwise only extract needed elements) and keep the subsequent
computation of num_seq using num_prefill and num_decode unchanged; locate this
in the code around the BatchInfo constructor call and the get_absorbed_info()
unpacking.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py`:
- Around line 274-278: The forward probe used for memory sizing should run under
torch.inference_mode() to avoid autograd allocations: wrap the calls to mod(...)
in a torch.inference_mode() context manager so both branches (when getattr(mod,
"_requires_csi", False) calls mod(cache_seq_interface=cm) and the else branch
calling mod(**cm.named_args)) execute inside torch.inference_mode(), preserving
existing argument usage and behavior while preventing grad-related memory
allocations during the resize probe.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_switch_to_generate_inplace.py`:
- Line 20: Replace the direct class import of SequenceInfo with a module-level
import to follow the repo import rule: change the import to bring in the
attention_interface module (e.g., import
tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface as
attention_interface) and then update all usages of SequenceInfo in this test
(function/class names in this file) to reference
attention_interface.SequenceInfo; ensure only the import line and references are
modified so behavior remains unchanged.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_gather_logits_before_lm_head.py`:
- Line 25: The test imports the class BatchInfo directly; change the import to
the module form (import the module namespace) so usages use module.BatchInfo
instead. Replace the line that currently does "from
tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface import BatchInfo"
with a module import (e.g., "from tensorrt_llm._torch.auto_deploy.custom_ops
import attention_interface") and update all references in
test_gather_logits_before_lm_head.py to use attention_interface.BatchInfo
wherever BatchInfo is referenced.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 08f8a9aa-6077-4847-b52b-0c5142dbef07

📥 Commits

Reviewing files that changed from the base of the PR and between 497b07d and 0b22405.

📒 Files selected for processing (56)

tensorrt_llm/_torch/auto_deploy/config/default.yaml
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/torch_backend_attention.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_attention.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_delta.py
tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py
tensorrt_llm/_torch/auto_deploy/custom_ops/fla/torch_backend_gated_delta.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/flashinfer_backend_mamba.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/mamba_backend_common.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_causal_conv.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mla/flashinfer_mla.py
tensorrt_llm/_torch/auto_deploy/custom_ops/mla/torch_backend_mla.py
tensorrt_llm/_torch/auto_deploy/custom_ops/utils/block_table_ragged.py
tensorrt_llm/_torch/auto_deploy/custom_ops/utils/torch_gather_logits.py
tensorrt_llm/_torch/auto_deploy/llm_args.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py
tensorrt_llm/_torch/auto_deploy/models/eagle.py
tensorrt_llm/_torch/auto_deploy/models/patches/bamba.py
tensorrt_llm/_torch/auto_deploy/shim/__init__.py
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
tensorrt_llm/_torch/auto_deploy/shim/demollm.py
tensorrt_llm/_torch/auto_deploy/shim/interface.py
tensorrt_llm/_torch/auto_deploy/transform/library/gather_logits_before_lm_head.py
tensorrt_llm/_torch/auto_deploy/transform/library/hidden_states.py
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
tests/integration/defs/examples/test_ad_speculative_decoding.py
tests/integration/test_lists/test-db/l0_h100.yml
tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/attention/test_attention_op.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/attention/test_flashinfer_attention_op.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/attention/test_torch_attention_op.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/attention/test_trtllm_attention_op.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/fla/test_torch_cached_gated_delta_rule.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/mamba/test_cuda_causal_conv_cached_op.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/mamba/test_flashinfer_mamba_cached_op.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/mamba/test_torch_causal_conv_cached_op.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/mamba/test_torch_mamba_cached_op.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/mamba/test_triton_mamba_cached_op.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/mla/test_flashinfer_mla_op.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/mla/test_torch_mla_op.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_switch_to_generate_inplace.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_triton_causal_conv_cached_op.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/utils/test_block_table_ragged_conversion.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_deepseek_custom.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_cached_sequence_interface.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_gather_logits_before_lm_head.py

💤 Files with no reviewable changes (1)

tensorrt_llm/_torch/auto_deploy/shim/init.py

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/mamba_backend_common.py

tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py

tensorrt_llm/_torch/auto_deploy/models/eagle.py

tests/integration/test_lists/test-db/l0_h100.yml

tests/unittest/auto_deploy/singlegpu/custom_ops/attention/test_attention_op.py

tests/unittest/auto_deploy/singlegpu/custom_ops/test_switch_to_generate_inplace.py

govind-ramnarayan · 2026-03-06T21:25:33Z

/bot run

lucaslie · 2026-03-06T23:02:33Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

govind-ramnarayan · 2026-03-16T19:04:57Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

govind-ramnarayan · 2026-03-16T19:13:29Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

govind-ramnarayan · 2026-03-16T21:49:57Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

tensorrt-cicd · 2026-03-16T21:55:57Z

PR_Github #39131 [ run ] triggered by Bot. Commit: 9328f9f Link to invocation

tensorrt-cicd · 2026-03-17T05:19:54Z

PR_Github #39131 [ run ] completed with state FAILURE. Commit: 9328f9f
/LLM/main/L0_MergeRequest_PR pipeline #30390 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Co-authored-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

…ix low acceptance rates Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

…but it now defaults to False unnest_sequences() now uses batch_info.is_gather_required() to automatically match how tokens were gathered in nest_sequences(), making it a true inverse. Also fix test_engine to pass gather_context_logits=True so the full-sequence reference logits comparison remains valid. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

Three categories of failures addressed: 1. Pydantic type check (test_all_fields_have_allowed_types): Add speculative_model_kwargs to AutoDeployLlmArgs compatibility exempt list. It uses Dict[str, Any] for the same reason as model_kwargs: arbitrary HF config overrides for the draft model. 2. KV cache / delta rule tensor shape mismatches: The new unnest_sequences() branch assumed gather_context_logits=False (the new default) means the model output is pre-gathered, but the three affected tests don't include gather_logits_before_lm_head so their output is still [total_tokens, hidden]. Fix by: - Restoring the is_gather_required() branch in unnest_sequences() with a clear docstring explaining both modes - Explicitly passing gather_context_logits=True in the nest_sequences() calls in test_kv_cache.py, test_gated_delta_rule_cache.py, and test_torch_gated_delta_rule_cache.py 3. EagleWrapper resource_manager kwarg (test_eagle_wrapper_forward): Skip the test with a detailed TODO. The EagleWrapper interface was refactored (resource_manager removed from __init__, sample_and_verify removed), so the test needs a substantial rewrite. It is valuable for validating Eagle3 acceptance ratio before the full export+transforms+ KV-cache pipeline and should be reinstated once updated. Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

govind-ramnarayan · 2026-03-17T16:32:53Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

tensorrt-cicd · 2026-03-17T16:39:08Z

PR_Github #39291 [ run ] triggered by Bot. Commit: 341432a Link to invocation

tensorrt-cicd · 2026-03-17T23:31:56Z

PR_Github #39291 [ run ] completed with state SUCCESS. Commit: 341432a
/LLM/main/L0_MergeRequest_PR pipeline #30541 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

govind-ramnarayan · 2026-03-17T23:33:59Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-03-17T23:39:59Z

PR_Github #39329 [ run ] triggered by Bot. Commit: 341432a Link to invocation

tensorrt-cicd · 2026-03-18T02:21:38Z

PR_Github #39329 [ run ] completed with state SUCCESS. Commit: 341432a
/LLM/main/L0_MergeRequest_PR pipeline #30576 completed with status: 'SUCCESS'

CI Report

Link to invocation

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com> Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> Co-authored-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

lucaslie self-assigned this Feb 25, 2026

lucaslie force-pushed the ll/gr/eagle-enable-overlap branch from 9a689e7 to f0e5e4a Compare February 25, 2026 03:00

lucaslie commented Feb 25, 2026

View reviewed changes

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Outdated Show resolved Hide resolved

lucaslie commented Feb 25, 2026

View reviewed changes

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py Show resolved Hide resolved

lucaslie commented Feb 25, 2026

View reviewed changes

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py Show resolved Hide resolved

lucaslie commented Feb 25, 2026

View reviewed changes

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py Show resolved Hide resolved

lucaslie force-pushed the ll/gr/eagle-enable-overlap branch from f0e5e4a to def4c7e Compare February 25, 2026 17:58

govind-ramnarayan reviewed Feb 25, 2026

View reviewed changes

tensorrt_llm/_torch/auto_deploy/models/eagle.py Outdated Show resolved Hide resolved

lucaslie force-pushed the ll/gr/eagle-enable-overlap branch 5 times, most recently from 39de91a to db6d948 Compare March 3, 2026 16:34

lucaslie force-pushed the ll/gr/eagle-enable-overlap branch 2 times, most recently from 49c0d3c to 11876c8 Compare March 5, 2026 00:46

lucaslie marked this pull request as ready for review March 6, 2026 00:15

lucaslie requested review from a team as code owners March 6, 2026 00:15

lucaslie requested a review from govind-ramnarayan March 6, 2026 00:15

lucaslie force-pushed the ll/gr/eagle-enable-overlap branch 2 times, most recently from fe4a894 to 0b22405 Compare March 6, 2026 00:17

lucaslie commented Mar 6, 2026

View reviewed changes

tests/integration/defs/accuracy/test_llm_api_autodeploy.py Show resolved Hide resolved

coderabbitai bot reviewed Mar 6, 2026

View reviewed changes

govind-ramnarayan force-pushed the ll/gr/eagle-enable-overlap branch from f2b09b1 to 5b88196 Compare March 6, 2026 21:21

govind-ramnarayan approved these changes Mar 6, 2026

View reviewed changes

govind-ramnarayan force-pushed the ll/gr/eagle-enable-overlap branch from 4494ca3 to 6d6cf0e Compare March 16, 2026 19:02

govind-ramnarayan requested a review from xinhe-nv March 16, 2026 22:35

xinhe-nv approved these changes Mar 17, 2026

View reviewed changes

lucaslie and others added 8 commits March 17, 2026 09:31

one-model spec dec with eagle

e7cd6c6

Co-authored-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com> Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

add host sync after target.forward() before hidden state capture to f…

e0ead8b

…ix low acceptance rates Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

fix num_tokens_seen

e8c490f

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

small test fixes

026e848

Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>

added autodeploy tests to QA lists

28b9656

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

fix qa files

bfad838

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>

govind-ramnarayan force-pushed the ll/gr/eagle-enable-overlap branch from 8a63e90 to 341432a Compare March 17, 2026 16:32

lucaslie linked an issue Mar 17, 2026 that may be closed by this pull request

[Feature]: AutoDeploy: one-model speculative decoding #11549

Closed

1 task

lucaslie assigned govind-ramnarayan and unassigned lucaslie and govind-ramnarayan Mar 17, 2026

lucaslie merged commit 3816c0b into NVIDIA:main Mar 18, 2026
5 checks passed

Conversation

lucaslie commented Feb 25, 2026 • edited by govind-ramnarayan Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

suyoggupta commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (2 warnings)

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

govind-ramnarayan commented Mar 6, 2026

Uh oh!

lucaslie commented Mar 6, 2026

Uh oh!

govind-ramnarayan commented Mar 16, 2026

Uh oh!

govind-ramnarayan commented Mar 16, 2026

Uh oh!

govind-ramnarayan commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 16, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

govind-ramnarayan commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

govind-ramnarayan commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 17, 2026

Uh oh!

tensorrt-cicd commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

lucaslie commented Feb 25, 2026 •

edited by govind-ramnarayan

Loading

coderabbitai bot commented Mar 6, 2026 •

edited

Loading