Skip to content

[#10931][feat] AutoDeploy: one-model spec dec#11701

Merged
lucaslie merged 8 commits intoNVIDIA:mainfrom
nv-auto-deploy:ll/gr/eagle-enable-overlap
Mar 18, 2026
Merged

[#10931][feat] AutoDeploy: one-model spec dec#11701
lucaslie merged 8 commits intoNVIDIA:mainfrom
nv-auto-deploy:ll/gr/eagle-enable-overlap

Conversation

@lucaslie
Copy link
Member

@lucaslie lucaslie commented Feb 25, 2026

Summary by CodeRabbit

  • New Features

    • Added support for Eagle3 one-model speculative decoding path, enabling single-model inference optimizations with speculative decoding capabilities.
    • Introduced speculative_model_kwargs configuration field for passing additional parameters to speculative models.
  • Improvements

    • Enhanced batch metadata handling and KV cache resource management for improved inference efficiency.
    • Streamlined public APIs by consolidating internal metadata structures.
  • Bug Fixes

    • Fixed overlap scheduler behavior and token acceptance rate tracking in speculative decoding workflows.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@lucaslie lucaslie self-assigned this Feb 25, 2026
@lucaslie lucaslie force-pushed the ll/gr/eagle-enable-overlap branch from 9a689e7 to f0e5e4a Compare February 25, 2026 03:00
@lucaslie lucaslie force-pushed the ll/gr/eagle-enable-overlap branch from f0e5e4a to def4c7e Compare February 25, 2026 17:58
@lucaslie lucaslie force-pushed the ll/gr/eagle-enable-overlap branch 5 times, most recently from 39de91a to db6d948 Compare March 3, 2026 16:34
@suyoggupta
Copy link
Collaborator

@lucaslie , @govind-ramnarayan : is this ready to be reviewed?

@lucaslie lucaslie force-pushed the ll/gr/eagle-enable-overlap branch 2 times, most recently from 49c0d3c to 11876c8 Compare March 5, 2026 00:46
@lucaslie lucaslie marked this pull request as ready for review March 6, 2026 00:15
@lucaslie lucaslie requested review from a team as code owners March 6, 2026 00:15
@lucaslie lucaslie force-pushed the ll/gr/eagle-enable-overlap branch 2 times, most recently from fe4a894 to 0b22405 Compare March 6, 2026 00:17
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 6, 2026

📝 Walkthrough

Walkthrough

Introduces a new BatchInfo class to encapsulate and manage batch metadata tensors across attention and other backend operations, refactors Eagle speculative decoding for one-model support, updates block-table management with page re-insertion logic, and integrates these changes across flashinfer, torch, triton, mamba, FLA, MLA backends, plus test and configuration updates.

Changes

Cohort / File(s) Summary
BatchInfo Core Implementation
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Introduced BatchInfo class with serialization, update, and accessor methods for batch metadata; integrated into SequenceInfo to replace direct tensor manipulation; refactored batch info handling, token gathering logic, and host-device synchronization with new helper methods (offset_pos_and_cache_, switch_to_generate_, rescatter_input_ids_).
BatchInfo Integration in Attention Backends
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py, torch_backend_attention.py, triton_attention.py, trtllm_attention.py
Replaced direct tensor-to-list conversions with BatchInfo wrapper and get_absorbed_info() method; updated function signatures to remove max_seq_info_host parameter in trtllm paths and centralize metadata extraction via BatchInfo.
BatchInfo Integration in Other Backends
tensorrt_llm/_torch/auto_deploy/custom_ops/fla/*, mamba/*, mla/*
Applied consistent BatchInfo-based batch metadata extraction across FLA, Mamba, and MLA backend implementations (flashinfer, torch, triton variants), replacing direct tensor unpacking with BatchInfo().get_absorbed_info() or get_num_tokens_to_gather().
Block Table and Ragged Cache Management
tensorrt_llm/_torch/auto_deploy/custom_ops/utils/block_table_ragged.py
Added extra_idx parameter to track removed pages for re-insertion when delta < 0; introduced max_blocks_per_seq parameter to guide memory sizing; updated public function signatures for adjust_block_table_torch, adjust_ragged_torch, and Triton variants to support new cache management semantics.
Gather Logits Refactoring
tensorrt_llm/_torch/auto_deploy/custom_ops/utils/torch_gather_logits.py, transform/library/gather_logits_before_lm_head.py
Replaced tokens_gather_info_host parameter with batch_info_host; updated control flow to use BatchInfo.get_num_tokens_to_gather() and is_gather_required() for conditional gathering logic.
Eagle Speculative Decoding Refactoring
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py, models/eagle.py
Refactored EagleWrapper.__init__ to remove resource_manager parameter; added _draft_inner_model and _draft_dtype properties; introduced KV-cache-aware forward path (_forward_with_kv_cache) and prefill-only path (_forward_prefill_only); introduced EagleOneModelFactory and export-info classes (TargetModelExportInfo, DraftModelExportInfo) for one-model Eagle composition.
Speculative Decoding Infrastructure
tensorrt_llm/_torch/auto_deploy/llm_args.py, shim/ad_executor.py, shim/interface.py, shim/demollm.py
Added speculative_model_kwargs field to LlmArgs; extended CachedSequenceInterface to accept spec_config; updated add_resource to return generated full resource name; added _run_forward() method to ADEngine and updated input preparation for new new_tokens_lens and gather_context_logits parameters; introduced Eagle3OneModel sampler support.
KVCache Transform Updates
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
Updated cache registration to use original key names instead of per-node indices; extended forward invocation with CSI compatibility check (_requires_csi flag) to conditionally pass cache_seq_interface parameter.
Hidden States and Configuration
tensorrt_llm/_torch/auto_deploy/config/default.yaml, transform/library/hidden_states.py
Added explicit enabled: false flag for detect_hidden_states_for_capture transform; refactored detection logic to skip transform when gm.is_draft is true or residual_add_for_capture already exists.
Shim Module Export Changes
tensorrt_llm/_torch/auto_deploy/shim/__init__.py
Removed create_autodeploy_executor from exported symbols, reducing public API surface.
Test Updates for BatchInfo
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/attention/test_*.py, mamba/test_*.py, mla/test_*.py, fla/test_*.py, utils/test_*.py, models/test_*.py
Updated test fixtures across multiple backends to construct batch_info_host via BatchInfo().update(...).serialize() instead of direct tensor literals; covers flashinfer, torch, triton, CUDA, and Triton variants.
New Comprehensive Tests
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_switch_to_generate_inplace.py, tests/integration/defs/accuracy/test_llm_api_autodeploy.py, tests/integration/defs/examples/test_ad_speculative_decoding.py
Added extensive test coverage for SequenceInfo.switch_to_generate_inplace() behavior; introduced TestLlama3_1_8B_Instruct_Eagle3 for Eagle3 one-model speculative decoding; added test_autodeploy_eagle3_one_model_acceptance_rate() and helper _run_acceptance_rate_check() for acceptance-rate validation.
Block Table and Cache Tests
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/utils/test_block_table_ragged_conversion.py
Updated ragged/block-table conversion tests to pass max_blocks_per_seq parameter; added comprehensive test test_adjust_remove_saves_to_extra_idx_and_reinserts() demonstrating page re-insertion cycles.
CachedSequenceInterface Tests
tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_cached_sequence_interface.py, test_engine.py
Updated tests to use returned full_name from add_resource() for resource lookups and caches; validated token-gathering behavior with new gather_context_logits parameter; added py_seq_slot attribute to request objects; updated engine tests to use _run_forward()["logits"] instead of _compute_logits().
Minor Configuration and Utility Changes
tensorrt_llm/_torch/auto_deploy/models/patches/bamba.py, transform/library/sharding.py, tests/integration/test_lists/test-db/l0_h100.yml
Updated Bamba patch to use BatchInfo-based metadata construction; added conditional config validation guard for multi-process setups; added placeholder comments for AutoDeploy tests.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 57.44% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning The PR description is the template skeleton without completion. It lacks title format, explanation of changes, test coverage details, and PR checklist completion. Complete the PR description with: (1) proper title following [ticket][type] format, (2) clear explanation of the one-model speculative decoding feature in the Description section, (3) list of relevant tests in Test Coverage section, and (4) confirm all PR checklist items have been reviewed.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The PR title '[#10931][feat] AutoDeploy: one-model spec dec' is fully related to the main changes in the changeset, which implement one-model Eagle speculative decoding support in AutoDeploy.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 15

🧹 Nitpick comments (7)
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_switch_to_generate_inplace.py (1)

20-20: Use module-level import instead of importing SequenceInfo directly.

Line 20 imports a class symbol directly, which conflicts with the repository’s Python import rule.

Proposed refactor
 import torch
 
-from tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface import SequenceInfo
+from tensorrt_llm._torch.auto_deploy.custom_ops import attention_interface
@@
-def _make_seq_info(extra_activate=()) -> SequenceInfo:
+def _make_seq_info(extra_activate=()) -> attention_interface.SequenceInfo:
@@
-    si = SequenceInfo(
+    si = attention_interface.SequenceInfo(
@@
-def _nest_prefill(si: SequenceInfo, input_ids, pages_per_seq, cache_loc, **kw):
+def _nest_prefill(si: attention_interface.SequenceInfo, input_ids, pages_per_seq, cache_loc, **kw):

As per coding guidelines, "**/*.py: Python imports must use form from package.subpackage import module (never from module import Class)."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_switch_to_generate_inplace.py`
at line 20, Replace the direct class import of SequenceInfo with a module-level
import to follow the repo import rule: change the import to bring in the
attention_interface module (e.g., import
tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface as
attention_interface) and then update all usages of SequenceInfo in this test
(function/class names in this file) to reference
attention_interface.SequenceInfo; ensure only the import line and references are
modified so behavior remains unchanged.
tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py (1)

274-278: Run the resize probe forward in inference mode.

This path is used only for memory sizing; wrapping the forward call in torch.inference_mode() avoids autograd allocations that can inflate mem_reserved_for_forward and reduce KV-cache capacity estimates.

♻️ Suggested patch
         cm.info.set_max_num_tokens_sample()
         try:
             # TODO (lucaslie): revisit this logic as part of spec dec cudagraph support...
-            if getattr(mod, "_requires_csi", False):
-                mod(cache_seq_interface=cm)
-            else:
-                mod(**cm.named_args)
+            with torch.inference_mode():
+                if getattr(mod, "_requires_csi", False):
+                    mod(cache_seq_interface=cm)
+                else:
+                    mod(**cm.named_args)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py` around lines
274 - 278, The forward probe used for memory sizing should run under
torch.inference_mode() to avoid autograd allocations: wrap the calls to mod(...)
in a torch.inference_mode() context manager so both branches (when getattr(mod,
"_requires_csi", False) calls mod(cache_seq_interface=cm) and the else branch
calling mod(**cm.named_args)) execute inside torch.inference_mode(), preserving
existing argument usage and behavior while preventing grad-related memory
allocations during the resize probe.
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py (1)

309-310: Avoid unused unpacked token count in host metadata prep.

Line 310 unpacks num_prefill_tokens but never uses it (RUF059).

💡 Proposed fix
-    num_prefill, num_prefill_tokens, num_decode = batch_info.get_absorbed_info()
+    num_prefill, _num_prefill_tokens, num_decode = batch_info.get_absorbed_info()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py`
around lines 309 - 310, The host metadata unpack currently assigns an unused
variable num_prefill_tokens from BatchInfo.get_absorbed_info(); update the
unpack to avoid the unused symbol — either ignore the middle value by assigning
it to _ or only unpack the two needed values (e.g., assign num_prefill and
num_decode) from BatchInfo(batch_info_host).get_absorbed_info() so
num_prefill_tokens is not created but the rest of the code (using BatchInfo and
get_absorbed_info) continues to work.
tensorrt_llm/_torch/auto_deploy/custom_ops/attention/torch_backend_attention.py (1)

320-322: Clean up unused unpacked value from absorbed metadata.

Line 321 unpacks num_prefill_tokens but doesn’t use it (RUF059).

💡 Proposed fix
-    num_prefill, num_prefill_tokens, num_decode = batch_info.get_absorbed_info()
+    num_prefill, _num_prefill_tokens, num_decode = batch_info.get_absorbed_info()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/torch_backend_attention.py`
around lines 320 - 322, The code unpacks three values from
BatchInfo.get_absorbed_info() but never uses num_prefill_tokens; update the
unpack to remove the unused variable (e.g., unpack only num_prefill and
num_decode) or replace num_prefill_tokens with an underscore to indicate it’s
intentionally ignored in the BatchInfo / get_absorbed_info call site where
batch_info_host is converted and num_seq is computed from num_prefill +
num_decode.
tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py (1)

192-194: Drop unused unpacked metadata to keep lint clean.

At Line 193, num_prefill_tokens is unpacked but unused (RUF059).

💡 Proposed fix
-    num_prefill, num_prefill_tokens, num_decode = batch_info.get_absorbed_info()
+    num_prefill, _num_prefill_tokens, num_decode = batch_info.get_absorbed_info()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py`
around lines 192 - 194, The unpacked variable num_prefill_tokens from
BatchInfo(batch_info_host).get_absorbed_info() is unused; change the unpacking
in the block that creates BatchInfo and calls get_absorbed_info() so the unused
metadata is discarded (e.g., replace num_prefill_tokens with a throwaway name
like _ or otherwise only extract needed elements) and keep the subsequent
computation of num_seq using num_prefill and num_decode unchanged; locate this
in the code around the BatchInfo constructor call and the get_absorbed_info()
unpacking.
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py (1)

423-423: Avoid unpacking an unused value.

Line 423 unpacks num_decode_tokens but never uses it, which triggers Ruff (RUF059).

💡 Proposed fix
-        num_prefill_tokens, num_extend_tokens, num_decode_tokens = self.get_num_tokens()
+        num_prefill_tokens, num_extend_tokens, _num_decode_tokens = self.get_num_tokens()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py` at line
423, The code unpacks three values from self.get_num_tokens() but never uses
num_decode_tokens, causing a linter warning; change the unpack to only capture
the needed values (e.g., num_prefill_tokens, num_extend_tokens =
self.get_num_tokens()) or replace the third target with a discard variable
(e.g., _ or _num_decode_tokens) where the call appears (the call to
self.get_num_tokens() in attention_interface.py), and ensure any downstream code
that expects num_decode_tokens is updated accordingly.
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_gather_logits_before_lm_head.py (1)

25-25: Prefer module-namespace import for BatchInfo

Line 25 imports a class symbol directly. The repository guideline asks for importing module namespaces and using module.Symbol.

♻️ Example refactor pattern
-from tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface import BatchInfo
+from tensorrt_llm._torch.auto_deploy.custom_ops import attention_interface

-        batch_info = BatchInfo()
+        batch_info = attention_interface.BatchInfo()

As per coding guidelines "Python imports must use form from package.subpackage import module (never from module import Class)".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_gather_logits_before_lm_head.py`
at line 25, The test imports the class BatchInfo directly; change the import to
the module form (import the module namespace) so usages use module.BatchInfo
instead. Replace the line that currently does "from
tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface import BatchInfo"
with a module import (e.g., "from tensorrt_llm._torch.auto_deploy.custom_ops
import attention_interface") and update all references in
test_gather_logits_before_lm_head.py to use attention_interface.BatchInfo
wherever BatchInfo is referenced.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`:
- Around line 411-413: The update method (update(self, batch_info: List[int]) ->
None) currently assigns batch_info into self._batch_info_np[:6] without
validating length, causing obscure numpy broadcast errors for legacy 3-value
inputs; add an explicit check that len(batch_info) == 6 at the top of update and
raise a clear ValueError (e.g. "batch_info must be length 6: [fields...]") when
it isn't, so callers get an immediate, descriptive error rather than a
silent/broadcast failure; keep the assignment to self._batch_info_np[:6] after
the check.
- Around line 1362-1375: The code overwrites cu_seqlen before computing
extraction_indices, so extraction_indices uses the reset arange rather than each
sequence's true end; to fix, read or save the original cu_seqlen (via
self.get_arg or a local variable) before you create/overwrite the new cu_seqlen,
then compute extraction_indices = (original_cu_seqlen[1:] - 1).long() and use
that when calling self.copy_ for input_ids; keep references to cu_seqlen,
input_ids_flat, extraction_indices, self.get_arg and self.copy_ to locate the
change.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py`:
- Around line 286-289: The BatchInfo initialization calls .numpy() which breaks
under torch.compile FakeTensor tracing; inside the register_fake implementation
replace the direct BatchInfo(batch_info_host) call with a FakeTensor guard:
detect FakeTensor instances (e.g. isinstance(batch_info_host,
torch._subclasses.fake_tensor.FakeTensor) or similar project convention) and
when fake, compute max_blocks_per_seq and max_batch_size from
batch_info_host.shape/metadata directly instead of constructing BatchInfo;
otherwise keep using BatchInfo(batch_info_host) and then call
get_max_blocks_per_seq() and get_max_batch_size() as before (refer to symbols
BatchInfo, register_fake, get_max_blocks_per_seq, get_max_batch_size).

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/mamba_backend_common.py`:
- Line 54: The unpacking on the call to batch_info.get_absorbed_info() creates
unused variables num_prefill_tokens and num_decode causing a lint error; change
the unpacking in the metadata preparation to only capture the used value (e.g.,
assign to num_prefill or use _ placeholders) so that only the necessary variable
from get_absorbed_info() is bound (refer to batch_info.get_absorbed_info and the
num_prefill usage to locate the edit).

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py`:
- Around line 163-164: The tuple returned from
BatchInfo(batch_info_host).get_absorbed_info() is unpacked into num_prefill,
num_prefill_tokens, num_decode but num_prefill_tokens is never used; update the
unpacking in torch_backend_mamba.py to drop the unused binding (e.g., unpack
into num_prefill, num_decode or use a throwaway name like _ for the middle
element) so only the used symbols (num_prefill and num_decode) remain.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/mla/flashinfer_mla.py`:
- Around line 439-440: The unpacked value num_prefill_tokens from
BatchInfo(batch_info_host).get_absorbed_info() is unused and triggers RUF059;
update the unpacking in the function containing this call so the unused slot is
intentionally marked (e.g., replace num_prefill_tokens with a throwaway name
like _ or _num_prefill_tokens) to indicate it’s intentionally unused and satisfy
the linter while keeping num_prefill and num_decode as-is.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/mla/torch_backend_mla.py`:
- Around line 368-369: The unpacked value num_prefill_tokens from
BatchInfo.get_absorbed_info() is unused; update the unpack to either ignore it
or rename it to indicate intentional unusedness (e.g., use "_" or
"_num_prefill_tokens") so the RUF059 warning is resolved—locate the assignment
to batch_info = BatchInfo(batch_info_host) and the subsequent unpack of
get_absorbed_info() and change "num_prefill, num_prefill_tokens, num_decode =
batch_info.get_absorbed_info()" to "num_prefill, _, num_decode =
batch_info.get_absorbed_info()" (or "num_prefill, _num_prefill_tokens,
num_decode") to make the unused value explicit.

In `@tensorrt_llm/_torch/auto_deploy/custom_ops/utils/block_table_ragged.py`:
- Around line 95-97: The custom Triton kernel adjust_block_table_triton writes
into extra_idx via tl.store(extra_idx_ptr + seq_id, removed) but extra_idx is
not listed in mutates_args, breaking functionalization; update the kernel
registration/call site for adjust_block_table_triton to include extra_idx in
mutates_args (or the equivalent mutating-args list) so PyTorch knows extra_idx
is mutated by the op, ensuring graph correctness and preserving the tl.store
side-effect referencing extra_idx_ptr/extra_idx.

In `@tensorrt_llm/_torch/auto_deploy/llm_args.py`:
- Around line 173-178: Replace the loose Dict[str, Any] on the LlmArgs field
speculative_model_kwargs with a concrete Pydantic model (e.g.,
SpeculativeModelConfig) that enumerates the supported keys (model_name, device,
batch_size, temperature, max_tokens, any other allowed options) and their
types/defaults; declare that dataclass/Model in the same module (or an adjacent
models module), update the Field(...) type on LlmArgs.speculative_model_kwargs
to use SpeculativeModelConfig with default_factory=SpeculativeModelConfig and
keep/adjust the description, and update any code that accessed keys as dicts to
read attributes from SpeculativeModelConfig (and update imports or tests
accordingly) so runtime/schema validation and generated OpenAPI/JSON schema
reflect the concrete fields.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py`:
- Around line 746-747: The wrapper currently assigns target_logits =
csi.info.maybe_gather_and_squeeze(out.logits) for local verification and then
returns that gathered tensor as the wrapper's logits, but
ADEngine._run_forward() will gather again and double-corrupt shapes/content;
update the wrapper so you only use maybe_gather_and_squeeze(out.logits) for
local checking (use a separate variable like gathered_for_check) and ensure the
wrapper output's logits field returns the raw, ungathered tensor (out.logits)
instead of target_logits so ADEngine._run_forward() receives the original
logits; adjust references around target_logits and the return at the code that
constructs the wrapper output (the place returning logits at the end of the
forward wrapper) accordingly.
- Around line 705-716: The code currently filters kwargs with
name.endswith("hidden_states_cache"), missing keys like "hidden_states_cache_1"
and causing empty buffers or wrong ordering; update the filter to match any key
that starts with "hidden_states_cache" (e.g.,
name.startswith("hidden_states_cache")) and change the sort key to order capture
layers deterministically by numeric suffix (use a small regex to extract
trailing digits from the buffer name and sort by that integer, falling back to
the full name or 0 when no digits are present). Ensure you import re if needed
and keep the subsequent concatenation (hidden_states =
torch.cat([buf[:num_tokens] for _, buf in buffers], dim=1)) intact.

In `@tensorrt_llm/_torch/auto_deploy/models/eagle.py`:
- Around line 142-145: The assert currently passes multi-element weight tensors
(e.g., n_embed from sub_gm.graph.get_attr(f"{embed_name}.weight")) into
torch._assert which triggers an "ambiguous bool" runtime error; change the
condition to a scalar boolean by testing the tensor's element count (e.g., use
n_embed.numel() > 0 or n_embed.numel() != 0) and pass that scalar boolean to
torch._assert along with the same message, and apply the same fix to the other
five places where a weight tensor from get_attr() is passed directly into
torch._assert.

In `@tests/integration/test_lists/test-db/l0_h100.yml`:
- Around line 442-443: The one-model Eagle3 test has been commented out,
removing CI coverage for the feature; either re-enable the test case
accuracy/test_llm_api_autodeploy.py::TestLlama3_1_8B_Instruct_Eagle3::test_eagle3_one_model
by undoing the commented-out entry in the YAML and ensuring it passes CI, or
move it to a tracked quarantine entry: add the test to an explicit quarantine
list with a linked issue ID and clear re-enable criteria (what needs to be fixed
and target re-enable date/condition) so CI visibility and traceability are
preserved.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/attention/test_attention_op.py`:
- Around line 57-67: BatchInfo.update(...) currently uses ng for both
generate-related fields which can misrepresent decode token count in edge cases
(e.g., max_seq_len=0); compute the actual decode token count used by the test
(e.g., decode_tokens = int(num_decode_tokens.item()) if a Tensor else
num_decode_tokens or otherwise derive it from the test's generate/seq_len logic)
and pass that value into BatchInfo.update([nc, npt, 0, 0, ng, decode_tokens])
instead of using ng for the final element so BatchInfo reflects the real decode
token count; update the local variable computation near ng and then call
BatchInfo.update with the new decode_tokens variable before serializing
batch_info_host.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_switch_to_generate_inplace.py`:
- Around line 165-190: The helper _setup_decode_batch currently accepts an
unused parameter swc_vals; remove this dead parameter from the
_setup_decode_batch signature and from every call site (e.g.,
test_decode_increment_by_one and other tests that pass swc_vals) and delete any
leftover references or comments about swc_vals so the helper and tests only pass
and accept the actual used arguments (positions, pages_per_seq, cache_loc,
etc.). Ensure the signature and all call sites are updated consistently to avoid
mismatches.

---

Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py`:
- Line 423: The code unpacks three values from self.get_num_tokens() but never
uses num_decode_tokens, causing a linter warning; change the unpack to only
capture the needed values (e.g., num_prefill_tokens, num_extend_tokens =
self.get_num_tokens()) or replace the third target with a discard variable
(e.g., _ or _num_decode_tokens) where the call appears (the call to
self.get_num_tokens() in attention_interface.py), and ensure any downstream code
that expects num_decode_tokens is updated accordingly.

In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py`:
- Around line 309-310: The host metadata unpack currently assigns an unused
variable num_prefill_tokens from BatchInfo.get_absorbed_info(); update the
unpack to avoid the unused symbol — either ignore the middle value by assigning
it to _ or only unpack the two needed values (e.g., assign num_prefill and
num_decode) from BatchInfo(batch_info_host).get_absorbed_info() so
num_prefill_tokens is not created but the rest of the code (using BatchInfo and
get_absorbed_info) continues to work.

In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/attention/torch_backend_attention.py`:
- Around line 320-322: The code unpacks three values from
BatchInfo.get_absorbed_info() but never uses num_prefill_tokens; update the
unpack to remove the unused variable (e.g., unpack only num_prefill and
num_decode) or replace num_prefill_tokens with an underscore to indicate it’s
intentionally ignored in the BatchInfo / get_absorbed_info call site where
batch_info_host is converted and num_seq is computed from num_prefill +
num_decode.

In
`@tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py`:
- Around line 192-194: The unpacked variable num_prefill_tokens from
BatchInfo(batch_info_host).get_absorbed_info() is unused; change the unpacking
in the block that creates BatchInfo and calls get_absorbed_info() so the unused
metadata is discarded (e.g., replace num_prefill_tokens with a throwaway name
like _ or otherwise only extract needed elements) and keep the subsequent
computation of num_seq using num_prefill and num_decode unchanged; locate this
in the code around the BatchInfo constructor call and the get_absorbed_info()
unpacking.

In `@tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py`:
- Around line 274-278: The forward probe used for memory sizing should run under
torch.inference_mode() to avoid autograd allocations: wrap the calls to mod(...)
in a torch.inference_mode() context manager so both branches (when getattr(mod,
"_requires_csi", False) calls mod(cache_seq_interface=cm) and the else branch
calling mod(**cm.named_args)) execute inside torch.inference_mode(), preserving
existing argument usage and behavior while preventing grad-related memory
allocations during the resize probe.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_switch_to_generate_inplace.py`:
- Line 20: Replace the direct class import of SequenceInfo with a module-level
import to follow the repo import rule: change the import to bring in the
attention_interface module (e.g., import
tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface as
attention_interface) and then update all usages of SequenceInfo in this test
(function/class names in this file) to reference
attention_interface.SequenceInfo; ensure only the import line and references are
modified so behavior remains unchanged.

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_gather_logits_before_lm_head.py`:
- Line 25: The test imports the class BatchInfo directly; change the import to
the module form (import the module namespace) so usages use module.BatchInfo
instead. Replace the line that currently does "from
tensorrt_llm._torch.auto_deploy.custom_ops.attention_interface import BatchInfo"
with a module import (e.g., "from tensorrt_llm._torch.auto_deploy.custom_ops
import attention_interface") and update all references in
test_gather_logits_before_lm_head.py to use attention_interface.BatchInfo
wherever BatchInfo is referenced.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 08f8a9aa-6077-4847-b52b-0c5142dbef07

📥 Commits

Reviewing files that changed from the base of the PR and between 497b07d and 0b22405.

📒 Files selected for processing (56)
  • tensorrt_llm/_torch/auto_deploy/config/default.yaml
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/flashinfer_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/torch_backend_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/triton_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention/trtllm_attention.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_delta.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fla/fla_backend_gated_delta.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/fla/torch_backend_gated_delta.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/cuda_backend_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/flashinfer_backend_mamba.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/mamba_backend_common.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/torch_backend_mamba.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_causal_conv.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mamba/triton_backend_mamba.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mla/flashinfer_mla.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/mla/torch_backend_mla.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/utils/block_table_ragged.py
  • tensorrt_llm/_torch/auto_deploy/custom_ops/utils/torch_gather_logits.py
  • tensorrt_llm/_torch/auto_deploy/llm_args.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py
  • tensorrt_llm/_torch/auto_deploy/models/eagle.py
  • tensorrt_llm/_torch/auto_deploy/models/patches/bamba.py
  • tensorrt_llm/_torch/auto_deploy/shim/__init__.py
  • tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
  • tensorrt_llm/_torch/auto_deploy/shim/demollm.py
  • tensorrt_llm/_torch/auto_deploy/shim/interface.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/gather_logits_before_lm_head.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/hidden_states.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/kvcache.py
  • tensorrt_llm/_torch/auto_deploy/transform/library/sharding.py
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
  • tests/integration/defs/examples/test_ad_speculative_decoding.py
  • tests/integration/test_lists/test-db/l0_h100.yml
  • tests/unittest/_torch/auto_deploy/_utils_test/torch_attention_reference.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/attention/test_attention_op.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/attention/test_flashinfer_attention_op.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/attention/test_torch_attention_op.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/attention/test_trtllm_attention_op.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/fla/test_fla_cached_gated_delta_rule.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/fla/test_torch_cached_gated_delta_rule.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/mamba/test_cuda_causal_conv_cached_op.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/mamba/test_flashinfer_mamba_cached_op.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/mamba/test_torch_causal_conv_cached_op.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/mamba/test_torch_mamba_cached_op.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/mamba/test_triton_mamba_cached_op.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/mla/test_flashinfer_mla_op.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/mla/test_torch_mla_op.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_switch_to_generate_inplace.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_triton_causal_conv_cached_op.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/utils/test_block_table_ragged_conversion.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/models/test_deepseek_custom.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_cached_sequence_interface.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/shim/test_engine.py
  • tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_gather_logits_before_lm_head.py
💤 Files with no reviewable changes (1)
  • tensorrt_llm/_torch/auto_deploy/shim/init.py

@govind-ramnarayan govind-ramnarayan force-pushed the ll/gr/eagle-enable-overlap branch from f2b09b1 to 5b88196 Compare March 6, 2026 21:21
@govind-ramnarayan
Copy link
Collaborator

/bot run

@lucaslie
Copy link
Member Author

lucaslie commented Mar 6, 2026

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

@govind-ramnarayan govind-ramnarayan force-pushed the ll/gr/eagle-enable-overlap branch from 4494ca3 to 6d6cf0e Compare March 16, 2026 19:02
@govind-ramnarayan
Copy link
Collaborator

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

2 similar comments
@govind-ramnarayan
Copy link
Collaborator

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

@govind-ramnarayan
Copy link
Collaborator

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39131 [ run ] triggered by Bot. Commit: 9328f9f Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39131 [ run ] completed with state FAILURE. Commit: 9328f9f
/LLM/main/L0_MergeRequest_PR pipeline #30390 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

lucaslie and others added 8 commits March 17, 2026 09:31
Co-authored-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
…ix low acceptance rates

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
…but it now defaults to False

unnest_sequences() now uses batch_info.is_gather_required() to automatically
match how tokens were gathered in nest_sequences(), making it a true inverse.
Also fix test_engine to pass gather_context_logits=True so the full-sequence
reference logits comparison remains valid.

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Three categories of failures addressed:

1. Pydantic type check (test_all_fields_have_allowed_types):
   Add speculative_model_kwargs to AutoDeployLlmArgs compatibility
   exempt list. It uses Dict[str, Any] for the same reason as
   model_kwargs: arbitrary HF config overrides for the draft model.

2. KV cache / delta rule tensor shape mismatches:
   The new unnest_sequences() branch assumed gather_context_logits=False
   (the new default) means the model output is pre-gathered, but the
   three affected tests don't include gather_logits_before_lm_head so
   their output is still [total_tokens, hidden]. Fix by:
   - Restoring the is_gather_required() branch in unnest_sequences()
     with a clear docstring explaining both modes
   - Explicitly passing gather_context_logits=True in the nest_sequences()
     calls in test_kv_cache.py, test_gated_delta_rule_cache.py, and
     test_torch_gated_delta_rule_cache.py

3. EagleWrapper resource_manager kwarg (test_eagle_wrapper_forward):
   Skip the test with a detailed TODO. The EagleWrapper interface was
   refactored (resource_manager removed from __init__, sample_and_verify
   removed), so the test needs a substantial rewrite. It is valuable for
   validating Eagle3 acceptance ratio before the full export+transforms+
   KV-cache pipeline and should be reinstated once updated.

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
@govind-ramnarayan govind-ramnarayan force-pushed the ll/gr/eagle-enable-overlap branch from 8a63e90 to 341432a Compare March 17, 2026 16:32
@govind-ramnarayan
Copy link
Collaborator

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39291 [ run ] triggered by Bot. Commit: 341432a Link to invocation

@lucaslie lucaslie linked an issue Mar 17, 2026 that may be closed by this pull request
1 task
@tensorrt-cicd
Copy link
Collaborator

PR_Github #39291 [ run ] completed with state SUCCESS. Commit: 341432a
/LLM/main/L0_MergeRequest_PR pipeline #30541 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@govind-ramnarayan
Copy link
Collaborator

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39329 [ run ] triggered by Bot. Commit: 341432a Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #39329 [ run ] completed with state SUCCESS. Commit: 341432a
/LLM/main/L0_MergeRequest_PR pipeline #30576 completed with status: 'SUCCESS'

CI Report

Link to invocation

@lucaslie lucaslie merged commit 3816c0b into NVIDIA:main Mar 18, 2026
5 checks passed
limin2021 pushed a commit to limin2021/TensorRT-LLM that referenced this pull request Mar 19, 2026
Signed-off-by: Lucas Liebenwein <11156568+lucaslie@users.noreply.github.com>
Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Co-authored-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: AutoDeploy: one-model speculative decoding

5 participants