[None][feat] Add llm.encode() fast path for encoder-only models#12801
[None][feat] Add llm.encode() fast path for encoder-only models#12801pcastonguay merged 9 commits intoNVIDIA:mainfrom
Conversation
04bcd64 to
83bc6b9
Compare
nvrohanv
left a comment
There was a problem hiding this comment.
Some comments on tokenization piece and handling of empty batch but overall looks good!
|
@Superjomn could you review since it adds a new method to LLM API? Thx. |
📝 WalkthroughWalkthroughIntroduces encoder-only inference support to TensorRT-LLM, bypassing generation components. Adds Changes
Sequence DiagramsequenceDiagram
participant Client
participant LLM
participant InputProcessor
participant EncoderExecutor
participant ModelEngine
participant CUDA
Client->>LLM: encode(inputs)
activate LLM
LLM->>InputProcessor: tokenize(inputs)
InputProcessor-->>LLM: token_ids
LLM->>EncoderExecutor: batch_forward(packed_inputs)
activate EncoderExecutor
EncoderExecutor->>ModelEngine: encoder_forward(inputs)
activate ModelEngine
ModelEngine->>ModelEngine: _prepare_encoder_inputs(inputs)
Note over ModelEngine: Validate tokens, copy to GPU, build attn_metadata
ModelEngine->>ModelEngine: _forward_step(model_inputs, None, False)
ModelEngine->>CUDA: Forward pass
CUDA-->>ModelEngine: logits
ModelEngine-->>EncoderExecutor: encoder_output
deactivate ModelEngine
EncoderExecutor-->>LLM: encoder_output
deactivate EncoderExecutor
LLM-->>Client: EncoderOutput(logits, prompt_token_ids, prompt)
deactivate LLM
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
♻️ Duplicate comments (1)
tensorrt_llm/llmapi/llm.py (1)
753-758:⚠️ Potential issue | 🟠 MajorNormalize batched inputs before indexing the first element.
The signature accepts
Sequence[PromptInputs], but this branch only recognizeslist.encode([])throws anIndexErrorat Line 755, andencode(("a", "b"))gets treated as one prompt instead of a batch. Please handle empty batches and non-list sequences before peeking atinputs[0].Suggested fix
- unbatched = not isinstance(inputs, list) + unbatched = not isinstance(inputs, (list, tuple)) if not unbatched: + if len(inputs) == 0: + raise ValueError("encode() requires at least one input.") if isinstance(inputs[0], int): unbatched = True if unbatched: inputs = [inputs] + else: + inputs = list(inputs)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/llmapi/llm.py` around lines 753 - 758, In the encode path normalize inputs without indexing inputs[0] prematurely: treat any non-sequence (or str/bytes) as a single prompt and wrap it in a list, treat empty sequences as an empty batch (do not access inputs[0]), and only peek the first element when len(inputs) > 0 to detect an unbatched numeric input; update the logic in the function/method handling the inputs variable (the block that currently checks isinstance(inputs, list) and inputs[0]) to use collections.abc.Sequence checks, guard against str/bytes, and check len(inputs) before accessing index 0.
🧹 Nitpick comments (3)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
228-231: Add return type annotation.The function signature is missing the return type annotation. For consistency with the codebase and coding guidelines requiring static type annotations for all functions.
♻️ Suggested fix
+from .encoder_executor import EncoderExecutor + def create_encoder_executor( llm_args: TorchLlmArgs, checkpoint_dir: Optional[str] = None, -): +) -> "EncoderExecutor":Note: Use string annotation to avoid circular import, or move the import to the top of the file.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py` around lines 228 - 231, The create_encoder_executor function is missing a return type annotation; update its signature to include the correct return type (e.g., the executor class/type returned by create_encoder_executor) using a string literal annotation to avoid circular imports or alternatively import the return type at module top if safe; locate the function create_encoder_executor in py_executor_creator.py and add the string-based return type annotation matching the executor class name used elsewhere in this module.tensorrt_llm/_torch/pyexecutor/encoder_executor.py (2)
57-59: Consider explicit resource cleanup inshutdown().The current implementation relies on Python's garbage collector to release CUDA resources. While this works, explicit cleanup (e.g., calling any model engine cleanup methods or clearing CUDA cache) would be more deterministic, especially for repeated instantiation scenarios.
♻️ Suggested improvement
def shutdown(self): """No background thread to stop — just release model engine resources.""" + if hasattr(self.model_engine, 'model') and self.model_engine.model is not None: + del self.model_engine.model del self.model_engine + import gc + gc.collect() + torch.cuda.empty_cache()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/encoder_executor.py` around lines 57 - 59, shutdown currently just deletes self.model_engine; update it to perform explicit deterministic cleanup by calling any available cleanup/close method on the model engine (e.g., self.model_engine.cleanup() or self.model_engine.close() if present), then delete the attribute and clear CUDA memory with torch.cuda.empty_cache() (ensure torch is imported), and handle missing cleanup methods gracefully (use getattr or try/except) so repeated instantiation releases GPU resources reliably.
33-35: Add type hints for__init__parameters.The parameters
model_engineanddistlack type annotations. Per coding guidelines, all function parameters should have static type annotations.♻️ Suggested fix
+from .model_engine import PyTorchModelEngine +from ..distributed import Distributed + class EncoderExecutor: ... - def __init__(self, model_engine, dist): + def __init__(self, model_engine: PyTorchModelEngine, dist: Distributed): self.model_engine = model_engine self.dist = dist🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/pyexecutor/encoder_executor.py` around lines 33 - 35, Add static type annotations to the __init__ parameters: import Any from typing and change the signature to def __init__(self, model_engine: Any, dist: Any) -> None so both model_engine and dist are typed and the constructor return type is explicit; update any relevant imports if needed and keep the parameter names (model_engine, dist) as-is to match the existing usage in the class.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/pyexecutor/model_engine.py`:
- Around line 3648-3649: The method signature for _prepare_encoder_inputs
violates Flake8 E128 due to under-indented continuation lines; fix by aligning
the continued parameter lines consistently—either place all parameters on the
same line or use a hanging indent where continuation lines are indented to align
under the opening parenthesis (e.g., 8 spaces from the left margin for
continuation) so the def _prepare_encoder_inputs(self, inputs: Dict[str, Any])
-> Dict[str, Any]: signature and its wrapped parameters follow PEP8/4-space
indentation rules and resolve E128.
- Around line 3663-3701: The model_inputs dict currently spreads **inputs
(including internal keys like 'seq_lens') which can cause TypeError for models
whose forward() doesn't accept these kwargs and also lacks input validation;
update the block that builds model_inputs in model_engine.py (around
model_inputs, input_ids_cuda, position_ids_cuda, and attn_metadata set-up) to:
1) explicitly remove/filter internal keys such as 'seq_lens', 'position_ids',
'token_type_ids', 'attention_mask' (and any other internal-only keys) from
inputs before merging so only valid model args are forwarded; 2) add validation
checks using seq_lens and batch_size to assert batch_size <= self.batch_size,
sum(seq_lens) == num_tokens, and position_ids.shape[0] == num_tokens (raise
clear ValueError messages); and 3) ensure position_ids is the correct
dtype/shape before copying to self.position_ids_cuda so the subsequent
attn_metadata and model_inputs construction use the sanitized inputs.
In `@tensorrt_llm/llmapi/llm.py`:
- Around line 738-740: The docstring advertises support for inputs_embeds but
the current flow strips it via _RESERVED_KEYS and always materializes input_ids
before calling batch_forward (see methods batch_forward and the forward
wrapper), so inputs_embeds is effectively unusable; either implement an
inputs_embeds execution branch that checks for "inputs_embeds" in model_kwargs,
skips materializing input_ids, preserves inputs_embeds through _RESERVED_KEYS to
the model, and passes it into batch_forward/forward path (including handling
attention_mask/position ids), or remove "inputs_embeds" from the public
docstrings and any places it’s removed by _RESERVED_KEYS; update both the
forward wrapper around batch_forward and the analogous code block referenced at
lines ~819-826 to match the chosen approach.
- Around line 748-760: In encode(), before dereferencing
self._encoder_executor.model_engine, check whether the encoder executor has been
shut down (e.g., if self._encoder_executor is None or has a closed flag) and
raise the same RuntimeError used for shutdown instead of letting an
AttributeError bubble up; update the beginning of the encode() method (which
already checks self._encoder_only) to verify the executor's presence and raise a
clear RuntimeError like "LLM has been shut down" if missing, so encode()
consistently errors on closed encoder-only LLMs.
In `@tests/unittest/llmapi/test_llm_encode.py`:
- Around line 119-131: The HF model is being moved to tllm_logits.device which
ended up being CPU; ensure the HuggingFace inference runs on the same CUDA
device as the TRT-LLM: compute a device variable (e.g., device =
tllm_logits.device if tllm_logits.device.type == "cuda" else
torch.device("cuda") if torch.cuda.is_available() else tllm_logits.device), then
call hf_model = hf_model.half().to(device) and move tokenizer outputs with
inputs = tokenizer(...).to(device) so AutoModelForSequenceClassification
(hf_model) and its inputs run on CUDA for a fair comparison with tllm_logits and
PROMPTS.
---
Duplicate comments:
In `@tensorrt_llm/llmapi/llm.py`:
- Around line 753-758: In the encode path normalize inputs without indexing
inputs[0] prematurely: treat any non-sequence (or str/bytes) as a single prompt
and wrap it in a list, treat empty sequences as an empty batch (do not access
inputs[0]), and only peek the first element when len(inputs) > 0 to detect an
unbatched numeric input; update the logic in the function/method handling the
inputs variable (the block that currently checks isinstance(inputs, list) and
inputs[0]) to use collections.abc.Sequence checks, guard against str/bytes, and
check len(inputs) before accessing index 0.
---
Nitpick comments:
In `@tensorrt_llm/_torch/pyexecutor/encoder_executor.py`:
- Around line 57-59: shutdown currently just deletes self.model_engine; update
it to perform explicit deterministic cleanup by calling any available
cleanup/close method on the model engine (e.g., self.model_engine.cleanup() or
self.model_engine.close() if present), then delete the attribute and clear CUDA
memory with torch.cuda.empty_cache() (ensure torch is imported), and handle
missing cleanup methods gracefully (use getattr or try/except) so repeated
instantiation releases GPU resources reliably.
- Around line 33-35: Add static type annotations to the __init__ parameters:
import Any from typing and change the signature to def __init__(self,
model_engine: Any, dist: Any) -> None so both model_engine and dist are typed
and the constructor return type is explicit; update any relevant imports if
needed and keep the parameter names (model_engine, dist) as-is to match the
existing usage in the class.
In `@tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`:
- Around line 228-231: The create_encoder_executor function is missing a return
type annotation; update its signature to include the correct return type (e.g.,
the executor class/type returned by create_encoder_executor) using a string
literal annotation to avoid circular imports or alternatively import the return
type at module top if safe; locate the function create_encoder_executor in
py_executor_creator.py and add the string-based return type annotation matching
the executor class name used elsewhere in this module.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 2fc3228c-1dc3-47f5-9f7b-8f3302f7fca1
📒 Files selected for processing (6)
tensorrt_llm/_torch/pyexecutor/encoder_executor.pytensorrt_llm/_torch/pyexecutor/model_engine.pytensorrt_llm/_torch/pyexecutor/py_executor_creator.pytensorrt_llm/llmapi/llm.pytensorrt_llm/llmapi/llm_args.pytests/unittest/llmapi/test_llm_encode.py
Superjomn
left a comment
There was a problem hiding this comment.
LGTM on the llmapi changes.
|
/bot run |
👎 Promotion blocked, new vulnerability foundVulnerability report
|
|
/bot run |
|
/bot run |
|
/bot run |
|
PR_Github #44047 [ run ] triggered by Bot. Commit: |
|
PR_Github #45260 [ run ] completed with state
|
|
/bot run |
|
PR_Github #45417 [ run ] triggered by Bot. Commit: |
8b74a57 to
285ba46
Compare
|
/bot kill |
|
/bot run --disable-fail-fast |
|
PR_Github #45427 [ kill ] triggered by Bot. Commit: |
|
PR_Github #45428 [ run ] triggered by Bot. Commit: |
|
PR_Github #45427 [ kill ] completed with state |
|
PR_Github #45428 [ run ] completed with state
|
Signed-off-by: tingyangk <tingyangk@nvidia.com>
Signed-off-by: tingyangk <tingyangk@nvidia.com>
Signed-off-by: tingyangk <tingyangk@nvidia.com>
Signed-off-by: tingyangk <tingyangk@nvidia.com>
Signed-off-by: tingyangk <tingyangk@nvidia.com>
Signed-off-by: tingyangk <tingyangk@nvidia.com>
Signed-off-by: tingyangk <tingyangk@nvidia.com>
Signed-off-by: tingyangk <tingyangk@nvidia.com>
285ba46 to
38dac93
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #45556 [ run ] triggered by Bot. Commit: |
|
PR_Github #45556 [ run ] completed with state
|
|
/bot run |
|
PR_Github #45597 [ run ] triggered by Bot. Commit: |
|
PR_Github #45597 [ run ] completed with state
|
|
/bot run |
|
PR_Github #45652 [ run ] triggered by Bot. Commit: |
|
PR_Github #45652 [ run ] completed with state |
Summary by CodeRabbit
New Features
encode_onlyparameter for models like BERTencode()API method for efficient encode-only inferenceEncoderOutputdataclass containing logits and tokenized promptsTests
Summary
Adds a dedicated
llm.encode()API for encode-only paths that bypasses the decoder-oriented PyExecutor loop entirely. Works for encoder models and decoder models running a "single-prefill" path.Problem
The current LLM API routes encoder models through the same
PyExecutordesigned for autoregressive decoders, introducing significant CPU overhead per batch from scheduler, KV cache management, sampling, and request state machine — none of which apply to encoders. Encoder models need a simple, direct path to the model’s forward call with batch inference executed in a single pass.Solution
A new execution path (
encode_only=True) that creates a lightweightEncoderExecutorinstead of the fullPyExecutor. Theencode()method tokenizes, packs, and runs a single forward pass directly throughModelEngine.encoder_forward(), returningEncoderOutputwith logits. This new API demonstrates a 3.92× speedup for the BERT 110M model (textattack/bert-base-uncased-yelp-polarity) in eager mode with batch size 10.Usage
encode_only=Truemust be explicitly set. Default (None) uses the oldgenerate()path.encode_only=Truecreates onlyEncoderExecutor; False/None creates onlyPyExecutor. Mutually exclusive.generate()/generate_async()raiseRuntimeErrorwhenencode_only=True.encode()is the only API.llm.encode()reusesPyTorchModelEngineand its_forward_step()path, features likeTorchCompileConfigare compatible.Future Works
EncoderExecutorAttentionMetadata, etc)Test Coverage
tests/unittest/llmapi/test_llm_encode.py— 11 new tests:BertForSequenceClassificationencode_only=True→ old path):tests/integration/defs/test_e2e.py::test_ptp_quickstart_berttests/unittest/llmapi/test_llm_pytorch.py::test_llm_reward_modelCC: @symphonylyh @amukkara @nvrohanv @schetlur-nv @juney-nvidia
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.