-
Notifications
You must be signed in to change notification settings - Fork 1.8k
[TRTLLM-7728][perf] improve batched sampling perf for contiguous batches #7908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/bot run |
PR_Github #19584 [ run ] triggered by Bot |
PR_Github #19584 [ run ] completed with state |
/bot run |
PR_Github #19589 [ run ] triggered by Bot |
PR_Github #19589 [ run ] completed with state |
/bot run --disable-fail-fast |
PR_Github #19594 [ run ] triggered by Bot |
fb76f2a
to
96457be
Compare
/bot run |
PR_Github #19599 [ run ] triggered by Bot |
PR_Github #19594 [ run ] completed with state |
/bot kill |
PR_Github #19616 [ kill ] triggered by Bot |
PR_Github #19599 [ run ] completed with state |
/bot run |
PR_Github #19616 [ kill ] completed with state |
PR_Github #19618 [ run ] triggered by Bot |
📝 WalkthroughWalkthroughThe changes remove the enable_mixed_sampler option throughout configs, executors, scaffolding, and tests; overhaul the PyTorch sampler to a strategy-driven, batched dispatch; adjust speculative drafting flow and gating; add a load_format field to decoding configs; tweak RMSNorm residual handling via a sentinel; set temperature=0 in some eval paths; and bump Python requirement to >=3.10. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
actor Client
participant PyExecutor as PyExecutor
participant Sampler as Sampler
participant Strategy as Strategy Detector
participant ImplTopK as TopK Impl
participant ImplTopP as TopP Impl
participant ImplTopKTopP as TopK+TopP Impl
participant ImplGreedy as Greedy Impl
Client->>PyExecutor: decode step (batch of requests, logits)
PyExecutor->>Sampler: sample(logits, requests, generator)
Sampler->>Strategy: determine per-request strategy
Strategy-->>Sampler: grouped indices by {TopK, TopP, TopKTopP, Greedy}
rect rgb(245,248,255)
note right of Sampler: Batched dispatch per strategy
Sampler->>ImplTopK: top_k_sampling_batch(logits[TopK])
ImplTopK-->>Sampler: tokens, softmax
Sampler->>ImplTopP: top_p_sampling_batch(logits[TopP])
ImplTopP-->>Sampler: tokens, softmax
Sampler->>ImplTopKTopP: top_k_top_p_sampling_batch(logits[TopKTopP])
ImplTopKTopP-->>Sampler: tokens, softmax
Sampler->>ImplGreedy: greedy_search_sampling_batch(logits[Greedy])
ImplGreedy-->>Sampler: tokens, softmax
end
Sampler-->>PyExecutor: merged tokens, softmax (batched mapping)
PyExecutor-->>Client: next tokens (+ optional probs)
sequenceDiagram
autonumber
actor Client
participant Creator as PyExecutor Creator
participant Drafter as ModelDrafter
participant Sampler as Sampler
participant LlmReq as LlmRequest
Client->>Creator: create executor (configs)
Creator->>Creator: check attn_backend == "TRTLLM"
Creator->>Creator: check guided_decoding_config is None
Creator->>Creator: check draft_config._allow_greedy_draft_tokens
alt conditions met
Creator-->>Drafter: enable chain-drafter
else
Creator-->>Drafter: chain-drafter disabled
end
note over Drafter: Draft step
Drafter->>LlmReq: build request(exclude_last_generation_logits=True,<br/>return_generation_logits=Sampler.should_provide_draft_probs(...))
Drafter->>Sampler: update_requests(...) and sample(...)
Sampler-->>Drafter: draft tokens (+ logits if requested)
Drafter-->>Client: processed draft results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~70 minutes Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
setup.py (1)
263-263
: Confirm Python >=3.10 update — codebase uses 3.10+ features; address remaining 3.12-gated codeRepository already uses Python 3.10+ syntax (match/case, PEP 604
X | Y
, builtin generics), so the bump to>=3.10
is consistent. I found explicit 3.12-gated checks/imports and test skips that need action:
- tensorrt_llm/_torch/pyexecutor/sampler.py — conditional
from typing import override
guarded bysys.version_info[:2] >= (3, 12)
.- tests/unittest/_torch/thop/parallel/test_fp4_linear.py —
@pytest.mark.skipif(sys.version_info < (3, 12))
(cutlass-dsl dependency).- tests/unittest/test_pip_install.py — asserts Python minor version in [10, 12].
Either raise
python_requires
to>=3.12
or backport/adjust these locations (e.g., importoverride
from typing_extensions, relax/adjust test skips and version assertions).tensorrt_llm/_torch/pyexecutor/sampler.py (1)
455-511
: torch_multi_arange implementation is clever and well-documented.This efficient implementation for computing multiple aranges in a single operation is well-thought-out. The algorithm explanation and handling of edge cases (invalid ranges) is thorough.
Consider adding a simple usage example in the docstring to make the function's purpose immediately clear:
""" Example: >>> ends = torch.tensor([3, 5, 2]) >>> starts = torch.tensor([0, 2, 1]) >>> result = torch_multi_arange(ends, starts=starts) >>> # result is torch.tensor([0, 1, 2, 2, 3, 4, 1]) """
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (18)
setup.py
(1 hunks)tensorrt_llm/_torch/auto_deploy/llm_args.py
(0 hunks)tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
(0 hunks)tensorrt_llm/_torch/modules/rms_norm.py
(5 hunks)tensorrt_llm/_torch/pyexecutor/_util.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/config.py
(0 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
(1 hunks)tensorrt_llm/_torch/pyexecutor/sampler.py
(22 hunks)tensorrt_llm/_torch/speculative/eagle3.py
(2 hunks)tensorrt_llm/_torch/speculative/model_drafter.py
(6 hunks)tensorrt_llm/evaluate/json_mode_eval.py
(1 hunks)tensorrt_llm/evaluate/mmlu.py
(1 hunks)tensorrt_llm/llmapi/llm_args.py
(1 hunks)tensorrt_llm/scaffolding/worker.py
(0 hunks)tests/unittest/api_stability/references/llm.yaml
(0 hunks)tests/unittest/llmapi/apps/_test_openai_misc.py
(1 hunks)tests/unittest/llmapi/test_llm_pytorch.py
(2 hunks)
💤 Files with no reviewable changes (5)
- tensorrt_llm/_torch/auto_deploy/llm_args.py
- tensorrt_llm/_torch/pyexecutor/config.py
- tests/unittest/api_stability/references/llm.yaml
- tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
- tensorrt_llm/scaffolding/worker.py
🧰 Additional context used
📓 Path-based instructions (3)
**/*.{h,hpp,hh,hxx,cpp,cxx,cc,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Use only spaces, no tabs; indent with 4 spaces.
Files:
tensorrt_llm/_torch/speculative/eagle3.py
tensorrt_llm/evaluate/json_mode_eval.py
tensorrt_llm/evaluate/mmlu.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tests/unittest/llmapi/apps/_test_openai_misc.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/modules/rms_norm.py
tensorrt_llm/_torch/pyexecutor/_util.py
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/_torch/speculative/model_drafter.py
setup.py
tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py
: Python code must target Python 3.8+.
Indent Python code with 4 spaces; do not use tabs.
Maintain module namespace when importing; prefer 'from package.subpackage import foo' then 'foo.SomeClass()' instead of importing the class directly.
Python filenames should be snake_case (e.g., some_file.py).
Python classes use PascalCase names.
Functions and methods use snake_case names.
Local variables use snake_case; prefix 'k' for variables that start with a number (e.g., k_99th_percentile).
Global variables use upper SNAKE_CASE prefixed with 'G' (e.g., G_MY_GLOBAL).
Constants use upper SNAKE_CASE (e.g., MY_CONSTANT).
Avoid shadowing variables from an outer scope.
Initialize all externally visible members of a class in the constructor.
Prefer docstrings for interfaces that may be used outside a file; comments for in-function or file-local interfaces.
Use Google-style docstrings for classes and functions (Sphinx-parsable).
Document attributes and variables inline so they render under the class/function docstring.
Avoid reflection when a simpler, explicit approach suffices (e.g., avoid dict(**locals()) patterns).
In try/except, catch the most specific exceptions possible.
For duck-typing try/except, keep the try body minimal and use else for the main logic.
Files:
tensorrt_llm/_torch/speculative/eagle3.py
tensorrt_llm/evaluate/json_mode_eval.py
tensorrt_llm/evaluate/mmlu.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tests/unittest/llmapi/apps/_test_openai_misc.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/modules/rms_norm.py
tensorrt_llm/_torch/pyexecutor/_util.py
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/_torch/speculative/model_drafter.py
setup.py
tensorrt_llm/_torch/pyexecutor/sampler.py
**/*.{cpp,cxx,cc,h,hpp,hh,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Prepend the NVIDIA Apache-2.0 copyright header with current year to the top of all source files (e.g., .cpp, .h, .cu, .py).
Files:
tensorrt_llm/_torch/speculative/eagle3.py
tensorrt_llm/evaluate/json_mode_eval.py
tensorrt_llm/evaluate/mmlu.py
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tensorrt_llm/_torch/pyexecutor/llm_request.py
tests/unittest/llmapi/apps/_test_openai_misc.py
tensorrt_llm/llmapi/llm_args.py
tensorrt_llm/_torch/modules/rms_norm.py
tensorrt_llm/_torch/pyexecutor/_util.py
tests/unittest/llmapi/test_llm_pytorch.py
tensorrt_llm/_torch/speculative/model_drafter.py
setup.py
tensorrt_llm/_torch/pyexecutor/sampler.py
🧠 Learnings (2)
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
PR: NVIDIA/TensorRT-LLM#7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
📚 Learning: 2025-08-27T14:41:56.665Z
Learnt from: ixlmar
PR: NVIDIA/TensorRT-LLM#7294
File: tensorrt_llm/_torch/modules/rms_norm.py:96-99
Timestamp: 2025-08-27T14:41:56.665Z
Learning: In tensorrt_llm/_torch/modules/rms_norm.py, the RMSNorm class uses a custom sentinel (_ARGUMENT_NOT_SPECIFIED_SENTINEL) instead of Ellipsis (...) for detecting unspecified optional arguments. Other modules in the codebase may use Ellipsis as a sentinel but do not forward it to RMSNorm methods, so there's no need for backward compatibility with Ellipsis in RMSNorm.
Applied to files:
tensorrt_llm/_torch/modules/rms_norm.py
🧬 Code graph analysis (7)
tensorrt_llm/_torch/speculative/eagle3.py (1)
tensorrt_llm/llmapi/llm_args.py (1)
EagleDecodingConfig
(442-480)
tensorrt_llm/evaluate/json_mode_eval.py (1)
tensorrt_llm/sampling_params.py (1)
GuidedDecodingParams
(14-36)
tests/unittest/llmapi/apps/_test_openai_misc.py (1)
tests/unittest/llmapi/apps/_test_openai_chat.py (2)
client
(82-83)model_name
(21-22)
tensorrt_llm/_torch/modules/rms_norm.py (5)
tensorrt_llm/_torch/custom_ops/flashinfer_custom_ops.py (2)
flashinfer_fused_add_rmsnorm
(32-35)flashinfer_rmsnorm
(21-23)tensorrt_llm/_torch/models/modeling_utils.py (1)
skip_forward
(148-159)tensorrt_llm/_torch/modules/embedding.py (2)
skip_forward
(117-125)skip_forward
(259-266)tensorrt_llm/_torch/modules/layer_norm.py (1)
skip_forward
(102-120)tensorrt_llm/_torch/modules/decoder_layer.py (1)
skip_forward
(23-34)
tests/unittest/llmapi/test_llm_pytorch.py (1)
tests/unittest/llmapi/test_llm.py (1)
llm_test_harness
(109-136)
tensorrt_llm/_torch/speculative/model_drafter.py (5)
tensorrt_llm/_torch/pyexecutor/sampler.py (21)
Sampler
(58-89)SampleState
(49-55)SampleStateTensors
(40-45)should_provide_draft_probs
(87-89)should_provide_draft_probs
(1522-1524)beam_width
(78-81)is_generation_model
(84-85)is_generation_model
(115-116)is_generation_model
(168-169)is_generation_model
(759-760)is_generation_model
(1558-1559)sample_async
(69-71)sample_async
(99-102)sample_async
(140-147)sample_async
(986-1009)sample_async
(1711-1799)update_requests
(74-75)update_requests
(105-112)update_requests
(150-165)update_requests
(943-970)update_requests
(1803-1816)tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
ScheduledRequests
(18-39)all_requests
(38-39)tensorrt_llm/llmapi/llm_args.py (1)
DecodingBaseConfig
(351-410)tensorrt_llm/_torch/pyexecutor/llm_request.py (4)
LlmRequest
(282-425)append
(79-98)append
(125-142)generation_logits
(214-223)tensorrt_llm/_torch/pyexecutor/handle_logits.py (1)
HandleLogits
(11-81)
tensorrt_llm/_torch/pyexecutor/sampler.py (4)
tensorrt_llm/_torch/pyexecutor/llm_request.py (4)
LlmRequest
(282-425)append
(79-98)append
(125-142)get_draft_token_length
(561-572)tensorrt_llm/_torch/speculative/model_drafter.py (1)
sample_async
(286-306)tensorrt_llm/_torch/speculative/mtp.py (1)
sample_async
(276-323)tensorrt_llm/_torch/pyexecutor/scheduler.py (3)
ScheduledRequests
(18-39)batch_size
(35-36)all_requests
(38-39)
🪛 Ruff (0.13.1)
tensorrt_llm/_torch/speculative/model_drafter.py
59-59: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/_torch/pyexecutor/sampler.py
87-87: Unused method argument: request
(ARG002)
742-742: Avoid specifying long messages outside the exception class
(TRY003)
751-751: Avoid specifying long messages outside the exception class
(TRY003)
1135-1135: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (33)
tensorrt_llm/_torch/modules/rms_norm.py (7)
17-18
: LGTM!The imports for
EllipsisType
and expanded typing support are correctly added to support the sentinel-based argument handling pattern.
28-30
: LGTM!The sentinel pattern is well-implemented using ellipsis (
...
) as the sentinel value with proper type aliasing. This aligns with the retrieved learning that indicates this module uses a custom sentinel instead of relying onNone
for optional arguments.
55-63
: LGTM!The parameter type update and sentinel detection logic is correctly implemented. The approach allows distinguishing between "no residual provided" (sentinel) versus "residual explicitly set to None", which provides more precise API control.
67-67
: LGTM!The conditional checks have been correctly updated to use
residual is not None
instead ofisinstance
checks, which is more appropriate after the sentinel handling logic setsresidual = None
when the sentinel is detected.Also applies to: 76-76
85-89
: LGTM!The return logic correctly handles both cases - returning a tuple when a residual was originally provided, or just the hidden states when the sentinel was used. The
cast
is appropriate for type checker satisfaction.
93-100
: LGTM!The
skip_forward
method correctly adopts the same sentinel-based pattern as the mainforward
method, maintaining API consistency. The implementation properly returns either just the hidden states or the tuple based on sentinel detection.
58-58
: No cross-module inconsistency — pattern only in rms_norm.pyRepository search shows _ARGUMENT_NOT_SPECIFIED_SENTINEL and the return type Union[torch.Tensor, Tuple[torch.Tensor, Optional[torch.Tensor]]] appear only in tensorrt_llm/_torch/modules/rms_norm.py (lines 58 and 96).
tests/unittest/llmapi/apps/_test_openai_misc.py (1)
97-103
: LGTM! Temperature parameter aligns with deterministic evaluation expectations.The addition of
temperature=0
ensures deterministic output generation during the request cancellation test, which is appropriate for testing behavior consistency. The inline FIXME comment properly documents the known issue.tensorrt_llm/evaluate/mmlu.py (1)
222-222
: LGTM! Temperature parameter ensures deterministic evaluation.Setting
temperature=0
in the sampling arguments ensures deterministic and reproducible evaluation results for MMLU, which is appropriate for benchmark consistency.tensorrt_llm/evaluate/json_mode_eval.py (1)
67-69
: LGTM! Temperature parameter ensures consistent JSON generation.The addition of
temperature=0
to the sampling arguments ensures deterministic JSON generation, which is appropriate for structured output evaluation where consistency is critical.tensorrt_llm/_torch/pyexecutor/_util.py (1)
699-702
: LGTM! Simplified sampler configuration by removing mixed sampler option.The removal of the
enable_mixed_sampler
parameter fromcreate_torch_sampler_args
aligns with the broader PR objective to simplify sampling configuration. The function signature is cleaner and the sampler construction is more straightforward.tensorrt_llm/llmapi/llm_args.py (1)
360-364
: LGTM! New fields support improved decoding configuration.The addition of
load_format
as a public field and_allow_greedy_draft_tokens
as a private attribute enhances decoding configuration capabilities. The private attribute properly usesPrivateAttr
for internal functionality.tensorrt_llm/_torch/speculative/eagle3.py (1)
2-2
: LGTM! Proper type-only import reduces runtime dependencies.The addition of
TYPE_CHECKING
import and wrappingEagleDecodingConfig
import within the type checking block is a good practice that provides type hints without creating runtime circular dependencies.Also applies to: 18-20
tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
369-369
: LGTM! Improved type annotation for embedding bias attribute.The addition of explicit type annotation
Optional[torch.Tensor]
for the_py_embedding_bias_1d
attribute improves code clarity and type safety without changing functionality.tensorrt_llm/_torch/pyexecutor/py_executor_creator.py (1)
349-353
: Simplified drafting activation logic looks good.The changes correctly remove the
enable_mixed_sampler
dependency and now rely solely ondraft_spec_config._allow_greedy_draft_tokens
for determining chain drafter usage. This aligns well with the broader refactoring to simplify speculative decoding control flow.tests/unittest/llmapi/test_llm_pytorch.py (1)
230-253
: Test simplification looks correct.The removal of
enable_mixed_sampler
andenable_logprobs
parameters from the embedding bias test aligns with the overall PR changes. The test now focuses on the core embedding bias functionality without the complexity of multiple parameter combinations.tensorrt_llm/_torch/speculative/model_drafter.py (5)
15-16
: Import changes align with new sampling architecture.The replacement of
TorchSampler
with the more genericSampler
,SampleState
, andSampleStateTensors
imports properly reflects the move to a more abstracted sampling interface.
77-78
: Updated validation for greedy-only drafting.The assertion now correctly validates against
spec_config._allow_greedy_draft_tokens
instead of the removedsampler.enable_mixed_sampler
, maintaining the requirement for greedy sampling in static draft loops.
82-96
: Draft request construction properly uses sampler's draft probs decision.The new approach delegates the decision about requesting draft probabilities to
self.sampler.should_provide_draft_probs(request)
, which allows for strategy-aware decision making. This is a cleaner separation of concerns.
289-303
: Sampling flow now always computes prefix sums and handles logits.The removal of conditional logic simplifies the code path and makes it more predictable. This unified approach should improve maintainability.
335-336
: Generation logits forwarding simplified.The direct forwarding of
req.py_result.generation_logits
without conditional checks aligns with the new unified sampling approach.Please verify that all callers of
process_decoded_tokens
handle potentialNone
values inpy_draft_logits
appropriately.tensorrt_llm/_torch/pyexecutor/sampler.py (12)
1-10
: Comprehensive imports for new batching infrastructure.The addition of
dataclasses
,enum
,defaultdict
,repeat
, and other imports properly supports the new batched sampling implementation with strategy-based routing.
172-199
: Well-structured batch sampling functions with clear signatures.The new
top_k_sampling_batch
and other sampling functions return both tokens and softmax probabilities, providing a consistent interface. The use of keyword-only arguments intop_p_sampling_batch
andtop_k_top_p_sampling_batch
improves API clarity.
284-293
: Greedy search properly handles optional softmax filtering.The
greedy_search_sampling_batch
function correctly implements softmax index filtering when provided, avoiding unnecessary softmax computation for non-selected indices.
346-370
: Strategy detection logic correctly handles temperature=0.The
_request_strategy
function properly treats temperature=0 as greedy sampling regardless of top_k/top_p settings, which is the correct behavior.
372-386
: Request grouping by strategy is efficient.The
_group_requests_by_sampling_strategy
function efficiently groups requests and maintains sorted indices as documented. The use of defaultdict and pin_memory options shows good attention to performance.
388-421
: Centralized sampling dispatcher well implemented.The
sample
function provides a clean interface for strategy-based sampling with proper handling of softmax filtering for non-greedy strategies. The pattern matching approach is clear and maintainable.
439-451
: BatchedSamplingResult dataclass is well-designed.The frozen dataclass clearly documents the batched sampling results with appropriate field names and types. The default factory for
py_draft_logits_indices
is correctly used.
808-809
: Generator device assertion added for safety.Good addition of the device check to ensure the generator is on the expected device.
1022-1109
: Sophisticated embedding bias implementation with good performance considerations.The new
_apply_embedding_bias
implementation handles per-request bias tensors efficiently with proper caching of unique biases and batched application. The extensive comment about implementation trade-offs shows careful thought about performance implications.
1449-1520
: Main request processing pipeline properly refactored.The
_process_requests
method now cleanly separates concerns: logit selection, embedding bias application, min length penalty, batched sampling, and result unbatching. The flow is logical and maintainable.
1521-1524
: Draft probability logic correctly implements greedy optimization.The
should_provide_draft_probs
override correctly avoids requesting draft probabilities for greedy sampling, which is an important optimization since greedy sampling doesn't need them.
1110-1263
: Complex batched sampling logic appears sound; test coverage for indexers is missing.rg search for PackedStepIndexer/UnpackedStepIndexer in test files returned no matches — add unit tests exercising edge cases: contiguous vs non‑contiguous indices, empty groups, mixed draft logits/d2t, and log_probs_host on/off to validate indexing/speculation behavior.
PR_Github #19618 [ run ] completed with state |
96457be
to
0e0a9b5
Compare
/bot run |
PR_Github #19783 [ run ] triggered by Bot |
PR_Github #19783 [ run ] completed with state |
0e0a9b5
to
3507535
Compare
/bot run |
PR_Github #19808 [ run ] triggered by Bot |
PR_Github #19808 [ run ] completed with state |
3507535
to
c074ad8
Compare
/bot run |
PR_Github #19821 [ run ] triggered by Bot |
PR_Github #19821 [ run ] completed with state |
c074ad8
to
b99bfae
Compare
/bot run |
PR_Github #19830 [ run ] triggered by Bot |
PR_Github #19830 [ run ] completed with state |
Signed-off-by: ixlmar <206748156+ixlmar@users.noreply.github.com>
b99bfae
to
057d988
Compare
/bot run |
PR_Github #19911 [ run ] triggered by Bot |
PR_Github #19911 [ run ] completed with state |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
This small improvement generalizes the pre-existing shortcut to reduce data movement when the request sampling groups are contiguous within the batch. This is relevant, e.g., if no requests share the same sampling parameters.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...
Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]
to print this help message.See details below for each supported subcommand.
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]
Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id
(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test
(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast
(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test
(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"
(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"
(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"
(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test
(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test
(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test
(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge
(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"
(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log
(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug
(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-list
parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.md
and the
scripts/test_to_stage_mapping.py
helper.kill
kill
Kill all running builds associated with pull request.
skip
skip --comment COMMENT
Skip testing for latest commit on pull request.
--comment "Reason for skipping build/test"
is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit