[None][fix] Fix early_stopping type and plumb through Triton ensemble… by jhaotingc · Pull Request #13692 · NVIDIA/TensorRT-LLM

jhaotingc · 2026-05-01T18:30:53Z

Summary by CodeRabbit

New Features
- Added optional early_stopping input parameter to language model configurations across disaggregated serving, GPT ensembles, inflight batcher, and multimodal models.
- Updated model decoders to recognize and forward the early stopping parameter during inference requests.
Tests
- Updated test validations for the new early stopping parameter functionality.

Description

After PR #13633 plumbed promptIgnoreLength correctly, early_stopping is the only sampling field that has two remaining issues:

Type mismatch between Triton config declaration (TYPE_BOOL, 1 byte) and the C++ extraction in getSamplingConfigFromTensors (extractOptionalSingleton<int32_t>, reads 4 bytes). The mismatch works accidentally for {0, 1} (adjacent memory is zero), but cannot represent the executor's documented value 2 ("stop only when all beams emit "), and is undefined behavior in principle.
early_stopping is missing entirely from the ensemble + BLS configs in five places, so clients hitting Triton via ensemble, tensorrt_llm_bls, multimodal/ensemble, or gpt/ensemble cannot set early_stopping at all. This pre-dates PR [None][feat] Support ignored prompt length for penalties via new sampling config parameter #8127.

Verified end-to-end on TinyLlama-1.1B with beam_width=4, prompt "Hello world. Goodbye." for all three Triton entry points:

Path 1 (direct tensorrt_llm):
early_stopping=0 -> beam_lens=[60, 60, 54, 60]
early_stopping=1 -> beam_lens=[3, 0, 2, 1]

Path 2 (ensemble):
early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..."
early_stopping=1 -> "Hello world." (3 tokens)

Path 3 (tensorrt_llm_bls):
early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..."
early_stopping=1 -> "Hello world." (3 tokens)

All three paths now honor early_stopping correctly.

Test Coverage

Setup: TinyLlama-1.1B-Chat engine (cached from session 1) served via the rebuilt libtriton_tensorrtllm.so from PR #13633 inside tekit_source:260429-tritondevel. Full Triton repo built from triton_backend/all_models/inflight_batcher_llm/* with all 5 sub-models loaded (preprocessing, postprocessing, ensemble, tensorrt_llm, tensorrt_llm_bls).

Test: prompt "Hello world. Goodbye.", beam_width=4, max_tokens=60, len_penalty=1.0. Vary early_stopping between 0 (never) and 1 (stop on worst-beam EOS). Send to all three model paths.

Path	`early_stopping=0`	`early_stopping=1`
direct `tensorrt_llm`	`beam_lens=[60, 60, 54, 60]` (no early stop)	`beam_lens=[3, 0, 2, 1]` (early stop)
`ensemble`	60-token text: `'Hello world. Goodbye. Hello world. Goodbye. Hello world. Goodbye...'`	3-token text: `'Hello world.'`
`tensorrt_llm_bls`	60-token text: `'Hello world. Goodbye. Hello world. Goodbye. Hello world. Goodbye...'`	3-token text: `'Hello world.'`

All three Triton entry points now honor early_stopping correctly.

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

jhaotingc · 2026-05-01T18:34:02Z

/bot run --disable-fail-fast

coderabbitai · 2026-05-01T18:38:58Z

📝 Walkthrough

Walkthrough

The changes introduce an early_stopping parameter as an INT32 input across multiple LLM model configurations. Type conversions from boolean to integer are applied to several models, while ensemble configurations are updated to declare and route this parameter through to underlying tensorrt_llm models.

Changes

Cohort / File(s)	Summary
Disaggregated Serving Config `triton_backend/all_models/disaggregated_serving/disaggregated_serving_bls/config.pbtxt`	Changed `early_stopping` input type from `TYPE_BOOL` to `TYPE_INT32` while preserving optional status and shape.
GPT Models `triton_backend/all_models/gpt/ensemble/config.pbtxt`, `triton_backend/all_models/gpt/tensorrt_llm/config.pbtxt`	Added optional `early_stopping` input (TYPE_INT32, dims `[1]`) to ensemble and tensorrt_llm configs; wired ensemble input to tensorrt_llm step.
Inflight Batcher Configuration `triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt`, `triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt`, `triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt`	Added optional `early_stopping` input (TYPE_INT32) to ensemble and bls configs; converted tensorrt_llm input type from `TYPE_BOOL` to `TYPE_INT32`.
Inflight Batcher BLS Logic `triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py`, `triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py`	Added `early_stopping` field to Request dataclass; extended decoder to include `early_stopping` in input names whitelist and request-to-tensor mapping.
Multimodal Models `triton_backend/all_models/multimodal/ensemble/config.pbtxt`	Added optional `early_stopping` input (TYPE_INT32, dims `[1]`) and wired it to tensorrt_llm step.
Test Updates `triton_backend/inflight_batcher_llm/tests/utilsTest.cpp`	Updated `earlyStopping` test tensor value from `4` to `2` with comments describing tri-state semantics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][fix] Fix early_stopping type and plumb through Triton ensemble…' clearly describes the main changes: fixing the early_stopping type mismatch and adding plumbing through ensemble configs.
Description check	✅ Passed	The PR description provides detailed explanation of both issues fixed (type mismatch and missing ensemble/BLS plumbing), includes comprehensive test coverage results across all three entry points, and follows the repository template with the PR checklist completed.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@triton_backend/inflight_batcher_llm/tests/utilsTest.cpp`:
- Around line 385-388: The inline comment for the early_stopping test is
incorrect: update the comment near the pushTensor<int32_t>(...,
InputFieldsNames::earlyStopping, nvinfer1::DataType::kINT32, {1}, {2}) call to
state that 2 means "stop only when all beams emit <eos>" (HuggingFace tri-state:
0=heuristic, 1=fast, 2=stop-only-when-all-beams-emit-eos) instead of `"never"`,
keeping the rest of the test unchanged.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7dac8a54-a694-4bec-b8c3-01e482633db8

📥 Commits

Reviewing files that changed from the base of the PR and between 81b5673 and 83f951b.

📒 Files selected for processing (10)

triton_backend/all_models/disaggregated_serving/disaggregated_serving_bls/config.pbtxt
triton_backend/all_models/gpt/ensemble/config.pbtxt
triton_backend/all_models/gpt/tensorrt_llm/config.pbtxt
triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt
triton_backend/all_models/multimodal/ensemble/config.pbtxt
triton_backend/inflight_batcher_llm/tests/utilsTest.cpp

tensorrt-cicd · 2026-05-01T18:40:22Z

PR_Github #46518 [ run ] triggered by Bot. Commit: 83f951b Link to invocation

tensorrt-cicd · 2026-05-01T21:28:51Z

PR_Github #46518 [ run ] completed with state SUCCESS. Commit: 83f951b
/LLM/main/L0_MergeRequest_PR pipeline #36577 completed with status: 'SUCCESS'

CI Report

Link to invocation

… + BLS After PR NVIDIA#13633 plumbed promptIgnoreLength correctly, early_stopping is the only sampling field that has two remaining issues: 1. Type mismatch between Triton config declaration (TYPE_BOOL, 1 byte) and the C++ extraction in getSamplingConfigFromTensors (extractOptionalSingleton<int32_t>, reads 4 bytes). The mismatch works accidentally for {0, 1} (adjacent memory is zero), but cannot represent the executor's documented value 2 ("stop only when all beams emit <eos>"), and is undefined behavior in principle. 2. early_stopping is missing entirely from the ensemble + BLS configs in five places, so clients hitting Triton via `ensemble`, `tensorrt_llm_bls`, `multimodal/ensemble`, or `gpt/ensemble` cannot set early_stopping at all. This pre-dates PR NVIDIA#8127. This change fixes both issues: Type fix (BREAKING for clients sending early_stopping as bool): - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt: data_type TYPE_BOOL -> TYPE_INT32 - triton_backend/all_models/disaggregated_serving/disaggregated_serving_bls/config.pbtxt: same. Aligns the wire-protocol declaration with executor::SamplingConfig semantics (std::optional<SizeType32> accepting 0/1/2). Clients previously sending numpy bool must now send numpy int32; behavior for values 0 and 1 is preserved. Ensemble + BLS plumbing (additive, no compat impact): - triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt: declare optional INT32 early_stopping input + add input_map block forwarding it to the tensorrt_llm step. - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt: declare optional INT32 early_stopping input. - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py: add early_stopping field to the Request dataclass. - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py: add early_stopping to both input_names lists and the BLS->engine name mapping. - triton_backend/all_models/multimodal/ensemble/config.pbtxt: declare + input_map. - triton_backend/all_models/gpt/ensemble/config.pbtxt: declare + input_map. - triton_backend/all_models/gpt/tensorrt_llm/config.pbtxt: declare (was missing entirely). The Python tensorrt_llm/1/model.py already forwards early_stopping to trtllm.SamplingConfig kwargs; only the wire-protocol declaration was wrong. Verified end-to-end on TinyLlama-1.1B with beam_width=4, prompt "Hello world. Goodbye." for all three Triton entry points: Path 1 (direct tensorrt_llm): early_stopping=0 -> beam_lens=[60, 60, 54, 60] early_stopping=1 -> beam_lens=[3, 0, 2, 1] Path 2 (ensemble): early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..." early_stopping=1 -> "Hello world." (3 tokens) Path 3 (tensorrt_llm_bls): early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..." early_stopping=1 -> "Hello world." (3 tokens) All three paths now honor early_stopping correctly. Signed-off-by: Jhao-Ting Chen <jtchen0528@gmail.com>

jhaotingc · 2026-05-01T21:58:30Z

/bot run

tensorrt-cicd · 2026-05-01T22:04:48Z

PR_Github #46527 [ run ] triggered by Bot. Commit: 37311ba Link to invocation

tensorrt-cicd · 2026-05-02T00:19:30Z

PR_Github #46527 [ run ] completed with state SUCCESS. Commit: 37311ba
/LLM/main/L0_MergeRequest_PR pipeline #36586 completed with status: 'SUCCESS'

CI Report

Link to invocation

jhaotingc requested a review from a team as a code owner May 1, 2026 18:30

jhaotingc requested review from SimengLiu-nv and Tabrizian May 1, 2026 18:30

github-actions Bot assigned jhaotingc May 1, 2026

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

Comment thread triton_backend/inflight_batcher_llm/tests/utilsTest.cpp Outdated

Tabrizian approved these changes May 1, 2026

View reviewed changes

jhaotingc enabled auto-merge (squash) May 1, 2026 21:43

jhaotingc force-pushed the jhaotingc/fix-triton-early-stopping branch from 83f951b to e72c92c Compare May 1, 2026 21:51

jhaotingc force-pushed the jhaotingc/fix-triton-early-stopping branch from e72c92c to 37311ba Compare May 1, 2026 21:52

jhaotingc merged commit fb0efdd into NVIDIA:main May 2, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][fix] Fix early_stopping type and plumb through Triton ensemble…#13692

[None][fix] Fix early_stopping type and plumb through Triton ensemble…#13692
jhaotingc merged 1 commit intoNVIDIA:mainfrom
jhaotingc:jhaotingc/fix-triton-early-stopping

jhaotingc commented May 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

jhaotingc commented May 1, 2026

Uh oh!

coderabbitai Bot commented May 1, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

jhaotingc commented May 1, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

tensorrt-cicd commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jhaotingc commented May 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

jhaotingc commented May 1, 2026

Uh oh!

coderabbitai Bot commented May 1, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

jhaotingc commented May 1, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

tensorrt-cicd commented May 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jhaotingc commented May 1, 2026 •

edited by coderabbitai Bot

Loading