[None][fix] Plumb promptIgnoreLength through Triton backend to fix silently-dropped lengthPenalty and earlyStopping by jhaotingc · Pull Request #13633 · NVIDIA/TensorRT-LLM

jhaotingc · 2026-04-30T03:00:12Z

Summary by CodeRabbit

New Features
- Added support for optional prompt_ignore_length parameter in inference requests, enabling customizable prompt length handling throughout the inference pipeline.
Tests
- Added test coverage for the new prompt_ignore_length parameter validation.

Description

The customer reported that on TRT-LLM v1.2.0rc4 (and confirmed identical on v1.2.0 GA), length_penalty and early_stopping were silently ignored when sent through the Triton C++ backend. Root cause traced to triton_backend/inflight_batcher_llm/src/utils.cc::getSamplingConfigFromTensors() passing 17 positional args to executor::SamplingConfig(...) while the constructor signature in cpp/include/tensorrt_llm/executor/executor.h had 20 parameters with promptIgnoreLength (added by PR #8127) at position 14 between frequencyPenalty and lengthPenalty. Because all parameters are std::optional<...> and std::optional<float> implicitly converts to std::optional<int32> via the contained type, the call compiled silently but all params from #14 onward shifted positions:

Caller pos in utils.cc (pre-fix)	Bound to slot	Param it landed in
`lengthPenalty` (#14)	14	`promptIgnoreLength` (lossy float→int)
`earlyStopping` (#15)	15	`lengthPenalty`
`noRepeatNgramSize` (#16)	16	`earlyStopping`
`numReturnSequences` (#17)	17	`noRepeatNgramSize`

numReturnSequences, minP, beamWidthArray defaulted to std::nullopt.

Test Coverage

Test 1: `length_penalty` — E2E PROVEN ✓

Setup: prompt "The capital of France is Paris. The capital of Germany is Berlin. The capital of Japan is", beam_width=4, max_tokens=120, early_stopping=True, vary len_penalty between 0.1 and 5.0.

`len_penalty`	beam[0] gen_len	beam[0] text
0.1 (favors short beams)	0	`''` (just `<eos>`)
5.0 (favors long beams)	100	`'.C. The capital of Australia is Canberra. The capital of New Zealand is Wellington...'`

output_ids differ: ✓. The chosen beam is dramatically different — empty vs 100-token continuation — proving length_penalty is now consumed at slot 15 of executor::SamplingConfig instead of being silently dropped.

Test 2: `early_stopping` — E2E PROVEN ✓

Setup: prompt "Hello world. Goodbye.", beam_width=4, max_tokens=60, len_penalty=1.0, vary early_stopping between False and True.

`early_stopping`	beam_lens	beam[0] text
`False` (run to max_tokens / heuristic)	[53, 53, 47, 53]	`'world. Goodbye. Hello world. Goodbye. Hello world. Goodbye...'`
`True` (stop on first `<eos>`)	[0, 0, 0, 0]	`''`

output_ids differ: ✓. With early_stopping=False the beams continued for 47-53 tokens; with True they all terminated immediately at <eos>. This proves early_stopping is now consumed at slot 16 of executor::SamplingConfig.

Test 3: `prompt_ignore_length` — UNIT-TEST PROVEN ✓ (e2e demo inconclusive on TinyLlama)

Setup: various prompts ("a a a a...", "the the...", "banana banana...", "France is a country. France is in Europe. France is famous for..."), varied repetition_penalty 1.3-10.0, prompt_ignore_length from None to prompt_len.

Outcome: no observable behavioral divergence at the e2e level on TinyLlama-Chat-1.1B. The model emits <eos> as the first generated token whenever the prompt contains many repetitions of a token, regardless of prompt_ignore_length. This is a property of TinyLlama-Chat (which is heavily fine-tuned for short turn-by-turn dialogue) rather than a defect in the plumbing. Verifying behavioral effect with a non-chat-tuned base model would require building another engine.

Plumbing IS verified at multiple lower layers, and these mechanically determine the e2e behavior:

Layer	Test	Result
Triton input tensor → `executor::SamplingConfig` extraction	`triton_backend/inflight_batcher_llm/tests/utilsTest::extractSingleton` (extended in this PR to push `prompt_ignore_length=7`)	`getPromptIgnoreLength().value() == 7` ✓
`executor::SamplingConfig` getter/setter API	`cpp/tests/unit_tests/executor/samplingConfigTest::getterSetter`	`setPromptIgnoreLength(1)` round-trips ✓
`runtime::SamplingConfig` field	`cpp/tests/unit_tests/runtime/samplingConfigTest::validInputs`	round-trip ✓
GPU penalty kernel uses field correctly	`cpp/tests/unit_tests/kernels/sampling/samplingPenaltyTest::PenaltyTypeFullWithPartialPromptIgnore` (added by upstream PR #8127)	kernel skips first N prompt tokens ✓

This is a complete chain: input tensor → getSamplingConfigFromTensors → SamplingConfig::mPromptIgnoreLength → penalty kernel skips N tokens. Each link is unit-tested. The Triton model repo also accepted the prompt_ignore_length input tensor (INT32) without protocol error, confirming the config.pbtxt declaration is valid.

Summary

Field	E2E behavioral demo	Mechanical proof
`length_penalty`	✓ — chosen beam differs (`''` vs 100-token text)	✓ — slot 15 alignment
`early_stopping`	✓ — beam_lens differ ([53,53,47,53] vs [0,0,0,0])	✓ — slot 16 alignment
`prompt_ignore_length`	inconclusive on TinyLlama-Chat (model emits `<eos>` too aggressively)	✓ — slot 14 alignment + kernel test + tensor accepted by tritonserver

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

jhaotingc · 2026-04-30T03:00:50Z

@CodeRabbit title

coderabbitai · 2026-04-30T03:01:05Z

Tip

For best results, initiate chat on the files or code changes.

@jhaotingc: Here is a suggested title for this PR following the TRT-LLM title template:

[None][fix] Plumb promptIgnoreLength through Triton backend to fix silently-dropped lengthPenalty and earlyStopping

jhaotingc · 2026-04-30T03:05:53Z

/bot run --disable-fail-fast

coderabbitai · 2026-04-30T03:06:06Z

📝 Walkthrough

Walkthrough

This change introduces support for a new optional prompt_ignore_length parameter throughout the TensorRT-LLM inference pipeline, from ensemble configuration through Python backend layers to C++ backend sampling configuration parsing and testing.

Changes

Cohort / File(s)	Summary
Configuration Declarations `triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt`, `triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt`, `triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt`	Added optional INT32 input `prompt_ignore_length` with shape `[1]` to ensemble, TensorRT-LLM, and BLS model configurations.
TensorRT-LLM Python Backend `triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py`	Extended sampling configuration mapping to extract and include `prompt_ignore_length` from batch request elements.
BLS Python Backend `triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py`, `triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py`	Added `prompt_ignore_length` field to Request dataclass and integrated parameter into decoder's input handling and tensor mapping.
C++ Backend `triton_backend/inflight_batcher_llm/src/utils.h`, `triton_backend/inflight_batcher_llm/src/utils.cc`	Added `promptIgnoreLength` constant to InputFieldsNames and updated sampling configuration parsing to extract and pass the parameter to executor SamplingConfig.
Testing `triton_backend/inflight_batcher_llm/tests/utilsTest.cpp`	Added test coverage for `promptIgnoreLength` parameter injection and validation in sampling configuration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main fix: plumbing promptIgnoreLength through the Triton backend to resolve parameter-misalignment issues affecting length_penalty and early_stopping.
Description check	✅ Passed	The description is comprehensive and well-structured, with clear root cause analysis, detailed test coverage demonstrating the fix works, and completion of the PR checklist.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Review rate limit: 9/10 reviews remaining, refill in 6 minutes.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

triton_backend/inflight_batcher_llm/tests/utilsTest.cpp (1)
391-391: ⚡ Quick win

Add one omitted-field regression test.

getRequest() now always injects promptIgnoreLength, so this test no longer exercises the backward-compatible path where the tensor is absent and SamplingConfig::promptIgnoreLength remains unset.

If you'd like, I can draft the extra regression test.

Also applies to: 589-589
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@triton_backend/inflight_batcher_llm/tests/utilsTest.cpp` at line 391, The
test currently always injects InputFieldsNames::promptIgnoreLength via
pushTensor which prevents exercising the backward-compatible path; add an extra
regression test in utilsTest.cpp that builds inputsTensors without calling
pushTensor for InputFieldsNames::promptIgnoreLength, call getRequest() and
assert that the resulting SamplingConfig::promptIgnoreLength is still unset/has
the default state (e.g., optional empty or sentinel), mirroring the existing
test structure but omitting the promptIgnoreLength tensor to verify legacy
behavior; reference getRequest(), pushTensor,
InputFieldsNames::promptIgnoreLength, and SamplingConfig::promptIgnoreLength
when locating where to add the new test.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@triton_backend/inflight_batcher_llm/tests/utilsTest.cpp`:
- Line 391: The test currently always injects
InputFieldsNames::promptIgnoreLength via pushTensor which prevents exercising
the backward-compatible path; add an extra regression test in utilsTest.cpp that
builds inputsTensors without calling pushTensor for
InputFieldsNames::promptIgnoreLength, call getRequest() and assert that the
resulting SamplingConfig::promptIgnoreLength is still unset/has the default
state (e.g., optional empty or sentinel), mirroring the existing test structure
but omitting the promptIgnoreLength tensor to verify legacy behavior; reference
getRequest(), pushTensor, InputFieldsNames::promptIgnoreLength, and
SamplingConfig::promptIgnoreLength when locating where to add the new test.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9bc76f60-8342-4001-b503-ecd37113922c

📥 Commits

Reviewing files that changed from the base of the PR and between 2bc8f7f and 149456d.

📒 Files selected for processing (9)

triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt
triton_backend/inflight_batcher_llm/src/utils.cc
triton_backend/inflight_batcher_llm/src/utils.h
triton_backend/inflight_batcher_llm/tests/utilsTest.cpp

tensorrt-cicd · 2026-04-30T03:13:02Z

PR_Github #46277 [ run ] triggered by Bot. Commit: 7eadbad Link to invocation

tensorrt-cicd · 2026-04-30T10:38:51Z

PR_Github #46277 [ run ] completed with state SUCCESS. Commit: 7eadbad
/LLM/main/L0_MergeRequest_PR pipeline #36382 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

jhaotingc · 2026-04-30T17:27:10Z

/bot run

tensorrt-cicd · 2026-04-30T17:33:51Z

PR_Github #46414 [ run ] triggered by Bot. Commit: 58a6589 Link to invocation

jhaotingc · 2026-04-30T17:34:54Z

/bot kill

jhaotingc · 2026-04-30T17:37:59Z

/bot help

github-actions · 2026-04-30T17:38:07Z

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Supports wildcard * for pattern matching (e.g., "*PerfSanity*" matches all stages containing PerfSanity). Examples: "A10-PyTorch-1, xxx", "PerfSanity". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Supports wildcard * for pattern matching. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx", --extra-stage "Post-Merge".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

jhaotingc · 2026-04-30T17:43:16Z

/bot kill

tensorrt-cicd · 2026-04-30T17:46:33Z

PR_Github #46415 [ kill ] triggered by Bot. Commit: 77e52ed Link to invocation

tensorrt-cicd · 2026-04-30T17:47:17Z

PR_Github #46415 [ kill ] completed with state SUCCESS. Commit: 77e52ed
Successfully killed previous jobs for commit 77e52ed

Link to invocation

jhaotingc · 2026-04-30T17:49:03Z

/bot run

tensorrt-cicd · 2026-04-30T17:49:29Z

PR_Github #46417 [ kill ] triggered by Bot. Commit: c15b7e2 Link to invocation

tensorrt-cicd · 2026-04-30T17:49:33Z

PR_Github #46417 [ kill ] completed with state SUCCESS. Commit: c15b7e2
Successfully killed previous jobs for commit c15b7e2

Link to invocation

jhaotingc · 2026-04-30T17:50:31Z

/bot run

tensorrt-cicd · 2026-04-30T17:56:16Z

PR_Github #46419 [ run ] triggered by Bot. Commit: c15b7e2 Link to invocation

tensorrt-cicd · 2026-04-30T17:56:41Z

PR_Github #46420 [ run ] triggered by Bot. Commit: c15b7e2 Link to invocation

tensorrt-cicd · 2026-04-30T17:56:46Z

PR_Github #46419 [ run ] completed with state ABORTED. Commit: c15b7e2

Link to invocation

tensorrt-cicd · 2026-04-30T18:12:56Z

PR_Github #46420 [ run ] completed with state FAILURE. Commit: c15b7e2
/LLM/main/L0_MergeRequest_PR pipeline #36492 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

The executor::SamplingConfig constructor takes 20 positional std::optional parameters, with promptIgnoreLength (added in PR NVIDIA#8127) at position 14. The Triton backend's getSamplingConfigFromTensors() in utils.cc was only passing 17 args, omitting promptIgnoreLength. This compiled silently because std::optional<float> implicitly converts to std::optional<int32> via the contained type, so all params from NVIDIA#14 onward shifted positions: caller's lengthPenalty bound to promptIgnoreLength, earlyStopping bound to lengthPenalty, etc. As a result, length_penalty and early_stopping sent over Triton (gRPC/HTTP) were silently ignored, and prompt_ignore_length had no way to be set at all. This change adds full plumbing for prompt_ignore_length so callers can configure it from the Triton client all the way through to the executor and the penalty kernels: - triton_backend/inflight_batcher_llm/src/utils.{h,cc}: declare new InputFieldsNames::promptIgnoreLength input field, extract it from input tensors via extractOptionalSingleton<int32_t>, and pass it into executor::SamplingConfig at position 14 (replacing the silent default). - triton_backend/inflight_batcher_llm/tests/utilsTest.cpp: extend the extractSingleton fixture to push a prompt_ignore_length tensor and assert SamplingConfig::getPromptIgnoreLength() round-trips correctly. - Triton model configs in all_models/ — declare optional INT32 prompt_ignore_length input on every model that already exposes len_penalty (the sibling sampling field), and add the corresponding ensemble input_map entry where applicable: inflight_batcher_llm/{tensorrt_llm,tensorrt_llm_bls,ensemble}/config.pbtxt disaggregated_serving/disaggregated_serving_bls/config.pbtxt gpt/{tensorrt_llm,ensemble}/config.pbtxt multimodal/ensemble/config.pbtxt - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py: forward prompt_ignore_length from request to trtllm.SamplingConfig kwargs (covers both Triton+engine and Triton+LLMAPI flows). - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/{decode,triton_decoder}.py: add prompt_ignore_length to the BLS Request dataclass, the input list, and the BLS->engine name mapping. Backward compatibility: prompt_ignore_length is an optional input. When the tensor is not provided, getSamplingConfigFromTensors yields std::nullopt, matching the previous (already broken) default behavior, but now also correctly forwarding lengthPenalty/earlyStopping/etc to their intended SamplingConfig slots. Signed-off-by: Jhao-Ting Chen <jtchen0528@gmail.com>

jhaotingc · 2026-05-01T02:42:07Z

/bot run

tensorrt-cicd · 2026-05-01T02:48:41Z

PR_Github #46462 [ run ] triggered by Bot. Commit: bafcf33 Link to invocation

tensorrt-cicd · 2026-05-01T05:39:57Z

PR_Github #46462 [ run ] completed with state SUCCESS. Commit: bafcf33
/LLM/main/L0_MergeRequest_PR pipeline #36530 completed with status: 'SUCCESS'

CI Report

Link to invocation

… + BLS After PR NVIDIA#13633 plumbed promptIgnoreLength correctly, early_stopping is the only sampling field that has two remaining issues: 1. Type mismatch between Triton config declaration (TYPE_BOOL, 1 byte) and the C++ extraction in getSamplingConfigFromTensors (extractOptionalSingleton<int32_t>, reads 4 bytes). The mismatch works accidentally for {0, 1} (adjacent memory is zero), but cannot represent the executor's documented value 2 ("stop only when all beams emit <eos>"), and is undefined behavior in principle. 2. early_stopping is missing entirely from the ensemble + BLS configs in five places, so clients hitting Triton via `ensemble`, `tensorrt_llm_bls`, `multimodal/ensemble`, or `gpt/ensemble` cannot set early_stopping at all. This pre-dates PR NVIDIA#8127. This change fixes both issues: Type fix (BREAKING for clients sending early_stopping as bool): - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt: data_type TYPE_BOOL -> TYPE_INT32 - triton_backend/all_models/disaggregated_serving/disaggregated_serving_bls/config.pbtxt: same. Aligns the wire-protocol declaration with executor::SamplingConfig semantics (std::optional<SizeType32> accepting 0/1/2). Clients previously sending numpy bool must now send numpy int32; behavior for values 0 and 1 is preserved. Ensemble + BLS plumbing (additive, no compat impact): - triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt: declare optional INT32 early_stopping input + add input_map block forwarding it to the tensorrt_llm step. - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt: declare optional INT32 early_stopping input. - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py: add early_stopping field to the Request dataclass. - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py: add early_stopping to both input_names lists and the BLS->engine name mapping. - triton_backend/all_models/multimodal/ensemble/config.pbtxt: declare + input_map. - triton_backend/all_models/gpt/ensemble/config.pbtxt: declare + input_map. - triton_backend/all_models/gpt/tensorrt_llm/config.pbtxt: declare (was missing entirely). The Python tensorrt_llm/1/model.py already forwards early_stopping to trtllm.SamplingConfig kwargs; only the wire-protocol declaration was wrong. Verified end-to-end on TinyLlama-1.1B with beam_width=4, prompt "Hello world. Goodbye." for all three Triton entry points: Path 1 (direct tensorrt_llm): early_stopping=0 -> beam_lens=[60, 60, 54, 60] early_stopping=1 -> beam_lens=[3, 0, 2, 1] Path 2 (ensemble): early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..." early_stopping=1 -> "Hello world." (3 tokens) Path 3 (tensorrt_llm_bls): early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..." early_stopping=1 -> "Hello world." (3 tokens) All three paths now honor early_stopping correctly. Signed-off-by: Jhao-Ting Chen <jtchen0528@gmail.com>

jhaotingc requested a review from a team as a code owner April 30, 2026 03:00

jhaotingc requested review from SimengLiu-nv and schetlur-nv April 30, 2026 03:00

github-actions Bot assigned jhaotingc Apr 30, 2026

jhaotingc changed the title ~~fix: Plumb promptIgnoreLength through Triton backend~~ [None][fix] Plumb promptIgnoreLength through Triton backend to fix silently-dropped lengthPenalty and earlyStopping Apr 30, 2026

jhaotingc force-pushed the jhaotingc/fix-triton-prompt-ignore-length branch from 149456d to 7eadbad Compare April 30, 2026 03:04

jhaotingc requested a review from Tabrizian April 30, 2026 03:05

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

Tabrizian approved these changes Apr 30, 2026

View reviewed changes

jhaotingc force-pushed the jhaotingc/fix-triton-prompt-ignore-length branch from 7eadbad to 58a6589 Compare April 30, 2026 17:26

jhaotingc force-pushed the jhaotingc/fix-triton-prompt-ignore-length branch from 58a6589 to 77e52ed Compare April 30, 2026 17:42

jhaotingc force-pushed the jhaotingc/fix-triton-prompt-ignore-length branch from 77e52ed to c15b7e2 Compare April 30, 2026 17:48

jhaotingc force-pushed the jhaotingc/fix-triton-prompt-ignore-length branch from c15b7e2 to 01c4613 Compare April 30, 2026 22:00

jhaotingc force-pushed the jhaotingc/fix-triton-prompt-ignore-length branch from 01c4613 to bafcf33 Compare May 1, 2026 02:41

jhaotingc merged commit 81b5673 into NVIDIA:main May 1, 2026
6 checks passed

jhaotingc mentioned this pull request May 1, 2026

[None][fix] Fix early_stopping type and plumb through Triton ensemble… #13692

Merged

1 task

Conversation

jhaotingc commented Apr 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

Test 1: length_penalty — E2E PROVEN ✓

Test 2: early_stopping — E2E PROVEN ✓

Test 3: prompt_ignore_length — UNIT-TEST PROVEN ✓ (e2e demo inconclusive on TinyLlama)

Summary

PR Checklist

GitHub Bot Help

Uh oh!

jhaotingc commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026

Uh oh!

jhaotingc commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

jhaotingc commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

jhaotingc commented Apr 30, 2026

Uh oh!

jhaotingc commented Apr 30, 2026

Uh oh!

github-actions Bot commented Apr 30, 2026

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

jhaotingc commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

jhaotingc commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

jhaotingc commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

jhaotingc commented May 1, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

tensorrt-cicd commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

jhaotingc commented Apr 30, 2026 •

edited by coderabbitai Bot

Loading

Test 1: `length_penalty` — E2E PROVEN ✓

Test 2: `early_stopping` — E2E PROVEN ✓

Test 3: `prompt_ignore_length` — UNIT-TEST PROVEN ✓ (e2e demo inconclusive on TinyLlama)