Skip to content

[None][fix] Plumb promptIgnoreLength through Triton backend to fix silently-dropped lengthPenalty and earlyStopping#13633

Merged
jhaotingc merged 1 commit intoNVIDIA:mainfrom
jhaotingc:jhaotingc/fix-triton-prompt-ignore-length
May 1, 2026
Merged

[None][fix] Plumb promptIgnoreLength through Triton backend to fix silently-dropped lengthPenalty and earlyStopping#13633
jhaotingc merged 1 commit intoNVIDIA:mainfrom
jhaotingc:jhaotingc/fix-triton-prompt-ignore-length

Conversation

@jhaotingc
Copy link
Copy Markdown
Collaborator

@jhaotingc jhaotingc commented Apr 30, 2026

Summary by CodeRabbit

  • New Features

    • Added support for optional prompt_ignore_length parameter in inference requests, enabling customizable prompt length handling throughout the inference pipeline.
  • Tests

    • Added test coverage for the new prompt_ignore_length parameter validation.

Description

The customer reported that on TRT-LLM v1.2.0rc4 (and confirmed identical on v1.2.0 GA), length_penalty and early_stopping were silently ignored when sent through the Triton C++ backend. Root cause traced to triton_backend/inflight_batcher_llm/src/utils.cc::getSamplingConfigFromTensors() passing 17 positional args to executor::SamplingConfig(...) while the constructor signature in cpp/include/tensorrt_llm/executor/executor.h had 20 parameters with promptIgnoreLength (added by PR #8127) at position 14 between frequencyPenalty and lengthPenalty. Because all parameters are std::optional<...> and std::optional<float> implicitly converts to std::optional<int32> via the contained type, the call compiled silently but all params from #14 onward shifted positions:

Caller pos in utils.cc (pre-fix) Bound to slot Param it landed in
lengthPenalty (#14) 14 promptIgnoreLength (lossy float→int)
earlyStopping (#15) 15 lengthPenalty
noRepeatNgramSize (#16) 16 earlyStopping
numReturnSequences (#17) 17 noRepeatNgramSize

numReturnSequences, minP, beamWidthArray defaulted to std::nullopt.

Test Coverage

Test 1: length_penalty — E2E PROVEN ✓

Setup: prompt "The capital of France is Paris. The capital of Germany is Berlin. The capital of Japan is", beam_width=4, max_tokens=120, early_stopping=True, vary len_penalty between 0.1 and 5.0.

len_penalty beam[0] gen_len beam[0] text
0.1 (favors short beams) 0 '' (just <eos>)
5.0 (favors long beams) 100 '.C. The capital of Australia is Canberra. The capital of New Zealand is Wellington...'

output_ids differ: ✓. The chosen beam is dramatically different — empty vs 100-token continuation — proving length_penalty is now consumed at slot 15 of executor::SamplingConfig instead of being silently dropped.

Test 2: early_stopping — E2E PROVEN ✓

Setup: prompt "Hello world. Goodbye.", beam_width=4, max_tokens=60, len_penalty=1.0, vary early_stopping between False and True.

early_stopping beam_lens beam[0] text
False (run to max_tokens / heuristic) [53, 53, 47, 53] 'world. Goodbye. Hello world. Goodbye. Hello world. Goodbye...'
True (stop on first <eos>) [0, 0, 0, 0] ''

output_ids differ: ✓. With early_stopping=False the beams continued for 47-53 tokens; with True they all terminated immediately at <eos>. This proves early_stopping is now consumed at slot 16 of executor::SamplingConfig.

Test 3: prompt_ignore_length — UNIT-TEST PROVEN ✓ (e2e demo inconclusive on TinyLlama)

Setup: various prompts ("a a a a...", "the the...", "banana banana...", "France is a country. France is in Europe. France is famous for..."), varied repetition_penalty 1.3-10.0, prompt_ignore_length from None to prompt_len.

Outcome: no observable behavioral divergence at the e2e level on TinyLlama-Chat-1.1B. The model emits <eos> as the first generated token whenever the prompt contains many repetitions of a token, regardless of prompt_ignore_length. This is a property of TinyLlama-Chat (which is heavily fine-tuned for short turn-by-turn dialogue) rather than a defect in the plumbing. Verifying behavioral effect with a non-chat-tuned base model would require building another engine.

Plumbing IS verified at multiple lower layers, and these mechanically determine the e2e behavior:

Layer Test Result
Triton input tensor → executor::SamplingConfig extraction triton_backend/inflight_batcher_llm/tests/utilsTest::extractSingleton (extended in this PR to push prompt_ignore_length=7) getPromptIgnoreLength().value() == 7
executor::SamplingConfig getter/setter API cpp/tests/unit_tests/executor/samplingConfigTest::getterSetter setPromptIgnoreLength(1) round-trips ✓
runtime::SamplingConfig field cpp/tests/unit_tests/runtime/samplingConfigTest::validInputs round-trip ✓
GPU penalty kernel uses field correctly cpp/tests/unit_tests/kernels/sampling/samplingPenaltyTest::PenaltyTypeFullWithPartialPromptIgnore (added by upstream PR #8127) kernel skips first N prompt tokens ✓

This is a complete chain: input tensor → getSamplingConfigFromTensorsSamplingConfig::mPromptIgnoreLength → penalty kernel skips N tokens. Each link is unit-tested. The Triton model repo also accepted the prompt_ignore_length input tensor (INT32) without protocol error, confirming the config.pbtxt declaration is valid.

Summary

Field E2E behavioral demo Mechanical proof
length_penalty ✓ — chosen beam differs ('' vs 100-token text) ✓ — slot 15 alignment
early_stopping ✓ — beam_lens differ ([53,53,47,53] vs [0,0,0,0]) ✓ — slot 16 alignment
prompt_ignore_length inconclusive on TinyLlama-Chat (model emits <eos> too aggressively) ✓ — slot 14 alignment + kernel test + tensor accepted by tritonserver

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@jhaotingc jhaotingc requested a review from a team as a code owner April 30, 2026 03:00
@jhaotingc
Copy link
Copy Markdown
Collaborator Author

@CodeRabbit title

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

Tip

For best results, initiate chat on the files or code changes.

@jhaotingc: Here is a suggested title for this PR following the TRT-LLM title template:

[None][fix] Plumb promptIgnoreLength through Triton backend to fix silently-dropped lengthPenalty and earlyStopping

@jhaotingc jhaotingc changed the title fix: Plumb promptIgnoreLength through Triton backend [None][fix] Plumb promptIgnoreLength through Triton backend to fix silently-dropped lengthPenalty and earlyStopping Apr 30, 2026
@jhaotingc jhaotingc force-pushed the jhaotingc/fix-triton-prompt-ignore-length branch from 149456d to 7eadbad Compare April 30, 2026 03:04
@jhaotingc jhaotingc requested a review from Tabrizian April 30, 2026 03:05
@jhaotingc
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

📝 Walkthrough

Walkthrough

This change introduces support for a new optional prompt_ignore_length parameter throughout the TensorRT-LLM inference pipeline, from ensemble configuration through Python backend layers to C++ backend sampling configuration parsing and testing.

Changes

Cohort / File(s) Summary
Configuration Declarations
triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt, triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt, triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt
Added optional INT32 input prompt_ignore_length with shape [1] to ensemble, TensorRT-LLM, and BLS model configurations.
TensorRT-LLM Python Backend
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py
Extended sampling configuration mapping to extract and include prompt_ignore_length from batch request elements.
BLS Python Backend
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py, triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py
Added prompt_ignore_length field to Request dataclass and integrated parameter into decoder's input handling and tensor mapping.
C++ Backend
triton_backend/inflight_batcher_llm/src/utils.h, triton_backend/inflight_batcher_llm/src/utils.cc
Added promptIgnoreLength constant to InputFieldsNames and updated sampling configuration parsing to extract and pass the parameter to executor SamplingConfig.
Testing
triton_backend/inflight_batcher_llm/tests/utilsTest.cpp
Added test coverage for promptIgnoreLength parameter injection and validation in sampling configuration.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main fix: plumbing promptIgnoreLength through the Triton backend to resolve parameter-misalignment issues affecting length_penalty and early_stopping.
Description check ✅ Passed The description is comprehensive and well-structured, with clear root cause analysis, detailed test coverage demonstrating the fix works, and completion of the PR checklist.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
triton_backend/inflight_batcher_llm/tests/utilsTest.cpp (1)

391-391: ⚡ Quick win

Add one omitted-field regression test.

getRequest() now always injects promptIgnoreLength, so this test no longer exercises the backward-compatible path where the tensor is absent and SamplingConfig::promptIgnoreLength remains unset.

If you'd like, I can draft the extra regression test.

Also applies to: 589-589

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@triton_backend/inflight_batcher_llm/tests/utilsTest.cpp` at line 391, The
test currently always injects InputFieldsNames::promptIgnoreLength via
pushTensor which prevents exercising the backward-compatible path; add an extra
regression test in utilsTest.cpp that builds inputsTensors without calling
pushTensor for InputFieldsNames::promptIgnoreLength, call getRequest() and
assert that the resulting SamplingConfig::promptIgnoreLength is still unset/has
the default state (e.g., optional empty or sentinel), mirroring the existing
test structure but omitting the promptIgnoreLength tensor to verify legacy
behavior; reference getRequest(), pushTensor,
InputFieldsNames::promptIgnoreLength, and SamplingConfig::promptIgnoreLength
when locating where to add the new test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@triton_backend/inflight_batcher_llm/tests/utilsTest.cpp`:
- Line 391: The test currently always injects
InputFieldsNames::promptIgnoreLength via pushTensor which prevents exercising
the backward-compatible path; add an extra regression test in utilsTest.cpp that
builds inputsTensors without calling pushTensor for
InputFieldsNames::promptIgnoreLength, call getRequest() and assert that the
resulting SamplingConfig::promptIgnoreLength is still unset/has the default
state (e.g., optional empty or sentinel), mirroring the existing test structure
but omitting the promptIgnoreLength tensor to verify legacy behavior; reference
getRequest(), pushTensor, InputFieldsNames::promptIgnoreLength, and
SamplingConfig::promptIgnoreLength when locating where to add the new test.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9bc76f60-8342-4001-b503-ecd37113922c

📥 Commits

Reviewing files that changed from the base of the PR and between 2bc8f7f and 149456d.

📒 Files selected for processing (9)
  • triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt
  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py
  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt
  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py
  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py
  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt
  • triton_backend/inflight_batcher_llm/src/utils.cc
  • triton_backend/inflight_batcher_llm/src/utils.h
  • triton_backend/inflight_batcher_llm/tests/utilsTest.cpp

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46277 [ run ] triggered by Bot. Commit: 7eadbad Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46277 [ run ] completed with state SUCCESS. Commit: 7eadbad
/LLM/main/L0_MergeRequest_PR pipeline #36382 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@jhaotingc jhaotingc force-pushed the jhaotingc/fix-triton-prompt-ignore-length branch from 7eadbad to 58a6589 Compare April 30, 2026 17:26
@jhaotingc
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46414 [ run ] triggered by Bot. Commit: 58a6589 Link to invocation

@jhaotingc
Copy link
Copy Markdown
Collaborator Author

/bot kill

@jhaotingc
Copy link
Copy Markdown
Collaborator Author

/bot help

@github-actions
Copy link
Copy Markdown

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Supports wildcard * for pattern matching (e.g., "*PerfSanity*" matches all stages containing PerfSanity). Examples: "A10-PyTorch-1, xxx", "PerfSanity". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Supports wildcard * for pattern matching. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx", --extra-stage "Post-Merge".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@jhaotingc jhaotingc force-pushed the jhaotingc/fix-triton-prompt-ignore-length branch from 58a6589 to 77e52ed Compare April 30, 2026 17:42
@jhaotingc
Copy link
Copy Markdown
Collaborator Author

/bot kill

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46415 [ kill ] triggered by Bot. Commit: 77e52ed Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46415 [ kill ] completed with state SUCCESS. Commit: 77e52ed
Successfully killed previous jobs for commit 77e52ed

Link to invocation

@jhaotingc jhaotingc force-pushed the jhaotingc/fix-triton-prompt-ignore-length branch from 77e52ed to c15b7e2 Compare April 30, 2026 17:48
@jhaotingc
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46417 [ kill ] triggered by Bot. Commit: c15b7e2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46417 [ kill ] completed with state SUCCESS. Commit: c15b7e2
Successfully killed previous jobs for commit c15b7e2

Link to invocation

@jhaotingc
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46419 [ run ] triggered by Bot. Commit: c15b7e2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46420 [ run ] triggered by Bot. Commit: c15b7e2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46419 [ run ] completed with state ABORTED. Commit: c15b7e2

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46420 [ run ] completed with state FAILURE. Commit: c15b7e2
/LLM/main/L0_MergeRequest_PR pipeline #36492 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@jhaotingc jhaotingc force-pushed the jhaotingc/fix-triton-prompt-ignore-length branch from c15b7e2 to 01c4613 Compare April 30, 2026 22:00
The executor::SamplingConfig constructor takes 20 positional std::optional
parameters, with promptIgnoreLength (added in PR NVIDIA#8127) at position 14.
The Triton backend's getSamplingConfigFromTensors() in utils.cc was only
passing 17 args, omitting promptIgnoreLength. This compiled silently
because std::optional<float> implicitly converts to std::optional<int32>
via the contained type, so all params from NVIDIA#14 onward shifted positions:
caller's lengthPenalty bound to promptIgnoreLength, earlyStopping bound
to lengthPenalty, etc. As a result, length_penalty and early_stopping
sent over Triton (gRPC/HTTP) were silently ignored, and prompt_ignore_length
had no way to be set at all.

This change adds full plumbing for prompt_ignore_length so callers can
configure it from the Triton client all the way through to the executor
and the penalty kernels:

- triton_backend/inflight_batcher_llm/src/utils.{h,cc}: declare new
  InputFieldsNames::promptIgnoreLength input field, extract it from input
  tensors via extractOptionalSingleton<int32_t>, and pass it into
  executor::SamplingConfig at position 14 (replacing the silent default).
- triton_backend/inflight_batcher_llm/tests/utilsTest.cpp: extend the
  extractSingleton fixture to push a prompt_ignore_length tensor and
  assert SamplingConfig::getPromptIgnoreLength() round-trips correctly.
- Triton model configs in all_models/ — declare optional INT32
  prompt_ignore_length input on every model that already exposes
  len_penalty (the sibling sampling field), and add the corresponding
  ensemble input_map entry where applicable:
    inflight_batcher_llm/{tensorrt_llm,tensorrt_llm_bls,ensemble}/config.pbtxt
    disaggregated_serving/disaggregated_serving_bls/config.pbtxt
    gpt/{tensorrt_llm,ensemble}/config.pbtxt
    multimodal/ensemble/config.pbtxt
- triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py:
  forward prompt_ignore_length from request to trtllm.SamplingConfig kwargs
  (covers both Triton+engine and Triton+LLMAPI flows).
- triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/{decode,triton_decoder}.py:
  add prompt_ignore_length to the BLS Request dataclass, the input list,
  and the BLS->engine name mapping.

Backward compatibility: prompt_ignore_length is an optional input. When the
tensor is not provided, getSamplingConfigFromTensors yields std::nullopt,
matching the previous (already broken) default behavior, but now also
correctly forwarding lengthPenalty/earlyStopping/etc to their intended
SamplingConfig slots.

Signed-off-by: Jhao-Ting Chen <jtchen0528@gmail.com>
@jhaotingc jhaotingc force-pushed the jhaotingc/fix-triton-prompt-ignore-length branch from 01c4613 to bafcf33 Compare May 1, 2026 02:41
@jhaotingc
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46462 [ run ] triggered by Bot. Commit: bafcf33 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46462 [ run ] completed with state SUCCESS. Commit: bafcf33
/LLM/main/L0_MergeRequest_PR pipeline #36530 completed with status: 'SUCCESS'

CI Report

Link to invocation

@jhaotingc jhaotingc merged commit 81b5673 into NVIDIA:main May 1, 2026
6 checks passed
jhaotingc added a commit to jhaotingc/TensorRT-LLM that referenced this pull request May 1, 2026
… + BLS

After PR NVIDIA#13633 plumbed promptIgnoreLength correctly, early_stopping
is the only sampling field that has two remaining issues:

1. Type mismatch between Triton config declaration (TYPE_BOOL, 1 byte)
   and the C++ extraction in getSamplingConfigFromTensors
   (extractOptionalSingleton<int32_t>, reads 4 bytes). The mismatch
   works accidentally for {0, 1} (adjacent memory is zero), but cannot
   represent the executor's documented value 2 ("stop only when all
   beams emit <eos>"), and is undefined behavior in principle.

2. early_stopping is missing entirely from the ensemble + BLS configs
   in five places, so clients hitting Triton via `ensemble`,
   `tensorrt_llm_bls`, `multimodal/ensemble`, or `gpt/ensemble`
   cannot set early_stopping at all. This pre-dates PR NVIDIA#8127.

This change fixes both issues:

Type fix (BREAKING for clients sending early_stopping as bool):
- triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt:
  data_type TYPE_BOOL -> TYPE_INT32
- triton_backend/all_models/disaggregated_serving/disaggregated_serving_bls/config.pbtxt:
  same.
  Aligns the wire-protocol declaration with executor::SamplingConfig
  semantics (std::optional<SizeType32> accepting 0/1/2). Clients
  previously sending numpy bool must now send numpy int32; behavior
  for values 0 and 1 is preserved.

Ensemble + BLS plumbing (additive, no compat impact):
- triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt:
  declare optional INT32 early_stopping input + add input_map block
  forwarding it to the tensorrt_llm step.
- triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt:
  declare optional INT32 early_stopping input.
- triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py:
  add early_stopping field to the Request dataclass.
- triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py:
  add early_stopping to both input_names lists and the BLS->engine
  name mapping.
- triton_backend/all_models/multimodal/ensemble/config.pbtxt: declare
  + input_map.
- triton_backend/all_models/gpt/ensemble/config.pbtxt: declare +
  input_map.
- triton_backend/all_models/gpt/tensorrt_llm/config.pbtxt: declare
  (was missing entirely).

The Python tensorrt_llm/1/model.py already forwards early_stopping
to trtllm.SamplingConfig kwargs; only the wire-protocol declaration
was wrong.

Verified end-to-end on TinyLlama-1.1B with beam_width=4, prompt
"Hello world. Goodbye." for all three Triton entry points:

  Path 1 (direct tensorrt_llm):
    early_stopping=0 -> beam_lens=[60, 60, 54, 60]
    early_stopping=1 -> beam_lens=[3, 0, 2, 1]

  Path 2 (ensemble):
    early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..."
    early_stopping=1 -> "Hello world." (3 tokens)

  Path 3 (tensorrt_llm_bls):
    early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..."
    early_stopping=1 -> "Hello world." (3 tokens)

All three paths now honor early_stopping correctly.

Signed-off-by: Jhao-Ting Chen <jtchen0528@gmail.com>
jhaotingc added a commit to jhaotingc/TensorRT-LLM that referenced this pull request May 1, 2026
… + BLS

After PR NVIDIA#13633 plumbed promptIgnoreLength correctly, early_stopping
is the only sampling field that has two remaining issues:

1. Type mismatch between Triton config declaration (TYPE_BOOL, 1 byte)
   and the C++ extraction in getSamplingConfigFromTensors
   (extractOptionalSingleton<int32_t>, reads 4 bytes). The mismatch
   works accidentally for {0, 1} (adjacent memory is zero), but cannot
   represent the executor's documented value 2 ("stop only when all
   beams emit <eos>"), and is undefined behavior in principle.

2. early_stopping is missing entirely from the ensemble + BLS configs
   in five places, so clients hitting Triton via `ensemble`,
   `tensorrt_llm_bls`, `multimodal/ensemble`, or `gpt/ensemble`
   cannot set early_stopping at all. This pre-dates PR NVIDIA#8127.

This change fixes both issues:

Type fix (BREAKING for clients sending early_stopping as bool):
- triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt:
  data_type TYPE_BOOL -> TYPE_INT32
- triton_backend/all_models/disaggregated_serving/disaggregated_serving_bls/config.pbtxt:
  same.
  Aligns the wire-protocol declaration with executor::SamplingConfig
  semantics (std::optional<SizeType32> accepting 0/1/2). Clients
  previously sending numpy bool must now send numpy int32; behavior
  for values 0 and 1 is preserved.

Ensemble + BLS plumbing (additive, no compat impact):
- triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt:
  declare optional INT32 early_stopping input + add input_map block
  forwarding it to the tensorrt_llm step.
- triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt:
  declare optional INT32 early_stopping input.
- triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py:
  add early_stopping field to the Request dataclass.
- triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py:
  add early_stopping to both input_names lists and the BLS->engine
  name mapping.
- triton_backend/all_models/multimodal/ensemble/config.pbtxt: declare
  + input_map.
- triton_backend/all_models/gpt/ensemble/config.pbtxt: declare +
  input_map.
- triton_backend/all_models/gpt/tensorrt_llm/config.pbtxt: declare
  (was missing entirely).

The Python tensorrt_llm/1/model.py already forwards early_stopping
to trtllm.SamplingConfig kwargs; only the wire-protocol declaration
was wrong.

Verified end-to-end on TinyLlama-1.1B with beam_width=4, prompt
"Hello world. Goodbye." for all three Triton entry points:

  Path 1 (direct tensorrt_llm):
    early_stopping=0 -> beam_lens=[60, 60, 54, 60]
    early_stopping=1 -> beam_lens=[3, 0, 2, 1]

  Path 2 (ensemble):
    early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..."
    early_stopping=1 -> "Hello world." (3 tokens)

  Path 3 (tensorrt_llm_bls):
    early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..."
    early_stopping=1 -> "Hello world." (3 tokens)

All three paths now honor early_stopping correctly.

Signed-off-by: Jhao-Ting Chen <jtchen0528@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants