Skip to content

[None][fix] Fix early_stopping type and plumb through Triton ensemble…#13692

Merged
jhaotingc merged 1 commit intoNVIDIA:mainfrom
jhaotingc:jhaotingc/fix-triton-early-stopping
May 2, 2026
Merged

[None][fix] Fix early_stopping type and plumb through Triton ensemble…#13692
jhaotingc merged 1 commit intoNVIDIA:mainfrom
jhaotingc:jhaotingc/fix-triton-early-stopping

Conversation

@jhaotingc
Copy link
Copy Markdown
Collaborator

@jhaotingc jhaotingc commented May 1, 2026

Summary by CodeRabbit

  • New Features

    • Added optional early_stopping input parameter to language model configurations across disaggregated serving, GPT ensembles, inflight batcher, and multimodal models.
    • Updated model decoders to recognize and forward the early stopping parameter during inference requests.
  • Tests

    • Updated test validations for the new early stopping parameter functionality.

Description

After PR #13633 plumbed promptIgnoreLength correctly, early_stopping is the only sampling field that has two remaining issues:

  1. Type mismatch between Triton config declaration (TYPE_BOOL, 1 byte) and the C++ extraction in getSamplingConfigFromTensors (extractOptionalSingleton<int32_t>, reads 4 bytes). The mismatch works accidentally for {0, 1} (adjacent memory is zero), but cannot represent the executor's documented value 2 ("stop only when all beams emit "), and is undefined behavior in principle.

  2. early_stopping is missing entirely from the ensemble + BLS configs in five places, so clients hitting Triton via ensemble, tensorrt_llm_bls, multimodal/ensemble, or gpt/ensemble cannot set early_stopping at all. This pre-dates PR [None][feat] Support ignored prompt length for penalties via new sampling config parameter #8127.

Verified end-to-end on TinyLlama-1.1B with beam_width=4, prompt "Hello world. Goodbye." for all three Triton entry points:

Path 1 (direct tensorrt_llm):
early_stopping=0 -> beam_lens=[60, 60, 54, 60]
early_stopping=1 -> beam_lens=[3, 0, 2, 1]

Path 2 (ensemble):
early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..."
early_stopping=1 -> "Hello world." (3 tokens)

Path 3 (tensorrt_llm_bls):
early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..."
early_stopping=1 -> "Hello world." (3 tokens)

All three paths now honor early_stopping correctly.

Test Coverage

Setup: TinyLlama-1.1B-Chat engine (cached from session 1) served via the rebuilt libtriton_tensorrtllm.so from PR #13633 inside tekit_source:260429-tritondevel. Full Triton repo built from triton_backend/all_models/inflight_batcher_llm/* with all 5 sub-models loaded (preprocessing, postprocessing, ensemble, tensorrt_llm, tensorrt_llm_bls).

Test: prompt "Hello world. Goodbye.", beam_width=4, max_tokens=60, len_penalty=1.0. Vary early_stopping between 0 (never) and 1 (stop on worst-beam EOS). Send to all three model paths.

Path early_stopping=0 early_stopping=1
direct tensorrt_llm beam_lens=[60, 60, 54, 60] (no early stop) beam_lens=[3, 0, 2, 1] (early stop)
ensemble 60-token text: 'Hello world. Goodbye. Hello world. Goodbye. Hello world. Goodbye...' 3-token text: 'Hello world.'
tensorrt_llm_bls 60-token text: 'Hello world. Goodbye. Hello world. Goodbye. Hello world. Goodbye...' 3-token text: 'Hello world.'

All three Triton entry points now honor early_stopping correctly.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@jhaotingc jhaotingc requested a review from a team as a code owner May 1, 2026 18:30
@jhaotingc
Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 1, 2026

📝 Walkthrough

Walkthrough

The changes introduce an early_stopping parameter as an INT32 input across multiple LLM model configurations. Type conversions from boolean to integer are applied to several models, while ensemble configurations are updated to declare and route this parameter through to underlying tensorrt_llm models.

Changes

Cohort / File(s) Summary
Disaggregated Serving Config
triton_backend/all_models/disaggregated_serving/disaggregated_serving_bls/config.pbtxt
Changed early_stopping input type from TYPE_BOOL to TYPE_INT32 while preserving optional status and shape.
GPT Models
triton_backend/all_models/gpt/ensemble/config.pbtxt, triton_backend/all_models/gpt/tensorrt_llm/config.pbtxt
Added optional early_stopping input (TYPE_INT32, dims [1]) to ensemble and tensorrt_llm configs; wired ensemble input to tensorrt_llm step.
Inflight Batcher Configuration
triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt, triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt, triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt
Added optional early_stopping input (TYPE_INT32) to ensemble and bls configs; converted tensorrt_llm input type from TYPE_BOOL to TYPE_INT32.
Inflight Batcher BLS Logic
triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py, triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py
Added early_stopping field to Request dataclass; extended decoder to include early_stopping in input names whitelist and request-to-tensor mapping.
Multimodal Models
triton_backend/all_models/multimodal/ensemble/config.pbtxt
Added optional early_stopping input (TYPE_INT32, dims [1]) and wired it to tensorrt_llm step.
Test Updates
triton_backend/inflight_batcher_llm/tests/utilsTest.cpp
Updated earlyStopping test tensor value from 4 to 2 with comments describing tri-state semantics.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][fix] Fix early_stopping type and plumb through Triton ensemble…' clearly describes the main changes: fixing the early_stopping type mismatch and adding plumbing through ensemble configs.
Description check ✅ Passed The PR description provides detailed explanation of both issues fixed (type mismatch and missing ensemble/BLS plumbing), includes comprehensive test coverage results across all three entry points, and follows the repository template with the PR checklist completed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Review rate limit: 9/10 reviews remaining, refill in 6 minutes.

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@triton_backend/inflight_batcher_llm/tests/utilsTest.cpp`:
- Around line 385-388: The inline comment for the early_stopping test is
incorrect: update the comment near the pushTensor<int32_t>(...,
InputFieldsNames::earlyStopping, nvinfer1::DataType::kINT32, {1}, {2}) call to
state that 2 means "stop only when all beams emit <eos>" (HuggingFace tri-state:
0=heuristic, 1=fast, 2=stop-only-when-all-beams-emit-eos) instead of `"never"`,
keeping the rest of the test unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7dac8a54-a694-4bec-b8c3-01e482633db8

📥 Commits

Reviewing files that changed from the base of the PR and between 81b5673 and 83f951b.

📒 Files selected for processing (10)
  • triton_backend/all_models/disaggregated_serving/disaggregated_serving_bls/config.pbtxt
  • triton_backend/all_models/gpt/ensemble/config.pbtxt
  • triton_backend/all_models/gpt/tensorrt_llm/config.pbtxt
  • triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt
  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt
  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py
  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py
  • triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt
  • triton_backend/all_models/multimodal/ensemble/config.pbtxt
  • triton_backend/inflight_batcher_llm/tests/utilsTest.cpp

Comment thread triton_backend/inflight_batcher_llm/tests/utilsTest.cpp Outdated
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46518 [ run ] triggered by Bot. Commit: 83f951b Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46518 [ run ] completed with state SUCCESS. Commit: 83f951b
/LLM/main/L0_MergeRequest_PR pipeline #36577 completed with status: 'SUCCESS'

CI Report

Link to invocation

@jhaotingc jhaotingc enabled auto-merge (squash) May 1, 2026 21:43
@jhaotingc jhaotingc force-pushed the jhaotingc/fix-triton-early-stopping branch from 83f951b to e72c92c Compare May 1, 2026 21:51
… + BLS

After PR NVIDIA#13633 plumbed promptIgnoreLength correctly, early_stopping
is the only sampling field that has two remaining issues:

1. Type mismatch between Triton config declaration (TYPE_BOOL, 1 byte)
   and the C++ extraction in getSamplingConfigFromTensors
   (extractOptionalSingleton<int32_t>, reads 4 bytes). The mismatch
   works accidentally for {0, 1} (adjacent memory is zero), but cannot
   represent the executor's documented value 2 ("stop only when all
   beams emit <eos>"), and is undefined behavior in principle.

2. early_stopping is missing entirely from the ensemble + BLS configs
   in five places, so clients hitting Triton via `ensemble`,
   `tensorrt_llm_bls`, `multimodal/ensemble`, or `gpt/ensemble`
   cannot set early_stopping at all. This pre-dates PR NVIDIA#8127.

This change fixes both issues:

Type fix (BREAKING for clients sending early_stopping as bool):
- triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt:
  data_type TYPE_BOOL -> TYPE_INT32
- triton_backend/all_models/disaggregated_serving/disaggregated_serving_bls/config.pbtxt:
  same.
  Aligns the wire-protocol declaration with executor::SamplingConfig
  semantics (std::optional<SizeType32> accepting 0/1/2). Clients
  previously sending numpy bool must now send numpy int32; behavior
  for values 0 and 1 is preserved.

Ensemble + BLS plumbing (additive, no compat impact):
- triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt:
  declare optional INT32 early_stopping input + add input_map block
  forwarding it to the tensorrt_llm step.
- triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt:
  declare optional INT32 early_stopping input.
- triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py:
  add early_stopping field to the Request dataclass.
- triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py:
  add early_stopping to both input_names lists and the BLS->engine
  name mapping.
- triton_backend/all_models/multimodal/ensemble/config.pbtxt: declare
  + input_map.
- triton_backend/all_models/gpt/ensemble/config.pbtxt: declare +
  input_map.
- triton_backend/all_models/gpt/tensorrt_llm/config.pbtxt: declare
  (was missing entirely).

The Python tensorrt_llm/1/model.py already forwards early_stopping
to trtllm.SamplingConfig kwargs; only the wire-protocol declaration
was wrong.

Verified end-to-end on TinyLlama-1.1B with beam_width=4, prompt
"Hello world. Goodbye." for all three Triton entry points:

  Path 1 (direct tensorrt_llm):
    early_stopping=0 -> beam_lens=[60, 60, 54, 60]
    early_stopping=1 -> beam_lens=[3, 0, 2, 1]

  Path 2 (ensemble):
    early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..."
    early_stopping=1 -> "Hello world." (3 tokens)

  Path 3 (tensorrt_llm_bls):
    early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..."
    early_stopping=1 -> "Hello world." (3 tokens)

All three paths now honor early_stopping correctly.

Signed-off-by: Jhao-Ting Chen <jtchen0528@gmail.com>
@jhaotingc jhaotingc force-pushed the jhaotingc/fix-triton-early-stopping branch from e72c92c to 37311ba Compare May 1, 2026 21:52
@jhaotingc
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46527 [ run ] triggered by Bot. Commit: 37311ba Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #46527 [ run ] completed with state SUCCESS. Commit: 37311ba
/LLM/main/L0_MergeRequest_PR pipeline #36586 completed with status: 'SUCCESS'

CI Report

Link to invocation

@jhaotingc jhaotingc merged commit fb0efdd into NVIDIA:main May 2, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants