[None][fix] Plumb promptIgnoreLength through Triton backend to fix silently-dropped lengthPenalty and earlyStopping#13633
Conversation
|
@CodeRabbit title |
|
Tip For best results, initiate chat on the files or code changes.
[None][fix] Plumb promptIgnoreLength through Triton backend to fix silently-dropped lengthPenalty and earlyStopping |
149456d to
7eadbad
Compare
|
/bot run --disable-fail-fast |
📝 WalkthroughWalkthroughThis change introduces support for a new optional Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Review rate limit: 9/10 reviews remaining, refill in 6 minutes. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
triton_backend/inflight_batcher_llm/tests/utilsTest.cpp (1)
391-391: ⚡ Quick winAdd one omitted-field regression test.
getRequest()now always injectspromptIgnoreLength, so this test no longer exercises the backward-compatible path where the tensor is absent andSamplingConfig::promptIgnoreLengthremains unset.If you'd like, I can draft the extra regression test.
Also applies to: 589-589
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@triton_backend/inflight_batcher_llm/tests/utilsTest.cpp` at line 391, The test currently always injects InputFieldsNames::promptIgnoreLength via pushTensor which prevents exercising the backward-compatible path; add an extra regression test in utilsTest.cpp that builds inputsTensors without calling pushTensor for InputFieldsNames::promptIgnoreLength, call getRequest() and assert that the resulting SamplingConfig::promptIgnoreLength is still unset/has the default state (e.g., optional empty or sentinel), mirroring the existing test structure but omitting the promptIgnoreLength tensor to verify legacy behavior; reference getRequest(), pushTensor, InputFieldsNames::promptIgnoreLength, and SamplingConfig::promptIgnoreLength when locating where to add the new test.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@triton_backend/inflight_batcher_llm/tests/utilsTest.cpp`:
- Line 391: The test currently always injects
InputFieldsNames::promptIgnoreLength via pushTensor which prevents exercising
the backward-compatible path; add an extra regression test in utilsTest.cpp that
builds inputsTensors without calling pushTensor for
InputFieldsNames::promptIgnoreLength, call getRequest() and assert that the
resulting SamplingConfig::promptIgnoreLength is still unset/has the default
state (e.g., optional empty or sentinel), mirroring the existing test structure
but omitting the promptIgnoreLength tensor to verify legacy behavior; reference
getRequest(), pushTensor, InputFieldsNames::promptIgnoreLength, and
SamplingConfig::promptIgnoreLength when locating where to add the new test.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9bc76f60-8342-4001-b503-ecd37113922c
📒 Files selected for processing (9)
triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxttriton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.pytriton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxttriton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.pytriton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.pytriton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxttriton_backend/inflight_batcher_llm/src/utils.cctriton_backend/inflight_batcher_llm/src/utils.htriton_backend/inflight_batcher_llm/tests/utilsTest.cpp
|
PR_Github #46277 [ run ] triggered by Bot. Commit: |
|
PR_Github #46277 [ run ] completed with state
|
7eadbad to
58a6589
Compare
|
/bot run |
|
PR_Github #46414 [ run ] triggered by Bot. Commit: |
|
/bot kill |
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
58a6589 to
77e52ed
Compare
|
/bot kill |
|
PR_Github #46415 [ kill ] triggered by Bot. Commit: |
|
PR_Github #46415 [ kill ] completed with state |
77e52ed to
c15b7e2
Compare
|
/bot run |
|
PR_Github #46417 [ kill ] triggered by Bot. Commit: |
|
PR_Github #46417 [ kill ] completed with state |
|
/bot run |
|
PR_Github #46419 [ run ] triggered by Bot. Commit: |
|
PR_Github #46420 [ run ] triggered by Bot. Commit: |
|
PR_Github #46419 [ run ] completed with state |
|
PR_Github #46420 [ run ] completed with state
|
c15b7e2 to
01c4613
Compare
The executor::SamplingConfig constructor takes 20 positional std::optional parameters, with promptIgnoreLength (added in PR NVIDIA#8127) at position 14. The Triton backend's getSamplingConfigFromTensors() in utils.cc was only passing 17 args, omitting promptIgnoreLength. This compiled silently because std::optional<float> implicitly converts to std::optional<int32> via the contained type, so all params from NVIDIA#14 onward shifted positions: caller's lengthPenalty bound to promptIgnoreLength, earlyStopping bound to lengthPenalty, etc. As a result, length_penalty and early_stopping sent over Triton (gRPC/HTTP) were silently ignored, and prompt_ignore_length had no way to be set at all. This change adds full plumbing for prompt_ignore_length so callers can configure it from the Triton client all the way through to the executor and the penalty kernels: - triton_backend/inflight_batcher_llm/src/utils.{h,cc}: declare new InputFieldsNames::promptIgnoreLength input field, extract it from input tensors via extractOptionalSingleton<int32_t>, and pass it into executor::SamplingConfig at position 14 (replacing the silent default). - triton_backend/inflight_batcher_llm/tests/utilsTest.cpp: extend the extractSingleton fixture to push a prompt_ignore_length tensor and assert SamplingConfig::getPromptIgnoreLength() round-trips correctly. - Triton model configs in all_models/ — declare optional INT32 prompt_ignore_length input on every model that already exposes len_penalty (the sibling sampling field), and add the corresponding ensemble input_map entry where applicable: inflight_batcher_llm/{tensorrt_llm,tensorrt_llm_bls,ensemble}/config.pbtxt disaggregated_serving/disaggregated_serving_bls/config.pbtxt gpt/{tensorrt_llm,ensemble}/config.pbtxt multimodal/ensemble/config.pbtxt - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/1/model.py: forward prompt_ignore_length from request to trtllm.SamplingConfig kwargs (covers both Triton+engine and Triton+LLMAPI flows). - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/{decode,triton_decoder}.py: add prompt_ignore_length to the BLS Request dataclass, the input list, and the BLS->engine name mapping. Backward compatibility: prompt_ignore_length is an optional input. When the tensor is not provided, getSamplingConfigFromTensors yields std::nullopt, matching the previous (already broken) default behavior, but now also correctly forwarding lengthPenalty/earlyStopping/etc to their intended SamplingConfig slots. Signed-off-by: Jhao-Ting Chen <jtchen0528@gmail.com>
01c4613 to
bafcf33
Compare
|
/bot run |
|
PR_Github #46462 [ run ] triggered by Bot. Commit: |
|
PR_Github #46462 [ run ] completed with state |
… + BLS After PR NVIDIA#13633 plumbed promptIgnoreLength correctly, early_stopping is the only sampling field that has two remaining issues: 1. Type mismatch between Triton config declaration (TYPE_BOOL, 1 byte) and the C++ extraction in getSamplingConfigFromTensors (extractOptionalSingleton<int32_t>, reads 4 bytes). The mismatch works accidentally for {0, 1} (adjacent memory is zero), but cannot represent the executor's documented value 2 ("stop only when all beams emit <eos>"), and is undefined behavior in principle. 2. early_stopping is missing entirely from the ensemble + BLS configs in five places, so clients hitting Triton via `ensemble`, `tensorrt_llm_bls`, `multimodal/ensemble`, or `gpt/ensemble` cannot set early_stopping at all. This pre-dates PR NVIDIA#8127. This change fixes both issues: Type fix (BREAKING for clients sending early_stopping as bool): - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt: data_type TYPE_BOOL -> TYPE_INT32 - triton_backend/all_models/disaggregated_serving/disaggregated_serving_bls/config.pbtxt: same. Aligns the wire-protocol declaration with executor::SamplingConfig semantics (std::optional<SizeType32> accepting 0/1/2). Clients previously sending numpy bool must now send numpy int32; behavior for values 0 and 1 is preserved. Ensemble + BLS plumbing (additive, no compat impact): - triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt: declare optional INT32 early_stopping input + add input_map block forwarding it to the tensorrt_llm step. - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt: declare optional INT32 early_stopping input. - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py: add early_stopping field to the Request dataclass. - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py: add early_stopping to both input_names lists and the BLS->engine name mapping. - triton_backend/all_models/multimodal/ensemble/config.pbtxt: declare + input_map. - triton_backend/all_models/gpt/ensemble/config.pbtxt: declare + input_map. - triton_backend/all_models/gpt/tensorrt_llm/config.pbtxt: declare (was missing entirely). The Python tensorrt_llm/1/model.py already forwards early_stopping to trtllm.SamplingConfig kwargs; only the wire-protocol declaration was wrong. Verified end-to-end on TinyLlama-1.1B with beam_width=4, prompt "Hello world. Goodbye." for all three Triton entry points: Path 1 (direct tensorrt_llm): early_stopping=0 -> beam_lens=[60, 60, 54, 60] early_stopping=1 -> beam_lens=[3, 0, 2, 1] Path 2 (ensemble): early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..." early_stopping=1 -> "Hello world." (3 tokens) Path 3 (tensorrt_llm_bls): early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..." early_stopping=1 -> "Hello world." (3 tokens) All three paths now honor early_stopping correctly. Signed-off-by: Jhao-Ting Chen <jtchen0528@gmail.com>
… + BLS After PR NVIDIA#13633 plumbed promptIgnoreLength correctly, early_stopping is the only sampling field that has two remaining issues: 1. Type mismatch between Triton config declaration (TYPE_BOOL, 1 byte) and the C++ extraction in getSamplingConfigFromTensors (extractOptionalSingleton<int32_t>, reads 4 bytes). The mismatch works accidentally for {0, 1} (adjacent memory is zero), but cannot represent the executor's documented value 2 ("stop only when all beams emit <eos>"), and is undefined behavior in principle. 2. early_stopping is missing entirely from the ensemble + BLS configs in five places, so clients hitting Triton via `ensemble`, `tensorrt_llm_bls`, `multimodal/ensemble`, or `gpt/ensemble` cannot set early_stopping at all. This pre-dates PR NVIDIA#8127. This change fixes both issues: Type fix (BREAKING for clients sending early_stopping as bool): - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt: data_type TYPE_BOOL -> TYPE_INT32 - triton_backend/all_models/disaggregated_serving/disaggregated_serving_bls/config.pbtxt: same. Aligns the wire-protocol declaration with executor::SamplingConfig semantics (std::optional<SizeType32> accepting 0/1/2). Clients previously sending numpy bool must now send numpy int32; behavior for values 0 and 1 is preserved. Ensemble + BLS plumbing (additive, no compat impact): - triton_backend/all_models/inflight_batcher_llm/ensemble/config.pbtxt: declare optional INT32 early_stopping input + add input_map block forwarding it to the tensorrt_llm step. - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/config.pbtxt: declare optional INT32 early_stopping input. - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/decode.py: add early_stopping field to the Request dataclass. - triton_backend/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/lib/triton_decoder.py: add early_stopping to both input_names lists and the BLS->engine name mapping. - triton_backend/all_models/multimodal/ensemble/config.pbtxt: declare + input_map. - triton_backend/all_models/gpt/ensemble/config.pbtxt: declare + input_map. - triton_backend/all_models/gpt/tensorrt_llm/config.pbtxt: declare (was missing entirely). The Python tensorrt_llm/1/model.py already forwards early_stopping to trtllm.SamplingConfig kwargs; only the wire-protocol declaration was wrong. Verified end-to-end on TinyLlama-1.1B with beam_width=4, prompt "Hello world. Goodbye." for all three Triton entry points: Path 1 (direct tensorrt_llm): early_stopping=0 -> beam_lens=[60, 60, 54, 60] early_stopping=1 -> beam_lens=[3, 0, 2, 1] Path 2 (ensemble): early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..." early_stopping=1 -> "Hello world." (3 tokens) Path 3 (tensorrt_llm_bls): early_stopping=0 -> 60-token continuation of "Hello world. Goodbye..." early_stopping=1 -> "Hello world." (3 tokens) All three paths now honor early_stopping correctly. Signed-off-by: Jhao-Ting Chen <jtchen0528@gmail.com>
Summary by CodeRabbit
New Features
prompt_ignore_lengthparameter in inference requests, enabling customizable prompt length handling throughout the inference pipeline.Tests
prompt_ignore_lengthparameter validation.Description
The customer reported that on TRT-LLM v1.2.0rc4 (and confirmed identical on v1.2.0 GA),
length_penaltyandearly_stoppingwere silently ignored when sent through the Triton C++ backend. Root cause traced totriton_backend/inflight_batcher_llm/src/utils.cc::getSamplingConfigFromTensors()passing 17 positional args toexecutor::SamplingConfig(...)while the constructor signature incpp/include/tensorrt_llm/executor/executor.hhad 20 parameters withpromptIgnoreLength(added by PR #8127) at position 14 betweenfrequencyPenaltyandlengthPenalty. Because all parameters arestd::optional<...>andstd::optional<float>implicitly converts tostd::optional<int32>via the contained type, the call compiled silently but all params from #14 onward shifted positions:lengthPenalty(#14)promptIgnoreLength(lossy float→int)earlyStopping(#15)lengthPenaltynoRepeatNgramSize(#16)earlyStoppingnumReturnSequences(#17)noRepeatNgramSizenumReturnSequences,minP,beamWidthArraydefaulted tostd::nullopt.Test Coverage
Test 1:
length_penalty— E2E PROVEN ✓Setup: prompt
"The capital of France is Paris. The capital of Germany is Berlin. The capital of Japan is",beam_width=4, max_tokens=120, early_stopping=True, varylen_penaltybetween0.1and5.0.len_penalty''(just<eos>)'.C. The capital of Australia is Canberra. The capital of New Zealand is Wellington...'output_idsdiffer: ✓. The chosen beam is dramatically different — empty vs 100-token continuation — provinglength_penaltyis now consumed at slot 15 ofexecutor::SamplingConfiginstead of being silently dropped.Test 2:
early_stopping— E2E PROVEN ✓Setup: prompt
"Hello world. Goodbye.",beam_width=4, max_tokens=60, len_penalty=1.0, varyearly_stoppingbetweenFalseandTrue.early_stoppingFalse(run to max_tokens / heuristic)'world. Goodbye. Hello world. Goodbye. Hello world. Goodbye...'True(stop on first<eos>)''output_idsdiffer: ✓. Withearly_stopping=Falsethe beams continued for 47-53 tokens; withTruethey all terminated immediately at<eos>. This provesearly_stoppingis now consumed at slot 16 ofexecutor::SamplingConfig.Test 3:
prompt_ignore_length— UNIT-TEST PROVEN ✓ (e2e demo inconclusive on TinyLlama)Setup: various prompts (
"a a a a...","the the...","banana banana...","France is a country. France is in Europe. France is famous for..."), variedrepetition_penalty1.3-10.0,prompt_ignore_lengthfromNonetoprompt_len.Outcome: no observable behavioral divergence at the e2e level on TinyLlama-Chat-1.1B. The model emits
<eos>as the first generated token whenever the prompt contains many repetitions of a token, regardless ofprompt_ignore_length. This is a property of TinyLlama-Chat (which is heavily fine-tuned for short turn-by-turn dialogue) rather than a defect in the plumbing. Verifying behavioral effect with a non-chat-tuned base model would require building another engine.Plumbing IS verified at multiple lower layers, and these mechanically determine the e2e behavior:
executor::SamplingConfigextractiontriton_backend/inflight_batcher_llm/tests/utilsTest::extractSingleton(extended in this PR to pushprompt_ignore_length=7)getPromptIgnoreLength().value() == 7✓executor::SamplingConfiggetter/setter APIcpp/tests/unit_tests/executor/samplingConfigTest::getterSettersetPromptIgnoreLength(1)round-trips ✓runtime::SamplingConfigfieldcpp/tests/unit_tests/runtime/samplingConfigTest::validInputscpp/tests/unit_tests/kernels/sampling/samplingPenaltyTest::PenaltyTypeFullWithPartialPromptIgnore(added by upstream PR #8127)This is a complete chain: input tensor →
getSamplingConfigFromTensors→SamplingConfig::mPromptIgnoreLength→ penalty kernel skips N tokens. Each link is unit-tested. The Triton model repo also accepted theprompt_ignore_lengthinput tensor (INT32) without protocol error, confirming theconfig.pbtxtdeclaration is valid.Summary
length_penalty''vs 100-token text)early_stoppingprompt_ignore_length<eos>too aggressively)PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.