[TRTLLM-11523][feat] Handle different chat template types#12336
[TRTLLM-11523][feat] Handle different chat template types#123362ez4bz merged 1 commit intoNVIDIA:mainfrom
Conversation
7deed7a to
c745948
Compare
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_nemotron_flash.py
Show resolved
Hide resolved
6f59a91 to
07962d4
Compare
📝 WalkthroughWalkthroughThis PR introduces a content-format-driven system to replace hardcoded chat template exceptions. It adds a Changes
Sequence DiagramsequenceDiagram
participant Chat as Chat Handler
participant Registry as Placeholder Registry
participant Resolver as Content Format Resolver
participant Template as Template Processor
participant Placeholder as Placeholder Manager
Chat->>Registry: get_content_format(model_type)
Registry-->>Chat: ContentFormat (OPENAI/STRING/PASSTHROUGH/None)
Chat->>Resolver: _resolve_content_format(format, template)
alt Explicit Format
Resolver-->>Chat: ContentFormat (explicit)
else Auto-detect (None)
Resolver->>Template: parse_jinja_template(template)
Template->>Template: _ast_has_content_iteration()
alt Has multimodal iteration
Template-->>Resolver: OPENAI
else Plain content
Template-->>Resolver: STRING
end
Resolver-->>Chat: Detected ContentFormat
end
alt PASSTHROUGH
Chat->>Template: skip_template_processing()
else OPENAI
Chat->>Placeholder: _build_openai_content(message)
Placeholder-->>Chat: rebuild content with dicts
Chat->>Template: render_with_openai_dicts()
else STRING
Chat->>Placeholder: interleave_mm_placeholders()
Placeholder-->>Chat: inject placeholders at media positions
Chat->>Template: render_with_placeholders()
end
Template-->>Chat: rendered_output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment Tip CodeRabbit can use Trivy to scan for security misconfigurations and secrets in Infrastructure as Code files.Add a .trivyignore file to your project to customize which findings Trivy reports. |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (3)
tensorrt_llm/inputs/utils.py (3)
732-741: Consider extracting modality inference logic to a shared helper.The logic to infer modality from placeholder string (checking for "video", "audio", "so_embedding" in lowercase) is duplicated between
_build_openai_content(lines 734-739) andinterleave_mm_placeholders(lines 577-584). Extracting this to a small helper function would improve maintainability.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/inputs/utils.py` around lines 732 - 741, Extract the modality inference logic duplicated in _build_openai_content and interleave_mm_placeholders into a small helper function (e.g., infer_modality_from_placeholder) that accepts a placeholder string and returns "image", "video", or "audio"; replace the inline checks in both functions to call this helper (use mm_placeholder_count and placeholder variables as inputs) so both sites use the same implementation and avoid duplicated lowercase checks for "video", "audio", and "so_embedding".
663-670: Potential false positive with substring matching in placeholder count.
rendered_text.count(placeholder)may over-count if the placeholder string appears as a substring of another token. For example, if placeholder is"<image>"and text contains"<image_special>", both would be counted.Consider using a regex with word boundaries or a more precise matching approach if placeholder substrings are a concern in practice.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/inputs/utils.py` around lines 663 - 670, The current count using rendered_text.count(placeholder) can over-count when a placeholder appears as a substring of another token; change the counting to use a regex-based exact-token match: escape the placeholder with re.escape and build a pattern that matches the placeholder as a standalone token (e.g., using lookarounds like (?<!\S) and (?!\S) or appropriate word-boundary logic), then compute actual_count = len(re.findall(pattern, rendered_text)). Update the block that references placeholder, expected_count, actual_count, rendered_text, and model_type to use this regex count (and add import re if needed).
800-804: Addstrict=Truetozip()to catch length mismatches.The
zip()call assumesconversationandmm_placeholder_countshave the same length. Addingstrict=True(Python 3.10+) would raise aValueErrorif the lengths differ, catching potential bugs early rather than silently truncating.- for conv, mm_placeholder_count in zip(conversation, - mm_placeholder_counts): + for conv, mm_placeholder_count in zip(conversation, + mm_placeholder_counts, + strict=True):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/inputs/utils.py` around lines 800 - 804, The zip over conversation and mm_placeholder_counts in the loop can silently truncate if lengths differ; update the zip(...) call used with conversation and mm_placeholder_counts inside the loop (where _build_openai_content is invoked) to use zip(..., strict=True) so a ValueError is raised on length mismatch, catching bugs early (requires Python 3.10+).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/inputs/content_format.py`:
- Around line 109-116: Replace the insecure and overly broad parse block by
constructing the Jinja2 Environment with autoescaping enabled (e.g., call
Environment(autoescape=True) or use select_autoescape) instead of Environment(),
and narrow the except to only the specific parse exception (catch
jinja2.exceptions.TemplateSyntaxError) so that _ast_has_content_iteration(ast)
is still checked and on TemplateSyntaxError the function falls back to returning
STRING; update imports to reference TemplateSyntaxError and adjust the
try/except around env.parse accordingly while leaving the ContentFormat.OPENAI
return intact.
In `@tensorrt_llm/serve/chat_utils.py`:
- Around line 316-324: parse_chat_messages_coroutines currently resolves content
format using only MULTIMODAL_PLACEHOLDER_REGISTRY (defaulting to
ContentFormat.STRING), which can conflict with _resolve_content_format
(inputs/utils.py) that auto-detects Jinja/OPENAI formats; update
parse_chat_messages_coroutines to accept either the chat_template or a
pre-resolved content_format flag and call _resolve_content_format (or use the
passed format) instead of registry-only logic so its placeholder strategy
matches apply_chat_template and _build_openai_content, or alternatively add a
clear comment/docstring in parse_chat_messages_coroutines explaining why
registry-only fallback is safe and referencing MULTIMODAL_PLACEHOLDER_REGISTRY
and _resolve_content_format.
In `@tests/unittest/inputs/test_content_format.py`:
- Around line 92-98: The test_caching currently only checks identity (result1 is
result2) which will pass due to enum singletons; instead verify the actual
lru_cache behavior by inspecting detect_content_format.cache_info() before and
after calls: call detect_content_format.cache_info() to capture initial
hits/misses, invoke detect_content_format(template) twice (or once then again)
and assert that cache_info().hits increased (or misses decreased appropriately)
to prove a cache hit, and optionally call detect_content_format.cache_clear() at
the start to ensure a clean state; reference the test method name test_caching
and the function detect_content_format and its cache_info()/cache_clear()
methods when making the changes.
---
Nitpick comments:
In `@tensorrt_llm/inputs/utils.py`:
- Around line 732-741: Extract the modality inference logic duplicated in
_build_openai_content and interleave_mm_placeholders into a small helper
function (e.g., infer_modality_from_placeholder) that accepts a placeholder
string and returns "image", "video", or "audio"; replace the inline checks in
both functions to call this helper (use mm_placeholder_count and placeholder
variables as inputs) so both sites use the same implementation and avoid
duplicated lowercase checks for "video", "audio", and "so_embedding".
- Around line 663-670: The current count using rendered_text.count(placeholder)
can over-count when a placeholder appears as a substring of another token;
change the counting to use a regex-based exact-token match: escape the
placeholder with re.escape and build a pattern that matches the placeholder as a
standalone token (e.g., using lookarounds like (?<!\S) and (?!\S) or appropriate
word-boundary logic), then compute actual_count = len(re.findall(pattern,
rendered_text)). Update the block that references placeholder, expected_count,
actual_count, rendered_text, and model_type to use this regex count (and add
import re if needed).
- Around line 800-804: The zip over conversation and mm_placeholder_counts in
the loop can silently truncate if lengths differ; update the zip(...) call used
with conversation and mm_placeholder_counts inside the loop (where
_build_openai_content is invoked) to use zip(..., strict=True) so a ValueError
is raised on length mismatch, catching bugs early (requires Python 3.10+).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: cb64dcc0-c6aa-4bd0-b922-0a4cb6b02bac
📒 Files selected for processing (11)
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_nemotron_flash.pytensorrt_llm/_torch/models/modeling_llava_next.pytensorrt_llm/_torch/models/modeling_mistral.pytensorrt_llm/_torch/models/modeling_vila.pytensorrt_llm/inputs/__init__.pytensorrt_llm/inputs/content_format.pytensorrt_llm/inputs/registry.pytensorrt_llm/inputs/utils.pytensorrt_llm/serve/chat_utils.pytests/unittest/inputs/test_chat_template_dispatch.pytests/unittest/inputs/test_content_format.py
3b47046 to
1077763
Compare
|
/bot run |
|
PR_Github #39642 [ run ] triggered by Bot. Commit: |
jdebache
left a comment
There was a problem hiding this comment.
Is it correct that this will simply preserve the existing behaviour for Mistral models?
|
PR_Github #39642 [ run ] completed with state
|
@hypdeb that is definitely the intent. I will run some A/B tests for some of the models to print the rendered prompts before and after the change and update the PR description. |
JunyiXu-nv
left a comment
There was a problem hiding this comment.
Overall LGTM. Is there any e2e tests could validate these changes?
4f31be0 to
679381b
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #41499 [ run ] triggered by Bot. Commit: |
|
PR_Github #41499 [ run ] completed with state
|
|
/bot run |
|
PR_Github #41587 [ run ] triggered by Bot. Commit: |
|
PR_Github #41587 [ run ] completed with state |
|
/bot run |
|
PR_Github #41598 [ run ] triggered by Bot. Commit: |
|
PR_Github #41598 [ run ] completed with state
|
* Why? Previously, the multimodal placeholder insertion was dictated by hardcoded exception lists, which add cognitive burden when onboarding new models, and did not account for the fact that different versions of the same model architecture could have different types of chat templates that require different handling. In addition, all placeholders were either added before or after the text, instead of possibly interleaved. * What? This commit addresses the above gaps by mimicking what is done in vLLM. To that end, it: 1. introduces content format detection based on Jinja AST inspection of the chat template. 2. preserves the interleaved positions of text and media items during message parsing. 3. dispatches to the appropriate logic based on the (possibly auto-detected) content format before applying the chat template: either the template handles multimodal content natively (OpenAI-style dicts), or expects plain strings with placeholders pre-inserted. 4. inserts the multimodal placeholders. 5. validates that the expected count is met. Models can also explicitly declare their `content_format` during registration if they are only meant to support one. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
679381b to
3802ba7
Compare
|
/bot run |
|
PR_Github #41613 [ run ] triggered by Bot. Commit: |
|
PR_Github #41613 [ run ] completed with state
|
|
/bot run |
|
PR_Github #41639 [ run ] triggered by Bot. Commit: |
|
PR_Github #41639 [ run ] completed with state
|
|
/bot run |
|
PR_Github #41714 [ run ] triggered by Bot. Commit: |
amukkara
left a comment
There was a problem hiding this comment.
Approving phi model changes
|
PR_Github #41714 [ run ] completed with state
|
|
/bot run |
|
PR_Github #41758 [ run ] triggered by Bot. Commit: |
|
PR_Github #41758 [ run ] completed with state |
* Why? Previously, the multimodal placeholder insertion was dictated by hardcoded exception lists, which add cognitive burden when onboarding new models, and did not account for the fact that different versions of the same model architecture could have different types of chat templates that require different handling. In addition, all placeholders were either added before or after the text, instead of possibly interleaved. * What? This commit addresses the above gaps by mimicking what is done in vLLM. To that end, it: 1. introduces content format detection based on Jinja AST inspection of the chat template. 2. preserves the interleaved positions of text and media items during message parsing. 3. dispatches to the appropriate logic based on the (possibly auto-detected) content format before applying the chat template: either the template handles multimodal content natively (OpenAI-style dicts), or expects plain strings with placeholders pre-inserted. 4. inserts the multimodal placeholders. Models can also explicitly declare their `content_format` during registration if they are only meant to support one. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
* Why? Previously, the multimodal placeholder insertion was dictated by hardcoded exception lists, which add cognitive burden when onboarding new models, and did not account for the fact that different versions of the same model architecture could have different types of chat templates that require different handling. In addition, all placeholders were either added before or after the text, instead of possibly interleaved. * What? This commit addresses the above gaps by mimicking what is done in vLLM. To that end, it: 1. introduces content format detection based on Jinja AST inspection of the chat template. 2. preserves the interleaved positions of text and media items during message parsing. 3. dispatches to the appropriate logic based on the (possibly auto-detected) content format before applying the chat template: either the template handles multimodal content natively (OpenAI-style dicts), or expects plain strings with placeholders pre-inserted. 4. inserts the multimodal placeholders. Models can also explicitly declare their `content_format` during registration if they are only meant to support one. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Summary by CodeRabbit
Release Notes
New Features
Tests
Description
Previously, the multimodal placeholder insertion was dictated by hardcoded exception lists, which add cognitive burden when onboarding new models, and did not account for the fact that different versions of the same model architecture could have different types of chat templates that require different handling.
In addition, all placeholders were either added before or after the text, instead of possibly interleaved.
This commit addresses the above gaps by mimicking what is done in vLLM.
To that end, it:
Models can also explicitly declare their
content_formatduring registration if they are only meant to support one.Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.