[TRTLLM-11974][feat] Handle multimodal placeholder expansion in token space for Nemotron Nano#13069
Conversation
642eb68 to
a0d696a
Compare
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
|
/bot run |
|
PR_Github #44887 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughAdded a tokenized multimodal fast path for Nemotron Nano (image/audio/video), changed LLaVA-Next's expand function to return (expanded_ids, mm_data_updates), introduced a tubelet-based retained-token helper for EVS, and wired mm_data_updates through the tokenized multimodal registry; tests updated/added. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant ExpandFn as expand_prompt_token_ids_for_mm
participant ModalityDetect as Modality Detector
participant ImageExp as Image Expander
participant AudioExp as Audio Expander
participant VideoExp as Video Expander
participant EVS as EVS Stream
Client->>ExpandFn: prompt token ids, _multi_modal_data, hf_processor_mm_kwargs
ExpandFn->>ModalityDetect: determine active modality
ModalityDetect-->>ExpandFn: image / audio / video
alt image
ExpandFn->>ImageExp: expand image placeholders -> expanded_ids
ImageExp-->>ExpandFn: expanded_ids
else audio
ExpandFn->>AudioExp: expand audio placeholders -> expanded_ids
AudioExp-->>ExpandFn: expanded_ids
else video
ExpandFn->>VideoExp: expand video placeholders -> expanded_ids, maybe evs_ids
VideoExp->>EVS: build evs_ids (if EVS & pruning)
EVS-->>VideoExp: evs_ids
VideoExp-->>ExpandFn: expanded_ids + evs_ids
end
ExpandFn-->>Client: (expanded_ids, mm_data_updates)
sequenceDiagram
participant Registry
participant ExpandFn
participant Extra as Extra Processed Inputs
Registry->>Registry: build expand_kwargs (include _multi_modal_data)
Registry->>ExpandFn: call expand_prompt_token_ids_for_mm(expand_kwargs)
ExpandFn-->>Registry: (expanded_ids, mm_data_updates)
alt mm_data_updates != None
Registry->>Extra: merge mm_data_updates into multimodal_data
Extra-->>Registry: updated multimodal_data
end
Registry-->>Registry: return processed result
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/models/modeling_nemotron_nano.py (1)
1-1:⚠️ Potential issue | 🟡 MinorUpdate the NVIDIA copyright year.
This file is meaningfully modified in this PR, but the header still says
2025.As per coding guidelines "Add NVIDIA copyright header on ALL new files and update year on modified files".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` at line 1, The file header still shows "Copyright (c) 2025, NVIDIA CORPORATION. All rights reserved." — update the year to the current year (change 2025 to 2026) in the file's top copyright header (the copyright comment string) to comply with the "update year on modified files" guideline; ensure the header format remains identical aside from the year.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Around line 1196-1284: The fast-path in expand_prompt_token_ids_for_mm drops
audio stored under multimodal_data["video"], so detect when mm_data contains
video with an "audio" field (check mm_data.get("video") and "audio" in
mm_data["video"]) and avoid the video-only fast-path; either (A) preserve and
return the appropriate audio updates by mirroring the logic in
_extract_audio_from_video and include those updates in mm_data_updates alongside
the evs_ids returned from _expand_video_placeholders_in_token_ids, or (B
simpler) explicitly fall back to the full HF processor by returning
prompt_token_ids, None (or otherwise signaling the caller to use the full
processor) whenever video metadata contains audio so the original prompt/audio
handling remains intact. Ensure you reference expand_prompt_token_ids_for_mm,
_expand_video_placeholders_in_token_ids, and _extract_audio_from_video when
implementing the chosen fix.
- Around line 1008-1034: get_text_with_mm_placeholders currently uses module
constants (IMAGE_PLACEHOLDER, VIDEO_PLACEHOLDER, AUDIO_PLACEHOLDER) which can
diverge from the instance configuration; change it to append the configured
placeholder attributes (self.img_context_token, self.video_context_token,
self._sound_context_token) instead, repeating each configured token
num_images/num_videos/num_audios times and returning the concatenated string so
the fast path matches the configured processor tokens.
In `@tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py`:
- Around line 1452-1456: The test assigns the first return value from
proc.expand_prompt_token_ids_for_mm to an unused variable `expanded`, which
triggers RUF059; change the unpack to use `_` for the unused first element
(e.g., `_, mm_data_updates = proc.expand_prompt_token_ids_for_mm(...)`) so only
`mm_data_updates` is treated as the meaningful binding while satisfying the
linter.
---
Outside diff comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Line 1: The file header still shows "Copyright (c) 2025, NVIDIA CORPORATION.
All rights reserved." — update the year to the current year (change 2025 to
2026) in the file's top copyright header (the copyright comment string) to
comply with the "update year on modified files" guideline; ensure the header
format remains identical aside from the year.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 9c8388c6-8aea-48f8-96fa-ab988158d3a4
📒 Files selected for processing (9)
tensorrt_llm/_torch/models/modeling_llava_next.pytensorrt_llm/_torch/models/modeling_nemotron_nano.pytensorrt_llm/inputs/__init__.pytensorrt_llm/inputs/evs.pytensorrt_llm/inputs/multimodal.pytensorrt_llm/inputs/registry.pytests/integration/test_lists/test-db/l0_l40s.ymltests/unittest/_torch/modeling/test_modeling_llava_next.pytests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (3)
tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py (1)
1452-1456:⚠️ Potential issue | 🟡 MinorUnused
expandedvariable flagged by linter.Ruff RUF059 flags
expandedas unused. Use_to indicate the value is intentionally discarded.Proposed fix
- expanded, mm_data_updates = proc.expand_prompt_token_ids_for_mm( + _, mm_data_updates = proc.expand_prompt_token_ids_for_mm( prompt, [total_mm], hf_processor_mm_kwargs={"_multi_modal_data": mm_data}, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py` around lines 1452 - 1456, The linter flags the local variable expanded as unused; when calling proc.expand_prompt_token_ids_for_mm, discard that first return value by replacing expanded with an underscore (use `_`), i.e., change the unpacking to `_ , mm_data_updates = proc.expand_prompt_token_ids_for_mm(prompt, [total_mm], hf_processor_mm_kwargs={"_multi_modal_data": mm_data}) so the unused return is intentionally ignored while preserving mm_data_updates and the call to expand_prompt_token_ids_for_mm.tensorrt_llm/_torch/models/modeling_nemotron_nano.py (2)
1008-1034:⚠️ Potential issue | 🟠 MajorUse the processor’s configured placeholder strings here.
get_text_with_mm_placeholders()still hard-codes the module constants. If a checkpoint overridesimg_context_token,video_context_token, orsound_context_token, the dummy HF-processor call diverges from the actual processor config and the fast path can miscount or fail.Suggested fix
- parts.extend([IMAGE_PLACEHOLDER] * num_images) - parts.extend([VIDEO_PLACEHOLDER] * num_videos) - parts.extend([AUDIO_PLACEHOLDER] * num_audios) + parts.extend([self.img_context_token] * num_images) + parts.extend([self.video_context_token] * num_videos) + parts.extend([self._sound_context_token] * num_audios)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` around lines 1008 - 1034, get_text_with_mm_placeholders currently uses hard-coded module constants (IMAGE_PLACEHOLDER, VIDEO_PLACEHOLDER, AUDIO_PLACEHOLDER); change it to read the actual processor/configured placeholder tokens (e.g., use self.processor.img_context_token, self.processor.video_context_token, self.processor.sound_context_token or the model/checkpoint attributes img_context_token, video_context_token, sound_context_token) so the dummy text matches the HF processor configuration; locate the method get_text_with_mm_placeholders and replace the uses of IMAGE_PLACEHOLDER/VIDEO_PLACEHOLDER/AUDIO_PLACEHOLDER with the corresponding runtime-configured tokens and fall back to the constants if the processor attributes are missing.
1239-1284:⚠️ Potential issue | 🟠 MajorThe tokenized video fast path still misses audio carried in video metadata.
__call__()’s video path runs_extract_audio_from_video()and injects audio placeholders for videos whose metadata contains audio. This path only checks top-level"audio"and always dispatches raw video requests to_expand_video_placeholders_in_token_ids(), soexpanded_idsstays video-only while the dummy-processor output can already containmultimodal_data["video"]["audio"]. That leaves token IDs and multimodal embeddings out of sync for Omni video+audio requests.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` around lines 1239 - 1284, The video fast-path in __call__ currently always treats a request with video as video-only and calls _expand_video_placeholders_in_token_ids, which misses audio carried inside mm_data["video"]["audio"] injected by _extract_audio_from_video; update the detection/dispatch so that if mm_data contains video.audio (or the dummy processor output includes multimodal_data["video"]["audio"]) you either (A) set has_audio alongside has_video and run the audio-expansion logic (e.g., call _expand_audio_placeholders_in_token_ids or a combined expansion), or (B) after _expand_video_placeholders_in_token_ids apply the audio placeholder expansion and merge audio evs/ids into mm_data_updates and the returned expanded_ids so token IDs and mm_data_updates stay in sync; reference __call__, mm_data, _extract_audio_from_video, _expand_video_placeholders_in_token_ids, _expand_audio_placeholders_in_token_ids, evs_ids_tensor and mm_data_updates when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Around line 1306-1341: The loop that expands img placeholders computes
num_context incorrectly by subtracting len(self._img_start_token_ids) and
len(self._img_end_token_ids); because get_num_tokens_per_image budgets wrappers
as +2 (one start + one end), change the calculation to num_context = n - 2 (and
clamp to >= 0) so you repeat img_context_token_id the correct number of times;
keep using self._img_start_token_ids, self._img_end_token_ids,
img_context_token_id and num_mm_tokens_per_placeholder as before and validate
image_idx bounds as currently implemented.
- Around line 1557-1567: The zip over frame_seps and tokens_per_frame should be
made strict to surface length mismatches instead of silently truncating; update
the loop that iterates "for frame_sep, num_tokens in zip(frame_seps,
tokens_per_frame):" to use zip(frame_seps, tokens_per_frame, strict=True) so any
mismatch between frame_seps and tokens_per_frame raises immediately; ensure this
change preserves the existing logic that encodes frame_sep via
self.tokenizer.encode and extends expansion and evs_expansion with
_img_start_token_ids, img_context_token_id / video_context_token_id,
_img_end_token_ids as currently implemented.
---
Duplicate comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Around line 1008-1034: get_text_with_mm_placeholders currently uses hard-coded
module constants (IMAGE_PLACEHOLDER, VIDEO_PLACEHOLDER, AUDIO_PLACEHOLDER);
change it to read the actual processor/configured placeholder tokens (e.g., use
self.processor.img_context_token, self.processor.video_context_token,
self.processor.sound_context_token or the model/checkpoint attributes
img_context_token, video_context_token, sound_context_token) so the dummy text
matches the HF processor configuration; locate the method
get_text_with_mm_placeholders and replace the uses of
IMAGE_PLACEHOLDER/VIDEO_PLACEHOLDER/AUDIO_PLACEHOLDER with the corresponding
runtime-configured tokens and fall back to the constants if the processor
attributes are missing.
- Around line 1239-1284: The video fast-path in __call__ currently always treats
a request with video as video-only and calls
_expand_video_placeholders_in_token_ids, which misses audio carried inside
mm_data["video"]["audio"] injected by _extract_audio_from_video; update the
detection/dispatch so that if mm_data contains video.audio (or the dummy
processor output includes multimodal_data["video"]["audio"]) you either (A) set
has_audio alongside has_video and run the audio-expansion logic (e.g., call
_expand_audio_placeholders_in_token_ids or a combined expansion), or (B) after
_expand_video_placeholders_in_token_ids apply the audio placeholder expansion
and merge audio evs/ids into mm_data_updates and the returned expanded_ids so
token IDs and mm_data_updates stay in sync; reference __call__, mm_data,
_extract_audio_from_video, _expand_video_placeholders_in_token_ids,
_expand_audio_placeholders_in_token_ids, evs_ids_tensor and mm_data_updates when
making the change.
In `@tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py`:
- Around line 1452-1456: The linter flags the local variable expanded as unused;
when calling proc.expand_prompt_token_ids_for_mm, discard that first return
value by replacing expanded with an underscore (use `_`), i.e., change the
unpacking to `_ , mm_data_updates = proc.expand_prompt_token_ids_for_mm(prompt,
[total_mm], hf_processor_mm_kwargs={"_multi_modal_data": mm_data}) so the unused
return is intentionally ignored while preserving mm_data_updates and the call to
expand_prompt_token_ids_for_mm.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: fe50175a-81c8-4c6d-874c-7478f1253657
📒 Files selected for processing (9)
tensorrt_llm/_torch/models/modeling_llava_next.pytensorrt_llm/_torch/models/modeling_nemotron_nano.pytensorrt_llm/inputs/__init__.pytensorrt_llm/inputs/evs.pytensorrt_llm/inputs/multimodal.pytensorrt_llm/inputs/registry.pytests/integration/test_lists/test-db/l0_l40s.ymltests/unittest/_torch/modeling/test_modeling_llava_next.pytests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py
|
PR_Github #44887 [ run ] completed with state |
83c288b to
37b9a3f
Compare
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
|
/bot run |
|
PR_Github #45770 [ run ] triggered by Bot. Commit: |
|
PR_Github #45770 [ run ] completed with state
|
|
/bot run |
|
PR_Github #45791 [ run ] triggered by Bot. Commit: |
|
PR_Github #46285 [ run ] completed with state |
2593b4d to
ab95b4c
Compare
|
/bot run |
|
PR_Github #46450 [ run ] triggered by Bot. Commit: |
|
PR_Github #46450 [ run ] completed with state
|
|
/bot run |
|
PR_Github #46473 [ run ] triggered by Bot. Commit: |
|
PR_Github #46473 [ run ] completed with state
|
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
Signed-off-by: Michal Guzek <mguzek@nvidia.com>
8a3e5b7 to
41f5784
Compare
|
/bot run |
|
PR_Github #46511 [ run ] triggered by Bot. Commit: |
|
PR_Github #46511 [ run ] completed with state
|
|
/bot run |
|
PR_Github #46541 [ run ] triggered by Bot. Commit: |
|
PR_Github #46541 [ run ] completed with state
|
|
/bot run |
|
PR_Github #46611 [ run ] triggered by Bot. Commit: |
|
PR_Github #46611 [ run ] completed with state |
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Description
Adds the tokenized-prompt + multimodal-data fast path (the same pattern PR #11708 introduced for LLaVA-Next) to Nemotron Nano V3 OMNI, covering all three modalities — image, video, audio — including EVS-enabled video. Also fixes a pre-existing bug in
NanoV2VLInputProcessor.get_num_tokens_per_video's EVS branch where the per-tubelet retained token count was inflated by routing throughget_num_tokens_per_image's dynamic-tiler logic, causingfind_mm_token_positionsto disagree with the vision encoder's real post-EVS output.tensorrt_llm/_torch/models/modeling_nemotron_nano.py— Addsget_text_with_mm_placeholders,get_num_tokens_per_audio,expand_prompt_token_ids_for_mm(+ per-modality helpers) toNanoV2VLInputProcessor. Video expansion uses a two-tier strategy mirroring vLLM's_apply_prompt_updates(token-level subsequence match with a text-level decode/split/re-encode fallback). EVS support included via a parallelevs_idsstream consumed bymerge_evs_mm_embeds. Fixesget_num_tokens_per_video's EVS branch to derivevideo_sizethe same way_process_videos_framesdoes.tensorrt_llm/inputs/evs.py- Extractscompute_retained_tokens_from_tubelet_budgetas the shared retention-count helper;compute_retained_tokens_countnow delegates to it.tensorrt_llm/inputs/multimodal.py— Adds audio branch tofind_mm_token_lengths; fixesfind_mm_token_positionsno-match early return from[]to([], [])to match its declared 2-tuple signature.tensorrt_llm/inputs/registry.py-tokenized_multimodal_processthreadsmulti_modal_datathrough to the input processor (needed by video expansion) and merges any returnedmm_data_updates(e.g.evs_ids) intoextra_processed_inputs["multimodal_data"]. Adds "audio" to the multimodal-hashing modality whitelist.tensorrt_llm/_torch/models/modeling_llava_next.py-expand_prompt_token_ids_for_mmsignature updated to returnTuple[List[int], Optional[Dict[str, Dict[str, Any]]]]; always returns(expanded, None).Test Coverage
tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py- 8 new test classes covering each fast-path method, multi-token BPE patterns, EVS token-level + text-level tiers, dispatch integration, and audio.Tested with VLMEvalKitMcore for the following datasets InfoVQA_VAL, OCRBench, VoxPopuli_ASR_test, TedLium_ASR_Test, Video-MME (EVS off), Video-MME (EVS of 0.5) by temporarily tokenizing the text prompt to trigger the fast path. The scores match the default, non-fast path where the text prompt and MM data is handled directly by the HF processor.
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.