[TRTLLM-11268][feat] Video temporal compression to Nemotron Nano and RADIO#12649
[TRTLLM-11268][feat] Video temporal compression to Nemotron Nano and RADIO#126492ez4bz merged 2 commits intoNVIDIA:mainfrom
Conversation
52a836f to
9640b75
Compare
|
/bot run |
📝 WalkthroughWalkthroughThis PR adds video temporal compression support to Nemotron nano and RADIO vision models, including aspect-ratio-preserving video preprocessing, tubelet-based temporal grouping via Changes
Sequence DiagramsequenceDiagram
participant Input as Video Input
participant Preproc as Preprocessing<br/>(video_to_pixel_values)
participant Encoder as Vision Encoder<br/>(extract_feature)
participant Temporal as Temporal Processor<br/>(forward_video)
participant Embed as Embedder<br/>(video_embedder)
participant Attn as Attention<br/>(transformer blocks)
participant Output as Feature Output
Input->>Preproc: pixel_values, target_size
Preproc->>Preproc: aspect-preserving resize<br/>normalize
Preproc->>Encoder: processed_pixels
Encoder->>Temporal: num_frames provided
Temporal->>Temporal: pad frames to<br/>multiple of T
Temporal->>Temporal: group into<br/>tubelets (T frames)
Temporal->>Embed: tubelet patches
Embed->>Embed: temporal projection<br/>3*T*patch_size²
Embed->>Temporal: embedded tubelets
Temporal->>Temporal: add positional encoding<br/>add CLS token
Temporal->>Attn: repacked tubelets<br/>for attention
Attn->>Attn: process through<br/>transformer blocks
Attn->>Temporal: attended features
Temporal->>Temporal: reshape back<br/>to frames
Temporal->>Output: temporal embeddings
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60–75 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py (1)
714-716: Unused unpacked variableswandh.These variables are unpacked but never used. Consider prefixing with underscore to indicate intentional discard.
♻️ Proposed fix
def test_predicted_vs_actual_token_count(self): - w, h = self.FRAME_SIZE + _w, _h = self.FRAME_SIZE proc = _make_processor(max_num_patches=256, min_num_patches=4)Or simply remove the unpacking since
FRAME_SIZEis accessed directly elsewhere:def test_predicted_vs_actual_token_count(self): proc = _make_processor(max_num_patches=256, min_num_patches=4)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py` around lines 714 - 716, In test_predicted_vs_actual_token_count, the unpacking "w, h = self.FRAME_SIZE" creates unused variables w and h; either remove that unpacking entirely or rename them to indicate intentional discard (e.g., "_w, _h = self.FRAME_SIZE") so linters won’t flag unused variables; update the test around _make_processor and FRAME_SIZE usage accordingly (no behavior change).tensorrt_llm/_torch/models/modeling_nemotron_nano.py (2)
191-192: Consider handling partial normalization parameters.If only one of
norm_meanornorm_stdis provided (but not both), the normalization is silently skipped. This could lead to unexpected behavior. Consider raising an error for this edge case.🛡️ Proposed fix
# Apply mean/std normalization (matches vLLM's input_conditioner). - if norm_mean is not None and norm_std is not None: + if norm_mean is not None or norm_std is not None: + if norm_mean is None or norm_std is None: + raise ValueError( + "Both norm_mean and norm_std must be provided for normalization, " + f"got norm_mean={norm_mean is not None}, norm_std={norm_std is not None}" + ) video_tensor = (video_tensor - norm_mean) / norm_std🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` around lines 191 - 192, Currently normalization only runs when both norm_mean and norm_std are non-None which silently skips normalization if only one is provided; update the logic around the video_tensor normalization to validate the parameters: if exactly one of norm_mean or norm_std is None, raise a ValueError (or appropriate exception) with a clear message referencing norm_mean and norm_std, otherwise apply the normalization video_tensor = (video_tensor - norm_mean) / norm_std; keep the check and transformation located with the existing video_tensor normalization block so callers cannot pass partial normalization parameters without being alerted.
74-75: Unnecessaryint()cast onround()result.In Python 3,
round()already returns an integer when called with one argument. Theint()wrapper is redundant.♻️ Proposed fix
- reduction_factor = int(round(1 / downsample_ratio)) + reduction_factor = round(1 / downsample_ratio)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` around lines 74 - 75, The assignment to reduction_factor unnecessarily wraps round(1 / downsample_ratio) in int(); update the code in modeling_nemotron_nano.py to set reduction_factor = round(1 / downsample_ratio) (and keep required_divisor = reduction_factor) so the redundant int() cast is removed while preserving the same behavior for reduction_factor and required_divisor.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/models/modeling_radio.py`:
- Around line 1121-1124: The condition for setting
patch_gen._video_embedder_loaded is ambiguous and can be true even when the
checkpoint lacked video_embedder weights; instead, only set
_video_embedder_loaded when the patch_gen actually has a video_embedder module
and the key wasn't unexpected. Update the logic in the load path around
radio_model.model.patch_generator (patch_gen) to first verify the presence of
the submodule (e.g., hasattr/ getattr(patch_gen, "video_embedder")) and then
check that 'model.patch_generator.video_embedder.weight' is not in
unexpected_keys before assigning patch_gen._video_embedder_loaded = True.
---
Nitpick comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Around line 191-192: Currently normalization only runs when both norm_mean and
norm_std are non-None which silently skips normalization if only one is
provided; update the logic around the video_tensor normalization to validate the
parameters: if exactly one of norm_mean or norm_std is None, raise a ValueError
(or appropriate exception) with a clear message referencing norm_mean and
norm_std, otherwise apply the normalization video_tensor = (video_tensor -
norm_mean) / norm_std; keep the check and transformation located with the
existing video_tensor normalization block so callers cannot pass partial
normalization parameters without being alerted.
- Around line 74-75: The assignment to reduction_factor unnecessarily wraps
round(1 / downsample_ratio) in int(); update the code in
modeling_nemotron_nano.py to set reduction_factor = round(1 / downsample_ratio)
(and keep required_divisor = reduction_factor) so the redundant int() cast is
removed while preserving the same behavior for reduction_factor and
required_divisor.
In `@tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py`:
- Around line 714-716: In test_predicted_vs_actual_token_count, the unpacking
"w, h = self.FRAME_SIZE" creates unused variables w and h; either remove that
unpacking entirely or rename them to indicate intentional discard (e.g., "_w, _h
= self.FRAME_SIZE") so linters won’t flag unused variables; update the test
around _make_processor and FRAME_SIZE usage accordingly (no behavior change).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c6b77b2c-737c-4418-a7d8-7996a8b516a2
📒 Files selected for processing (3)
tensorrt_llm/_torch/models/modeling_nemotron_nano.pytensorrt_llm/_torch/models/modeling_radio.pytests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py
|
PR_Github #41809 [ run ] triggered by Bot. Commit: |
|
PR_Github #41809 [ run ] completed with state |
|
/bot run |
|
PR_Github #41985 [ run ] triggered by Bot. Commit: |
|
PR_Github #41985 [ run ] completed with state
|
9640b75 to
b3b2152
Compare
…RADIO Implement tubelet-based temporal compression for video inputs, matching the Megatron-LM / vLLM video processing pipeline. T consecutive frames are grouped into tubelets before embedding, reducing the token count by a factor of `video_temporal_patch_size`. Key additions: - Aspect-ratio-preserving video frame resize and normalization - Separate video embedder in RADIO ViT for tubelet projection - Tubelet-aware token counting, frame separators, and EVS paths - Fix align_corners=True -> False in position embedding interpolation Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
7c38538 to
3192c7a
Compare
|
/bot run |
|
PR_Github #42593 [ run ] triggered by Bot. Commit: |
|
PR_Github #42593 [ run ] completed with state
|
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
3192c7a to
696c3d2
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #42673 [ run ] triggered by Bot. Commit: |
|
PR_Github #42673 [ run ] completed with state
|
|
/bot run |
|
PR_Github #42732 [ run ] triggered by Bot. Commit: |
|
PR_Github #42732 [ run ] completed with state |
Summary by CodeRabbit
New Features
Tests
Description
Implement tubelet-based temporal compression for video inputs, matching the Megatron-LM / vLLM video processing pipeline. T consecutive frames are grouped into tubelets before embedding, reducing the token count by a factor of
video_temporal_patch_size.Key additions:
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.