Skip to content

[TRTLLM-11268][feat] Video temporal compression to Nemotron Nano and RADIO#12649

Merged
2ez4bz merged 2 commits intoNVIDIA:mainfrom
2ez4bz:dev-nano-v3-video
Apr 10, 2026
Merged

[TRTLLM-11268][feat] Video temporal compression to Nemotron Nano and RADIO#12649
2ez4bz merged 2 commits intoNVIDIA:mainfrom
2ez4bz:dev-nano-v3-video

Conversation

@2ez4bz
Copy link
Copy Markdown
Collaborator

@2ez4bz 2ez4bz commented Apr 1, 2026

Summary by CodeRabbit

  • New Features

    • Added video preprocessing with aspect-ratio preservation for improved frame handling.
    • Introduced temporal compression support for efficient video frame grouping and processing.
    • Extended vision models to support video inputs with configurable temporal and spatial parameters.
  • Tests

    • Added comprehensive test coverage for video preprocessing utilities and temporal compression workflows.

Description

Implement tubelet-based temporal compression for video inputs, matching the Megatron-LM / vLLM video processing pipeline. T consecutive frames are grouped into tubelets before embedding, reducing the token count by a factor of video_temporal_patch_size.

Key additions:

  • Aspect-ratio-preserving video frame resize and normalization
  • Separate video embedder in RADIO ViT for tubelet projection
  • Tubelet-aware token counting, frame separators, and EVS paths
  • Fix align_corners=True -> False in position embedding interpolation

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Comment thread tensorrt_llm/_torch/models/modeling_nemotron_nano.py Outdated
Comment thread tensorrt_llm/_torch/models/modeling_radio.py
Comment thread tensorrt_llm/_torch/models/modeling_radio.py
Comment thread tensorrt_llm/_torch/models/modeling_nemotron_nano.py Outdated
Comment thread tensorrt_llm/_torch/models/modeling_nemotron_nano.py
Comment thread tensorrt_llm/_torch/models/modeling_nemotron_nano.py Outdated
@2ez4bz 2ez4bz force-pushed the dev-nano-v3-video branch from 52a836f to 9640b75 Compare April 4, 2026 06:14
@2ez4bz 2ez4bz marked this pull request as ready for review April 4, 2026 06:16
@2ez4bz 2ez4bz requested review from a team as code owners April 4, 2026 06:16
@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 4, 2026

/bot run

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 4, 2026

📝 Walkthrough

Walkthrough

This PR adds video temporal compression support to Nemotron nano and RADIO vision models, including aspect-ratio-preserving video preprocessing, tubelet-based temporal grouping via temporal_patch_size configuration, updated vision encoders and input processors to handle temporal frames, and comprehensive test coverage for the new video preprocessing pipeline.

Changes

Cohort / File(s) Summary
Video Preprocessing Utilities
tensorrt_llm/_torch/models/modeling_nemotron_nano.py
Added functions _compute_aspect_preserving_size, get_video_target_size_and_feature_size, and video_to_pixel_values to compute aspect-ratio-aware target dimensions, perform bicubic resizing with clamping/rescaling, and apply optional mean/std normalization.
Vision Encoder & Temporal Compression
tensorrt_llm/_torch/models/modeling_nemotron_nano.py
Modified NanoV2VLVisionEncoder.extract_feature to accept optional num_frames and forward it to vision model when temporal compression is enabled; added _extract_video_embeddings_temporal method to process videos with temporal compression and flatten embeddings.
Input Processor Configuration & Video Frame Handling
tensorrt_llm/_torch/models/modeling_nemotron_nano.py
Extended NanoV2VLInputProcessor with config fields for video_temporal_patch_size, video_maintain_aspect_ratio, video_target_num_patches, and video frame normalization tensors; updated get_num_tokens_per_video and _compute_token_numbers_per_video to group frames into tubelets and compute tokens using aspect-aware sizing; modified _process_videos_frames to use video_to_pixel_values and handle variable-sized frame batches as lists.
Prompt Construction for Video
tensorrt_llm/_torch/models/modeling_nemotron_nano.py
Added _build_tubelet_separators method for temporal-compression-aware prompt formatting; updated _get_frame_separators and _process_video_prompts to conditionally generate tubelet-based separators and prepend video prefix based on configuration; modified __call__ to handle pixel_values as either tensor or list.
EVS Application for Video
tensorrt_llm/_torch/models/modeling_nemotron_nano.py
Updated apply_evs_per_video to handle both 2D and 3D mm_embed layouts by computing spatial token counts from pixel/tile dimensions and slicing/reshaping accordingly.
RADIO Model Temporal Patch Generation
tensorrt_llm/_torch/models/modeling_radio.py
Added temporal_patch_size and separate_video_embedder configuration to ViTPatchGenerator; introduced conditional video_embedder (\ViTPatchLinear\\) when temporal compression is enabled; added forward_video method to pad frame patches, group into tubelets, and apply embedder. Modified ViTPatchLinear to accept temporal patch size and adjust input projection dimension accordingly.
RADIO Model Integration
tensorrt_llm/_torch/models/modeling_radio.py
Extended VisionTransformer, RADIOVisionModelBase, and RADIOVisionModel to accept optional num_frames parameter; when provided and temporal compression is enabled, calls patch_generator.forward_video, packs/reshapes tubelets for attention, and computes sequence lengths accordingly. Updated positional interpolation in window_select from align_corners=True to False. Updated weight-loading logic to mark _video_embedder_loaded flag.
Test Coverage for Video Preprocessing
tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py
Added comprehensive unit tests: TestComputeAspectPreservingSize (validates patch-size divisibility and aspect-ratio preservation), TestGetVideoTargetSizeAndFeatureSize (checks feature-size computation), TestVideoToPixelValues (verifies shape, normalization, resizing), TestBuildTubeletSeparators (asserts separator formatting and numbering), and TestGetNumTokensPerVideoTemporal (validates token count reduction and tubelet grouping).

Sequence Diagram

sequenceDiagram
    participant Input as Video Input
    participant Preproc as Preprocessing<br/>(video_to_pixel_values)
    participant Encoder as Vision Encoder<br/>(extract_feature)
    participant Temporal as Temporal Processor<br/>(forward_video)
    participant Embed as Embedder<br/>(video_embedder)
    participant Attn as Attention<br/>(transformer blocks)
    participant Output as Feature Output

    Input->>Preproc: pixel_values, target_size
    Preproc->>Preproc: aspect-preserving resize<br/>normalize
    Preproc->>Encoder: processed_pixels
    
    Encoder->>Temporal: num_frames provided
    Temporal->>Temporal: pad frames to<br/>multiple of T
    Temporal->>Temporal: group into<br/>tubelets (T frames)
    Temporal->>Embed: tubelet patches
    Embed->>Embed: temporal projection<br/>3*T*patch_size²
    Embed->>Temporal: embedded tubelets
    Temporal->>Temporal: add positional encoding<br/>add CLS token
    
    Temporal->>Attn: repacked tubelets<br/>for attention
    Attn->>Attn: process through<br/>transformer blocks
    Attn->>Temporal: attended features
    Temporal->>Temporal: reshape back<br/>to frames
    Temporal->>Output: temporal embeddings
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60–75 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 43.64% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: implementing video temporal compression for Nemotron Nano and RADIO models, with proper JIRA ticket reference and feature type.
Description check ✅ Passed The description covers the key implementation details and objectives, though the Test Coverage section is incomplete and some checklist items are unchecked.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py (1)

714-716: Unused unpacked variables w and h.

These variables are unpacked but never used. Consider prefixing with underscore to indicate intentional discard.

♻️ Proposed fix
     def test_predicted_vs_actual_token_count(self):
-        w, h = self.FRAME_SIZE
+        _w, _h = self.FRAME_SIZE
         proc = _make_processor(max_num_patches=256, min_num_patches=4)

Or simply remove the unpacking since FRAME_SIZE is accessed directly elsewhere:

     def test_predicted_vs_actual_token_count(self):
         proc = _make_processor(max_num_patches=256, min_num_patches=4)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py` around
lines 714 - 716, In test_predicted_vs_actual_token_count, the unpacking "w, h =
self.FRAME_SIZE" creates unused variables w and h; either remove that unpacking
entirely or rename them to indicate intentional discard (e.g., "_w, _h =
self.FRAME_SIZE") so linters won’t flag unused variables; update the test around
_make_processor and FRAME_SIZE usage accordingly (no behavior change).
tensorrt_llm/_torch/models/modeling_nemotron_nano.py (2)

191-192: Consider handling partial normalization parameters.

If only one of norm_mean or norm_std is provided (but not both), the normalization is silently skipped. This could lead to unexpected behavior. Consider raising an error for this edge case.

🛡️ Proposed fix
     # Apply mean/std normalization (matches vLLM's input_conditioner).
-    if norm_mean is not None and norm_std is not None:
+    if norm_mean is not None or norm_std is not None:
+        if norm_mean is None or norm_std is None:
+            raise ValueError(
+                "Both norm_mean and norm_std must be provided for normalization, "
+                f"got norm_mean={norm_mean is not None}, norm_std={norm_std is not None}"
+            )
         video_tensor = (video_tensor - norm_mean) / norm_std
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` around lines 191 - 192,
Currently normalization only runs when both norm_mean and norm_std are non-None
which silently skips normalization if only one is provided; update the logic
around the video_tensor normalization to validate the parameters: if exactly one
of norm_mean or norm_std is None, raise a ValueError (or appropriate exception)
with a clear message referencing norm_mean and norm_std, otherwise apply the
normalization video_tensor = (video_tensor - norm_mean) / norm_std; keep the
check and transformation located with the existing video_tensor normalization
block so callers cannot pass partial normalization parameters without being
alerted.

74-75: Unnecessary int() cast on round() result.

In Python 3, round() already returns an integer when called with one argument. The int() wrapper is redundant.

♻️ Proposed fix
-    reduction_factor = int(round(1 / downsample_ratio))
+    reduction_factor = round(1 / downsample_ratio)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` around lines 74 - 75,
The assignment to reduction_factor unnecessarily wraps round(1 /
downsample_ratio) in int(); update the code in modeling_nemotron_nano.py to set
reduction_factor = round(1 / downsample_ratio) (and keep required_divisor =
reduction_factor) so the redundant int() cast is removed while preserving the
same behavior for reduction_factor and required_divisor.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/models/modeling_radio.py`:
- Around line 1121-1124: The condition for setting
patch_gen._video_embedder_loaded is ambiguous and can be true even when the
checkpoint lacked video_embedder weights; instead, only set
_video_embedder_loaded when the patch_gen actually has a video_embedder module
and the key wasn't unexpected. Update the logic in the load path around
radio_model.model.patch_generator (patch_gen) to first verify the presence of
the submodule (e.g., hasattr/ getattr(patch_gen, "video_embedder")) and then
check that 'model.patch_generator.video_embedder.weight' is not in
unexpected_keys before assigning patch_gen._video_embedder_loaded = True.

---

Nitpick comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Around line 191-192: Currently normalization only runs when both norm_mean and
norm_std are non-None which silently skips normalization if only one is
provided; update the logic around the video_tensor normalization to validate the
parameters: if exactly one of norm_mean or norm_std is None, raise a ValueError
(or appropriate exception) with a clear message referencing norm_mean and
norm_std, otherwise apply the normalization video_tensor = (video_tensor -
norm_mean) / norm_std; keep the check and transformation located with the
existing video_tensor normalization block so callers cannot pass partial
normalization parameters without being alerted.
- Around line 74-75: The assignment to reduction_factor unnecessarily wraps
round(1 / downsample_ratio) in int(); update the code in
modeling_nemotron_nano.py to set reduction_factor = round(1 / downsample_ratio)
(and keep required_divisor = reduction_factor) so the redundant int() cast is
removed while preserving the same behavior for reduction_factor and
required_divisor.

In `@tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py`:
- Around line 714-716: In test_predicted_vs_actual_token_count, the unpacking
"w, h = self.FRAME_SIZE" creates unused variables w and h; either remove that
unpacking entirely or rename them to indicate intentional discard (e.g., "_w, _h
= self.FRAME_SIZE") so linters won’t flag unused variables; update the test
around _make_processor and FRAME_SIZE usage accordingly (no behavior change).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c6b77b2c-737c-4418-a7d8-7996a8b516a2

📥 Commits

Reviewing files that changed from the base of the PR and between 9ab5cef and 9640b75.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/models/modeling_nemotron_nano.py
  • tensorrt_llm/_torch/models/modeling_radio.py
  • tests/unittest/_torch/modeling/test_nemotron_nano_preprocessing.py

Comment thread tensorrt_llm/_torch/models/modeling_radio.py Outdated
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41809 [ run ] triggered by Bot. Commit: 9640b75 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41809 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 9 PM PST on 4/4.

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 6, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41985 [ run ] triggered by Bot. Commit: 9640b75 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #41985 [ run ] completed with state SUCCESS. Commit: 9640b75
/LLM/main/L0_MergeRequest_PR pipeline #32837 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Comment thread tensorrt_llm/_torch/models/modeling_radio.py
Comment thread tensorrt_llm/_torch/models/modeling_radio.py Outdated
Comment thread tensorrt_llm/_torch/models/modeling_nemotron_nano.py Outdated
@2ez4bz 2ez4bz force-pushed the dev-nano-v3-video branch from 9640b75 to b3b2152 Compare April 7, 2026 18:56
…RADIO

Implement tubelet-based temporal compression for video inputs,
matching the Megatron-LM / vLLM video processing pipeline. T
consecutive frames are grouped into tubelets before embedding,
reducing the token count by a factor of `video_temporal_patch_size`.

Key additions:
- Aspect-ratio-preserving video frame resize and normalization
- Separate video embedder in RADIO ViT for tubelet projection
- Tubelet-aware token counting, frame separators, and EVS paths
- Fix align_corners=True -> False in position embedding interpolation

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
@2ez4bz 2ez4bz force-pushed the dev-nano-v3-video branch 2 times, most recently from 7c38538 to 3192c7a Compare April 9, 2026 23:45
@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 9, 2026

/bot run

@2ez4bz 2ez4bz enabled auto-merge (squash) April 9, 2026 23:46
@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42593 [ run ] triggered by Bot. Commit: 3192c7a Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42593 [ run ] completed with state SUCCESS. Commit: 3192c7a
/LLM/main/L0_MergeRequest_PR pipeline #33319 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
@2ez4bz 2ez4bz force-pushed the dev-nano-v3-video branch from 3192c7a to 696c3d2 Compare April 10, 2026 05:29
@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 10, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42673 [ run ] triggered by Bot. Commit: 696c3d2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42673 [ run ] completed with state SUCCESS. Commit: 696c3d2
/LLM/main/L0_MergeRequest_PR pipeline #33379 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Copy Markdown
Collaborator Author

2ez4bz commented Apr 10, 2026

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42732 [ run ] triggered by Bot. Commit: 696c3d2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #42732 [ run ] completed with state SUCCESS. Commit: 696c3d2
/LLM/main/L0_MergeRequest_PR pipeline #33415 completed with status: 'SUCCESS'

CI Report

Link to invocation

@2ez4bz 2ez4bz merged commit 07ba6d0 into NVIDIA:main Apr 10, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants