Skip to content

[TRTLLM-11265][feat] Implement dynamic resolution for Nemotron VL#11894

Merged
2ez4bz merged 3 commits intoNVIDIA:mainfrom
2ez4bz:dev-nano-vl-dyn-res
Mar 10, 2026
Merged

[TRTLLM-11265][feat] Implement dynamic resolution for Nemotron VL#11894
2ez4bz merged 3 commits intoNVIDIA:mainfrom
2ez4bz:dev-nano-vl-dyn-res

Conversation

@2ez4bz
Copy link
Collaborator

@2ez4bz 2ez4bz commented Mar 4, 2026

Summary by CodeRabbit

  • New Features

    • Added dynamic per-image resolution handling for multimodal models, enabling adaptive tiling and variable image size processing.
    • Enhanced vision encoders to support flexible image resolutions with improved token budgeting and per-image patch optimization.
  • Tests

    • Added comprehensive unit tests for preprocessing logic with parametrized image sizes and resolution constraints.

[None][feat] Implement dynamic resolution

NOTE: this is essentially forking over the changes from @netanel-haber 's PR to vLLM.

This change implements "dynamic resolution" handling of images
for Nemotron VL models.

It also adds some logic to handle newer configuration class definitions
for Nemotron VL.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@2ez4bz 2ez4bz changed the title [None][feat] Implement dynamic resolution [None][feat] Implement dynamic resolution for Nemotron VL Mar 4, 2026
@Wanli-Jiang Wanli-Jiang self-requested a review March 4, 2026 08:58
@2ez4bz 2ez4bz force-pushed the dev-nano-vl-dyn-res branch from cbac145 to ff21ea8 Compare March 5, 2026 06:52
@2ez4bz 2ez4bz marked this pull request as ready for review March 5, 2026 06:52
@2ez4bz 2ez4bz requested review from a team as code owners March 5, 2026 06:52
@2ez4bz 2ez4bz requested a review from jaedeok-nvidia March 5, 2026 06:52
@2ez4bz
Copy link
Collaborator Author

2ez4bz commented Mar 5, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #37824 [ run ] triggered by Bot. Commit: ff21ea8 Link to invocation

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

Introduces dynamic, per-image adaptive tiling and resolution handling for multimodal models. The Nemotron Nano model adds dynamic resolution tiling with token budgeting for variable-sized images, while the RADIO model adds dynamic sequence handling for per-image position embeddings. Both maintain backward compatibility with fixed-resolution paths. Comprehensive preprocessing tests validate the new functionality.

Changes

Cohort / File(s) Summary
Nemotron Nano Dynamic Resolution
tensorrt_llm/_torch/models/modeling_nemotron_nano.py
Added DynamicResolutionParams and DynamicResolutionImageTiler classes for adaptive tiling with token budgeting. Introduced _process_images_dynamic and dynamic feature extraction methods in NanoV2VLVisionEncoder. Extended image processing to support per-image sizing and token budgeting. Enhanced RMSNorm eps handling for config compatibility. Fallback to existing fixed-tile path when dynamic tiling is disabled.
RADIO Dynamic Sequences
tensorrt_llm/_torch/models/modeling_radio.py
Added calc_seq_len and calc_seq_lens utility functions for dynamic sequence length computation. Extended ViTPatchGenerator, VisionTransformer, and RADIOVisionModelBase with dynamic processing branches supporting variable image resolutions via imgs_sizes parameter. Introduced per-image position encoding and CLS token handling. Fixed max sequence length for attention metadata stability. Updated forward_features and _extract_final paths with dynamic sequence propagation.
Test Infrastructure
tests/integration/test_lists/test-db/l0_a10.yml
Added new test entry for Nemotron Nano v2 VL preprocessing validation.
Preprocessing Tests
tests/unittest/_torch/modeling/test_nemotron_nano_v2_vl_preprocessing.py
New test module validating DynamicResolutionImageTiler parameter bounds, token budgeting constraints, patch processing, and convergence. Tests both dynamic and fixed-tile vision encoder forward paths with mocks and parametrized scenarios.

Sequence Diagram(s)

sequenceDiagram
    participant Client as Image Input
    participant Processor as NanoV2VLInputProcessor
    participant Tiler as DynamicResolutionImageTiler
    participant Encoder as NanoV2VLVisionEncoder
    participant Model as NemotronH_Nano_VL_V2

    Client->>Processor: Provide images + token budget
    Processor->>Tiler: Request tiling parameters
    Tiler->>Tiler: Compute per-image patches<br/>(budget constrained)
    Tiler-->>Processor: Return resizing + patch counts
    Processor->>Processor: Resize images per tiling
    Processor->>Processor: Normalize + stack patches
    Processor->>Encoder: Forward patches + imgs_sizes
    Encoder->>Encoder: extract_feature_dynamic<br/>(per-image processing)
    Encoder-->>Model: Embeddings + per-image tokens
    Model->>Model: Integrate into multimodal flow
Loading
sequenceDiagram
    participant Client as Variable-Size Images
    participant Generator as ViTPatchGenerator
    participant Transformer as VisionTransformer
    participant RADIOVision as RADIOVisionModel
    participant Output as Feature Output

    Client->>Generator: Forward x, imgs_sizes
    Generator->>Generator: extract_patches_dynamic
    Generator->>Generator: apply_pos_enc_dynamic<br/>(per-image embeddings)
    Generator->>Generator: cls_token_dynamic<br/>(per-image CLS tokens)
    Generator-->>Transformer: patches with per-image<br/>sequence structure
    Transformer->>Transformer: forward_features<br/>(with dynamic branches)
    Transformer->>Transformer: prepare_attn_metadata<br/>(fixed max_seq_len)
    Transformer-->>RADIOVision: Processed features
    RADIOVision->>RADIOVision: _extract_final<br/>(dynamic reshape)
    RADIOVision-->>Output: Per-image aligned features
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description explains the feature (dynamic resolution handling for Nemotron VL) and acknowledges the upstream source, but the Test Coverage section is incomplete and lacks specific test details. Complete the Test Coverage section by explicitly listing the test files and their coverage (e.g., 'test_nemotron_nano_v2_vl_preprocessing.py validates dynamic tiling parameters and vision encoder paths').
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main feature: implementing dynamic resolution for Nemotron VL models, which aligns perfectly with the primary changes across all modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
tests/unittest/_torch/modeling/test_nemotron_nano_v2_vl_preprocessing.py (1)

10-14: Switch to module-level import for modeling_nemotron_nano.

Please import the module and reference symbols via namespace instead of importing classes directly.

♻️ Suggested change
-from tensorrt_llm._torch.models.modeling_nemotron_nano import (
-    DynamicResolutionImageTiler,
-    DynamicResolutionParams,
-    NanoV2VLVisionEncoder,
-)
+from tensorrt_llm._torch.models import modeling_nemotron_nano
...
-    return DynamicResolutionImageTiler(**defaults)
+    return modeling_nemotron_nano.DynamicResolutionImageTiler(**defaults)
...
-                DynamicResolutionParams(
+                modeling_nemotron_nano.DynamicResolutionParams(
...
-    encoder = mock.MagicMock(spec=NanoV2VLVisionEncoder)
+    encoder = mock.MagicMock(spec=modeling_nemotron_nano.NanoV2VLVisionEncoder)
...
-    NanoV2VLVisionEncoder.forward(vision_encoder, [mm_param])
+    modeling_nemotron_nano.NanoV2VLVisionEncoder.forward(vision_encoder, [mm_param])
As per coding guidelines, `Python imports must use form from package.subpackage import module (never from module import Class)`.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unittest/_torch/modeling/test_nemotron_nano_v2_vl_preprocessing.py`
around lines 10 - 14, The test imports classes directly from
modeling_nemotron_nano; change to a module-level import (import
tensorrt_llm._torch.models.modeling_nemotron_nano as modeling_nemotron_nano) and
update all references to DynamicResolutionImageTiler, DynamicResolutionParams,
and NanoV2VLVisionEncoder in this file to use the module namespace
(modeling_nemotron_nano.DynamicResolutionImageTiler,
modeling_nemotron_nano.DynamicResolutionParams,
modeling_nemotron_nano.NanoV2VLVisionEncoder) to comply with the package import
guideline.
tensorrt_llm/_torch/models/modeling_nemotron_nano.py (1)

40-40: Use module namespace import for modeling_radio.

Line 40 imports symbols directly; switch to module import and access members via module namespace.

♻️ Suggested change
-from .modeling_radio import RADIOVisionModel, calc_seq_lens
+from . import modeling_radio
...
-        self.vision_model = RADIOVisionModel(vision_model_config, disable_quantization=True)
+        self.vision_model = modeling_radio.RADIOVisionModel(
+            vision_model_config, disable_quantization=True
+        )
...
-        seq_lens = calc_seq_lens(imgs_sizes, patch_dim)
+        seq_lens = modeling_radio.calc_seq_lens(imgs_sizes, patch_dim)
As per coding guidelines, `When importing in Python, always maintain the namespace. Import the module, not individual classes or functions`.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` at line 40, Replace the
direct symbol import from .modeling_radio with a module-level import (e.g., use
"from . import modeling_radio" or "import
tensorrt_llm._torch.models.modeling_radio as modeling_radio") and then update
all references in this file that currently use RADIOVisionModel and
calc_seq_lens to use the module namespace (modeling_radio.RADIOVisionModel and
modeling_radio.calc_seq_lens); ensure any type hints, instantiations, or calls
are updated accordingly so the file no longer relies on direct symbol imports.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Around line 470-483: The current dynamic-mode gating uses has_dynamic =
any(...) which can enable the dynamic image path for mixed image/video batches
and then access imgs_sizes on modalities that are video (causing failures) and
skip EVS handling; change the gating to compute per-modality flags (e.g.,
image_needs_dynamic = ["imgs_sizes" in multimodal_data.get(modality_type, {})
for modality_type, multimodal_data in zip(modality_types, multimodal_data_lst)])
and only take the dynamic branch for modalities that are strictly images (or
where all requests for that modality support imgs_sizes), calling
extract_feature_dynamic only for those modalities and falling back to the
existing static/EVS paths for video or modalities without imgs_sizes; update
mm_embedding assembly so you don't return early for mixed batches and preserve
EVS handling by invoking the EVS-specific code path where appropriate.
- Around line 121-122: The current computation of
closest_patch_height/closest_patch_width uses round(orig / self._patch_size +
0.5) which increments exact multiples of patch_size; replace the expression with
a proper half-up integer division so exact multiples stay unchanged (e.g.,
compute using floor((orig + self._patch_size/2) / self._patch_size) or int((orig
+ self._patch_size/2) / self._patch_size)). Update both closest_patch_height and
closest_patch_width (references: closest_patch_height, closest_patch_width,
orig_height, orig_width, self._patch_size) and add/import math.floor if you
choose the math.floor variant. Ensure behavior preserves exact multiples and
avoids over-allocating patches.

In `@tensorrt_llm/_torch/models/modeling_radio.py`:
- Around line 855-867: The dynamic-resolution branch that slices flattened patch
tokens (uses imgs_sizes, calc_seq_lens, patch_gen.num_skip/patch_size and builds
all_patches/all_feat) must be guarded against inputs in feature_fmt == 'NCHW'
(or whenever x/y are full NCHW tensors rather than flattened patches); add a
check (e.g., if imgs_sizes is not None and feature_fmt != 'NCHW') and only run
the calc_seq_lens/patch slicing logic when the tensor is flattened patches,
otherwise follow the existing NCHW path or reshape using explicit H/W from
imgs_sizes; ensure the code references patch_gen.num_skip, patch_gen.patch_size
and calc_seq_lens to compute num_patches and build all_feat accordingly.

---

Nitpick comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Line 40: Replace the direct symbol import from .modeling_radio with a
module-level import (e.g., use "from . import modeling_radio" or "import
tensorrt_llm._torch.models.modeling_radio as modeling_radio") and then update
all references in this file that currently use RADIOVisionModel and
calc_seq_lens to use the module namespace (modeling_radio.RADIOVisionModel and
modeling_radio.calc_seq_lens); ensure any type hints, instantiations, or calls
are updated accordingly so the file no longer relies on direct symbol imports.

In `@tests/unittest/_torch/modeling/test_nemotron_nano_v2_vl_preprocessing.py`:
- Around line 10-14: The test imports classes directly from
modeling_nemotron_nano; change to a module-level import (import
tensorrt_llm._torch.models.modeling_nemotron_nano as modeling_nemotron_nano) and
update all references to DynamicResolutionImageTiler, DynamicResolutionParams,
and NanoV2VLVisionEncoder in this file to use the module namespace
(modeling_nemotron_nano.DynamicResolutionImageTiler,
modeling_nemotron_nano.DynamicResolutionParams,
modeling_nemotron_nano.NanoV2VLVisionEncoder) to comply with the package import
guideline.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9ea41539-110d-4e56-8e5d-8abb815daa2e

📥 Commits

Reviewing files that changed from the base of the PR and between 12f2f39 and ff21ea8.

📒 Files selected for processing (4)
  • tensorrt_llm/_torch/models/modeling_nemotron_nano.py
  • tensorrt_llm/_torch/models/modeling_radio.py
  • tests/integration/test_lists/test-db/l0_a10.yml
  • tests/unittest/_torch/modeling/test_nemotron_nano_v2_vl_preprocessing.py

@tensorrt-cicd
Copy link
Collaborator

PR_Github #37824 [ run ] completed with state SUCCESS. Commit: ff21ea8
/LLM/main/L0_MergeRequest_PR pipeline #29285 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Collaborator Author

2ez4bz commented Mar 5, 2026

/bot run --disable-fail-fast

@2ez4bz 2ez4bz changed the title [None][feat] Implement dynamic resolution for Nemotron VL [TRTLLM-11264][feat] Implement dynamic resolution for Nemotron VL Mar 5, 2026
@2ez4bz 2ez4bz changed the title [TRTLLM-11264][feat] Implement dynamic resolution for Nemotron VL [TRTLLM-11265][feat] Implement dynamic resolution for Nemotron VL Mar 5, 2026
@tensorrt-cicd
Copy link
Collaborator

PR_Github #37897 [ run ] triggered by Bot. Commit: ff21ea8 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #37897 [ run ] completed with state SUCCESS. Commit: ff21ea8
/LLM/main/L0_MergeRequest_PR pipeline #29343 completed with status: 'SUCCESS'

Link to invocation

@2ez4bz 2ez4bz force-pushed the dev-nano-vl-dyn-res branch from ff21ea8 to ccfa733 Compare March 7, 2026 06:02
@2ez4bz 2ez4bz force-pushed the dev-nano-vl-dyn-res branch from ccfa733 to 080d917 Compare March 7, 2026 06:14
@2ez4bz
Copy link
Collaborator Author

2ez4bz commented Mar 7, 2026

/bot run --disable-fail-fast

Copy link
Collaborator

@Wanli-Jiang Wanli-Jiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM now.

2ez4bz added 2 commits March 9, 2026 09:16
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
This commit implements "dynamic resolution" handling of images
for Nemotron VL models.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
@2ez4bz 2ez4bz force-pushed the dev-nano-vl-dyn-res branch from 080d917 to b7844fb Compare March 9, 2026 16:20
@2ez4bz
Copy link
Collaborator Author

2ez4bz commented Mar 9, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38294 [ run ] triggered by Bot. Commit: b7844fb Link to invocation

@2ez4bz 2ez4bz enabled auto-merge (squash) March 9, 2026 20:40
@tensorrt-cicd
Copy link
Collaborator

PR_Github #38294 [ run ] completed with state SUCCESS. Commit: b7844fb
/LLM/main/L0_MergeRequest_PR pipeline #29673 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Collaborator Author

2ez4bz commented Mar 10, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38375 [ run ] triggered by Bot. Commit: 283de88 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38375 [ run ] completed with state SUCCESS. Commit: 283de88
/LLM/main/L0_MergeRequest_PR pipeline #29743 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@2ez4bz
Copy link
Collaborator Author

2ez4bz commented Mar 10, 2026

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38463 [ run ] triggered by Bot. Commit: 283de88 Link to invocation

@tensorrt-cicd
Copy link
Collaborator

PR_Github #38463 [ run ] completed with state SUCCESS. Commit: 283de88
/LLM/main/L0_MergeRequest_PR pipeline #29820 completed with status: 'SUCCESS'

Link to invocation

@2ez4bz 2ez4bz merged commit 3ce0ec8 into NVIDIA:main Mar 10, 2026
5 checks passed
bmarimuthu-nv pushed a commit to nv-auto-deploy/TensorRT-LLM that referenced this pull request Mar 12, 2026
…IDIA#11894)

This commit implements "dynamic resolution" handling of images
for Nemotron VL models.

Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants