[TRTLLM-11265][feat] Implement dynamic resolution for Nemotron VL#11894
[TRTLLM-11265][feat] Implement dynamic resolution for Nemotron VL#118942ez4bz merged 3 commits intoNVIDIA:mainfrom
Conversation
84b6e9e to
cbac145
Compare
cbac145 to
ff21ea8
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #37824 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughIntroduces dynamic, per-image adaptive tiling and resolution handling for multimodal models. The Nemotron Nano model adds dynamic resolution tiling with token budgeting for variable-sized images, while the RADIO model adds dynamic sequence handling for per-image position embeddings. Both maintain backward compatibility with fixed-resolution paths. Comprehensive preprocessing tests validate the new functionality. Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Image Input
participant Processor as NanoV2VLInputProcessor
participant Tiler as DynamicResolutionImageTiler
participant Encoder as NanoV2VLVisionEncoder
participant Model as NemotronH_Nano_VL_V2
Client->>Processor: Provide images + token budget
Processor->>Tiler: Request tiling parameters
Tiler->>Tiler: Compute per-image patches<br/>(budget constrained)
Tiler-->>Processor: Return resizing + patch counts
Processor->>Processor: Resize images per tiling
Processor->>Processor: Normalize + stack patches
Processor->>Encoder: Forward patches + imgs_sizes
Encoder->>Encoder: extract_feature_dynamic<br/>(per-image processing)
Encoder-->>Model: Embeddings + per-image tokens
Model->>Model: Integrate into multimodal flow
sequenceDiagram
participant Client as Variable-Size Images
participant Generator as ViTPatchGenerator
participant Transformer as VisionTransformer
participant RADIOVision as RADIOVisionModel
participant Output as Feature Output
Client->>Generator: Forward x, imgs_sizes
Generator->>Generator: extract_patches_dynamic
Generator->>Generator: apply_pos_enc_dynamic<br/>(per-image embeddings)
Generator->>Generator: cls_token_dynamic<br/>(per-image CLS tokens)
Generator-->>Transformer: patches with per-image<br/>sequence structure
Transformer->>Transformer: forward_features<br/>(with dynamic branches)
Transformer->>Transformer: prepare_attn_metadata<br/>(fixed max_seq_len)
Transformer-->>RADIOVision: Processed features
RADIOVision->>RADIOVision: _extract_final<br/>(dynamic reshape)
RADIOVision-->>Output: Per-image aligned features
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/unittest/_torch/modeling/test_nemotron_nano_v2_vl_preprocessing.py (1)
10-14: Switch to module-level import formodeling_nemotron_nano.Please import the module and reference symbols via namespace instead of importing classes directly.
As per coding guidelines, `Python imports must use form from package.subpackage import module (never from module import Class)`.♻️ Suggested change
-from tensorrt_llm._torch.models.modeling_nemotron_nano import ( - DynamicResolutionImageTiler, - DynamicResolutionParams, - NanoV2VLVisionEncoder, -) +from tensorrt_llm._torch.models import modeling_nemotron_nano ... - return DynamicResolutionImageTiler(**defaults) + return modeling_nemotron_nano.DynamicResolutionImageTiler(**defaults) ... - DynamicResolutionParams( + modeling_nemotron_nano.DynamicResolutionParams( ... - encoder = mock.MagicMock(spec=NanoV2VLVisionEncoder) + encoder = mock.MagicMock(spec=modeling_nemotron_nano.NanoV2VLVisionEncoder) ... - NanoV2VLVisionEncoder.forward(vision_encoder, [mm_param]) + modeling_nemotron_nano.NanoV2VLVisionEncoder.forward(vision_encoder, [mm_param])🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unittest/_torch/modeling/test_nemotron_nano_v2_vl_preprocessing.py` around lines 10 - 14, The test imports classes directly from modeling_nemotron_nano; change to a module-level import (import tensorrt_llm._torch.models.modeling_nemotron_nano as modeling_nemotron_nano) and update all references to DynamicResolutionImageTiler, DynamicResolutionParams, and NanoV2VLVisionEncoder in this file to use the module namespace (modeling_nemotron_nano.DynamicResolutionImageTiler, modeling_nemotron_nano.DynamicResolutionParams, modeling_nemotron_nano.NanoV2VLVisionEncoder) to comply with the package import guideline.tensorrt_llm/_torch/models/modeling_nemotron_nano.py (1)
40-40: Use module namespace import formodeling_radio.Line 40 imports symbols directly; switch to module import and access members via module namespace.
As per coding guidelines, `When importing in Python, always maintain the namespace. Import the module, not individual classes or functions`.♻️ Suggested change
-from .modeling_radio import RADIOVisionModel, calc_seq_lens +from . import modeling_radio ... - self.vision_model = RADIOVisionModel(vision_model_config, disable_quantization=True) + self.vision_model = modeling_radio.RADIOVisionModel( + vision_model_config, disable_quantization=True + ) ... - seq_lens = calc_seq_lens(imgs_sizes, patch_dim) + seq_lens = modeling_radio.calc_seq_lens(imgs_sizes, patch_dim)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py` at line 40, Replace the direct symbol import from .modeling_radio with a module-level import (e.g., use "from . import modeling_radio" or "import tensorrt_llm._torch.models.modeling_radio as modeling_radio") and then update all references in this file that currently use RADIOVisionModel and calc_seq_lens to use the module namespace (modeling_radio.RADIOVisionModel and modeling_radio.calc_seq_lens); ensure any type hints, instantiations, or calls are updated accordingly so the file no longer relies on direct symbol imports.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Around line 470-483: The current dynamic-mode gating uses has_dynamic =
any(...) which can enable the dynamic image path for mixed image/video batches
and then access imgs_sizes on modalities that are video (causing failures) and
skip EVS handling; change the gating to compute per-modality flags (e.g.,
image_needs_dynamic = ["imgs_sizes" in multimodal_data.get(modality_type, {})
for modality_type, multimodal_data in zip(modality_types, multimodal_data_lst)])
and only take the dynamic branch for modalities that are strictly images (or
where all requests for that modality support imgs_sizes), calling
extract_feature_dynamic only for those modalities and falling back to the
existing static/EVS paths for video or modalities without imgs_sizes; update
mm_embedding assembly so you don't return early for mixed batches and preserve
EVS handling by invoking the EVS-specific code path where appropriate.
- Around line 121-122: The current computation of
closest_patch_height/closest_patch_width uses round(orig / self._patch_size +
0.5) which increments exact multiples of patch_size; replace the expression with
a proper half-up integer division so exact multiples stay unchanged (e.g.,
compute using floor((orig + self._patch_size/2) / self._patch_size) or int((orig
+ self._patch_size/2) / self._patch_size)). Update both closest_patch_height and
closest_patch_width (references: closest_patch_height, closest_patch_width,
orig_height, orig_width, self._patch_size) and add/import math.floor if you
choose the math.floor variant. Ensure behavior preserves exact multiples and
avoids over-allocating patches.
In `@tensorrt_llm/_torch/models/modeling_radio.py`:
- Around line 855-867: The dynamic-resolution branch that slices flattened patch
tokens (uses imgs_sizes, calc_seq_lens, patch_gen.num_skip/patch_size and builds
all_patches/all_feat) must be guarded against inputs in feature_fmt == 'NCHW'
(or whenever x/y are full NCHW tensors rather than flattened patches); add a
check (e.g., if imgs_sizes is not None and feature_fmt != 'NCHW') and only run
the calc_seq_lens/patch slicing logic when the tensor is flattened patches,
otherwise follow the existing NCHW path or reshape using explicit H/W from
imgs_sizes; ensure the code references patch_gen.num_skip, patch_gen.patch_size
and calc_seq_lens to compute num_patches and build all_feat accordingly.
---
Nitpick comments:
In `@tensorrt_llm/_torch/models/modeling_nemotron_nano.py`:
- Line 40: Replace the direct symbol import from .modeling_radio with a
module-level import (e.g., use "from . import modeling_radio" or "import
tensorrt_llm._torch.models.modeling_radio as modeling_radio") and then update
all references in this file that currently use RADIOVisionModel and
calc_seq_lens to use the module namespace (modeling_radio.RADIOVisionModel and
modeling_radio.calc_seq_lens); ensure any type hints, instantiations, or calls
are updated accordingly so the file no longer relies on direct symbol imports.
In `@tests/unittest/_torch/modeling/test_nemotron_nano_v2_vl_preprocessing.py`:
- Around line 10-14: The test imports classes directly from
modeling_nemotron_nano; change to a module-level import (import
tensorrt_llm._torch.models.modeling_nemotron_nano as modeling_nemotron_nano) and
update all references to DynamicResolutionImageTiler, DynamicResolutionParams,
and NanoV2VLVisionEncoder in this file to use the module namespace
(modeling_nemotron_nano.DynamicResolutionImageTiler,
modeling_nemotron_nano.DynamicResolutionParams,
modeling_nemotron_nano.NanoV2VLVisionEncoder) to comply with the package import
guideline.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9ea41539-110d-4e56-8e5d-8abb815daa2e
📒 Files selected for processing (4)
tensorrt_llm/_torch/models/modeling_nemotron_nano.pytensorrt_llm/_torch/models/modeling_radio.pytests/integration/test_lists/test-db/l0_a10.ymltests/unittest/_torch/modeling/test_nemotron_nano_v2_vl_preprocessing.py
|
PR_Github #37824 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #37897 [ run ] triggered by Bot. Commit: |
|
PR_Github #37897 [ run ] completed with state |
ff21ea8 to
ccfa733
Compare
ccfa733 to
080d917
Compare
|
/bot run --disable-fail-fast |
Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
This commit implements "dynamic resolution" handling of images for Nemotron VL models. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
080d917 to
b7844fb
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #38294 [ run ] triggered by Bot. Commit: |
|
PR_Github #38294 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #38375 [ run ] triggered by Bot. Commit: |
|
PR_Github #38375 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #38463 [ run ] triggered by Bot. Commit: |
|
PR_Github #38463 [ run ] completed with state |
…IDIA#11894) This commit implements "dynamic resolution" handling of images for Nemotron VL models. Signed-off-by: William Zhang <133824995+2ez4bz@users.noreply.github.com>
Summary by CodeRabbit
New Features
Tests
[None][feat] Implement dynamic resolution
NOTE: this is essentially forking over the changes from @netanel-haber 's PR to vLLM.
This change implements "dynamic resolution" handling of images
for Nemotron VL models.
It also adds some logic to handle newer configuration class definitions
for Nemotron VL.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.