[TRTLLM-12427][perf] Qwen2.5/3/3.5-VL Performance Optimization by yechank-nvidia · Pull Request #11943 · NVIDIA/TensorRT-LLM

yechank-nvidia · 2026-03-05T09:45:24Z

Summary

This PR reworks the Qwen2-VL / Qwen2.5-VL / Qwen3-VL PyTorch-backend implementations to cut host (CPU) overhead in the vision tower and input processors, enables piecewise CUDA graph for these models, and fixes several correctness issues around mRoPE and vision-block RoPE. The optimizations target the high-concurrency serving regime, where launch overhead and host↔device syncs dominate.

What changed

Vision tower / rotary embedding

Drop the HF rotary dependency and memoize the frequency table; pre-compute cos/sin into init-time buffers so the forward path no longer calls .cos()/.sin() per step.
Add an L2 per-tile GPU rotary cache for Qwen2.5/3-VL vision and annotate its measured GPU footprint.
Remove dead code and a redundant batched pos-embed kernel from the vision tower.

Host-overhead reduction

Add an async_tensor_h2d helper and route all Qwen2.5/3-VL H2D copies (vision pos_ids, window_index, rope_position_ids) through it.
Skip redundant pinning in maybe_pin_memory when the input is already pinned.
Pre-allocate deepstack scratch and skip vision-encoder host syncs.
Add a text-only fast path in the Qwen2.5/3-VL input processors so text-only requests avoid vision-path work.

CUDA graph

Enable piecewise CUDA graph for LLM prefill on Qwen2/3-VL.

Refactor

Inherit the Qwen3-VL input processor from the Qwen2-VL base to remove duplication.

Performance

Model: Qwen3VLForConditionalGeneration (FP8 weights + KV cache, bf16 vision encoder). Hardware: H200 ×1. Workload: image+text, ISL=1000, OSL=1000, 512×512 image, KV block reuse off, 3-run mean. Comparison is against upstream main with identical serving config (max_batch_size=256, max_num_tokens=8192, num_postprocess_workers=4, cuda_graph_config.enable_padding=true, chunked prefill on).

System output-token throughput (tok/s)

concurrency	upstream	this PR	Δ
32	4735	4838	+2.2%
64	6948	7359	+5.9%
128	8553	9767	+14.2%

Other metrics

Per-user throughput (tok/s/user) — c=128: 77.5 → 89.8 (+15.9%)
TTFT (ms) — c=1: 147 → 81 (−45%); c=32: 815 → 690 (−15%); c=128: 1953 → 1804 (−8%)
ITL (ms/token) — c=128: 12.96 → 11.18 (−13.7%)
Request latency (ms) — c=128: 14901 → 12975 (−12.9%)

Gains are concentrated at high concurrency, where the host-side savings (fewer CPU launches, async H2D, higher CUDA-graph hit ratio) and the lower-overhead mRoPE path matter most. Low-concurrency TTFT also improves substantially (c=1 nearly halved) from the text fast path and removed vision-encoder syncs.

Summary by CodeRabbit

Release Notes

Bug Fixes
- Fixed rotary cache index computation in attention kernels.
- Corrected request broadcasting logic for distributed inference configurations.
Performance
- Optimized multimodal model inference through vectorized RoPE position indexing and GPU memory operations.
- Added Triton kernel support for position embedding interpolation in vision models.
- Improved device transfer efficiency with async host-to-device operations and optimized tensor placement.
- Enhanced distributed tensor parallelism handling.
Tests
- Expanded test coverage for multimodal RoPE configurations and vision component equivalence validation.

yechank-nvidia · 2026-05-21T11:16:09Z

/bot run

tensorrt-cicd · 2026-05-21T11:21:57Z

PR_Github #49687 [ run ] triggered by Bot. Commit: 3489cf2 Link to invocation

tensorrt-cicd · 2026-05-21T11:35:15Z

PR_Github #49687 [ run ] completed with state FAILURE. Commit: 3489cf2
/LLM/main/L0_MergeRequest_PR pipeline #39294 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yechank-nvidia · 2026-05-26T05:25:44Z

/bot run

tensorrt-cicd · 2026-05-26T05:31:03Z

PR_Github #50284 [ run ] triggered by Bot. Commit: da59f84 Link to invocation

tensorrt-cicd · 2026-05-26T07:11:16Z

PR_Github #50284 [ run ] completed with state SUCCESS. Commit: da59f84
/LLM/main/L0_MergeRequest_PR pipeline #39813 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yechank-nvidia · 2026-05-27T04:04:02Z

/bot run

tensorrt-cicd · 2026-05-27T04:10:44Z

PR_Github #50450 [ run ] triggered by Bot. Commit: 657bb10 Link to invocation

tensorrt-cicd · 2026-05-27T09:16:15Z

PR_Github #50450 [ run ] completed with state SUCCESS. Commit: 657bb10
/LLM/main/L0_MergeRequest_PR pipeline #39969 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

…en3-VL vision Replace ``HFQwen3VLVisionRotaryEmbedding(head_dim // 2)`` with an in-tree implementation: * ``Qwen3VisionModel.__init__`` registers an ``rope_inv_freq`` buffer using the same formula the HF class used internally (``1.0 / (10000.0 ** (torch.arange(0, rope_dim, 2) / rope_dim))``, ``rope_dim = head_dim // 2``). * New ``_freq_table(max_hw)`` is ``@lru_cache(maxsize=64)``. It computes ``torch.outer(arange(max_hw), inv_freq)`` once per unique ``max_hw``. Previously every ``_rotary_pos_emb_thw`` cache miss re-ran HF's ``forward()``, which built a fresh ``arange + outer`` every time. Production has at most a handful of distinct max-extents across the served tile set, so the cache stays small (``maxsize=64`` is plenty). The HF dependency was only used as a stateless functor; nothing about its module identity or persistent buffers was reused. Dropping it also cuts the ``from transformers...Qwen3VLVisionRotaryEmbedding`` import. Tests: * ``test_freq_table_matches_hf_rotary``: 6 ``max_hw`` values (16/32/48/64/100/128) -> in-tree output bit-matches HF (atol=0, rtol=0, same dtype, same shape). * ``test_freq_table_lru_cache_hit``: repeated ``max_hw`` returns the same cached device tensor object. * All pre-existing ``_rotary_pos_emb_thw`` / ``rot_pos_emb_l2`` / ``batched_pos_embed_*`` tests still pass; they now exercise the in-tree path end-to-end. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Two small simplifications in the per-forward hot path: * Qwen3-VL: replace the per-layer ``layer_num in self.deepstack_visual_indexes`` ``in`` check + ``.index(layer_num)`` linear scan (each O(len(deepstack_visual_indexes))) with a precomputed ``self._deepstack_layer_to_merger_idx`` dict built once in ``__init__``. ``forward`` now does a single ``.get()`` per block iteration; O(1) instead of O(L) per layer. The dict is a plain Python attribute (not a parameter); the underlying ``ModuleList`` indexing is unchanged. * Qwen2.5-VL: replace reverse_indices = torch.empty_like(window_index) reverse_indices[window_index] = torch.arange(N, device=..., dtype=...) with reverse_indices = torch.argsort(window_index) ``window_index`` is a permutation of ``[0, N)``; its inverse permutation is exactly ``argsort(window_index)``. torch implements argsort as a single fused GPU sort, which both removes the explicit ``empty_like + arange + scatter`` triplet and skips an intermediate allocation. Test: ``test_argsort_matches_scatter_inverse`` (n = 16/256/1024) proves the new path equals the old one (atol=0, rtol=0) and is the true inverse permutation (window_index[reverse_indices] == arange(N)). Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

…es through it vLLM-style pinned + async H->D helper, kept in ``tensorrt_llm._utils`` alongside ``maybe_pin_memory`` (the only file where the pinned-memory pre-commit check allows raw ``.pin_memory()``): def async_tensor_h2d(data, dtype, device): """List/tuple OR CPU Tensor -> pinned host buffer -> non_blocking cudaMemcpyAsync to ``device``.""" Without an explicit pin, ``.to(device, non_blocking=True)`` on a pageable source silently stages through a pinned buffer that PyTorch allocates per call, eating CPU time even though nsys still reports ``cudaMemcpyAsync``. The helper centralizes the pinned-buffer + async-DMA idiom so callers don't choose between ``torch.tensor(..., pin_memory=prefer_pinned()).to(..., non_blocking=True)`` (sequence input) and ``maybe_pin_memory(t).to( ..., non_blocking=True)`` (existing tensor input). Routed through it: * Qwen3-VL ``_triton_pos_embed_interpolate_batched``: the 5 per-image metadata arrays (starts, hs, ws, h_scales, w_scales) -- previously five ``torch.tensor(..., pin_memory=prefer_pinned()).to(device, non_blocking=True)`` copies of the same shape. * Qwen2.5-VL ``Qwen2_5_VisionModel.forward``: the per-forward ``window_index = torch.cat(window_indices).to(device, non_blocking=True)`` -- the cat result is a fresh pageable tensor, so without the helper it silently staged. Tests: * ``test_async_tensor_h2d_sequence_input``: list -> CUDA int32 tensor, values preserved. * ``test_async_tensor_h2d_tensor_input``: CPU fp64 -> CUDA fp32 with dtype cast applied, values preserved. No new memory allocations beyond what the prior code already issued (``pin_memory()`` reuses Torch's pinned-buffer pool). Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

…d input Gate the ``.pin_memory()`` dispatch on ``not tensor.is_pinned()``. PyTorch's pin_memory is already a no-op for pinned tensors, but the call still pays a CPython + pybind round-trip. Several hot paths upstream-pin the input and then ``maybe_pin_memory`` is called again inside generic helpers (e.g. ``AttentionMetadata.seq_lens`` / ``seq_lens_kv`` setters, ``TrtllmAttentionMetadata.prepare`` re-pinning ``kv_lens``), so the guard is a microscopic but free win. Behaviorally identical to before; an already-pinned tensor now returns from the function as the same Python object. Test: ``test_maybe_pin_memory_idempotent_on_already_pinned`` -- second call on a pinned tensor returns ``is`` the first call's result. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

…()/.sin() in Qwen3-VL vision Mirror vLLM's pattern (vllm/model_executor/layers/rotary_embedding/base.py ``_compute_cos_sin_cache``): build the rotary cos/sin tables at ``__init__`` time and store them as module buffers. ``forward`` then only gathers and reshapes -- no per-forward ``torch.outer``, ``.cos()``, or ``.sin()`` kernels. Concrete changes in ``Qwen3VisionModel``: * ``__init__`` registers two buffers, ``rope_cos_cache`` and ``rope_sin_cache``, each shape ``(max_rope_seqlen=8192, freq_dim)`` fp32. Built from the same ``inv_freq`` (``theta=10000``, ``dim=head_dim//2``) formula HF's ``Qwen3VLVisionRotaryEmbedding`` and the prior ``_freq_table`` used. Total init buffer cost: ~1.15 MB (576 KB x 2). vLLM's equivalent ``get_rope`` cache lives at the model dtype (bf16); ours stays fp32 to match ``RopeParams.create_rope_const_params``. * ``_freq_table`` lru_cache is removed -- the gather source is now the pre-computed cos/sin buffers, so there is no transient freqs tensor to memoize. * ``_rotary_pos_emb_thw(t, h, w)`` now returns a ``(cos, sin)`` tuple of device tensors instead of a single freqs tensor. Each half has shape ``(t*h*w, 2*freq_dim)`` after ``cos_cache[pos_ids].flatten(1)``. L2 cache footprint per-token doubles (144 B -> 288 B fp32) because cos and sin are stored separately; production ~10-30 unique tiles still lands at 4-10 MB total. ``maxsize=1024`` is unchanged. * ``rot_pos_emb`` likewise returns ``(cos, sin)``; multi-image batches cat the cos and sin lists separately (device-side, since each L2 entry is already on device). * ``forward`` drops the ``rotary_pos_emb.flatten(1).repeat(1, 2).cos()/.sin()`` chain. It only does ``cos.repeat(1, 2)`` and ``sin.repeat(1, 2)`` to expand to ``head_dim``, then hands the pair to the blocks. nsys (5-image batch, 20 steady-state iters, all L2 hits): before: per-forward = 4 kernels (cat + repeat + cos + sin) after: per-forward = 4 kernels (2 cats + 2 repeats) with the two transcendental cos/sin kernels replaced by plain ``CatArrayBatchedCopy_vectorized`` and ``elementwise_kernel`` (repeat copies). 0 H->D in steady state. Tests: * ``test_rope_cos_sin_buffers_match_hf_rotary``: pre-computed cos/sin buffers match HF's ``Qwen3VLVisionRotaryEmbedding(36)(max_hw).cos()`` / ``.sin()`` within ~1 ULP fp32 (host-vs-device codegen). * ``test_rotary_pos_emb_thw_returns_device_tensor`` / ``test_rotary_pos_emb_thw_lru_cache_hit``: updated for the ``(cos, sin)`` tuple return. * ``test_rot_pos_emb_l2_matches_per_tile``: cos and sin halves match per-tile cats bit-exact. * ``test_rot_pos_emb_l2_no_device_transfer_on_hit``: still 0 new misses on a repeated grid. * ``test_rot_pos_emb_cos_sin_matches_old_repeat_chain``: forward's final ``cos.repeat(1, 2)`` / ``sin.repeat(1, 2)`` equals the prior ``freqs.repeat(1, 2).cos()/.sin()`` chain within ~1 ULP. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

… rebase The qwen3vl_opt branch's ``1c4e86892d "Further optimize Qwen3-VL"`` removed ``rope_position_ids`` from the per-block call and added a ``rotary_pos_emb.flatten(1).repeat(1, 2).cos()/.sin()`` chain in the vision tower forward. At authoring time that matched the HF ``apply_rotary_pos_emb_vision`` ABI (cos/sin shape ``(seq, head_dim)``), which ``Qwen2_5_VLVisionAttention.forward`` used. After upstream ``7279d6322d "EXAONE-4.5 Support"`` refactored that attention class to route through the new ``apply_rope`` helper (``RotaryEmbedding.apply_rotary_pos_emb`` / FlashInfer), the expected ``cos.shape[-1]`` became ``head_dim // 2``: ``apply_rotary_pos_emb`` computes ``rot_dim = cos.shape[-1] * 2`` and chunks q/k into halves of size ``cos.shape[-1]``. With Qwen3-VL's ``head_dim = 72`` (``72 % 64 != 0``, so the FlashInfer gate misses and the PyTorch fallback runs), the prior ``.repeat(1, 2)`` made cos/sin ``(total, 72)`` and the elementwise multiply against ``q.chunk(2)[i]`` (shape ``(..., 36)``) raised ``RuntimeError: The size of tensor a (36) must match the size of tensor b (72) at non-singleton dimension 2`` -- silently broken at rebase time, missed by every unit test on this branch because they verified isolated cos/sin construction (pos_ids, lru_cache identity, gather equivalence) but never piped them through the actual attention block. CI's model-level test would have caught it. Fix: * ``Qwen3VisionModel.forward``: drop ``cos.repeat(1, 2)`` / ``sin.repeat(1, 2)``. The cos/sin tuple returned by ``rot_pos_emb`` is already at the post-EXAONE shape ``(total_tokens, 2 * freq_dim) = (total_tokens, head_dim // 2)``. * ``Qwen3VisionModel.forward``: restore the pre-1c4e86892d ``rope_position_ids = torch.arange(seq_len, dtype=int32, pin_memory=prefer_pinned()).to(device, non_blocking=True)`` and pass ``position_ids=rope_position_ids`` into each block. This keeps the FlashInfer ``position_ids is not None`` gate satisfied for any future config where ``head_dim % 64 == 0``; for Qwen3-VL itself it still falls through to the PyTorch path, which now has the correct broadcast. All prior optimizations on this branch are preserved: cos/sin pre-computed init buffers (``rope_cos_cache`` / ``rope_sin_cache``), L2 per-(t, h, w) cache, batched Triton pos-embed kernel, rot_pos_ids lru_cache, async_tensor_h2d helper, argsort inverse permutation, deepstack dict lookup. The fix is a layout correction on the output side of ``rot_pos_emb``, not a revert of the optimizations. Test: ``test_qwen3vl_vision_block_forward_end_to_end`` builds one ``Qwen3VLVisionBlock`` with random weights and runs forward with the post-EXAONE cos/sin shape and ``position_ids``. Without this fix it raises the same broadcasting error; with it the block returns the expected ``(seq_len, hidden)`` tensor. Any future regression that re-introduces a stray ``.repeat(1, 2)`` on cos/sin (or drops ``position_ids``) will surface here, not only at CI time. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Four CPU-dispatch optimizations stacked on top of the prior PR commits, each verified by the qwen3vl unittests and nsys conc=64 OSL=8. 1. get_rope_index: rewrite the per-token loop using numpy (np.arange/ np.indices/np.concatenate) instead of small torch tensor ops. Same algorithm, output exactly equal. ~2x per call on a 512x512 image. 2. Vision attention apply_rope: add a flash_attn Triton fast-path (flash_attn.ops.triton.rotary.apply_rotary, in-place) between the FlashInfer fused path and the PyTorch fallback. Qwen3-VL head_dim=72 and Qwen2.5-VL head_dim=80 both miss FlashInfer's "head_size % 64 == 0" precondition and used to land on the 7-launch PyTorch path; the Triton kernel is a single launch per (q, k). 3. Vision cos/sin buffers materialised in the vision-tower dtype: Qwen3-VL register_buffer stores rope_cos_cache/rope_sin_cache as bf16, Qwen2.5-VL get_rope_and_window_index_by_thw casts cos_thw/sin_thw once per cached (t, h, w). apply_rope guards the per-call .to(dtype=q.dtype) with a dtype check so it becomes a no-op on the hot path. 4. Vision block residual fusion (vLLM pattern, see vllm/model_executor/models/qwen2_5_vl Qwen2_5_VisionBlock.forward): collapse the post-attention residual add into norm2's residual path, which our LayerNorm / RMSNorm already supports as a torch.compile- fused kernel. One fewer elementwise add launch per block. nsys conc=64 OSL=8 (Qwen3-VL-8B, 1x512x512 image): vision-tower wall median 183 -> 107 ms (-42%), launches/iter 1571 -> 1220 (-22%), CPU API time/iter 134 -> 115 ms (-15%). Throughput moves modestly because GPU work is unchanged and the time is largely absorbed at stream-sync points within _forward_step. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Stripped layer of commit 7368e5b after measurement showed the batched single-launch kernel did not move the conc=64 throughput needle and added non-trivial setup cost (5 small H->D copies for per-image metadata) on the low-concurrency path. * Vision pos-embed interpolation goes back to the per-image Triton kernel ``_triton_pos_embed_interpolate`` + ``torch.cat`` join in ``Qwen3VisionModel.fast_pos_embed_interpolate``. The kernel takes ``(h, w, h_scale, w_scale)`` as scalar args, so no H<->D metadata transfers happen on the hot path. The batched ``_bilinear_pos_embed_batched_kernel`` / ``_triton_pos_embed_interpolate_batched`` pair is removed. * Vision rotary cache no longer hard-codes ``max_rope_seqlen = 8192``; ``Qwen3VisionModel.__init__`` now constructs a standard ``RotaryEmbedding`` whose ``rotary_cos_sin`` buffer is sized to ``text_config.max_position_embeddings`` via ``RopeParams.from_config``. ``_rotary_pos_emb_thw`` slices ``self.rotary_pos_emb.rotary_cos_sin [:max(h, w)]`` then indexes with ``pos_ids``, mirroring upstream/main. * Removed vLLM cross-reference comments in modeling_qwen3vl.py, modeling_qwen2vl.py, and modules/rotary_embedding.py while keeping the local technical descriptions intact. * Updated unit tests to mock ``rotary_pos_emb.rotary_cos_sin`` (the new buffer location) and dropped the batched-vs-per-image comparison test together with the batched kernel. 3-run aiperf sweep (conc=64, ISL=1000, OSL=100, warmup=20, single 512x512 image, Qwen3-VL-8B): throughput 18.01 +/- 0.94 req/s vs main 17.25 +/- 0.72 (+4.4%), TTFT 1082 +/- 107 ms vs 1240 +/- 137 (-12.7%), latency 3528 +/- 178 ms vs 3692 +/- 149 (-4.4%). Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

…c_tensor_h2d The per-(t, h, w) rotary helpers in modeling_qwen2vl.py and modeling_qwen3vl.py were doing a bare ``pos_ids.to(device, non_blocking=True)`` (and likewise for ``window_index_thw``). Without pin_memory the non_blocking flag silently falls back to a staging copy, so the H->D never actually overlaps the surrounding GPU work. Route those transfers through the project-wide ``async_tensor_h2d`` helper instead; it pins (when ``prefer_pinned()`` is enabled) and issues a real ``cudaMemcpyAsync``. Each unique (t, h, w) tile still only pays this once (lru_cache around the helper), but now the copy genuinely runs asynchronously. Unit tests (qwen3vl) pass unchanged. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

CI yapf hook reformatted the torch.arange call signature. Apply the same reformat locally so pre-commit passes on push. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Skip the multi-modal HF processor and call the tokenizer directly when ``inputs["multi_modal_data"]`` is empty. The two outputs match bit-exactly when ``images`` / ``videos`` are ``None``, so the only thing we lose is the ``bypass_processor_output_validation`` context plus the image/video processing branches inside ``ProcessorMixin.__call__``. mrope_config is still populated since the LM is M-RoPE and needs the position ids on every request -- it just goes through the short ``get_rope_index`` path with ``image_grid_thw=None``. nsys (conc=64, OSL=8, text-only requests, Qwen3-VL-8B): the per-request ``Qwen3VLInputProcessorBase forward()`` median drops from 36.1 ms to 12.5 ms (-65%) and the cumulative wall-clock share drops from 45.8% to 21.1% of the capture window. End-to-end throughput is essentially unchanged at this concurrency (36.54 -> 36.78 req/s) because the executor's LM forward is the gating phase; the win is for low-conc / TTFT-sensitive text-only traffic where the input processor sits on the request critical path. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Four coordinated fixes so torch.compile + piecewise CUDA graph stops crashing on the Qwen2.5-VL / Qwen3-VL multimodal wrappers: 1. Share ``extra_attrs`` between the outer VL wrapper and the inner LM. ``Qwen{2VL,3VL}ModelBase.__init__`` deep-copied ``model_config`` for the LM, so ``AutoModelForCausalLM.from_config`` registered LM attention layers in the copied dict while ``model_engine.model_forward`` bound the thread-local ``model_extra_attrs`` to ``self.model.extra_attrs`` (the outer dict). Under ``set_torch_compiling(True)``, ``attn_custom_op_inplace`` looks up its layer in the global TLS dict -- missed the LM and hit a stale vision entry, producing a ``[1, 1152] X [4096, 4096]`` fake-tensor mismatch at ``o_proj``. 2. Unregister vision attention from ``extra_attrs`` in ``Qwen2_5_VLVisionAttention.__init__`` (base class shared by both Qwen2.5-VL and Qwen3-VL). Vision runs from the outer wrapper, outside the compiled LM region, so ``forward_impl`` must use the eager path with its own ``attn_metadata``. With the global flag on the custom-op path otherwise consults ``extra_attrs["attention_metadata"]`` -- which ``model_engine.model_forward`` populates with the LM's metadata -- so vision FMHA dispatched with the LM's S/num_contexts and vision's head_dim and failed (``FMHA kernels are not found ... D: 72, S: 0``). 3. Piecewise compile only the LM body, not the outer wrapper. When the model has ``self.llm.model``, ``torch.compile`` is applied there; tracing the outer wrapper otherwise pulls the vision-tower output path and ``fuse_input_embeds`` into the same graph, which lets the vision hidden_dim leak into the LM ``o_proj`` fake-tensor trace and blows up the piecewise warmup. 4. Teach the piecewise backend to detect ``inputs_embeds`` placeholders. The VL wrapper invokes the LM with ``input_ids=None`` and ``inputs_embeds=<tensor>``; dynamo eliminates the ``input_ids`` placeholder, so ``Backend.__call__`` could not find ``l_input_ids_`` and raised ``Cannot detect input_num_tokens``. Accept ``l_inputs_embeds_``/``l_kwargs_inputs_embeds_`` as the alternate num_tokens carrier. Validated end-to-end with piecewise CUDA graph enabled on Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct: server boots through warmup, text and image requests both succeed. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

…c_tensor_h2d ``Qwen3VisionModel.forward`` built the per-iter ``rope_position_ids`` with the manual ``torch.arange(..., pin_memory=prefer_pinned()).to( device=..., non_blocking=True)`` pattern. ``async_tensor_h2d`` already centralizes the same pinned-CPU + ``cudaMemcpyAsync`` pattern that the other VL hot-path H2D sites use. Use it here too for consistency. No functional or performance change -- both paths construct a pinned CPU tensor and issue a non-blocking H->D copy. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Multi-context-request batches and chunked prefill both broke the per-request indexing scheme once ``batch_idx`` was dropped from the attention kernel's mrope cos/sin lookup. Two requests at the same ``rotary_position`` shared a buffer entry even though their (T, H, W) coordinates differ, silently corrupting attention; chunked prefill of a single request hit the same issue from chunk 2 onward because the kernel indexed by request-internal position while Python only materialized cos/sin for the current chunk. The failure surfaced as ``test_chunked_prefill_multimodal_smoke`` returning empty outputs / immediate ``<|im_end|>``. Switch ``applyBiasRopeUpdateKVCacheV2``'s mrope cos/sin lookup from ``rotary_position`` (= ``past_seen + token_idx_in_seq``, request- internal) to ``bounded_global_token_idx`` (batch-flat per-token index, same scheme regular rope already uses). Each batch token now reads its own cos/sin entry independent of request boundary, ``past_seen_token_num``, or chunk size. Simplify ``Qwen3VLModelBase`` / ``Qwen2VLModelBase``'s ``prepare_mrope_config`` to a single ``get_cos_sin(position_ids)`` over the batch's already-stitched ``position_ids`` (3, 1, total_tokens); drop the per-request loop, the per-request ``torch.cat(dim=0)`` layout that the kernel can no longer index, and the ``self.mrope_position_ids_padding_cuda`` (3, 1, max_position_embeddings) buffer (cos/sin is now sized to the current iteration's token count instead of the model's full context capacity). Update ``test_modeling_qwen3vl`` / ``test_modeling_qwen2_5vl``'s ``get_trtllm_inputs`` to slice ``mrope_position_ids`` to the chunk range (matching production ``PyTorchModelEngine`` behavior): chunked prefill / KV cache reuse scenarios used to pass full-request ``mrope_position_ids`` regardless of chunk, which only worked because the prior code read mrope from ``multimodal_data`` and ignored the ``position_ids`` argument. Validated end-to-end with the chunked-prefill smoke test, HF↔TRT-LLM logit matching across all 7 scenarios per model (image / video / multiple-image / cuda-graph / chunked-prefill / kv-cache-reuse / no-fuse-rope), the multi-request batched quickstart example, and aiperf on Qwen3-VL-8B-FP8 across concurrencies 1..128 (throughput-neutral versus the prior per-request layout, 0-8% lower TTFT in context phase). Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

…t syncs Three Qwen2/3-VL host-side micro-fixes in the prefill / vision-encoder path: 1. Pre-allocated `(L, max_num_tokens, H)` `deepstack_input_embeds` buffer in `Qwen3VLModelBase.__init__` (`register_buffer(..., persistent=False)` so it stays out of `state_dict`). Forward path zero_()s the `(L, N, H)` slice and scatters `stack(deepstack_embeds, dim=0)` in one packed assignment, replacing `L` fresh `torch.zeros` + `L` indexed scatters that ran on every prefill iter inside `fuse_input_embeds`. Pattern mirrors vLLM's `deepstack_input_embeds`. 2. Reuse the `text_token_indices` / `mm_token_indices` that `PyTorchModelEngine._prepare_tp_inputs` already computes on CPU and async-H2Ds into the forward kwargs; on engine-driven paths the `filter_mm_token_from_input_ids` `torch.where` invocation is fully skipped (the fallback only runs on direct-`forward` unit-test paths). Pass them through to `fuse_input_embeds` so its internal filter is also bypassed. 3. Vision encoder forward cleanup: - Build `seq_lens: List[int]` from `grid_thw.tolist()` in Python (single Python loop) instead of `torch.repeat_interleave(...).tolist()`; the repeat-interleave path created two small CPU tensor ops + an extra `.tolist()` purely to produce a Python list `prepare_attn_metadata` then re-tensors. - `prepare_attn_metadata` takes a `List[int]` and computes `max_seq_len = max(seq_lens)` in Python, dropping `seq_lens.max().item()`. - Pre-allocate a 32K `arange(int32)` buffer for the vision block's `rope_position_ids` (`register_buffer(persistent=False)`); per-call code just slices `[:seq_len]` instead of `torch.arange(seq_len) + async_tensor_h2d` every encoder forward. Test rigs (`tests/unittest/_torch/modeling/test_modeling_qwen3vl.py`, `test_modeling_qwen2_5vl.py`) only had their existing comments normalized to single backticks for style consistency; no logic change. Validated with the full Qwen2.5-VL / Qwen3-VL test surface (53/53 passing) -- HF<->TRT-LLM logit matching across image / video / multiple-image / cuda-graph / chunked-prefill / kv-cache-reuse / no-fuse-rope scenarios, the chunked-prefill smoke test, the P-D disagg no-reuse test, and the multi-image / multi-request batch tests. End-to-end aiperf throughput on Qwen3-VL-8B-FP8 across concurrencies 1..128 is within noise (the host-side savings sit well under the attention/MoE wall time budget); the win is in lower per-iter overhead and cleaner module state. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

The `has_tp` guard around `self.dist.broadcast(payloads, root=0)` in `RequestBroadcaster._broadcast_requests` (introduced in c1d294e to skip a hang on `world_size > 1, tp_size == 1`) is no longer needed -- the underlying behavior is now handled upstream. Revert to the pre-guard form. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Apply the codebase's RST-literal style (``foo``) to inline code references in comments / docstrings on the branch-touched lines of Qwen2/3-VL model and test files; no logic change. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

…uff-format) Pure formatting cleanup -- no logic change. Brings the branch-touched Qwen2/3-VL Python, vision-encoder C++ kernel template, and modeling tests in line with the repo's pre-commit hooks (yapf, clang-format, ruff-format). Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

- Restore the original `torch_dtype` / `max_position_embeddings` propagation comment on `Qwen3VLVisionAttention.__init__`. - Drop the redundant `if cos.dtype != q.dtype` / `if sin.dtype != q.dtype` guards in `Qwen2_5_VLVisionAttention.apply_rope`; `tensor.to(dtype=...)` already short-circuits when the dtype matches. - Collapse the doubled backticks introduced earlier in this branch back to single backticks on the branch-touched lines, matching the reviewer-preferred style. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Replace the misleading `pragma: no cover - flash_attn is part of the default deps` line on the flash_attn rotary import: flash_attn is only declared in `triton_backend/requirements.txt` and the multimodal extras, not the main `requirements.txt`, so the guarded import is genuinely the load-time fallback when flash_attn isn't installed. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Python 3's `/` is true division and always returns float, so `float(...) / float(...)` was redundant. Same effective values, less noise. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Extract the ``_build_temporal_block`` step as a classmethod hook on ``Qwen2VLInputProcessorBase`` so Qwen3-VL's only meaningful difference (per-frame timestamps vs. ``tokens_per_second`` scaling) can be expressed as a one-line override. ``Qwen3VLInputProcessorBase`` now subclasses ``Qwen2VLInputProcessorBase``, overrides ``__init__`` (dtype source), ``get_rope_index`` (``repeat_interleave`` of ``video_grid_thw`` before super), and ``_build_temporal_block`` (plain ``np.indices``). Drops ~95% of the duplicated tokenizer / processor / mrope / call logic and the matching unused imports. Also dispatch ``get_mrope_config``'s ``get_rope_index`` call via ``type(self)`` so the subclass override is actually used, and condense the ``bypass_processor_output_validation`` rationale comment. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

yechank-nvidia · 2026-05-29T14:05:16Z

/bot run

tensorrt-cicd · 2026-05-29T14:11:55Z

PR_Github #51050 [ run ] triggered by Bot. Commit: b8c317d Link to invocation

tensorrt-cicd · 2026-05-29T14:17:01Z

PR_Github #51041 [ run ] completed with state ABORTED. Commit: 5c4dfbc

Link to invocation

tensorrt-cicd · 2026-05-29T17:19:11Z

PR_Github #51050 [ run ] completed with state SUCCESS. Commit: b8c317d
/LLM/main/L0_MergeRequest_PR pipeline #40496 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

yechank-nvidia self-assigned this Mar 5, 2026

yechank-nvidia added the Multimodal Label for issues & PRs regarding Multimodal related objects label Mar 5, 2026

moraxu self-assigned this May 19, 2026

yechank-nvidia force-pushed the qwen3vl_opt branch from 87e587d to 3489cf2 Compare May 21, 2026 11:13

yechank-nvidia changed the title ~~[Draft][perf] Qwen3-VL Performance Optimization~~ [None][perf] Qwen3/3.5-VL Performance Optimization May 21, 2026

yechank-nvidia changed the title ~~[None][perf] Qwen3/3.5-VL Performance Optimization~~ [None][perf] Qwen2/2.5/3/3.5-VL Performance Optimization May 22, 2026

yechank-nvidia changed the title ~~[None][perf] Qwen2/2.5/3/3.5-VL Performance Optimization~~ [None][perf] Qwen2.5/3/3.5-VL Performance Optimization May 22, 2026

yechank-nvidia force-pushed the qwen3vl_opt branch from 3489cf2 to 124f649 Compare May 26, 2026 02:25

yechank-nvidia force-pushed the qwen3vl_opt branch from da59f84 to 657bb10 Compare May 27, 2026 01:15

yechank-nvidia marked this pull request as ready for review May 27, 2026 10:55

yechank-nvidia requested review from a team as code owners May 27, 2026 10:55

yechank-nvidia requested review from byshiue, moraxu, symphonylyh, tijyojwad and xxi-nv May 27, 2026 10:55

yechank-nvidia added 25 commits May 29, 2026 13:57

[None][fix] yapf formatting in _make_warmup_mrope_position_ids

4511bb6

CI yapf hook reformatted the torch.arange call signature. Apply the same reformat locally so pre-commit passes on push. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][style] Drop redundant float() casts on rotary scale factors

389d363

Python 3's `/` is true division and always returns float, so `float(...) / float(...)` was redundant. Same effective values, less noise. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][style] Apply ruff-format auto-fixes from pre-commit

d0b8571

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][perf] cache Qwen VL MRoPE deltas

ba39c04

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

[None][fix] make Qwen VL flash-attn optional

b8c317d

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>

yechank-nvidia force-pushed the qwen3vl_opt branch from 5c4dfbc to b8c317d Compare May 29, 2026 14:05

Conversation

yechank-nvidia commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Vision tower / rotary embedding

Host-overhead reduction

CUDA graph

Refactor

Performance

System output-token throughput (tok/s)

Other metrics

Summary by CodeRabbit

Release Notes

Uh oh!

yechank-nvidia commented May 21, 2026

Uh oh!

tensorrt-cicd commented May 21, 2026

Uh oh!

tensorrt-cicd commented May 21, 2026

Uh oh!

yechank-nvidia commented May 26, 2026

Uh oh!

tensorrt-cicd commented May 26, 2026

Uh oh!

tensorrt-cicd commented May 26, 2026

Uh oh!

yechank-nvidia commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

yechank-nvidia commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

yechank-nvidia commented Mar 5, 2026 •

edited

Loading