Skip to content

[TRTLLM-12427][perf] Qwen2.5/3/3.5-VL Performance Optimization#11943

Open
yechank-nvidia wants to merge 34 commits into
NVIDIA:mainfrom
yechank-nvidia:qwen3vl_opt
Open

[TRTLLM-12427][perf] Qwen2.5/3/3.5-VL Performance Optimization#11943
yechank-nvidia wants to merge 34 commits into
NVIDIA:mainfrom
yechank-nvidia:qwen3vl_opt

Conversation

@yechank-nvidia
Copy link
Copy Markdown
Collaborator

@yechank-nvidia yechank-nvidia commented Mar 5, 2026

Summary

This PR reworks the Qwen2-VL / Qwen2.5-VL / Qwen3-VL PyTorch-backend implementations to cut host (CPU) overhead in the vision tower and input processors, enables piecewise CUDA graph for these models, and fixes several correctness issues around mRoPE and vision-block RoPE. The optimizations target the high-concurrency serving regime, where launch overhead and host↔device syncs dominate.

What changed

Vision tower / rotary embedding

  • Drop the HF rotary dependency and memoize the frequency table; pre-compute cos/sin into init-time buffers so the forward path no longer calls .cos()/.sin() per step.
  • Add an L2 per-tile GPU rotary cache for Qwen2.5/3-VL vision and annotate its measured GPU footprint.
  • Remove dead code and a redundant batched pos-embed kernel from the vision tower.

Host-overhead reduction

  • Add an async_tensor_h2d helper and route all Qwen2.5/3-VL H2D copies (vision pos_ids, window_index, rope_position_ids) through it.
  • Skip redundant pinning in maybe_pin_memory when the input is already pinned.
  • Pre-allocate deepstack scratch and skip vision-encoder host syncs.
  • Add a text-only fast path in the Qwen2.5/3-VL input processors so text-only requests avoid vision-path work.

CUDA graph

  • Enable piecewise CUDA graph for LLM prefill on Qwen2/3-VL.

Refactor

  • Inherit the Qwen3-VL input processor from the Qwen2-VL base to remove duplication.

Performance

Model: Qwen3VLForConditionalGeneration (FP8 weights + KV cache, bf16 vision encoder). Hardware: H200 ×1. Workload: image+text, ISL=1000, OSL=1000, 512×512 image, KV block reuse off, 3-run mean. Comparison is against upstream main with identical serving config (max_batch_size=256, max_num_tokens=8192, num_postprocess_workers=4, cuda_graph_config.enable_padding=true, chunked prefill on).

System output-token throughput (tok/s)

concurrency upstream this PR Δ
32 4735 4838 +2.2%
64 6948 7359 +5.9%
128 8553 9767 +14.2%

Other metrics

  • Per-user throughput (tok/s/user) — c=128: 77.5 → 89.8 (+15.9%)
  • TTFT (ms) — c=1: 147 → 81 (−45%); c=32: 815 → 690 (−15%); c=128: 1953 → 1804 (−8%)
  • ITL (ms/token) — c=128: 12.96 → 11.18 (−13.7%)
  • Request latency (ms) — c=128: 14901 → 12975 (−12.9%)

Gains are concentrated at high concurrency, where the host-side savings (fewer CPU launches, async H2D, higher CUDA-graph hit ratio) and the lower-overhead mRoPE path matter most. Low-concurrency TTFT also improves substantially (c=1 nearly halved) from the text fast path and removed vision-encoder syncs.

Summary by CodeRabbit

Release Notes

  • Bug Fixes

    • Fixed rotary cache index computation in attention kernels.
    • Corrected request broadcasting logic for distributed inference configurations.
  • Performance

    • Optimized multimodal model inference through vectorized RoPE position indexing and GPU memory operations.
    • Added Triton kernel support for position embedding interpolation in vision models.
    • Improved device transfer efficiency with async host-to-device operations and optimized tensor placement.
    • Enhanced distributed tensor parallelism handling.
  • Tests

    • Expanded test coverage for multimodal RoPE configurations and vision component equivalence validation.

Review Change Stack

@yechank-nvidia yechank-nvidia self-assigned this Mar 5, 2026
@yechank-nvidia yechank-nvidia added the Multimodal Label for issues & PRs regarding Multimodal related objects label Mar 5, 2026
@moraxu moraxu self-assigned this May 19, 2026
@yechank-nvidia yechank-nvidia changed the title [Draft][perf] Qwen3-VL Performance Optimization [None][perf] Qwen3/3.5-VL Performance Optimization May 21, 2026
@yechank-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49687 [ run ] triggered by Bot. Commit: 3489cf2 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #49687 [ run ] completed with state FAILURE. Commit: 3489cf2
/LLM/main/L0_MergeRequest_PR pipeline #39294 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yechank-nvidia yechank-nvidia changed the title [None][perf] Qwen3/3.5-VL Performance Optimization [None][perf] Qwen2/2.5/3/3.5-VL Performance Optimization May 22, 2026
@yechank-nvidia yechank-nvidia changed the title [None][perf] Qwen2/2.5/3/3.5-VL Performance Optimization [None][perf] Qwen2.5/3/3.5-VL Performance Optimization May 22, 2026
@yechank-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50284 [ run ] triggered by Bot. Commit: da59f84 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50284 [ run ] completed with state SUCCESS. Commit: da59f84
/LLM/main/L0_MergeRequest_PR pipeline #39813 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yechank-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50450 [ run ] triggered by Bot. Commit: 657bb10 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50450 [ run ] completed with state SUCCESS. Commit: 657bb10
/LLM/main/L0_MergeRequest_PR pipeline #39969 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@yechank-nvidia yechank-nvidia marked this pull request as ready for review May 27, 2026 10:55
@yechank-nvidia yechank-nvidia requested review from a team as code owners May 27, 2026 10:55
…en3-VL vision

Replace ``HFQwen3VLVisionRotaryEmbedding(head_dim // 2)`` with an
in-tree implementation:

* ``Qwen3VisionModel.__init__`` registers an ``rope_inv_freq`` buffer
  using the same formula the HF class used internally
  (``1.0 / (10000.0 ** (torch.arange(0, rope_dim, 2) / rope_dim))``,
  ``rope_dim = head_dim // 2``).

* New ``_freq_table(max_hw)`` is ``@lru_cache(maxsize=64)``. It
  computes ``torch.outer(arange(max_hw), inv_freq)`` once per unique
  ``max_hw``. Previously every ``_rotary_pos_emb_thw`` cache miss
  re-ran HF's ``forward()``, which built a fresh ``arange + outer``
  every time. Production has at most a handful of distinct max-extents
  across the served tile set, so the cache stays small (``maxsize=64``
  is plenty).

The HF dependency was only used as a stateless functor; nothing about
its module identity or persistent buffers was reused. Dropping it
also cuts the ``from transformers...Qwen3VLVisionRotaryEmbedding``
import.

Tests:
* ``test_freq_table_matches_hf_rotary``: 6 ``max_hw`` values
  (16/32/48/64/100/128) -> in-tree output bit-matches HF (atol=0,
  rtol=0, same dtype, same shape).
* ``test_freq_table_lru_cache_hit``: repeated ``max_hw`` returns the
  same cached device tensor object.
* All pre-existing ``_rotary_pos_emb_thw`` / ``rot_pos_emb_l2`` /
  ``batched_pos_embed_*`` tests still pass; they now exercise the
  in-tree path end-to-end.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Two small simplifications in the per-forward hot path:

* Qwen3-VL: replace the per-layer ``layer_num in
  self.deepstack_visual_indexes`` ``in`` check + ``.index(layer_num)``
  linear scan (each O(len(deepstack_visual_indexes))) with a precomputed
  ``self._deepstack_layer_to_merger_idx`` dict built once in
  ``__init__``. ``forward`` now does a single ``.get()`` per block
  iteration; O(1) instead of O(L) per layer. The dict is a plain Python
  attribute (not a parameter); the underlying ``ModuleList`` indexing
  is unchanged.

* Qwen2.5-VL: replace
    reverse_indices = torch.empty_like(window_index)
    reverse_indices[window_index] = torch.arange(N, device=..., dtype=...)
  with
    reverse_indices = torch.argsort(window_index)
  ``window_index`` is a permutation of ``[0, N)``; its inverse
  permutation is exactly ``argsort(window_index)``. torch implements
  argsort as a single fused GPU sort, which both removes the explicit
  ``empty_like + arange + scatter`` triplet and skips an intermediate
  allocation.

Test: ``test_argsort_matches_scatter_inverse`` (n = 16/256/1024)
proves the new path equals the old one (atol=0, rtol=0) and is the
true inverse permutation (window_index[reverse_indices] == arange(N)).

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…es through it

vLLM-style pinned + async H->D helper, kept in ``tensorrt_llm._utils``
alongside ``maybe_pin_memory`` (the only file where the pinned-memory
pre-commit check allows raw ``.pin_memory()``):

  def async_tensor_h2d(data, dtype, device):
      """List/tuple OR CPU Tensor -> pinned host buffer ->
      non_blocking cudaMemcpyAsync to ``device``."""

Without an explicit pin, ``.to(device, non_blocking=True)`` on a
pageable source silently stages through a pinned buffer that PyTorch
allocates per call, eating CPU time even though nsys still reports
``cudaMemcpyAsync``. The helper centralizes the pinned-buffer +
async-DMA idiom so callers don't choose between
``torch.tensor(..., pin_memory=prefer_pinned()).to(...,
non_blocking=True)`` (sequence input) and ``maybe_pin_memory(t).to(
..., non_blocking=True)`` (existing tensor input).

Routed through it:

* Qwen3-VL ``_triton_pos_embed_interpolate_batched``: the 5 per-image
  metadata arrays (starts, hs, ws, h_scales, w_scales) -- previously
  five ``torch.tensor(..., pin_memory=prefer_pinned()).to(device,
  non_blocking=True)`` copies of the same shape.
* Qwen2.5-VL ``Qwen2_5_VisionModel.forward``: the per-forward
  ``window_index = torch.cat(window_indices).to(device,
  non_blocking=True)`` -- the cat result is a fresh pageable tensor,
  so without the helper it silently staged.

Tests:
* ``test_async_tensor_h2d_sequence_input``: list -> CUDA int32 tensor,
  values preserved.
* ``test_async_tensor_h2d_tensor_input``: CPU fp64 -> CUDA fp32 with
  dtype cast applied, values preserved.

No new memory allocations beyond what the prior code already issued
(``pin_memory()`` reuses Torch's pinned-buffer pool).

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…d input

Gate the ``.pin_memory()`` dispatch on ``not tensor.is_pinned()``.
PyTorch's pin_memory is already a no-op for pinned tensors, but the
call still pays a CPython + pybind round-trip. Several hot paths
upstream-pin the input and then ``maybe_pin_memory`` is called again
inside generic helpers (e.g. ``AttentionMetadata.seq_lens`` /
``seq_lens_kv`` setters, ``TrtllmAttentionMetadata.prepare`` re-pinning
``kv_lens``), so the guard is a microscopic but free win.

Behaviorally identical to before; an already-pinned tensor now
returns from the function as the same Python object.

Test: ``test_maybe_pin_memory_idempotent_on_already_pinned`` -- second
call on a pinned tensor returns ``is`` the first call's result.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…()/.sin() in Qwen3-VL vision

Mirror vLLM's pattern (vllm/model_executor/layers/rotary_embedding/base.py
``_compute_cos_sin_cache``): build the rotary cos/sin tables at
``__init__`` time and store them as module buffers. ``forward`` then
only gathers and reshapes -- no per-forward ``torch.outer``,
``.cos()``, or ``.sin()`` kernels.

Concrete changes in ``Qwen3VisionModel``:

* ``__init__`` registers two buffers, ``rope_cos_cache`` and
  ``rope_sin_cache``, each shape ``(max_rope_seqlen=8192, freq_dim)``
  fp32. Built from the same ``inv_freq`` (``theta=10000``,
  ``dim=head_dim//2``) formula HF's
  ``Qwen3VLVisionRotaryEmbedding`` and the prior
  ``_freq_table`` used. Total init buffer cost: ~1.15 MB
  (576 KB x 2). vLLM's equivalent ``get_rope`` cache lives at the
  model dtype (bf16); ours stays fp32 to match
  ``RopeParams.create_rope_const_params``.

* ``_freq_table`` lru_cache is removed -- the gather source is now
  the pre-computed cos/sin buffers, so there is no transient freqs
  tensor to memoize.

* ``_rotary_pos_emb_thw(t, h, w)`` now returns a ``(cos, sin)`` tuple
  of device tensors instead of a single freqs tensor. Each half has
  shape ``(t*h*w, 2*freq_dim)`` after ``cos_cache[pos_ids].flatten(1)``.
  L2 cache footprint per-token doubles (144 B -> 288 B fp32) because
  cos and sin are stored separately; production ~10-30 unique tiles
  still lands at 4-10 MB total. ``maxsize=1024`` is unchanged.

* ``rot_pos_emb`` likewise returns ``(cos, sin)``; multi-image batches
  cat the cos and sin lists separately (device-side, since each L2
  entry is already on device).

* ``forward`` drops the
  ``rotary_pos_emb.flatten(1).repeat(1, 2).cos()/.sin()`` chain. It
  only does ``cos.repeat(1, 2)`` and ``sin.repeat(1, 2)`` to expand
  to ``head_dim``, then hands the pair to the blocks.

nsys (5-image batch, 20 steady-state iters, all L2 hits):
  before: per-forward = 4 kernels (cat + repeat + cos + sin)
  after:  per-forward = 4 kernels (2 cats + 2 repeats) with the two
          transcendental cos/sin kernels replaced by plain
          ``CatArrayBatchedCopy_vectorized`` and ``elementwise_kernel``
          (repeat copies). 0 H->D in steady state.

Tests:
* ``test_rope_cos_sin_buffers_match_hf_rotary``: pre-computed cos/sin
  buffers match HF's ``Qwen3VLVisionRotaryEmbedding(36)(max_hw).cos()``
  / ``.sin()`` within ~1 ULP fp32 (host-vs-device codegen).
* ``test_rotary_pos_emb_thw_returns_device_tensor`` /
  ``test_rotary_pos_emb_thw_lru_cache_hit``: updated for the
  ``(cos, sin)`` tuple return.
* ``test_rot_pos_emb_l2_matches_per_tile``: cos and sin halves match
  per-tile cats bit-exact.
* ``test_rot_pos_emb_l2_no_device_transfer_on_hit``: still 0 new
  misses on a repeated grid.
* ``test_rot_pos_emb_cos_sin_matches_old_repeat_chain``: forward's
  final ``cos.repeat(1, 2)`` / ``sin.repeat(1, 2)`` equals the
  prior ``freqs.repeat(1, 2).cos()/.sin()`` chain within ~1 ULP.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
… rebase

The qwen3vl_opt branch's ``1c4e86892d "Further optimize Qwen3-VL"``
removed ``rope_position_ids`` from the per-block call and added a
``rotary_pos_emb.flatten(1).repeat(1, 2).cos()/.sin()`` chain in the
vision tower forward. At authoring time that matched the HF
``apply_rotary_pos_emb_vision`` ABI (cos/sin shape
``(seq, head_dim)``), which ``Qwen2_5_VLVisionAttention.forward`` used.

After upstream ``7279d6322d "EXAONE-4.5 Support"`` refactored that
attention class to route through the new ``apply_rope`` helper
(``RotaryEmbedding.apply_rotary_pos_emb`` / FlashInfer), the expected
``cos.shape[-1]`` became ``head_dim // 2``: ``apply_rotary_pos_emb``
computes ``rot_dim = cos.shape[-1] * 2`` and chunks q/k into halves
of size ``cos.shape[-1]``. With Qwen3-VL's ``head_dim = 72``
(``72 % 64 != 0``, so the FlashInfer gate misses and the PyTorch
fallback runs), the prior ``.repeat(1, 2)`` made cos/sin
``(total, 72)`` and the elementwise multiply against
``q.chunk(2)[i]`` (shape ``(..., 36)``) raised
``RuntimeError: The size of tensor a (36) must match the size of
tensor b (72) at non-singleton dimension 2`` -- silently broken at
rebase time, missed by every unit test on this branch because they
verified isolated cos/sin construction (pos_ids, lru_cache identity,
gather equivalence) but never piped them through the actual attention
block. CI's model-level test would have caught it.

Fix:

* ``Qwen3VisionModel.forward``: drop ``cos.repeat(1, 2)`` /
  ``sin.repeat(1, 2)``. The cos/sin tuple returned by ``rot_pos_emb``
  is already at the post-EXAONE shape ``(total_tokens, 2 * freq_dim)
  = (total_tokens, head_dim // 2)``.

* ``Qwen3VisionModel.forward``: restore the pre-1c4e86892d
  ``rope_position_ids = torch.arange(seq_len, dtype=int32,
  pin_memory=prefer_pinned()).to(device, non_blocking=True)`` and
  pass ``position_ids=rope_position_ids`` into each block. This
  keeps the FlashInfer ``position_ids is not None`` gate satisfied
  for any future config where ``head_dim % 64 == 0``; for Qwen3-VL
  itself it still falls through to the PyTorch path, which now has
  the correct broadcast.

All prior optimizations on this branch are preserved: cos/sin
pre-computed init buffers (``rope_cos_cache`` / ``rope_sin_cache``),
L2 per-(t, h, w) cache, batched Triton pos-embed kernel,
rot_pos_ids lru_cache, async_tensor_h2d helper, argsort inverse
permutation, deepstack dict lookup. The fix is a layout correction
on the output side of ``rot_pos_emb``, not a revert of the
optimizations.

Test: ``test_qwen3vl_vision_block_forward_end_to_end`` builds one
``Qwen3VLVisionBlock`` with random weights and runs forward with the
post-EXAONE cos/sin shape and ``position_ids``. Without this fix it
raises the same broadcasting error; with it the block returns the
expected ``(seq_len, hidden)`` tensor. Any future regression that
re-introduces a stray ``.repeat(1, 2)`` on cos/sin (or drops
``position_ids``) will surface here, not only at CI time.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Four CPU-dispatch optimizations stacked on top of the prior PR commits,
each verified by the qwen3vl unittests and nsys conc=64 OSL=8.

1. get_rope_index: rewrite the per-token loop using numpy (np.arange/
   np.indices/np.concatenate) instead of small torch tensor ops. Same
   algorithm, output exactly equal. ~2x per call on a 512x512 image.
2. Vision attention apply_rope: add a flash_attn Triton fast-path
   (flash_attn.ops.triton.rotary.apply_rotary, in-place) between the
   FlashInfer fused path and the PyTorch fallback. Qwen3-VL head_dim=72
   and Qwen2.5-VL head_dim=80 both miss FlashInfer's "head_size % 64 == 0"
   precondition and used to land on the 7-launch PyTorch path; the Triton
   kernel is a single launch per (q, k).
3. Vision cos/sin buffers materialised in the vision-tower dtype:
   Qwen3-VL register_buffer stores rope_cos_cache/rope_sin_cache as bf16,
   Qwen2.5-VL get_rope_and_window_index_by_thw casts cos_thw/sin_thw once
   per cached (t, h, w). apply_rope guards the per-call .to(dtype=q.dtype)
   with a dtype check so it becomes a no-op on the hot path.
4. Vision block residual fusion (vLLM pattern, see
   vllm/model_executor/models/qwen2_5_vl Qwen2_5_VisionBlock.forward):
   collapse the post-attention residual add into norm2's residual path,
   which our LayerNorm / RMSNorm already supports as a torch.compile-
   fused kernel. One fewer elementwise add launch per block.

nsys conc=64 OSL=8 (Qwen3-VL-8B, 1x512x512 image): vision-tower wall
median 183 -> 107 ms (-42%), launches/iter 1571 -> 1220 (-22%), CPU API
time/iter 134 -> 115 ms (-15%). Throughput moves modestly because GPU
work is unchanged and the time is largely absorbed at stream-sync points
within _forward_step.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Stripped layer of commit 7368e5b after measurement showed the batched
single-launch kernel did not move the conc=64 throughput needle and added
non-trivial setup cost (5 small H->D copies for per-image metadata) on
the low-concurrency path.

* Vision pos-embed interpolation goes back to the per-image Triton
  kernel ``_triton_pos_embed_interpolate`` + ``torch.cat`` join in
  ``Qwen3VisionModel.fast_pos_embed_interpolate``. The kernel takes
  ``(h, w, h_scale, w_scale)`` as scalar args, so no H<->D metadata
  transfers happen on the hot path. The batched
  ``_bilinear_pos_embed_batched_kernel`` /
  ``_triton_pos_embed_interpolate_batched`` pair is removed.

* Vision rotary cache no longer hard-codes ``max_rope_seqlen = 8192``;
  ``Qwen3VisionModel.__init__`` now constructs a standard
  ``RotaryEmbedding`` whose ``rotary_cos_sin`` buffer is sized to
  ``text_config.max_position_embeddings`` via ``RopeParams.from_config``.
  ``_rotary_pos_emb_thw`` slices ``self.rotary_pos_emb.rotary_cos_sin
  [:max(h, w)]`` then indexes with ``pos_ids``, mirroring upstream/main.

* Removed vLLM cross-reference comments in modeling_qwen3vl.py,
  modeling_qwen2vl.py, and modules/rotary_embedding.py while keeping
  the local technical descriptions intact.

* Updated unit tests to mock ``rotary_pos_emb.rotary_cos_sin`` (the new
  buffer location) and dropped the batched-vs-per-image comparison
  test together with the batched kernel.

3-run aiperf sweep (conc=64, ISL=1000, OSL=100, warmup=20, single
512x512 image, Qwen3-VL-8B): throughput 18.01 +/- 0.94 req/s vs main
17.25 +/- 0.72 (+4.4%), TTFT 1082 +/- 107 ms vs 1240 +/- 137 (-12.7%),
latency 3528 +/- 178 ms vs 3692 +/- 149 (-4.4%).

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…c_tensor_h2d

The per-(t, h, w) rotary helpers in modeling_qwen2vl.py and
modeling_qwen3vl.py were doing a bare ``pos_ids.to(device,
non_blocking=True)`` (and likewise for ``window_index_thw``). Without
pin_memory the non_blocking flag silently falls back to a staging copy,
so the H->D never actually overlaps the surrounding GPU work.

Route those transfers through the project-wide ``async_tensor_h2d``
helper instead; it pins (when ``prefer_pinned()`` is enabled) and
issues a real ``cudaMemcpyAsync``. Each unique (t, h, w) tile still
only pays this once (lru_cache around the helper), but now the copy
genuinely runs asynchronously.

Unit tests (qwen3vl) pass unchanged.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
CI yapf hook reformatted the torch.arange call signature. Apply the
same reformat locally so pre-commit passes on push.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Skip the multi-modal HF processor and call the tokenizer directly when
``inputs["multi_modal_data"]`` is empty. The two outputs match
bit-exactly when ``images`` / ``videos`` are ``None``, so the only thing
we lose is the ``bypass_processor_output_validation`` context plus the
image/video processing branches inside ``ProcessorMixin.__call__``.

mrope_config is still populated since the LM is M-RoPE and needs the
position ids on every request -- it just goes through the short
``get_rope_index`` path with ``image_grid_thw=None``.

nsys (conc=64, OSL=8, text-only requests, Qwen3-VL-8B): the per-request
``Qwen3VLInputProcessorBase forward()`` median drops from 36.1 ms to
12.5 ms (-65%) and the cumulative wall-clock share drops from 45.8% to
21.1% of the capture window. End-to-end throughput is essentially
unchanged at this concurrency (36.54 -> 36.78 req/s) because the
executor's LM forward is the gating phase; the win is for low-conc /
TTFT-sensitive text-only traffic where the input processor sits on
the request critical path.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Four coordinated fixes so torch.compile + piecewise CUDA graph stops
crashing on the Qwen2.5-VL / Qwen3-VL multimodal wrappers:

1. Share ``extra_attrs`` between the outer VL wrapper and the inner
   LM. ``Qwen{2VL,3VL}ModelBase.__init__`` deep-copied ``model_config``
   for the LM, so ``AutoModelForCausalLM.from_config`` registered LM
   attention layers in the copied dict while ``model_engine.model_forward``
   bound the thread-local ``model_extra_attrs`` to ``self.model.extra_attrs``
   (the outer dict). Under ``set_torch_compiling(True)``,
   ``attn_custom_op_inplace`` looks up its layer in the global TLS dict --
   missed the LM and hit a stale vision entry, producing a
   ``[1, 1152] X [4096, 4096]`` fake-tensor mismatch at ``o_proj``.

2. Unregister vision attention from ``extra_attrs`` in
   ``Qwen2_5_VLVisionAttention.__init__`` (base class shared by both
   Qwen2.5-VL and Qwen3-VL). Vision runs from the outer wrapper, outside
   the compiled LM region, so ``forward_impl`` must use the eager path
   with its own ``attn_metadata``. With the global flag on the custom-op
   path otherwise consults ``extra_attrs["attention_metadata"]`` -- which
   ``model_engine.model_forward`` populates with the LM's metadata -- so
   vision FMHA dispatched with the LM's S/num_contexts and vision's
   head_dim and failed (``FMHA kernels are not found ... D: 72, S: 0``).

3. Piecewise compile only the LM body, not the outer wrapper. When the
   model has ``self.llm.model``, ``torch.compile`` is applied there;
   tracing the outer wrapper otherwise pulls the vision-tower output
   path and ``fuse_input_embeds`` into the same graph, which lets the
   vision hidden_dim leak into the LM ``o_proj`` fake-tensor trace and
   blows up the piecewise warmup.

4. Teach the piecewise backend to detect ``inputs_embeds`` placeholders.
   The VL wrapper invokes the LM with ``input_ids=None`` and
   ``inputs_embeds=<tensor>``; dynamo eliminates the ``input_ids``
   placeholder, so ``Backend.__call__`` could not find ``l_input_ids_``
   and raised ``Cannot detect input_num_tokens``. Accept
   ``l_inputs_embeds_``/``l_kwargs_inputs_embeds_`` as the alternate
   num_tokens carrier.

Validated end-to-end with piecewise CUDA graph enabled on
Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct: server boots through
warmup, text and image requests both succeed.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…c_tensor_h2d

``Qwen3VisionModel.forward`` built the per-iter ``rope_position_ids``
with the manual ``torch.arange(..., pin_memory=prefer_pinned()).to(
device=..., non_blocking=True)`` pattern. ``async_tensor_h2d`` already
centralizes the same pinned-CPU + ``cudaMemcpyAsync`` pattern that the
other VL hot-path H2D sites use. Use it here too for consistency. No
functional or performance change -- both paths construct a pinned CPU
tensor and issue a non-blocking H->D copy.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Multi-context-request batches and chunked prefill both broke the
per-request indexing scheme once ``batch_idx`` was dropped from the
attention kernel's mrope cos/sin lookup. Two requests at the same
``rotary_position`` shared a buffer entry even though their (T, H, W)
coordinates differ, silently corrupting attention; chunked prefill of
a single request hit the same issue from chunk 2 onward because the
kernel indexed by request-internal position while Python only
materialized cos/sin for the current chunk. The failure surfaced as
``test_chunked_prefill_multimodal_smoke`` returning empty outputs /
immediate ``<|im_end|>``.

Switch ``applyBiasRopeUpdateKVCacheV2``'s mrope cos/sin lookup from
``rotary_position`` (= ``past_seen + token_idx_in_seq``, request-
internal) to ``bounded_global_token_idx`` (batch-flat per-token
index, same scheme regular rope already uses). Each batch token now
reads its own cos/sin entry independent of request boundary,
``past_seen_token_num``, or chunk size.

Simplify ``Qwen3VLModelBase`` / ``Qwen2VLModelBase``'s
``prepare_mrope_config`` to a single ``get_cos_sin(position_ids)`` over
the batch's already-stitched ``position_ids`` (3, 1, total_tokens);
drop the per-request loop, the per-request ``torch.cat(dim=0)`` layout
that the kernel can no longer index, and the
``self.mrope_position_ids_padding_cuda`` (3, 1,
max_position_embeddings) buffer (cos/sin is now sized to the current
iteration's token count instead of the model's full context capacity).

Update ``test_modeling_qwen3vl`` / ``test_modeling_qwen2_5vl``'s
``get_trtllm_inputs`` to slice ``mrope_position_ids`` to the chunk
range (matching production ``PyTorchModelEngine`` behavior): chunked
prefill / KV cache reuse scenarios used to pass full-request
``mrope_position_ids`` regardless of chunk, which only worked because
the prior code read mrope from ``multimodal_data`` and ignored the
``position_ids`` argument.

Validated end-to-end with the chunked-prefill smoke test, HF↔TRT-LLM
logit matching across all 7 scenarios per model (image / video /
multiple-image / cuda-graph / chunked-prefill / kv-cache-reuse /
no-fuse-rope), the multi-request batched quickstart example, and aiperf
on Qwen3-VL-8B-FP8 across concurrencies 1..128 (throughput-neutral
versus the prior per-request layout, 0-8% lower TTFT in context phase).

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…t syncs

Three Qwen2/3-VL host-side micro-fixes in the prefill / vision-encoder
path:

1. Pre-allocated `(L, max_num_tokens, H)` `deepstack_input_embeds`
   buffer in `Qwen3VLModelBase.__init__` (`register_buffer(..., persistent=False)`
   so it stays out of `state_dict`). Forward path zero_()s the
   `(L, N, H)` slice and scatters `stack(deepstack_embeds, dim=0)` in
   one packed assignment, replacing `L` fresh `torch.zeros` + `L`
   indexed scatters that ran on every prefill iter inside
   `fuse_input_embeds`. Pattern mirrors vLLM's `deepstack_input_embeds`.

2. Reuse the `text_token_indices` / `mm_token_indices` that
   `PyTorchModelEngine._prepare_tp_inputs` already computes on CPU and
   async-H2Ds into the forward kwargs; on engine-driven paths the
   `filter_mm_token_from_input_ids` `torch.where` invocation is fully
   skipped (the fallback only runs on direct-`forward` unit-test paths).
   Pass them through to `fuse_input_embeds` so its internal filter is
   also bypassed.

3. Vision encoder forward cleanup:
   - Build `seq_lens: List[int]` from `grid_thw.tolist()` in Python
     (single Python loop) instead of
     `torch.repeat_interleave(...).tolist()`; the repeat-interleave path
     created two small CPU tensor ops + an extra `.tolist()` purely to
     produce a Python list `prepare_attn_metadata` then re-tensors.
   - `prepare_attn_metadata` takes a `List[int]` and computes
     `max_seq_len = max(seq_lens)` in Python, dropping
     `seq_lens.max().item()`.
   - Pre-allocate a 32K `arange(int32)` buffer for the vision block's
     `rope_position_ids` (`register_buffer(persistent=False)`); per-call
     code just slices `[:seq_len]` instead of `torch.arange(seq_len) +
     async_tensor_h2d` every encoder forward.

Test rigs (`tests/unittest/_torch/modeling/test_modeling_qwen3vl.py`,
`test_modeling_qwen2_5vl.py`) only had their existing comments
normalized to single backticks for style consistency; no logic change.

Validated with the full Qwen2.5-VL / Qwen3-VL test surface (53/53
passing) -- HF<->TRT-LLM logit matching across image / video /
multiple-image / cuda-graph / chunked-prefill / kv-cache-reuse /
no-fuse-rope scenarios, the chunked-prefill smoke test, the P-D disagg
no-reuse test, and the multi-image / multi-request batch tests.
End-to-end aiperf throughput on Qwen3-VL-8B-FP8 across concurrencies
1..128 is within noise (the host-side savings sit well under the
attention/MoE wall time budget); the win is in lower per-iter
overhead and cleaner module state.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
The `has_tp` guard around `self.dist.broadcast(payloads, root=0)` in
`RequestBroadcaster._broadcast_requests` (introduced in c1d294e to
skip a hang on `world_size > 1, tp_size == 1`) is no longer needed --
the underlying behavior is now handled upstream. Revert to the
pre-guard form.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Apply the codebase's RST-literal style (``foo``) to inline code
references in comments / docstrings on the branch-touched lines of
Qwen2/3-VL model and test files; no logic change.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…uff-format)

Pure formatting cleanup -- no logic change. Brings the branch-touched
Qwen2/3-VL Python, vision-encoder C++ kernel template, and modeling
tests in line with the repo's pre-commit hooks (yapf, clang-format,
ruff-format).

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
- Restore the original `torch_dtype` / `max_position_embeddings`
  propagation comment on `Qwen3VLVisionAttention.__init__`.
- Drop the redundant `if cos.dtype != q.dtype` / `if sin.dtype != q.dtype`
  guards in `Qwen2_5_VLVisionAttention.apply_rope`; `tensor.to(dtype=...)`
  already short-circuits when the dtype matches.
- Collapse the doubled backticks introduced earlier in this branch back
  to single backticks on the branch-touched lines, matching the
  reviewer-preferred style.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Replace the misleading `pragma: no cover - flash_attn is part of the
default deps` line on the flash_attn rotary import: flash_attn is
only declared in `triton_backend/requirements.txt` and the multimodal
extras, not the main `requirements.txt`, so the guarded import is
genuinely the load-time fallback when flash_attn isn't installed.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Python 3's `/` is true division and always returns float, so
`float(...) / float(...)` was redundant. Same effective values, less
noise.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Extract the ``_build_temporal_block`` step as a classmethod hook on
``Qwen2VLInputProcessorBase`` so Qwen3-VL's only meaningful difference
(per-frame timestamps vs. ``tokens_per_second`` scaling) can be expressed
as a one-line override. ``Qwen3VLInputProcessorBase`` now subclasses
``Qwen2VLInputProcessorBase``, overrides ``__init__`` (dtype source),
``get_rope_index`` (``repeat_interleave`` of ``video_grid_thw`` before
super), and ``_build_temporal_block`` (plain ``np.indices``). Drops
~95% of the duplicated tokenizer / processor / mrope / call logic and
the matching unused imports.

Also dispatch ``get_mrope_config``'s ``get_rope_index`` call via
``type(self)`` so the subclass override is actually used, and condense
the ``bypass_processor_output_validation`` rationale comment.

Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
@yechank-nvidia
Copy link
Copy Markdown
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51050 [ run ] triggered by Bot. Commit: b8c317d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51041 [ run ] completed with state ABORTED. Commit: 5c4dfbc

Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51050 [ run ] completed with state SUCCESS. Commit: b8c317d
/LLM/main/L0_MergeRequest_PR pipeline #40496 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Multimodal Label for issues & PRs regarding Multimodal related objects

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants