[TRTLLM-12427][perf] Qwen2.5/3/3.5-VL Performance Optimization#11943
[TRTLLM-12427][perf] Qwen2.5/3/3.5-VL Performance Optimization#11943yechank-nvidia wants to merge 34 commits into
Conversation
87e587d to
3489cf2
Compare
|
/bot run |
|
PR_Github #49687 [ run ] triggered by Bot. Commit: |
|
PR_Github #49687 [ run ] completed with state
|
3489cf2 to
124f649
Compare
|
/bot run |
|
PR_Github #50284 [ run ] triggered by Bot. Commit: |
|
PR_Github #50284 [ run ] completed with state
|
da59f84 to
657bb10
Compare
|
/bot run |
|
PR_Github #50450 [ run ] triggered by Bot. Commit: |
|
PR_Github #50450 [ run ] completed with state
|
…en3-VL vision Replace ``HFQwen3VLVisionRotaryEmbedding(head_dim // 2)`` with an in-tree implementation: * ``Qwen3VisionModel.__init__`` registers an ``rope_inv_freq`` buffer using the same formula the HF class used internally (``1.0 / (10000.0 ** (torch.arange(0, rope_dim, 2) / rope_dim))``, ``rope_dim = head_dim // 2``). * New ``_freq_table(max_hw)`` is ``@lru_cache(maxsize=64)``. It computes ``torch.outer(arange(max_hw), inv_freq)`` once per unique ``max_hw``. Previously every ``_rotary_pos_emb_thw`` cache miss re-ran HF's ``forward()``, which built a fresh ``arange + outer`` every time. Production has at most a handful of distinct max-extents across the served tile set, so the cache stays small (``maxsize=64`` is plenty). The HF dependency was only used as a stateless functor; nothing about its module identity or persistent buffers was reused. Dropping it also cuts the ``from transformers...Qwen3VLVisionRotaryEmbedding`` import. Tests: * ``test_freq_table_matches_hf_rotary``: 6 ``max_hw`` values (16/32/48/64/100/128) -> in-tree output bit-matches HF (atol=0, rtol=0, same dtype, same shape). * ``test_freq_table_lru_cache_hit``: repeated ``max_hw`` returns the same cached device tensor object. * All pre-existing ``_rotary_pos_emb_thw`` / ``rot_pos_emb_l2`` / ``batched_pos_embed_*`` tests still pass; they now exercise the in-tree path end-to-end. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Two small simplifications in the per-forward hot path:
* Qwen3-VL: replace the per-layer ``layer_num in
self.deepstack_visual_indexes`` ``in`` check + ``.index(layer_num)``
linear scan (each O(len(deepstack_visual_indexes))) with a precomputed
``self._deepstack_layer_to_merger_idx`` dict built once in
``__init__``. ``forward`` now does a single ``.get()`` per block
iteration; O(1) instead of O(L) per layer. The dict is a plain Python
attribute (not a parameter); the underlying ``ModuleList`` indexing
is unchanged.
* Qwen2.5-VL: replace
reverse_indices = torch.empty_like(window_index)
reverse_indices[window_index] = torch.arange(N, device=..., dtype=...)
with
reverse_indices = torch.argsort(window_index)
``window_index`` is a permutation of ``[0, N)``; its inverse
permutation is exactly ``argsort(window_index)``. torch implements
argsort as a single fused GPU sort, which both removes the explicit
``empty_like + arange + scatter`` triplet and skips an intermediate
allocation.
Test: ``test_argsort_matches_scatter_inverse`` (n = 16/256/1024)
proves the new path equals the old one (atol=0, rtol=0) and is the
true inverse permutation (window_index[reverse_indices] == arange(N)).
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…es through it
vLLM-style pinned + async H->D helper, kept in ``tensorrt_llm._utils``
alongside ``maybe_pin_memory`` (the only file where the pinned-memory
pre-commit check allows raw ``.pin_memory()``):
def async_tensor_h2d(data, dtype, device):
"""List/tuple OR CPU Tensor -> pinned host buffer ->
non_blocking cudaMemcpyAsync to ``device``."""
Without an explicit pin, ``.to(device, non_blocking=True)`` on a
pageable source silently stages through a pinned buffer that PyTorch
allocates per call, eating CPU time even though nsys still reports
``cudaMemcpyAsync``. The helper centralizes the pinned-buffer +
async-DMA idiom so callers don't choose between
``torch.tensor(..., pin_memory=prefer_pinned()).to(...,
non_blocking=True)`` (sequence input) and ``maybe_pin_memory(t).to(
..., non_blocking=True)`` (existing tensor input).
Routed through it:
* Qwen3-VL ``_triton_pos_embed_interpolate_batched``: the 5 per-image
metadata arrays (starts, hs, ws, h_scales, w_scales) -- previously
five ``torch.tensor(..., pin_memory=prefer_pinned()).to(device,
non_blocking=True)`` copies of the same shape.
* Qwen2.5-VL ``Qwen2_5_VisionModel.forward``: the per-forward
``window_index = torch.cat(window_indices).to(device,
non_blocking=True)`` -- the cat result is a fresh pageable tensor,
so without the helper it silently staged.
Tests:
* ``test_async_tensor_h2d_sequence_input``: list -> CUDA int32 tensor,
values preserved.
* ``test_async_tensor_h2d_tensor_input``: CPU fp64 -> CUDA fp32 with
dtype cast applied, values preserved.
No new memory allocations beyond what the prior code already issued
(``pin_memory()`` reuses Torch's pinned-buffer pool).
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…d input Gate the ``.pin_memory()`` dispatch on ``not tensor.is_pinned()``. PyTorch's pin_memory is already a no-op for pinned tensors, but the call still pays a CPython + pybind round-trip. Several hot paths upstream-pin the input and then ``maybe_pin_memory`` is called again inside generic helpers (e.g. ``AttentionMetadata.seq_lens`` / ``seq_lens_kv`` setters, ``TrtllmAttentionMetadata.prepare`` re-pinning ``kv_lens``), so the guard is a microscopic but free win. Behaviorally identical to before; an already-pinned tensor now returns from the function as the same Python object. Test: ``test_maybe_pin_memory_idempotent_on_already_pinned`` -- second call on a pinned tensor returns ``is`` the first call's result. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…()/.sin() in Qwen3-VL vision
Mirror vLLM's pattern (vllm/model_executor/layers/rotary_embedding/base.py
``_compute_cos_sin_cache``): build the rotary cos/sin tables at
``__init__`` time and store them as module buffers. ``forward`` then
only gathers and reshapes -- no per-forward ``torch.outer``,
``.cos()``, or ``.sin()`` kernels.
Concrete changes in ``Qwen3VisionModel``:
* ``__init__`` registers two buffers, ``rope_cos_cache`` and
``rope_sin_cache``, each shape ``(max_rope_seqlen=8192, freq_dim)``
fp32. Built from the same ``inv_freq`` (``theta=10000``,
``dim=head_dim//2``) formula HF's
``Qwen3VLVisionRotaryEmbedding`` and the prior
``_freq_table`` used. Total init buffer cost: ~1.15 MB
(576 KB x 2). vLLM's equivalent ``get_rope`` cache lives at the
model dtype (bf16); ours stays fp32 to match
``RopeParams.create_rope_const_params``.
* ``_freq_table`` lru_cache is removed -- the gather source is now
the pre-computed cos/sin buffers, so there is no transient freqs
tensor to memoize.
* ``_rotary_pos_emb_thw(t, h, w)`` now returns a ``(cos, sin)`` tuple
of device tensors instead of a single freqs tensor. Each half has
shape ``(t*h*w, 2*freq_dim)`` after ``cos_cache[pos_ids].flatten(1)``.
L2 cache footprint per-token doubles (144 B -> 288 B fp32) because
cos and sin are stored separately; production ~10-30 unique tiles
still lands at 4-10 MB total. ``maxsize=1024`` is unchanged.
* ``rot_pos_emb`` likewise returns ``(cos, sin)``; multi-image batches
cat the cos and sin lists separately (device-side, since each L2
entry is already on device).
* ``forward`` drops the
``rotary_pos_emb.flatten(1).repeat(1, 2).cos()/.sin()`` chain. It
only does ``cos.repeat(1, 2)`` and ``sin.repeat(1, 2)`` to expand
to ``head_dim``, then hands the pair to the blocks.
nsys (5-image batch, 20 steady-state iters, all L2 hits):
before: per-forward = 4 kernels (cat + repeat + cos + sin)
after: per-forward = 4 kernels (2 cats + 2 repeats) with the two
transcendental cos/sin kernels replaced by plain
``CatArrayBatchedCopy_vectorized`` and ``elementwise_kernel``
(repeat copies). 0 H->D in steady state.
Tests:
* ``test_rope_cos_sin_buffers_match_hf_rotary``: pre-computed cos/sin
buffers match HF's ``Qwen3VLVisionRotaryEmbedding(36)(max_hw).cos()``
/ ``.sin()`` within ~1 ULP fp32 (host-vs-device codegen).
* ``test_rotary_pos_emb_thw_returns_device_tensor`` /
``test_rotary_pos_emb_thw_lru_cache_hit``: updated for the
``(cos, sin)`` tuple return.
* ``test_rot_pos_emb_l2_matches_per_tile``: cos and sin halves match
per-tile cats bit-exact.
* ``test_rot_pos_emb_l2_no_device_transfer_on_hit``: still 0 new
misses on a repeated grid.
* ``test_rot_pos_emb_cos_sin_matches_old_repeat_chain``: forward's
final ``cos.repeat(1, 2)`` / ``sin.repeat(1, 2)`` equals the
prior ``freqs.repeat(1, 2).cos()/.sin()`` chain within ~1 ULP.
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
… rebase The qwen3vl_opt branch's ``1c4e86892d "Further optimize Qwen3-VL"`` removed ``rope_position_ids`` from the per-block call and added a ``rotary_pos_emb.flatten(1).repeat(1, 2).cos()/.sin()`` chain in the vision tower forward. At authoring time that matched the HF ``apply_rotary_pos_emb_vision`` ABI (cos/sin shape ``(seq, head_dim)``), which ``Qwen2_5_VLVisionAttention.forward`` used. After upstream ``7279d6322d "EXAONE-4.5 Support"`` refactored that attention class to route through the new ``apply_rope`` helper (``RotaryEmbedding.apply_rotary_pos_emb`` / FlashInfer), the expected ``cos.shape[-1]`` became ``head_dim // 2``: ``apply_rotary_pos_emb`` computes ``rot_dim = cos.shape[-1] * 2`` and chunks q/k into halves of size ``cos.shape[-1]``. With Qwen3-VL's ``head_dim = 72`` (``72 % 64 != 0``, so the FlashInfer gate misses and the PyTorch fallback runs), the prior ``.repeat(1, 2)`` made cos/sin ``(total, 72)`` and the elementwise multiply against ``q.chunk(2)[i]`` (shape ``(..., 36)``) raised ``RuntimeError: The size of tensor a (36) must match the size of tensor b (72) at non-singleton dimension 2`` -- silently broken at rebase time, missed by every unit test on this branch because they verified isolated cos/sin construction (pos_ids, lru_cache identity, gather equivalence) but never piped them through the actual attention block. CI's model-level test would have caught it. Fix: * ``Qwen3VisionModel.forward``: drop ``cos.repeat(1, 2)`` / ``sin.repeat(1, 2)``. The cos/sin tuple returned by ``rot_pos_emb`` is already at the post-EXAONE shape ``(total_tokens, 2 * freq_dim) = (total_tokens, head_dim // 2)``. * ``Qwen3VisionModel.forward``: restore the pre-1c4e86892d ``rope_position_ids = torch.arange(seq_len, dtype=int32, pin_memory=prefer_pinned()).to(device, non_blocking=True)`` and pass ``position_ids=rope_position_ids`` into each block. This keeps the FlashInfer ``position_ids is not None`` gate satisfied for any future config where ``head_dim % 64 == 0``; for Qwen3-VL itself it still falls through to the PyTorch path, which now has the correct broadcast. All prior optimizations on this branch are preserved: cos/sin pre-computed init buffers (``rope_cos_cache`` / ``rope_sin_cache``), L2 per-(t, h, w) cache, batched Triton pos-embed kernel, rot_pos_ids lru_cache, async_tensor_h2d helper, argsort inverse permutation, deepstack dict lookup. The fix is a layout correction on the output side of ``rot_pos_emb``, not a revert of the optimizations. Test: ``test_qwen3vl_vision_block_forward_end_to_end`` builds one ``Qwen3VLVisionBlock`` with random weights and runs forward with the post-EXAONE cos/sin shape and ``position_ids``. Without this fix it raises the same broadcasting error; with it the block returns the expected ``(seq_len, hidden)`` tensor. Any future regression that re-introduces a stray ``.repeat(1, 2)`` on cos/sin (or drops ``position_ids``) will surface here, not only at CI time. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Four CPU-dispatch optimizations stacked on top of the prior PR commits, each verified by the qwen3vl unittests and nsys conc=64 OSL=8. 1. get_rope_index: rewrite the per-token loop using numpy (np.arange/ np.indices/np.concatenate) instead of small torch tensor ops. Same algorithm, output exactly equal. ~2x per call on a 512x512 image. 2. Vision attention apply_rope: add a flash_attn Triton fast-path (flash_attn.ops.triton.rotary.apply_rotary, in-place) between the FlashInfer fused path and the PyTorch fallback. Qwen3-VL head_dim=72 and Qwen2.5-VL head_dim=80 both miss FlashInfer's "head_size % 64 == 0" precondition and used to land on the 7-launch PyTorch path; the Triton kernel is a single launch per (q, k). 3. Vision cos/sin buffers materialised in the vision-tower dtype: Qwen3-VL register_buffer stores rope_cos_cache/rope_sin_cache as bf16, Qwen2.5-VL get_rope_and_window_index_by_thw casts cos_thw/sin_thw once per cached (t, h, w). apply_rope guards the per-call .to(dtype=q.dtype) with a dtype check so it becomes a no-op on the hot path. 4. Vision block residual fusion (vLLM pattern, see vllm/model_executor/models/qwen2_5_vl Qwen2_5_VisionBlock.forward): collapse the post-attention residual add into norm2's residual path, which our LayerNorm / RMSNorm already supports as a torch.compile- fused kernel. One fewer elementwise add launch per block. nsys conc=64 OSL=8 (Qwen3-VL-8B, 1x512x512 image): vision-tower wall median 183 -> 107 ms (-42%), launches/iter 1571 -> 1220 (-22%), CPU API time/iter 134 -> 115 ms (-15%). Throughput moves modestly because GPU work is unchanged and the time is largely absorbed at stream-sync points within _forward_step. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Stripped layer of commit 7368e5b after measurement showed the batched single-launch kernel did not move the conc=64 throughput needle and added non-trivial setup cost (5 small H->D copies for per-image metadata) on the low-concurrency path. * Vision pos-embed interpolation goes back to the per-image Triton kernel ``_triton_pos_embed_interpolate`` + ``torch.cat`` join in ``Qwen3VisionModel.fast_pos_embed_interpolate``. The kernel takes ``(h, w, h_scale, w_scale)`` as scalar args, so no H<->D metadata transfers happen on the hot path. The batched ``_bilinear_pos_embed_batched_kernel`` / ``_triton_pos_embed_interpolate_batched`` pair is removed. * Vision rotary cache no longer hard-codes ``max_rope_seqlen = 8192``; ``Qwen3VisionModel.__init__`` now constructs a standard ``RotaryEmbedding`` whose ``rotary_cos_sin`` buffer is sized to ``text_config.max_position_embeddings`` via ``RopeParams.from_config``. ``_rotary_pos_emb_thw`` slices ``self.rotary_pos_emb.rotary_cos_sin [:max(h, w)]`` then indexes with ``pos_ids``, mirroring upstream/main. * Removed vLLM cross-reference comments in modeling_qwen3vl.py, modeling_qwen2vl.py, and modules/rotary_embedding.py while keeping the local technical descriptions intact. * Updated unit tests to mock ``rotary_pos_emb.rotary_cos_sin`` (the new buffer location) and dropped the batched-vs-per-image comparison test together with the batched kernel. 3-run aiperf sweep (conc=64, ISL=1000, OSL=100, warmup=20, single 512x512 image, Qwen3-VL-8B): throughput 18.01 +/- 0.94 req/s vs main 17.25 +/- 0.72 (+4.4%), TTFT 1082 +/- 107 ms vs 1240 +/- 137 (-12.7%), latency 3528 +/- 178 ms vs 3692 +/- 149 (-4.4%). Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…c_tensor_h2d The per-(t, h, w) rotary helpers in modeling_qwen2vl.py and modeling_qwen3vl.py were doing a bare ``pos_ids.to(device, non_blocking=True)`` (and likewise for ``window_index_thw``). Without pin_memory the non_blocking flag silently falls back to a staging copy, so the H->D never actually overlaps the surrounding GPU work. Route those transfers through the project-wide ``async_tensor_h2d`` helper instead; it pins (when ``prefer_pinned()`` is enabled) and issues a real ``cudaMemcpyAsync``. Each unique (t, h, w) tile still only pays this once (lru_cache around the helper), but now the copy genuinely runs asynchronously. Unit tests (qwen3vl) pass unchanged. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
CI yapf hook reformatted the torch.arange call signature. Apply the same reformat locally so pre-commit passes on push. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Skip the multi-modal HF processor and call the tokenizer directly when ``inputs["multi_modal_data"]`` is empty. The two outputs match bit-exactly when ``images`` / ``videos`` are ``None``, so the only thing we lose is the ``bypass_processor_output_validation`` context plus the image/video processing branches inside ``ProcessorMixin.__call__``. mrope_config is still populated since the LM is M-RoPE and needs the position ids on every request -- it just goes through the short ``get_rope_index`` path with ``image_grid_thw=None``. nsys (conc=64, OSL=8, text-only requests, Qwen3-VL-8B): the per-request ``Qwen3VLInputProcessorBase forward()`` median drops from 36.1 ms to 12.5 ms (-65%) and the cumulative wall-clock share drops from 45.8% to 21.1% of the capture window. End-to-end throughput is essentially unchanged at this concurrency (36.54 -> 36.78 req/s) because the executor's LM forward is the gating phase; the win is for low-conc / TTFT-sensitive text-only traffic where the input processor sits on the request critical path. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Four coordinated fixes so torch.compile + piecewise CUDA graph stops
crashing on the Qwen2.5-VL / Qwen3-VL multimodal wrappers:
1. Share ``extra_attrs`` between the outer VL wrapper and the inner
LM. ``Qwen{2VL,3VL}ModelBase.__init__`` deep-copied ``model_config``
for the LM, so ``AutoModelForCausalLM.from_config`` registered LM
attention layers in the copied dict while ``model_engine.model_forward``
bound the thread-local ``model_extra_attrs`` to ``self.model.extra_attrs``
(the outer dict). Under ``set_torch_compiling(True)``,
``attn_custom_op_inplace`` looks up its layer in the global TLS dict --
missed the LM and hit a stale vision entry, producing a
``[1, 1152] X [4096, 4096]`` fake-tensor mismatch at ``o_proj``.
2. Unregister vision attention from ``extra_attrs`` in
``Qwen2_5_VLVisionAttention.__init__`` (base class shared by both
Qwen2.5-VL and Qwen3-VL). Vision runs from the outer wrapper, outside
the compiled LM region, so ``forward_impl`` must use the eager path
with its own ``attn_metadata``. With the global flag on the custom-op
path otherwise consults ``extra_attrs["attention_metadata"]`` -- which
``model_engine.model_forward`` populates with the LM's metadata -- so
vision FMHA dispatched with the LM's S/num_contexts and vision's
head_dim and failed (``FMHA kernels are not found ... D: 72, S: 0``).
3. Piecewise compile only the LM body, not the outer wrapper. When the
model has ``self.llm.model``, ``torch.compile`` is applied there;
tracing the outer wrapper otherwise pulls the vision-tower output
path and ``fuse_input_embeds`` into the same graph, which lets the
vision hidden_dim leak into the LM ``o_proj`` fake-tensor trace and
blows up the piecewise warmup.
4. Teach the piecewise backend to detect ``inputs_embeds`` placeholders.
The VL wrapper invokes the LM with ``input_ids=None`` and
``inputs_embeds=<tensor>``; dynamo eliminates the ``input_ids``
placeholder, so ``Backend.__call__`` could not find ``l_input_ids_``
and raised ``Cannot detect input_num_tokens``. Accept
``l_inputs_embeds_``/``l_kwargs_inputs_embeds_`` as the alternate
num_tokens carrier.
Validated end-to-end with piecewise CUDA graph enabled on
Qwen3-VL-8B-Instruct and Qwen2.5-VL-7B-Instruct: server boots through
warmup, text and image requests both succeed.
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…c_tensor_h2d ``Qwen3VisionModel.forward`` built the per-iter ``rope_position_ids`` with the manual ``torch.arange(..., pin_memory=prefer_pinned()).to( device=..., non_blocking=True)`` pattern. ``async_tensor_h2d`` already centralizes the same pinned-CPU + ``cudaMemcpyAsync`` pattern that the other VL hot-path H2D sites use. Use it here too for consistency. No functional or performance change -- both paths construct a pinned CPU tensor and issue a non-blocking H->D copy. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Multi-context-request batches and chunked prefill both broke the per-request indexing scheme once ``batch_idx`` was dropped from the attention kernel's mrope cos/sin lookup. Two requests at the same ``rotary_position`` shared a buffer entry even though their (T, H, W) coordinates differ, silently corrupting attention; chunked prefill of a single request hit the same issue from chunk 2 onward because the kernel indexed by request-internal position while Python only materialized cos/sin for the current chunk. The failure surfaced as ``test_chunked_prefill_multimodal_smoke`` returning empty outputs / immediate ``<|im_end|>``. Switch ``applyBiasRopeUpdateKVCacheV2``'s mrope cos/sin lookup from ``rotary_position`` (= ``past_seen + token_idx_in_seq``, request- internal) to ``bounded_global_token_idx`` (batch-flat per-token index, same scheme regular rope already uses). Each batch token now reads its own cos/sin entry independent of request boundary, ``past_seen_token_num``, or chunk size. Simplify ``Qwen3VLModelBase`` / ``Qwen2VLModelBase``'s ``prepare_mrope_config`` to a single ``get_cos_sin(position_ids)`` over the batch's already-stitched ``position_ids`` (3, 1, total_tokens); drop the per-request loop, the per-request ``torch.cat(dim=0)`` layout that the kernel can no longer index, and the ``self.mrope_position_ids_padding_cuda`` (3, 1, max_position_embeddings) buffer (cos/sin is now sized to the current iteration's token count instead of the model's full context capacity). Update ``test_modeling_qwen3vl`` / ``test_modeling_qwen2_5vl``'s ``get_trtllm_inputs`` to slice ``mrope_position_ids`` to the chunk range (matching production ``PyTorchModelEngine`` behavior): chunked prefill / KV cache reuse scenarios used to pass full-request ``mrope_position_ids`` regardless of chunk, which only worked because the prior code read mrope from ``multimodal_data`` and ignored the ``position_ids`` argument. Validated end-to-end with the chunked-prefill smoke test, HF↔TRT-LLM logit matching across all 7 scenarios per model (image / video / multiple-image / cuda-graph / chunked-prefill / kv-cache-reuse / no-fuse-rope), the multi-request batched quickstart example, and aiperf on Qwen3-VL-8B-FP8 across concurrencies 1..128 (throughput-neutral versus the prior per-request layout, 0-8% lower TTFT in context phase). Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…t syncs
Three Qwen2/3-VL host-side micro-fixes in the prefill / vision-encoder
path:
1. Pre-allocated `(L, max_num_tokens, H)` `deepstack_input_embeds`
buffer in `Qwen3VLModelBase.__init__` (`register_buffer(..., persistent=False)`
so it stays out of `state_dict`). Forward path zero_()s the
`(L, N, H)` slice and scatters `stack(deepstack_embeds, dim=0)` in
one packed assignment, replacing `L` fresh `torch.zeros` + `L`
indexed scatters that ran on every prefill iter inside
`fuse_input_embeds`. Pattern mirrors vLLM's `deepstack_input_embeds`.
2. Reuse the `text_token_indices` / `mm_token_indices` that
`PyTorchModelEngine._prepare_tp_inputs` already computes on CPU and
async-H2Ds into the forward kwargs; on engine-driven paths the
`filter_mm_token_from_input_ids` `torch.where` invocation is fully
skipped (the fallback only runs on direct-`forward` unit-test paths).
Pass them through to `fuse_input_embeds` so its internal filter is
also bypassed.
3. Vision encoder forward cleanup:
- Build `seq_lens: List[int]` from `grid_thw.tolist()` in Python
(single Python loop) instead of
`torch.repeat_interleave(...).tolist()`; the repeat-interleave path
created two small CPU tensor ops + an extra `.tolist()` purely to
produce a Python list `prepare_attn_metadata` then re-tensors.
- `prepare_attn_metadata` takes a `List[int]` and computes
`max_seq_len = max(seq_lens)` in Python, dropping
`seq_lens.max().item()`.
- Pre-allocate a 32K `arange(int32)` buffer for the vision block's
`rope_position_ids` (`register_buffer(persistent=False)`); per-call
code just slices `[:seq_len]` instead of `torch.arange(seq_len) +
async_tensor_h2d` every encoder forward.
Test rigs (`tests/unittest/_torch/modeling/test_modeling_qwen3vl.py`,
`test_modeling_qwen2_5vl.py`) only had their existing comments
normalized to single backticks for style consistency; no logic change.
Validated with the full Qwen2.5-VL / Qwen3-VL test surface (53/53
passing) -- HF<->TRT-LLM logit matching across image / video /
multiple-image / cuda-graph / chunked-prefill / kv-cache-reuse /
no-fuse-rope scenarios, the chunked-prefill smoke test, the P-D disagg
no-reuse test, and the multi-image / multi-request batch tests.
End-to-end aiperf throughput on Qwen3-VL-8B-FP8 across concurrencies
1..128 is within noise (the host-side savings sit well under the
attention/MoE wall time budget); the win is in lower per-iter
overhead and cleaner module state.
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
The `has_tp` guard around `self.dist.broadcast(payloads, root=0)` in `RequestBroadcaster._broadcast_requests` (introduced in c1d294e to skip a hang on `world_size > 1, tp_size == 1`) is no longer needed -- the underlying behavior is now handled upstream. Revert to the pre-guard form. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Apply the codebase's RST-literal style (``foo``) to inline code references in comments / docstrings on the branch-touched lines of Qwen2/3-VL model and test files; no logic change. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
…uff-format) Pure formatting cleanup -- no logic change. Brings the branch-touched Qwen2/3-VL Python, vision-encoder C++ kernel template, and modeling tests in line with the repo's pre-commit hooks (yapf, clang-format, ruff-format). Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
- Restore the original `torch_dtype` / `max_position_embeddings` propagation comment on `Qwen3VLVisionAttention.__init__`. - Drop the redundant `if cos.dtype != q.dtype` / `if sin.dtype != q.dtype` guards in `Qwen2_5_VLVisionAttention.apply_rope`; `tensor.to(dtype=...)` already short-circuits when the dtype matches. - Collapse the doubled backticks introduced earlier in this branch back to single backticks on the branch-touched lines, matching the reviewer-preferred style. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Replace the misleading `pragma: no cover - flash_attn is part of the default deps` line on the flash_attn rotary import: flash_attn is only declared in `triton_backend/requirements.txt` and the multimodal extras, not the main `requirements.txt`, so the guarded import is genuinely the load-time fallback when flash_attn isn't installed. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Python 3's `/` is true division and always returns float, so `float(...) / float(...)` was redundant. Same effective values, less noise. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Extract the ``_build_temporal_block`` step as a classmethod hook on ``Qwen2VLInputProcessorBase`` so Qwen3-VL's only meaningful difference (per-frame timestamps vs. ``tokens_per_second`` scaling) can be expressed as a one-line override. ``Qwen3VLInputProcessorBase`` now subclasses ``Qwen2VLInputProcessorBase``, overrides ``__init__`` (dtype source), ``get_rope_index`` (``repeat_interleave`` of ``video_grid_thw`` before super), and ``_build_temporal_block`` (plain ``np.indices``). Drops ~95% of the duplicated tokenizer / processor / mrope / call logic and the matching unused imports. Also dispatch ``get_mrope_config``'s ``get_rope_index`` call via ``type(self)`` so the subclass override is actually used, and condense the ``bypass_processor_output_validation`` rationale comment. Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
Signed-off-by: yechank <161688079+yechank-nvidia@users.noreply.github.com>
5c4dfbc to
b8c317d
Compare
|
/bot run |
|
PR_Github #51050 [ run ] triggered by Bot. Commit: |
|
PR_Github #51041 [ run ] completed with state |
|
PR_Github #51050 [ run ] completed with state
|
Summary
This PR reworks the Qwen2-VL / Qwen2.5-VL / Qwen3-VL PyTorch-backend implementations to cut host (CPU) overhead in the vision tower and input processors, enables piecewise CUDA graph for these models, and fixes several correctness issues around mRoPE and vision-block RoPE. The optimizations target the high-concurrency serving regime, where launch overhead and host↔device syncs dominate.
What changed
Vision tower / rotary embedding
cos/sininto init-time buffers so the forward path no longer calls.cos()/.sin()per step.Host-overhead reduction
async_tensor_h2dhelper and route all Qwen2.5/3-VL H2D copies (visionpos_ids,window_index,rope_position_ids) through it.maybe_pin_memorywhen the input is already pinned.CUDA graph
Refactor
Performance
Model:
Qwen3VLForConditionalGeneration(FP8 weights + KV cache, bf16 vision encoder). Hardware: H200 ×1. Workload: image+text, ISL=1000, OSL=1000, 512×512 image, KV block reuse off, 3-run mean. Comparison is against upstreammainwith identical serving config (max_batch_size=256,max_num_tokens=8192,num_postprocess_workers=4,cuda_graph_config.enable_padding=true, chunked prefill on).System output-token throughput (tok/s)
Other metrics
Gains are concentrated at high concurrency, where the host-side savings (fewer CPU launches, async H2D, higher CUDA-graph hit ratio) and the lower-overhead mRoPE path matter most. Low-concurrency TTFT also improves substantially (c=1 nearly halved) from the text fast path and removed vision-encoder syncs.
Summary by CodeRabbit
Release Notes
Bug Fixes
Performance
Tests