[Multimodal] Optimize Qwen2/2.5-VL startup time #19756

WoosukKwon · 2025-06-17T16:56:45Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Currently, processing large dummy inputs takes 40 secs of the startup time for Qwen2/2.5-VL (it happens twice and each takes 20 secs). This can be skipped by pre-computing the maximum token count per modality.

Test Plan

Test Result

(Optional) Documentation Update

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

github-actions · 2025-06-17T16:56:58Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @WoosukKwon, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly optimizes the startup time for Qwen2/2.5-VL models by introducing a mechanism to pre-compute and provide maximum token counts for multimodal inputs. This change bypasses the previously time-consuming process of generating and processing large dummy inputs during model initialization, leading to faster loading times.

Highlights

Performance Optimization: Implemented a new mechanism to pre-compute maximum token counts for multimodal inputs, specifically targeting Qwen2/2.5-VL models to significantly reduce their startup time by avoiding the generation and processing of large dummy inputs.
API Extension: Introduced a new method, get_max_tokens_per_item, to the BaseProcessingInfo class in vllm/multimodal/processing.py. This method, which returns None by default, allows specific models to override it and provide pre-defined maximum token counts per modality.
Model-Specific Implementation: Overrode the newly added get_max_tokens_per_item method within the Qwen2VLForConditionalGeneration class in vllm/model_executor/models/qwen2_vl.py to return fixed maximum token counts for image (16384) and video (98304) modalities.
Multimodal Profiling Logic: Modified the get_mm_max_tokens method in MultiModalProfiler (vllm/multimodal/profiling.py) to first check if pre-computed maximum token counts are available via processing_info.get_max_tokens_per_item(). If so, these values are used directly, otherwise, the system falls back to the original method of generating and processing dummy inputs.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an optimization to reduce the startup time for Qwen2/2.5-VL models by allowing them to provide pre-computed maximum token counts per modality, thus avoiding the slow process of generating and processing dummy inputs. The changes involve adding a new method get_max_tokens_per_item to the base multimodal processing information class and overriding it in the Qwen2-VL model. The profiling code has also been updated to leverage this new method.

The core idea and implementation appear sound and should lead to the described performance improvements. I have a suggestion regarding the use of named constants for better maintainability.

Please also ensure the PR description checklist is completed, particularly the sections for 'Test Plan' and 'Test Result'. Given this is a performance optimization, including some benchmark numbers (e.g., startup time before and after this change) would be highly beneficial.

gemini-code-assist · 2025-06-17T16:57:40Z

vllm/model_executor/models/qwen2_vl.py

+    def get_max_tokens_per_item(self) -> Mapping[str, int]:
+        return {"image": 16384, "video": 98304}


The hardcoded values 16384 and 98304 represent the maximum tokens for images and videos respectively for Qwen2-VL. For better readability and maintainability, consider defining these as named constants at the module level or within the class. This makes their meaning clearer and simplifies updates if these values change in the future.

For example:

# At the module level or as class attributes _MAX_IMAGE_TOKENS_QWEN2_VL = 16384 _MAX_VIDEO_TOKENS_QWEN2_VL = 98304 class Qwen2VLForCausalLM(nn.Module, SupportsMultiModal): # ... def get_max_tokens_per_item(self) -> Mapping[str, int]: return {"image": _MAX_IMAGE_TOKENS_QWEN2_VL, "video": _MAX_VIDEO_TOKENS_QWEN2_VL}

DarkLight1337

The dummy data processing is already cached such that it's processed only once even though it can be called multiple times during startup (#17935). We currently still use dummy data for the profiling run so I'm not sure how much this PR helps.

WoosukKwon · 2025-06-17T17:26:55Z

@DarkLight1337 Thanks for sharing it! In my experiment, this PR reduces the startup time of Qwen2.5-VL-3B from 120 secs to 55 secs. It definitely helps.

That said, I'm not sure if the pre-computed values should depend on the limit_mm_per_prompt paramter.

ywang96 · 2025-06-17T17:36:25Z

vllm/model_executor/models/qwen2_vl.py

+    def get_max_tokens_per_item(self) -> Mapping[str, int]:
+        return {"image": 16384, "video": 98304}


vllm/multimodal/processing.py

vllm/multimodal/profiling.py

Signed-off-by: Roger Wang <hey@rogerw.me> Signed-off-by: Roger Wang <ywang@roblox.com>

Signed-off-by: Roger Wang <hey@rogerw.me>

DarkLight1337 · 2025-06-19T05:49:57Z

vllm/model_executor/models/qwen2_vl.py

+
+        max_image_tokens = self.get_max_image_tokens()
+        max_video_tokens = self.get_max_video_tokens(seq_len, mm_counts)
+        return {"image": max_image_tokens, "video": max_video_tokens}


Can you validate whether the startup time is actually reduced (compared to before this PR) after this latest change?

@DarkLight1337 Yep that's exactly what I'm going to do next

Signed-off-by: Roger Wang <hey@rogerw.me>

ywang96 · 2025-06-19T08:24:44Z

@DarkLight1337 @WoosukKwon Here's a short repro script - let me know if this is reasonable.

import time
from vllm import LLM

st = time.perf_counter()
llm = LLM(model="Qwen/Qwen2.5-VL-3B-Instruct", enforce_eager=True)
print("Time taken", time.perf_counter() - st)

Results below are 10 rounds average - profiling is done with dummy video input

Main branch: 29.343433478847146
Woosuk's initial commit 3fe1893: 15.75505773164332
Updated version of this branch without hardcoded values: 15.77296781912446

Adding some constraints with limit_mm_per_prompt={"video": 0} so that profiling is done with dummy image input

Main branch: 16.037972562015057
Woosuk's initial commit 3fe1893: 15.723176507279277
Updated version of this branch without hardcoded values: 15.553956482559443

I think this means there are something wrong with caching the processed video inputs? Probably also has something to do with serialization. Will do more digging to verify.

Signed-off-by: Roger Wang <hey@rogerw.me>

WoosukKwon · 2025-06-19T18:19:46Z

@ywang96 Thanks for the investigation. Didn't know that it is caused by the video input. 🤔

Signed-off-by: Roger Wang <hey@rogerw.me>

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Roger Wang <hey@rogerw.me> Co-authored-by: Roger Wang <hey@rogerw.me>

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Roger Wang <hey@rogerw.me> Co-authored-by: Roger Wang <hey@rogerw.me> Signed-off-by: juncheoll <th6re8e@naver.com>

[Multimodal] Optimize Qwen2/2.5-VL startup time

3fe1893

Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>

WoosukKwon requested review from DarkLight1337 and ywang96 as code owners June 17, 2025 16:56

mergify bot added the multi-modality Related to multi-modality (#4194) label Jun 17, 2025

gemini-code-assist bot reviewed Jun 17, 2025

View reviewed changes

DarkLight1337 reviewed Jun 17, 2025

View reviewed changes

ywang96 reviewed Jun 17, 2025

View reviewed changes

mergify bot added the qwen Related to Qwen models label Jun 18, 2025

ywang96 and others added 2 commits June 18, 2025 22:41

clarify and remove hardcoded values

883bf7e

Signed-off-by: Roger Wang <hey@rogerw.me> Signed-off-by: Roger Wang <ywang@roblox.com>

comment

f6250e7

Signed-off-by: Roger Wang <hey@rogerw.me>

DarkLight1337 reviewed Jun 19, 2025

View reviewed changes

ywang96 added 2 commits June 18, 2025 23:02

Merge branch 'main' into woosuk/optimize-qwen-vl-startup-time

e17ac34

add missing kwarg

9ec904b

Signed-off-by: Roger Wang <hey@rogerw.me>

typing

85de93e

Signed-off-by: Roger Wang <hey@rogerw.me>

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 20, 2025

WoosukKwon enabled auto-merge (squash) June 20, 2025 17:46

ywang96 disabled auto-merge June 20, 2025 17:55

update

fc97bb8

Signed-off-by: Roger Wang <hey@rogerw.me>

ywang96 enabled auto-merge (squash) June 20, 2025 18:08

ywang96 added 4 commits June 20, 2025 11:24

update

a1fde53

Signed-off-by: Roger Wang <hey@rogerw.me>

debug

75767f0

Signed-off-by: Roger Wang <hey@rogerw.me>

warning

bf105d0

Signed-off-by: Roger Wang <hey@rogerw.me>

Merge branch 'main' into woosuk/optimize-qwen-vl-startup-time

da4c240

ywang96 merged commit 2c5302f into main Jun 21, 2025
75 checks passed

ywang96 deleted the woosuk/optimize-qwen-vl-startup-time branch June 21, 2025 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Multimodal] Optimize Qwen2/2.5-VL startup time #19756

[Multimodal] Optimize Qwen2/2.5-VL startup time #19756

WoosukKwon commented Jun 17, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jun 17, 2025

Uh oh!

ywang96 Jun 17, 2025

Uh oh!

DarkLight1337 left a comment •

edited

Loading

Uh oh!

WoosukKwon commented Jun 17, 2025

Uh oh!

ywang96 Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 Jun 19, 2025 •

edited

Loading

Uh oh!

ywang96 Jun 19, 2025

Uh oh!

ywang96 commented Jun 19, 2025 •

edited

Loading

Uh oh!

WoosukKwon commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

		def get_max_tokens_per_item(self) -> Mapping[str, int]:
		return {"image": 16384, "video": 98304}

Uh oh!

[Multimodal] Optimize Qwen2/2.5-VL startup time #19756

[Multimodal] Optimize Qwen2/2.5-VL startup time #19756

Conversation

WoosukKwon commented Jun 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jun 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

WoosukKwon commented Jun 17, 2025

Uh oh!

ywang96 Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ywang96 Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

ywang96 commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WoosukKwon commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

WoosukKwon commented Jun 17, 2025 •

edited by github-actions bot

Loading

DarkLight1337 left a comment •

edited

Loading

DarkLight1337 Jun 19, 2025 •

edited

Loading

ywang96 commented Jun 19, 2025 •

edited

Loading