[Usage]: Can vllm multimodal generate use preprocessed image? #14998

pjgao · 2025-03-18T03:46:04Z

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

When using the generate method in vLLM for inference with multimodal data, we can directly pass Image objects, but is there an interface that allows us to directly pass preprocessed pixel_values and image_grid_thw (generated by a processor) instead, to avoid performing image preprocessing within vLLM?

MODEL_PATH = "Qwen2-VL-7B-Instruct/"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "./demo.jpeg",
                "min_pixels": 224 * 224,
                "max_pixels": 1280 * 28 * 28,
            },
            {"type": "text", "text": "describe this picture?"},
        ],
    },
]
from transformers import AutoProcessor, AutoTokenizer
prompt = AutoProcessor.from_pretrained(MODEL_PATH).apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
from qwen_vl_utils import process_vision_info
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

llm_inputs = {
    "prompt_token_ids": prompt_token_ids,
    "multi_modal_data":  {"image": image_inputs},
    "mm_processor_kwargs": video_kwargs,
}
outputs = llm.generate(prompts=[llm_inputs], sampling_params=sampling_params)

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

DarkLight1337 · 2025-03-18T03:53:29Z

You will have to define your own multi-modal processor to handle that. See #14281

pjgao · 2025-03-18T03:56:19Z

You will have to define your own multi-modal processor to handle that. See #14281

Thanks for your kindly reply, I will check it❤️

pjgao added the usage label Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: Can vllm multimodal generate use preprocessed image? #14998

[Usage]: Can vllm multimodal generate use preprocessed image? #14998

pjgao commented Mar 18, 2025

DarkLight1337 commented Mar 18, 2025

pjgao commented Mar 18, 2025

[Usage]: Can vllm multimodal generate use preprocessed image? #14998

[Usage]: Can vllm multimodal generate use preprocessed image? #14998

Comments

pjgao commented Mar 18, 2025

Your current environment

How would you like to use vllm

Before submitting a new issue...

DarkLight1337 commented Mar 18, 2025

pjgao commented Mar 18, 2025