[Usage]: Can vllm multimodal generate use preprocessed image?

### Your current environment

```text
The output of `python collect_env.py`
```


### How would you like to use vllm

When using the `generate` method in vLLM for inference with multimodal data,  we can directly pass Image objects, but is there an interface that allows us to directly pass preprocessed `pixel_values` and `image_grid_thw` (generated by a processor) instead, to avoid performing image preprocessing within vLLM?


```python
MODEL_PATH = "Qwen2-VL-7B-Instruct/"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "./demo.jpeg",
                "min_pixels": 224 * 224,
                "max_pixels": 1280 * 28 * 28,
            },
            {"type": "text", "text": "describe this picture?"},
        ],
    },
]
from transformers import AutoProcessor, AutoTokenizer
prompt = AutoProcessor.from_pretrained(MODEL_PATH).apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
from qwen_vl_utils import process_vision_info
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

llm_inputs = {
    "prompt_token_ids": prompt_token_ids,
    "multi_modal_data":  {"image": image_inputs},
    "mm_processor_kwargs": video_kwargs,
}
outputs = llm.generate(prompts=[llm_inputs], sampling_params=sampling_params)
```



### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Usage]: Can vllm multimodal generate use preprocessed image? #14998

Your current environment

How would you like to use vllm

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Usage]: Can vllm multimodal generate use preprocessed image? #14998

Description

Your current environment

How would you like to use vllm

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions