-
-
Notifications
You must be signed in to change notification settings - Fork 8.9k
Closed
Labels
Description
Your current environment
The output of `python collect_env.py`
How would you like to use vllm
When using the generate
method in vLLM for inference with multimodal data, we can directly pass Image objects, but is there an interface that allows us to directly pass preprocessed pixel_values
and image_grid_thw
(generated by a processor) instead, to avoid performing image preprocessing within vLLM?
MODEL_PATH = "Qwen2-VL-7B-Instruct/"
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "image",
"image": "./demo.jpeg",
"min_pixels": 224 * 224,
"max_pixels": 1280 * 28 * 28,
},
{"type": "text", "text": "describe this picture?"},
],
},
]
from transformers import AutoProcessor, AutoTokenizer
prompt = AutoProcessor.from_pretrained(MODEL_PATH).apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
from qwen_vl_utils import process_vision_info
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
llm_inputs = {
"prompt_token_ids": prompt_token_ids,
"multi_modal_data": {"image": image_inputs},
"mm_processor_kwargs": video_kwargs,
}
outputs = llm.generate(prompts=[llm_inputs], sampling_params=sampling_params)
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.