Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: Can vllm multimodal generate use preprocessed image? #14998

Open
1 task done
pjgao opened this issue Mar 18, 2025 · 2 comments
Open
1 task done

[Usage]: Can vllm multimodal generate use preprocessed image? #14998

pjgao opened this issue Mar 18, 2025 · 2 comments
Labels
usage How to use vllm

Comments

@pjgao
Copy link

pjgao commented Mar 18, 2025

Your current environment

The output of `python collect_env.py`

How would you like to use vllm

When using the generate method in vLLM for inference with multimodal data, we can directly pass Image objects, but is there an interface that allows us to directly pass preprocessed pixel_values and image_grid_thw (generated by a processor) instead, to avoid performing image preprocessing within vLLM?

MODEL_PATH = "Qwen2-VL-7B-Instruct/"
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "./demo.jpeg",
                "min_pixels": 224 * 224,
                "max_pixels": 1280 * 28 * 28,
            },
            {"type": "text", "text": "describe this picture?"},
        ],
    },
]
from transformers import AutoProcessor, AutoTokenizer
prompt = AutoProcessor.from_pretrained(MODEL_PATH).apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
from qwen_vl_utils import process_vision_info
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

llm_inputs = {
    "prompt_token_ids": prompt_token_ids,
    "multi_modal_data":  {"image": image_inputs},
    "mm_processor_kwargs": video_kwargs,
}
outputs = llm.generate(prompts=[llm_inputs], sampling_params=sampling_params)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@pjgao pjgao added the usage How to use vllm label Mar 18, 2025
@DarkLight1337
Copy link
Member

You will have to define your own multi-modal processor to handle that. See #14281

@pjgao
Copy link
Author

pjgao commented Mar 18, 2025

You will have to define your own multi-modal processor to handle that. See #14281

Thanks for your kindly reply, I will check it❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

2 participants