Open
Description
Your current environment
vLLM: 0.7.3
Python 3.12.8
transformers: 4.48.2
🐛 Describe the bug
I compared the specific TextVQA result from vLLM with HuggingFace transformer. It turns out that vLLM gives a different answer (both not using sampling postprocess).
# Question: "what is this food place selling? Answer the question using a single word or phrase.”
# huggingface transformer answer
German sausages.
# vLLM transformer
Hot dogs and sausages.
I checked the code glm4v.py
and chatglm.py
under vllm/model_executor/models
and found that vLLM does not preprocess image part in position ids as the source code of GLM4V does.
For example, GLM4V's positon ids for a multimodal input including an image should be [0, 1, 2, 3, 3, 3, 3, 4, 5, 6] where 3 represents the image part; However, vLLM just gives [0, 1, 2, 3, 4, 5, 6, 7, 8, 9].
I think this the main reason that causes the different answer therefore the lower score on TextVQA, and I hope that vLLM group can fix that.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.