[Bug]: GLM4V model gets lower precision score on TextVQA since vLLM does not process model's position ids correctly.

### Your current environment

vLLM: 0.7.3
Python 3.12.8
transformers: 4.48.2

### 🐛 Describe the bug

I compared the specific TextVQA result from vLLM with HuggingFace transformer.  It turns out that vLLM gives a different answer (both not using sampling postprocess). 

```text
# Question: "what is this food place selling? Answer the question using a single word or phrase.”

# huggingface transformer answer
German sausages.

# vLLM transformer
Hot dogs and sausages.
```

I checked the code `glm4v.py` and `chatglm.py` under `vllm/model_executor/models` and found that vLLM does not preprocess image part in position ids as the source code of GLM4V does.

For example, GLM4V's positon ids for a multimodal input including an image should be [0, 1, 2, 3, 3, 3, 3, 4, 5, 6] where 3 represents the image part; However, vLLM just gives [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]. 

I think this the main reason that causes the different answer therefore the lower score on TextVQA, and I hope that vLLM group can fix that. 


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: GLM4V model gets lower precision score on TextVQA since vLLM does not process model's position ids correctly. #14790

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: GLM4V model gets lower precision score on TextVQA since vLLM does not process model's position ids correctly. #14790

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions