Why can't InternVL3-8B start vLLM after being converted to the Hugging Face format? It shows the error: `ValueError: 'limit_mm_per_prompt' is only supported for multimodal models.' #38000

FloSophorae · 2025-05-07T13:29:37Z

System Info

vllm 0.8.5.post1
transformers 4.52.0.dev0

Who can help?

@amyeroberts
@qubvel

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

CUDA_VISIBLE_DEVICES=0,1 vllm serve $MODEL_PATH \
    --tensor-parallel-size 2 \
    --port $MODEL_PROT \
    --host 0.0.0.0 \
    --dtype float16 \
    --max-model-len 65536 \
    --limit-mm-per-prompt image=30,video=0\
    --enable-prefix-caching \
    --gpu-memory-utilization 0.6 \
    --block-size 16 > "$VLLM_LOG"

Expected behavior

I downloaded the model from OpenGVLab/InternVL3-8B, which natively supports running OpenAI-style chat completions with vLLM. However, after converting it to the Hugging Face format using the script transformers/src/transformers/models/internvl/convert_internvl_weights_to_hf.py, launching vLLM resulted in the error:

ValueError: 'limit_mm_per_prompt' is only supported for multimodal models.

The command I used to launch vllm is as follows:


CUDA_VISIBLE_DEVICES=0,1 vllm serve $MODEL_PATH \
    --tensor-parallel-size 2 \
    --port $MODEL_PROT \
    --host 0.0.0.0 \
    --dtype float16 \
    --max-model-len 65536 \
    --limit-mm-per-prompt image=30,video=0\
    --enable-prefix-caching \
    --gpu-memory-utilization 0.6 \
    --block-size 16 > "$VLLM_LOG"

The system runs correctly when I set MODEL_PATH to the original OpenGVLab/InternVL3-8B address. But throws an error when I change the path to the converted InternVL-3B-hf format : ValueError: 'limit_mm_per_prompt' is only supported for multimodal models.

Could someone explain why this is happening and suggest solutions?
Thank you very much!

The text was updated successfully, but these errors were encountered:

zucchini-nlp · 2025-05-07T14:07:15Z

Hm, I don't think the converted weights would be compatible with vLLM yet. Usually vLLM implements the multimodal models themselves using the official checkpoints. Support for transformers style multimodal model is planned already and we'll try to add it soon

FloSophorae · 2025-05-07T15:12:20Z

Hello, thank you for your response. Could you please advise if there’s any way to convert a Hugging Face (HF) format model back to its original format? My training task only generated the model in HF format. Looking forward to your reply!

zucchini-nlp · 2025-05-07T15:52:10Z

Ah, you are using a custom tuned model. I don't think we have an easy way to do reverse conversion, unless you can write your own conversion script

Thought I will cc @hmellor as well, not sure if there's way to run converted models in vLLM without using transformers backend

hmellor · 2025-05-07T16:00:38Z

I can't find any mention of needing to convert this model for it to be compatible with vLLM. You should be able to use it direcltly.

I think the problem is that this is a custom model, and you're not using --trust-remote-code.

This is necessary to get the custom processor from the Hub repo. I've found a document showing that InternVL is an example of a model with a custom processor in https://docs.vllm.ai/en/latest/contributing/model/multimodal.html#custom-hf-processor.

yonigozlan · 2025-05-09T15:16:36Z

Btw not sure if that can help, but there's already a converted checkpoint on the hub for the native Transformers InternVL3-8B: https://huggingface.co/OpenGVLab/InternVL3-8B-hf

hmellor · 2025-05-09T15:26:54Z

In vLLM CI we use https://hf.co/OpenGVLab/InternVL2-1B, so I think unconverted is ok

hrdxwandg · 2025-05-15T09:12:43Z

In vLLM CI we use https://hf.co/OpenGVLab/InternVL2-1B, so I think unconverted is ok

I use the following command to start vllm, i can start successfully.

	CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve /xxx/InternVL3/InternVL3-78B \
		--tensor-parallel-size 4 \
		--port 8092 \
  	    --host 0.0.0.0 \
  	    --dtype float16 \
  	    --max-model-len 65536 \
  	    --limit-mm-per-prompt "video=5" \
  	    --enable-prefix-caching \
  	    --trust-remote-code \
  	    --gpu-memory-utilization 0.9 \
  	    --block-size 16

but when i push a request, it has an error, it seems limit-mm-per-prompt param does not work, how to fix it?

ERROR 05-15 17:04:34 [serving_chat.py:200] Error in preprocessing prompt inputs
ERROR 05-15 17:04:34 [serving_chat.py:200] Traceback (most recent call last):
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/openai/serving_chat.py", line 183, in create_chat_completion
ERROR 05-15 17:04:34 [serving_chat.py:200]     ) = await self._preprocess_chat(
ERROR 05-15 17:04:34 [serving_chat.py:200]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/openai/serving_engine.py", line 403, in _preprocess_chat
ERROR 05-15 17:04:34 [serving_chat.py:200]     conversation, mm_data_future = parse_chat_messages_futures(
ERROR 05-15 17:04:34 [serving_chat.py:200]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/chat_utils.py", line 1165, in parse_chat_messages_futures
ERROR 05-15 17:04:34 [serving_chat.py:200]     sub_messages = _parse_chat_message_content(
ERROR 05-15 17:04:34 [serving_chat.py:200]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/chat_utils.py", line 1089, in _parse_chat_message_content
ERROR 05-15 17:04:34 [serving_chat.py:200]     result = _parse_chat_message_content_parts(
ERROR 05-15 17:04:34 [serving_chat.py:200]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/chat_utils.py", line 989, in _parse_chat_message_content_parts
ERROR 05-15 17:04:34 [serving_chat.py:200]     parse_res = _parse_chat_message_content_part(
ERROR 05-15 17:04:34 [serving_chat.py:200]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/chat_utils.py", line 1064, in _parse_chat_message_content_part
ERROR 05-15 17:04:34 [serving_chat.py:200]     mm_parser.parse_video(str_content)
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/chat_utils.py", line 784, in parse_video
ERROR 05-15 17:04:34 [serving_chat.py:200]     placeholder = self._tracker.add("video", video)
ERROR 05-15 17:04:34 [serving_chat.py:200]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/chat_utils.py", line 568, in add
ERROR 05-15 17:04:34 [serving_chat.py:200]     raise ValueError(
ERROR 05-15 17:04:34 [serving_chat.py:200] ValueError: At most 0 video(s) may be provided in one request. You can set `--limit-mm-per-prompt` to increase this limit if the model supports it.
INFO:     172.16.8.240:43962 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request

hmellor · 2025-05-15T14:03:06Z

This is a vLLM problem, not a Transformers problem. Let's discuss in vLLM.

yoyouC · 2025-05-19T04:37:00Z

@FloSophorae Don't know if this help, but I've managed to convet a Internvl3-1B-hf model back to the original OpenGVLab/InternVL3-1B format via the following script:

import re
import torch
from safetensors.torch import load_file, save_file
from collections import OrderedDict

# ==== Inverse Mappings ====

CONVERTED_TO_ORIGINAL_KEY_MAPPING = [
    (r"vision_tower", "vision_model"),
    (r"layernorm_before", "norm1"),
    (r"layernorm_after", "norm2"),
    (r"lambda_1", "ls1"),
    (r"lambda_2", "ls2"),
    (r"patch_embeddings\.projection", "patch_embedding"),
    (r"position_embeddings", "position_embedding"),
    (r"cls_token", "class_embedding"),
    (r"\bencoder\.layer\.", "encoder.layers."),
    (r"attention\.q_proj", "attn.qkv_q"),
    (r"attention\.k_proj", "attn.qkv_k"),
    (r"attention\.v_proj", "attn.qkv_v"),
    (r"attention\.projection_layer", "attn.proj"),
    (r"multi_modal_projector\.layer_norm", "mlp1.0"),
    (r"multi_modal_projector\.linear_1", "mlp1.1"),
    (r"multi_modal_projector\.linear_2", "mlp1.3"),
]

QKV_PATTERN = re.compile(r"(.*)\.attn\.qkv_([qkv])\.(weight|bias)")

# ==== Merge QKV ====

def merge_qkv(state_dict):
    qkv_buffers = {}

    for key, tensor in state_dict.items():
        match = QKV_PATTERN.match(key)
        if match:
            base, part, suffix = match.groups()
            qkv_key = f"{base}.attn.qkv.{suffix}"
            if qkv_key not in qkv_buffers:
                qkv_buffers[qkv_key] = {}
            qkv_buffers[qkv_key][part] = tensor
        else:
            continue

    for merged_key, parts in qkv_buffers.items():
        if set(parts) == {"q", "k", "v"}:
            if parts["q"].shape[0] == parts["k"].shape[0] == parts["v"].shape[0]:
                merged_tensor = torch.cat([parts["q"], parts["k"], parts["v"]], dim=0)
            else:
                merged_tensor = torch.cat([parts["q"], parts["k"], parts["v"]], dim=1)
            state_dict[merged_key] = merged_tensor

    # Remove the split keys
    for key in list(state_dict.keys()):
        if QKV_PATTERN.match(key):
            del state_dict[key]

    return state_dict

# ==== Key Renaming ====

def rename_keys(state_dict):
    new_state = OrderedDict()

    for key, value in state_dict.items():
        new_key = key
        for pattern, replacement in CONVERTED_TO_ORIGINAL_KEY_MAPPING:
            new_key = re.sub(pattern, replacement, new_key)
        new_state[new_key] = value

    return new_state

# ==== Entry Point ====

def reverse_safetensors(input_path, output_path):
    print(f"Loading: {input_path}")
    state_dict = load_file(input_path)

    print("Renaming keys...")
    renamed = rename_keys(state_dict)

    print("Merging Q/K/V...")
    merged = merge_qkv(renamed)

    print(f"Saving to: {output_path}")
    save_file(merged, output_path)
    print("Done.")


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument("--input", required=True, help="Path to HF model.safetensors")
    parser.add_argument("--output", required=True, help="Path to output original-format .safetensors")
    args = parser.parse_args()

    reverse_safetensors(args.input, args.output)

FloSophorae added the bug label May 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why can't InternVL3-8B start vLLM after being converted to the Hugging Face format? It shows the error: `ValueError: 'limit_mm_per_prompt' is only supported for multimodal models.' #38000

Why can't InternVL3-8B start vLLM after being converted to the Hugging Face format? It shows the error: `ValueError: 'limit_mm_per_prompt' is only supported for multimodal models.' #38000

FloSophorae commented May 7, 2025

zucchini-nlp commented May 7, 2025

Uh oh!

FloSophorae commented May 7, 2025

Uh oh!

zucchini-nlp commented May 7, 2025

Uh oh!

hmellor commented May 7, 2025

Uh oh!

yonigozlan commented May 9, 2025

Uh oh!

hmellor commented May 9, 2025

Uh oh!

hrdxwandg commented May 15, 2025

Uh oh!

hmellor commented May 15, 2025

Uh oh!

yoyouC commented May 19, 2025

Uh oh!

Why can't InternVL3-8B start vLLM after being converted to the Hugging Face format? It shows the error: `ValueError: 'limit_mm_per_prompt' is only supported for multimodal models.' #38000

Why can't InternVL3-8B start vLLM after being converted to the Hugging Face format? It shows the error: `ValueError: 'limit_mm_per_prompt' is only supported for multimodal models.' #38000

Comments

FloSophorae commented May 7, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

zucchini-nlp commented May 7, 2025

Uh oh!

FloSophorae commented May 7, 2025

Uh oh!

zucchini-nlp commented May 7, 2025

Uh oh!

hmellor commented May 7, 2025

Uh oh!

yonigozlan commented May 9, 2025

Uh oh!

hmellor commented May 9, 2025

Uh oh!

hrdxwandg commented May 15, 2025

Uh oh!

hmellor commented May 15, 2025

Uh oh!

yoyouC commented May 19, 2025

Uh oh!