Skip to content

Why can't InternVL3-8B start vLLM after being converted to the Hugging Face format? It shows the error: `ValueError: 'limit_mm_per_prompt' is only supported for multimodal models.' #38000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 4 tasks
FloSophorae opened this issue May 7, 2025 · 9 comments
Labels

Comments

@FloSophorae
Copy link

System Info

vllm 0.8.5.post1
transformers 4.52.0.dev0

Who can help?

@amyeroberts
@qubvel

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

CUDA_VISIBLE_DEVICES=0,1 vllm serve $MODEL_PATH \
    --tensor-parallel-size 2 \
    --port $MODEL_PROT \
    --host 0.0.0.0 \
    --dtype float16 \
    --max-model-len 65536 \
    --limit-mm-per-prompt image=30,video=0\
    --enable-prefix-caching \
    --gpu-memory-utilization 0.6 \
    --block-size 16 > "$VLLM_LOG"

Expected behavior

I downloaded the model from OpenGVLab/InternVL3-8B, which natively supports running OpenAI-style chat completions with vLLM. However, after converting it to the Hugging Face format using the script transformers/src/transformers/models/internvl/convert_internvl_weights_to_hf.py, launching vLLM resulted in the error:

ValueError: 'limit_mm_per_prompt' is only supported for multimodal models.

The command I used to launch vllm is as follows:


CUDA_VISIBLE_DEVICES=0,1 vllm serve $MODEL_PATH \
    --tensor-parallel-size 2 \
    --port $MODEL_PROT \
    --host 0.0.0.0 \
    --dtype float16 \
    --max-model-len 65536 \
    --limit-mm-per-prompt image=30,video=0\
    --enable-prefix-caching \
    --gpu-memory-utilization 0.6 \
    --block-size 16 > "$VLLM_LOG"

The system runs correctly when I set MODEL_PATH to the original OpenGVLab/InternVL3-8B address. But throws an error when I change the path to the converted InternVL-3B-hf format : ValueError: 'limit_mm_per_prompt' is only supported for multimodal models.

Could someone explain why this is happening and suggest solutions?
Thank you very much!

@FloSophorae FloSophorae added the bug label May 7, 2025
@zucchini-nlp
Copy link
Member

Hm, I don't think the converted weights would be compatible with vLLM yet. Usually vLLM implements the multimodal models themselves using the official checkpoints. Support for transformers style multimodal model is planned already and we'll try to add it soon

@FloSophorae
Copy link
Author

Hello, thank you for your response. Could you please advise if there’s any way to convert a Hugging Face (HF) format model back to its original format? My training task only generated the model in HF format. Looking forward to your reply!

@zucchini-nlp
Copy link
Member

Ah, you are using a custom tuned model. I don't think we have an easy way to do reverse conversion, unless you can write your own conversion script

Thought I will cc @hmellor as well, not sure if there's way to run converted models in vLLM without using transformers backend

@hmellor
Copy link
Member

hmellor commented May 7, 2025

I can't find any mention of needing to convert this model for it to be compatible with vLLM. You should be able to use it direcltly.

I think the problem is that this is a custom model, and you're not using --trust-remote-code.

This is necessary to get the custom processor from the Hub repo. I've found a document showing that InternVL is an example of a model with a custom processor in https://docs.vllm.ai/en/latest/contributing/model/multimodal.html#custom-hf-processor.

@yonigozlan
Copy link
Member

Btw not sure if that can help, but there's already a converted checkpoint on the hub for the native Transformers InternVL3-8B: https://huggingface.co/OpenGVLab/InternVL3-8B-hf

@hmellor
Copy link
Member

hmellor commented May 9, 2025

In vLLM CI we use https://hf.co/OpenGVLab/InternVL2-1B, so I think unconverted is ok

@hrdxwandg
Copy link

In vLLM CI we use https://hf.co/OpenGVLab/InternVL2-1B, so I think unconverted is ok

I use the following command to start vllm, i can start successfully.

	CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve /xxx/InternVL3/InternVL3-78B \
		--tensor-parallel-size 4 \
		--port 8092 \
  	    --host 0.0.0.0 \
  	    --dtype float16 \
  	    --max-model-len 65536 \
  	    --limit-mm-per-prompt "video=5" \
  	    --enable-prefix-caching \
  	    --trust-remote-code \
  	    --gpu-memory-utilization 0.9 \
  	    --block-size 16

but when i push a request, it has an error, it seems limit-mm-per-prompt param does not work, how to fix it?

ERROR 05-15 17:04:34 [serving_chat.py:200] Error in preprocessing prompt inputs
ERROR 05-15 17:04:34 [serving_chat.py:200] Traceback (most recent call last):
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/openai/serving_chat.py", line 183, in create_chat_completion
ERROR 05-15 17:04:34 [serving_chat.py:200]     ) = await self._preprocess_chat(
ERROR 05-15 17:04:34 [serving_chat.py:200]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/openai/serving_engine.py", line 403, in _preprocess_chat
ERROR 05-15 17:04:34 [serving_chat.py:200]     conversation, mm_data_future = parse_chat_messages_futures(
ERROR 05-15 17:04:34 [serving_chat.py:200]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/chat_utils.py", line 1165, in parse_chat_messages_futures
ERROR 05-15 17:04:34 [serving_chat.py:200]     sub_messages = _parse_chat_message_content(
ERROR 05-15 17:04:34 [serving_chat.py:200]                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/chat_utils.py", line 1089, in _parse_chat_message_content
ERROR 05-15 17:04:34 [serving_chat.py:200]     result = _parse_chat_message_content_parts(
ERROR 05-15 17:04:34 [serving_chat.py:200]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/chat_utils.py", line 989, in _parse_chat_message_content_parts
ERROR 05-15 17:04:34 [serving_chat.py:200]     parse_res = _parse_chat_message_content_part(
ERROR 05-15 17:04:34 [serving_chat.py:200]                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/chat_utils.py", line 1064, in _parse_chat_message_content_part
ERROR 05-15 17:04:34 [serving_chat.py:200]     mm_parser.parse_video(str_content)
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/chat_utils.py", line 784, in parse_video
ERROR 05-15 17:04:34 [serving_chat.py:200]     placeholder = self._tracker.add("video", video)
ERROR 05-15 17:04:34 [serving_chat.py:200]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-15 17:04:34 [serving_chat.py:200]   File "xxx/python3.11/site-packages/vllm/entrypoints/chat_utils.py", line 568, in add
ERROR 05-15 17:04:34 [serving_chat.py:200]     raise ValueError(
ERROR 05-15 17:04:34 [serving_chat.py:200] ValueError: At most 0 video(s) may be provided in one request. You can set `--limit-mm-per-prompt` to increase this limit if the model supports it.
INFO:     172.16.8.240:43962 - "POST /v1/chat/completions HTTP/1.1" 400 Bad Request

@hmellor
Copy link
Member

hmellor commented May 15, 2025

This is a vLLM problem, not a Transformers problem. Let's discuss in vLLM.

@yoyouC
Copy link

yoyouC commented May 19, 2025

@FloSophorae Don't know if this help, but I've managed to convet a Internvl3-1B-hf model back to the original OpenGVLab/InternVL3-1B format via the following script:

import re
import torch
from safetensors.torch import load_file, save_file
from collections import OrderedDict

# ==== Inverse Mappings ====

CONVERTED_TO_ORIGINAL_KEY_MAPPING = [
    (r"vision_tower", "vision_model"),
    (r"layernorm_before", "norm1"),
    (r"layernorm_after", "norm2"),
    (r"lambda_1", "ls1"),
    (r"lambda_2", "ls2"),
    (r"patch_embeddings\.projection", "patch_embedding"),
    (r"position_embeddings", "position_embedding"),
    (r"cls_token", "class_embedding"),
    (r"\bencoder\.layer\.", "encoder.layers."),
    (r"attention\.q_proj", "attn.qkv_q"),
    (r"attention\.k_proj", "attn.qkv_k"),
    (r"attention\.v_proj", "attn.qkv_v"),
    (r"attention\.projection_layer", "attn.proj"),
    (r"multi_modal_projector\.layer_norm", "mlp1.0"),
    (r"multi_modal_projector\.linear_1", "mlp1.1"),
    (r"multi_modal_projector\.linear_2", "mlp1.3"),
]

QKV_PATTERN = re.compile(r"(.*)\.attn\.qkv_([qkv])\.(weight|bias)")

# ==== Merge QKV ====

def merge_qkv(state_dict):
    qkv_buffers = {}

    for key, tensor in state_dict.items():
        match = QKV_PATTERN.match(key)
        if match:
            base, part, suffix = match.groups()
            qkv_key = f"{base}.attn.qkv.{suffix}"
            if qkv_key not in qkv_buffers:
                qkv_buffers[qkv_key] = {}
            qkv_buffers[qkv_key][part] = tensor
        else:
            continue

    for merged_key, parts in qkv_buffers.items():
        if set(parts) == {"q", "k", "v"}:
            if parts["q"].shape[0] == parts["k"].shape[0] == parts["v"].shape[0]:
                merged_tensor = torch.cat([parts["q"], parts["k"], parts["v"]], dim=0)
            else:
                merged_tensor = torch.cat([parts["q"], parts["k"], parts["v"]], dim=1)
            state_dict[merged_key] = merged_tensor

    # Remove the split keys
    for key in list(state_dict.keys()):
        if QKV_PATTERN.match(key):
            del state_dict[key]

    return state_dict

# ==== Key Renaming ====

def rename_keys(state_dict):
    new_state = OrderedDict()

    for key, value in state_dict.items():
        new_key = key
        for pattern, replacement in CONVERTED_TO_ORIGINAL_KEY_MAPPING:
            new_key = re.sub(pattern, replacement, new_key)
        new_state[new_key] = value

    return new_state

# ==== Entry Point ====

def reverse_safetensors(input_path, output_path):
    print(f"Loading: {input_path}")
    state_dict = load_file(input_path)

    print("Renaming keys...")
    renamed = rename_keys(state_dict)

    print("Merging Q/K/V...")
    merged = merge_qkv(renamed)

    print(f"Saving to: {output_path}")
    save_file(merged, output_path)
    print("Done.")


if __name__ == "__main__":
    import argparse

    parser = argparse.ArgumentParser()
    parser.add_argument("--input", required=True, help="Path to HF model.safetensors")
    parser.add_argument("--output", required=True, help="Path to output original-format .safetensors")
    args = parser.parse_args()

    reverse_safetensors(args.input, args.output)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants