Skip to content

LmDeploy InternVL3.5 reason error when input multiple images #1198

@zhongjiaru

Description

@zhongjiaru

Dear InternVL3.5 Team,

Thanks for your great work. I try to use LMDeploy to deploy the InternVL3.5-38B and input a video to let VLM response to my question.
When I add the frames of video to 13, I will encounter this error:
2025-09-29 20:22:46,084 - lmdeploy - INFO - async_engine.py:732 - session=7, history_tokens=0, input_tokens=43382, max_new_tokens=700, seq_start=True, seq_end=True, step=0, prep=True 2025-09-29 20:23:41,280 - lmdeploy - ERROR - async_engine.py:874 - session 7 finished, reason "error" 2025-09-29 20:23:41,282 - lmdeploy - INFO - request.py:297 - Receive END_SESSION Request: 1

And the command I use to deploy the VLM is as follows:
lmdeploy serve api_server OpenGVLab/InternVL3_5-38B-HF --server-port 23333 --tp 2 --backend pytorch --log-level INFO

The test code is as follows:

from openai import OpenAI
from PIL import Image
from io import BytesIO
import base64
import os
import json
from lmdeploy.vl.constants import IMAGE_TOKEN

LMDEPLOY_API_URL = "http://localhost:23333/v1"
ROOT_PATH = './'
MODEL_NAME = "OpenGVLab/InternVL3_5-38B-HF"

def local_image_to_base64(root_path, image_path: str, target_size=(480, 360)) -> str:
    if len(image_path) == 0:
        return [], None
    base64_list = []
    for path in image_path:
        path = os.path.join(root_path, path)
        with Image.open(path).convert("RGB") as img:
            img.thumbnail(target_size)
            buffer = BytesIO()
            img.save(buffer, format="JPEG", quality=95)
            base64_str = base64.b64encode(buffer.getvalue()).decode("utf-8")
            size = (img.width, img.height)
            
            base64_list.append(f"data:image/jpeg;base64,{base64_str}")
    
    return base64_list, size


client = OpenAI(
        base_url=LMDEPLOY_API_URL,
        api_key="dummy_key"
    )
input_images = [
    '001.jpg', '005.jpg', '010.jpg', '014.jpg', '020.jpg', '028.jpg', '034.jpg', '040.jpg', '043.jpg', '046.jpg', '050.jpg', '054.jpg', '060.jpg'
]

imgs_base64, _ = local_image_to_base64(ROOT_PATH, input_images)

question = ''
for i in range(len(imgs_base64)):
    question = question + f'Frame{i+1}: {IMAGE_TOKEN}\n'

question += 'Describe the camera motion in detail. And find which frames contain a piano.'
content = [{'type': 'text', 'text': question}]
for img in imgs_base64:
    content.append(
        {
            "type": "image_url",
            "image_url": {'max_dynamic_patch': 1, "url": img}
        }
    )
message = [
    {
        'role': 'user',
        'content': content
    }
]

response = client.chat.completions.create(
    model=MODEL_NAME,
    messages=message,
    max_tokens=700,
    temperature=0.6,
    stream=False
)
print(response.choices[0].message.content.strip())

By the way, I find that the answer of which frame contains a piano is always wrong. It seems that it cannot ground a target in a certain frame:
The camera starts by focusing on a wooden door with glass panels, slowly moving forward to reveal an ornate room. As the camera progresses, it pans slightly to the right, showcasing blue upholstered furniture and paintings on the walls. The camera continues to move forward into another room, revealing more of the interior, including a long dining table set with chairs. The ceiling has visible damage, indicating possible neglect or disrepair. In the first frame, there is a piano on the right side of the doorway. The camera angle shifts to provide a broader view of the room as it moves deeper into the space.

I would like to know if this is normal. Have I overlooked anything or made any mistakes? Could you please help me with this? Thank you very much.

images.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions