-
Notifications
You must be signed in to change notification settings - Fork 725
Description
Dear InternVL3.5 Team,
Thanks for your great work. I try to use LMDeploy to deploy the InternVL3.5-38B and input a video to let VLM response to my question.
When I add the frames of video to 13, I will encounter this error:
2025-09-29 20:22:46,084 - lmdeploy - INFO - async_engine.py:732 - session=7, history_tokens=0, input_tokens=43382, max_new_tokens=700, seq_start=True, seq_end=True, step=0, prep=True 2025-09-29 20:23:41,280 - lmdeploy - ERROR - async_engine.py:874 - session 7 finished, reason "error" 2025-09-29 20:23:41,282 - lmdeploy - INFO - request.py:297 - Receive END_SESSION Request: 1
And the command I use to deploy the VLM is as follows:
lmdeploy serve api_server OpenGVLab/InternVL3_5-38B-HF --server-port 23333 --tp 2 --backend pytorch --log-level INFO
The test code is as follows:
from openai import OpenAI
from PIL import Image
from io import BytesIO
import base64
import os
import json
from lmdeploy.vl.constants import IMAGE_TOKEN
LMDEPLOY_API_URL = "http://localhost:23333/v1"
ROOT_PATH = './'
MODEL_NAME = "OpenGVLab/InternVL3_5-38B-HF"
def local_image_to_base64(root_path, image_path: str, target_size=(480, 360)) -> str:
if len(image_path) == 0:
return [], None
base64_list = []
for path in image_path:
path = os.path.join(root_path, path)
with Image.open(path).convert("RGB") as img:
img.thumbnail(target_size)
buffer = BytesIO()
img.save(buffer, format="JPEG", quality=95)
base64_str = base64.b64encode(buffer.getvalue()).decode("utf-8")
size = (img.width, img.height)
base64_list.append(f"data:image/jpeg;base64,{base64_str}")
return base64_list, size
client = OpenAI(
base_url=LMDEPLOY_API_URL,
api_key="dummy_key"
)
input_images = [
'001.jpg', '005.jpg', '010.jpg', '014.jpg', '020.jpg', '028.jpg', '034.jpg', '040.jpg', '043.jpg', '046.jpg', '050.jpg', '054.jpg', '060.jpg'
]
imgs_base64, _ = local_image_to_base64(ROOT_PATH, input_images)
question = ''
for i in range(len(imgs_base64)):
question = question + f'Frame{i+1}: {IMAGE_TOKEN}\n'
question += 'Describe the camera motion in detail. And find which frames contain a piano.'
content = [{'type': 'text', 'text': question}]
for img in imgs_base64:
content.append(
{
"type": "image_url",
"image_url": {'max_dynamic_patch': 1, "url": img}
}
)
message = [
{
'role': 'user',
'content': content
}
]
response = client.chat.completions.create(
model=MODEL_NAME,
messages=message,
max_tokens=700,
temperature=0.6,
stream=False
)
print(response.choices[0].message.content.strip())
By the way, I find that the answer of which frame contains a piano is always wrong. It seems that it cannot ground a target in a certain frame:
The camera starts by focusing on a wooden door with glass panels, slowly moving forward to reveal an ornate room. As the camera progresses, it pans slightly to the right, showcasing blue upholstered furniture and paintings on the walls. The camera continues to move forward into another room, revealing more of the interior, including a long dining table set with chairs. The ceiling has visible damage, indicating possible neglect or disrepair. In the first frame, there is a piano on the right side of the doorway. The camera angle shifts to provide a broader view of the room as it moves deeper into the space.
I would like to know if this is normal. Have I overlooked anything or made any mistakes? Could you please help me with this? Thank you very much.