-
Notifications
You must be signed in to change notification settings - Fork 467
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
请问Lmdeploy是否支持batch inference #2347
Comments
你是怎么用的?起服务然后用openai的接口来用的么? |
我最开始是按照 here 用4个A100起服务,然后用openai接口调用的.。然后发现:多个请求几乎同时到达也是处理完一个请求再处理下一个。 model_path='/path/to/InternVL2-Llama3-76B'
MODEL_NAME=InternVL2-Llama3-76B'
PORT=23334
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 lmdeploy serve api_server ${model_path} --model-name ${MODEL_NAME} --backend turbomind --enable-prefix-caching --vision-max-batch-size 128 --log-level INFO --server-port ${PORT} --tp 8 --chat-template /path/to/lmdeploy_internvl2_chat_template.json 客户端(本地)用如下一个测试脚本作为调用的示例(多个进程同时调用这个服务),代码改自 here: from openai import OpenAI
import multiprocessing as mp
import time
def call_api(client,model_name,rank):
for i in range(10):
t=time.time()
response = client.chat.completions.create(
model=model_name,
messages=[{
'role':'user',
'content': [{
'type': 'text',
'text': 'describe this image',
}, {
'type': 'image_url',
'image_url': {
'url':
'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
},
}],
}],
temperature=0.8,
top_p=0.8)
tt=time.time()-t
print("rank",rank,'time',tt,'turn',i)
if __name__=="__main__":
client = OpenAI(
api_key='YOUR_API_KEY',
base_url="http://172.16.78.10:23334/v1"
)
model_name = client.models.list().data[0].id #Internvl2-llama3-76B
processes=[]
tt=time.time()
for i in range(40):
p=mp.Process(target=call_api,args=(client,model_name,i))
processes.append(p)
p.start()
for p in processes:
p.join()
ttt=time.time()-tt
print("Total Duration",ttt) 在server的日志里:ImageEncoder forward batchsize不为1的时候仅有22次(16次为2,2次为3,4次为4),占约仅为5% |
嗯嗯,具体的使用方法我的同学在上面写了,非常感谢 |
vision 和 llm 要分开来看。 vision默认不并发(默认一次处理一张图,但是如果一张图经过预处理有12个patch,实际上也是有batchd的), llm 的部分会自动组batch,只要前面的请求还没处理完,然后又收到了新请求,就会组batch,具体的可以将log_level设置为INFO,然后观察这种log,如下面这种,表示目前batch中有8个请求,其7个在decode阶段,一个在prefill阶段。
|
OK,感谢。我看一下 |
感谢,我明白了。 |
您好,我使用lmdeploy部署internvl2-72b模型,想要实现batch inference。但是实际发现internvl仍然是一个请求处理完再处理下一个,没有实现真正的batch inference。而batch inference又很重要。非常感谢
The text was updated successfully, but these errors were encountered: