请问Lmdeploy是否支持batch inference #2347

leoozy · 2024-08-20T18:26:19Z

您好，我使用lmdeploy部署internvl2-72b模型，想要实现batch inference。但是实际发现internvl仍然是一个请求处理完再处理下一个，没有实现真正的batch inference。而batch inference又很重要。非常感谢

irexyc · 2024-08-21T05:26:32Z

你是怎么用的？起服务然后用openai的接口来用的么？

zzjchen · 2024-08-21T05:55:00Z

我最开始是按照 here 用4个A100起服务，然后用openai接口调用的.。然后发现：多个请求几乎同时到达也是处理完一个请求再处理下一个。
然后我看到了issue 2327 , 把 --vision-max-batch-size 改为了128，但仍然不能看到多个请求被作为一个batch同时处理（打开--log-level INFO发现ImageEncoder forward还是几乎都是1）
随后我尝试了用8个A100起服务，命令如下（前面的命令也基本一样）

model_path='/path/to/InternVL2-Llama3-76B'
MODEL_NAME=InternVL2-Llama3-76B'
PORT=23334
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 lmdeploy serve api_server ${model_path} --model-name ${MODEL_NAME} --backend turbomind --enable-prefix-caching --vision-max-batch-size 128 --log-level INFO --server-port ${PORT} --tp 8 --chat-template /path/to/lmdeploy_internvl2_chat_template.json

客户端（本地）用如下一个测试脚本作为调用的示例（多个进程同时调用这个服务），代码改自 here：

from openai import OpenAI
import multiprocessing as mp
import time

def call_api(client,model_name,rank):
    for i in range(10):
        t=time.time()
        response = client.chat.completions.create(
            model=model_name,
            messages=[{
                'role':'user',
                'content': [{
                    'type': 'text',
                    'text': 'describe this image',
                }, {
                    'type': 'image_url',
                    'image_url': {
                        'url':
                        'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
                    },
                }],
            }],
            temperature=0.8,
            top_p=0.8)
        tt=time.time()-t
        print("rank",rank,'time',tt,'turn',i)
    
if __name__=="__main__":
    client = OpenAI(
        api_key='YOUR_API_KEY',
        base_url="http://172.16.78.10:23334/v1"
    )
    model_name = client.models.list().data[0].id #Internvl2-llama3-76B
    processes=[]
    tt=time.time()
    for i in range(40):
        p=mp.Process(target=call_api,args=(client,model_name,i))
        processes.append(p)
        p.start()
    for p in processes:
        p.join()
    ttt=time.time()-tt
    print("Total Duration",ttt)

在server的日志里：ImageEncoder forward batchsize不为1的时候仅有22次(16次为2，2次为3，4次为4)，占约仅为5%
感觉是server自动在做batch?但是多数时候不等有没有batch它自己就先把一个图片传走了？

leoozy · 2024-08-21T05:57:01Z

你是怎么用的？起服务然后用openai的接口来用的么？

嗯嗯，具体的使用方法我的同学在上面写了，非常感谢

irexyc · 2024-08-21T06:08:48Z

@zzjchen @leoozy

vision 和 llm 要分开来看。

vision默认不并发（默认一次处理一张图，但是如果一张图经过预处理有12个patch，实际上也是有batchd的)， vision-max-batch-size 设置大于1的数的话，vision有概率同时处理多张图，不过目前的策略没有在收到请求等待一小段时间，所以实际上vision组batch的概率并不高。不过vision组batch在pytorch引擎下收益并不大，对于较大的模型基本上时间是线性的，而且如果batch很大，会严重增加第一个token的输出时间。

llm 的部分会自动组batch，只要前面的请求还没处理完，然后又收到了新请求，就会组batch，具体的可以将log_level设置为INFO，然后观察这种log，如下面这种，表示目前batch中有8个请求，其7个在decode阶段，一个在prefill阶段。

[TM][INFO] [Forward] [0, 8), dc=7, pf=1, ...

zzjchen · 2024-08-21T06:28:00Z

OK，感谢。我看一下

zzjchen · 2024-08-29T08:34:09Z

感谢，我明白了。

lvhan028 assigned irexyc Aug 21, 2024

lvhan028 added the awaiting response label Aug 28, 2024

zzjchen mentioned this issue Aug 29, 2024

(Continuous) Batch Serving Internvl2 using lmdeploy OpenGVLab/InternVL#523

Closed

lvhan028 closed this as completed Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

请问Lmdeploy是否支持batch inference #2347

请问Lmdeploy是否支持batch inference #2347

leoozy commented Aug 20, 2024

irexyc commented Aug 21, 2024

zzjchen commented Aug 21, 2024 •

edited

Loading

leoozy commented Aug 21, 2024

irexyc commented Aug 21, 2024

zzjchen commented Aug 21, 2024

zzjchen commented Aug 29, 2024

请问Lmdeploy是否支持batch inference #2347

请问Lmdeploy是否支持batch inference #2347

Comments

leoozy commented Aug 20, 2024

irexyc commented Aug 21, 2024

zzjchen commented Aug 21, 2024 • edited Loading

leoozy commented Aug 21, 2024

irexyc commented Aug 21, 2024

zzjchen commented Aug 21, 2024

zzjchen commented Aug 29, 2024

zzjchen commented Aug 21, 2024 •

edited

Loading