Balance vision model weights on multi gpus #1591

irexyc · 2024-05-14T07:41:30Z

TODO

hangs issue when using nccl(turbomind).
docs & cli

lmdeploy/messages.py

lvhan028 · 2024-05-16T07:50:39Z

src/turbomind/triton_backend/llama/LlamaTritonModel.cc

@@ -279,7 +279,7 @@ std::unique_ptr<LlamaTritonSharedModelInstance<T>> LlamaTritonModel<T>::createSh

    /// TODO: this stream handle is leaked
    cudaStream_t stream{};
-    ft::check_cuda_error(cudaStreamCreate(&stream));
+    ft::check_cuda_error(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking));


Non blocking, why?

不是 non blocking 的话同步时会和default stream 同步。怀疑 nccl 和 python 那边的 default stream 同步的时候会有冲突，然后卡主。

lmdeploy/cli/serve.py

docs/en/inference/vl_pipeline.md

lmdeploy/vl/model/yi.py

lmdeploy/messages.py

AllentDan · 2024-05-21T03:17:17Z

lmdeploy/turbomind/turbomind.py

-@contextmanager
-def cuda_ctx(device_id):
-    old_device = torch.cuda.current_device()
-    torch.cuda.set_device(device_id)
-    yield
-    torch.cuda.set_device(old_device)


Just out of curiosity, why add and remove it?

pybind的那几个函数在c++那边都有做cudaSetDevice，所以感觉不需要了。
之前的话，forward thread c++ 那边没做cudaSetDevice，不过有没有cuda_ctx 我这里都能正常跑。

AllentDan · 2024-05-21T03:55:27Z

lmdeploy/vl/engine.py

        self.model = load_vl_model(model_path)
-        self.max_batch_size = max_batch_size
+        self.max_batch_size = (1 if vicion_config is None else


Why change the default value from 16 to 1 here?

因为有issue提到，服务可以启动，但是多个请求过来显存会超。另外有一些vision模型的显存对batch比较敏感

buaadf · 2024-05-21T09:19:19Z

@lzhangzz waiting for your review...

rTrQqgH74lc2PT5k · 2024-05-21T13:23:27Z

非常期待这个功能

AllentDan

现在的 tp 相当于只要 CUDA_VISIBLE_DEVICES 可访问的 GPU 都会用吗？即使指定了 tp==2, 也会用四卡，如果四卡均可访问

irexyc · 2024-05-22T08:33:32Z

现在的 tp 相当于只要 CUDA_VISIBLE_DEVICES 可访问的 GPU 都会用吗？即使指定了 tp==2, 也会用四卡，如果四卡均可访问

是的。

lmdeploy/vl/engine.py

RunningLeon

tested OK with vl pipeline on tp=2 for these available models:

llava-v1.57b
llava-v1.6-vicuna-7b
llava-v1.6-34b
deepseek-vl-7b-chat
Qwen-VL-Chat
Yi-VL-6B
Mini-Gemini-7B

RunningLeon

LGTM

lvhan028 · 2024-05-23T03:51:35Z

runtime.txt 中要明确下 accelerate的最低版本

src/turbomind/triton_backend/llama/LlamaTritonModel.cc

lzhangzz

LGTM

covdvoyager · 2024-05-23T10:28:32Z

大佬，请问在执行python文件时 File "D:\新建文件夹\InternDog-master\app_cli.py", line 3, in
from agent.model import chat
File "D:\新建文件夹\InternDog-master\agent\model.py", line 2, in
from lmdeploy import turbomind as tm
File "C:\Users\86186\anaconda3\envs\pythonProject2\Lib\site-packages\lmdeploy\turbomind_init_.py", line 24, in
from .turbomind import TurboMind # noqa: E402
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\86186\anaconda3\envs\pythonProject2\Lib\site-packages\lmdeploy\turbomind\turbomind.py", line 26, in
from .deploy.converter import (get_model_format, supported_formats,
File "C:\Users\86186\anaconda3\envs\pythonProject2\Lib\site-packages\lmdeploy\turbomind\deploy\converter.py", line 16, in
from .target_model.base import OUTPUT_MODELS, TurbomindModelConfig
File "C:\Users\86186\anaconda3\envs\pythonProject2\Lib\site-packages\lmdeploy\turbomind\deploy\target_model_init_.py", line 3, in
from .w4 import TurbomindW4Model # noqa: F401
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\86186\anaconda3\envs\pythonProject2\Lib\site-packages\lmdeploy\turbomind\deploy\target_model\w4.py", line 17, in
import _turbomind as _tm # noqa: E402
^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: DLL load failed while importing _turbomind: 找不到指定的模块。这样报错是有甚麽问题吗

irexyc · 2024-05-23T11:39:13Z

@covdvoyager

可以看下这个是否能帮到你。
#1146 (comment)

buaadf · 2024-05-24T01:22:23Z

@irexyc 请问多卡并行必须要2的幂次张卡吗，我这里用3张A30跑不起来

irexyc · 2024-05-24T01:25:02Z

@buaadf

backend_config 里面的 tp 需要 2的幂次。

buaadf · 2024-05-24T01:37:01Z

@buaadf

backend_config 里面的 tp 需要 2的幂次。

请问 tp的设置和卡数有什么关系吗，A30（24G）至少需要几张才能跑起来呀？

irexyc · 2024-05-24T01:43:26Z

@buaadf

LM 模型切分 tp 需要是2的幂次。tp=2就是说LM需要两块卡，会从可见的卡里面选择0,1号卡。

如果你跑的是VLM模型，backend_config tp设置2，CUDA_VISIBLE_DEVICES=“0,1,2”，那么 vision 模型会均分到三块卡上，LM模型会均分到前两块卡上。

能不能跑看你跑的是什么模型。就权重来说（不量化），7b的模型，大概需要14G的显存。20b的模型需要40G的显存。除了模型的显存外，kv cache 也需要显存，会影响 session_len 以及 batch 的大小。可以通过 cache_max_entry_count 来控制大小。

ysyx2008 · 2024-05-27T03:25:20Z

tp=2的情况下，双4090卡仍然无法运行int8版本的InternVL（25G权重文件），显存占用会爆掉。
求赐教。

(internvl) yushen@YuShen-Work:~/ai/InternVL$ python gradio_InternVL.py
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Exception in thread Thread-3 (_create_weight_func):
Traceback (most recent call last):
File "/home/yushen/micromamba/envs/internvl/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/yushen/micromamba/envs/internvl/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/yushen/micromamba/envs/internvl/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 199, in _create_weight_func
model_comm.create_shared_weights(device_id, rank)
RuntimeError: [TM][ERROR] CUDA runtime error: out of memory /lmdeploy/src/turbomind/utils/memory_utils.cu:32

Exception in thread Thread-5 (_get_params):
Traceback (most recent call last):
File "/home/yushen/micromamba/envs/internvl/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/yushen/micromamba/envs/internvl/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/yushen/micromamba/envs/internvl/lib/python3.10/site-packages/lmdeploy/turbomind/turbomind.py", line 215, in _get_params
out = model_comm.get_params(device_id, rank)
RuntimeError: [TM][ERROR] Assertion fail: /lmdeploy/src/turbomind/triton_backend/llama/LlamaTritonModel.cc:417

irexyc · 2024-05-27T03:33:29Z

@ysyx2008

我们不支持加载bnb形式的int8模型。请用我们的量化工具进行量化。这个工具应该在0.4.2中可以使用

这个是针对 LLM 的量化文档，VLM 也是通用的，把DEMO中的模型换成VLM模型就好。

近期也会有一篇针对 VLM量化的文章发布，可以关注一下。

ysyx2008 · 2024-05-27T05:19:15Z

@ysyx2008

我们不支持加载bnb形式的int8模型。请用我们的量化工具进行量化。这个工具应该在0.4.2中可以使用

这个是针对 LLM 的量化文档，VLM 也是通用的，把DEMO中的模型换成VLM模型就好。

近期也会有一篇针对 VLM量化的文章发布，可以关注一下。

非常感谢，之前自行量化报错，刚发现pip默认安装的是0.4.1版本，我再去研究更新到0.4.2再试一次。再次感谢。

sshuair · 2024-05-27T06:51:34Z

@irexyc error with internlm/internlm-xcomposer2-4khd-7b model

Dummy Resized
Traceback (most recent call last):
  File "/opt/py38/bin/lmdeploy", line 11, in <module>
    load_entry_point('lmdeploy', 'console_scripts', 'lmdeploy')()
  File "/opt/lmdeploy/lmdeploy/cli/entrypoint.py", line 37, in run
    args.run(args)
  File "/opt/lmdeploy/lmdeploy/cli/serve.py", line 303, in api_server
    run_api_server(args.model_path,
  File "/opt/lmdeploy/lmdeploy/serve/openai/api_server.py", line 1191, in serve
    VariableInterface.async_engine = pipeline_class(
  File "/opt/lmdeploy/lmdeploy/serve/vl_async_engine.py", line 20, in __init__
    self.vl_encoder = ImageEncoder(model_path, vision_config)
  File "/opt/lmdeploy/lmdeploy/vl/engine.py", line 69, in __init__
    self.model = load_vl_model(model_path)
  File "/opt/lmdeploy/lmdeploy/vl/model/builder.py", line 40, in load_vl_model
    return Xcomposer2VisionModel(model_path, with_llm)
  File "/opt/lmdeploy/lmdeploy/vl/model/xcomposer2.py", line 42, in __init__
    self.build_model()
  File "/opt/lmdeploy/lmdeploy/vl/model/xcomposer2.py", line 76, in build_model
    max_memory = get_balanced_memory(
UnboundLocalError: local variable 'get_balanced_memory' referenced before assignment

irexyc · 2024-05-27T07:14:34Z

@sshuair

shoud be fixed in #1661

serser · 2024-06-24T13:59:42Z

curious to know, is VLM pipeline support persistent batching? @irexyc

Pass-O-Guava · 2024-07-23T08:32:36Z

vision均分是tp还是pp? @irexyc

irexyc added 5 commits May 11, 2024 10:46

remove cuda_ctx as turbomind will set device itself

b6b137a

balanced vision model

8650e6d

fix mini_gemeni

247ae54

fix hanging issue

0e6febe

update docs & cli

f096268

irexyc changed the title ~~[WIP] Balance vision model weights on multi gpus~~ Balance vision model weights on multi gpus May 14, 2024

fix internvl template

9b4085f

lvhan028 added the improvement label May 16, 2024

lvhan028 requested review from AllentDan, RunningLeon and lzhangzz May 16, 2024 03:02

lvhan028 reviewed May 16, 2024

View reviewed changes

lmdeploy/messages.py Outdated Show resolved Hide resolved

lvhan028 reviewed May 16, 2024

View reviewed changes

lmdeploy/messages.py Outdated Show resolved Hide resolved

lvhan028 reviewed May 16, 2024

View reviewed changes

lmdeploy/cli/serve.py Outdated Show resolved Hide resolved

lvhan028 reviewed May 16, 2024

View reviewed changes

lmdeploy/cli/serve.py Outdated Show resolved Hide resolved

irexyc added 2 commits May 16, 2024 09:59

remove unused args

62d02d7

update xcomposer2

ff0e7ac

RunningLeon reviewed May 20, 2024

View reviewed changes

docs/en/inference/vl_pipeline.md Show resolved Hide resolved

RunningLeon reviewed May 21, 2024

View reviewed changes

lmdeploy/vl/model/yi.py Outdated Show resolved Hide resolved

RunningLeon reviewed May 21, 2024

View reviewed changes

lmdeploy/messages.py Show resolved Hide resolved

AllentDan reviewed May 21, 2024

View reviewed changes

irexyc mentioned this pull request May 21, 2024

[Bug] 多卡部署InternVL-Chat-V1-5时，在显存足够的情况下也会OutOfMemory。 #1555

Closed

2 tasks

RunningLeon mentioned this pull request May 22, 2024

[Feature]: Support cogvlm-chat #1502

Merged

6 tasks

fix yi-vl

468c9b4

AllentDan reviewed May 22, 2024

View reviewed changes

AllentDan approved these changes May 22, 2024

View reviewed changes

lvhan028 reviewed May 22, 2024

View reviewed changes

lmdeploy/vl/engine.py Outdated Show resolved Hide resolved

RunningLeon reviewed May 23, 2024

View reviewed changes

RunningLeon approved these changes May 23, 2024

View reviewed changes

RunningLeon mentioned this pull request May 23, 2024

[Feature]: Support llava for pytorch engine #1641

Merged

lzhangzz reviewed May 23, 2024

View reviewed changes

src/turbomind/triton_backend/llama/LlamaTritonModel.cc Outdated Show resolved Hide resolved

lzhangzz approved these changes May 23, 2024

View reviewed changes

irexyc added 3 commits May 23, 2024 07:35

fix comments

aa6e5a5

update requirements

9ac8abb

fix lint

9ede071

lvhan028 approved these changes May 23, 2024

View reviewed changes

lvhan028 merged commit 2f28531 into InternLM:main May 23, 2024
5 of 9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Balance vision model weights on multi gpus #1591

Balance vision model weights on multi gpus #1591

irexyc commented May 14, 2024 •

edited

Loading

lvhan028 May 16, 2024

irexyc May 22, 2024

AllentDan May 21, 2024

irexyc May 22, 2024

AllentDan May 21, 2024

irexyc May 22, 2024

buaadf commented May 21, 2024

rTrQqgH74lc2PT5k commented May 21, 2024

AllentDan left a comment

irexyc commented May 22, 2024

RunningLeon left a comment

RunningLeon left a comment

lvhan028 commented May 23, 2024 •

edited

Loading

lzhangzz left a comment

covdvoyager commented May 23, 2024

irexyc commented May 23, 2024

buaadf commented May 24, 2024

irexyc commented May 24, 2024

buaadf commented May 24, 2024

irexyc commented May 24, 2024

ysyx2008 commented May 27, 2024

irexyc commented May 27, 2024

ysyx2008 commented May 27, 2024

sshuair commented May 27, 2024

irexyc commented May 27, 2024

serser commented Jun 24, 2024

Pass-O-Guava commented Jul 23, 2024

Balance vision model weights on multi gpus #1591

Balance vision model weights on multi gpus #1591

Conversation

irexyc commented May 14, 2024 • edited Loading

TODO

lvhan028 May 16, 2024

Choose a reason for hiding this comment

irexyc May 22, 2024

Choose a reason for hiding this comment

AllentDan May 21, 2024

Choose a reason for hiding this comment

irexyc May 22, 2024

Choose a reason for hiding this comment

AllentDan May 21, 2024

Choose a reason for hiding this comment

irexyc May 22, 2024

Choose a reason for hiding this comment

buaadf commented May 21, 2024

rTrQqgH74lc2PT5k commented May 21, 2024

AllentDan left a comment

Choose a reason for hiding this comment

irexyc commented May 22, 2024

RunningLeon left a comment

Choose a reason for hiding this comment

RunningLeon left a comment

Choose a reason for hiding this comment

lvhan028 commented May 23, 2024 • edited Loading

lzhangzz left a comment

Choose a reason for hiding this comment

covdvoyager commented May 23, 2024

irexyc commented May 23, 2024

buaadf commented May 24, 2024

irexyc commented May 24, 2024

buaadf commented May 24, 2024

irexyc commented May 24, 2024

ysyx2008 commented May 27, 2024

irexyc commented May 27, 2024

ysyx2008 commented May 27, 2024

sshuair commented May 27, 2024

irexyc commented May 27, 2024

serser commented Jun 24, 2024

Pass-O-Guava commented Jul 23, 2024

irexyc commented May 14, 2024 •

edited

Loading

lvhan028 commented May 23, 2024 •

edited

Loading