-
Notifications
You must be signed in to change notification settings - Fork 342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Balance vision model weights on multi gpus #1591
Conversation
@@ -279,7 +279,7 @@ std::unique_ptr<LlamaTritonSharedModelInstance<T>> LlamaTritonModel<T>::createSh | |||
|
|||
/// TODO: this stream handle is leaked | |||
cudaStream_t stream{}; | |||
ft::check_cuda_error(cudaStreamCreate(&stream)); | |||
ft::check_cuda_error(cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Non blocking, why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不是 non blocking 的话同步时会和default stream 同步。怀疑 nccl 和 python 那边的 default stream 同步的时候会有冲突,然后卡主。
@contextmanager | ||
def cuda_ctx(device_id): | ||
old_device = torch.cuda.current_device() | ||
torch.cuda.set_device(device_id) | ||
yield | ||
torch.cuda.set_device(old_device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just out of curiosity, why add and remove it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pybind的那几个函数在c++那边都有做cudaSetDevice,所以感觉不需要了。
之前的话,forward thread c++ 那边没做cudaSetDevice,不过有没有cuda_ctx 我这里都能正常跑。
lmdeploy/vl/engine.py
Outdated
self.model = load_vl_model(model_path) | ||
self.max_batch_size = max_batch_size | ||
self.max_batch_size = (1 if vicion_config is None else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change the default value from 16 to 1 here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
因为有issue提到,服务可以启动,但是多个请求过来显存会超。另外有一些vision模型的显存对batch比较敏感
@lzhangzz waiting for your review... |
非常期待这个功能 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在的 tp 相当于只要 CUDA_VISIBLE_DEVICES 可访问的 GPU 都会用吗?即使指定了 tp==2, 也会用四卡,如果四卡均可访问
是的。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested OK with vl pipeline on tp=2 for these available models:
llava-v1.57b
llava-v1.6-vicuna-7b
llava-v1.6-34b
deepseek-vl-7b-chat
Qwen-VL-Chat
Yi-VL-6B
Mini-Gemini-7B
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
runtime.txt 中要明确下 accelerate的最低版本 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
大佬,请问在执行python文件时 File "D:\新建文件夹\InternDog-master\app_cli.py", line 3, in |
可以看下这个是否能帮到你。 |
@irexyc 请问多卡并行必须要2的幂次张卡吗,我这里用3张A30跑不起来 |
backend_config 里面的 tp 需要 2的幂次。 |
请问 tp的设置和卡数有什么关系吗,A30(24G)至少需要几张才能跑起来呀? |
LM 模型切分 tp 需要是2的幂次。tp=2就是说LM需要两块卡,会从可见的卡里面选择0,1号卡。 如果你跑的是VLM模型,backend_config tp设置2,CUDA_VISIBLE_DEVICES=“0,1,2”,那么 vision 模型会均分到三块卡上,LM模型会均分到前两块卡上。 能不能跑看你跑的是什么模型。就权重来说(不量化),7b的模型,大概需要14G的显存。20b的模型需要40G的显存。除了模型的显存外,kv cache 也需要显存,会影响 session_len 以及 batch 的大小。可以通过 cache_max_entry_count 来控制大小。 |
tp=2的情况下,双4090卡仍然无法运行int8版本的InternVL(25G权重文件),显存占用会爆掉。 (internvl) yushen@YuShen-Work:~/ai/InternVL$ python gradio_InternVL.py Exception in thread Thread-5 (_get_params): |
@irexyc error with
|
curious to know, is VLM pipeline support persistent batching? @irexyc |
vision均分是tp还是pp? @irexyc |
TODO
#1563