You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i try to use vllm to serve Qwen-32B-chat-AWQ in 3090(24G x 2).
in my expectation, 24G memory could be enough in one gpu, so i use one GPU at first time, but failed
then i try to use tensor parallel to serve the model and that work, but memeory usage over my expection: 18G for each GPU, total 36G, that much more beyond my expectation for one GPU, i want to know, if that is common
in my expectation, 13-14G for each GPU is enough
The text was updated successfully, but these errors were encountered:
Your current environment
How would you like to use vllm
i try to use vllm to serve Qwen-32B-chat-AWQ in 3090(24G x 2).
in my expectation, 24G memory could be enough in one gpu, so i use one GPU at first time, but failed
then i try to use tensor parallel to serve the model and that work, but memeory usage over my expection: 18G for each GPU, total 36G, that much more beyond my expectation for one GPU, i want to know, if that is common
in my expectation, 13-14G for each GPU is enough
The text was updated successfully, but these errors were encountered: