-
-
Notifications
You must be signed in to change notification settings - Fork 9.4k
Closed as not planned
Labels
Description
Anything you want to discuss about vllm.
Ubuntu 22, RTX3090.
I've ran vllm 0.8.1
with a very small model of https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-AWQ with below:
vllm serve Qwen/Qwen2.5-3B-Instruct-AWQ
INFO 03-20 17:28:12 [__init__.py:256] Automatically detected platform cuda.
INFO 03-20 17:28:13 [api_server.py:977] vLLM API server version 0.8.1
it works good, but when I looked via nvidia-smi
, it takes almost 16G:
anaconda3/envs/vllm/bin/python 15982MiB
and i've changed to a bigger model of https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-AWQ but got the same GPU memory usage.
Question:
why 3B-Instruct-AWQ
takes 16G?
why 7B-Instruct-AWQ
takes the same GPU memory?
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.