Skip to content

[Misc]: why 3B-Instruct-AWQ takes 16G #15204

@shaojun

Description

@shaojun

Anything you want to discuss about vllm.

Ubuntu 22, RTX3090.
I've ran vllm 0.8.1 with a very small model of https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-AWQ with below:

vllm serve Qwen/Qwen2.5-3B-Instruct-AWQ
INFO 03-20 17:28:12 [__init__.py:256] Automatically detected platform cuda.
INFO 03-20 17:28:13 [api_server.py:977] vLLM API server version 0.8.1

it works good, but when I looked via nvidia-smi, it takes almost 16G:

anaconda3/envs/vllm/bin/python      15982MiB

and i've changed to a bigger model of https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-AWQ but got the same GPU memory usage.

Question:
why 3B-Instruct-AWQ takes 16G?
why 7B-Instruct-AWQ takes the same GPU memory?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions