Open
Description
Your current environment
使用vllm0.7.3启动qwen2.5vl-7b模型,模型启动命令是:nohup env CUDA_VISIBLE_DEVICES=4, vllm serve /Qwen/Qwen2___5-VL-7B-Instruct/ --trust-remote-code --served-model-name qwen_model --gpu-memory-utilization 0.9 --tensor-parallel-size 4 --port 8000 &>qwen.log &
使用显卡是4张nvidia 4090 最开始启动的时候显存占用是每张卡约8G左右,运行时间越长,显存占用越多,一晚上的显存占用增加到约12G左右。
How would you like to use vllm
I want to run inference of a [specific model](put link here). I don't know how to integrate it with vllm.
我应该怎么使用vllm才不会造成这种显存越来越多的现象呢?
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.