-
-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[V1] Feedback Thread #12568
Comments
👍 I have not done a proper benchmark but V1 feels superior, i.e. higher throughput + lower latency, TTFT. I have encountered a possible higher memory consumption issue, but am overall very pleased with the vllm community's hard work on V1. |
Does anyone know about this bug with n>1? Thanks |
Logging is in progress. Current main has a lot more and we will maintain compatibility with V0. Thanks! |
Quick feedback [VLLM_USE_V1=1]:
|
Thanks, both are in progress |
are logprobs output (and specifically prompt logprobs with echo=True) expected to be working with current V1 (0.7.0)? |
Maybe there is a better place to discuss this but the implementation for models that use more than one extra modality is quite non-intuitive. |
Still in progress |
Thanks for fixing metrics logs in 0.7.1! |
I'm either going insane, but with V1 qwen 8b instruct LLM just breaks in fp8 and around 25% of generations are just gibberish, with same running code and everything. Do I need to make a bug report, or it's an expected behaviour and I need some specific setup of sampling params for it to work in v1? |
The V1 engine doesn't seem to support logits processors or min-p filtering. Issue #12678 |
Something is weird with memory calculation in V1 and tensor parallel. Here are 2 cases that I tested recently: vllm 0.7.0 on 2x A6000: Starting normally a 32b-awq model and using Everything works as previously, GPUs both get to ~44-46GB usage Using GPUs both load up to ~24-25GB and it slowly goes up as inference runs. I've seen it go up to 32GB on each GPU. Updating to vllm 0.7.1 and running a 7b-awq model this time, I also noticed that running the above command "normally" the logs show Maximum concurrency at 44x Using V1 I get:
And finally, with vllm 0.7.0 and 4x L4 loading a 32b-awq model with tp 4 works in "normal mode", but OOMs with V1. |
I did a little experiment with DeepSeek-R1 on 8xH200 GPU. vLLM 0.7.0 showed the following results with
In general, vLLM without VLLM_USE_V1 looked more productive. I also tried V0 with
Throughput was still 2 times lower than SGLang in the same benchmark. Today I updated vLLM to the new version (0.7.1) and decided to repeat the experiment. And the results in version V0 have become much better!
But running vLLM with
|
v1 not support T4,are you support? |
Hi @bao231, V1 does not support T4 or older-generation GPUs since the kernel libraries used in V1 (e.g., flash-attn) do not support them. |
V1 support other attention libs?has you plan? @WoosukKwon |
Thanks!
|
Can you provide a more detailed reproduction instruction? cc @WoosukKwon |
Thanks. We are actively working on PP |
Check out #sig-multi-modality in our slack! This is the best place for a discussion like this |
Its pretty hard to follow what you are seeing. Please attach:
Thanks! |
Hi, please see Launch command
|
I am trying to run inference on Qwen2.5-VL-72B for video processing using 4xA800 GPUs. However, I encountered errors when executing the code with VLLM V1, whereas it works correctly with VLLM V0 by setting VLLM_USE_V1=0. llm = LLM(
model=MODEL_PATH,
limit_mm_per_prompt={"image": 10, "video": 10},
tensor_parallel_size=4,
gpu_memory_utilization=0.7
)
sampling_params = SamplingParams(
temperature=0.1,
top_p=0.001,
repetition_penalty=1.05,
max_tokens=256,
stop_token_ids=[],
)
question = ''
messages = [
{"role": "system", "content": "You are a good video analyst"},
{
"role": "user",
"content": [
{
"type": "video",
"video": file,
},
{"type": "text", "text": question},
],
}
]
prompt = self.processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
mm_data = {}
if image_inputs is not None:
mm_data["image"] = image_inputs
if video_inputs is not None:
mm_data["video"] = video_inputs
llm_inputs = {
"prompt": prompt,
"multi_modal_data": mm_data,
# FPS will be returned in video_kwargs
#"mm_processor_kwargs": video_kwargs,
}
outputs = llm.generate(llm_inputs, sampling_params=sampling_params)
|
Cannot work with Qwen 2 #13284 |
CUDA error when using |
Somehow GGUF isn't working great, it will crash the V1 engine by weird CUDA errors by OOM errors but no such errors will be present via V0. |
On the 28h20 machine, I start docker with the following compose file Node1 services:
v3-1-vllm:
container_name: v3-1-vllm
image: vllm/vllm-openai:latest
privileged: true
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
shm_size: "1024g"
ipc: "host"
network_mode: "host"
volumes:
- /data/deepseek-v3:/root/.cache/huggingface
- /data/torchcache-v3:/root/torchcache
environment:
- VLLM_HOST_IP=10.0.251.33
- GLOO_SOCKET_IFNAME=ens12f0np0
- NCCL_SOCKET_IFNAME=ibs1
- NCCL_IB_ALLOW=1
- NCCL_IB_DISABLE=0
- NCCL_IB_CUDA_SUPPORT=1
- NCCL_IB_HCA=ibp1
- NCCL_IB_RETRY_CNT=13
- NCCL_IB_GID_INDEX=3
- NCCL_NET_GDR_LEVEL=2
- NCCL_IB_TIMEOUT=22
- NCCL_DEBUG=INFO
- NCCL_P2P_LEVEL=NVL
- NCCL_CROSS_NIC=1
- NCCL_NET_GDR_LEVEL=SYS
entrypoint:
- /bin/bash
- -c
- |
(nohup ray start --disable-usage-stats --block --head --port=6379 > /init.log 2>&1 &)
sleep 10 && python3 -m vllm.entrypoints.openai.api_server \
--served-model-name "deepseek/deepseek-v3" \
--model /root/.cache/huggingface \
--host 0.0.0.0 \
--port 30000 \
--enable-prefix-caching \
--enable-chunked-prefill \
--pipeline-parallel-size 2 \
--tensor-parallel-size 8 \
--gpu-memory-utilization 0.95 \
--max-model-len 64128 \
--max-num-batched-tokens 8192 \
--scheduling-policy priority \
--trust-remote-code \
--max-num-seqs 12 \
--swap-space 16 \
--block_size 32 \
--disable_log_requests
restart: always Node2
But I have a problem
How should I specify the model name and model path? |
Seeking suggestions on how to use customized stat logger for vllm v1. It is allowed for v0. |
Lots of mismatches in V1 Auto Prefix Caching's Example. It may be necessary to review it carefully again to avoid misleading the readers. |
Thanks for the link. I apologize if I am asking the wrong person but is it possible to use a common prefix for all requests? Like I setup a very long system prompt and skip cache lookup etc? Thx! |
[ngram] Speculative decoding is easy to generate repeat texts for multilingual tasks. |
Speculative decoding is using larger model as "error correction" for the tiny model, repetitive text might be caused by your larger model, speculative decoding should (in theory) only accelerate the generation instead of interfering with the generated text. |
But the larger model works well. That's weird. And it often generates repetitive text when there are multiple requests at the same time. |
Sorry for any potential misleading information. Could you elaborate where are the mismatches? Or you could just send a fix PR and I'd review it. |
Please also try same seed, or zero temperature, without speculative decoding to see if problem exists. |
Which Docker image is recommended for deploying the V1 version? and are there any best practices for deploying the V1 version? |
It's related to this pr |
Could you try main again and see if it is fixed? Thanks! |
I rely on |
Hi, are you already working on resolving the size mismatch issue when loading the MixtralForCausalLM GGUF model? |
#14915 hi, i meet this bug |
#15046 When trying one of the listed supported models with architecture |
Is it possible to close prefix-caching in V1 ? |
Hi. V1 only supports using the XGrammar backend for structured generation, but XGrammar does not support as many JSON schemas as Outlines. Specifically, I'm using |
You can set |
We are aware and are close to finishing other structured generation backends in V1. Ideally EOW |
I would suggest using |
Have to disable v1 engine due to this small restriction: #15252. Will x-post in the Slack for vis. |
I have documented the results of my experiments comparing the throughput of V0 and V1 in a newly created issue. The findings suggest that when GPU memory is fully utilized, preemption occurs, and V1 fails to demonstrate a significant throughput advantage over V0. Can anyone explain why this happens? |
We've encountered a critical memory leak when using the V1 engine for image inference — system RAM usage exceeds 200 GB over time. Full bug report with reproduction steps and details can be found here: #15294. |
Please leave comments here about your usage of V1, does it work? does it not work? which feature do you need in order to adopt it? any bugs?
For bug report, please file it separately and link the issue here.
For in depth discussion, please feel free to join #sig-v1 in the vLLM Slack workspace.
The text was updated successfully, but these errors were encountered: