Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: vllm 0.7.3-0.8.0rc3 - QWEN2.5-VL: video_grid_thw[video_index][0], [multiproc_executor.py:375] IndexError: list index out of range #14986

Closed
1 task done
denadai2 opened this issue Mar 17, 2025 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@denadai2
Copy link

denadai2 commented Mar 17, 2025

Your current environment

The output of `python collect_env.py`

the script does not work with a uv venv

$ python collect_env.py
INFO 03-17 23:06:44 [__init__.py:256] Automatically detected platform cuda.
Collecting environment information...
Traceback (most recent call last):
  File "/home/mdenadai/recsys_2025/collect_env.py", line 767, in <module>
    main()
  File "/home/mdenadai/recsys_2025/collect_env.py", line 746, in main
    output = get_pretty_env_info()
  File "/home/mdenadai/recsys_2025/collect_env.py", line 741, in get_pretty_env_info
    return pretty_str(get_env_info())
  File "/home/mdenadai/recsys_2025/collect_env.py", line 539, in get_env_info
    pip_version, pip_list_output = get_pip_packages(run_lambda)
  File "/home/mdenadai/recsys_2025/collect_env.py", line 493, in get_pip_packages
    out = run_with_pip([sys.executable, '-mpip'])
  File "/home/mdenadai/recsys_2025/collect_env.py", line 489, in run_with_pip
    return "\n".join(line for line in out.splitlines()
AttributeError: 'NoneType' object has no attribute 'splitlines'

However, I have python 3.10, cuda 12.4, 2xa100 40 gb

🐛 Describe the bug

I have an error that appears whenever I run the inference multiple times. I tried:

  • having different videos as input
  • disable prefix caching
  • engine v1-> v0
  • 0.7.3->getting the latest master vllm
  • using bigger gpus
  • flash attention vs xformers
# Tried all these envs
#import os
#os.environ["VLLM_ATTENTION_BACKEND"] = "XFORMERS"
#os.environ["VLLM_USE_V1"]="0"

from vllm import LLM, SamplingParams
from transformers import AutoProcessor
from qwen_vl_utils import process_vision_info

MODEL_PATH = "Qwen/Qwen2.5-VL-7B-Instruct"

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 1, "video": 1},
    tensor_parallel_size=2,
    max_num_seqs=1,
    #enable_prefix_caching=False, enable_chunked_prefill=False,
    max_model_len=80000,
    enforce_eager=True,
    #gpu_memory_utilization=0.6
)
sampling_params = SamplingParams(
    temperature=0.1,
    top_p=0.001,
    repetition_penalty=1.05,
    max_tokens=512,
    stop_token_ids=[],
)

prompt = """
Describe the video
"""

for i in range(2):
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "text", "text": prompt},
            {
                "type": "video",
                "video": f"/data/sampled_videos/16935.mp4",  # attached
                "fps": 1,
                "max_pixels": 426 * 240,
            }
        ]
         },
    ]

    processor = AutoProcessor.from_pretrained(MODEL_PATH)

    prompt = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )
    image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

    mm_data = {}
    if video_inputs is not None:
        mm_data["video"] = video_inputs

    llm_inputs = {
        "prompt": prompt,
        "multi_modal_data": mm_data,

        # FPS will be returned in video_kwargs
        "mm_processor_kwargs": video_kwargs,
    }

    outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
    generated_text = outputs[0].outputs[0].text

    print(generated_text)
    print()
    print()

Error:

INFO 03-17 23:01:45 [__init__.py:30] Available plugins for group vllm.general_plugins:
INFO 03-17 23:01:45 [__init__.py:32] name=register_dummy_model, value=vllm_plugin_meralion:register
INFO 03-17 23:01:45 [__init__.py:34] all available plugins for group vllm.general_plugins will be loaded.
INFO 03-17 23:01:45 [__init__.py:36] set environment variable VLLM_PLUGINS to control which plugins to load.
INFO 03-17 23:01:47 [__init__.py:44] plugin register_dummy_model loaded.
INFO 03-17 23:01:56 [config.py:583] This model supports multiple tasks: {'generate', 'reward', 'classify', 'embed', 'score'}. Defaulting to 'generate'.
INFO 03-17 23:01:56 [config.py:1499] Defaulting to use mp for distributed inference
INFO 03-17 23:01:56 [config.py:1677] Chunked prefill is enabled with max_num_batched_tokens=8192.
WARNING 03-17 23:01:56 [cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 03-17 23:01:57 [core.py:53] Initializing a V1 LLM engine (v0.8.0rc3.dev4+g18551e82) with config: model='Qwen/Qwen2.5-VL-7B-Instruct', speculative_config=None, tokenizer='Qwen/Qwen2.5-VL-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=80000, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=Qwen/Qwen2.5-VL-7B-Instruct, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}
WARNING 03-17 23:01:57 [multiproc_worker_utils.py:310] Reducing Torch parallelism from 12 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 03-17 23:01:57 [custom_cache_manager.py:19] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 03-17 23:01:57 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[0, 1], buffer_handle=(2, 10485760, 10, 'psm_625548f5'), local_subscribe_addr='ipc:///var/tmp/c9f38fd7-57cf-4211-8479-9af83d9609ca', remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 03-17 23:01:58 [utils.py:2282] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f129b24e1a0>
(VllmWorker rank=0 pid=18086) INFO 03-17 23:01:58 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_7602b52c'), local_subscribe_addr='ipc:///var/tmp/736ceb70-dba8-42db-ac51-ac6f2c22901d', remote_subscribe_addr=None, remote_addr_ipv6=False)
WARNING 03-17 23:01:58 [utils.py:2282] Methods determine_num_available_blocks,device_config,get_cache_block_size_bytes,initialize_cache not implemented in <vllm.v1.worker.gpu_worker.Worker object at 0x7f129b24e8c0>
(VllmWorker rank=1 pid=18107) INFO 03-17 23:01:58 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[0], buffer_handle=(1, 10485760, 10, 'psm_d790acd7'), local_subscribe_addr='ipc:///var/tmp/be3583b9-4121-44de-be46-d02b74defe30', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=18107) INFO 03-17 23:01:59 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorker rank=0 pid=18086) INFO 03-17 23:01:59 [utils.py:925] Found nccl from library libnccl.so.2
(VllmWorker rank=1 pid=18107) INFO 03-17 23:01:59 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=0 pid=18086) INFO 03-17 23:01:59 [pynccl.py:69] vLLM is using nccl==2.21.5
(VllmWorker rank=1 pid=18107) INFO 03-17 23:01:59 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/mdenadai/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=18086) INFO 03-17 23:01:59 [custom_all_reduce_utils.py:244] reading GPU P2P access cache from /home/mdenadai/.cache/vllm/gpu_p2p_access_cache_for_0,1.json
(VllmWorker rank=0 pid=18086) INFO 03-17 23:01:59 [shm_broadcast.py:258] vLLM message queue communication handle: Handle(local_reader_ranks=[1], buffer_handle=(1, 4194304, 6, 'psm_0bb42966'), local_subscribe_addr='ipc:///var/tmp/f3ea327c-89ac-484d-88bd-84cd012d7e07', remote_subscribe_addr=None, remote_addr_ipv6=False)
(VllmWorker rank=1 pid=18107) INFO 03-17 23:01:59 [parallel_state.py:948] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 1
(VllmWorker rank=1 pid=18107) INFO 03-17 23:01:59 [cuda.py:215] Using Flash Attention backend on V1 engine.
(VllmWorker rank=0 pid=18086) INFO 03-17 23:01:59 [parallel_state.py:948] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, TP rank 0
(VllmWorker rank=0 pid=18086) INFO 03-17 23:01:59 [cuda.py:215] Using Flash Attention backend on V1 engine.
(VllmWorker rank=1 pid=18107) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(VllmWorker rank=0 pid=18086) Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
(VllmWorker rank=1 pid=18107) INFO 03-17 23:02:00 [gpu_model_runner.py:1128] Starting to load model Qwen/Qwen2.5-VL-7B-Instruct...
(VllmWorker rank=0 pid=18086) INFO 03-17 23:02:00 [gpu_model_runner.py:1128] Starting to load model Qwen/Qwen2.5-VL-7B-Instruct...
(VllmWorker rank=1 pid=18107) WARNING 03-17 23:02:00 [vision.py:94] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
(VllmWorker rank=0 pid=18086) WARNING 03-17 23:02:00 [vision.py:94] Current `vllm-flash-attn` has a bug inside vision module, so we use xformers backend instead. You can run `pip install flash-attn` to use flash-attention backend.
(VllmWorker rank=1 pid=18107) INFO 03-17 23:02:00 [config.py:3206] cudagraph sizes specified by model runner [] is overridden by config []
(VllmWorker rank=0 pid=18086) INFO 03-17 23:02:00 [config.py:3206] cudagraph sizes specified by model runner [] is overridden by config []
(VllmWorker rank=0 pid=18086) WARNING 03-17 23:02:00 [topk_topp_sampler.py:63] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=1 pid=18107) WARNING 03-17 23:02:00 [topk_topp_sampler.py:63] FlashInfer is not available. Falling back to the PyTorch-native implementation of top-p & top-k sampling. For the best performance, please install FlashInfer.
(VllmWorker rank=0 pid=18086) INFO 03-17 23:02:01 [weight_utils.py:257] Using model weights format ['*.safetensors']
(VllmWorker rank=1 pid=18107) INFO 03-17 23:02:01 [weight_utils.py:257] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:03,  1.32it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:01<00:02,  1.30it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:01<00:00,  2.01it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:02<00:00,  1.72it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:03<00:00,  1.55it/s]
(VllmWorker rank=0 pid=18086)
(VllmWorker rank=0 pid=18086) INFO 03-17 23:02:04 [loader.py:429] Loading weights took 3.26 seconds
(VllmWorker rank=1 pid=18107) INFO 03-17 23:02:04 [loader.py:429] Loading weights took 3.23 seconds
(VllmWorker rank=0 pid=18086) INFO 03-17 23:02:04 [gpu_model_runner.py:1140] Model loading took 7.8681 GB and 3.855880 seconds
(VllmWorker rank=1 pid=18107) INFO 03-17 23:02:04 [gpu_model_runner.py:1140] Model loading took 7.8681 GB and 4.036642 seconds
(VllmWorker rank=1 pid=18107) INFO 03-17 23:02:04 [gpu_model_runner.py:1358] Encoder cache will be initialized with a budget of 49152 tokens, and profiled with 1 video items of the maximum feature size.
(VllmWorker rank=0 pid=18086) INFO 03-17 23:02:04 [gpu_model_runner.py:1358] Encoder cache will be initialized with a budget of 49152 tokens, and profiled with 1 video items of the maximum feature size.
(VllmWorker rank=1 pid=18107) It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
(VllmWorker rank=0 pid=18086) It looks like you are trying to rescale already rescaled images. If the input images have pixel values between 0 and 1, set `do_rescale=False` to avoid rescaling them again.
INFO 03-17 23:02:15 [kv_cache_utils.py:537] GPU KV cache size: 877,952 tokens
INFO 03-17 23:02:15 [kv_cache_utils.py:540] Maximum concurrency for 80,000 tokens per request: 10.97x
INFO 03-17 23:02:15 [kv_cache_utils.py:537] GPU KV cache size: 877,952 tokens
INFO 03-17 23:02:15 [kv_cache_utils.py:540] Maximum concurrency for 80,000 tokens per request: 10.97x
INFO 03-17 23:02:15 [core.py:138] init engine (profile, create kv cache, warmup model) took 10.43 seconds
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.48, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
qwen-vl-utils using decord to read video.
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:15<00:00, 15.15s/it, est. speed input: 397.82 toks/s, output: 20.47 toks/s]
The video appears to be a humorous and lively sequence featuring multiple scenes. It starts with a man in a black suit standing in a wooden hallway, holding a microphone and singing or speaking into it. The text on the screen suggests he is performing or engaging in some form of entertainment.

The scene then transitions to a different setting where a group of people are seated at desks, possibly in an office environment. One person is speaking, and the text on the screen indicates they are asking for something specific, adding a layer of interaction and dialogue.

Next, the video cuts back to the man in the wooden hallway, now wearing sunglasses and continuing his performance. He seems to be enjoying himself, moving energetically while singing. The text on the screen includes phrases like "let's play again" and "let's go," suggesting a playful and enthusiastic tone.

The video then shifts to another scene where a man in a blue shirt is seen smiling and laughing, indicating a light-hearted and fun atmosphere. This is followed by a scene with a colorful, glitchy effect, which might be used as a transition or to add visual interest.

Finally, the video returns to the man in the wooden hallway, who continues his performance with more animated gestures and expressions. The text on the screen includes phrases like "thank you everyone" and "thank you Amir Khan," suggesting that this could be part of a live performance or event where he is addressing an audience.

Overall, the video combines elements of performance, humor, and interaction, creating a dynamic and entertaining experience.


Processed prompts:   0%|                                                                                                    | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375] WorkerProc hit an exception: %s
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375] Traceback (most recent call last):
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 371, in worker_busy_loop
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375]     output = func(*args, **kwargs)
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 242, in execute_model
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375]     output = self.model_runner.execute_model(scheduler_output)
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375]     return func(*args, **kwargs)
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 921, in execute_model
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375]     self._update_states(scheduler_output)
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 359, in _update_states
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375]     MRotaryEmbedding.get_input_positions_tensor(
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding.py", line 1014, in get_input_positions_tensor
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375]     video_grid_thw[video_index][0],
(VllmWorker rank=0 pid=18086) ERROR 03-17 23:02:43 [multiproc_executor.py:375] IndexError: list index out of range
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375] WorkerProc hit an exception: %s
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375] Traceback (most recent call last):
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 371, in worker_busy_loop
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375]     output = func(*args, **kwargs)
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375]     return func(*args, **kwargs)
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 242, in execute_model
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375]     output = self.model_runner.execute_model(scheduler_output)
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375]     return func(*args, **kwargs)
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 921, in execute_model
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375]     self._update_states(scheduler_output)
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 359, in _update_states
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375]     MRotaryEmbedding.get_input_positions_tensor(
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding.py", line 1014, in get_input_positions_tensor
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375]     video_grid_thw[video_index][0],
(VllmWorker rank=1 pid=18107) ERROR 03-17 23:02:43 [multiproc_executor.py:375] IndexError: list index out of range
ERROR 03-17 23:02:43 [core.py:340] EngineCore hit an exception: Traceback (most recent call last):
ERROR 03-17 23:02:43 [core.py:340]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 333, in run_engine_core
ERROR 03-17 23:02:43 [core.py:340]     engine_core.run_busy_loop()
ERROR 03-17 23:02:43 [core.py:340]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 367, in run_busy_loop
ERROR 03-17 23:02:43 [core.py:340]     outputs = step_fn()
ERROR 03-17 23:02:43 [core.py:340]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 192, in step
ERROR 03-17 23:02:43 [core.py:340]     output = self.model_executor.execute_model(scheduler_output)
ERROR 03-17 23:02:43 [core.py:340]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 80, in execute_model
ERROR 03-17 23:02:43 [core.py:340]     output = self.collective_rpc("execute_model",
ERROR 03-17 23:02:43 [core.py:340]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 133, in collective_rpc
ERROR 03-17 23:02:43 [core.py:340]     raise e
ERROR 03-17 23:02:43 [core.py:340]   File "/home/mdenadai/uv_venvs/bug/lib/python3.10/site-packages/vllm/v1/executor/multiproc_executor.py", line 122, in collective_rpc
ERROR 03-17 23:02:43 [core.py:340]     raise result
ERROR 03-17 23:02:43 [core.py:340] IndexError: list index out of range
ERROR 03-17 23:02:43 [core.py:340]
CRITICAL 03-17 23:02:43 [core_client.py:269] Got fatal signal from worker processes, shutting down. See stack trace above for root cause issue.
Killed```

16935.mp4.zip

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@denadai2 denadai2 added the bug Something isn't working label Mar 17, 2025
@denadai2 denadai2 changed the title [Bug]: QWEN2.5-VL: video_grid_thw[video_index][0], [multiproc_executor.py:375] IndexError: list index out of range [Bug]: vllm 0.7.3-0.8.0rc3 - QWEN2.5-VL: video_grid_thw[video_index][0], [multiproc_executor.py:375] IndexError: list index out of range Mar 17, 2025
@DarkLight1337
Copy link
Member

cc @imkero @Isotr0py

@Isotr0py Isotr0py self-assigned this Mar 18, 2025
@Isotr0py
Copy link
Collaborator

Will investigate it tonight.

@denadai2
Copy link
Author

thx

@Isotr0py
Copy link
Collaborator

I think the cause is in your script, you are updating your prompt in the loop which will cause incorrect video_nums:

prompt = """
Describe the video
"""

for i in range(2):
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "text", "text": prompt},
            {
                "type": "video",
                "video": f"/data/sampled_videos/16935.mp4",  # attached
                "fps": 1,
                "max_pixels": 426 * 240,
            }
        ]
         },
    ]

    processor = AutoProcessor.from_pretrained(MODEL_PATH)

    prompt = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

You just need to modify it like this to avoid override:

question = """
Describe the video
"""

for i in range(2):
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "text", "text": question},
            {
                "type": "video",
                "video": f"./16935.mp4",
                "fps": 1,
                "max_pixels": 426 * 240,
            }
        ]
         },
    ]

    processor = AutoProcessor.from_pretrained(MODEL_PATH)

    prompt = processor.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True,
    )

@imkero
Copy link
Contributor

imkero commented Mar 18, 2025

Reproduced the problem and same conclusion as @Isotr0py provided above.

@denadai2
Copy link
Author

denadai2 commented Mar 18, 2025

I think you intended to put question inside the loop.

Image

with

for i in range(2):
    prompt = """
    Describe the video
    """

it works.

OOMMGGG!! thanks. This came totally unexpectedly. However, should we change how the problem is handled with an e.g. assertion?

@denadai2
Copy link
Author

closed by mistake

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants