Closed
Description
Your current environment
The output of python collect_env.py
INFO 06-17 11:26:00 [__init__.py:239] Automatically detected platform cuda.
Collecting environment information...
==============================
System Info
==============================
OS : Ubuntu 22.04.5 LTS (x86_64)
GCC version : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version : Could not collect
CMake version : version 3.22.1
Libc version : glibc-2.35
==============================
PyTorch Info
==============================
PyTorch version : 2.6.0+cu124
Is debug build : False
CUDA used to build PyTorch : 12.4
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.10.18 | packaged by conda-forge | (main, Jun 4 2025, 14:45:41) [GCC 13.3.0] (64-bit runtime)
Python platform : Linux-5.10.134-008.16.kangaroo.al8.x86_64-x86_64-with-glibc2.35
==============================
CUDA / GPU Info
==============================
Is CUDA available : True
CUDA runtime version : 12.4.131
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration : GPU 0: NVIDIA A100-SXM4-80GB
Nvidia driver version : 535.54.03
cuDNN version : Could not collect
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
架构: x86_64
CPU 运行模式: 32-bit, 64-bit
Address sizes: 46 bits physical, 57 bits virtual
字节序: Little Endian
CPU: 16
在线 CPU 列表: 0-15
厂商 ID: GenuineIntel
型号名称: Intel(R) Xeon(R) Processor @ 2.90GHz
CPU 系列: 6
型号: 106
每个核的线程数: 1
每个座的核数: 16
座: 1
步进: 6
BogoMIPS: 5800.00
标记: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtop
ology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpu
id_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw
avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd avx512vbmi umip pku avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear arch_capabilities
超管理器厂商: KVM
虚拟化类型: 完全
L1d 缓存: 384 KiB (8 instances)
L1i 缓存: 256 KiB (8 instances)
L2 缓存: 10 MiB (8 instances)
L3 缓存: 48 MiB (1 instance)
NUMA 节点: 1
NUMA 节点0 CPU: 0-15
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Vulnerable
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Vulnerable
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==27.0.0
[pip3] torch==2.6.0+cu124
[pip3] torchaudio==2.6.0+cu124
[pip3] torchvision==0.21.0+cu124
[pip3] transformers==4.52.4
[pip3] triton==3.2.0
[conda] numpy 2.2.6 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.4.5.8 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.2.1.3 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.5.147 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.6.1.9 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.3.1.170 pypi_0 pypi
[conda] nvidia-cusparselt-cu12 0.6.2 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.21.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.4.127 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.4.127 pypi_0 pypi
[conda] pyzmq 27.0.0 pypi_0 pypi
[conda] torch 2.6.0+cu124 pypi_0 pypi
[conda] torchaudio 2.6.0+cu124 pypi_0 pypi
[conda] torchvision 0.21.0+cu124 pypi_0 pypi
[conda] transformers 4.52.4 pypi_0 pypi
[conda] triton 3.2.0 pypi_0 pypi
==============================
vLLM Info
==============================
ROCM Version : Could not collect
Neuron SDK Version : N/A
vLLM Version : 0.8.5.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB 0-15 N/A N/A
NIC0 PHB X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
==============================
Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NCCL_IB_TC=16
NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,drive
r>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,dr
iver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 bran
d=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,d
river>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=5
35,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536
NCCL_MIN_NCHANNELS=4
NCCL_VERSION=2.20.5-1
NCCL_SOCKET_IFNAME=eth
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NCCL_DEBUG=INFO
VLLM_USE_MODELSCOPE=True
NCCL_IB_HCA=mlx5
NVIDIA_PRODUCT_NAME=CUDA
NCCL_IB_GID_INDEX=3
CUDA_VERSION=12.4.0
NCCL_IB_QPS_PER_CONNECTION=8
NCCL_IB_TIMEOUT=22
NCCL_IB_SL=5
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
🐛 Describe the bug
Use the example code in the doc and get this error
'''
This code is used to generate response from vLLM using llava and llava onevision models with single image input. I got cuda out of memory at the second sample.
'''
import setuptools.dist
import os
import random
from contextlib import contextmanager
from dataclasses import asdict
from typing import NamedTuple, Optional
from huggingface_hub import snapshot_download
from transformers import AutoTokenizer
from vllm import LLM, EngineArgs, SamplingParams
from vllm.assets.image import ImageAsset
from vllm.assets.video import VideoAsset
from vllm.lora.request import LoRARequest
from vllm.utils import FlexibleArgumentParser
from vllm.multimodal.utils import fetch_image
from PIL import Image
import json
from tqdm import tqdm
def convert_image_mode(image: Image.Image, to_mode: str):
if image.mode == to_mode:
return image
elif image.mode == "RGBA" and to_mode == "RGB":
return rgba_to_rgb(image)
else:
return image.convert(to_mode)
class ModelRequestData(NamedTuple):
engine_args: EngineArgs
prompts: list[str]
stop_token_ids: Optional[list[int]] = None
lora_requests: Optional[list[LoRARequest]] = None
# LLaVA-1.5
def run_llava(questions: list[str], modality: str) -> ModelRequestData:
assert modality == "image"
prompts = [
f"USER: <image>\n{question}\nASSISTANT:" for question in questions
]
# import pdb
# pdb.set_trace()
engine_args = EngineArgs(
model="/home/xj_data/zhaohongbo/hf_model/llava15-7b-hf",
max_model_len=4096,
limit_mm_per_prompt={modality: 1},
)
return ModelRequestData(
engine_args=engine_args,
prompts=prompts,
)
# LLaVA-OneVision
def run_llava_onevision(questions: list[str],
modality: str) -> ModelRequestData:
if modality == "video":
prompts = [
f"<|im_start|>user <video>\n{question}<|im_end|> \
<|im_start|>assistant\n" for question in questions
]
elif modality == "image":
prompts = [
f"<|im_start|>user <image>\n{question}<|im_end|> \
<|im_start|>assistant\n" for question in questions
]
engine_args = EngineArgs(
model="/home/xj_data/zhaohongbo/hf_model/llavaov-7b",
max_model_len=16384,
limit_mm_per_prompt={modality: 1},
)
return ModelRequestData(
engine_args=engine_args,
prompts=prompts,
)
from typing import Union
def get_multi_modal_input(args, image_input: Union[str, list], question: str):
"""
return {
"data": image or video,
"question": question,
}
"""
if args.modality == "image":
# Input image and question
image = convert_image_mode(Image.open(image_input), "RGB")
img_questions = [question] # should be a list
return {
"data": image,
"questions": img_questions,
}
# if args.modality == "video":
# # Input video and question
# video = VideoAsset(name="baby_reading", num_frames=args.num_frames).np_ndarrays
# vid_questions = ["Why is this video funny?"]
# return {
# "data": video,
# "questions": vid_questions,
# }
msg = f"Modality {args.modality} is not supported."
raise ValueError(msg)
model_example_map = {
"llava": run_llava,
"llava-onevision": run_llava_onevision,
}
def apply_image_repeat(image_repeat_prob, num_prompts, data,
prompts: list[str], modality):
"""Repeats images with provided probability of "image_repeat_prob".
Used to simulate hit/miss for the MM preprocessor cache.
"""
assert image_repeat_prob <= 1.0 and image_repeat_prob >= 0
no_yes = [0, 1]
probs = [1.0 - image_repeat_prob, image_repeat_prob]
inputs = []
cur_image = data
for i in range(num_prompts):
if image_repeat_prob is not None:
res = random.choices(no_yes, probs)[0]
if res == 0:
# No repeat => Modify one pixel
cur_image = cur_image.copy()
new_val = (i // 256 // 256, i // 256, i % 256)
cur_image.putpixel((0, 0), new_val)
inputs.append({
"prompt": prompts[i % len(prompts)],
"multi_modal_data": {
modality: cur_image
},
})
return inputs
@contextmanager
def time_counter(enable: bool):
if enable:
import time
start_time = time.time()
yield
elapsed_time = time.time() - start_time
# print("-" * 50)
print("-- generate time = {}".format(elapsed_time))
# print("-" * 50)
else:
yield
def parse_args():
parser = FlexibleArgumentParser(
description="Demo on using vLLM for offline inference with "
"vision language models for text generation")
parser.add_argument(
"--model-type",
"-m",
type=str,
default="llava",
choices=model_example_map.keys(),
help='Huggingface "model_type".',
)
parser.add_argument("--num-prompts",
type=int,
default=1,
help="Number of prompts to run.")
parser.add_argument(
"--modality",
type=str,
default="image",
choices=["image", "video"],
help="Modality of the input.",
)
parser.add_argument(
"--num-frames",
type=int,
default=1,
help="Number of frames to extract from the video.",
)
parser.add_argument(
"--seed",
type=int,
default=None,
help="Set the seed when initializing `vllm.LLM`.",
)
parser.add_argument(
"--image-repeat-prob",
type=float,
default=None,
help=
"Simulates the hit-ratio for multi-modal preprocessor cache (if enabled)",
)
parser.add_argument(
"--disable-mm-preprocessor-cache",
action="store_true",
help="If True, disables caching of multi-modal preprocessor/mapper.",
)
parser.add_argument(
"--time-generate",
action="store_true",
help="If True, then print the total generate() call time",
)
parser.add_argument(
"--use-different-prompt-per-request",
action="store_true",
help="If True, then use different prompt (with the same multi-modal "
"data) for each request.",
)
return parser.parse_args()
# 读取json文件
def read_json_file(file_path):
with open(file_path, 'r', encoding='utf-8') as file:
return json.load(file)
def main(args):
model = args.model_type
if model not in model_example_map:
raise ValueError(f"Model type {model} is not supported.")
modality = args.modality
# MEGABench_core_json = read_json_file('LMUData/MEGABench_image_core.json')
# MEGABench_open_json = read_json_file('LMUData/MEGABench_image_open.json')
# debug_json = read_json_file('test_muliti_image.json')
total_json = read_json_file('final_merged_question_list.json')
data_json = total_json
for item in tqdm(data_json):
id = item['id']
image_file = item['image']
if isinstance(image_file, list) and model == 'llava':
continue # just skip multi-image for llava
if isinstance(image_file, str):
image = f"{os.path.abspath(item['image'])}"
num_images = 1
elif isinstance(image_file, list):
image = [f"{os.path.abspath(i)}" for i in image_file]
num_images = len(image_file)
else:
AssertionError("image_file should be str or list")
GT = item['GT']
question = item['question']
mm_input = get_multi_modal_input(args, image, question)
# import pdb; pdb.set_trace()
data = mm_input["data"]
questions = mm_input["questions"]
if num_images == 1:
req_data = model_example_map[model](questions, modality)
elif num_images > 1:
if 'internvl' in model:
model = 'internvl_chat_multi'
req_data = model_example_map[model](questions, image)
elif 'onevision' in model:
model = 'llava-onevision' # TODO: ADD support for multiple images
# req_data = model_example_map[model](questions, modality)
else:
raise ValueError(
f"Model {model} does not support multiple images.")
# Disable other modalities to save memory
default_limits = {"image": 0, "video": 0, "audio": 0}
req_data.engine_args.limit_mm_per_prompt = default_limits | dict(
req_data.engine_args.limit_mm_per_prompt or {})
engine_args = asdict(req_data.engine_args) | {
"seed": args.seed,
"disable_mm_preprocessor_cache":
args.disable_mm_preprocessor_cache,
}
llm = LLM(**engine_args)
# Don't want to check the flag multiple times, so just hijack `prompts`.
prompts = (req_data.prompts if args.use_different_prompt_per_request
else [req_data.prompts[0]])
# We set temperature to 0.2 so that outputs can be different
# even when all prompts are identical when running batch inference.
sampling_params = SamplingParams(
temperature=0.2,
max_tokens=64,
stop_token_ids=req_data.stop_token_ids)
assert args.num_prompts > 0
# import pdb; pdb.set_trace()
if args.num_prompts == 1:
# Single inference
inputs = {
"prompt": prompts[0],
"multi_modal_data": {
modality: data
},
}
else:
# Batch inference
if args.image_repeat_prob is not None:
# Repeat images with specified probability of "image_repeat_prob"
inputs = apply_image_repeat(args.image_repeat_prob,
args.num_prompts, data, prompts,
modality)
else:
# Use the same image for all prompts
inputs = [{
"prompt": prompts[i % len(prompts)],
"multi_modal_data": {
modality: data
},
} for i in range(args.num_prompts)]
# Add LoRA request if applicable
lora_request = (req_data.lora_requests *
args.num_prompts if req_data.lora_requests else None)
with time_counter(args.time_generate):
outputs = llm.generate(
inputs,
sampling_params=sampling_params,
lora_request=lora_request,
)
# print("-" * 50)
# import pdb; pdb.set_trace()
for o in outputs:
generated_text = o.outputs[0].text
print(generated_text)
# print("-" * 50)
item['pred'] = generated_text
item['response_model'] = args.model_type
# break # TODO: 只处理一个数据
## create the output directory if it doesn't exist
os.makedirs('Responses', exist_ok=True)
# Save the results to a JSON file
# 保存更新后的json数据
with open(
f'Responses/total_{args.model_type}_response.json',
'w',
encoding='utf-8') as file:
json.dump(data_json, file, ensure_ascii=False, indent=4)
if __name__ == "__main__":
args = parse_args()
main(args)
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.