[Bug]: Strange cuda out of memory when runing llava1.5 7b on 80G A100

### Your current environment

<details>
<summary>The output of <code>python collect_env.py</code></summary>

```text
INFO 06-17 11:26:00 [__init__.py:239] Automatically detected platform cuda.                                                                                                                          
Collecting environment information...                                                                                                                                                                
==============================                                                                                                                                                                       
        System Info                                                                                                                                                                                  
==============================                                                                                                                                                                       
OS                           : Ubuntu 22.04.5 LTS (x86_64)                                                                                                                                           
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0                                                                                                                                 
Clang version                : Could not collect                                                                                                                                                     
CMake version                : version 3.22.1                                                                                                                                                        
Libc version                 : glibc-2.35                                                                                                                                                            
                                                                                                                                                                                                     
==============================                                                                                                                                                                       
       PyTorch Info                                                                                                                                                                                  
==============================                                                                                                                                                                       
PyTorch version              : 2.6.0+cu124                                                                                                                                                           
Is debug build               : False                                                                                                                                                                 
CUDA used to build PyTorch   : 12.4                                                                                                                                                                  
ROCM used to build PyTorch   : N/A                                                                                                                                                                   
                                                                                                                                                                                                     
==============================                                                                                                                                                                       
      Python Environment                                                                                                                                                                             
==============================                                                                                                                                                                       
Python version               : 3.10.18 | packaged by conda-forge | (main, Jun  4 2025, 14:45:41) [GCC 13.3.0] (64-bit runtime)                                                                       
Python platform              : Linux-5.10.134-008.16.kangaroo.al8.x86_64-x86_64-with-glibc2.35                                                                                                       
                                                                                                                                                                                                     
==============================                                                                                                                                                                       
       CUDA / GPU Info                                                                                                                                                                               
==============================                                                                                                                                                                       
Is CUDA available            : True                                                                                                                                                                  
CUDA runtime version         : 12.4.131                                                                                                                                                              
CUDA_MODULE_LOADING set to   : LAZY                                                                                                                                                                  
GPU models and configuration : GPU 0: NVIDIA A100-SXM4-80GB                                                                                                                                          
Nvidia driver version        : 535.54.03                                                                                                                                                             
cuDNN version                : Could not collect                                                                                                                                                     
HIP runtime version          : N/A                                                                                                                                                                   
MIOpen runtime version       : N/A                                                                                                                                                                   
Is XNNPACK available         : True            
==============================                                                                                                                                                                       
          CPU Info                                                                                                                                                                                   
==============================                                                                                                                                                                       
架构：                           x86_64                                                                                                                                                              
CPU 运行模式：                   32-bit, 64-bit                                                                                                                                                      
Address sizes:                   46 bits physical, 57 bits virtual
字节序：                         Little Endian
CPU:                             16
在线 CPU 列表：                  0-15
厂商 ID：                        GenuineIntel
型号名称：                       Intel(R) Xeon(R) Processor @ 2.90GHz
CPU 系列：                       6
型号：                           106
每个核的线程数：                 1
每个座的核数：                   16
座：                             1
步进：                           6
BogoMIPS：                       5800.00
标记：                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtop
ology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpu
id_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw 
avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd avx512vbmi umip pku avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear arch_capabilities
超管理器厂商：                   KVM
虚拟化类型：                     完全
L1d 缓存：                       384 KiB (8 instances)
L1i 缓存：                       256 KiB (8 instances)
L2 缓存：                        10 MiB (8 instances)
L3 缓存：                        48 MiB (1 instance)
NUMA 节点：                      1
NUMA 节点0 CPU：                 0-15
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Vulnerable
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:        Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Vulnerable
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.4.5.8
[pip3] nvidia-cuda-cupti-cu12==12.4.127
[pip3] nvidia-cuda-nvrtc-cu12==12.4.127
[pip3] nvidia-cuda-runtime-cu12==12.4.127
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.2.1.3
[pip3] nvidia-curand-cu12==10.3.5.147
[pip3] nvidia-cusolver-cu12==11.6.1.9
[pip3] nvidia-cusparse-cu12==12.3.1.170
[pip3] nvidia-cusparselt-cu12==0.6.2
[pip3] nvidia-nccl-cu12==2.21.5
[pip3] nvidia-nvjitlink-cu12==12.4.127
[pip3] nvidia-nvtx-cu12==12.4.127
[pip3] pyzmq==27.0.0
[pip3] torch==2.6.0+cu124
[pip3] torchaudio==2.6.0+cu124                                                                                                                                                                       
[pip3] torchvision==0.21.0+cu124                                                                                                                                                                     
[pip3] transformers==4.52.4                                                                                                                                                                          
[pip3] triton==3.2.0                                                                                                                                                                                 
[conda] numpy                     2.2.6                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.4.5.8                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.4.127                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.4.127                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.2.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.5.147               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.6.1.9                 pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.3.1.170               pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.6.2                    pypi_0    pypi
[conda] nvidia-nccl-cu12          2.21.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.4.127                 pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.4.127                 pypi_0    pypi
[conda] pyzmq                     27.0.0                   pypi_0    pypi
[conda] torch                     2.6.0+cu124              pypi_0    pypi
[conda] torchaudio                2.6.0+cu124              pypi_0    pypi
[conda] torchvision               0.21.0+cu124             pypi_0    pypi
[conda] transformers              4.52.4                   pypi_0    pypi
[conda] triton                    3.2.0                    pypi_0    pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.8.5.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    NIC0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     0-15            N/A             N/A
NIC0    PHB      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=all
NCCL_IB_TC=16
NVIDIA_REQUIRE_CUDA=cuda>=12.4 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,drive
r>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,dr
iver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 bran
d=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 brand=tesla,d
river>=535,driver<536 brand=unknown,driver>=535,driver<536 brand=nvidia,driver>=535,driver<536 brand=nvidiartx,driver>=535,driver<536 brand=geforce,driver>=535,driver<536 brand=geforcertx,driver>=5
35,driver<536 brand=quadro,driver>=535,driver<536 brand=quadrortx,driver>=535,driver<536 brand=titan,driver>=535,driver<536 brand=titanrtx,driver>=535,driver<536
NCCL_MIN_NCHANNELS=4
NCCL_VERSION=2.20.5-1
NCCL_SOCKET_IFNAME=eth
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NCCL_DEBUG=INFO
VLLM_USE_MODELSCOPE=True
NCCL_IB_HCA=mlx5
NVIDIA_PRODUCT_NAME=CUDA
NCCL_IB_GID_INDEX=3
CUDA_VERSION=12.4.0
NCCL_IB_QPS_PER_CONNECTION=8
NCCL_IB_TIMEOUT=22
NCCL_IB_SL=5
LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NCCL_CUMEM_ENABLE=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
CUDA_MODULE_LOADING=LAZY
```

</details>


### 🐛 Describe the bug

Use the example code in the doc and get this error
```
'''
This code is used to generate response from vLLM using llava and llava onevision models with single image input. I got cuda out of memory at the second sample.
'''
import setuptools.dist
import os
import random
from contextlib import contextmanager
from dataclasses import asdict
from typing import NamedTuple, Optional

from huggingface_hub import snapshot_download
from transformers import AutoTokenizer

from vllm import LLM, EngineArgs, SamplingParams
from vllm.assets.image import ImageAsset
from vllm.assets.video import VideoAsset
from vllm.lora.request import LoRARequest
from vllm.utils import FlexibleArgumentParser
from vllm.multimodal.utils import fetch_image
from PIL import Image
import json
from tqdm import tqdm


def convert_image_mode(image: Image.Image, to_mode: str):
    if image.mode == to_mode:
        return image
    elif image.mode == "RGBA" and to_mode == "RGB":
        return rgba_to_rgb(image)
    else:
        return image.convert(to_mode)

class ModelRequestData(NamedTuple):
    engine_args: EngineArgs
    prompts: list[str]
    stop_token_ids: Optional[list[int]] = None
    lora_requests: Optional[list[LoRARequest]] = None


# LLaVA-1.5
def run_llava(questions: list[str], modality: str) -> ModelRequestData:
    assert modality == "image"

    prompts = [
        f"USER: <image>\n{question}\nASSISTANT:" for question in questions
    ]
    # import pdb
    # pdb.set_trace()
    engine_args = EngineArgs(
        model="/home/xj_data/zhaohongbo/hf_model/llava15-7b-hf",
        max_model_len=4096,
        limit_mm_per_prompt={modality: 1},
    )

    return ModelRequestData(
        engine_args=engine_args,
        prompts=prompts,
    )


# LLaVA-OneVision
def run_llava_onevision(questions: list[str],
                        modality: str) -> ModelRequestData:
    if modality == "video":
        prompts = [
            f"<|im_start|>user <video>\n{question}<|im_end|> \
        <|im_start|>assistant\n"                                                                                                 for question in questions
        ]

    elif modality == "image":
        prompts = [
            f"<|im_start|>user <image>\n{question}<|im_end|> \
        <|im_start|>assistant\n"                                                                                                 for question in questions
        ]

    engine_args = EngineArgs(
        model="/home/xj_data/zhaohongbo/hf_model/llavaov-7b",
        max_model_len=16384,
        limit_mm_per_prompt={modality: 1},
    )

    return ModelRequestData(
        engine_args=engine_args,
        prompts=prompts,
    )




from typing import Union

def get_multi_modal_input(args, image_input: Union[str, list], question: str):
    """
    return {
        "data": image or video,
        "question": question,
    }
    """
    if args.modality == "image":
        # Input image and question
        image = convert_image_mode(Image.open(image_input), "RGB")
        img_questions = [question]  # should be a list

        return {
            "data": image,
            "questions": img_questions,
        }

    # if args.modality == "video":
    #     # Input video and question
    #     video = VideoAsset(name="baby_reading", num_frames=args.num_frames).np_ndarrays
    #     vid_questions = ["Why is this video funny?"]

    #     return {
    #         "data": video,
    #         "questions": vid_questions,
    #     }

    msg = f"Modality {args.modality} is not supported."
    raise ValueError(msg)


model_example_map = {
    "llava": run_llava,
    "llava-onevision": run_llava_onevision,
}


def apply_image_repeat(image_repeat_prob, num_prompts, data,
                       prompts: list[str], modality):
    """Repeats images with provided probability of "image_repeat_prob".
    Used to simulate hit/miss for the MM preprocessor cache.
    """
    assert image_repeat_prob <= 1.0 and image_repeat_prob >= 0
    no_yes = [0, 1]
    probs = [1.0 - image_repeat_prob, image_repeat_prob]

    inputs = []
    cur_image = data
    for i in range(num_prompts):
        if image_repeat_prob is not None:
            res = random.choices(no_yes, probs)[0]
            if res == 0:
                # No repeat => Modify one pixel
                cur_image = cur_image.copy()
                new_val = (i // 256 // 256, i // 256, i % 256)
                cur_image.putpixel((0, 0), new_val)

        inputs.append({
            "prompt": prompts[i % len(prompts)],
            "multi_modal_data": {
                modality: cur_image
            },
        })

    return inputs


@contextmanager
def time_counter(enable: bool):
    if enable:
        import time
        start_time = time.time()
        yield
        elapsed_time = time.time() - start_time
        # print("-" * 50)
        print("-- generate time = {}".format(elapsed_time))
        # print("-" * 50)
    else:
        yield


def parse_args():
    parser = FlexibleArgumentParser(
        description="Demo on using vLLM for offline inference with "
        "vision language models for text generation")
    parser.add_argument(
        "--model-type",
        "-m",
        type=str,
        default="llava",
        choices=model_example_map.keys(),
        help='Huggingface "model_type".',
    )
    parser.add_argument("--num-prompts",
                        type=int,
                        default=1,
                        help="Number of prompts to run.")
    parser.add_argument(
        "--modality",
        type=str,
        default="image",
        choices=["image", "video"],
        help="Modality of the input.",
    )
    parser.add_argument(
        "--num-frames",
        type=int,
        default=1,
        help="Number of frames to extract from the video.",
    )
    parser.add_argument(
        "--seed",
        type=int,
        default=None,
        help="Set the seed when initializing `vllm.LLM`.",
    )

    parser.add_argument(
        "--image-repeat-prob",
        type=float,
        default=None,
        help=
        "Simulates the hit-ratio for multi-modal preprocessor cache (if enabled)",
    )

    parser.add_argument(
        "--disable-mm-preprocessor-cache",
        action="store_true",
        help="If True, disables caching of multi-modal preprocessor/mapper.",
    )

    parser.add_argument(
        "--time-generate",
        action="store_true",
        help="If True, then print the total generate() call time",
    )

    parser.add_argument(
        "--use-different-prompt-per-request",
        action="store_true",
        help="If True, then use different prompt (with the same multi-modal "
        "data) for each request.",
    )
    return parser.parse_args()


# 读取json文件
def read_json_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return json.load(file)


def main(args):
    model = args.model_type
    if model not in model_example_map:
        raise ValueError(f"Model type {model} is not supported.")

    modality = args.modality

    # MEGABench_core_json = read_json_file('LMUData/MEGABench_image_core.json')
    # MEGABench_open_json = read_json_file('LMUData/MEGABench_image_open.json')
    # debug_json = read_json_file('test_muliti_image.json')
    total_json = read_json_file('final_merged_question_list.json')
    data_json = total_json

    for item in tqdm(data_json):
        id = item['id']
        image_file = item['image']
        
        if isinstance(image_file, list) and model == 'llava':
            continue # just skip multi-image for llava

        if isinstance(image_file, str):
            image = f"{os.path.abspath(item['image'])}"
            num_images = 1
        elif isinstance(image_file, list):
            image = [f"{os.path.abspath(i)}" for i in image_file]
            num_images = len(image_file)
        else:
            AssertionError("image_file should be str or list")

        GT = item['GT']
        question = item['question']

        mm_input = get_multi_modal_input(args, image, question)
        # import pdb; pdb.set_trace()
        data = mm_input["data"]
        questions = mm_input["questions"]

        if num_images == 1:
            req_data = model_example_map[model](questions, modality)
        elif num_images > 1:
            if 'internvl' in model:
                model = 'internvl_chat_multi'
                req_data = model_example_map[model](questions, image)
            elif 'onevision' in model:
                model = 'llava-onevision' # TODO: ADD support for multiple images
                # req_data = model_example_map[model](questions, modality)
            else:
                raise ValueError(
                    f"Model {model} does not support multiple images.")
        # Disable other modalities to save memory
        default_limits = {"image": 0, "video": 0, "audio": 0}
        req_data.engine_args.limit_mm_per_prompt = default_limits | dict(
            req_data.engine_args.limit_mm_per_prompt or {})

        engine_args = asdict(req_data.engine_args) | {
            "seed": args.seed,
            "disable_mm_preprocessor_cache":
            args.disable_mm_preprocessor_cache,
        }
        llm = LLM(**engine_args)

        # Don't want to check the flag multiple times, so just hijack `prompts`.
        prompts = (req_data.prompts if args.use_different_prompt_per_request
                   else [req_data.prompts[0]])

        # We set temperature to 0.2 so that outputs can be different
        # even when all prompts are identical when running batch inference.
        sampling_params = SamplingParams(
            temperature=0.2,
            max_tokens=64,
            stop_token_ids=req_data.stop_token_ids)

        assert args.num_prompts > 0
        # import pdb; pdb.set_trace()
        if args.num_prompts == 1:
            # Single inference
            inputs = {
                "prompt": prompts[0],
                "multi_modal_data": {
                    modality: data
                },
            }
        else:
            # Batch inference
            if args.image_repeat_prob is not None:
                # Repeat images with specified probability of "image_repeat_prob"
                inputs = apply_image_repeat(args.image_repeat_prob,
                                            args.num_prompts, data, prompts,
                                            modality)
            else:
                # Use the same image for all prompts
                inputs = [{
                    "prompt": prompts[i % len(prompts)],
                    "multi_modal_data": {
                        modality: data
                    },
                } for i in range(args.num_prompts)]

        # Add LoRA request if applicable
        lora_request = (req_data.lora_requests *
                        args.num_prompts if req_data.lora_requests else None)

        with time_counter(args.time_generate):
            outputs = llm.generate(
                inputs,
                sampling_params=sampling_params,
                lora_request=lora_request,
            )

        # print("-" * 50)
        # import pdb; pdb.set_trace()
        for o in outputs:
            generated_text = o.outputs[0].text
            print(generated_text)
            # print("-" * 50)
        item['pred'] = generated_text
        item['response_model'] = args.model_type
        # break  # TODO: 只处理一个数据
    ## create the output directory if it doesn't exist
    os.makedirs('Responses', exist_ok=True)
    # Save the results to a JSON file   
    # 保存更新后的json数据
    with open(
            f'Responses/total_{args.model_type}_response.json',
            'w',
            encoding='utf-8') as file:
        json.dump(data_json, file, ensure_ascii=False, indent=4)

if __name__ == "__main__":
    args = parse_args()
    main(args)

```

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Strange cuda out of memory when runing llava1.5 7b on 80G A100 #19724

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Strange cuda out of memory when runing llava1.5 7b on 80G A100 #19724

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions