##### Master Degree in Computer Science and Data Science for Economics

# vLLM

### Sergio Picascia

[vLLM](https://github.com/vllm-project/vllm) is a fast and easy-to-use library for LLM inference and serving.

In [1]:
%pip install vllm



vLLM is an open-source library that offers a highly efficient and fast **serving engine** (it's not the model itself) for large language models (LLMs). It's designed to maximize the throughput (the amount of processed data over the infrastructure) of LLMs, which is a major challenge due to their large size and computational demands.

It achieves this speed-up by using a few key techniques:

* PagedAttention: This is the core innovation of vLLM. It is an attention algorithm inspired by virtual memory and paging in operating systems. It manages the Key-Value (KV) cache more efficiently by dividing it into fixed-size blocks. This prevents memory waste and allows for better resource utilization, especially when serving multiple requests in parallel with varying sequence lengths.

* Continuous Batching: Unlike traditional batching, which waits for all sequences in a batch to finish before starting a new one, continuous batching allows new sequences to enter the batch as soon as a sequence finishes. This keeps the GPU busy and improves overall throughput.

* Optimized CUDA Kernels: vLLM uses custom, highly optimized CUDA kernels to accelerate the attention calculation and other parts of the LLM pipeline

In [2]:
from vllm import LLM, SamplingParams
%pip install qwen_vl_utils #use Qwen llm
from qwen_vl_utils import process_vision_info
from transformers import AutoProcessor
import torch

INFO 08-31 14:07:08 [__init__.py:241] Automatically detected platform cuda.
Collecting qwen_vl_utils
  Downloading qwen_vl_utils-0.0.11-py3-none-any.whl.metadata (6.3 kB)
Collecting av (from qwen_vl_utils)
  Downloading av-15.1.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (4.6 kB)
Downloading qwen_vl_utils-0.0.11-py3-none-any.whl (7.6 kB)
Downloading av-15.1.0-cp312-cp312-manylinux_2_28_x86_64.whl (39.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m39.9/39.9 MB[0m [31m17.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: av, qwen_vl_utils
Successfully installed av-15.1.0 qwen_vl_utils-0.0.11


In [3]:
DEVICE = (
    "cuda:0"
    if torch.cuda.is_available()
    else "mps" if torch.backends.mps.is_available() else "cpu"
)
MODEL_PATH = "Qwen/Qwen2.5-VL-7B-Instruct-AWQ"
DEVICE

'cuda:0'

In [None]:
model = LLM(
    model=MODEL_PATH,
    enforce_eager=False,
    max_model_len=8192,
    #device=DEVICE, #auto detected internally
    gpu_memory_utilization=0.3,
    limit_mm_per_prompt={"video": 0, "image": 1}, #limits the number of multi-modal inputs per prompt
    max_num_batched_tokens=2048, #maximum number of tokens in a batch for processing
    max_num_seqs=64, #maximum number of sequences that can be processed in parallel
    enable_prefix_caching=True, #caching of prefixes for faster inference
    enable_chunked_prefill=True, #chunked prefill for better memory utilization.
)
processor = AutoProcessor.from_pretrained(MODEL_PATH, use_fast=True)
sampling_params = SamplingParams(
    temperature=0, #0 for deterministic sampling (greedy decoding), where the model always chooses the most likely next token.
    max_tokens=4096, #maximum number of tokens to generate in the output
)
processor

INFO 08-31 14:07:34 [utils.py:326] non-default args: {'model': 'Qwen/Qwen2.5-VL-7B-Instruct-AWQ', 'max_model_len': 8192, 'enable_prefix_caching': True, 'gpu_memory_utilization': 0.3, 'max_num_batched_tokens': 2048, 'max_num_seqs': 64, 'disable_log_stats': True, 'limit_mm_per_prompt': {'video': 0, 'image': 1}, 'enable_chunked_prefill': True}


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

preprocessor_config.json:   0%|          | 0.00/575 [00:00<?, ?B/s]

INFO 08-31 14:08:11 [__init__.py:711] Resolved architecture: Qwen2_5_VLForConditionalGeneration
INFO 08-31 14:08:11 [__init__.py:1750] Using max model len 8192
INFO 08-31 14:08:15 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 08-31 14:08:15 [llm_engine.py:222] Initializing a V0 LLM engine (v0.10.1.1) with config: model='Qwen/Qwen2.5-VL-7B-Instruct-AWQ', speculative_config=None, tokenizer='Qwen/Qwen2.5-VL-7B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config={}, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_backend=

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/249 [00:00<?, ?B/s]

INFO 08-31 14:08:20 [cuda.py:384] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-31 14:08:20 [cuda.py:433] Using XFormers backend.
INFO 08-31 14:08:21 [parallel_state.py:1134] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0
INFO 08-31 14:08:21 [model_runner.py:1080] Starting to load model Qwen/Qwen2.5-VL-7B-Instruct-AWQ...
INFO 08-31 14:08:22 [weight_utils.py:296] Using model weights format ['*.safetensors']


model-00001-of-00002.safetensors:   0%|          | 0.00/3.98G [00:00<?, ?B/s]

In [None]:
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://redpandanetwork.org/get/files/image/galleries/28138502587_a0a020ae9a_k.jpeg",
                "min_pixels": 224 * 224,
                "max_pixels": 1280 * 28 * 28,
            },
            {"type": "text", "text": "Describe the image."},
        ],
    },
]

In [None]:
prompt = processor.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=True
            )
image_inputs, _ = process_vision_info(messages)
llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": {"image": image_inputs},
}

In [None]:
output = model.generate([llm_inputs], sampling_params=sampling_params)

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Processed prompts: 100%|██████████| 1/1 [00:01<00:00,  1.14s/it, est. speed input: 1117.17 toks/s, output: 99.24 toks/s]


In [None]:
output[0].outputs[0].text

"The image shows a red panda, also known as a lesser panda or a red cat-bear, perched on a tree branch. The red panda has a distinctive coat with a mix of black, white, and reddish-brown fur. Its face is predominantly white with a black nose and a white muzzle. The red panda's ears are upright and rounded, and its eyes are dark and expressive. The background is a blurred green, suggesting a natural, forested environment. The red panda appears to be in a relaxed state, possibly observing its surroundings."