# **vLLM: High-Performance LLM Serving**

## **What is vLLM?**

vLLM is an open-source library designed for high-throughput and memory-efficient serving of Large Language Models (LLMs). It implements optimized memory management techniques to maximize inference speed while minimizing resource usage.

## **Key Features**

- **PagedAttention**: Innovative memory management technique that outperforms traditional implementations
- **Continuous batching**: Dynamically processes requests without waiting for batch completion
- **Tensor parallelism**: Distributes model weights across multiple GPUs
- **Quantization support**: Runs models in lower precision formats (INT8, FP16, etc.)
- **OpenAI-compatible API**: Drop-in replacement for OpenAI's API

## **Performance Benefits**

- Up to 24x higher throughput compared to standard implementations
- Significantly reduced latency for concurrent requests
- Efficient memory usage enabling larger context lengths
- Seamless scaling across multiple GPUs

## **Supported Architectures**

- Supports most popular model families:
  - Llama, Llama 2, Mistral, CodeLlama
  - Mixtral, Falcon, MPT, Gemma
  - Phi, Qwen, BLOOM, and more

## **Integration Options**

- **Python API**: Direct integration into Python applications
- **REST API**: OpenAI-compatible endpoint for language-agnostic use
- **Framework integrations**: Works with LangChain, LlamaIndex, etc.


## **vLLM can work with distributed GPUs in several ways:**

### **Tensor Parallelism:**

  - Splits model weights across multiple GPUs on a single machine
  - Configured using the --tensor-parallel-size parameter
  - Each layer's computation is distributed across GPUs


### **Pipeline Parallelism:**

  - Splits model layers across different GPUs
  - Different from tensor parallelism which splits individual layers
  - Useful for extremely large models that don't fit on a single GPU even with tensor parallelism


### **Multi-Node Distributed Inference:**

  - vLLM supports distributing models across multiple machines
  - Uses Ray as the backend for distributed computing
  - Can combine both tensor and pipeline parallelism across nodes

## **Usage Example**


In [None]:
# Starting vLLM server
# python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.3

# Using with OpenAI client
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)

## **When to Use vLLM**

- Serving LLMs in production environments
- Running local models with near cloud-service performance
- Building applications that require high throughput
- Handling concurrent user requests efficiently

# **Running Local LLMs with vLLM and OpenAI API Compatibility**

This notebook documents the process of running local large language models using vLLM with OpenAI API compatibility.

## **Setup Process**

### 1. **Installing vLLM**


First, ensure vLLM is installed with the appropriate CUDA version for your system:

In [None]:
! pip install vllm
! pip install "vllm[triton]" --extra-index-url https://download.pytorch.org/whl/cu124 <-- user your cuda version
# Or with specific CUDA version
# pip install vllm-with-cuda11x  # Example for CUDA 11.x
# !pip install --upgrade --quiet  vllm -q # upgrade vllm if needed
# kill -9 1068752  # to kill the process if needed

## **With LangChain Integration**

When using with LangChain, you can connect as follows:

In [1]:
from langchain_community.llms import VLLM

llm = VLLM(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    trust_remote_code=True,  # mandatory for hf models
    max_new_tokens=1000,
    top_k=10,
    top_p=0.95,
    temperature=0.8,
)

print(llm.invoke("What is the capital of France ?"))

  from .autonotebook import tqdm as notebook_tqdm


INFO 04-24 16:08:55 [__init__.py:239] Automatically detected platform cuda.


2025-04-24 16:08:56,559	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 04-24 16:09:08 [config.py:689] This model supports multiple tasks: {'reward', 'classify', 'score', 'embed', 'generate'}. Defaulting to 'generate'.
INFO 04-24 16:09:09 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=8192.


  self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)


INFO 04-24 16:09:11 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='mistralai/Mistral-7B-Instruct-v0.3', speculative_config=None, tokenizer='mistralai/Mistral-7B-Instruct-v0.3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=mistralai/Mistral-7B-Instruct-v0.3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill

Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:03<00:06,  3.00s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:06<00:03,  3.20s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:09<00:00,  3.05s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:09<00:00,  3.07s/it]



INFO 04-24 16:09:25 [loader.py:458] Loading weights took 9.28 seconds
INFO 04-24 16:09:26 [gpu_model_runner.py:1291] Model loading took 13.5084 GiB and 12.519891 seconds
INFO 04-24 16:09:39 [backends.py:416] Using cache directory: /home/omjadhav/.cache/vllm/torch_compile_cache/eaff6eae66/rank_0_0 for vLLM's torch.compile
INFO 04-24 16:09:39 [backends.py:426] Dynamo bytecode transform time: 13.40 s
INFO 04-24 16:09:40 [backends.py:115] Directly load the compiled graph for shape None from the cache
INFO 04-24 16:09:52 [monitor.py:33] torch.compile takes 13.40 s in total
INFO 04-24 16:09:56 [kv_cache_utils.py:634] GPU KV cache size: 454,288 tokens
INFO 04-24 16:09:56 [kv_cache_utils.py:637] Maximum concurrency for 32,768 tokens per request: 13.86x
INFO 04-24 16:10:23 [gpu_model_runner.py:1626] Graph capturing finished in 28 secs, took 0.52 GiB
INFO 04-24 16:10:23 [core.py:163] init engine (profile, create kv cache, warmup model) took 57.75 seconds
INFO 04-24 16:10:24 [core_client.py:435] 

Processed prompts: 100%|██████████| 1/1 [00:03<00:00,  3.34s/it, est. speed input: 2.39 toks/s, output: 41.28 toks/s]



Paris is the capital of France. It is located in the north-central part of the country and is the most populous city in France, as well as the most populous city in the European Union. Paris is known for its iconic landmarks, such as the Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral. It is also a major cultural and artistic center, with a vibrant arts scene and a rich history. Paris is located on the Seine River and is known for its beautiful architecture, gardens, and parks. It is a popular tourist destination and is also a major hub for international trade and finance.





## **Integrate the model in an LLMChain**

In [2]:
from langchain.chains import LLMChain
from langchain_core.prompts import PromptTemplate

template = """Question: {question}

Answer: Let's think step by step."""

prompt = PromptTemplate.from_template(template)

llm_chain = LLMChain(prompt=prompt, llm=llm)

question = "Who was the US president in the year the first Pokemon game was released?"

print(llm_chain.invoke(question))

  llm_chain = LLMChain(prompt=prompt, llm=llm)
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.62s/it, est. speed input: 12.23 toks/s, output: 73.38 toks/s]

{'question': 'Who was the US president in the year the first Pokemon game was released?', 'text': '\n\n1. The first Pokemon game, "Pokemon Red and Green," was released in Japan on February 27, 1996.\n\n2. To find out who the U.S. president was at that time, we need to know that the United States is 5 hours behind Japan. So, February 27, 1996, in the U.S., would be February 26, 1996.\n\n3. President Bill Clinton was the U.S. president from January 20, 1993, to January 20, 2001.\n\n4. Therefore, on February 26, 1996, Bill Clinton was the U.S. president.\n\nSo, the U.S. president in the year the first Pokemon game was released was President Bill Clinton.'}





## **Distributed Inference**

vLLM supports distributed tensor-parallel inference and serving.

To run multi-GPU inference with the LLM class, set the tensor_parallel_size argument to the number of GPUs you want to use. For example, to run inference on 2 GPUs

In [1]:
from langchain_community.llms import VLLM

llm = VLLM(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    tensor_parallel_size=2,
    trust_remote_code=True,  # mandatory for hf models
)

llm.invoke("What is the future of AI?")

  from .autonotebook import tqdm as notebook_tqdm


INFO 04-24 16:16:39 [__init__.py:239] Automatically detected platform cuda.


2025-04-24 16:16:43,116	INFO util.py:154 -- Missing packages: ['ipywidgets']. Run `pip install -U ipywidgets`, then restart the notebook server for rich notebook output.


INFO 04-24 16:16:59 [config.py:689] This model supports multiple tasks: {'embed', 'reward', 'classify', 'generate', 'score'}. Defaulting to 'generate'.
INFO 04-24 16:17:00 [config.py:1713] Defaulting to use mp for distributed inference
INFO 04-24 16:17:00 [config.py:1901] Chunked prefill is enabled with max_num_batched_tokens=8192.


  self.tokenizer = get_tokenizer(self.tokenizer_id, **tokenizer_config)


INFO 04-24 16:17:02 [core.py:61] Initializing a V1 LLM engine (v0.8.4) with config: model='mistralai/Mistral-7B-Instruct-v0.3', speculative_config=None, tokenizer='mistralai/Mistral-7B-Instruct-v0.3', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto,  device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='auto', reasoning_backend=None), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=None, served_model_name=mistralai/Mistral-7B-Instruct-v0.3, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill

Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:02<00:05,  2.73s/it]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:04<00:02,  2.02s/it]


[1;36m(VllmWorker rank=1 pid=1106532)[0;0m INFO 04-24 16:17:16 [loader.py:458] Loading weights took 5.00 seconds


Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00,  1.72s/it]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:05<00:00,  1.88s/it]
[1;36m(VllmWorker rank=0 pid=1106505)[0;0m 


[1;36m(VllmWorker rank=0 pid=1106505)[0;0m INFO 04-24 16:17:16 [loader.py:458] Loading weights took 5.72 seconds
[1;36m(VllmWorker rank=1 pid=1106532)[0;0m INFO 04-24 16:17:16 [gpu_model_runner.py:1291] Model loading took 6.7584 GiB and 8.649958 seconds
[1;36m(VllmWorker rank=0 pid=1106505)[0;0m INFO 04-24 16:17:16 [gpu_model_runner.py:1291] Model loading took 6.7584 GiB and 8.827003 seconds
[1;36m(VllmWorker rank=0 pid=1106505)[0;0m INFO 04-24 16:17:25 [backends.py:416] Using cache directory: /home/omjadhav/.cache/vllm/torch_compile_cache/778a2b3212/rank_0_0 for vLLM's torch.compile
[1;36m(VllmWorker rank=0 pid=1106505)[0;0m INFO 04-24 16:17:25 [backends.py:426] Dynamo bytecode transform time: 8.82 s
[1;36m(VllmWorker rank=1 pid=1106532)[0;0m INFO 04-24 16:17:25 [backends.py:416] Using cache directory: /home/omjadhav/.cache/vllm/torch_compile_cache/778a2b3212/rank_1_0 for vLLM's torch.compile
[1;36m(VllmWorker rank=1 pid=1106532)[0;0m INFO 04-24 16:17:25 [backends.py:426

Processed prompts: 100%|██████████| 1/1 [00:04<00:00,  4.80s/it, est. speed input: 1.67 toks/s, output: 106.78 toks/s]


' Are we headed towards a future where machines will take over the world and robots will enslave our race? Or will it be a bright future where AI, being our close companions, will help us achieving things we never thought were possible? Let’s discuss the surprising predictions made by experts and entrepreneurs. There are two possible futures for AI: optimization or superintelligence.\n\nOptimization AI is what we already have today. It’s the AI that powers our cars, our smartphones, and the Alexas in our living rooms. It’s designed to respond to specific tasks and, as you train it with more information, it gets better at completing those tasks. It’s great at playing chess, winning Jeopardy!, and analyzing medicine, but it’s not going to take over the world.\n\nSuperintelligence AI would be different. This AI would have human-level intelligence and the ability to continue learning and improving without human intervention. While we are just starting to see the beginning of superintellige

## **Quantization**

vLLM supports awq quantization. To enable it, pass quantization to vllm_kwargs.

In [None]:
llm_q = VLLM(
    model="TheBloke/Llama-2-7b-Chat-AWQ",  # <--- requres quantized model 
    trust_remote_code=True,
    max_new_tokens=512,
    vllm_kwargs={"quantization": "awq"}, # <--- quantization options will throw error if quantized model is not used
)

## **OpenAI-Compatible Server**

vLLM can be deployed as a server that mimics the OpenAI API protocol. This allows vLLM to be used as a drop-in replacement for applications using OpenAI API.

This server can be queried in the same format as OpenAI API.

In [None]:
# first run the server

!python -m vllm.entrypoints.openai.api_server \
  --model /your/model/path \ # path to your model or model name if downloaded from huggingface it will look in ~/.cache/huggingface
  --tensor-parallel-size 2 \  # number of GPUs you want to use for tensor parallelism
  --max-model-len 8192 \      # if your prompts+responses are long
  --disable-log-requests \    # speeds up inference by reducing logging overhead
  --gpu-memory-utilization 0.95 \  # safely max out VRAM usage
  --max-num-batched-tokens 4096 \  # increases batching capacity per request (depending on your workload)
  --enforce-eager \           # forces eager mode, can reduce latency slightly
  --seed 42                   # for reproducible runs

# You can add or remove flags as needed

curl http://localhost:8000/v1/models # to check if the server is running and get the model name


In [16]:
from langchain_community.llms import VLLMOpenAI

llm = VLLMOpenAI(
    openai_api_key="EMPTY",
    openai_api_base="http://localhost:8000/v1",
    model_name="mistralai/Mistral-7B-Instruct-v0.3",
    model_kwargs={"stop": ["."]},
)

print(llm.invoke("Eiffel Tower is ")) 

330 meters high and is one of the most famous landmarks in the world


## **LoRA adapter**

LoRA adapters can be used with any vLLM model that implements SupportsLoRA.

In [None]:
from langchain_community.llms import VLLM
from vllm.lora.request import LoRARequest

llm = VLLM(
    model="meta-llama/Llama-3.2-3B-Instruct",
    max_new_tokens=300,
    top_k=1,
    top_p=0.90,
    temperature=0.1,
    vllm_kwargs={
        "gpu_memory_utilization": 0.5,
        "enable_lora": True,
        "max_model_len": 350,
    },
)

LoRA_ADAPTER_PATH = "path/to/adapter"
lora_adapter = LoRARequest("lora_adapter", 1, LoRA_ADAPTER_PATH)

print(
    llm.invoke("What are some popular Korean street foods?", lora_request=lora_adapter)
)

## **Best Practices**

1. **Memory Management**: Monitor GPU memory usage with tools like `nvidia-smi`
2. **Batch Size Tuning**: Adjust batch size for optimal throughput if processing multiple requests
3. **Quantization**: Consider quantized models (like 4-bit or 8-bit) for larger models with limited GPU memory
4. **Caching**: Enable response caching for repetitive queries
