# **vLLM: High-Performance LLM Serving**

## **What is vLLM?**

vLLM is an open-source library designed for high-throughput and memory-efficient serving of Large Language Models (LLMs). It implements optimized memory management techniques to maximize inference speed while minimizing resource usage.

## **Key Features**

- **PagedAttention**: Innovative memory management technique that outperforms traditional implementations
- **Continuous batching**: Dynamically processes requests without waiting for batch completion
- **Tensor parallelism**: Distributes model weights across multiple GPUs
- **Quantization support**: Runs models in lower precision formats (INT8, FP16, etc.)
- **OpenAI-compatible API**: Drop-in replacement for OpenAI's API

## **Performance Benefits**

- Up to 24x higher throughput compared to standard implementations
- Significantly reduced latency for concurrent requests
- Efficient memory usage enabling larger context lengths
- Seamless scaling across multiple GPUs

## **Supported Architectures**

- Supports most popular model families:
  - Llama, Llama 2, Mistral, CodeLlama
  - Mixtral, Falcon, MPT, Gemma
  - Phi, Qwen, BLOOM, and more

## **Integration Options**

- **Python API**: Direct integration into Python applications
- **REST API**: OpenAI-compatible endpoint for language-agnostic use
- **Framework integrations**: Works with LangChain, LlamaIndex, etc.


## **vLLM can work with distributed GPUs in several ways:**

### **Tensor Parallelism:**

  - Splits model weights across multiple GPUs on a single machine
  - Configured using the --tensor-parallel-size parameter
  - Each layer's computation is distributed across GPUs


### **Pipeline Parallelism:**

  - Splits model layers across different GPUs
  - Different from tensor parallelism which splits individual layers
  - Useful for extremely large models that don't fit on a single GPU even with tensor parallelism


### **Multi-Node Distributed Inference:**

  - vLLM supports distributing models across multiple machines
  - Uses Ray as the backend for distributed computing
  - Can combine both tensor and pipeline parallelism across nodes

## **Usage Example**


In [None]:
# Starting vLLM server
# python -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-Instruct-v0.2

# Using with OpenAI client
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="mistralai/Mistral-7B-Instruct-v0.2",
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)

## **When to Use vLLM**

- Serving LLMs in production environments
- Running local models with near cloud-service performance
- Building applications that require high throughput
- Handling concurrent user requests efficiently

# **Running Local LLMs with vLLM and OpenAI API Compatibility**

This notebook documents the process of running local large language models using vLLM with OpenAI API compatibility.

## **Setup Process**

### 1. **Installing vLLM**


First, ensure vLLM is installed with the appropriate CUDA version for your system:

In [None]:
! pip install vllm
! pip install "vllm[triton]" --extra-index-url https://download.pytorch.org/whl/cu124 <-- user your cuda version
# Or with specific CUDA version
# pip install vllm-with-cuda11x  # Example for CUDA 11.x

### 2. **Starting the vLLM Server**

Launch the vLLM server with the following command to utilize multiple GPUs:

In [None]:
!python -m vllm.entrypoints.openai.api_server \
  --model /path/to/model/snapshots/snapshotID \
  --tensor-parallel-size 2 \  # Adjust based on number of GPUs
  --gpu-memory-utilization 0.9

Parameters explained:
- `--model`: Path to the model snapshot
- `--tensor-parallel-size`: Number of GPUs to use in parallel
- `--gpu-memory-utilization`: Portion of GPU memory to utilize (0.9 = 90%)

### 3. **Verifying Available Models**

Check which models are available on your server:


In [None]:
!curl http://localhost:8000/v1/models # This returns the available model path under the "root" parameter, which you'll need for API calls.

## **Using the Model with OpenAI API**

### **Basic Chat Completion**

In [None]:
from openai import OpenAI

# Initialize client with local server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"  # Required parameter even if not used
)

# Make API call
response = client.chat.completions.create(
    model="path/to/model/snapshots/<snapshotID>",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "What's the capital of France?"}
    ]
)

# Extract and print response
print(response.choices[0].message.content)
# Note: Ensure that the model path and snapshot ID are correctly specified.

### **With LangChain Integration**

When using with LangChain, you can connect as follows:


In [2]:
from langchain.chat_models import ChatOpenAI

# Initialize the LLM
gpu_llm = ChatOpenAI(
    model_name="/path/to/model/snapshots/snapshotID",
    openai_api_base="http://localhost:8000/v1",
    openai_api_key="EMPTY"
)

# Example usage
response = gpu_llm.invoke("What's the capital of France?")
print(response)

  gpu_llm = ChatOpenAI(


content=' The capital of France is Paris. It is located in the north-central part of the country. Paris is known for its iconic landmarks such as the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and the Arc de Triomphe. It is also a major cultural, artistic, and fashion center. The city has a population of over 2 million people, and the wider metropolitan area has a population of over 12 million people, making it one of the most populous urban areas in Europe. The city of Paris was founded in the 3rd century BC by a Celtic people called the Parisii, and it became the capital of the Kingdom of France in the 10th century. It has been the capital of France ever since, with the exception of a brief period during the French Revolution when the seat of government was moved to the city of Tours.' additional_kwargs={} response_metadata={'token_usage': {'completion_tokens': 190, 'prompt_tokens': 11, 'total_tokens': 201, 'completion_tokens_details': None, 'prompt_tokens_details': None}, 'm

## **Best Practices**

1. **Memory Management**: Monitor GPU memory usage with tools like `nvidia-smi`
2. **Batch Size Tuning**: Adjust batch size for optimal throughput if processing multiple requests
3. **Quantization**: Consider quantized models (like 4-bit or 8-bit) for larger models with limited GPU memory
4. **Caching**: Enable response caching for repetitive queries

## **Troubleshooting**

- If experiencing CUDA out-of-memory errors, reduce `--gpu-memory-utilization` or use a smaller model
- For tensor parallelism issues, ensure all GPUs are of the same architecture
- If the model loads but inference is slow, consider adjusting KV cache settings

## **References**

- [vLLM GitHub Repository](https://github.com/vllm-project/vllm)
- [Documentation](https://docs.vllm.ai/)
- [PagedAttention Paper](https://arxiv.org/abs/2309.06180)
- [OpenAI API Compatibility Guide](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html)
- [LangChain Integration Documentation](https://python.langchain.com/docs/integrations/llms/vllm)
