# vLLM Setup Guide

This notebook provides a comprehensive guide to setting up and using vLLM, a high-throughput and memory-efficient inference and serving library for LLMs.

## What is vLLM?

vLLM is an open-source library designed for:
- Fast LLM inference and serving
- Efficient memory management with PagedAttention
- Continuous batching for high throughput
- OpenAI-compatible API server

## System Requirements

- **Operating System**: Linux
- **Python**: 3.9 - 3.12
- **Hardware**: NVIDIA GPUs (recommended)
- **CUDA**: Compatible version for your GPU

## Installation

### Option 1: Using uv (Recommended)

uv is a fast Python package manager that's recommended for vLLM installation.

In [None]:
# First, install uv if you haven't already
!pip install uv

In [None]:
# Create and activate a virtual environment
!uv venv --python 3.12
# Note: In Jupyter, you'll need to restart the kernel and select the new environment

In [None]:
# Install vLLM with automatic torch backend selection
!uv pip install vllm --torch-backend=auto

### Option 2: Using conda

In [None]:
# Create a new conda environment
# Run this in terminal:
# conda create -n vllm_env python=3.12 -y
# conda activate vllm_env

# Then install vLLM
!pip install vllm --torch-backend=auto

### Option 3: Using pip (in existing environment)

In [None]:
# Direct pip install
!pip install vllm --torch-backend=auto

## Verify Installation

In [None]:
# Verify vLLM installation
import vllm
print(f"vLLM version: {vllm.__version__}")

# Check GPU availability
import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")

## Offline Inference

Let's start with basic offline inference using vLLM.

In [None]:
from vllm import LLM, SamplingParams

# Define prompts
prompts = [
    "Hello, my name is",
    "The capital of France is",
    "The future of AI is"
]

# Set sampling parameters
sampling_params = SamplingParams(
    temperature=0.8,
    top_p=0.95,
    max_tokens=100
)

# Initialize the model (using a small model for demo)
llm = LLM(model="facebook/opt-125m")

# Generate outputs
outputs = llm.generate(prompts, sampling_params)

# Print the outputs
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt}")
    print(f"Generated: {generated_text}")
    print("-" * 50)

## Using Larger Models

vLLM supports a wide range of models from Hugging Face. Here's how to use larger, more capable models:

In [None]:
# Example with a more capable model (adjust based on your GPU memory)
# Uncomment and run based on your GPU capacity:

# For 8GB GPU:
# llm = LLM(model="microsoft/phi-2")

# For 16GB GPU:
# llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")

# For 24GB+ GPU:
# llm = LLM(model="meta-llama/Llama-2-13b-chat-hf")

# Example with Mistral 7B (requires ~16GB GPU memory)
try:
    llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.1")
    
    # Chat-style prompt
    prompts = [
        "[INST] Explain quantum computing in simple terms. [/INST]"
    ]
    
    sampling_params = SamplingParams(temperature=0.7, max_tokens=200)
    outputs = llm.generate(prompts, sampling_params)
    
    print(outputs[0].outputs[0].text)
except Exception as e:
    print(f"Error loading model: {e}")
    print("Try a smaller model based on your GPU memory.")

## Advanced Sampling Parameters

vLLM offers various sampling parameters to control generation:

In [None]:
# Demonstrating different sampling strategies
from vllm import LLM, SamplingParams

llm = LLM(model="facebook/opt-125m")

prompt = "The weather today is"

# Greedy decoding (deterministic)
greedy_params = SamplingParams(temperature=0, max_tokens=30)

# Creative sampling
creative_params = SamplingParams(
    temperature=1.2,
    top_p=0.95,
    top_k=40,
    max_tokens=30
)

# Beam search
beam_params = SamplingParams(
    use_beam_search=True,
    best_of=3,
    max_tokens=30
)

# Generate with different strategies
print("Greedy decoding:")
print(llm.generate([prompt], greedy_params)[0].outputs[0].text)
print("\nCreative sampling:")
print(llm.generate([prompt], creative_params)[0].outputs[0].text)
print("\nBeam search:")
print(llm.generate([prompt], beam_params)[0].outputs[0].text)

## OpenAI-Compatible API Server

vLLM can serve models with an OpenAI-compatible API, making it easy to integrate with existing applications.

### Starting the Server

To start the vLLM server, run this command in your terminal:

```bash
# Basic server start
vllm serve facebook/opt-125m

# With custom options
vllm serve facebook/opt-125m --port 8000 --host 0.0.0.0

# For a chat model
vllm serve mistralai/Mistral-7B-Instruct-v0.1 --chat-template
```

In [None]:
# Example of using the API (run after starting the server)
import requests
import json

# API endpoint
url = "http://localhost:8000/v1/completions"

# Request payload
payload = {
    "model": "facebook/opt-125m",
    "prompt": "The meaning of life is",
    "max_tokens": 50,
    "temperature": 0.7
}

# Make request (uncomment when server is running)
# response = requests.post(url, json=payload)
# print(json.dumps(response.json(), indent=2))

In [None]:
# Using OpenAI Python client with vLLM
from openai import OpenAI

# Point to local vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="dummy-key"  # vLLM doesn't require an API key
)

# Use it like OpenAI API (uncomment when server is running)
# completion = client.completions.create(
#     model="facebook/opt-125m",
#     prompt="Once upon a time",
#     max_tokens=50
# )
# print(completion.choices[0].text)

## Attention Backend Configuration

vLLM supports multiple attention backends for optimal performance on different hardware.

In [None]:
import os

# Check current attention backend
current_backend = os.environ.get('VLLM_ATTENTION_BACKEND', 'auto')
print(f"Current attention backend: {current_backend}")

# Available backends:
# - FLASH_ATTN: FlashAttention-2 backend
# - FLASHINFER: FlashInfer backend  
# - XFORMERS: xFormers backend
# - ROCM_FLASH: ROCm flash attention
# - auto: Automatic selection (default)

# To set a specific backend:
# os.environ['VLLM_ATTENTION_BACKEND'] = 'FLASH_ATTN'

# Example with specific backend
# llm = LLM(model="facebook/opt-125m")
# This will use the backend specified in the environment variable

## Performance Optimization Tips

1. **GPU Memory Management**:

In [None]:
# Control GPU memory usage
llm = LLM(
    model="facebook/opt-125m",
    gpu_memory_utilization=0.9,  # Use 90% of GPU memory
    max_model_len=2048  # Limit context length
)

2. **Tensor Parallelism** (for multi-GPU systems):

In [None]:
# For multi-GPU systems
# llm = LLM(
#     model="meta-llama/Llama-2-13b-hf",
#     tensor_parallel_size=2  # Use 2 GPUs
# )

3. **Quantization** (reduce memory usage):

In [None]:
# Load quantized models
# llm = LLM(
#     model="TheBloke/Llama-2-7B-AWQ",  # AWQ quantized model
#     quantization="awq"
# )

## Troubleshooting Common Issues

### 1. CUDA Out of Memory

In [None]:
# Solutions for OOM errors:

# 1. Reduce batch size
llm = LLM(model="facebook/opt-125m", max_num_seqs=1)

# 2. Reduce memory utilization
llm = LLM(model="facebook/opt-125m", gpu_memory_utilization=0.7)

# 3. Use smaller model or quantized version
# 4. Reduce max_model_len

### 2. Check System Compatibility

In [None]:
# Check CUDA and GPU info
import subprocess

# Check CUDA version
try:
    cuda_version = subprocess.check_output(['nvcc', '--version']).decode('utf-8')
    print("CUDA Version:")
    print(cuda_version)
except:
    print("CUDA not found in PATH")

# Check GPU info
try:
    gpu_info = subprocess.check_output(['nvidia-smi', '--query-gpu=name,memory.total', '--format=csv']).decode('utf-8')
    print("\nGPU Info:")
    print(gpu_info)
except:
    print("nvidia-smi not found")

## Next Steps

1. **Explore Model Zoo**: Browse Hugging Face for models compatible with vLLM
2. **Production Deployment**: Set up vLLM with proper monitoring and scaling
3. **Custom Models**: Learn to serve your fine-tuned models
4. **Integration**: Connect vLLM to your applications via the OpenAI-compatible API

## Resources

- [vLLM Documentation](https://docs.vllm.ai/)
- [vLLM GitHub](https://github.com/vllm-project/vllm)
- [Supported Models](https://docs.vllm.ai/en/latest/models/supported_models.html)
- [Performance Guide](https://docs.vllm.ai/en/latest/models/performance.html)