# VLLM Parallel Inference Demo
This notebook demonstrates how to use VLLM with tensor and pipeline parallelism for efficient model inference.

## Setup and Installation
First, let's install VLLM and set up the required dependencies.

In [None]:
# Install required packages
!pip install vllm torch transformers accelerate
!pip install pandas matplotlib  # For benchmarking visualization

In [None]:
# Set CUDA environment variables for optimal performance
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"  # Adjust based on available GPUs

## Initialize VLLM with Model
We'll use VLLM's LLM class to load and configure the model.

In [None]:
from vllm import LLM, SamplingParams
import torch

# Initialize the model with basic configuration
model_name = "deepseek-ai/DeepSeek-V2-Lite"  # You can replace with your preferred model
llm = LLM(model=model_name, 
          trust_remote_code=True,
          dtype="bfloat16",
          gpu_memory_utilization=0.85)

## Configure Tensor Parallelism
Set up tensor parallel inference across multiple GPUs.

In [None]:
# Initialize model with tensor parallelism
tensor_parallel_llm = LLM(
    model=model_name,
    tensor_parallel_size=2,  # Number of GPUs for tensor parallelism
    trust_remote_code=True,
    dtype="bfloat16"
)

## Configure Pipeline Parallelism
Implement pipeline parallelism by splitting the model across GPUs.

In [None]:
# Initialize model with pipeline parallelism
pipeline_parallel_llm = LLM(
    model=model_name,
    pipeline_parallel_size=2,  # Number of pipeline stages
    trust_remote_code=True,
    dtype="bfloat16"
)

## Sample Text Generation
Let's test the model with different parallel configurations.

In [None]:
# Sample prompt for testing
prompt = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"

# Configure sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.95,
    max_tokens=100
)

# Generate with different configurations
result_tensor = tensor_parallel_llm.generate([prompt], sampling_params)
result_pipeline = pipeline_parallel_llm.generate([prompt], sampling_params)

print("Tensor Parallel Output:")
print(result_tensor[0].outputs[0].text)
print("\nPipeline Parallel Output:")
print(result_pipeline[0].outputs[0].text)

## Benchmarking Different Parallel Configurations
Compare performance across different parallel setups.

In [None]:
import time
import pandas as pd
import matplotlib.pyplot as plt

def benchmark_generation(llm, prompt, n_runs=5):
    times = []
    for _ in range(n_runs):
        start = time.time()
        _ = llm.generate([prompt], sampling_params)
        times.append(time.time() - start)
    return sum(times) / len(times)

# Benchmark different configurations
configs = {
    'Base': llm,
    'Tensor Parallel': tensor_parallel_llm,
    'Pipeline Parallel': pipeline_parallel_llm
}

results = {}
for name, model in configs.items():
    avg_time = benchmark_generation(model, prompt)
    results[name] = avg_time

# Create visualization
df = pd.DataFrame(list(results.items()), columns=['Configuration', 'Time (s)'])
plt.figure(figsize=(10, 6))
plt.bar(df['Configuration'], df['Time (s)'])
plt.title('Inference Time Comparison')
plt.ylabel('Average Time (seconds)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()