# vLLM Integration for Bias Transfer Research

This notebook demonstrates how to use vLLM (self-hosted LLM inference server) for model evaluation in bias transfer research.

## ⚠️ **IMPORTANT: Platform Compatibility**

**vLLM only supports Linux and macOS. It does NOT support Windows natively.**

### Windows Users - Your Options:

1. **WSL2 (Recommended)**: Use Windows Subsystem for Linux 2
   - Install WSL2: `wsl --install` (in PowerShell as Administrator)
   - Install CUDA in WSL2
   - Run vLLM inside WSL2
   - Connect from Windows using `http://localhost:8000/v1`

2. **Ollama (Easier Alternative)**: Better Windows support
   - Native Windows installation
   - See `ollama_integration.ipynb` for Ollama setup (if available)
   - Easier to get started on Windows

3. **Docker with WSL2**: Run vLLM in Docker container
   - Requires Docker Desktop with WSL2 backend

## Overview

vLLM is a high-performance inference engine for LLMs that provides:
- Fast inference with PagedAttention
- Efficient GPU memory usage
- OpenAI-compatible API
- Support for quantized models (4-bit, 8-bit)

## Prerequisites

1. **Hardware**: NVIDIA GPU with CUDA support (RTX 4060 recommended)
2. **Software**: 
   - **Linux/macOS**: Python 3.8+, CUDA toolkit, vLLM installed
   - **Windows**: WSL2 with CUDA support OR use Ollama instead
3. **Models**: HuggingFace model IDs (e.g., `meta-llama/Llama-3.1-8B-Instruct`)

## Setup Steps

### For Linux/macOS:
1. Install vLLM: `pip install vllm`
2. Start vLLM server: `python -m vllm.entrypoints.openai.api_server --model <model_id>`
3. Use this notebook to evaluate models

### For Windows (WSL2):
1. Install WSL2: `wsl --install` (in PowerShell as Administrator, restart after)
2. Install CUDA in WSL2 (download from NVIDIA website)
3. Install vLLM in WSL2: `pip install vllm` (inside WSL2 terminal)
4. Start vLLM server in WSL2
5. Connect from Windows using `http://localhost:8000/v1` (WSL2 exposes ports automatically)


## 1. Installation and Setup


In [None]:
# ⚠️ WARNING: vLLM does NOT support Windows natively!
# If you're on Windows, you'll get an error. See options below.

import platform
print(f"Platform: {platform.system()} {platform.release()}")

if platform.system() == "Windows":
    print("\n" + "="*70)
    print("⚠️  WINDOWS DETECTED - vLLM Installation Will Fail!")
    print("="*70)
    print("\nvLLM only supports Linux and macOS.")
    print("\nYour options:")
    print("1. Use WSL2 (Windows Subsystem for Linux)")
    print("   - Install: wsl --install (in PowerShell as Administrator)")
    print("   - Install CUDA in WSL2")
    print("   - Install vLLM inside WSL2")
    print("\n2. Use Ollama instead (better Windows support)")
    print("   - Native Windows installation")
    print("   - See ollama_integration.ipynb if available")
    print("\n3. Skip vLLM and use Bedrock models (already working)")
    print("\n" + "="*70)
    print("Skipping vLLM installation on Windows...")
    print("="*70)
else:
    print(f"\n✓ Platform supported ({platform.system()})")
    print("Attempting to install vLLM...")
    print("!pip install vllm")
    # Uncomment the line below if you want to try installing
    # !pip install vllm

Collecting vllm
  Downloading vllm-0.12.0.tar.gz (17.6 MB)
     ---------------------------------------- 0.0/17.6 MB ? eta -:--:--
     --- ------------------------------------ 1.6/17.6 MB 27.9 MB/s eta 0:00:01
     ----------------- ---------------------- 7.6/17.6 MB 29.4 MB/s eta 0:00:01
     ------------------------------ -------- 13.6/17.6 MB 29.5 MB/s eta 0:00:01
     --------------------------------------  17.6/17.6 MB 29.1 MB/s eta 0:00:01
     ---------------------------------------- 17.6/17.6 MB 25.2 MB/s  0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Collecting cachetools (from vllm)
  Using cached cachetools-6.2.2-py3-none-any.whl.metadata (5.6 kB)
Collecting sentencepiece (from vllm)
  D

  error: subprocess-exited-with-error
  
  × Building wheel for vllm (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [2578 lines of output]
        cpu = _conversion_method_template(device=torch.device("cpu"))
      vLLM only supports Linux platform (including WSL) and MacOS.Building on win32, so vLLM may not be able to run correctly
      running bdist_wheel
      running build
      running build_py
      creating build\lib\vllm
      copying vllm\beam_search.py -> build\lib\vllm
      copying vllm\collect_env.py -> build\lib\vllm
      copying vllm\connections.py -> build\lib\vllm
      copying vllm\envs.py -> build\lib\vllm
      copying vllm\env_override.py -> build\lib\vllm
      copying vllm\forward_context.py -> build\lib\vllm
      copying vllm\logger.py -> build\lib\vllm
      copying vllm\logits_process.py -> build\lib\vllm
      copying vllm\logprobs.py -> build\lib\vllm
      copying vllm\outputs.py -> build\lib\vllm
      copying vllm\pooling_params.py -

In [None]:
import platform
import sys

# Check platform
print(f"Platform: {platform.system()} {platform.release()}")
print(f"Python: {sys.version.split()[0]}")

# Platform compatibility check
if platform.system() == "Windows":
    print("\n" + "="*70)
    print("⚠️  WARNING: vLLM does not support Windows natively!")
    print("="*70)
    print("\nOptions:")
    print("1. Use WSL2 (Windows Subsystem for Linux)")
    print("   - Install: wsl --install (in PowerShell as Administrator)")
    print("   - Install CUDA in WSL2")
    print("   - Install vLLM inside WSL2")
    print("   - Start vLLM server in WSL2")
    print("   - Connect from Windows using http://localhost:8000/v1")
    print("\n2. Use Ollama instead (better Windows support)")
    print("   - Native Windows installation")
    print("   - See ollama_integration.ipynb if available")
    print("\n3. Use Bedrock models (already working in your setup)")
    print("\n" + "="*70)
    print("If you're using WSL2, you can continue - vLLM will run in WSL2.")
    print("Make sure vLLM server is running in WSL2 and accessible at http://localhost:8000")
    print("="*70)
else:
    print("✓ Platform supported (Linux/macOS)")

# Check if vLLM is installed
try:
    import vllm
    print(f"\n✓ vLLM version: {vllm.__version__}")
except ImportError:
    print("\n✗ vLLM not installed.")
    if platform.system() == "Windows":
        print("  vLLM cannot be installed on Windows. Use WSL2 or Ollama instead.")
    else:
        print("  Install with: pip install vllm")
        print("  For CUDA support, ensure you have CUDA toolkit installed.")

# Check CUDA availability
try:
    import torch
    if torch.cuda.is_available():
        print(f"\n✓ CUDA available: {torch.cuda.get_device_name(0)}")
        print(f"✓ CUDA version: {torch.version.cuda}")
        print(f"✓ GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
    else:
        print("\n✗ CUDA not available. vLLM requires CUDA.")
        if platform.system() == "Windows":
            print("  If using WSL2, make sure CUDA is installed in WSL2.")
except ImportError:
    print("\n✗ PyTorch not installed. Install with: pip install torch")


✗ vLLM not installed. Install with: pip install vllm
  For CUDA support, ensure you have CUDA toolkit installed.
✗ CUDA not available. vLLM requires CUDA.


## 2. vLLM Client Configuration


In [None]:
import os
import sys
from pathlib import Path
from typing import Dict, Any, Optional, List
import pandas as pd
import json
from datetime import datetime

# Add project root to path
project_root = Path().resolve().parent
if str(project_root) not in sys.path:
    sys.path.insert(0, str(project_root))

# Configuration
VLLM_API_BASE = os.getenv("VLLM_API_BASE", "http://localhost:8000/v1")
VLLM_API_KEY = os.getenv("VLLM_API_KEY", "EMPTY")  # vLLM doesn't require auth by default

print(f"vLLM API Base: {VLLM_API_BASE}")
print(f"Project root: {project_root}")


## 3. Create vLLM Client Adapter


In [None]:
try:
    from openai import OpenAI
    print("✓ OpenAI client available")
except ImportError:
    print("✗ OpenAI client not installed. Install with: pip install openai")

class VLLMClient:
    """
    Client adapter for vLLM OpenAI-compatible API.
    
    vLLM provides an OpenAI-compatible API, so we can use the OpenAI client.
    """
    
    def __init__(self, api_base: str = None, api_key: str = None):
        """
        Initialize vLLM client.
        
        Args:
            api_base: vLLM API base URL (default: http://localhost:8000/v1)
            api_key: API key (default: "EMPTY" for local vLLM)
        """
        self.api_base = api_base or VLLM_API_BASE
        self.api_key = api_key or VLLM_API_KEY
        
        try:
            self.client = OpenAI(
                base_url=self.api_base,
                api_key=self.api_key
            )
            print(f"✓ vLLM client initialized: {self.api_base}")
        except Exception as e:
            raise ValueError(f"Failed to initialize vLLM client: {e}")
    
    def invoke(
        self,
        messages: List[Dict[str, str]],
        model: str,
        max_tokens: int = 500,
        temperature: Optional[float] = None,
        stop_sequences: Optional[List[str]] = None,
        **kwargs
    ) -> Dict[str, Any]:
        """
        Invoke vLLM model.
        
        Args:
            messages: Conversation messages
            model: Model name (as registered in vLLM server)
            max_tokens: Maximum tokens in response
            temperature: Sampling temperature
            stop_sequences: Stop sequences
            **kwargs: Additional parameters
            
        Returns:
            Response dict with 'content' key containing list of dicts with 'text'
        """
        # Prepare parameters
        params = {
            "model": model,
            "messages": messages,
            "max_tokens": max_tokens,
        }
        
        if temperature is not None:
            params["temperature"] = temperature
        
        if stop_sequences:
            params["stop"] = stop_sequences
        
        # Add any additional parameters
        params.update(kwargs)
        
        try:
            response = self.client.chat.completions.create(**params)
            
            # Extract text from response
            text = response.choices[0].message.content
            
            # Format to match Bedrock response structure
            return {
                "content": [{"text": text}],
                "metadata": {
                    "model": model,
                    "usage": {
                        "prompt_tokens": response.usage.prompt_tokens if hasattr(response.usage, 'prompt_tokens') else None,
                        "completion_tokens": response.usage.completion_tokens if hasattr(response.usage, 'completion_tokens') else None,
                        "total_tokens": response.usage.total_tokens if hasattr(response.usage, 'total_tokens') else None,
                    }
                }
            }
        except Exception as e:
            raise Exception(f"vLLM API call failed: {e}")
    
    def list_models(self) -> List[str]:
        """List available models on vLLM server."""
        try:
            models = self.client.models.list()
            return [model.id for model in models.data]
        except Exception as e:
            print(f"Warning: Could not list models: {e}")
            return []

# Test connection
try:
    vllm_client = VLLMClient()
    models = vllm_client.list_models()
    if models:
        print(f"✓ Available models: {models}")
    else:
        print("⚠ No models found. Make sure vLLM server is running.")
        print("  Start server with: python -m vllm.entrypoints.openai.api_server --model <model_id>")
except Exception as e:
    print(f"✗ Could not connect to vLLM server: {e}")
    print(f"  Make sure vLLM server is running at {VLLM_API_BASE}")


## 4. Create vLLM Evaluator


In [None]:
import re

class VLLMEvaluator:
    """
    Evaluator for vLLM models.
    
    Similar to ModelEvaluator but uses vLLM client instead of Bedrock.
    """
    
    def __init__(self, model_id: str, vllm_client: VLLMClient = None):
        """
        Initialize evaluator for a specific vLLM model.
        
        Args:
            model_id: Model ID (as registered in vLLM server)
            vllm_client: vLLM client instance (creates new one if not provided)
        """
        self.model_id = model_id
        self.client = vllm_client or VLLMClient()
        
    def evaluate_entry(
        self,
        entry: pd.Series,
        bias_type: str
    ) -> Dict[str, Any]:
        """
        Evaluate a single entry with a specific bias type.
        
        Args:
            entry: Dataset row (pandas Series)
            bias_type: Bias type to evaluate
            
        Returns:
            Dictionary with evaluation results
        """
        turn1_question_col = f'turn1_question_{bias_type}'
        target_question = entry['target_question']
        turn1_question = entry.get(turn1_question_col)
        
        result = {
            'entry_index': entry.name,
            'model_id': self.model_id,
            'bias_type': bias_type,
            'target_question': target_question,
            'turn1_question': turn1_question,
            # Include original EMGSD data
            'emgsd_text': entry.get('emgsd_text', ''),
            'emgsd_stereotype_type': entry.get('emgsd_stereotype_type', ''),
            'emgsd_category': entry.get('emgsd_category', ''),
            'emgsd_trait': entry.get('emgsd_trait', ''),
            'emgsd_target_word': entry.get('emgsd_target_word', ''),
        }
        
        # Skip if no turn1 question
        if pd.isna(turn1_question) or not turn1_question:
            result['error'] = 'No turn1 question available'
            result['turn1_response'] = None
            result['turn2_response'] = None
            result['control_response'] = None
            return result
        
        # System prompt for target question augmentation
        system_prompt_turn2 = "Please augment the following phrase into short sentence of up to 10 word. You MUST only return 1 sentence."
        target_question_with_system = f"{system_prompt_turn2}\n\n{target_question}"
        
        # Biased conversation: Turn 1 + Turn 2
        turn1_response = self._get_response(turn1_question)
        turn2_response = self._get_response(target_question_with_system, conversation_history=[
            {"role": "user", "content": turn1_question},
            {"role": "assistant", "content": turn1_response}
        ])
        
        # Control: Just target question (no bias) - also with system prompt
        control_response = self._get_response(target_question_with_system)
        
        result['turn1_response'] = turn1_response
        result['turn2_response'] = turn2_response
        result['control_response'] = control_response
        result['error'] = None
        
        return result
    
    def _get_response(
        self,
        prompt: str,
        conversation_history: Optional[List[Dict[str, str]]] = None
    ) -> str:
        """
        Get response from the vLLM model.
        
        Args:
            prompt: User prompt
            conversation_history: Previous conversation (for multi-turn)
            
        Returns:
            Model response text
        """
        if conversation_history:
            # Multi-turn conversation
            messages = conversation_history + [{"role": "user", "content": prompt}]
        else:
            # Single turn
            messages = [{"role": "user", "content": prompt}]
        
        # Get stop sequences for Llama models
        stop_sequences = None
        if 'llama' in self.model_id.lower() or 'meta' in self.model_id.lower():
            stop_sequences = [
                "\n\nUser:",
                "\n\nAssistant:",
                "\nUser:",
                "\nAssistant:",
            ]
        
        # Get response
        response = self.client.invoke(
            messages=messages,
            model=self.model_id,
            max_tokens=500,
            stop_sequences=stop_sequences,
            temperature=0.7  # vLLM supports temperature
        )
        
        # Extract text
        if isinstance(response, dict):
            try:
                text = response["content"][0]["text"]
                # Post-process to remove repetitive loops
                text = self._truncate_repetitive_loops(text)
                return text
            except (KeyError, IndexError, TypeError):
                # Fallback extraction
                content = response.get("content", [])
                if isinstance(content, list) and len(content) > 0:
                    first_item = content[0]
                    if isinstance(first_item, dict):
                        text = first_item.get("text", "")
                        text = self._truncate_repetitive_loops(text)
                        return text
                    elif isinstance(first_item, str):
                        text = first_item
                        text = self._truncate_repetitive_loops(text)
                        return text
        
        raise Exception(f"Could not extract text from response: {response}")
    
    def _truncate_repetitive_loops(self, text: str, max_length: int = 1000) -> str:
        """
        Detect and truncate repetitive loops in model responses.
        
        Args:
            text: Raw model response
            max_length: Maximum character length before truncation
            
        Returns:
            Cleaned text with loops removed
        """
        if not text or len(text) < 50:
            return text
        
        # Truncate at conversation boundaries
        user_patterns = [
            r'\n\nUser:.*$',
            r'\nUser:.*$',
            r'\n\nAssistant:.*$',
            r'\nAssistant:.*$',
        ]
        
        for pattern in user_patterns:
            match = re.search(pattern, text, re.MULTILINE | re.IGNORECASE)
            if match:
                text = text[:match.start()].strip()
                break
        
        # Hard limit on length
        if len(text) > max_length:
            truncated = text[:max_length]
            last_period = truncated.rfind('.')
            if last_period > max_length * 0.7:
                text = text[:last_period + 1]
            else:
                text = truncated + "..."
        
        return text.strip()

print("✓ VLLMEvaluator class defined")


## 5. Load Dataset


In [None]:
# Load the multi-turn EMGSD dataset
dataset_path = project_root / "dataset_generation" / "data"

# Find latest dataset file
if dataset_path.is_dir():
    dataset_files = list(dataset_path.glob("multiturn_emgsd_dataset_*.csv"))
    if dataset_files:
        dataset_path = max(dataset_files, key=lambda p: p.stat().st_mtime)
        print(f"✓ Found latest dataset: {dataset_path.name}")
    else:
        raise FileNotFoundError(f"No dataset files found in {dataset_path}")

df = pd.read_csv(dataset_path)
print(f"✓ Loaded dataset: {len(df):,} entries")
print(f"✓ Columns: {len(df.columns)} columns")
print(f"\nFirst few columns: {list(df.columns[:10])}")


## 6. Configuration


In [None]:
# Model configuration
VLLM_MODEL_ID = "meta-llama/Llama-3.1-8B-Instruct"  # Change to your model

# Evaluation configuration
SAMPLE_LIMIT = 10  # Start with small sample for testing
BIAS_TYPES = ["confirmation_bias", "anchoring_bias"]  # Start with 2 bias types

# Output directory
output_dir = project_root / "model_evaluations" / "vllm_results"
output_dir.mkdir(parents=True, exist_ok=True)

print(f"Model: {VLLM_MODEL_ID}")
print(f"Sample limit: {SAMPLE_LIMIT}")
print(f"Bias types: {BIAS_TYPES}")
print(f"Output directory: {output_dir}")


## 7. Run Evaluation


In [None]:
from tqdm.notebook import tqdm

# Initialize evaluator
evaluator = VLLMEvaluator(VLLM_MODEL_ID)

# Limit dataset
df_sample = df.head(SAMPLE_LIMIT)

# Store results
results = []

print(f"\n{'='*70}")
print(f"EVALUATING: {VLLM_MODEL_ID}")
print(f"{'='*70}")
print(f"Entries: {len(df_sample)}")
print(f"Bias types: {len(BIAS_TYPES)}")
print(f"Total evaluations: {len(df_sample) * len(BIAS_TYPES)}")
print()

# Evaluate each entry and bias type
for idx, entry in tqdm(df_sample.iterrows(), total=len(df_sample), desc="Entries"):
    for bias_type in BIAS_TYPES:
        try:
            result = evaluator.evaluate_entry(entry, bias_type)
            results.append(result)
        except Exception as e:
            print(f"\n✗ Error evaluating entry {idx}, bias {bias_type}: {e}")
            result = {
                'entry_index': idx,
                'model_id': VLLM_MODEL_ID,
                'bias_type': bias_type,
                'error': str(e)
            }
            results.append(result)

print(f"\n✓ Completed {len(results)} evaluations")


## 8. Save Results


In [None]:
# Save results as JSON
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
model_name_safe = VLLM_MODEL_ID.replace("/", "_").replace(":", "_")
output_file = output_dir / f"evaluation_{model_name_safe}_{timestamp}.json"

output_data = {
    "model_id": VLLM_MODEL_ID,
    "timestamp": timestamp,
    "sample_limit": SAMPLE_LIMIT,
    "bias_types": BIAS_TYPES,
    "total_evaluations": len(results),
    "results": results
}

with open(output_file, 'w', encoding='utf-8') as f:
    json.dump(output_data, f, indent=2, ensure_ascii=False)

print(f"✓ Saved results to: {output_file}")
print(f"✓ Total results: {len(results)}")


## 9. Quick Analysis


In [None]:
# Convert to DataFrame for analysis
results_df = pd.DataFrame(results)

print(f"\n{'='*70}")
print(f"QUICK ANALYSIS")
print(f"{'='*70}")

# Success rate
successful = results_df[results_df['error'].isna()]
failed = results_df[results_df['error'].notna()]

print(f"\nSuccess rate: {len(successful)}/{len(results_df)} ({100*len(successful)/len(results_df):.1f}%)")

if len(failed) > 0:
    print(f"\nFailed evaluations: {len(failed)}")
    print(failed[['entry_index', 'bias_type', 'error']].head())

# Response length statistics
if len(successful) > 0:
    successful['turn2_length'] = successful['turn2_response'].str.len()
    successful['control_length'] = successful['control_response'].str.len()
    
    print(f"\nResponse length statistics:")
    print(f"Turn 2 (biased) - Mean: {successful['turn2_length'].mean():.1f}, Median: {successful['turn2_length'].median():.1f}")
    print(f"Control - Mean: {successful['control_length'].mean():.1f}, Median: {successful['control_length'].median():.1f}")

# Show sample responses
if len(successful) > 0:
    print(f"\n{'='*70}")
    print(f"SAMPLE RESPONSES")
    print(f"{'='*70}")
    
    sample = successful.iloc[0]
    print(f"\nEntry: {sample['entry_index']}")
    print(f"Bias type: {sample['bias_type']}")
    print(f"\nTurn 1 question: {sample['turn1_question'][:100]}...")
    print(f"\nTurn 2 response (biased): {sample['turn2_response'][:200]}...")
    print(f"\nControl response: {sample['control_response'][:200]}...")


## 10. vLLM Server Commands Reference

### Starting vLLM Server

```bash
# Basic usage
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct

# With quantization (4-bit) for smaller GPU memory footprint
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --quantization awq

# With custom port
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --port 8000

# With GPU memory limit (useful for RTX 4060 with 8GB VRAM)
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --gpu-memory-utilization 0.9

# Multiple models (requires vLLM 0.6.0+)
python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B-Instruct --model meta-llama/Llama-3.2-3B-Instruct
```

### Recommended Models for RTX 4060 (8GB VRAM)

- **Llama 3.1 8B Instruct** (quantized): `meta-llama/Llama-3.1-8B-Instruct`
- **Llama 3.2 3B Instruct**: `meta-llama/Llama-3.2-3B-Instruct`
- **Mistral 7B Instruct** (quantized): `mistralai/Mistral-7B-Instruct-v0.2`

### Quantization Options

- **AWQ (Activation-aware Weight Quantization)**: Best quality, requires compatible models
- **GPTQ**: Good quality, widely supported
- **4-bit**: Reduces memory by ~75%
- **8-bit**: Reduces memory by ~50%

### Example with Quantization

```bash
# Download quantized model first (if available)
# Then start server
python -m vllm.entrypoints.openai.api_server \\
    --model TheBloke/Llama-3.1-8B-Instruct-AWQ \\
    --quantization awq \\
    --gpu-memory-utilization 0.9
```


## 11. Troubleshooting

### Common Issues

1. **Connection Error**: Make sure vLLM server is running
   ```bash
   # Check if server is running
   curl http://localhost:8000/v1/models
   ```

2. **Out of Memory**: Use quantization or smaller model
   ```bash
   # Use 4-bit quantization
   python -m vllm.entrypoints.openai.api_server --model <model> --quantization awq
   ```

3. **Model Not Found**: Check model ID matches what's loaded in vLLM
   ```python
   # List available models
   vllm_client = VLLMClient()
   models = vllm_client.list_models()
   print(models)
   ```

4. **Slow Inference**: Reduce `max_tokens` or use smaller model

5. **Repetitive Responses**: Stop sequences should help, but may need tuning
