# GGR Experiment with vLLM Inference Monitoring

This notebook demonstrates the complete workflow for evaluating the Greedy Group Recursion (GGR) algorithm's impact on vLLM inference performance. We will:

1. **Load Reordered Dataset**: Load the dataset reordered by GGR algorithm
2. **Initialize vLLM**: Set up vLLM with metrics logging enabled
3. **Perform Inference**: Run inference on the reordered dataset
4. **Monitor Resources**: Track GPU, CPU, and memory usage during inference
5. **Analyze Performance**: Evaluate KV cache utilization and prefix hit rates

## Key Metrics to Monitor

- **KV Cache Usage**: `vllm:gpu_cache_usage_perc`
- **Prefix Hit Rate**: `vllm:gpu_prefix_cache_hits / vllm:gpu_prefix_cache_queries`
- **Token Throughput**: Total tokens processed per second
- **Latency**: End-to-end request latency and time-to-first-token
- **System Resources**: GPU utilization, CPU usage, memory consumption

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import os
import sys
import time
import json
import threading
from datetime import datetime
from typing import List, Dict, Any, Optional
import warnings
warnings.filterwarnings('ignore')

# Add project src to path
sys.path.append(os.path.join(os.getcwd(), 'src'))

# vLLM imports
try:
    from vllm import LLM, SamplingParams
    vllm_available = True
except ImportError:
    print("Warning: vLLM not installed. Install with: pip install vllm")
    vllm_available = False

# Resource monitoring imports
try:
    import psutil
    psutil_available = True
except ImportError:
    print("Warning: psutil not installed. Install with: pip install psutil")
    psutil_available = False

try:
    import pynvml
    pynvml_available = True
except ImportError:
    print("Warning: pynvml not installed. Install with: pip install pynvml")
    pynvml_available = False

# Data processing imports
import matplotlib.pyplot as plt
import seaborn as sns

print("Libraries imported successfully!")
print(f"vLLM available: {vllm_available}")
print(f"psutil available: {psutil_available}")
print(f"pynvml available: {pynvml_available}")

## 1. Load Reordered Dataset

First, we'll load the dataset that has been reordered by the GGR algorithm. This dataset should contain prompts with shared prefixes grouped together to maximize KV cache reuse.

In [None]:
# Configuration
CONFIG = {
    'reordered_dataset_path': 'results/test_experiment_reordered_table.csv',  # Path to GGR reordered dataset
    'baseline_dataset_path': 'data/sample_movies.csv',  # Original dataset for comparison
    'model_name': 'microsoft/DialoGPT-medium',  # Smaller model for testing (change to larger model if needed)
    'output_dir': 'inference_results',
    'max_samples': 50,  # Limit samples for testing (set to None for full dataset)
    'sampling_params': {
        'temperature': 0.7,
        'top_p': 0.9,
        'max_tokens': 100,
        'seed': 42  # For reproducibility
    },
    'monitoring_interval': 5  # seconds between resource monitoring samples
}

# Create output directory
os.makedirs(CONFIG['output_dir'], exist_ok=True)

def load_dataset(file_path: str, max_samples: Optional[int] = None) -> pd.DataFrame:
    """Load dataset and prepare prompts for inference"""
    try:
        df = pd.read_csv(file_path)
        print(f"Dataset loaded: {df.shape}")
        print(f"Columns: {list(df.columns)}")
        
        if max_samples:
            df = df.head(max_samples)
            print(f"Limited to {max_samples} samples")
        
        return df
    except FileNotFoundError:
        print(f"Error: Dataset file not found at {file_path}")
        print("Please ensure you have run the GGR experiment first to generate the reordered dataset.")
        return None
    except Exception as e:
        print(f"Error loading dataset: {e}")
        return None

# Load reordered dataset
print("Loading GGR reordered dataset...")
reordered_df = load_dataset(CONFIG['reordered_dataset_path'], CONFIG['max_samples'])

if reordered_df is not None:
    print("\nFirst few rows of reordered dataset:")
    print(reordered_df.head())
else:
    print("\nCreating sample dataset for demonstration...")
    # Create sample dataset if reordered dataset not available
    sample_data = {
        'movie_id': [f'movie_{i:03d}' for i in range(1, 21)] * 2,
        'movie_title': [f'Movie Title {i}' for i in range(1, 21)] * 2,
        'review_content': [
            f'This movie about {i} is excellent. Great acting and plot.' if i % 2 == 0 
            else f'This movie about {i} is disappointing. Poor execution.'
            for i in range(1, 21)
        ] * 2
    }
    reordered_df = pd.DataFrame(sample_data)
    if CONFIG['max_samples']:
        reordered_df = reordered_df.head(CONFIG['max_samples'])
    print(f"Created sample dataset with {len(reordered_df)} rows")

In [None]:
def create_prompts(df: pd.DataFrame) -> List[str]:
    """Create inference prompts from dataset rows"""
    prompts = []
    
    for _, row in df.iterrows():
        # Create a structured prompt that encourages prefix reuse
        if 'review_content' in df.columns and 'movie_title' in df.columns:
            prompt = f"Analyze the following movie review and determine the sentiment:\n\nMovie: {row.get('movie_title', 'Unknown')}\nReview: {row.get('review_content', 'No review')}
\nSentiment:"
        else:
            # Generic prompt format
            prompt = f"Analyze the following data:\n{dict(row)}\n\nAnalysis:"
        
        prompts.append(prompt)
    
    return prompts

# Create prompts from the reordered dataset
if reordered_df is not None:
    inference_prompts = create_prompts(reordered_df)
    print(f"\nCreated {len(inference_prompts)} prompts for inference")
    print("\nExample prompt:")
    print("-" * 50)
    print(inference_prompts[0])
    print("-" * 50)
    
    # Analyze prompt prefixes to verify GGR effectiveness
    prefix_analysis = {}
    for i, prompt in enumerate(inference_prompts[:10]):
        prefix = prompt.split('\n')[0]  # First line as prefix
        if prefix not in prefix_analysis:
            prefix_analysis[prefix] = []
        prefix_analysis[prefix].append(i)
    
    print(f"\nPrefix analysis (first 10 prompts):")
    for prefix, indices in prefix_analysis.items():
        if len(indices) > 1:
            print(f"Prefix '{prefix[:50]}...' appears in prompts: {indices}")
else:
    inference_prompts = []

## 2. Initialize vLLM and Configure Metrics Logging

Now we'll set up the vLLM model with appropriate configuration for metrics collection. We'll also prepare resource monitoring systems.

In [None]:
class ResourceMonitor:
    """Monitor system resources during inference"""
    
    def __init__(self, interval: int = 5):
        self.interval = interval
        self.monitoring = False
        self.metrics = []
        self.thread = None
        
        # Initialize monitoring libraries
        if pynvml_available:
            try:
                pynvml.nvmlInit()
                self.gpu_available = True
                self.gpu_count = pynvml.nvmlDeviceGetCount()
                print(f"GPU monitoring initialized: {self.gpu_count} GPU(s) detected")
            except Exception as e:
                print(f"GPU monitoring initialization failed: {e}")
                self.gpu_available = False
        else:
            self.gpu_available = False
    
    def _monitor(self):
        """Background monitoring function"""
        while self.monitoring:
            try:
                timestamp = time.time()
                metric = {'timestamp': timestamp, 'datetime': datetime.now().isoformat()}
                
                # CPU and Memory monitoring
                if psutil_available:
                    metric['cpu_percent'] = psutil.cpu_percent(interval=1)
                    memory = psutil.virtual_memory()
                    metric['memory_percent'] = memory.percent
                    metric['memory_used_gb'] = memory.used / (1024**3)
                    metric['memory_available_gb'] = memory.available / (1024**3)
                
                # GPU monitoring
                if self.gpu_available:
                    for gpu_idx in range(self.gpu_count):
                        handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_idx)
                        
                        # GPU utilization
                        util = pynvml.nvmlDeviceGetUtilizationRates(handle)
                        metric[f'gpu_{gpu_idx}_util_percent'] = util.gpu
                        metric[f'gpu_{gpu_idx}_memory_util_percent'] = util.memory
                        
                        # GPU memory
                        memory_info = pynvml.nvmlDeviceGetMemoryInfo(handle)
                        metric[f'gpu_{gpu_idx}_memory_used_gb'] = memory_info.used / (1024**3)
                        metric[f'gpu_{gpu_idx}_memory_free_gb'] = memory_info.free / (1024**3)
                        metric[f'gpu_{gpu_idx}_memory_total_gb'] = memory_info.total / (1024**3)
                
                self.metrics.append(metric)
                time.sleep(self.interval)
                
            except Exception as e:
                print(f"Monitoring error: {e}")
                time.sleep(self.interval)
    
    def start(self):
        """Start monitoring"""
        if not self.monitoring:
            self.monitoring = True
            self.thread = threading.Thread(target=self._monitor)
            self.thread.daemon = True
            self.thread.start()
            print(f"Resource monitoring started (interval: {self.interval}s)")
    
    def stop(self):
        """Stop monitoring"""
        if self.monitoring:
            self.monitoring = False
            if self.thread:
                self.thread.join(timeout=self.interval + 2)
            print("Resource monitoring stopped")
    
    def get_metrics(self) -> List[Dict]:
        """Get collected metrics"""
        return self.metrics.copy()
    
    def save_metrics(self, filepath: str):
        """Save metrics to file"""
        df = pd.DataFrame(self.metrics)
        df.to_csv(filepath, index=False)
        print(f"Resource metrics saved to {filepath}")

# Initialize resource monitor
resource_monitor = ResourceMonitor(interval=CONFIG['monitoring_interval'])

In [None]:
def initialize_vllm_model(model_name: str, **kwargs):
    """Initialize vLLM model with appropriate configuration"""
    if not vllm_available:
        print("vLLM not available. Please install with: pip install vllm")
        return None, None
    
    try:
        print(f"Initializing vLLM model: {model_name}")
        print("This may take a few minutes for the first run...")
        
        # Configure vLLM with prefix caching enabled
        llm_config = {
            'model': model_name,
            'gpu_memory_utilization': 0.8,  # Use 80% of GPU memory
            'enable_prefix_caching': True,   # Enable prefix caching for GGR benefits
            'block_size': 16,                # Optimize block size for caching
            'max_num_seqs': 8,              # Batch size for parallel processing
            'seed': CONFIG['sampling_params']['seed'],
            **kwargs
        }
        
        llm = LLM(**llm_config)
        
        # Create sampling parameters
        sampling_params = SamplingParams(**CONFIG['sampling_params'])
        
        print("vLLM model initialized successfully!")
        print(f"Model config: {llm_config}")
        print(f"Sampling params: {CONFIG['sampling_params']}")
        
        return llm, sampling_params
        
    except Exception as e:
        print(f"Error initializing vLLM model: {e}")
        print("This might be due to:")
        print("1. Model not available or incorrect name")
        print("2. Insufficient GPU memory")
        print("3. CUDA/GPU setup issues")
        return None, None

# Initialize vLLM (comment out if running without GPU or vLLM)
if vllm_available and len(inference_prompts) > 0:
    print("Initializing vLLM model...")
    llm, sampling_params = initialize_vllm_model(CONFIG['model_name'])
else:
    print("Skipping vLLM initialization (not available or no prompts)")
    llm, sampling_params = None, None

## 3. Perform Inference with vLLM

Now we'll run inference on our reordered dataset while monitoring vLLM internal metrics and system resources.

In [None]:
class VLLMMetricsCollector:
    """Collect vLLM internal metrics during inference"""
    
    def __init__(self):
        self.metrics = {
            'inference_runs': [],
            'total_metrics': {}
        }
    
    def collect_engine_stats(self, llm, run_info: Dict):
        """Collect vLLM engine statistics"""
        try:
            # Try to access engine stats (method varies by vLLM version)
            stats = None
            
            # Try different methods to access stats
            if hasattr(llm, 'llm_engine'):
                engine = llm.llm_engine
                
                # Method 1: Direct stats access (older versions)
                if hasattr(engine, '_get_stats'):
                    stats = engine._get_stats()
                elif hasattr(engine, 'get_stats'):
                    stats = engine.get_stats()
                
                # Method 2: Access via scheduler (newer versions)
                elif hasattr(engine, 'scheduler') and hasattr(engine.scheduler, 'get_stats'):
                    stats = engine.scheduler.get_stats()
            
            if stats:
                metric_data = {
                    'run_info': run_info,
                    'timestamp': time.time(),
                    'stats': self._extract_stats(stats)
                }
                self.metrics['inference_runs'].append(metric_data)
                print(f"Collected vLLM stats: {metric_data['stats']}")
            else:
                print("Could not access vLLM internal stats (this is normal for newer versions)")
                # Fallback: collect basic timing and token info
                self.metrics['inference_runs'].append({
                    'run_info': run_info,
                    'timestamp': time.time(),
                    'stats': {'note': 'Internal stats not accessible'}
                })
                
        except Exception as e:
            print(f"Error collecting vLLM stats: {e}")
    
    def _extract_stats(self, stats):
        """Extract relevant metrics from stats object"""
        extracted = {}
        
        # Common stat attributes to check
        stat_attrs = [
            'gpu_cache_usage_sys', 'gpu_cache_usage_perc',
            'gpu_prefix_cache_hit_rate', 'gpu_prefix_cache_queries', 'gpu_prefix_cache_hits',
            'prompt_tokens', 'generation_tokens', 'total_tokens',
            'num_requests_running', 'num_requests_waiting'
        ]
        
        for attr in stat_attrs:
            if hasattr(stats, attr):
                value = getattr(stats, attr)
                extracted[attr] = float(value) if isinstance(value, (int, float)) else str(value)
        
        return extracted
    
    def get_metrics(self):
        """Get collected metrics"""
        return self.metrics.copy()
    
    def save_metrics(self, filepath: str):
        """Save metrics to file"""
        with open(filepath, 'w') as f:
            json.dump(self.metrics, f, indent=2, default=str)
        print(f"vLLM metrics saved to {filepath}")

# Initialize metrics collector
vllm_metrics = VLLMMetricsCollector()

In [None]:
def run_inference_with_monitoring(llm, sampling_params, prompts: List[str], batch_size: int = 4):
    """Run inference while monitoring performance metrics"""
    if not llm or not prompts:
        print("Skipping inference: no model or prompts available")
        return [], []
    
    print(f"Starting inference on {len(prompts)} prompts...")
    
    # Start resource monitoring
    resource_monitor.start()
    
    # Track inference timing and results
    inference_results = []
    timing_data = []
    
    try:
        # Process prompts in batches
        for batch_start in range(0, len(prompts), batch_size):
            batch_end = min(batch_start + batch_size, len(prompts))
            batch_prompts = prompts[batch_start:batch_end]
            
            print(f"Processing batch {batch_start//batch_size + 1}/{(len(prompts)-1)//batch_size + 1} ({len(batch_prompts)} prompts)")
            
            # Time the batch
            batch_start_time = time.perf_counter()
            
            # Run inference
            outputs = llm.generate(batch_prompts, sampling_params)
            
            batch_end_time = time.perf_counter()
            batch_duration = batch_end_time - batch_start_time
            
            # Collect results and timing
            for i, output in enumerate(outputs):
                prompt_idx = batch_start + i
                generated_text = output.outputs[0].text if output.outputs else ""
                
                result = {
                    'prompt_idx': prompt_idx,
                    'prompt': batch_prompts[i],
                    'generated_text': generated_text,
                    'batch_duration': batch_duration,
                    'batch_size': len(batch_prompts)
                }
                inference_results.append(result)
            
            # Collect vLLM metrics after each batch
            run_info = {
                'batch_idx': batch_start // batch_size,
                'batch_size': len(batch_prompts),
                'batch_duration': batch_duration,
                'prompts_processed': batch_end
            }
            vllm_metrics.collect_engine_stats(llm, run_info)
            
            timing_data.append({
                'batch_idx': batch_start // batch_size,
                'batch_size': len(batch_prompts),
                'duration': batch_duration,
                'prompts_per_second': len(batch_prompts) / batch_duration,
                'cumulative_prompts': batch_end
            })
            
            # Small delay between batches for monitoring
            time.sleep(1)
    
    except Exception as e:
        print(f"Error during inference: {e}")
    
    finally:
        # Stop resource monitoring
        resource_monitor.stop()
    
    print(f"Inference completed! Processed {len(inference_results)} prompts")
    return inference_results, timing_data

# Run inference if model is available
if llm and len(inference_prompts) > 0:
    print("\n" + "="*50)
    print("STARTING INFERENCE WITH MONITORING")
    print("="*50)
    
    inference_start_time = time.perf_counter()
    
    results, timing_data = run_inference_with_monitoring(
        llm, sampling_params, inference_prompts, batch_size=4
    )
    
    inference_end_time = time.perf_counter()
    total_inference_time = inference_end_time - inference_start_time
    
    print(f"\nTotal inference time: {total_inference_time:.2f} seconds")
    print(f"Average time per prompt: {total_inference_time/len(results):.2f} seconds")
    print(f"Throughput: {len(results)/total_inference_time:.2f} prompts/second")
    
else:
    print("Skipping inference - model not available or no prompts")
    results, timing_data = [], []

## 4. Log Resource Usage

Let's examine the resource usage data we collected during inference and save it for analysis.

In [None]:
# Save all collected data
print("Saving experimental data...")

# Save resource monitoring data
resource_metrics_file = os.path.join(CONFIG['output_dir'], 'resource_metrics.csv')
resource_monitor.save_metrics(resource_metrics_file)

# Save vLLM metrics
vllm_metrics_file = os.path.join(CONFIG['output_dir'], 'vllm_metrics.json')
vllm_metrics.save_metrics(vllm_metrics_file)

# Save inference results
if results:
    results_df = pd.DataFrame(results)
    results_file = os.path.join(CONFIG['output_dir'], 'inference_results.csv')
    results_df.to_csv(results_file, index=False)
    print(f"Inference results saved to {results_file}")
    
    # Show sample results
    print("\nSample inference results:")
    print(results_df[['prompt_idx', 'generated_text']].head())

# Save timing data
if timing_data:
    timing_df = pd.DataFrame(timing_data)
    timing_file = os.path.join(CONFIG['output_dir'], 'timing_data.csv')
    timing_df.to_csv(timing_file, index=False)
    print(f"Timing data saved to {timing_file}")
    
    print("\nTiming summary:")
    print(timing_df.describe())

# Display resource usage summary
resource_metrics = resource_monitor.get_metrics()
if resource_metrics:
    print(f"\nResource monitoring summary ({len(resource_metrics)} samples):")
    
    df_resources = pd.DataFrame(resource_metrics)
    
    if psutil_available and 'cpu_percent' in df_resources.columns:
        print(f"CPU Usage: {df_resources['cpu_percent'].mean():.1f}% (avg), {df_resources['cpu_percent'].max():.1f}% (max)")
        print(f"Memory Usage: {df_resources['memory_percent'].mean():.1f}% (avg), {df_resources['memory_percent'].max():.1f}% (max)")
    
    # GPU metrics if available
    gpu_columns = [col for col in df_resources.columns if col.startswith('gpu_')]
    if gpu_columns:
        for col in gpu_columns[:4]:  # Show first few GPU metrics
            if col.endswith('_util_percent') or col.endswith('_memory_util_percent'):
                print(f"{col}: {df_resources[col].mean():.1f}% (avg), {df_resources[col].max():.1f}% (max)")
else:
    print("No resource metrics collected")

## 5. Analyze Metrics and Resource Utilization

Now let's analyze the collected metrics to evaluate the performance impact of the GGR ordering and visualize the results.

In [None]:
# Create comprehensive analysis
print("="*60)
print("COMPREHENSIVE PERFORMANCE ANALYSIS")
print("="*60)

# 1. Inference Performance Analysis
if results and timing_data:
    print("\n1. INFERENCE PERFORMANCE METRICS")
    print("-" * 40)
    
    timing_df = pd.DataFrame(timing_data)
    total_time = timing_df['duration'].sum()
    total_prompts = len(results)
    avg_throughput = total_prompts / total_time
    
    print(f"Total Prompts Processed: {total_prompts}")
    print(f"Total Processing Time: {total_time:.2f} seconds")
    print(f"Average Throughput: {avg_throughput:.2f} prompts/second")
    print(f"Average Batch Size: {timing_df['batch_size'].mean():.1f}")
    print(f"Average Time per Batch: {timing_df['duration'].mean():.2f} seconds")
    print(f"Min Batch Time: {timing_df['duration'].min():.2f} seconds")
    print(f"Max Batch Time: {timing_df['duration'].max():.2f} seconds")

# 2. Resource Utilization Analysis
resource_metrics = resource_monitor.get_metrics()
if resource_metrics:
    print("\n2. RESOURCE UTILIZATION ANALYSIS")
    print("-" * 40)
    
    df_resources = pd.DataFrame(resource_metrics)
    
    if psutil_available and 'cpu_percent' in df_resources.columns:
        print(f"CPU Utilization:")
        print(f"  Average: {df_resources['cpu_percent'].mean():.1f}%")
        print(f"  Peak: {df_resources['cpu_percent'].max():.1f}%")
        print(f"  Std Dev: {df_resources['cpu_percent'].std():.1f}%")
        
        print(f"Memory Utilization:")
        print(f"  Average: {df_resources['memory_percent'].mean():.1f}%")
        print(f"  Peak: {df_resources['memory_percent'].max():.1f}%")
        print(f"  Average Used: {df_resources['memory_used_gb'].mean():.1f} GB")
    
    # GPU Analysis
    gpu_util_cols = [col for col in df_resources.columns if col.endswith('_util_percent')]
    gpu_mem_cols = [col for col in df_resources.columns if col.endswith('_memory_used_gb')]
    
    if gpu_util_cols:
        print(f"GPU Utilization:")
        for col in gpu_util_cols:
            gpu_id = col.split('_')[1]
            print(f"  GPU {gpu_id} Average: {df_resources[col].mean():.1f}%")
            print(f"  GPU {gpu_id} Peak: {df_resources[col].max():.1f}%")
    
    if gpu_mem_cols:
        print(f"GPU Memory Usage:")
        for col in gpu_mem_cols:
            gpu_id = col.split('_')[1]
            print(f"  GPU {gpu_id} Average: {df_resources[col].mean():.1f} GB")
            print(f"  GPU {gpu_id} Peak: {df_resources[col].max():.1f} GB")

# 3. vLLM Metrics Analysis
vllm_metrics_data = vllm_metrics.get_metrics()
if vllm_metrics_data['inference_runs']:
    print("\n3. vLLM INTERNAL METRICS")
    print("-" * 40)
    
    # Analyze prefix cache performance
    prefix_cache_data = []
    for run in vllm_metrics_data['inference_runs']:
        stats = run['stats']
        if 'gpu_prefix_cache_hit_rate' in stats:
            prefix_cache_data.append({
                'batch_idx': run['run_info']['batch_idx'],
                'hit_rate': stats['gpu_prefix_cache_hit_rate'],
                'cache_usage': stats.get('gpu_cache_usage_perc', 0)
            })
    
    if prefix_cache_data:
        cache_df = pd.DataFrame(prefix_cache_data)
        print(f"Prefix Cache Performance:")
        print(f"  Average Hit Rate: {cache_df['hit_rate'].mean():.1%}")
        print(f"  Peak Hit Rate: {cache_df['hit_rate'].max():.1%}")
        print(f"  Average Cache Usage: {cache_df['cache_usage'].mean():.1%}")
        
        # This is where GGR should show improvement!
        if cache_df['hit_rate'].mean() > 0.5:  # 50% hit rate threshold
            print(f"  ✅ HIGH PREFIX CACHE HIT RATE - GGR appears effective!")
        else:
            print(f"  ⚠️  Low prefix cache hit rate - consider optimizing data ordering")
    else:
        print("  Internal metrics not available (normal for newer vLLM versions)")
else:
    print("\n3. vLLM metrics not available")

print("\n4. EXPERIMENTAL CONCLUSIONS")
print("-" * 40)
print("To evaluate GGR effectiveness, compare these metrics with a baseline run:")
print("• Higher prefix cache hit rate indicates better GGR ordering")
print("• Higher throughput (prompts/second) shows performance improvement")
print("• More stable GPU utilization suggests better batching")
print("• Lower variance in batch times indicates consistent performance")

In [None]:
# Create visualizations
if len(resource_metrics) > 1 or len(timing_data) > 1:
    print("\nCreating performance visualizations...")
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('GGR vLLM Inference Performance Analysis', fontsize=16)
    
    # 1. Timing Analysis
    if timing_data:
        timing_df = pd.DataFrame(timing_data)
        axes[0, 0].plot(timing_df['batch_idx'], timing_df['prompts_per_second'], 'b-o')
        axes[0, 0].set_title('Inference Throughput Over Time')
        axes[0, 0].set_xlabel('Batch Index')
        axes[0, 0].set_ylabel('Prompts/Second')
        axes[0, 0].grid(True)
    
    # 2. Resource Usage
    if resource_metrics:
        df_resources = pd.DataFrame(resource_metrics)
        if 'cpu_percent' in df_resources.columns:
            axes[0, 1].plot(range(len(df_resources)), df_resources['cpu_percent'], 'g-', label='CPU %')
            axes[0, 1].plot(range(len(df_resources)), df_resources['memory_percent'], 'r-', label='Memory %')
            axes[0, 1].set_title('System Resource Usage')
            axes[0, 1].set_xlabel('Time Sample')
            axes[0, 1].set_ylabel('Usage %')
            axes[0, 1].legend()
            axes[0, 1].grid(True)
    
    # 3. GPU Utilization (if available)
    gpu_util_cols = [col for col in df_resources.columns if col.endswith('_util_percent')] if resource_metrics else []
    if gpu_util_cols:
        for i, col in enumerate(gpu_util_cols[:2]):  # Show up to 2 GPUs
            axes[1, 0].plot(range(len(df_resources)), df_resources[col], label=f'GPU {i}')
        axes[1, 0].set_title('GPU Utilization')
        axes[1, 0].set_xlabel('Time Sample')
        axes[1, 0].set_ylabel('GPU Usage %')
        axes[1, 0].legend()
        axes[1, 0].grid(True)
    else:
        axes[1, 0].text(0.5, 0.5, 'GPU data not available', ha='center', va='center', transform=axes[1, 0].transAxes)
        axes[1, 0].set_title('GPU Utilization')
    
    # 4. Batch Performance Distribution
    if timing_data:
        axes[1, 1].hist(timing_df['duration'], bins=min(10, len(timing_df)), alpha=0.7, color='skyblue')
        axes[1, 1].set_title('Batch Duration Distribution')
        axes[1, 1].set_xlabel('Duration (seconds)')
        axes[1, 1].set_ylabel('Frequency')
        axes[1, 1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    
    # Save plot
    plot_file = os.path.join(CONFIG['output_dir'], 'performance_analysis.png')
    plt.savefig(plot_file, dpi=300, bbox_inches='tight')
    print(f"Performance plots saved to {plot_file}")
    
    plt.show()
else:
    print("Insufficient data for visualization")

## Experiment Summary and Next Steps

This notebook demonstrates the complete workflow for evaluating GGR's impact on vLLM inference performance. Here's what we accomplished:

### ✅ **Completed Tasks:**
1. **Dataset Loading**: Loaded GGR-reordered dataset with prompts optimized for prefix reuse
2. **vLLM Setup**: Configured vLLM with prefix caching enabled for maximum GGR benefits
3. **Inference Monitoring**: Ran inference while collecting detailed performance metrics
4. **Resource Tracking**: Monitored GPU, CPU, and memory usage throughout the process
5. **Analysis & Visualization**: Analyzed results and created performance visualizations

### 📊 **Key Metrics Collected:**
- **Inference Performance**: Throughput (prompts/second), batch timing, latency distribution
- **vLLM Internal Metrics**: KV cache usage, prefix cache hit rate, token processing stats
- **System Resources**: GPU utilization, memory usage, CPU load
- **GGR Effectiveness**: Prefix reuse patterns, cache hit rates, performance improvements

### 🚀 **Next Steps for Complete Evaluation:**

1. **Baseline Comparison**: Run the same experiment with randomly ordered data to establish baseline performance
2. **Statistical Analysis**: Compare GGR vs baseline using proper statistical tests
3. **Scale Testing**: Test with larger datasets to see GGR benefits at scale
4. **Model Variations**: Test different model sizes to see where GGR provides most benefit
5. **Parameter Tuning**: Optimize vLLM parameters (batch size, cache settings) for GGR

### 📈 **Expected GGR Benefits:**
- **Higher Prefix Cache Hit Rate**: 20% → 80%+ improvement
- **Better Throughput**: 1.5-4x faster inference (literature reports)
- **Consistent Performance**: Lower variance in batch timing
- **Efficient Resource Usage**: Higher GPU utilization, better memory efficiency

### 🔧 **Experiment Configuration Used:**

In [None]:
# Display final experiment summary
print("="*60)
print("FINAL EXPERIMENT SUMMARY")
print("="*60)

print("\nExperiment Configuration:")
for key, value in CONFIG.items():
    if key != 'sampling_params':
        print(f"  {key}: {value}")
    else:
        print(f"  {key}:")
        for subkey, subvalue in value.items():
            print(f"    {subkey}: {subvalue}")

print(f"\nFiles Generated:")
output_files = [
    'resource_metrics.csv',
    'vllm_metrics.json', 
    'inference_results.csv',
    'timing_data.csv',
    'performance_analysis.png'
]

for filename in output_files:
    filepath = os.path.join(CONFIG['output_dir'], filename)
    if os.path.exists(filepath):
        size_mb = os.path.getsize(filepath) / (1024*1024)
        print(f"  ✅ {filename} ({size_mb:.2f} MB)")
    else:
        print(f"  ❌ {filename} (not found)")

print(f"\nExperiment completed at: {datetime.now().isoformat()}")
print(f"Results saved in: {CONFIG['output_dir']}")

if vllm_available:
    print("\n🎯 To compare with baseline:")
    print("1. Save this experiment as 'GGR_results'")
    print("2. Shuffle your dataset randomly")
    print("3. Re-run this notebook with the shuffled dataset")
    print("4. Compare prefix cache hit rates and throughput")
else:
    print("\n⚠️  vLLM was not available for this run")
    print("Install vLLM with: pip install vllm")
    print("Then re-run to see actual performance metrics")

print("\n📊 Analysis complete!")