# Phi-2 AI Judge Evaluation Demo

This notebook demonstrates how to use the Microsoft Phi-2 model for AI evaluation tasks.
The Phi-2 model is an efficient 2.7B parameter transformer optimized for reasoning and code understanding.

## Features:
- Single model approach for comprehensive evaluation
- Resource-efficient implementation
- Support for multiple evaluation aspects
- Batch processing capabilities
- Response comparison functionality

## Setup and Installation

First, let's install the required dependencies and import necessary modules.

In [None]:
# Install required packages
!pip install torch transformers accelerate

In [None]:
import sys
import os
sys.path.append('../app')

# Import our Phi-2 judge
from models.phi2_judge import get_phi2_judge, Phi2Judge

# Other imports
import json
import time
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

print("✅ All dependencies loaded successfully!")

## Initialize the Phi-2 Judge

Let's create an instance of our Phi-2 judge and check its configuration.

In [None]:
# Get the singleton instance
judge = get_phi2_judge()

# Check model information
model_info = judge.get_model_info()
print("📊 Model Information:")
for key, value in model_info.items():
    print(f"  {key}: {value}")

## Basic Evaluation Example

Let's start with a simple evaluation example to see how the Phi-2 judge works.

In [None]:
# Define a sample prompt and response
sample_prompt = "Explain how machine learning works in simple terms."
sample_response = """
Machine learning is like teaching a computer to recognize patterns and make decisions 
by showing it lots of examples. Instead of programming specific rules, we feed the 
computer data and let it learn from that data. For example, if we want to teach a 
computer to recognize cats in photos, we show it thousands of cat pictures. The 
computer finds patterns in these images and learns what makes a cat look like a cat. 
Then when we show it a new photo, it can predict whether there's a cat in it based 
on what it learned.
"""

print("🔍 Evaluating sample prompt-response pair...")
print(f"Prompt: {sample_prompt}")
print(f"Response: {sample_response.strip()[:100]}...")
print()

In [None]:
# Perform the evaluation
start_time = time.time()
result = judge.evaluate(sample_prompt, sample_response)
evaluation_time = time.time() - start_time

print(f"⏱️ Evaluation completed in {evaluation_time:.2f} seconds")
print(f"📊 Overall Score: {result['overall_score']}/10")
print()
print("📋 Detailed Scores:")
for aspect, score in result['detailed_scores'].items():
    print(f"  {aspect.capitalize()}: {score}/10")

print()
print("💬 Detailed Feedback:")
for aspect, feedback in result['detailed_feedback'].items():
    print(f"  {aspect.capitalize()}: {feedback[:100]}...")

## Custom Weight Evaluation

You can customize the importance of different evaluation aspects by providing custom weights.

In [None]:
# Define custom weights (prioritizing relevance and accuracy)
custom_weights = {
    "relevance": 0.4,
    "accuracy": 0.4,
    "coherence": 0.15,
    "completeness": 0.05
}

print("🎯 Custom weighted evaluation:")
print(f"Weights: {custom_weights}")
print()

# Evaluate with custom weights
weighted_result = judge.evaluate(sample_prompt, sample_response, weights=custom_weights)

print(f"📊 Weighted Overall Score: {weighted_result['overall_score']}/10")
print(f"📊 Default Overall Score: {result['overall_score']}/10")
print(f"📈 Difference: {abs(weighted_result['overall_score'] - result['overall_score']):.2f}")

## Response Comparison

Let's compare two different responses to the same prompt.

In [None]:
# Define two responses to compare
comparison_prompt = "What are the benefits of renewable energy?"

response_a = """
Renewable energy has several key benefits: it's environmentally friendly because it doesn't 
produce greenhouse gas emissions, it's sustainable since sources like solar and wind are 
naturally replenished, and it can reduce long-term energy costs once the infrastructure 
is established. Additionally, it helps reduce dependence on fossil fuel imports and creates 
jobs in the green energy sector.
"""

response_b = """
Renewable energy is good for the planet. Solar panels and wind turbines make electricity 
without pollution. It's better than coal and oil.
"""

print("🔄 Comparing two responses:")
print(f"Prompt: {comparison_prompt}")
print(f"Response A: {response_a.strip()[:80]}...")
print(f"Response B: {response_b.strip()[:80]}...")
print()

In [None]:
# Perform comparison
comparison_result = judge.compare_responses(comparison_prompt, response_a, response_b)

print("🏆 Comparison Results:")
print(f"Winner: {comparison_result['winner']}")
print(f"Margin: {comparison_result['margin']} points")
print()
print("📊 Score Summary:")
summary = comparison_result['comparison_summary']
print(f"Response A Score: {summary['response1_score']}/10")
print(f"Response B Score: {summary['response2_score']}/10")
print(f"Difference: {summary['difference']} points")

## Batch Evaluation

For evaluating multiple prompt-response pairs efficiently, we can use batch evaluation.

In [None]:
# Create a batch of evaluations
batch_data = [
    {
        "prompt": "Explain quantum computing",
        "response": "Quantum computing uses quantum mechanical phenomena like superposition and entanglement to perform computations that would be impossible for classical computers."
    },
    {
        "prompt": "What is artificial intelligence?",
        "response": "AI is the simulation of human intelligence in machines that are programmed to think and learn like humans."
    },
    {
        "prompt": "How do neural networks work?",
        "response": "Neural networks are computing systems inspired by biological neural networks. They consist of interconnected nodes that process information."
    }
]

print(f"🔄 Processing batch of {len(batch_data)} evaluations...")
batch_results = judge.batch_evaluate(batch_data)

print("✅ Batch evaluation completed!")
print()
print("📊 Batch Results Summary:")
for i, result in enumerate(batch_results):
    print(f"  Evaluation {i+1}: {result['overall_score']}/10 (Time: {result['evaluation_time']}s)")

## Data Visualization

Let's create some visualizations to better understand the evaluation results.

In [None]:
# Extract data for visualization
aspects = list(result['detailed_scores'].keys())
scores = list(result['detailed_scores'].values())
weights = list(result['weights_used'].values())

# Create visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Score breakdown
bars1 = ax1.bar(aspects, scores, color=['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4'])
ax1.set_title('Evaluation Scores by Aspect', fontsize=14, fontweight='bold')
ax1.set_ylabel('Score (1-10)')
ax1.set_ylim(0, 10)
ax1.grid(axis='y', alpha=0.3)

# Add score labels on bars
for bar, score in zip(bars1, scores):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1, 
             f'{score:.1f}', ha='center', va='bottom', fontweight='bold')

# Weights visualization
colors = ['#FF6B6B', '#4ECDC4', '#45B7D1', '#96CEB4']
wedges, texts, autotexts = ax2.pie(weights, labels=aspects, colors=colors, 
                                   autopct='%1.1f%%', startangle=90)
ax2.set_title('Aspect Weights Distribution', fontsize=14, fontweight='bold')

# Make percentage text bold
for autotext in autotexts:
    autotext.set_fontweight('bold')

plt.tight_layout()
plt.show()

## Performance Analysis

Let's analyze the performance characteristics of our Phi-2 judge.

In [None]:
# Collect performance data from batch results
performance_data = {
    'evaluation_times': [result['evaluation_time'] for result in batch_results],
    'prompt_lengths': [result['input_info']['prompt_length'] for result in batch_results],
    'response_lengths': [result['input_info']['response_length'] for result in batch_results],
    'overall_scores': [result['overall_score'] for result in batch_results]
}

# Create DataFrame for analysis
df = pd.DataFrame(performance_data)

print("📊 Performance Statistics:")
print(f"Average evaluation time: {df['evaluation_times'].mean():.2f}s")
print(f"Min evaluation time: {df['evaluation_times'].min():.2f}s")
print(f"Max evaluation time: {df['evaluation_times'].max():.2f}s")
print(f"Average overall score: {df['overall_scores'].mean():.2f}/10")
print(f"Score standard deviation: {df['overall_scores'].std():.2f}")

print("\n📋 Performance Summary Table:")
print(df.describe().round(2))

## Testing Edge Cases

Let's test some edge cases to see how robust our evaluation system is.

In [None]:
# Test cases
edge_cases = [
    {
        "name": "Irrelevant Response",
        "prompt": "Explain photosynthesis",
        "response": "I love pizza. It's my favorite food. The weather today is nice."
    },
    {
        "name": "Empty Response",
        "prompt": "What is the capital of France?",
        "response": ""
    },
    {
        "name": "Perfect Match",
        "prompt": "What is 2+2?",
        "response": "2+2 equals 4. This is basic arithmetic."
    },
    {
        "name": "Very Short Response",
        "prompt": "Explain the theory of relativity",
        "response": "Einstein."
    }
]

print("🧪 Testing Edge Cases:")
print("=" * 50)

edge_results = []
for case in edge_cases:
    print(f"\n🔍 {case['name']}:")
    print(f"Prompt: {case['prompt']}")
    print(f"Response: '{case['response']}'")
    
    result = judge.evaluate(case['prompt'], case['response'])
    edge_results.append(result)
    
    print(f"Overall Score: {result['overall_score']}/10")
    print(f"Relevance: {result['detailed_scores']['relevance']}/10")
    print(f"Evaluation Time: {result['evaluation_time']}s")

## Model Memory Management

The Phi-2 judge includes memory management features. Let's demonstrate them.

In [None]:
# Check current model status
model_info = judge.get_model_info()
print(f"Model loaded: {model_info['model_loaded']}")
print(f"Device: {model_info['device']}")

# Demonstrate model unloading (optional for memory management)
print("\n🔄 Unloading model to free memory...")
judge.unload_model()

# Check status after unloading
model_info = judge.get_model_info()
print(f"Model loaded after unload: {model_info['model_loaded']}")

# Model will be reloaded automatically on next evaluation
print("\n🔄 Model will be reloaded automatically on next evaluation.")

## Conclusion

This notebook demonstrated the key features of our Phi-2 AI Judge:

✅ **Single Model Architecture**: Uses only Microsoft Phi-2 for all evaluations

✅ **Resource Efficient**: Optimized for memory usage and performance

✅ **Comprehensive Evaluation**: Covers relevance, coherence, accuracy, and completeness

✅ **Flexible Weighting**: Allows custom importance weights for different aspects

✅ **Batch Processing**: Efficient handling of multiple evaluations

✅ **Response Comparison**: Side-by-side comparison of different responses

✅ **Memory Management**: Built-in model loading/unloading for resource management

✅ **Edge Case Handling**: Robust evaluation even with problematic inputs

### Next Steps:
- Integrate this judge into your web application
- Fine-tune evaluation prompts for your specific use case
- Experiment with different aspect weights
- Monitor performance in production

In [None]:
print("🎉 Phi-2 AI Judge Demo Complete!")
print("Ready for production deployment.")