# Comprehensive Model Evaluation and Comparison

This notebook compares the performance of all implemented models:
- Baseline (LLM + RAG + Popularity)
- Transformer (LLM + RAG + BERT-based Transformer)
- RGCN (LLM + RAG + Relational Graph Convolutional Network)
- NCF (LLM + RAG + Neural Collaborative Filtering)

Evaluation includes:
1. Standard recommendation metrics (HIT@K, MRR@K, NDCG@K, Recall@K)
2. Contextual understanding evaluation
3. Statistical analysis
4. Visualizations and insights

## Environment Setup

In [None]:
import os
os.chdir("..")
print("Current Working Directory:", os.getcwd())

In [None]:
import json
import pandas as pd
import numpy as np
from pathlib import Path
import plotly.graph_objects as go
import plotly.express as px
from plotly.subplots import make_subplots

from scripts.evaluator import EvaluationVisualizer

## 1. Standard Metrics Evaluation

In [None]:
# Initialize visualizer
visualizer = EvaluationVisualizer(results_file="data/evaluation")

# Load all model results
model_files = {
    'Baseline': 'baseline_metrics.json',
    'Transformer': 'trained_transformer_metrics.json',
    'RGCN': 'rgcn_metrics.json',
    'NCF': 'ncf_metrics.json'
}

results_data = visualizer.load_results(model_files)

# Display loaded data summary
print("\nLoaded Results Summary:")
for model_name, metrics in results_data.items():
    print(f"\n{model_name}:")
    for metric, value in list(metrics.items())[:4]:  # Show first 4 metrics
        print(f"  {metric}: {value:.4f}")

In [None]:
# Create comprehensive comparison table
comparison_data = []

for model_name, metrics in results_data.items():
    for metric_type in ['HIT', 'MRR', 'NDCG', 'Recall']:
        for k in [1, 3, 5, 10]:
            metric_name = f'{metric_type}@{k}'
            comparison_data.append({
                'Model': model_name,
                'Metric': metric_type,
                'K': k,
                'Score': metrics.get(metric_name, 0)
            })

comparison_df = pd.DataFrame(comparison_data)

# Pivot for better viewing
pivot_df = comparison_df.pivot_table(
    index=['Metric', 'K'],
    columns='Model',
    values='Score'
)

print("Standard Metrics Comparison")
print("="*80)
display(pivot_df)

In [None]:
# Identify which model performs best for each metric@K
best_performers = []

for metric_type in ['HIT', 'MRR', 'NDCG', 'Recall']:
    for k in [1, 3, 5, 10]:
        metric_name = f'{metric_type}@{k}'
        scores = {model: metrics.get(metric_name, 0) 
                 for model, metrics in results_data.items()}
        best_model = max(scores, key=scores.get)
        best_score = scores[best_model]
        
        best_performers.append({
            'Metric': metric_name,
            'Best Model': best_model,
            'Score': best_score
        })

best_df = pd.DataFrame(best_performers)
print("\nBest Performing Models")
print("="*80)
display(best_df)

### 1.1 Visualization: All Metrics Comparison

In [None]:
fig = visualizer.plot_all_metrics_comparison(k_values=[1, 3, 5, 10])
fig.show()

### 1.2 Visualization: Metric vs K

In [None]:
# Show how each metric improves with K
for metric in ['HIT', 'MRR', 'NDCG', 'Recall']:
    fig = visualizer.plot_metric_vs_k(metric_type=metric, k_values=[1, 3, 5, 10])
    fig.show()

### 1.3 Visualization: Holistic Comparison

In [None]:
# Compare all models across all metrics at K=10
fig = visualizer.plot_radar_chart(k_value=10)
fig.show()

In [None]:
fig = visualizer.plot_heatmap(k_values=[1, 3, 5, 10])
fig.show()

## 2. Contextual Understanding Evaluation

In [None]:
# Load contextual evaluation results
contextual_results_path = Path("data/evaluation/contextual_results.json")

if contextual_results_path.exists():
    with open(contextual_results_path, 'r') as f:
        contextual_results = json.load(f)
    
    print(f"Loaded contextual results from {len(contextual_results)} evaluators")
else:
    print("No contextual results found. Run contextual evaluation first.")
    contextual_results = {}

In [None]:
# Calculate average success rates across all evaluators
if contextual_results:
    contextual_summary = []
    
    for evaluator_id, models in contextual_results.items():
        for model_name, data in models.items():
            contextual_summary.append({
                'Evaluator': evaluator_id,
                'Model': model_name,
                'Success Rate': data['success_rate'],
                'Successes': data['success_count'],
                'Total': data['total_rated']
            })
    
    contextual_df = pd.DataFrame(contextual_summary)
    
    print("Contextual Evaluation Results")
    print("="*80)
    display(contextual_df)
    
    # Calculate average per model
    avg_by_model = contextual_df.groupby('Model')['Success Rate'].mean().sort_values(ascending=False)
    print("\nAverage Success Rate by Model:")
    print(avg_by_model)
else:
    print("No contextual data to analyze")

In [None]:
# Analyze performance by context type (temporal, mood, audience)
# Placeholder: Will need to load individual query results

# TODO: Break down by context type
# - Temporal shift queries
# - Mood context queries  
# - Audience context queries

print("TODO: Implement context-type breakdown")

In [None]:
if contextual_results and len(contextual_df) > 0:
    # Bar chart of success rates
    fig = go.Figure()
    
    for model in contextual_df['Model'].unique():
        model_data = contextual_df[contextual_df['Model'] == model]
        fig.add_trace(go.Bar(
            name=model,
            x=model_data['Evaluator'],
            y=model_data['Success Rate'],
            text=[f"{rate:.1%}" for rate in model_data['Success Rate']],
            textposition='outside'
        ))
    
    fig.update_layout(
        title='Contextual Understanding Success Rates by Evaluator',
        xaxis_title='Evaluator',
        yaxis_title='Success Rate',
        barmode='group',
        height=500
    )
    
    fig.show()

## 3. Statistical Analysis

In [None]:
# Calculate improvement over baseline
baseline_metrics = results_data.get('Baseline', {})

improvement_data = []

for model_name, metrics in results_data.items():
    if model_name == 'Baseline':
        continue
    
    for metric_type in ['HIT', 'MRR', 'NDCG', 'Recall']:
        for k in [1, 3, 5, 10]:
            metric_name = f'{metric_type}@{k}'
            baseline_score = baseline_metrics.get(metric_name, 0)
            model_score = metrics.get(metric_name, 0)
            
            if baseline_score > 0:
                improvement = ((model_score - baseline_score) / baseline_score) * 100
            else:
                improvement = 0
            
            improvement_data.append({
                'Model': model_name,
                'Metric': metric_name,
                'Baseline': baseline_score,
                'Model Score': model_score,
                'Improvement (%)': improvement
            })

improvement_df = pd.DataFrame(improvement_data)

print("Performance Improvement Over Baseline")
print("="*80)
display(improvement_df.head(20))

# Average improvement per model
avg_improvement = improvement_df.groupby('Model')['Improvement (%)'].mean().sort_values(ascending=False)
print("\nAverage Improvement Over Baseline:")
print(avg_improvement)

In [None]:
# Bar chart of average improvements
fig = go.Figure(data=[
    go.Bar(
        x=avg_improvement.index,
        y=avg_improvement.values,
        text=[f"{v:.1f}%" for v in avg_improvement.values],
        textposition='outside'
    )
])

fig.update_layout(
    title='Average Performance Improvement Over Baseline',
    xaxis_title='Model',
    yaxis_title='Improvement (%)',
    height=500
)

fig.show()

## 4. Model Rankings

In [None]:
# Rank models based on average performance across all metrics
model_rankings = []

for model_name, metrics in results_data.items():
    all_scores = list(metrics.values())
    avg_score = np.mean(all_scores)
    
    model_rankings.append({
        'Model': model_name,
        'Average Score': avg_score,
        'Rank': 0  # Will be filled after sorting
    })

# Sort and assign ranks
ranking_df = pd.DataFrame(model_rankings).sort_values('Average Score', ascending=False)
ranking_df['Rank'] = range(1, len(ranking_df) + 1)

print("Overall Model Rankings (Based on Average Metric Scores)")
print("="*80)
display(ranking_df)

## 5. Key Insights and Findings

### Key Findings:

**Standard Metrics:**
- [TO BE FILLED: Which model performs best overall?]
- [TO BE FILLED: Which metrics show the most improvement?]
- [TO BE FILLED: Performance trends across K values]

**Contextual Understanding:**
- [TO BE FILLED: Which model handles context shifts best?]
- [TO BE FILLED: Performance differences by context type]
- [TO BE FILLED: Common failure patterns]

**Model Comparison:**
- [TO BE FILLED: Baseline vs trained models]
- [TO BE FILLED: Strengths and weaknesses of each approach]
- [TO BE FILLED: Trade-offs (accuracy vs complexity)]

## 6. Recommendations for Deployment

### Model Selection Recommendations:

**Best Overall Model:** [TO BE FILLED]

**Use Cases:**
- For accuracy: [MODEL]
- For speed: [MODEL]
- For contextual understanding: [MODEL]
- For production deployment: [MODEL]

**Future Improvements:**
- [TO BE FILLED]

## 7. Conclusion

### Summary

This evaluation compared four different approaches to conversational movie recommendation:

1. **Baseline (LLM+RAG+Popularity)**: [SUMMARY]
2. **Transformer**: [SUMMARY]
3. **RGCN**: [SUMMARY]
4. **NCF**: [SUMMARY]

**Main Conclusions:**
- [TO BE FILLED]
- [TO BE FILLED]
- [TO BE FILLED]

**Impact of External Recommenders:**
[TO BE FILLED: Did they improve performance? By how much?]