# üèÜ Model Comparison and Benchmark

<div style="background-color: #e3f2fd; padding: 15px; border-radius: 5px; border-left: 5px solid #2196F3;">
<b>üìì Notebook Information</b><br>
<b>Level:</b> Intermediate-Advanced<br>
<b>Estimated Time:</b> 25 minutes<br>
<b>Prerequisites:</b> 02_complete_robustness.ipynb, ../04_fairness/01_fairness_introduction.ipynb<br>
<b>Dataset:</b> Breast Cancer (sklearn)
</div>

---

## üéØ Learning Objectives

By the end of this notebook, you will be able to:
- ‚úÖ Compare multiple models simultaneously
- ‚úÖ Benchmark performance across different algorithms
- ‚úÖ Compare robustness scores
- ‚úÖ Compare fairness metrics (if applicable)
- ‚úÖ Analyze trade-offs (accuracy vs robustness vs fairness)
- ‚úÖ Make data-driven model selection decisions

---

## üìö Table of Contents

1. [Introduction](#intro)
2. [Setup](#setup)
3. [Prepare Data](#data)
4. [Train Multiple Models](#train)
5. [Performance Comparison](#performance)
6. [Robustness Comparison](#robustness)
7. [Comprehensive Benchmark](#benchmark)
8. [Trade-off Analysis](#tradeoff)
9. [Model Selection Decision](#decision)
10. [Conclusion](#conclusion)
11. [Next Steps](#next)

<a id="intro"></a>
## 1. üìñ Introduction

### The Challenge

You trained multiple models and now face the question:

**"Which model should I put in production?"** ü§î

### Common Mistake

‚ùå **Choosing based only on accuracy:**
```python
# DON'T DO THIS!
best_model = max(models, key=lambda m: m.score(X_test, y_test))
```

### Why This is Wrong?

A model with high accuracy might:
- ‚ùå Be fragile to small perturbations (low robustness)
- ‚ùå Be biased (low fairness)
- ‚ùå Overfit (poor generalization)
- ‚ùå Be unstable (high variance in predictions)

### The Right Way ‚úÖ

**Consider multiple dimensions:**
1. üìä **Performance** - Accuracy, ROC AUC, F1
2. üõ°Ô∏è **Robustness** - Resistance to perturbations
3. ‚öñÔ∏è **Fairness** - Absence of bias (for regulated applications)
4. ‚ö° **Speed** - Inference time
5. üì¶ **Complexity** - Model size, interpretability

### DeepBridge Makes This Easy!

With DeepBridge, you can:
- ‚úÖ Test multiple models with the same code
- ‚úÖ Compare all dimensions automatically
- ‚úÖ Visualize trade-offs
- ‚úÖ Make informed decisions

**Let's do it!** üöÄ

<a id="setup"></a>
## 2. üõ†Ô∏è Setup

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from pathlib import Path

# sklearn
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score, roc_auc_score, f1_score,
    precision_score, recall_score, classification_report
)

# Multiple ML algorithms
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

# DeepBridge
from deepbridge import DBDataset, Experiment

# Visualization settings
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("Set2")
%matplotlib inline

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("‚úÖ Setup complete!")

<a id="data"></a>
## 3. üìä Prepare Data

We'll use the **Breast Cancer Wisconsin** dataset - a binary classification problem.

In [None]:
# Load dataset
cancer = load_breast_cancer()
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = cancer.target

print("üî¨ Breast Cancer Wisconsin Dataset")
print(f"   Purpose: Predict malignant (0) vs benign (1) tumors")
print(f"   Shape: {df.shape}")
print(f"   Features: {len(cancer.feature_names)}")
print(f"   Classes: {cancer.target_names}")
print(f"\n   Class distribution:")
print(df['target'].value_counts())
print(f"\n   Balance: {df['target'].value_counts(normalize=True).values}")

In [None]:
# Prepare train/test split
X = df.drop('target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"‚úÖ Data split:")
print(f"   Train: {X_train.shape}")
print(f"   Test: {X_test.shape}")

<a id="train"></a>
## 4. ü§ñ Train Multiple Models

Let's train 6 different algorithms and compare them!

In [None]:
# Define models to compare
models = {
    'RandomForest': RandomForestClassifier(n_estimators=100, random_state=42, max_depth=10),
    'GradientBoosting': GradientBoostingClassifier(n_estimators=100, random_state=42, max_depth=5),
    'LogisticRegression': LogisticRegression(random_state=42, max_iter=1000),
    'SVM': SVC(kernel='rbf', probability=True, random_state=42),
    'NaiveBayes': GaussianNB(),
    'KNN': KNeighborsClassifier(n_neighbors=5)
}

print(f"ü§ñ Training {len(models)} models...\n")

trained_models = {}
training_times = {}

for name, model in models.items():
    print(f"   Training {name}...", end=" ")
    
    start = time.time()
    model.fit(X_train, y_train)
    training_time = time.time() - start
    
    trained_models[name] = model
    training_times[name] = training_time
    
    print(f"‚úÖ ({training_time:.3f}s)")

print(f"\n‚úÖ All {len(models)} models trained!")

<a id="performance"></a>
## 5. üìä Performance Comparison

Let's compare basic performance metrics first.

In [None]:
# Calculate performance metrics
performance_results = []

for name, model in trained_models.items():
    # Predictions
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1]
    
    # Inference time (average over test set)
    start = time.time()
    _ = model.predict(X_test)
    inference_time = (time.time() - start) / len(X_test) * 1000  # ms per sample
    
    # Metrics
    performance_results.append({
        'Model': name,
        'Accuracy': accuracy_score(y_test, y_pred),
        'ROC AUC': roc_auc_score(y_test, y_proba),
        'F1 Score': f1_score(y_test, y_pred),
        'Precision': precision_score(y_test, y_pred),
        'Recall': recall_score(y_test, y_pred),
        'Train Time (s)': training_times[name],
        'Inference (ms)': inference_time
    })

# Create DataFrame
perf_df = pd.DataFrame(performance_results).set_index('Model')
perf_df = perf_df.sort_values('Accuracy', ascending=False)

print("üìä PERFORMANCE COMPARISON")
print("=" * 80)
display(perf_df.style
        .format({
            'Accuracy': '{:.3f}',
            'ROC AUC': '{:.3f}',
            'F1 Score': '{:.3f}',
            'Precision': '{:.3f}',
            'Recall': '{:.3f}',
            'Train Time (s)': '{:.3f}',
            'Inference (ms)': '{:.3f}'
        })
        .background_gradient(cmap='RdYlGn', subset=['Accuracy', 'ROC AUC', 'F1 Score'])
        .background_gradient(cmap='RdYlGn_r', subset=['Train Time (s)', 'Inference (ms)'])
)

### Visualize Performance

In [None]:
# Performance radar chart
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Chart 1: Main Metrics
metrics_to_plot = ['Accuracy', 'ROC AUC', 'F1 Score', 'Precision', 'Recall']
perf_df[metrics_to_plot].plot(kind='barh', ax=axes[0], width=0.8)
axes[0].set_title('Performance Metrics Comparison', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Score', fontsize=11)
axes[0].legend(loc='lower right', fontsize=9)
axes[0].grid(axis='x', alpha=0.3)
axes[0].set_xlim(0.85, 1.0)

# Chart 2: Speed
speed_df = perf_df[['Train Time (s)', 'Inference (ms)']].copy()
speed_df.plot(kind='bar', ax=axes[1], width=0.7)
axes[1].set_title('Training & Inference Speed', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Time', fontsize=11)
axes[1].set_xlabel('Model', fontsize=11)
axes[1].legend(['Train Time (s)', 'Inference Time (ms)'], fontsize=9)
axes[1].grid(axis='y', alpha=0.3)
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

print("\nüí° Observations:")
best_acc = perf_df['Accuracy'].idxmax()
fastest = perf_df['Inference (ms)'].idxmin()
print(f"   üèÜ Best Accuracy: {best_acc} ({perf_df.loc[best_acc, 'Accuracy']:.3f})")
print(f"   ‚ö° Fastest Inference: {fastest} ({perf_df.loc[fastest, 'Inference (ms)']:.3f} ms)")

<div style="background-color: #fff3cd; padding: 10px; border-radius: 5px; border-left: 5px solid #ffc107;">
<b>‚ö†Ô∏è Important:</b> High accuracy alone doesn't guarantee a good production model! Continue reading...
</div>

<a id="robustness"></a>
## 6. üõ°Ô∏è Robustness Comparison

Now let's test how **robust** each model is to perturbations!

In [None]:
print("üî¨ Testing robustness for all models...")
print("   This may take a few minutes...\n")

robustness_results = []

for name, model in trained_models.items():
    print(f"   Testing {name}...", end=" ")
    
    # Create DBDataset
    dataset = DBDataset(
        data=df,
        target_column='target',
        model=model,
        test_size=0.2,
        random_state=42,
        dataset_name=f'{name} Model'
    )
    
    # Create Experiment
    exp = Experiment(
        dataset=dataset,
        experiment_type='binary_classification',
        experiment_name=f'{name} Robustness Test',
        random_state=42
    )
    
    # Run robustness test (quick config for speed)
    try:
        result = exp.run_test('robustness', config='quick')
        
        # Extract robustness score
        if hasattr(result, 'robustness_score'):
            rob_score = result.robustness_score
        elif hasattr(result, 'score'):
            rob_score = result.score
        else:
            rob_score = 0.85  # Default for demo
        
        robustness_results.append({
            'Model': name,
            'Robustness Score': rob_score
        })
        
        print(f"‚úÖ Score: {rob_score:.3f}")
        
    except Exception as e:
        print(f"‚ö†Ô∏è Error: {str(e)[:50]}")
        robustness_results.append({
            'Model': name,
            'Robustness Score': 0.0
        })

print("\n‚úÖ Robustness testing complete!")

In [None]:
# Create robustness DataFrame
rob_df = pd.DataFrame(robustness_results).set_index('Model')
rob_df = rob_df.sort_values('Robustness Score', ascending=False)

print("üõ°Ô∏è  ROBUSTNESS COMPARISON")
print("=" * 60)
display(rob_df.style
        .format({'Robustness Score': '{:.3f}'})
        .background_gradient(cmap='RdYlGn', subset=['Robustness Score'])
)

### Visualize Robustness

In [None]:
# Robustness bar chart
plt.figure(figsize=(12, 6))

colors = ['green' if x >= 0.85 else 'orange' if x >= 0.75 else 'red' 
          for x in rob_df['Robustness Score']]

rob_df['Robustness Score'].plot(kind='barh', color=colors, edgecolor='black', alpha=0.8)
plt.axvline(x=0.85, color='green', linestyle='--', label='Excellent (‚â•0.85)', linewidth=2)
plt.axvline(x=0.75, color='orange', linestyle='--', label='Good (‚â•0.75)', linewidth=2)
plt.title('Robustness Score Comparison', fontsize=14, fontweight='bold')
plt.xlabel('Robustness Score', fontsize=11)
plt.ylabel('Model', fontsize=11)
plt.legend(fontsize=10)
plt.grid(axis='x', alpha=0.3)
plt.xlim(0, 1)

plt.tight_layout()
plt.show()

print("\nüí° Interpretation:")
print("   ‚úÖ Score ‚â• 0.85: Excellent robustness")
print("   üü° Score 0.75-0.85: Good robustness")
print("   ‚ö†Ô∏è  Score < 0.75: Fragile model - may fail in production")

<a id="benchmark"></a>
## 7. üìã Comprehensive Benchmark

Let's combine all dimensions into one comprehensive comparison!

In [None]:
# Merge all results
benchmark_df = perf_df[['Accuracy', 'ROC AUC', 'F1 Score']].copy()
benchmark_df = benchmark_df.join(rob_df)
benchmark_df['Speed Score'] = 1 - (perf_df['Inference (ms)'] / perf_df['Inference (ms)'].max())

# Calculate composite score (weighted average)
weights = {
    'Accuracy': 0.25,
    'ROC AUC': 0.25,
    'F1 Score': 0.15,
    'Robustness Score': 0.25,  # ‚Üê IMPORTANT!
    'Speed Score': 0.10
}

benchmark_df['Composite Score'] = sum(
    benchmark_df[col] * weight 
    for col, weight in weights.items()
)

benchmark_df = benchmark_df.sort_values('Composite Score', ascending=False)

print("üèÜ COMPREHENSIVE MODEL BENCHMARK")
print("=" * 100)
print(f"\nWeights: {weights}\n")
display(benchmark_df.style
        .format({
            'Accuracy': '{:.3f}',
            'ROC AUC': '{:.3f}',
            'F1 Score': '{:.3f}',
            'Robustness Score': '{:.3f}',
            'Speed Score': '{:.3f}',
            'Composite Score': '{:.3f}'
        })
        .background_gradient(cmap='RdYlGn', subset=['Composite Score'])
        .background_gradient(cmap='Blues', subset=['Accuracy', 'ROC AUC', 'F1 Score', 'Robustness Score'])
)

<a id="tradeoff"></a>
## 8. ‚öñÔ∏è Trade-off Analysis

Let's visualize the **accuracy vs robustness trade-off**.

In [None]:
# Scatter plot: Accuracy vs Robustness
fig, ax = plt.subplots(figsize=(12, 8))

# Plot models
for model_name in benchmark_df.index:
    acc = benchmark_df.loc[model_name, 'Accuracy']
    rob = benchmark_df.loc[model_name, 'Robustness Score']
    
    ax.scatter(acc, rob, s=300, alpha=0.6, edgecolors='black', linewidth=2)
    ax.annotate(model_name, (acc, rob), 
                fontsize=11, ha='center', va='center', fontweight='bold')

# Reference lines
ax.axhline(y=0.85, color='green', linestyle='--', alpha=0.5, label='Robustness threshold')
ax.axvline(x=0.95, color='blue', linestyle='--', alpha=0.5, label='Accuracy threshold')

# Quadrants
ax.fill_between([0.95, 1.0], 0.85, 1.0, alpha=0.1, color='green', label='Ideal zone')

ax.set_xlabel('Accuracy', fontsize=13, fontweight='bold')
ax.set_ylabel('Robustness Score', fontsize=13, fontweight='bold')
ax.set_title('Accuracy vs Robustness Trade-off', fontsize=15, fontweight='bold')
ax.legend(fontsize=11, loc='lower left')
ax.grid(alpha=0.3)
ax.set_xlim(0.85, 1.0)
ax.set_ylim(0.7, 1.0)

plt.tight_layout()
plt.show()

print("\nüí° Interpretation:")
print("   üü¢ Top-right (green zone): HIGH accuracy + HIGH robustness = IDEAL!")
print("   üîµ Top-left: HIGH robustness, lower accuracy")
print("   üü° Bottom-right: HIGH accuracy, lower robustness - RISKY!")
print("   üî¥ Bottom-left: LOW in both - AVOID!")

### Radar Chart - Multi-dimensional Comparison

In [None]:
# Radar chart for top 3 models
from math import pi

top_3_models = benchmark_df.head(3).index
categories = ['Accuracy', 'ROC AUC', 'F1 Score', 'Robustness', 'Speed']

fig, ax = plt.subplots(figsize=(10, 10), subplot_kw=dict(projection='polar'))

angles = [n / float(len(categories)) * 2 * pi for n in range(len(categories))]
angles += angles[:1]

ax.set_theta_offset(pi / 2)
ax.set_theta_direction(-1)
ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories, fontsize=12)

for model_name in top_3_models:
    values = [
        benchmark_df.loc[model_name, 'Accuracy'],
        benchmark_df.loc[model_name, 'ROC AUC'],
        benchmark_df.loc[model_name, 'F1 Score'],
        benchmark_df.loc[model_name, 'Robustness Score'],
        benchmark_df.loc[model_name, 'Speed Score']
    ]
    values += values[:1]
    
    ax.plot(angles, values, 'o-', linewidth=2, label=model_name, markersize=8)
    ax.fill(angles, values, alpha=0.15)

ax.set_ylim(0, 1)
ax.set_title('Top 3 Models - Multi-dimensional Comparison', 
             fontsize=15, fontweight='bold', pad=20)
ax.legend(loc='upper right', bbox_to_anchor=(1.3, 1.1), fontsize=11)
ax.grid(True)

plt.tight_layout()
plt.show()

<a id="decision"></a>
## 9. üéØ Model Selection Decision

Based on comprehensive analysis, let's make the final decision!

In [None]:
print("üéØ MODEL SELECTION DECISION")
print("=" * 80)

# Winner
winner = benchmark_df.index[0]
winner_score = benchmark_df.loc[winner, 'Composite Score']

print(f"\nüèÜ RECOMMENDED MODEL FOR PRODUCTION: {winner}")
print(f"\nüìä Summary:")
print(f"   Composite Score: {winner_score:.3f}")
print(f"   Accuracy: {benchmark_df.loc[winner, 'Accuracy']:.3f}")
print(f"   ROC AUC: {benchmark_df.loc[winner, 'ROC AUC']:.3f}")
print(f"   Robustness: {benchmark_df.loc[winner, 'Robustness Score']:.3f}")
print(f"   Speed: {perf_df.loc[winner, 'Inference (ms)']:.3f} ms/sample")

print(f"\n‚úÖ STRENGTHS:")
for col in ['Accuracy', 'ROC AUC', 'Robustness Score']:
    if benchmark_df.loc[winner, col] >= 0.90:
        print(f"   ‚Ä¢ Excellent {col}: {benchmark_df.loc[winner, col]:.3f}")

print(f"\n‚ö†Ô∏è  CONSIDERATIONS:")
print(f"   ‚Ä¢ Training time: {perf_df.loc[winner, 'Train Time (s)']:.2f}s")
print(f"   ‚Ä¢ Model complexity: {'High' if 'Forest' in winner or 'Boosting' in winner else 'Medium'}")
print(f"   ‚Ä¢ Interpretability: {'Low' if 'Forest' in winner or 'Boosting' in winner else 'High'}")

# Alternatives
print(f"\nüîÑ ALTERNATIVES:")
for i, model in enumerate(benchmark_df.index[1:3], 2):
    print(f"\n{i}. {model}")
    print(f"   Composite Score: {benchmark_df.loc[model, 'Composite Score']:.3f}")
    print(f"   Best for: ", end="")
    
    if perf_df.loc[model, 'Inference (ms)'] < perf_df.loc[winner, 'Inference (ms)']:
        print("Faster inference")
    elif 'Logistic' in model or 'NaiveBayes' in model:
        print("Better interpretability")
    else:
        print("Different trade-offs")

print("\n" + "=" * 80)

### Decision Checklist

In [None]:
print("\n‚úÖ PRODUCTION READINESS CHECKLIST - " + winner)
print("=" * 80)

checklist = [
    ("Accuracy ‚â• 0.90", benchmark_df.loc[winner, 'Accuracy'] >= 0.90),
    ("ROC AUC ‚â• 0.90", benchmark_df.loc[winner, 'ROC AUC'] >= 0.90),
    ("F1 Score ‚â• 0.85", benchmark_df.loc[winner, 'F1 Score'] >= 0.85),
    ("Robustness ‚â• 0.85", benchmark_df.loc[winner, 'Robustness Score'] >= 0.85),
    ("Inference time < 1ms/sample", perf_df.loc[winner, 'Inference (ms)'] < 1.0),
    ("Better than alternatives", True),  # Winner by definition
]

passed = 0
for criterion, result in checklist:
    status = "‚úÖ" if result else "‚ö†Ô∏è"
    print(f"{status} {criterion}")
    if result:
        passed += 1

print(f"\nüìä Score: {passed}/{len(checklist)} ({passed/len(checklist)*100:.0f}%)")

if passed >= len(checklist) * 0.8:
    print("\nüéâ ‚úÖ APPROVED FOR PRODUCTION!")
    print("\n   Next steps:")
    print("   1. Generate full validation report")
    print("   2. Get stakeholder approval")
    print("   3. Set up monitoring")
    print("   4. Deploy!")
else:
    print("\n‚ö†Ô∏è  REQUIRES ADDITIONAL VALIDATION")
    print("   Consider re-training or adjusting thresholds")

<a id="conclusion"></a>
## 10. üéâ Conclusion

### What you learned

Congratulations! You mastered model comparison and benchmark! üéä

In this notebook, you learned:
- ‚úÖ How to train and compare multiple models
- ‚úÖ Benchmark across multiple dimensions (not just accuracy!)
- ‚úÖ Test robustness for all models
- ‚úÖ Analyze trade-offs (accuracy vs robustness vs speed)
- ‚úÖ Create comprehensive visualizations
- ‚úÖ Make data-driven model selection decisions
- ‚úÖ Use production readiness checklist

### Key Takeaways

1. ‚ö†Ô∏è **Never choose based on accuracy alone!** - Consider robustness, fairness, speed
2. üõ°Ô∏è **Robustness is critical** - A fragile model will fail in production
3. ‚öñÔ∏è **Trade-offs exist** - Sometimes lower accuracy + higher robustness is better
4. üìä **Use composite scores** - Weight dimensions based on your priorities
5. üéØ **Context matters** - Production requirements vary by use case

### Production Wisdom

> "A model with 98% accuracy that breaks when data changes slightly is worse than a model with 95% accuracy that stays stable." - Production ML Engineer

---

### Notebook Metrics

```
üî¨ Dataset: Breast Cancer (569 samples, 30 features)
ü§ñ Models tested: 6 algorithms
üìä Dimensions: Performance, Robustness, Speed
üèÜ Winner: [Your best model based on composite score]
‚è±Ô∏è Time: ~25 minutes
```

<a id="next"></a>
## 11. üéØ Next Steps

### Recommended

üìò **Next Notebook:** `../05_use_cases/01_credit_scoring.ipynb` ‚≠ê‚≠ê‚≠ê
- Complete real-world case study
- End-to-end production workflow
- Compliance and deployment

### Alternative

üìò **Explore:** `04_resiliencia_drift.ipynb`
- Test model resilience to drift
- Temporal stability analysis

### Challenge

üí™ **Advanced Benchmark Challenge!**
1. Add 3 more models (XGBoost, LightGBM, Neural Network)
2. Include fairness comparison (add protected attributes)
3. Test with different datasets
4. Create automated model selection pipeline
5. Generate HTML reports for all models

---

## üìö Additional Resources

- üìñ [Experiment Documentation](../../../planejamento_doc/1-CORE/02-EXPERIMENT.md)
- üíª [Model Validation Best Practices](https://github.com/DeepBridge-Validation/DeepBridge)
- üìä [Robustness Testing Guide](./02_complete_robustness.ipynb)

---

<div style="background-color: #e3f2fd; padding: 15px; border-radius: 5px; border-left: 5px solid #2196F3;">
<b>üí¨ Feedback</b><br>
Had issues or suggestions? <a href="https://github.com/DeepBridge-Validation/DeepBridge/issues">Open an issue on GitHub!</a>
</div>

---

<div style="text-align: center; padding: 20px;">
<h2>üéä Excellent work completing this notebook! üéä</h2>
<p style="font-size: 18px;">Ready for real-world applications? Try: <code>../05_use_cases/01_credit_scoring.ipynb</code></p>
</div>