# 📈 Model Evaluation: Measuring and Comparing Performance

Welcome to **Model Evaluation** - the final and crucial step in our machine learning journey! 🎯 This is where we become data scientists and thoroughly assess our models to choose the best one for production.

## 🔍 What is Model Evaluation?

Model evaluation is like conducting a **comprehensive job interview** for our ML models. We test them from every angle to see which one truly understands our data and can make reliable predictions in the real world.

### 🎭 **The Evaluation Challenge:**
- **📊 Accuracy isn't everything**: A model with 95% accuracy might still be terrible
- **⚖️ Imbalanced data requires special metrics**: Standard metrics can be misleading
- **🎯 Business context matters**: Different errors have different costs
- **🔮 Real-world performance**: Lab performance vs. production performance

## 📚 **Comprehensive Evaluation Framework:**

### 📊 **Classification Metrics Deep Dive**
- **Accuracy**: Overall correctness (but can be misleading)
- **Precision**: Of predicted positives, how many were correct?
- **Recall**: Of actual positives, how many did we catch?
- **F1-Score**: Harmonic mean of precision and recall
- **ROC-AUC**: Ability to distinguish between classes
- **PR-AUC**: Performance on imbalanced data

### 🎯 **Advanced Evaluation Techniques**
- **Confusion Matrix Analysis**: Detailed error breakdown
- **ROC and PR Curves**: Visual performance assessment
- **Feature Importance**: What drives predictions?
- **Cross-Validation**: Robust performance estimation
- **Error Analysis**: Understanding model failures

### 🏆 **Model Selection Criteria**
- **Primary Metric**: Best performance on key business metric
- **Stability**: Consistent performance across different data splits
- **Interpretability**: Can we understand and trust the model?
- **Efficiency**: Training and prediction speed
- **Robustness**: Performance on edge cases

### 📈 **Business Impact Assessment**
- **Cost-Benefit Analysis**: Value of correct vs. cost of errors
- **Production Readiness**: Real-world deployment considerations
- **Monitoring Strategy**: How to track performance over time

---

## 🚀 **What We'll Accomplish:**

By the end of this notebook, you'll:
- ✅ **Master evaluation metrics** for imbalanced classification
- ✅ **Create comprehensive performance reports** with visualizations
- ✅ **Understand model strengths and weaknesses** deeply
- ✅ **Select the optimal model** for your specific use case
- ✅ **Prepare for production deployment** with confidence
- ✅ **Set up monitoring and maintenance** strategies

Let's evaluate our models like true data scientists! 🔬📊

In [18]:
# 📦 Step 1: Import Comprehensive Evaluation Libraries
print("📦 IMPORTING EVALUATION LIBRARIES...")
print("="*35)

# Core libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

# Enhanced plotting setup for evaluation
plt.style.use('default')
sns.set_palette("Set2")
plt.rcParams['figure.figsize'] = (15, 10)
plt.rcParams['font.size'] = 11

# Comprehensive evaluation metrics
from sklearn.metrics import (
    # Basic metrics
    accuracy_score, precision_score, recall_score, f1_score,
    # Advanced metrics
    roc_auc_score, average_precision_score, log_loss,
    # Detailed analysis
    classification_report, confusion_matrix,
    # Curves and plots
    roc_curve, precision_recall_curve, det_curve,
    # Multi-class extensions
    roc_auc_score, precision_recall_fscore_support
)

# Model interpretation and feature importance
try:
    import shap
    print("✅ SHAP available for model interpretation")
except ImportError:
    print("⚠️ SHAP not available - will use basic feature importance")
    shap = None

# Advanced plotting
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay, PrecisionRecallDisplay

# Statistical analysis
from scipy import stats
from scipy.stats import chi2_contingency

# Model loading utilities
import pickle
import joblib
from pathlib import Path
import json
import sys

# 📁 Step 2: Set up evaluation environment
print("\n📁 SETTING UP EVALUATION ENVIRONMENT...")
print("="*37)

current_dir = Path.cwd()
project_root = current_dir.parent
src_path = project_root / 'src'
sys.path.append(str(src_path))

# Directories
models_dir = project_root / 'models'
results_dir = project_root / 'results'
reports_dir = results_dir / 'reports'
figures_dir = results_dir / 'figures'

# Create directories if they don't exist
for directory in [results_dir, reports_dir, figures_dir]:
    directory.mkdir(parents=True, exist_ok=True)

print(f"📂 Models: {models_dir}")
print(f"📂 Results: {results_dir}")
print(f"📂 Reports: {reports_dir}")
print(f"📂 Figures: {figures_dir}")

# 🎯 Step 3: Define evaluation parameters
print(f"\n🎯 EVALUATION PARAMETERS...")
print("="*25)

# Evaluation configuration
RANDOM_STATE = 42
CONFIDENCE_LEVEL = 0.95
DECIMAL_PLACES = 4

# Metrics to focus on (for imbalanced classification)
PRIMARY_METRICS = ['f1_score', 'precision', 'recall', 'roc_auc', 'average_precision']
BUSINESS_METRIC = 'f1_score'  # Can be changed based on business needs

print(f"🎲 Random state: {RANDOM_STATE}")
print(f"📊 Primary metrics: {PRIMARY_METRICS}")
print(f"🎯 Business metric: {BUSINESS_METRIC}")
print(f"📈 Confidence level: {CONFIDENCE_LEVEL}")

print("\n✅ Evaluation environment ready!")

📦 IMPORTING EVALUATION LIBRARIES...
✅ SHAP available for model interpretation

📁 SETTING UP EVALUATION ENVIRONMENT...
📂 Models: c:\Users\DELL\Desktop\AI-Project\AI-Project\models
📂 Results: c:\Users\DELL\Desktop\AI-Project\AI-Project\results
📂 Reports: c:\Users\DELL\Desktop\AI-Project\AI-Project\results\reports
📂 Figures: c:\Users\DELL\Desktop\AI-Project\AI-Project\results\figures

🎯 EVALUATION PARAMETERS...
🎲 Random state: 42
📊 Primary metrics: ['f1_score', 'precision', 'recall', 'roc_auc', 'average_precision']
🎯 Business metric: f1_score
📈 Confidence level: 0.95

✅ Evaluation environment ready!


## 📊 Step 1: Load Trained Models and Test Data

Let's start by loading our trained models and preparing for comprehensive evaluation. We'll analyze performance from multiple angles to make an informed decision.

### 🔍 **Evaluation Philosophy:**
- **📊 Multiple metrics**: No single metric tells the whole story
- **📈 Visual analysis**: Charts reveal patterns that numbers hide
- **🎯 Business context**: Consider real-world impact of predictions
- **⚖️ Fairness assessment**: Ensure models work for all groups
- **🔮 Generalization**: Test on truly unseen data

### 📋 **Evaluation Checklist:**
- ✅ Load trained models and test data
- ✅ Calculate comprehensive metrics
- ✅ Create visualization dashboards
- ✅ Analyze feature importance
- ✅ Assess model reliability
- ✅ Make final model selection

In [None]:
# 📊 Step 1: Load trained models and test data
print("📊 LOADING MODELS AND TEST DATA...")
print("="*33)

# Try to load from saved files
try:
    trained_models = {}
    models_loaded = False
    
    # Load all model files
    model_files = list(models_dir.glob('*_model.pkl'))
    for model_file in model_files:
        try:
            model_info = joblib.load(model_file)
            model_name = model_file.stem.replace('_model', '').replace('_', ' ').title()
            trained_models[model_name] = model_info
            models_loaded = True
            print(f"✅ Loaded {model_name}")
        except Exception as e:
            print(f"❌ Failed to load {model_file}: {e}")
    
    # Load evaluation data
    data_path = models_dir / 'evaluation_data.pkl'
    if data_path.exists():
        evaluation_data = joblib.load(data_path)
        X_test = evaluation_data['X_test']
        y_test = evaluation_data['y_test']
        X_train = evaluation_data['X_train']
        X_test_scaled = evaluation_data['X_test_scaled']
        feature_columns = evaluation_data['feature_columns']
        target_col = evaluation_data['target_col']
        print("✅ Loaded evaluation data")
    
    models_available = models_loaded and 'X_test' in locals()
    
except Exception as e:
    print(f"❌ Error loading from files: {e}")
    models_available = False

if not models_available:
    print("🔄 Trying to use in-memory models...")
    # Check if models are available from training notebook
    if 'trained_models' in globals() and 'X_test' in globals():
        print("✅ Using models from training notebook session")
        models_available = True
    else:
        print("❌ No models available. Please run training first.")
        models_available = False
    
    # 📈 Step 2: Generate predictions on test set
    print(f"\n📈 GENERATING TEST SET PREDICTIONS...")
    print("="*36)
    
    test_results = {}
    
    for model_name, model_info in trained_models.items():
        print(f"\n🔮 Evaluating {model_name}...")
        
        try:
            model = model_info['model']
            needs_scaling = model_info['needs_scaling']
            
            # Use appropriate test data (scaled or original)
            X_test_use = X_test_scaled if needs_scaling else X_test
            
            # Generate predictions
            y_test_pred = model.predict(X_test_use)
            
            # Generate probability predictions if available
            if hasattr(model, 'predict_proba'):
                y_test_proba = model.predict_proba(X_test_use)
                y_test_proba_pos = y_test_proba[:, 1]  # Probability of positive class
            else:
                y_test_proba = None
                y_test_proba_pos = None
            
            # Calculate comprehensive metrics
            test_accuracy = accuracy_score(y_test, y_test_pred)
            test_precision = precision_score(y_test, y_test_pred, average='binary')
            test_recall = recall_score(y_test, y_test_pred, average='binary')
            test_f1 = f1_score(y_test, y_test_pred, average='binary')
            
            if y_test_proba_pos is not None:
                test_roc_auc = roc_auc_score(y_test, y_test_proba_pos)
                test_pr_auc = average_precision_score(y_test, y_test_proba_pos)
            else:
                test_roc_auc = None
                test_pr_auc = None
            
            # Store results
            test_results[model_name] = {
                'predictions': y_test_pred,
                'probabilities': y_test_proba_pos,
                'accuracy': test_accuracy,
                'precision': test_precision,
                'recall': test_recall,
                'f1_score': test_f1,
                'roc_auc': test_roc_auc,
                'pr_auc': test_pr_auc,
                'confusion_matrix': confusion_matrix(y_test, y_test_pred)
            }
            
            print(f"  ✅ Metrics calculated:")
            print(f"     • Accuracy: {test_accuracy:.4f}")
            print(f"     • Precision: {test_precision:.4f}")
            print(f"     • Recall: {test_recall:.4f}")
            print(f"     • F1-Score: {test_f1:.4f}")
            if test_roc_auc:
                print(f"     • ROC-AUC: {test_roc_auc:.4f}")
                print(f"     • PR-AUC: {test_pr_auc:.4f}")
                
        except Exception as e:
            print(f"  ❌ Error: {str(e)}")
            test_results[model_name] = None
    
    # 📊 Step 3: Create performance comparison table
    print(f"\n📊 PERFORMANCE COMPARISON TABLE:")
    print("="*33)
    
    # Create DataFrame for easy comparison
    results_data = []
    for model_name, results in test_results.items():
        if results is not None:
            row = {
                'Model': model_name,
                'Accuracy': results['accuracy'],
                'Precision': results['precision'],
                'Recall': results['recall'],
                'F1-Score': results['f1_score'],
                'ROC-AUC': results['roc_auc'] if results['roc_auc'] else 'N/A',
                'PR-AUC': results['pr_auc'] if results['pr_auc'] else 'N/A'
            }
            results_data.append(row)
    
    results_df = pd.DataFrame(results_data)
    
    if not results_df.empty:
        # Display formatted table
        print(results_df.round(4).to_string(index=False))
        
        # Find best models for different metrics
        print(f"\n🏆 BEST PERFORMERS:")
        print("="*18)
        
        metrics_to_rank = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
        for metric in metrics_to_rank:
            if metric in results_df.columns:
                best_idx = results_df[metric].idxmax()
                best_model = results_df.loc[best_idx, 'Model']
                best_score = results_df.loc[best_idx, metric]
                print(f"  • Best {metric}: {best_model} ({best_score:.4f})")
        
        # ROC-AUC (only for models with probabilities)
        roc_auc_scores = results_df[results_df['ROC-AUC'] != 'N/A']['ROC-AUC']
        if not roc_auc_scores.empty:
            best_roc_idx = roc_auc_scores.idxmax()
            best_roc_model = results_df.loc[best_roc_idx, 'Model']
            best_roc_score = roc_auc_scores.loc[best_roc_idx]
            print(f"  • Best ROC-AUC: {best_roc_model} ({best_roc_score:.4f})")
    
    # 📈 Step 4: Create comprehensive visualizations
    print(f"\n📈 CREATING PERFORMANCE VISUALIZATIONS...")
    print("="*38)
    
    # Set up the visualization grid
    fig = plt.figure(figsize=(20, 15))
    
    # 1. Performance metrics comparison (bar chart)
    ax1 = plt.subplot(2, 3, 1)
    if not results_df.empty:
        metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
        x = np.arange(len(results_df))
        width = 0.2
        
        for i, metric in enumerate(metrics):
            values = results_df[metric].values
            ax1.bar(x + i*width, values, width, label=metric, alpha=0.8)
        
        ax1.set_xlabel('Models')
        ax1.set_ylabel('Score')
        ax1.set_title('📊 Model Performance Comparison', fontweight='bold')
        ax1.set_xticks(x + width * 1.5)
        ax1.set_xticklabels(results_df['Model'], rotation=45, ha='right')
        ax1.legend()
        ax1.grid(True, alpha=0.3)
        ax1.set_ylim(0, 1)
    
    # 2. ROC Curves comparison
    ax2 = plt.subplot(2, 3, 2)
    
    for model_name, results in test_results.items():
        if results is not None and results['probabilities'] is not None:
            fpr, tpr, _ = roc_curve(y_test, results['probabilities'])
            roc_auc = results['roc_auc']
            ax2.plot(fpr, tpr, label=f'{model_name} (AUC = {roc_auc:.3f})', linewidth=2)
    
    ax2.plot([0, 1], [0, 1], 'k--', linewidth=1, alpha=0.5)
    ax2.set_xlabel('False Positive Rate')
    ax2.set_ylabel('True Positive Rate')
    ax2.set_title('📈 ROC Curves Comparison', fontweight='bold')
    ax2.legend()
    ax2.grid(True, alpha=0.3)
    
    # 3. Precision-Recall Curves
    ax3 = plt.subplot(2, 3, 3)
    
    for model_name, results in test_results.items():
        if results is not None and results['probabilities'] is not None:
            precision, recall, _ = precision_recall_curve(y_test, results['probabilities'])
            pr_auc = results['pr_auc']
            ax3.plot(recall, precision, label=f'{model_name} (AUC = {pr_auc:.3f})', linewidth=2)
    
    # Baseline (random classifier)
    baseline = (y_test == 1).mean()
    ax3.axhline(y=baseline, color='k', linestyle='--', alpha=0.5, label=f'Baseline = {baseline:.3f}')
    
    ax3.set_xlabel('Recall')
    ax3.set_ylabel('Precision')
    ax3.set_title('📊 Precision-Recall Curves', fontweight='bold')
    ax3.legend()
    ax3.grid(True, alpha=0.3)
    
    # 4. Confusion Matrices (show best model)
    if not results_df.empty:
        best_model_name = results_df.loc[results_df['F1-Score'].idxmax(), 'Model']
        best_results = test_results[best_model_name]
        
        ax4 = plt.subplot(2, 3, 4)
        cm = best_results['confusion_matrix']
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax4)
        ax4.set_title(f'🎯 Confusion Matrix: {best_model_name}', fontweight='bold')
        ax4.set_xlabel('Predicted')
        ax4.set_ylabel('Actual')
    
    # 5. Feature Importance (for tree-based models)
    ax5 = plt.subplot(2, 3, 5)
    
    # Find a tree-based model for feature importance
    tree_models = ['Random Forest', 'XGBoost', 'Gradient Boosting']
    importance_model = None
    
    for model_name in tree_models:
        if model_name in trained_models:
            model = trained_models[model_name]['model']
            if hasattr(model, 'feature_importances_'):
                importance_model = model_name
                break
    
    if importance_model:
        model = trained_models[importance_model]['model']
        importances = model.feature_importances_
        
        # Get feature names
        feature_names = X_train.columns if not trained_models[importance_model]['needs_scaling'] else X_train_scaled.columns
        
        # Sort by importance
        indices = np.argsort(importances)[::-1][:10]  # Top 10 features
        
        ax5.bar(range(len(indices)), importances[indices], alpha=0.8)
        ax5.set_xlabel('Features')
        ax5.set_ylabel('Importance')
        ax5.set_title(f'🌟 Feature Importance: {importance_model}', fontweight='bold')
        ax5.set_xticks(range(len(indices)))
        ax5.set_xticklabels([feature_names[i] for i in indices], rotation=45, ha='right')
        ax5.grid(True, alpha=0.3)
    
    # 6. Model Reliability (performance consistency)
    ax6 = plt.subplot(2, 3, 6)
    
    if 'cv_results' in globals():
        cv_f1_means = []
        cv_f1_stds = []
        cv_model_names = []
        
        for model_name, cv_data in cv_results.items():
            if cv_data and 'f1' in cv_data and cv_data['f1']:
                cv_f1_means.append(cv_data['f1']['mean'])
                cv_f1_stds.append(cv_data['f1']['std'])
                cv_model_names.append(model_name)
        
        if cv_f1_means:
            x_pos = range(len(cv_model_names))
            ax6.bar(x_pos, cv_f1_means, yerr=cv_f1_stds, capsize=5, alpha=0.8)
            ax6.set_xlabel('Models')
            ax6.set_ylabel('F1-Score')
            ax6.set_title('📊 Model Reliability (CV F1 ± Std)', fontweight='bold')
            ax6.set_xticks(x_pos)
            ax6.set_xticklabels(cv_model_names, rotation=45, ha='right')
            ax6.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    print("✅ Comprehensive evaluation complete!")
    
else:
    print("❌ No trained models available for evaluation")
    print("💡 Please run the model training notebook first")

📊 LOADING MODELS AND TEST DATA...
✅ Loaded Gradient Boosting
✅ Loaded Logistic Regression
✅ Loaded Random Forest
✅ Loaded Support Vector Machine
✅ Loaded Xgboost
✅ Loaded evaluation data
❌ No trained models available for evaluation
💡 Please run the model training notebook first
