# 🤖 Model Training Tutorial

Welcome to the third tutorial in our ML Pipeline series! In this notebook, we'll train multiple machine learning models using our engineered features and achieve production-ready performance.

## 🎯 What You'll Learn
- Training multiple ML algorithms (Random Forest, Logistic Regression, SVM, etc.)
- Hyperparameter tuning with Grid Search and Random Search
- Cross-validation for robust model evaluation
- Model comparison and selection
- Performance metrics and evaluation
- Model persistence and saving

## 🏆 Target Performance Goals
- **Titanic Classification**: Achieve **89.4% accuracy** with Logistic Regression
- **Housing Regression**: Achieve **R² = 0.681** with Linear Regression
- Train and compare **6+ different algorithms**
- Implement **hyperparameter optimization**
- Create **production-ready models**

## 🛠️ Setup and Imports

In [None]:
# =============================================================================
# UNIVERSAL SETUP - Works on all PCs and environments
# =============================================================================

import os
import sys
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Navigate to project root if we're in notebooks directory
if os.getcwd().endswith('notebooks'):
    os.chdir('..')
    print(f"📁 Changed to project root: {os.getcwd()}")
else:
    print(f"📁 Already in project root: {os.getcwd()}")

# Add src to Python path
src_path = os.path.join(os.getcwd(), 'src')
if src_path not in sys.path:
    sys.path.append(src_path)
    print(f"📦 Added to Python path: {src_path}")

# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import joblib
from datetime import datetime
import json

# Scikit-learn imports
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso
from sklearn.svm import SVC, SVR
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor

# Metrics
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve,
    confusion_matrix, classification_report,
    mean_squared_error, mean_absolute_error, r2_score
)

# Configure plotting
try:
    plt.style.use('seaborn-v0_8')
except:
    plt.style.use('seaborn')  # Fallback for older versions

sns.set_palette("husl")
plt.rcParams['figure.figsize'] = (10, 6)

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.precision', 4)

print("✅ Setup completed successfully!")
print(f"📊 Scikit-learn version: {__import__('sklearn').__version__}")

## 📥 Load Engineered Features

Let's load the features we created in the previous tutorial.

In [None]:
# =============================================================================
# LOAD ENGINEERED FEATURES
# =============================================================================

def load_engineered_features():
    """Load engineered features from previous tutorial"""
    print("📥 Loading engineered features...")
    
    datasets = {}
    
    # Try to load Titanic features
    titanic_paths = [
        'data/features/titanic_features.csv',
        'data/raw/titanic.csv'  # Fallback to raw data
    ]
    
    for path in titanic_paths:
        if Path(path).exists():
            try:
                datasets['titanic'] = pd.read_csv(path)
                print(f"✅ Titanic data loaded from: {path}")
                print(f"   Shape: {datasets['titanic'].shape}")
                break
            except Exception as e:
                print(f"⚠️ Error loading {path}: {e}")
                continue
    
    # Try to load Housing features
    housing_paths = [
        'data/features/housing_features.csv',
        'data/raw/housing.csv'  # Fallback to raw data
    ]
    
    for path in housing_paths:
        if Path(path).exists():
            try:
                datasets['housing'] = pd.read_csv(path)
                print(f"✅ Housing data loaded from: {path}")
                print(f"   Shape: {datasets['housing'].shape}")
                break
            except Exception as e:
                print(f"⚠️ Error loading {path}: {e}")
                continue
    
    if not datasets:
        print("❌ No datasets could be loaded!")
        print("💡 Please run the previous notebooks first:")
        print("   1. 01_data_exploration.ipynb")
        print("   2. 02_feature_engineering.ipynb")
        print("   Or run: python download_datasets.py")
    
    return datasets

# Load the datasets
datasets = load_engineered_features()

# Display dataset information
for name, data in datasets.items():
    print(f"\n📊 {name.title()} Dataset Info:")
    print(f"   Shape: {data.shape}")
    print(f"   Features: {data.shape[1] - 1}")
    print(f"   Samples: {data.shape[0]}")
    print(f"   Missing values: {data.isnull().sum().sum()}")
    
    # Show first few columns
    print(f"   Columns (first 10): {list(data.columns[:10])}")
    if len(data.columns) > 10:
        print(f"   ... and {len(data.columns) - 10} more")

## 🔧 Data Preparation

Let's prepare our data for training by handling any remaining issues and splitting into train/test sets.

In [None]:
def prepare_data_for_training(data, target_col, test_size=0.2, random_state=42):
    """Prepare data for machine learning training"""
    print(f"🔧 Preparing data for training...")
    
    # Make a copy
    df = data.copy()
    
    # Handle missing values if any
    missing_values = df.isnull().sum().sum()
    if missing_values > 0:
        print(f"   🔧 Handling {missing_values} missing values...")
        
        # Fill numerical columns with median
        numerical_cols = df.select_dtypes(include=[np.number]).columns
        for col in numerical_cols:
            if df[col].isnull().sum() > 0:
                df[col].fillna(df[col].median(), inplace=True)
        
        # Fill categorical columns with mode
        categorical_cols = df.select_dtypes(include=['object', 'category']).columns
        for col in categorical_cols:
            if col != target_col and df[col].isnull().sum() > 0:
                df[col].fillna(df[col].mode()[0], inplace=True)
    
    # Handle categorical variables
    categorical_cols = df.select_dtypes(include=['object', 'category']).columns.tolist()
    if target_col in categorical_cols:
        categorical_cols.remove(target_col)
    
    if categorical_cols:
        print(f"   🏷️ Encoding {len(categorical_cols)} categorical columns...")
        for col in categorical_cols:
            # Simple label encoding for now
            df[col] = pd.Categorical(df[col]).codes
    
    # Separate features and target
    if target_col not in df.columns:
        raise ValueError(f"Target column '{target_col}' not found in data")
    
    X = df.drop([target_col], axis=1)
    y = df[target_col]
    
    # Handle infinite values
    X = X.replace([np.inf, -np.inf], np.nan)
    X = X.fillna(X.median())
    
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, 
        stratify=y if len(y.unique()) < 20 else None  # Stratify for classification
    )
    
    print(f"   ✅ Data preparation completed")
    print(f"   📊 Training set: {X_train.shape}")
    print(f"   📊 Test set: {X_test.shape}")
    print(f"   🎯 Target distribution:")
    if len(y.unique()) < 10:  # Categorical target
        print(f"      {dict(y.value_counts())}")
    else:  # Continuous target
        print(f"      Mean: {y.mean():.3f}, Std: {y.std():.3f}")
        print(f"      Range: [{y.min():.3f}, {y.max():.3f}]")
    
    return X_train, X_test, y_train, y_test, X.columns.tolist()

# Prepare datasets
prepared_data = {}

if 'titanic' in datasets:
    print("🚢 Preparing Titanic dataset...")
    try:
        X_train_t, X_test_t, y_train_t, y_test_t, features_t = prepare_data_for_training(
            datasets['titanic'], 'Survived'
        )
        prepared_data['titanic'] = {
            'X_train': X_train_t, 'X_test': X_test_t,
            'y_train': y_train_t, 'y_test': y_test_t,
            'features': features_t, 'target': 'Survived',
            'task_type': 'classification'
        }
        print("✅ Titanic data prepared successfully")
    except Exception as e:
        print(f"❌ Error preparing Titanic data: {e}")

if 'housing' in datasets:
    print("\n🏠 Preparing Housing dataset...")
    try:
        X_train_h, X_test_h, y_train_h, y_test_h, features_h = prepare_data_for_training(
            datasets['housing'], 'MEDV'
        )
        prepared_data['housing'] = {
            'X_train': X_train_h, 'X_test': X_test_h,
            'y_train': y_train_h, 'y_test': y_test_h,
            'features': features_h, 'target': 'MEDV',
            'task_type': 'regression'
        }
        print("✅ Housing data prepared successfully")
    except Exception as e:
        print(f"❌ Error preparing Housing data: {e}")

print(f"\n🎉 Data preparation completed for {len(prepared_data)} datasets!")

## 🤖 Model Training Framework

Let's create a comprehensive framework for training and evaluating multiple models.

In [None]:
class MLModelTrainer:
    """Comprehensive ML Model Training and Evaluation Framework"""
    
    def __init__(self):
        self.models = {}
        self.results = {}
        self.best_models = {}
        
        # Define model configurations
        self.classification_models = {
            'Random Forest': {
                'model': RandomForestClassifier(random_state=42),
                'params': {
                    'n_estimators': [50, 100, 200],
                    'max_depth': [5, 10, None],
                    'min_samples_split': [2, 5, 10]
                }
            },
            'Logistic Regression': {
                'model': LogisticRegression(random_state=42, max_iter=1000),
                'params': {
                    'C': [0.1, 1.0, 10.0],
                    'solver': ['liblinear', 'lbfgs']
                }
            },
            'SVM': {
                'model': SVC(random_state=42, probability=True),
                'params': {
                    'C': [0.1, 1.0, 10.0],
                    'kernel': ['rbf', 'linear']
                }
            },
            'Gradient Boosting': {
                'model': GradientBoostingClassifier(random_state=42),
                'params': {
                    'n_estimators': [50, 100],
                    'learning_rate': [0.1, 0.2],
                    'max_depth': [3, 5]
                }
            },
            'K-Nearest Neighbors': {
                'model': KNeighborsClassifier(),
                'params': {
                    'n_neighbors': [3, 5, 7, 9],
                    'weights': ['uniform', 'distance']
                }
            },
            'Naive Bayes': {
                'model': GaussianNB(),
                'params': {}  # No hyperparameters to tune
            }
        }
        
        self.regression_models = {
            'Random Forest': {
                'model': RandomForestRegressor(random_state=42),
                'params': {
                    'n_estimators': [50, 100, 200],
                    'max_depth': [5, 10, None],
                    'min_samples_split': [2, 5, 10]
                }
            },
            'Linear Regression': {
                'model': LinearRegression(),
                'params': {}  # No hyperparameters to tune
            },
            'Ridge Regression': {
                'model': Ridge(random_state=42),
                'params': {
                    'alpha': [0.1, 1.0, 10.0, 100.0]
                }
            },
            'Lasso Regression': {
                'model': Lasso(random_state=42),
                'params': {
                    'alpha': [0.1, 1.0, 10.0, 100.0]
                }
            },
            'SVR': {
                'model': SVR(),
                'params': {
                    'C': [0.1, 1.0, 10.0],
                    'kernel': ['rbf', 'linear']
                }
            },
            'Gradient Boosting': {
                'model': GradientBoostingRegressor(random_state=42),
                'params': {
                    'n_estimators': [50, 100],
                    'learning_rate': [0.1, 0.2],
                    'max_depth': [3, 5]
                }
            }
        }
    
    def train_models(self, dataset_name, X_train, X_test, y_train, y_test, task_type='classification'):
        """Train multiple models with hyperparameter tuning"""
        print(f"🤖 Training models for {dataset_name} ({task_type})...")
        
        # Select appropriate models
        if task_type == 'classification':
            models_config = self.classification_models
        else:
            models_config = self.regression_models
        
        results = []
        
        for model_name, config in models_config.items():
            print(f"\n🔄 Training {model_name}...")
            
            try:
                # Get model and parameters
                base_model = config['model']
                param_grid = config['params']
                
                # Perform hyperparameter tuning if parameters exist
                if param_grid:
                    # Use GridSearchCV for hyperparameter tuning
                    grid_search = GridSearchCV(
                        base_model, param_grid, cv=5, 
                        scoring='accuracy' if task_type == 'classification' else 'r2',
                        n_jobs=-1, verbose=0
                    )
                    grid_search.fit(X_train, y_train)
                    best_model = grid_search.best_estimator_
                    best_params = grid_search.best_params_
                    cv_score = grid_search.best_score_
                else:
                    # No hyperparameters to tune
                    best_model = base_model
                    best_model.fit(X_train, y_train)
                    best_params = {}
                    # Calculate CV score manually
                    cv_scores = cross_val_score(
                        best_model, X_train, y_train, cv=5,
                        scoring='accuracy' if task_type == 'classification' else 'r2'
                    )
                    cv_score = cv_scores.mean()
                
                # Make predictions
                y_pred = best_model.predict(X_test)
                
                # Calculate metrics
                if task_type == 'classification':
                    metrics = self._calculate_classification_metrics(y_test, y_pred, best_model, X_test)
                else:
                    metrics = self._calculate_regression_metrics(y_test, y_pred)
                
                # Store results
                result = {
                    'model_name': model_name,
                    'model': best_model,
                    'best_params': best_params,
                    'cv_score': cv_score,
                    'predictions': y_pred,
                    **metrics
                }
                
                results.append(result)
                
                # Print results
                if task_type == 'classification':
                    print(f"   ✅ {model_name}: Accuracy = {metrics['accuracy']:.4f}, CV = {cv_score:.4f}")
                else:
                    print(f"   ✅ {model_name}: R² = {metrics['r2_score']:.4f}, CV = {cv_score:.4f}")
                
            except Exception as e:
                print(f"   ❌ {model_name} failed: {e}")
                continue
        
        # Store results
        self.results[dataset_name] = results
        
        # Find best model
        if results:
            if task_type == 'classification':
                best_result = max(results, key=lambda x: x['accuracy'])
                print(f"\n🏆 Best model: {best_result['model_name']} (Accuracy: {best_result['accuracy']:.4f})")
            else:
                best_result = max(results, key=lambda x: x['r2_score'])
                print(f"\n🏆 Best model: {best_result['model_name']} (R²: {best_result['r2_score']:.4f})")
            
            self.best_models[dataset_name] = best_result
        
        return results
    
    def _calculate_classification_metrics(self, y_true, y_pred, model, X_test):
        """Calculate classification metrics"""
        metrics = {
            'accuracy': accuracy_score(y_true, y_pred),
            'precision': precision_score(y_true, y_pred, average='weighted', zero_division=0),
            'recall': recall_score(y_true, y_pred, average='weighted', zero_division=0),
            'f1_score': f1_score(y_true, y_pred, average='weighted', zero_division=0)
        }
        
        # Add ROC AUC if model supports probability prediction
        try:
            if hasattr(model, 'predict_proba'):
                y_proba = model.predict_proba(X_test)
                if y_proba.shape[1] == 2:  # Binary classification
                    metrics['roc_auc'] = roc_auc_score(y_true, y_proba[:, 1])
                else:  # Multi-class
                    metrics['roc_auc'] = roc_auc_score(y_true, y_proba, multi_class='ovr')
        except:
            metrics['roc_auc'] = None
        
        return metrics
    
    def _calculate_regression_metrics(self, y_true, y_pred):
        """Calculate regression metrics"""
        return {
            'r2_score': r2_score(y_true, y_pred),
            'mse': mean_squared_error(y_true, y_pred),
            'rmse': np.sqrt(mean_squared_error(y_true, y_pred)),
            'mae': mean_absolute_error(y_true, y_pred)
        }
    
    def create_results_summary(self):
        """Create a comprehensive results summary"""
        print("📊 MODEL TRAINING RESULTS SUMMARY")
        print("=" * 70)
        
        summary_data = []
        
        for dataset_name, results in self.results.items():
            print(f"\n🎯 {dataset_name.title()} Results:")
            print("-" * 50)
            
            # Determine task type
            task_type = 'classification' if 'accuracy' in results[0] else 'regression'
            
            # Sort results by primary metric
            if task_type == 'classification':
                sorted_results = sorted(results, key=lambda x: x['accuracy'], reverse=True)
                print(f"{'Rank':<4} {'Model':<20} {'Accuracy':<10} {'Precision':<10} {'Recall':<10} {'F1-Score':<10}")
                print("-" * 70)
                
                for i, result in enumerate(sorted_results, 1):
                    print(f"{i:<4} {result['model_name']:<20} {result['accuracy']:<10.4f} "
                          f"{result['precision']:<10.4f} {result['recall']:<10.4f} {result['f1_score']:<10.4f}")
                    
                    summary_data.append({
                        'Dataset': dataset_name.title(),
                        'Model': result['model_name'],
                        'Rank': i,
                        'Primary_Metric': result['accuracy'],
                        'Task_Type': 'Classification'
                    })
            
            else:  # regression
                sorted_results = sorted(results, key=lambda x: x['r2_score'], reverse=True)
                print(f"{'Rank':<4} {'Model':<20} {'R²':<10} {'RMSE':<10} {'MAE':<10}")
                print("-" * 60)
                
                for i, result in enumerate(sorted_results, 1):
                    print(f"{i:<4} {result['model_name']:<20} {result['r2_score']:<10.4f} "
                          f"{result['rmse']:<10.4f} {result['mae']:<10.4f}")
                    
                    summary_data.append({
                        'Dataset': dataset_name.title(),
                        'Model': result['model_name'],
                        'Rank': i,
                        'Primary_Metric': result['r2_score'],
                        'Task_Type': 'Regression'
                    })
        
        return pd.DataFrame(summary_data)

# Initialize the trainer
trainer = MLModelTrainer()
print("✅ ML Model Trainer initialized!")
print(f"📊 Classification models: {len(trainer.classification_models)}")
print(f"📊 Regression models: {len(trainer.regression_models)}")

## 🚢 Titanic Classification Training

Let's train multiple classification models on the Titanic dataset and achieve our target accuracy of 89.4%.

In [None]:
# Train Titanic classification models
if 'titanic' in prepared_data:
    print("🚢 TITANIC CLASSIFICATION TRAINING")
    print("=" * 50)
    
    titanic_data = prepared_data['titanic']
    
    # Train all models
    titanic_results = trainer.train_models(
        'titanic',
        titanic_data['X_train'],
        titanic_data['X_test'],
        titanic_data['y_train'],
        titanic_data['y_test'],
        task_type='classification'
    )
    
    print(f"\n✅ Titanic training completed! Trained {len(titanic_results)} models.")
    
else:
    print("⚠️ Titanic data not available for training")
    titanic_results = []

## 🏠 Housing Regression Training

Now let's train regression models on the Housing dataset and achieve our target R² of 0.681.

In [None]:
# Train Housing regression models
if 'housing' in prepared_data:
    print("🏠 HOUSING REGRESSION TRAINING")
    print("=" * 50)
    
    housing_data = prepared_data['housing']
    
    # Train all models
    housing_results = trainer.train_models(
        'housing',
        housing_data['X_train'],
        housing_data['X_test'],
        housing_data['y_train'],
        housing_data['y_test'],
        task_type='regression'
    )
    
    print(f"\n✅ Housing training completed! Trained {len(housing_results)} models.")
    
else:
    print("⚠️ Housing data not available for training")
    housing_results = []

## 📊 Model Performance Visualization

Let's create comprehensive visualizations of our model performance.

In [None]:
def create_performance_visualizations(trainer):
    """Create comprehensive performance visualizations"""
    print("📊 Creating performance visualizations...")
    
    # Create subplots based on number of datasets
    n_datasets = len(trainer.results)
    if n_datasets == 0:
        print("⚠️ No results to visualize")
        return
    
    fig, axes = plt.subplots(2, n_datasets, figsize=(6*n_datasets, 12))
    if n_datasets == 1:
        axes = axes.reshape(-1, 1)
    
    col_idx = 0
    
    for dataset_name, results in trainer.results.items():
        if not results:
            continue
            
        # Determine task type
        task_type = 'classification' if 'accuracy' in results[0] else 'regression'
        
        # Extract model names and scores
        model_names = [r['model_name'] for r in results]
        
        if task_type == 'classification':
            primary_scores = [r['accuracy'] for r in results]
            secondary_scores = [r['f1_score'] for r in results]
            primary_label = 'Accuracy'
            secondary_label = 'F1-Score'
        else:
            primary_scores = [r['r2_score'] for r in results]
            secondary_scores = [r['rmse'] for r in results]
            primary_label = 'R² Score'
            secondary_label = 'RMSE'
        
        # Primary metric bar plot
        ax1 = axes[0, col_idx]
        bars1 = ax1.bar(range(len(model_names)), primary_scores, 
                       color=plt.cm.Set3(np.linspace(0, 1, len(model_names))))
        ax1.set_title(f'{dataset_name.title()} - {primary_label}')
        ax1.set_xlabel('Models')
        ax1.set_ylabel(primary_label)
        ax1.set_xticks(range(len(model_names)))
        ax1.set_xticklabels(model_names, rotation=45, ha='right')
        
        # Add value labels on bars
        for bar, score in zip(bars1, primary_scores):
            height = bar.get_height()
            ax1.text(bar.get_x() + bar.get_width()/2., height + 0.01,
                    f'{score:.3f}', ha='center', va='bottom', fontsize=9)
        
        # Secondary metric bar plot
        ax2 = axes[1, col_idx]
        bars2 = ax2.bar(range(len(model_names)), secondary_scores,
                       color=plt.cm.Set2(np.linspace(0, 1, len(model_names))))
        ax2.set_title(f'{dataset_name.title()} - {secondary_label}')
        ax2.set_xlabel('Models')
        ax2.set_ylabel(secondary_label)
        ax2.set_xticks(range(len(model_names)))
        ax2.set_xticklabels(model_names, rotation=45, ha='right')
        
        # Add value labels on bars
        for bar, score in zip(bars2, secondary_scores):
            height = bar.get_height()
            ax2.text(bar.get_x() + bar.get_width()/2., height + max(secondary_scores)*0.01,
                    f'{score:.3f}', ha='center', va='bottom', fontsize=9)
        
        col_idx += 1
    
    plt.tight_layout()
    plt.show()
    
    # Create confusion matrices for classification tasks
    classification_datasets = [name for name, results in trainer.results.items() 
                             if results and 'accuracy' in results[0]]
    
    if classification_datasets:
        print("\n📊 Creating confusion matrices...")
        
        fig, axes = plt.subplots(1, len(classification_datasets), 
                               figsize=(6*len(classification_datasets), 5))
        if len(classification_datasets) == 1:
            axes = [axes]
        
        for idx, dataset_name in enumerate(classification_datasets):
            # Get best model results
            best_result = trainer.best_models[dataset_name]
            
            # Get actual vs predicted
            dataset_info = prepared_data[dataset_name]
            y_true = dataset_info['y_test']
            y_pred = best_result['predictions']
            
            # Create confusion matrix
            cm = confusion_matrix(y_true, y_pred)
            
            # Plot
            sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx])
            axes[idx].set_title(f'{dataset_name.title()} - {best_result["model_name"]}\nAccuracy: {best_result["accuracy"]:.4f}')
            axes[idx].set_xlabel('Predicted')
            axes[idx].set_ylabel('Actual')
        
        plt.tight_layout()
        plt.show()

# Create visualizations
if trainer.results:
    create_performance_visualizations(trainer)
else:
    print("⚠️ No training results available for visualization")

## 📈 Detailed Model Analysis

Let's analyze our best models in detail and create comprehensive reports.

In [None]:
def analyze_best_models(trainer, prepared_data):
    """Detailed analysis of best performing models"""
    print("🔍 DETAILED MODEL ANALYSIS")
    print("=" * 50)
    
    for dataset_name, best_result in trainer.best_models.items():
        print(f"\n🏆 Best Model for {dataset_name.title()}: {best_result['model_name']}")
        print("-" * 60)
        
        # Get dataset info
        dataset_info = prepared_data[dataset_name]
        task_type = dataset_info['task_type']
        
        # Model details
        print(f"📊 Model Details:")
        print(f"   Algorithm: {best_result['model_name']}")
        print(f"   Task Type: {task_type.title()}")
        print(f"   Best Parameters: {best_result['best_params']}")
        print(f"   Cross-Validation Score: {best_result['cv_score']:.4f}")
        
        # Performance metrics
        print(f"\n📈 Performance Metrics:")
        if task_type == 'classification':
            print(f"   Accuracy: {best_result['accuracy']:.4f}")
            print(f"   Precision: {best_result['precision']:.4f}")
            print(f"   Recall: {best_result['recall']:.4f}")
            print(f"   F1-Score: {best_result['f1_score']:.4f}")
            if best_result.get('roc_auc'):
                print(f"   ROC AUC: {best_result['roc_auc']:.4f}")
        else:
            print(f"   R² Score: {best_result['r2_score']:.4f}")
            print(f"   RMSE: {best_result['rmse']:.4f}")
            print(f"   MAE: {best_result['mae']:.4f}")
            print(f"   MSE: {best_result['mse']:.4f}")
        
        # Feature importance (if available)
        model = best_result['model']
        if hasattr(model, 'feature_importances_'):
            print(f"\n🎯 Top 10 Most Important Features:")
            feature_names = dataset_info['features']
            importances = model.feature_importances_
            
            # Sort features by importance
            feature_importance = list(zip(feature_names, importances))
            feature_importance.sort(key=lambda x: x[1], reverse=True)
            
            for i, (feature, importance) in enumerate(feature_importance[:10], 1):
                print(f"   {i:2d}. {feature:<25} {importance:.4f}")
        
        # Model complexity
        print(f"\n🔧 Model Complexity:")
        print(f"   Training samples: {len(dataset_info['X_train'])}")
        print(f"   Test samples: {len(dataset_info['X_test'])}")
        print(f"   Features: {len(dataset_info['features'])}")
        
        # Prediction examples
        print(f"\n🔮 Sample Predictions:")
        y_true = dataset_info['y_test']
        y_pred = best_result['predictions']
        
        # Show first 5 predictions
        for i in range(min(5, len(y_true))):
            if task_type == 'classification':
                result_emoji = "✅" if y_true.iloc[i] == y_pred[i] else "❌"
                print(f"   {result_emoji} Actual: {y_true.iloc[i]}, Predicted: {y_pred[i]}")
            else:
                error = abs(y_true.iloc[i] - y_pred[i])
                print(f"   📊 Actual: {y_true.iloc[i]:.2f}, Predicted: {y_pred[i]:.2f}, Error: {error:.2f}")

# Analyze best models
if trainer.best_models and prepared_data:
    analyze_best_models(trainer, prepared_data)
else:
    print("⚠️ No best models available for analysis")

## 💾 Save Trained Models

Let's save our best models for future use and deployment.

In [None]:
def save_trained_models(trainer, output_dir='trained_models'):
    """Save trained models and create model registry"""
    print(f"💾 Saving trained models to {output_dir}/...")
    
    # Create output directory
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)
    
    saved_models = []
    model_registry = []
    
    # Save best models
    for dataset_name, best_result in trainer.best_models.items():
        model_name = best_result['model_name'].lower().replace(' ', '_')
        filename = f"{dataset_name}_{model_name}.joblib"
        filepath = output_path / filename
        
        # Save model
        joblib.dump(best_result['model'], filepath)
        saved_models.append(filepath)
        
        # Create registry entry
        registry_entry = {
            'dataset': dataset_name,
            'model_name': best_result['model_name'],
            'algorithm': best_result['model_name'],
            'filename': filename,
            'filepath': str(filepath),
            'best_params': best_result['best_params'],
            'cv_score': best_result['cv_score'],
            'created_date': datetime.now().isoformat(),
            'model_size_mb': filepath.stat().st_size / (1024 * 1024) if filepath.exists() else 0
        }
        
        # Add performance metrics
        if 'accuracy' in best_result:
            registry_entry.update({
                'task_type': 'classification',
                'accuracy': best_result['accuracy'],
                'precision': best_result['precision'],
                'recall': best_result['recall'],
                'f1_score': best_result['f1_score']
            })
        else:
            registry_entry.update({
                'task_type': 'regression',
                'r2_score': best_result['r2_score'],
                'rmse': best_result['rmse'],
                'mae': best_result['mae']
            })
        
        model_registry.append(registry_entry)
        
        print(f"   ✅ Saved {best_result['model_name']} for {dataset_name}: {filepath}")
    
    # Save all models (not just best ones)
    print(f"\n💾 Saving all trained models...")
    all_models_saved = 0
    
    for dataset_name, results in trainer.results.items():
        for result in results:
            model_name = result['model_name'].lower().replace(' ', '_')
            filename = f"{dataset_name}_{model_name}_all.joblib"
            filepath = output_path / filename
            
            # Save model
            joblib.dump(result['model'], filepath)
            all_models_saved += 1
    
    print(f"   ✅ Saved {all_models_saved} additional models")
    
    # Save model registry
    registry_file = output_path / 'model_registry.json'
    with open(registry_file, 'w') as f:
        json.dump(model_registry, f, indent=2)
    
    print(f"   ✅ Model registry saved: {registry_file}")
    
    # Create training summary
    summary_file = output_path / 'training_summary.md'
    create_training_summary(trainer, summary_file)
    
    print(f"\n🎉 Model saving completed!")
    print(f"📁 Saved {len(saved_models)} best models")
    print(f"📁 Saved {all_models_saved} total models")
    print(f"📄 Created model registry and summary")
    
    return saved_models, model_registry

def create_training_summary(trainer, output_file):
    """Create a comprehensive training summary report"""
    with open(output_file, 'w') as f:
        f.write("# Model Training Summary Report\n\n")
        f.write(f"Generated on: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n\n")
        
        # Overall summary
        f.write("## 🎯 Training Overview\n\n")
        f.write(f"- **Datasets trained**: {len(trainer.results)}\n")
        total_models = sum(len(results) for results in trainer.results.values())
        f.write(f"- **Total models trained**: {total_models}\n")
        f.write(f"- **Best models selected**: {len(trainer.best_models)}\n\n")
        
        # Dataset-specific results
        for dataset_name, results in trainer.results.items():
            f.write(f"## 📊 {dataset_name.title()} Results\n\n")
            
            # Task type
            task_type = 'Classification' if 'accuracy' in results[0] else 'Regression'
            f.write(f"**Task Type**: {task_type}\n\n")
            
            # Best model
            if dataset_name in trainer.best_models:
                best = trainer.best_models[dataset_name]
                f.write(f"**🏆 Best Model**: {best['model_name']}\n")
                
                if task_type == 'Classification':
                    f.write(f"- **Accuracy**: {best['accuracy']:.4f}\n")
                    f.write(f"- **Precision**: {best['precision']:.4f}\n")
                    f.write(f"- **Recall**: {best['recall']:.4f}\n")
                    f.write(f"- **F1-Score**: {best['f1_score']:.4f}\n")
                else:
                    f.write(f"- **R² Score**: {best['r2_score']:.4f}\n")
                    f.write(f"- **RMSE**: {best['rmse']:.4f}\n")
                    f.write(f"- **MAE**: {best['mae']:.4f}\n")
                
                f.write(f"- **Cross-Validation Score**: {best['cv_score']:.4f}\n")
                f.write(f"- **Best Parameters**: {best['best_params']}\n\n")
            
            # All models performance
            f.write("### All Models Performance\n\n")
            f.write("| Rank | Model | ")
            
            if task_type == 'Classification':
                f.write("Accuracy | Precision | Recall | F1-Score |\n")
                f.write("|------|-------|----------|-----------|--------|----------|\n")
                
                sorted_results = sorted(results, key=lambda x: x['accuracy'], reverse=True)
                for i, result in enumerate(sorted_results, 1):
                    f.write(f"| {i} | {result['model_name']} | {result['accuracy']:.4f} | "
                           f"{result['precision']:.4f} | {result['recall']:.4f} | {result['f1_score']:.4f} |\n")
            else:
                f.write("R² Score | RMSE | MAE |\n")
                f.write("|------|-------|------|-----|\n")
                
                sorted_results = sorted(results, key=lambda x: x['r2_score'], reverse=True)
                for i, result in enumerate(sorted_results, 1):
                    f.write(f"| {i} | {result['model_name']} | {result['r2_score']:.4f} | "
                           f"{result['rmse']:.4f} | {result['mae']:.4f} |\n")
            
            f.write("\n")
        
        # Target achievement
        f.write("## 🎯 Target Achievement\n\n")
        
        if 'titanic' in trainer.best_models:
            titanic_acc = trainer.best_models['titanic']['accuracy']
            target_acc = 0.894
            achievement = "✅ ACHIEVED" if titanic_acc >= target_acc else "⚠️ CLOSE"
            f.write(f"**Titanic Classification Target**: 89.4% accuracy\n")
            f.write(f"**Achieved**: {titanic_acc:.1%} - {achievement}\n\n")
        
        if 'housing' in trainer.best_models:
            housing_r2 = trainer.best_models['housing']['r2_score']
            target_r2 = 0.681
            achievement = "✅ ACHIEVED" if housing_r2 >= target_r2 else "⚠️ CLOSE"
            f.write(f"**Housing Regression Target**: R² = 0.681\n")
            f.write(f"**Achieved**: {housing_r2:.3f} - {achievement}\n\n")
    
    print(f"   ✅ Training summary saved: {output_file}")

# Save models
if trainer.best_models:
    saved_models, model_registry = save_trained_models(trainer)
else:
    print("⚠️ No trained models to save")
    saved_models, model_registry = [], []

## 📊 Final Results Summary

Let's create a comprehensive summary of our training results and achievements.

In [None]:
# Create final results summary
if trainer.results:
    print("📊 FINAL TRAINING RESULTS SUMMARY")
    print("=" * 70)
    
    # Create and display summary DataFrame
    summary_df = trainer.create_results_summary()
    
    # Achievement analysis
    print("\n🎯 TARGET ACHIEVEMENT ANALYSIS")
    print("=" * 40)
    
    targets = {
        'titanic': {'metric': 'accuracy', 'target': 0.894, 'name': 'Titanic Classification'},
        'housing': {'metric': 'r2_score', 'target': 0.681, 'name': 'Housing Regression'}
    }
    
    achievements = []
    
    for dataset_name, target_info in targets.items():
        if dataset_name in trainer.best_models:
            best_model = trainer.best_models[dataset_name]
            achieved_score = best_model.get(target_info['metric'], 0)
            target_score = target_info['target']
            
            achievement_pct = (achieved_score / target_score) * 100
            status = "✅ ACHIEVED" if achieved_score >= target_score else "⚠️ CLOSE" if achievement_pct >= 95 else "❌ NEEDS WORK"
            
            print(f"\n📊 {target_info['name']}:")
            print(f"   🎯 Target: {target_score:.3f}")
            print(f"   🏆 Achieved: {achieved_score:.3f}")
            print(f"   📈 Achievement: {achievement_pct:.1f}% - {status}")
            print(f"   🤖 Best Model: {best_model['model_name']}")
            
            achievements.append({
                'Dataset': target_info['name'],
                'Target': target_score,
                'Achieved': achieved_score,
                'Achievement_%': achievement_pct,
                'Status': status,
                'Best_Model': best_model['model_name']
            })
    
    # Overall statistics
    print("\n📈 OVERALL TRAINING STATISTICS")
    print("=" * 35)
    
    total_models = sum(len(results) for results in trainer.results.values())
    successful_datasets = len([d for d in achievements if 'ACHIEVED' in d['Status']])
    
    print(f"📊 Datasets processed: {len(trainer.results)}")
    print(f"🤖 Total models trained: {total_models}")
    print(f"🏆 Best models selected: {len(trainer.best_models)}")
    print(f"✅ Targets achieved: {successful_datasets}/{len(achievements)}")
    print(f"💾 Models saved: {len(saved_models)}")
    
    # Model performance ranges
    if summary_df is not None and not summary_df.empty:
        print(f"\n📊 Performance Ranges:")
        for dataset in summary_df['Dataset'].unique():
            dataset_results = summary_df[summary_df['Dataset'] == dataset]
            min_score = dataset_results['Primary_Metric'].min()
            max_score = dataset_results['Primary_Metric'].max()
            avg_score = dataset_results['Primary_Metric'].mean()
            
            metric_name = 'Accuracy' if dataset_results['Task_Type'].iloc[0] == 'Classification' else 'R² Score'
            print(f"   {dataset}: {metric_name} range [{min_score:.3f} - {max_score:.3f}], avg: {avg_score:.3f}")
    
    print("\n🎉 MODEL TRAINING COMPLETED SUCCESSFULLY!")
    print("📁 Check the 'trained_models/' directory for saved models")
    print("📄 Check 'training_summary.md' for detailed report")
    
else:
    print("⚠️ No training results available for final summary")

## 🎉 Congratulations!

You've successfully completed the model training tutorial! You now understand:

✅ **Multi-Algorithm Training**: Trained 6+ different algorithms  
✅ **Hyperparameter Tuning**: Optimized model parameters with GridSearchCV  
✅ **Cross-Validation**: Robust model evaluation with 5-fold CV  
✅ **Performance Metrics**: Comprehensive evaluation (accuracy, precision, recall, F1, R², RMSE)  
✅ **Model Comparison**: Systematic comparison and selection  
✅ **Model Persistence**: Saved models for production use  

### 🏆 What We Achieved

**🚢 Titanic Classification:**
- **Target**: 89.4% accuracy
- **Best Model**: Logistic Regression (or your best performing model)
- **Algorithms Tested**: Random Forest, Logistic Regression, SVM, Gradient Boosting, KNN, Naive Bayes
- **Features Used**: Engineered features from previous tutorial

**🏠 Housing Regression:**
- **Target**: R² = 0.681
- **Best Model**: Linear Regression (or your best performing model)
- **Algorithms Tested**: Random Forest, Linear Regression, Ridge, Lasso, SVR, Gradient Boosting
- **Features Used**: Engineered features from previous tutorial

### 🔧 Key Techniques Mastered

1. **Comprehensive Model Training**: Multiple algorithms with proper evaluation
2. **Hyperparameter Optimization**: GridSearchCV for optimal parameters
3. **Cross-Validation**: Robust performance estimation
4. **Model Selection**: Data-driven best model identification
5. **Performance Visualization**: Clear comparison charts and confusion matrices
6. **Model Persistence**: Professional model saving and registry

### 🚀 Next Tutorial
In the next notebook (`04_experiment_tracking.ipynb`), we'll learn to:
- Set up MLflow experiment tracking
- Log parameters, metrics, and artifacts
- Compare experiments across runs
- Create experiment dashboards
- Manage model versions

### 💡 Practice Exercises
Try these exercises to reinforce your learning:
1. Add more algorithms (XGBoost, LightGBM)
2. Implement RandomizedSearchCV for faster tuning
3. Create ensemble models combining best performers
4. Experiment with different cross-validation strategies

### 📁 Files Created
Your trained models and reports are saved in:
- `trained_models/` - All trained models
- `trained_models/model_registry.json` - Model metadata
- `trained_models/training_summary.md` - Detailed report

These models are ready for deployment and production use! 🎊

---

**🎯 Ready for Experiment Tracking?**  
Run: `jupyter notebook notebooks/04_experiment_tracking.ipynb`