# Advanced Data Generation Showcase

This notebook demonstrates the comprehensive synthetic data generation capabilities of the sklearn-mastery project. We'll create sophisticated datasets specifically designed to showcase different machine learning algorithms' strengths, weaknesses, and optimal use cases across various data characteristics and complexity levels.

## Table of Contents
1. [Setup and Imports](#setup)
2. [Results Management System](#results)
3. [Linear Regression Dataset Spectrum](#linear-regression)
4. [Classification Complexity Hierarchy](#classification)
5. [Clustering Pattern Showcase](#clustering)
6. [Special Purpose Datasets](#special)
7. [Comparative Algorithm Analysis](#analysis)
8. [Interactive Dataset Explorer](#explorer)
9. [Performance Benchmarking](#benchmarking)
10. [Comprehensive Results Saving](#saving)

## 1. Setup and Imports {#setup}

In [None]:
# Standard imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.metrics import accuracy_score, r2_score, silhouette_score, classification_report
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, BaggingClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.mixture import GaussianMixture
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y, check_array
from sklearn.utils.multiclass import unique_labels
import time
import warnings
warnings.filterwarnings('ignore')

# Results management imports
import os
from pathlib import Path
import joblib
import datetime
import json

In [None]:
# Project imports
import sys
sys.path.append('../src')

from data.generators import SyntheticDataGenerator
from pipelines.pipeline_factory import PipelineFactory
from evaluation.metrics import ModelEvaluator
from evaluation.visualization import ModelVisualizationSuite
from utils.helpers import DataUtils

print("✅ Advanced Data Generation Showcase Setup Complete!")

In [None]:
# Configure plotting for premium appearance
plt.style.use('seaborn-v0_8')
plt.rcParams['figure.figsize'] = (14, 8)
plt.rcParams['axes.grid'] = True
plt.rcParams['grid.alpha'] = 0.3
plt.rcParams['font.size'] = 10
plt.rcParams['axes.titlesize'] = 12
plt.rcParams['axes.labelsize'] = 11
sns.set_palette('husl')

print(f"📊 Matplotlib backend: {plt.get_backend()}")
print(f"🎨 Color palette: {sns.color_palette().as_hex()}")
print("🎨 Plotting configuration optimized for premium visualizations!")

## 2. Results Management System {#results}

Advanced results management system for saving models, figures, and comprehensive analysis reports.

In [None]:
# Enhanced Results Management System
def setup_results_directories():
    """Create comprehensive results directory structure."""
    base_dir = Path(__file__).parent.parent if '__file__' in globals() else Path.cwd().parent
    results_dir = base_dir / 'results'
    
    # Create comprehensive subdirectories
    directories = [
        'figures', 'models', 'reports', 'experiments',
        'data_generation', 'algorithm_analysis', 'benchmarks',
        'interactive_analysis', 'comparative_studies'
    ]
    
    for directory in directories:
        (results_dir / directory).mkdir(parents=True, exist_ok=True)
        print(f"📁 Created/verified: {results_dir / directory}")
    
    return results_dir

def get_timestamp():
    """Get formatted timestamp for file naming."""
    return datetime.datetime.now().strftime("%Y%m%d_%H%M%S")

def save_figure(fig, name, description="", category="general", dpi=300):
    """Save figure with comprehensive metadata."""
    timestamp = get_timestamp()
    filename = f"{timestamp}_data_generation_{category}_{name}.png"
    filepath = results_dir / 'figures' / filename
    
    # Save figure
    fig.savefig(filepath, dpi=dpi, bbox_inches='tight', facecolor='white')
    
    # Save metadata
    metadata = {
        'filename': filename,
        'description': description,
        'category': category,
        'timestamp': timestamp,
        'notebook': '01_data_generation_showcase',
        'dpi': dpi,
        'figure_size': fig.get_size_inches().tolist()
    }
    
    metadata_file = results_dir / 'figures' / f"{timestamp}_data_generation_{category}_{name}_metadata.json"
    with open(metadata_file, 'w') as f:
        json.dump(metadata, f, indent=2)
    
    print(f"💾 Figure saved: {filepath}")
    return filepath

def save_model(model, name, description="", category="general", metadata=None):
    """Save model with comprehensive metadata."""
    timestamp = get_timestamp()
    filename = f"{timestamp}_data_generation_{category}_{name}.joblib"
    filepath = results_dir / 'models' / filename
    
    # Save model
    joblib.dump(model, filepath, compress=3)
    
    # Create comprehensive metadata
    model_metadata = {
        'filename': filename,
        'model_name': name,
        'description': description,
        'category': category,
        'timestamp': timestamp,
        'notebook': '01_data_generation_showcase',
        'model_type': model.__class__.__name__ if hasattr(model, '__class__') else str(type(model)),
        'model_params': model.get_params() if hasattr(model, 'get_params') else {},
        'file_size_mb': filepath.stat().st_size / (1024*1024) if filepath.exists() else 0
    }
    
    if metadata:
        model_metadata.update(metadata)
    
    metadata_file = results_dir / 'models' / f"{timestamp}_data_generation_{category}_{name}_metadata.json"
    with open(metadata_file, 'w') as f:
        json.dump(model_metadata, f, indent=2, default=str)
    
    print(f"💾 Model saved: {filepath}")
    return filepath

def save_experiment_results(experiment_name, results, description="", category="general"):
    """Save experiment results with detailed configuration."""
    timestamp = get_timestamp()
    filename = f"{timestamp}_data_generation_{category}_{experiment_name}.json"
    filepath = results_dir / 'experiments' / filename
    
    experiment_data = {
        'experiment_name': experiment_name,
        'description': description,
        'category': category,
        'timestamp': timestamp,
        'notebook': '01_data_generation_showcase',
        'results': results,
        'system_info': {
            'python_version': sys.version,
            'numpy_version': np.__version__,
            'pandas_version': pd.__version__
        }
    }
    
    with open(filepath, 'w') as f:
        json.dump(experiment_data, f, indent=2, default=str)
    
    print(f"💾 Experiment results saved: {filepath}")
    return filepath

def save_report(content, name, description="", category="general", format='txt'):
    """Save comprehensive analysis report."""
    timestamp = get_timestamp()
    filename = f"{timestamp}_data_generation_{category}_{name}.{format}"
    filepath = results_dir / 'reports' / filename
    
    if format == 'txt':
        with open(filepath, 'w') as f:
            f.write(content)
    elif format == 'json':
        with open(filepath, 'w') as f:
            json.dump(content, f, indent=2, default=str)
    
    print(f"💾 Report saved: {filepath}")
    return filepath

# Initialize results directories
results_dir = setup_results_directories()
print(f"\n📊 Results will be saved to: {results_dir}")
print("🔧 Results management system initialized!")

## 3. Linear Regression Dataset Spectrum {#linear-regression}

Let's create a comprehensive spectrum of regression datasets designed to demonstrate the nuanced differences between various regression algorithms and their optimal use cases.

In [None]:
# Initialize advanced data generator
print("🔄 Initializing Advanced Synthetic Data Generator...")

generator = SyntheticDataGenerator(random_state=42)

# Create regression dataset spectrum
print("\n🎯 Generating Regression Dataset Spectrum...")

regression_datasets = {}

# Dataset 1: Perfect for showcasing multicollinearity effects
print("  Creating multicollinearity demonstration dataset...")
X_collinear, y_collinear, true_coef = generator.regression_with_collinearity(
    n_samples=1200,
    n_features=25,
    collinear_groups=[(0, 1, 2, 3), (7, 8, 9), (15, 16, 17, 18, 19)],
    noise_variance=0.15,
    coefficient_sparsity=0.4
)

regression_datasets['Multicollinear'] = {
    'X': X_collinear, 'y': y_collinear, 'true_coef': true_coef,
    'description': 'High multicollinearity with sparse true coefficients',
    'optimal_algorithm': 'Ridge Regression',
    'challenge': 'multicollinearity'
}

print(f"    Generated: {X_collinear.shape}, True non-zero coefficients: {np.sum(true_coef != 0)}/{len(true_coef)}")

# Dataset 2: High-dimensional sparse regression
print("  Creating high-dimensional sparse regression dataset...")
X_sparse, y_sparse, coef_sparse = generator.high_dimensional_regression(
    n_samples=800,
    n_features=100,
    n_informative=12,
    noise_variance=0.1
)

regression_datasets['High_Dimensional'] = {
    'X': X_sparse, 'y': y_sparse, 'true_coef': coef_sparse,
    'description': 'High-dimensional with very sparse coefficients',
    'optimal_algorithm': 'Lasso Regression',
    'challenge': 'high_dimensionality'
}

print(f"    Generated: {X_sparse.shape}, Active features: {np.sum(coef_sparse != 0)}")

# Dataset 3: Nonlinear relationships for comparison
print("  Creating nonlinear regression dataset...")
X_nonlinear, y_nonlinear = generator.nonlinear_regression(
    n_samples=1000,
    n_features=8,
    nonlinearity_type='polynomial',
    noise_variance=0.2
)

regression_datasets['Nonlinear'] = {
    'X': X_nonlinear, 'y': y_nonlinear, 'true_coef': None,
    'description': 'Polynomial nonlinear relationships',
    'optimal_algorithm': 'Random Forest',
    'challenge': 'nonlinearity'
}

print(f"    Generated: {X_nonlinear.shape}")

# Dataset 4: Robust regression scenario (with outliers)
print("  Creating robust regression dataset with outliers...")
X_robust, y_robust, coef_robust = generator.regression_with_outliers(
    n_samples=900,
    n_features=15,
    outlier_fraction=0.1,
    outlier_strength=3.0
)

regression_datasets['With_Outliers'] = {
    'X': X_robust, 'y': y_robust, 'true_coef': coef_robust,
    'description': 'Clean data corrupted with 10% strong outliers',
    'optimal_algorithm': 'Huber Regression',
    'challenge': 'outliers'
}

print(f"    Generated: {X_robust.shape}, Outlier fraction: 10%")

print("\n✨ Regression dataset spectrum generation complete!")

# Save regression datasets metadata
regression_metadata = {
    'total_datasets': len(regression_datasets),
    'dataset_details': {name: {
        'shape': info['X'].shape,
        'description': info['description'],
        'optimal_algorithm': info.get('optimal_algorithm', 'Unknown'),
        'challenge': info.get('challenge', 'general')
    } for name, info in regression_datasets.items()}
}

save_experiment_results('regression_datasets_generation', regression_metadata,
                       'Comprehensive regression dataset generation results', 'regression')

In [None]:
# Analyze regression dataset characteristics
print("📊 Analyzing Regression Dataset Characteristics...")

fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.ravel()

for idx, (dataset_name, dataset_info) in enumerate(regression_datasets.items()):
    if idx >= 4:  # Only plot first 4 datasets
        break
        
    X, y = dataset_info['X'], dataset_info['y']
    ax = axes[idx]
    
    # For multicollinearity dataset, show correlation heatmap
    if dataset_name == 'Multicollinear':
        corr_matrix = np.corrcoef(X.T)
        # Show only a subset for clarity
        subset_size = min(15, X.shape[1])
        corr_subset = corr_matrix[:subset_size, :subset_size]
        
        im = ax.imshow(corr_subset, cmap='RdBu_r', vmin=-1, vmax=1)
        ax.set_title(f'{dataset_name} Dataset\nFeature Correlation Matrix')
        plt.colorbar(im, ax=ax, shrink=0.8)
        
    # For high-dimensional, show coefficient importance
    elif dataset_name == 'High_Dimensional' and dataset_info['true_coef'] is not None:
        coef = dataset_info['true_coef']
        non_zero_idx = np.where(np.abs(coef) > 1e-6)[0]
        ax.bar(range(len(non_zero_idx)), coef[non_zero_idx])
        ax.set_title(f'{dataset_name} Dataset\nTrue Non-Zero Coefficients')
        ax.set_xlabel('Feature Index')
        ax.set_ylabel('Coefficient Value')
        
    # For other datasets, show target distribution and feature relationship
    else:
        if X.shape[1] >= 2:
            scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', alpha=0.6, s=20)
            ax.set_xlabel('Feature 1')
            ax.set_ylabel('Feature 2')
            plt.colorbar(scatter, ax=ax, shrink=0.8)
        else:
            ax.hist(y, bins=30, alpha=0.7, edgecolor='black')
            ax.set_xlabel('Target Value')
            ax.set_ylabel('Frequency')
        
        ax.set_title(f'{dataset_name} Dataset\n{dataset_info["description"]}')
    
    ax.grid(True, alpha=0.3)

plt.tight_layout()

# Save regression characteristics figure
save_figure(fig, 'regression_dataset_characteristics',
           'Analysis of different regression dataset types and their properties', 'regression')
plt.show()

# Summary statistics
print("\n📈 Dataset Summary Statistics:")
print("=" * 80)
print(f"{'Dataset':<18} {'Samples':<8} {'Features':<10} {'Target Range':<15} {'Description':<25}")
print("=" * 80)

for name, info in regression_datasets.items():
    X, y = info['X'], info['y']
    target_range = f"{y.min():.2f} to {y.max():.2f}"
    description = info['description'][:24] + "..." if len(info['description']) > 24 else info['description']
    
    print(f"{name:<18} {X.shape[0]:<8} {X.shape[1]:<10} {target_range:<15} {description:<25}")

print("=" * 80)

In [None]:
# Enhanced coefficient analysis for regression datasets
print("🔍 Enhanced Coefficient Analysis...")

def analyze_coefficient_patterns(datasets, algorithm_results):
    """Analyze coefficient patterns across different regression algorithms."""
    coefficient_analysis = {}
    
    for dataset_name, dataset_info in datasets.items():
        if 'true_coef' in dataset_info and dataset_info['true_coef'] is not None:
            true_coef = dataset_info['true_coef']
            
            if dataset_name in algorithm_results:
                dataset_coef_analysis = {
                    'true_coefficients': {
                        'non_zero_count': np.sum(np.abs(true_coef) > 1e-6),
                        'sparsity': 1 - (np.sum(np.abs(true_coef) > 1e-6) / len(true_coef)),
                        'magnitude_range': [float(np.min(true_coef)), float(np.max(true_coef))],
                        'l1_norm': float(np.sum(np.abs(true_coef))),
                        'l2_norm': float(np.sqrt(np.sum(true_coef ** 2)))
                    },
                    'algorithm_coefficients': {}
                }
                
                # Analyze learned coefficients for each algorithm
                for alg_name, metrics in algorithm_results[dataset_name].items():
                    if 'model' in metrics and hasattr(metrics['model'], 'coef_'):
                        learned_coef = metrics['model'].coef_
                        
                        # Calculate coefficient recovery metrics
                        correlation = np.corrcoef(true_coef, learned_coef)[0, 1] if len(learned_coef) == len(true_coef) else 0
                        mse = np.mean((true_coef - learned_coef) ** 2) if len(learned_coef) == len(true_coef) else float('inf')
                        
                        # Feature selection accuracy
                        true_support = np.abs(true_coef) > 1e-6
                        learned_support = np.abs(learned_coef) > 1e-6
                        
                        if len(true_support) == len(learned_support):
                            precision = np.sum(true_support & learned_support) / np.sum(learned_support) if np.sum(learned_support) > 0 else 0
                            recall = np.sum(true_support & learned_support) / np.sum(true_support) if np.sum(true_support) > 0 else 0
                            f1_feature = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
                        else:
                            precision = recall = f1_feature = 0
                        
                        dataset_coef_analysis['algorithm_coefficients'][alg_name] = {
                            'correlation_with_true': float(correlation),
                            'mse_with_true': float(mse),
                            'feature_selection_precision': float(precision),
                            'feature_selection_recall': float(recall),
                            'feature_selection_f1': float(f1_feature),
                            'learned_sparsity': float(1 - (np.sum(np.abs(learned_coef) > 1e-6) / len(learned_coef))),
                            'l1_norm': float(np.sum(np.abs(learned_coef))),
                            'l2_norm': float(np.sqrt(np.sum(learned_coef ** 2)))
                        }
                
                coefficient_analysis[dataset_name] = dataset_coef_analysis
    
    return coefficient_analysis

# Perform coefficient analysis
if 'regression_datasets' in globals() and 'algorithm_performance' in globals():
    coef_analysis = analyze_coefficient_patterns(regression_datasets, algorithm_performance)
    
    # Display coefficient recovery results
    print("\n📊 Coefficient Recovery Analysis:")
    print("=" * 80)
    
    for dataset_name, analysis in coef_analysis.items():
        print(f"\n🔍 {dataset_name}:")
        print(f"  True coefficients: {analysis['true_coefficients']['non_zero_count']} non-zero, "
              f"sparsity: {analysis['true_coefficients']['sparsity']:.3f}")
        
        print("  Algorithm Performance:")
        for alg_name, metrics in analysis['algorithm_coefficients'].items():
            print(f"    {alg_name:<20}: Corr={metrics['correlation_with_true']:.3f}, "
                  f"F1={metrics['feature_selection_f1']:.3f}, "
                  f"Sparsity={metrics['learned_sparsity']:.3f}")
    
    # Save coefficient analysis
    save_experiment_results('coefficient_recovery_analysis', coef_analysis,
                           'Detailed coefficient recovery and feature selection analysis', 'regression')

print("✨ Enhanced coefficient analysis complete!")

### Demonstrate Regression Algorithm Performance Spectrum

In [None]:
# Compare regression algorithms across dataset types
print("🧪 Comprehensive Regression Algorithm Analysis...")

# Define regression algorithms with their optimal use cases
regression_algorithms = {
    'Linear Regression': {
        'model': LinearRegression(),
        'description': 'Baseline linear model',
        'optimal_for': 'Low multicollinearity, sufficient samples'
    },
    'Ridge Regression': {
        'model': Ridge(alpha=1.0),
        'description': 'L2 regularization',
        'optimal_for': 'Multicollinearity, stable coefficients'
    },
    'Lasso Regression': {
        'model': Lasso(alpha=0.1, max_iter=2000),
        'description': 'L1 regularization',
        'optimal_for': 'Feature selection, sparse solutions'
    },
    'Elastic Net': {
        'model': ElasticNet(alpha=0.1, l1_ratio=0.5, max_iter=2000),
        'description': 'Combined L1 + L2',
        'optimal_for': 'Balanced regularization'
    }
}

# Evaluate each algorithm on each dataset
algorithm_performance = {}

print("\n🔍 Evaluating Algorithm Performance Across Dataset Types...")

for dataset_name, dataset_info in regression_datasets.items():
    print(f"\n--- Testing on {dataset_name} Dataset ---")
    
    X, y = dataset_info['X'], dataset_info['y']
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.25, random_state=42
    )
    
    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    dataset_results = {}
    
    for alg_name, alg_info in regression_algorithms.items():
        try:
            start_time = time.time()
            
            # Train model
            model = alg_info['model'].__class__(**alg_info['model'].get_params())
            model.fit(X_train_scaled, y_train)
            
            training_time = time.time() - start_time
            
            # Evaluate
            train_score = model.score(X_train_scaled, y_train)
            test_score = model.score(X_test_scaled, y_test)
            
            # Cross-validation score
            cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='r2')
            
            # Count non-zero coefficients (for regularized models)
            if hasattr(model, 'coef_'):
                n_nonzero = np.sum(np.abs(model.coef_) > 1e-6)
                coef_sparsity = 1 - (n_nonzero / len(model.coef_))
            else:
                n_nonzero = 0
                coef_sparsity = 0
            
            dataset_results[alg_name] = {
                'train_r2': train_score,
                'test_r2': test_score,
                'cv_mean': cv_scores.mean(),
                'cv_std': cv_scores.std(),
                'training_time': training_time,
                'n_nonzero_coef': n_nonzero,
                'coef_sparsity': coef_sparsity,
                'model': model
            }
            
            print(f"  {alg_name:<20}: Test R²={test_score:.4f}, CV R²={cv_scores.mean():.4f}±{cv_scores.std():.4f}")
            
        except Exception as e:
            print(f"  {alg_name:<20}: ❌ Failed ({str(e)[:30]}...)")
            dataset_results[alg_name] = {'error': str(e)}
    
    algorithm_performance[dataset_name] = dataset_results

print("\n✨ Algorithm performance evaluation complete!")

# Save regression algorithm performance
regression_performance_summary = {}
for dataset_name, algorithms in algorithm_performance.items():
    regression_performance_summary[dataset_name] = {}
    for alg_name, metrics in algorithms.items():
        if 'error' not in metrics:
            regression_performance_summary[dataset_name][alg_name] = {
                'test_r2': metrics['test_r2'],
                'cv_mean': metrics['cv_mean'],
                'training_time': metrics['training_time'],
                'coef_sparsity': metrics['coef_sparsity']
            }

save_experiment_results('regression_algorithm_performance', regression_performance_summary,
                       'Comprehensive regression algorithm performance analysis', 'regression')

In [None]:
# Visualize comprehensive regression algorithm analysis
print("📊 Creating Comprehensive Regression Analysis Visualization...")

# Prepare data for visualization
performance_data = []
sparsity_data = []
time_data = []

for dataset_name, algorithms in algorithm_performance.items():
    for alg_name, metrics in algorithms.items():
        if 'error' not in metrics:
            performance_data.append({
                'Dataset': dataset_name,
                'Algorithm': alg_name,
                'Test_R2': metrics['test_r2'],
                'CV_Mean': metrics['cv_mean'],
                'CV_Std': metrics['cv_std'],
                'Generalization_Gap': metrics['train_r2'] - metrics['test_r2']
            })
            
            sparsity_data.append({
                'Dataset': dataset_name,
                'Algorithm': alg_name,
                'Coefficient_Sparsity': metrics['coef_sparsity'],
                'N_Nonzero': metrics['n_nonzero_coef']
            })
            
            time_data.append({
                'Dataset': dataset_name,
                'Algorithm': alg_name,
                'Training_Time': metrics['training_time']
            })

performance_df = pd.DataFrame(performance_data)
sparsity_df = pd.DataFrame(sparsity_data)
time_df = pd.DataFrame(time_data)

# Create comprehensive visualization
fig, axes = plt.subplots(2, 3, figsize=(20, 12))

# 1. Test performance heatmap
if not performance_df.empty:
    performance_pivot = performance_df.pivot(index='Algorithm', columns='Dataset', values='Test_R2')
    sns.heatmap(performance_pivot, annot=True, cmap='RdYlGn', center=0.5, 
                cbar_kws={'label': 'Test R² Score'}, ax=axes[0, 0], fmt='.3f')
    axes[0, 0].set_title('Algorithm Performance Across Dataset Types')
    axes[0, 0].set_ylabel('Regression Algorithm')

# 2. Generalization analysis
if not performance_df.empty:
    gen_pivot = performance_df.pivot(index='Algorithm', columns='Dataset', values='Generalization_Gap')
    sns.heatmap(gen_pivot, annot=True, cmap='RdBu_r', center=0, 
                cbar_kws={'label': 'Generalization Gap'}, ax=axes[0, 1], fmt='.3f')
    axes[0, 1].set_title('Overfitting Analysis (Train R² - Test R²)')
    axes[0, 1].set_ylabel('Regression Algorithm')

# 3. Feature selection effectiveness
if not sparsity_df.empty:
    sparsity_pivot = sparsity_df.pivot(index='Algorithm', columns='Dataset', values='Coefficient_Sparsity')
    sns.heatmap(sparsity_pivot, annot=True, cmap='viridis', 
                cbar_kws={'label': 'Coefficient Sparsity'}, ax=axes[0, 2], fmt='.3f')
    axes[0, 2].set_title('Feature Selection Effectiveness')
    axes[0, 2].set_ylabel('Regression Algorithm')

# 4. Performance distribution by algorithm
if not performance_df.empty:
    sns.boxplot(data=performance_df, x='Algorithm', y='Test_R2', ax=axes[1, 0])
    axes[1, 0].set_title('Performance Distribution by Algorithm')
    axes[1, 0].set_ylabel('Test R² Score')
    axes[1, 0].tick_params(axis='x', rotation=45)
    axes[1, 0].grid(True, alpha=0.3)

# 5. Training efficiency analysis
if not time_df.empty:
    time_pivot = time_df.pivot(index='Algorithm', columns='Dataset', values='Training_Time')
    sns.heatmap(time_pivot, annot=True, cmap='YlOrRd', 
                cbar_kws={'label': 'Training Time (s)'}, ax=axes[1, 1], fmt='.4f')
    axes[1, 1].set_title('Training Efficiency Comparison')
    axes[1, 1].set_ylabel('Regression Algorithm')

# 6. Performance vs complexity trade-off
if not performance_df.empty and not sparsity_df.empty:
    # Merge dataframes for scatter plot
    merged_df = pd.merge(performance_df, sparsity_df, on=['Dataset', 'Algorithm'])
    
    for dataset in merged_df['Dataset'].unique():
        dataset_data = merged_df[merged_df['Dataset'] == dataset]
        axes[1, 2].scatter(dataset_data['Coefficient_Sparsity'], dataset_data['Test_R2'], 
                          label=dataset, s=80, alpha=0.7)
    
    axes[1, 2].set_xlabel('Coefficient Sparsity')
    axes[1, 2].set_ylabel('Test R² Score')
    axes[1, 2].set_title('Performance vs Model Complexity')
    axes[1, 2].legend(bbox_to_anchor=(1.05, 1), loc='upper left')
    axes[1, 2].grid(True, alpha=0.3)

plt.tight_layout()

# Save comprehensive regression analysis
save_figure(fig, 'comprehensive_regression_analysis',
           'Complete performance comparison across all regression algorithms and datasets', 'regression')
plt.show()

# Create advanced coefficient recovery visualization
if 'coef_analysis' in locals():
    print("📊 Creating Advanced Coefficient Recovery Visualization...")
    
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    
    # 1. Coefficient correlation heatmap
    correlation_data = []
    datasets_with_coef = []
    algorithms_with_coef = set()
    
    for dataset_name, analysis in coef_analysis.items():
        datasets_with_coef.append(dataset_name)
        for alg_name, metrics in analysis['algorithm_coefficients'].items():
            algorithms_with_coef.add(alg_name)
            correlation_data.append({
                'Dataset': dataset_name,
                'Algorithm': alg_name,
                'Correlation': metrics['correlation_with_true']
            })
    
    if correlation_data:
        corr_df = pd.DataFrame(correlation_data)
        corr_pivot = corr_df.pivot(index='Algorithm', columns='Dataset', values='Correlation')
        
        sns.heatmap(corr_pivot, annot=True, cmap='RdYlGn', center=0.5, 
                    cbar_kws={'label': 'Correlation with True Coefficients'}, 
                    ax=axes[0, 0], fmt='.3f')
        axes[0, 0].set_title('Coefficient Recovery: Correlation with True Values')
    
    # 2. Feature selection performance
    fs_data = []
    for dataset_name, analysis in coef_analysis.items():
        for alg_name, metrics in analysis['algorithm_coefficients'].items():
            fs_data.append({
                'Dataset': dataset_name,
                'Algorithm': alg_name,
                'F1_Score': metrics['feature_selection_f1']
            })
    
    if fs_data:
        fs_df = pd.DataFrame(fs_data)
        fs_pivot = fs_df.pivot(index='Algorithm', columns='Dataset', values='F1_Score')
        
        sns.heatmap(fs_pivot, annot=True, cmap='Blues', 
                    cbar_kws={'label': 'Feature Selection F1 Score'}, 
                    ax=axes[0, 1], fmt='.3f')
        axes[0, 1].set_title('Feature Selection Performance')
    
    # 3. Sparsity comparison
    sparsity_comparison = []
    for dataset_name, analysis in coef_analysis.items():
        true_sparsity = analysis['true_coefficients']['sparsity']
        for alg_name, metrics in analysis['algorithm_coefficients'].items():
            learned_sparsity = metrics['learned_sparsity']
            sparsity_comparison.append({
                'Dataset': dataset_name,
                'Algorithm': alg_name,
                'True_Sparsity': true_sparsity,
                'Learned_Sparsity': learned_sparsity,
                'Sparsity_Error': abs(true_sparsity - learned_sparsity)
            })
    
    if sparsity_comparison:
        sparsity_df = pd.DataFrame(sparsity_comparison)
        
        # Scatter plot of true vs learned sparsity
        for alg in sparsity_df['Algorithm'].unique():
            alg_data = sparsity_df[sparsity_df['Algorithm'] == alg]
            axes[1, 0].scatter(alg_data['True_Sparsity'], alg_data['Learned_Sparsity'], 
                              label=alg, alpha=0.7, s=60)
        
        # Perfect sparsity recovery line
        axes[1, 0].plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Perfect Recovery')
        axes[1, 0].set_xlabel('True Sparsity')
        axes[1, 0].set_ylabel('Learned Sparsity')
        axes[1, 0].set_title('Sparsity Recovery Comparison')
        axes[1, 0].legend()
        axes[1, 0].grid(True, alpha=0.3)
    
    # 4. Algorithm ranking by coefficient recovery
    if correlation_data:
        avg_correlation = corr_df.groupby('Algorithm')['Correlation'].mean().sort_values(ascending=True)
        
        bars = axes[1, 1].barh(range(len(avg_correlation)), avg_correlation.values, 
                              color='lightblue', alpha=0.7)
        axes[1, 1].set_yticks(range(len(avg_correlation)))
        axes[1, 1].set_yticklabels(avg_correlation.index)
        axes[1, 1].set_xlabel('Average Correlation with True Coefficients')
        axes[1, 1].set_title('Algorithm Ranking: Coefficient Recovery')
        axes[1, 1].grid(True, alpha=0.3)
        
        # Add value labels
        for bar, value in zip(bars, avg_correlation.values):
            axes[1, 1].text(value + 0.01, bar.get_y() + bar.get_height()/2, 
                           f'{value:.3f}', va='center', fontsize=9)
    
    plt.tight_layout()
    
    # Save advanced coefficient analysis visualization
    save_figure(fig, 'advanced_coefficient_recovery_analysis',
               'Advanced visualization of coefficient recovery and feature selection performance', 'regression')
    plt.show()

# Algorithm recommendation analysis
print("\n🎯 Algorithm Recommendation Analysis:")
print("=" * 100)

for dataset_name, algorithms in algorithm_performance.items():
    if any('error' not in alg for alg in algorithms.values()):
        # Find best performing algorithm for this dataset
        valid_algs = {name: metrics for name, metrics in algorithms.items() if 'error' not in metrics}
        best_alg = max(valid_algs.keys(), key=lambda x: valid_algs[x]['test_r2'])
        best_score = valid_algs[best_alg]['test_r2']
        
        print(f"\n📊 {dataset_name} Dataset:")
        print(f"  Best Algorithm: {best_alg} (R² = {best_score:.4f})")
        print(f"  Optimal for: {regression_algorithms[best_alg]['optimal_for']}")
        
        # Show all algorithm rankings
        sorted_algs = sorted(valid_algs.items(), key=lambda x: x[1]['test_r2'], reverse=True)
        print(f"  Algorithm Rankings:")
        for i, (alg_name, metrics) in enumerate(sorted_algs):
            print(f"    {i+1}. {alg_name}: R² = {metrics['test_r2']:.4f}")

print("=" * 100)

## 4. Classification Complexity Hierarchy {#classification}

Now let's create a sophisticated hierarchy of classification datasets that progressively increase in complexity to demonstrate algorithm capabilities and limitations.

In [None]:
# Generate comprehensive classification dataset hierarchy
print("🎯 Generating Classification Complexity Hierarchy...")

classification_datasets = {}

# Level 1: Linear separability spectrum
print("  Level 1: Linear Separability Analysis...")

linear_configs = [
    {'separation': 2.0, 'name': 'Easy_Linear', 'description': 'Perfectly linearly separable'},
    {'separation': 1.0, 'name': 'Medium_Linear', 'description': 'Moderately separable'},
    {'separation': 0.3, 'name': 'Hard_Linear', 'description': 'Barely linearly separable'}
]

for config in linear_configs:
    X, y = generator.classification_dataset(
        n_samples=1000,
        n_features=20,
        n_informative=15,
        n_redundant=3,
        n_classes=3,
        class_sep=config['separation'],
        random_state=42
    )
    
    classification_datasets[config['name']] = {
        'X': X, 'y': y, 'level': 1,
        'description': config['description'],
        'complexity_factors': {'linearity': 'linear', 'separability': config['separation']},
        'optimal_algorithm': 'Logistic Regression' if config['separation'] > 1.0 else 'SVM'
    }
    
    print(f"    {config['name']}: {X.shape}, Class distribution: {np.bincount(y)}")

# Level 2: Geometric complexity
print("  Level 2: Geometric Pattern Complexity...")

geometric_patterns = [
    {'type': 'moons', 'name': 'Moons_Pattern'},
    {'type': 'circles', 'name': 'Circles_Pattern'},
    {'type': 'spirals', 'name': 'Spirals_Pattern'}
]

for pattern in geometric_patterns:
    if pattern['type'] == 'moons':
        X, y = generator.make_moons_advanced(
            n_samples=800, noise=0.15, n_clusters=2
        )
    elif pattern['type'] == 'circles':
        X, y = generator.make_circles_advanced(
            n_samples=800, noise=0.1, factor=0.3
        )
    else:  # spirals
        X, y = generator.make_spirals(
            n_samples=600, noise=0.1, n_spirals=2
        )
    
    classification_datasets[pattern['name']] = {
        'X': X, 'y': y, 'level': 2,
        'description': f'Nonlinear {pattern["type"]} pattern',
        'complexity_factors': {'linearity': 'nonlinear', 'pattern': pattern['type']},
        'optimal_algorithm': 'Random Forest' if pattern['type'] != 'circles' else 'SVM (RBF)'
    }
    
    print(f"    {pattern['name']}: {X.shape}, Pattern: {pattern['type']}")

# Level 3: High-dimensional challenges
print("  Level 3: High-Dimensional Challenges...")

high_dim_configs = [
    {
        'n_samples': 500, 'n_features': 100, 'n_informative': 20,
        'name': 'HighDim_Sparse', 'description': 'High-dimensional sparse features'
    },
    {
        'n_samples': 300, 'n_features': 200, 'n_informative': 10,
        'name': 'HighDim_Very_Sparse', 'description': 'Very high-dimensional, very sparse'
    }
]

for config in high_dim_configs:
    X, y = generator.classification_dataset(
        n_samples=config['n_samples'],
        n_features=config['n_features'],
        n_informative=config['n_informative'],
        n_redundant=5,
        n_classes=2,
        class_sep=0.8,
        random_state=42
    )
    
    classification_datasets[config['name']] = {
        'X': X, 'y': y, 'level': 3,
        'description': config['description'],
        'complexity_factors': {
            'dimensionality': 'high', 
            'sparsity': config['n_informative'] / config['n_features']
        },
        'optimal_algorithm': 'Linear SVM'
    }
    
    print(f"    {config['name']}: {X.shape}, Informative ratio: {config['n_informative']/config['n_features']:.2f}")

# Level 4: Real-world challenges
print("  Level 4: Real-World Challenge Scenarios...")

# Imbalanced classification
X_imbal, y_imbal = generator.imbalanced_classification(
    n_samples=1200, n_features=15, imbalance_ratio=0.05
)

classification_datasets['Imbalanced_Extreme'] = {
    'X': X_imbal, 'y': y_imbal, 'level': 4,
    'description': 'Extremely imbalanced classes (5% minority)',
    'complexity_factors': {'imbalance_ratio': 0.05, 'challenge': 'class_imbalance'},
    'optimal_algorithm': 'Random Forest with balancing'
}

# Noisy features
X_noisy, y_noisy = generator.classification_with_noise(
    n_samples=800, n_features=25, n_informative=10, noise_features=10
)

classification_datasets['Noisy_Features'] = {
    'X': X_noisy, 'y': y_noisy, 'level': 4,
    'description': 'Many irrelevant noisy features',
    'complexity_factors': {'noise_ratio': 0.4, 'challenge': 'feature_noise'},
    'optimal_algorithm': 'Gradient Boosting'
}

print(f"    Imbalanced_Extreme: {X_imbal.shape}, Class ratio: {np.bincount(y_imbal)}")
print(f"    Noisy_Features: {X_noisy.shape}, Noise ratio: 40%")

print("\n✨ Classification complexity hierarchy generation complete!")

# Save classification datasets metadata
classification_metadata = {
    'total_datasets': len(classification_datasets),
    'complexity_levels': len(set(info['level'] for info in classification_datasets.values())),
    'dataset_details': {name: {
        'shape': info['X'].shape,
        'level': info['level'],
        'description': info['description'],
        'optimal_algorithm': info.get('optimal_algorithm', 'Unknown'),
        'complexity_factors': info.get('complexity_factors', {})
    } for name, info in classification_datasets.items()}
}

save_experiment_results('classification_datasets_generation', classification_metadata,
                       'Comprehensive classification complexity hierarchy generation', 'classification')

In [None]:
# Visualize classification complexity hierarchy
print("📊 Visualizing Classification Complexity Hierarchy...")

# Create comprehensive visualization of all classification datasets
n_datasets = len(classification_datasets)
n_cols = 4
n_rows = (n_datasets + n_cols - 1) // n_cols

fig, axes = plt.subplots(n_rows, n_cols, figsize=(20, 5 * n_rows))
if n_rows == 1:
    axes = axes.reshape(1, -1)

axes_flat = axes.flatten()

for idx, (dataset_name, dataset_info) in enumerate(classification_datasets.items()):
    if idx >= len(axes_flat):
        break
        
    ax = axes_flat[idx]
    X, y = dataset_info['X'], dataset_info['y']
    
    # For 2D datasets, create scatter plots
    if X.shape[1] >= 2:
        # Use first two features for visualization
        scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap='tab10', alpha=0.7, s=30)
        ax.set_xlabel('Feature 1')
        ax.set_ylabel('Feature 2')
        
        # Add class boundaries for linear separability visualization
        if 'Linear' in dataset_name and X.shape[1] >= 2:
            # Simple linear decision boundary estimation
            from sklearn.svm import SVC
            svm_viz = SVC(kernel='linear', random_state=42)
            svm_viz.fit(X[:, :2], y)
            
            # Create mesh for decision boundary
            x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
            y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
            xx, yy = np.meshgrid(np.linspace(x_min, x_max, 50),
                               np.linspace(y_min, y_max, 50))
            
            Z = svm_viz.predict(np.c_[xx.ravel(), yy.ravel()])
            Z = Z.reshape(xx.shape)
            ax.contour(xx, yy, Z, alpha=0.3, levels=np.unique(y), colors='black', linestyles='--')
    
    else:
        # For high-dimensional datasets, show class distribution
        unique_classes, class_counts = np.unique(y, return_counts=True)
        bars = ax.bar(unique_classes, class_counts, alpha=0.7)
        ax.set_xlabel('Class Label')
        ax.set_ylabel('Sample Count')
        
        # Add percentage labels
        total_samples = len(y)
        for bar, count in zip(bars, class_counts):
            percentage = (count / total_samples) * 100
            ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + total_samples * 0.01,
                   f'{percentage:.1f}%', ha='center', va='bottom', fontsize=8)
    
    # Title with complexity level and description
    level = dataset_info['level']
    description = dataset_info['description']
    optimal_alg = dataset_info.get('optimal_algorithm', 'Unknown')
    ax.set_title(f'Level {level}: {dataset_name}\n{description}\n'
                f'({X.shape[0]} samples, {X.shape[1]} features)\nOptimal: {optimal_alg}', fontsize=9)
    ax.grid(True, alpha=0.3)

# Hide empty subplots
for idx in range(len(classification_datasets), len(axes_flat)):
    axes_flat[idx].set_visible(False)

plt.tight_layout()

# Save classification hierarchy visualization
save_figure(fig, 'classification_complexity_hierarchy',
           'Visualization of classification datasets across complexity levels', 'classification')
plt.show()

# Dataset complexity analysis
print("\n📈 Classification Dataset Complexity Analysis:")
print("=" * 100)
print(f"{'Dataset':<20} {'Level':<6} {'Samples':<8} {'Features':<10} {'Classes':<8} {'Complexity Factors':<30}")
print("=" * 100)

for name, info in classification_datasets.items():
    X, y = info['X'], info['y']
    level = info['level']
    n_classes = len(np.unique(y))
    
    # Format complexity factors
    factors = info.get('complexity_factors', {})
    factor_str = ', '.join([f"{k}:{v}" for k, v in factors.items()])
    factor_str = factor_str[:29] + "..." if len(factor_str) > 29 else factor_str
    
    print(f"{name:<20} {level:<6} {X.shape[0]:<8} {X.shape[1]:<10} {n_classes:<8} {factor_str:<30}")

print("=" * 100)

### Comprehensive Classification Algorithm Benchmark

In [None]:
# Comprehensive classification algorithm benchmarking
print("🏆 Comprehensive Classification Algorithm Benchmark...")

# Define comprehensive algorithm suite
classification_algorithms = {
    'Logistic Regression': {
        'model': LogisticRegression(random_state=42, max_iter=1000),
        'strengths': ['Linear boundaries', 'Probabilistic output', 'Fast training'],
        'weaknesses': ['Linear only', 'Sensitive to outliers']
    },
    'Random Forest': {
        'model': RandomForestClassifier(n_estimators=100, random_state=42),
        'strengths': ['Handles nonlinearity', 'Feature importance', 'Robust to outliers'],
        'weaknesses': ['Can overfit', 'Black box', 'Memory intensive']
    },
    'SVM (RBF)': {
        'model': SVC(kernel='rbf', random_state=42, probability=True),
        'strengths': ['Nonlinear boundaries', 'Effective in high dimensions', 'Memory efficient'],
        'weaknesses': ['Slow on large datasets', 'Sensitive to scaling', 'Parameter sensitive']
    },
    'SVM (Linear)': {
        'model': SVC(kernel='linear', random_state=42, probability=True),
        'strengths': ['Linear boundaries', 'Good with high dimensions', 'Regularization'],
        'weaknesses': ['Linear only', 'Sensitive to scaling']
    },
    'Gradient Boosting': {
        'model': GradientBoostingClassifier(random_state=42, n_estimators=100),
        'strengths': ['Excellent performance', 'Handles mixed data', 'Feature importance'],
        'weaknesses': ['Slow training', 'Prone to overfitting', 'Many hyperparameters']
    },
    'Neural Network': {
        'model': MLPClassifier(hidden_layer_sizes=(100, 50), random_state=42, max_iter=500),
        'strengths': ['Universal approximator', 'Learns complex patterns', 'Flexible architecture'],
        'weaknesses': ['Requires large data', 'Black box', 'Sensitive to scaling']
    },
    'Naive Bayes': {
        'model': GaussianNB(),
        'strengths': ['Fast training/prediction', 'Works with small data', 'Probabilistic'],
        'weaknesses': ['Strong independence assumption', 'Poor with correlated features']
    }
}

# Comprehensive benchmarking across all datasets
algorithm_benchmark_results = {}

print("\n🔬 Running Comprehensive Algorithm Benchmark...")

for dataset_name, dataset_info in classification_datasets.items():
    print(f"\n--- Benchmarking on {dataset_name} (Level {dataset_info['level']}) ---")
    
    X, y = dataset_info['X'], dataset_info['y']
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42, stratify=y
    )
    
    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    dataset_results = {}
    
    for alg_name, alg_info in classification_algorithms.items():
        try:
            start_time = time.time()
            
            # Train model
            model = alg_info['model'].__class__(**alg_info['model'].get_params())
            model.fit(X_train_scaled, y_train)
            
            training_time = time.time() - start_time
            
            # Evaluate
            start_time = time.time()
            predictions = model.predict(X_test_scaled)
            prediction_time = time.time() - start_time
            
            accuracy = accuracy_score(y_test, predictions)
            
            # Cross-validation
            cv_scores = cross_val_score(model, X_train_scaled, y_train, cv=5, scoring='accuracy')
            
            # Generalization gap
            train_accuracy = model.score(X_train_scaled, y_train)
            generalization_gap = train_accuracy - accuracy
            
            dataset_results[alg_name] = {
                'accuracy': accuracy,
                'train_accuracy': train_accuracy,
                'cv_mean': cv_scores.mean(),
                'cv_std': cv_scores.std(),
                'training_time': training_time,
                'prediction_time': prediction_time,
                'total_time': training_time + prediction_time,
                'generalization_gap': generalization_gap,
                'model': model
            }
            
            print(f"  {alg_name:<20}: Acc={accuracy:.4f}, CV={cv_scores.mean():.4f}±{cv_scores.std():.4f}, Time={training_time:.3f}s")
            
        except Exception as e:
            print(f"  {alg_name:<20}: ❌ Failed ({str(e)[:30]}...)")
            dataset_results[alg_name] = {'error': str(e)}
    
    algorithm_benchmark_results[dataset_name] = dataset_results

print("\n✨ Algorithm benchmarking complete!")

# Save benchmark results
benchmark_summary = {}
for dataset_name, algorithms in algorithm_benchmark_results.items():
    benchmark_summary[dataset_name] = {}
    for alg_name, metrics in algorithms.items():
        if 'error' not in metrics:
            benchmark_summary[dataset_name][alg_name] = {
                'accuracy': metrics['accuracy'],
                'cv_mean': metrics['cv_mean'],
                'training_time': metrics['training_time'],
                'generalization_gap': metrics['generalization_gap']
            }

save_experiment_results('classification_algorithm_benchmark', benchmark_summary,
                       'Comprehensive classification algorithm benchmark results', 'classification')

## 5. Clustering Pattern Showcase {#clustering}

Generate diverse clustering datasets to demonstrate different clustering algorithm capabilities.

In [None]:
# Generate clustering datasets
print("🔗 Generating Clustering Pattern Showcase...")

clustering_datasets = {}

# Dataset 1: Well-separated spherical clusters
print("  Creating well-separated spherical clusters...")
X_spherical, y_spherical = generator.make_blobs_advanced(
    n_samples=800,
    centers=4,
    cluster_std=1.0,
    n_features=2,
    random_state=42
)

clustering_datasets['Spherical_Clusters'] = {
    'X': X_spherical, 'y': y_spherical,
    'description': 'Well-separated spherical clusters',
    'optimal_algorithm': 'K-Means',
    'challenge': 'none',
    'true_k': 4
}

# Dataset 2: Non-spherical clusters
print("  Creating non-spherical clusters...")
X_moons, y_moons = generator.make_moons_advanced(
    n_samples=600, noise=0.1, n_clusters=2
)

clustering_datasets['Non_Spherical'] = {
    'X': X_moons, 'y': y_moons,
    'description': 'Non-spherical moon-shaped clusters',
    'optimal_algorithm': 'DBSCAN',
    'challenge': 'non_spherical',
    'true_k': 2
}

# Dataset 3: Density-based clusters
print("  Creating density-based clusters...")
X_density, y_density = generator.make_density_clusters(
    n_samples=700,
    n_centers=3,
    cluster_density_ratio=0.3,
    noise_ratio=0.1
)

clustering_datasets['Density_Based'] = {
    'X': X_density, 'y': y_density,
    'description': 'Varying density clusters with noise',
    'optimal_algorithm': 'DBSCAN',
    'challenge': 'varying_density',
    'true_k': 3
}

# Dataset 4: Hierarchical structure
print("  Creating hierarchical clusters...")
X_hierarchical, y_hierarchical = generator.make_hierarchical_clusters(
    n_samples=500,
    n_levels=3,
    branching_factor=2
)

clustering_datasets['Hierarchical'] = {
    'X': X_hierarchical, 'y': y_hierarchical,
    'description': 'Hierarchical nested clusters',
    'optimal_algorithm': 'Agglomerative Clustering',
    'challenge': 'hierarchical',
    'true_k': 8  # 2^3 leaf clusters
}

# Dataset 5: High-dimensional clustering
print("  Creating high-dimensional clusters...")
X_high_dim, y_high_dim = generator.make_blobs_advanced(
    n_samples=600,
    centers=5,
    cluster_std=2.0,
    n_features=50,
    random_state=42
)

clustering_datasets['High_Dimensional'] = {
    'X': X_high_dim, 'y': y_high_dim,
    'description': 'High-dimensional spherical clusters',
    'optimal_algorithm': 'Gaussian Mixture',
    'challenge': 'high_dimensionality',
    'true_k': 5
}

print("\n✨ Clustering dataset generation complete!")

# Save clustering datasets metadata
clustering_metadata = {
    'total_datasets': len(clustering_datasets),
    'dataset_details': {name: {
        'shape': info['X'].shape,
        'description': info['description'],
        'optimal_algorithm': info.get('optimal_algorithm', 'Unknown'),
        'challenge': info.get('challenge', 'general'),
        'true_k': info.get('true_k', 'Unknown')
    } for name, info in clustering_datasets.items()}
}

save_experiment_results('clustering_datasets_generation', clustering_metadata,
                       'Comprehensive clustering dataset generation results', 'clustering')

In [None]:
# Visualize clustering datasets
print("📊 Visualizing Clustering Datasets...")

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

for idx, (dataset_name, dataset_info) in enumerate(clustering_datasets.items()):
    if idx >= len(axes):
        break
        
    ax = axes[idx]
    X, y = dataset_info['X'], dataset_info['y']
    
    # For 2D datasets, create scatter plots
    if X.shape[1] >= 2:
        scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap='tab10', alpha=0.7, s=30)
        ax.set_xlabel('Feature 1')
        ax.set_ylabel('Feature 2')
    else:
        # For 1D data, create histogram
        for cluster_id in np.unique(y):
            cluster_data = X[y == cluster_id]
            ax.hist(cluster_data, alpha=0.6, label=f'Cluster {cluster_id}', bins=20)
        ax.set_xlabel('Feature Value')
        ax.set_ylabel('Frequency')
        ax.legend()
    
    # Title with information
    description = dataset_info['description']
    optimal_alg = dataset_info.get('optimal_algorithm', 'Unknown')
    true_k = dataset_info.get('true_k', 'Unknown')
    
    ax.set_title(f'{dataset_name}\n{description}\n'
                f'True K: {true_k}, Optimal: {optimal_alg}', fontsize=9)
    ax.grid(True, alpha=0.3)

# Hide empty subplot
if len(clustering_datasets) < len(axes):
    axes[-1].set_visible(False)

plt.tight_layout()

# Save clustering visualization
save_figure(fig, 'clustering_datasets_showcase',
           'Visualization of different clustering patterns and challenges', 'clustering')
plt.show()

# Clustering dataset summary
print("\n📊 Clustering Dataset Summary:")
print("=" * 80)
print(f"{'Dataset':<20} {'Samples':<8} {'Features':<10} {'True K':<8} {'Challenge':<20}")
print("=" * 80)

for name, info in clustering_datasets.items():
    X, y = info['X'], info['y']
    true_k = info.get('true_k', 'Unknown')
    challenge = info.get('challenge', 'general')
    
    print(f"{name:<20} {X.shape[0]:<8} {X.shape[1]:<10} {true_k:<8} {challenge:<20}")

print("=" * 80)


# Comprehensive clustering algorithm performance visualization
if 'clustering_algorithm_results' in globals():
    print("📊 Creating Comprehensive Clustering Performance Visualization...")
    
    # Prepare clustering performance data
    clustering_perf_data = []
    for dataset_name, algorithms in clustering_algorithm_results.items():
        for alg_name, metrics in algorithms.items():
            if 'error' not in metrics:
                clustering_perf_data.append({
                    'Dataset': dataset_name,
                    'Algorithm': alg_name,
                    'Silhouette_Score': metrics['silhouette_score'],
                    'ARI_Score': metrics['ari_score'],
                    'N_Clusters_Found': metrics['n_clusters_found'],
                    'Clustering_Time': metrics['clustering_time'],
                    'True_K': clustering_datasets[dataset_name].get('true_k', 0)
                })
    
    if clustering_perf_data:
        clustering_df = pd.DataFrame(clustering_perf_data)
        
        fig, axes = plt.subplots(2, 2, figsize=(16, 12))
        
        # 1. Silhouette score heatmap
        sil_pivot = clustering_df.pivot(index='Algorithm', columns='Dataset', values='Silhouette_Score')
        sns.heatmap(sil_pivot, annot=True, cmap='RdYlGn', center=0.3, 
                    cbar_kws={'label': 'Silhouette Score'}, ax=axes[0, 0], fmt='.3f')
        axes[0, 0].set_title('Clustering Quality: Silhouette Scores')
        
        # 2. ARI score heatmap
        ari_pivot = clustering_df.pivot(index='Algorithm', columns='Dataset', values='ARI_Score')
        sns.heatmap(ari_pivot, annot=True, cmap='viridis', 
                    cbar_kws={'label': 'Adjusted Rand Index'}, ax=axes[0, 1], fmt='.3f')
        axes[0, 1].set_title('Cluster Assignment Accuracy: ARI Scores')
        
        # 3. Clustering efficiency
        time_pivot = clustering_df.pivot(index='Algorithm', columns='Dataset', values='Clustering_Time')
        sns.heatmap(time_pivot, annot=True, cmap='YlOrRd', 
                    cbar_kws={'label': 'Clustering Time (s)'}, ax=axes[1, 0], fmt='.4f')
        axes[1, 0].set_title('Clustering Efficiency')
        
        # 4. Number of clusters found vs true K
        for dataset in clustering_df['Dataset'].unique():
            dataset_data = clustering_df[clustering_df['Dataset'] == dataset]
            true_k = dataset_data['True_K'].iloc[0]
            
            bars = axes[1, 1].bar(
                [f"{alg}\n{dataset[:8]}..." for alg in dataset_data['Algorithm']], 
                dataset_data['N_Clusters_Found'], 
                alpha=0.7, label=dataset if len(clustering_df['Dataset'].unique()) <= 4 else None
            )
            
            # Add true K line
            if true_k > 0:
                axes[1, 1].axhline(y=true_k, color='red', linestyle='--', alpha=0.5)
        
        axes[1, 1].set_ylabel('Number of Clusters Found')
        axes[1, 1].set_title('Cluster Count Accuracy')
        axes[1, 1].tick_params(axis='x', rotation=45)
        if len(clustering_df['Dataset'].unique()) <= 4:
            axes[1, 1].legend()
        axes[1, 1].grid(True, alpha=0.3)
        
        plt.tight_layout()
        
        # Save clustering performance visualization
        save_figure(fig, 'comprehensive_clustering_performance',
                   'Complete clustering algorithm performance analysis across all datasets', 'clustering')
        plt.show()
        
        # Clustering algorithm recommendations
        print("\n🎯 Clustering Algorithm Recommendations:")
        print("=" * 60)
        
        for dataset_name in clustering_df['Dataset'].unique():
            dataset_results = clustering_df[clustering_df['Dataset'] == dataset_name]
            
            # Find best by silhouette score
            best_sil = dataset_results.loc[dataset_results['Silhouette_Score'].idxmax()]
            best_ari = dataset_results.loc[dataset_results['ARI_Score'].idxmax()]
            
            print(f"\n📊 {dataset_name}:")
            print(f"  Best Silhouette: {best_sil['Algorithm']} ({best_sil['Silhouette_Score']:.4f})")
            print(f"  Best ARI: {best_ari['Algorithm']} ({best_ari['ARI_Score']:.4f})")
            print(f"  Predicted Optimal: {clustering_datasets[dataset_name].get('optimal_algorithm', 'Unknown')}")

In [None]:
# Test clustering algorithms
print("🧪 Testing Clustering Algorithms...")

# Define clustering algorithms
clustering_algorithms = {
    'K-Means': {
        'class': KMeans,
        'params': {'random_state': 42},
        'strengths': ['Fast', 'Scales well', 'Simple'],
        'weaknesses': ['Assumes spherical clusters', 'Requires K']
    },
    'DBSCAN': {
        'class': DBSCAN,
        'params': {'eps': 0.5, 'min_samples': 5},
        'strengths': ['Finds arbitrary shapes', 'Handles noise', 'No K required'],
        'weaknesses': ['Sensitive to parameters', 'Struggles with varying density']
    },
    'Agglomerative': {
        'class': AgglomerativeClustering,
        'params': {'linkage': 'ward'},
        'strengths': ['Hierarchical structure', 'No assumptions about shape', 'Deterministic'],
        'weaknesses': ['Computationally expensive', 'Requires K', 'Sensitive to outliers']
    },
    'Gaussian Mixture': {
        'class': GaussianMixture,
        'params': {'random_state': 42},
        'strengths': ['Probabilistic', 'Handles overlapping clusters', 'Flexible shapes'],
        'weaknesses': ['Assumes Gaussian distributions', 'Requires K', 'Can overfit']
    }
}

# Evaluate clustering algorithms
clustering_results = {}

for dataset_name, dataset_info in clustering_datasets.items():
    print(f"\n--- Testing on {dataset_name} ---")
    
    X, y_true = dataset_info['X'], dataset_info['y']
    true_k = dataset_info.get('true_k', len(np.unique(y_true)))
    
    # Standardize features for clustering
    scaler = StandardScaler()
    X_scaled = scaler.fit_transform(X)
    
    dataset_clustering_results = {}
    
    for alg_name, alg_info in clustering_algorithms.items():
        try:
            start_time = time.time()
            
            # Set number of clusters for algorithms that require it
            params = alg_info['params'].copy()
            if alg_name in ['K-Means', 'Agglomerative', 'Gaussian Mixture']:
                params['n_clusters'] = true_k
            elif alg_name == 'Gaussian Mixture':
                params['n_components'] = true_k
            
            # Create and fit model
            model = alg_info['class'](**params)
            
            if hasattr(model, 'fit_predict'):
                labels = model.fit_predict(X_scaled)
            else:
                model.fit(X_scaled)
                labels = model.labels_
            
            clustering_time = time.time() - start_time
            
            # Calculate silhouette score
            if len(np.unique(labels)) > 1:  # Need at least 2 clusters
                silhouette_avg = silhouette_score(X_scaled, labels)
            else:
                silhouette_avg = -1  # Invalid clustering
            
            # Calculate adjusted rand index if we have true labels
            from sklearn.metrics import adjusted_rand_score
            ari_score = adjusted_rand_score(y_true, labels)
            
            dataset_clustering_results[alg_name] = {
                'silhouette_score': silhouette_avg,
                'ari_score': ari_score,
                'n_clusters_found': len(np.unique(labels)),
                'clustering_time': clustering_time,
                'labels': labels,
                'model': model
            }
            
            print(f"  {alg_name:<20}: Silhouette={silhouette_avg:.4f}, ARI={ari_score:.4f}, K={len(np.unique(labels))}")
            
        except Exception as e:
            print(f"  {alg_name:<20}: ❌ Failed ({str(e)[:30]}...)")
            dataset_clustering_results[alg_name] = {'error': str(e)}
    
    clustering_results[dataset_name] = dataset_clustering_results

print("\n✨ Clustering algorithm evaluation complete!")

# Save clustering results
clustering_summary = {}
for dataset_name, algorithms in clustering_results.items():
    clustering_summary[dataset_name] = {}
    for alg_name, metrics in algorithms.items():
        if 'error' not in metrics:
            clustering_summary[dataset_name][alg_name] = {
                'silhouette_score': metrics['silhouette_score'],
                'ari_score': metrics['ari_score'],
                'n_clusters_found': metrics['n_clusters_found'],
                'clustering_time': metrics['clustering_time']
            }

save_experiment_results('clustering_algorithm_results', clustering_summary,
                       'Comprehensive clustering algorithm evaluation results', 'clustering')

## 6. Special Purpose Datasets {#special}

Create datasets designed to highlight specific challenges and edge cases in machine learning.

In [None]:
# Generate special purpose datasets for edge cases
print("⚡ Generating Special Purpose Datasets...")

special_datasets = {}

# Dataset 1: Extreme class imbalance
print("  Creating extreme class imbalance dataset...")
X_imbalance, y_imbalance = generator.imbalanced_classification(
    n_samples=2000,
    n_features=20,
    imbalance_ratio=0.01,  # 1% minority class
    random_state=42
)

special_datasets['Extreme_Imbalance'] = {
    'X': X_imbalance, 'y': y_imbalance,
    'type': 'classification',
    'challenge': 'class_imbalance',
    'description': 'Extreme class imbalance (1% minority)',
    'optimal_algorithm': 'Balanced Random Forest'
}

# Dataset 2: High-dimensional sparse data
print("  Creating high-dimensional sparse dataset...")
X_sparse, y_sparse = generator.sparse_classification(
    n_samples=1000,
    n_features=500,
    n_informative=20,
    sparsity=0.95,  # 95% of features are zero
    random_state=42
)

special_datasets['High_Dim_Sparse'] = {
    'X': X_sparse, 'y': y_sparse,
    'type': 'classification',
    'challenge': 'high_dimensionality_sparsity',
    'description': 'High-dimensional sparse features (95% sparsity)',
    'optimal_algorithm': 'Linear SVM'
}

# Dataset 3: Mixed data types
print("  Creating mixed data types dataset...")
X_mixed, y_mixed, feature_types = generator.mixed_data_types(
    n_samples=800,
    n_numerical=10,
    n_categorical=8,
    n_binary=5,
    n_classes=3,
    random_state=42
)

special_datasets['Mixed_Data_Types'] = {
    'X': X_mixed, 'y': y_mixed,
    'type': 'classification',
    'challenge': 'mixed_data_types',
    'description': 'Mixed numerical, categorical, and binary features',
    'optimal_algorithm': 'Gradient Boosting',
    'feature_types': feature_types
}

# Dataset 4: Time series classification
print("  Creating time series classification dataset...")
X_time_series, y_time_series = generator.time_series_classification(
    n_samples=400,
    n_timesteps=50,
    n_features=3,
    n_classes=4,
    pattern_type='seasonal',
    random_state=42
)

special_datasets['Time_Series'] = {
    'X': X_time_series, 'y': y_time_series,
    'type': 'classification',
    'challenge': 'temporal_dependencies',
    'description': 'Time series with temporal dependencies',
    'optimal_algorithm': 'LSTM (if available) or Random Forest'
}

# Dataset 5: Multi-label classification
print("  Creating multi-label classification dataset...")
X_multilabel, y_multilabel = generator.multilabel_classification(
    n_samples=1200,
    n_features=25,
    n_classes=8,
    n_labels_per_sample=3,
    random_state=42
)

special_datasets['Multi_Label'] = {
    'X': X_multilabel, 'y': y_multilabel,
    'type': 'multilabel',
    'challenge': 'multi_label',
    'description': 'Multi-label classification problem',
    'optimal_algorithm': 'Multi-label Random Forest'
}

# Dataset 6: Concept drift simulation
print("  Creating concept drift dataset...")
X_drift, y_drift, drift_points = generator.concept_drift_classification(
    n_samples=1500,
    n_features=15,
    n_drift_points=2,
    drift_severity=0.5,
    random_state=42
)

special_datasets['Concept_Drift'] = {
    'X': X_drift, 'y': y_drift,
    'type': 'classification',
    'challenge': 'concept_drift',
    'description': 'Classification with concept drift over time',
    'optimal_algorithm': 'Adaptive Random Forest',
    'drift_points': drift_points
}

print("\n✨ Special purpose dataset generation complete!")

# Save special datasets metadata
special_metadata = {
    'total_datasets': len(special_datasets),
    'dataset_details': {name: {
        'shape': info['X'].shape,
        'type': info['type'],
        'challenge': info['challenge'],
        'description': info['description'],
        'optimal_algorithm': info.get('optimal_algorithm', 'Unknown')
    } for name, info in special_datasets.items()}
}

save_experiment_results('special_datasets_generation', special_metadata,
                       'Special purpose datasets for edge cases and challenges', 'special')

In [None]:
# Visualize special purpose datasets
print("📊 Visualizing Special Purpose Datasets...")

fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.ravel()

plot_idx = 0
for dataset_name, dataset_info in special_datasets.items():
    if plot_idx >= len(axes):
        break
        
    ax = axes[plot_idx]
    X, y = dataset_info['X'], dataset_info['y']
    challenge = dataset_info['challenge']
    description = dataset_info['description']
    
    if challenge == 'class_imbalance':
        # Show class distribution
        unique_classes, class_counts = np.unique(y, return_counts=True)
        colors = ['red' if count < len(y) * 0.2 else 'blue' for count in class_counts]
        bars = ax.bar(unique_classes, class_counts, color=colors, alpha=0.7)
        
        # Add percentage labels
        for bar, count in zip(bars, class_counts):
            percentage = (count / len(y)) * 100
            ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + len(y) * 0.01,
                   f'{percentage:.1f}%', ha='center', va='bottom', fontweight='bold')
        
        ax.set_xlabel('Class')
        ax.set_ylabel('Count')
        ax.set_title(f'{dataset_name}\nClass Distribution')
        
    elif challenge == 'high_dimensionality_sparsity':
        # Show sparsity pattern
        sparsity_per_sample = np.mean(X == 0, axis=1)
        ax.hist(sparsity_per_sample, bins=30, alpha=0.7, color='green', edgecolor='black')
        ax.axvline(np.mean(sparsity_per_sample), color='red', linestyle='--', 
                  label=f'Mean: {np.mean(sparsity_per_sample):.2%}')
        ax.set_xlabel('Sparsity per Sample')
        ax.set_ylabel('Frequency')
        ax.set_title(f'{dataset_name}\nSparsity Distribution')
        ax.legend()
        
    elif challenge == 'mixed_data_types':
        # Show feature type distribution
        feature_types = dataset_info['feature_types']
        type_counts = {}
        for ftype in feature_types:
            type_counts[ftype] = type_counts.get(ftype, 0) + 1
        
        types = list(type_counts.keys())
        counts = list(type_counts.values())
        colors = plt.cm.Set3(np.linspace(0, 1, len(types)))
        
        bars = ax.bar(types, counts, color=colors, alpha=0.7)
        ax.set_xlabel('Feature Type')
        ax.set_ylabel('Count')
        ax.set_title(f'{dataset_name}\nFeature Type Distribution')
        
        # Add count labels
        for bar, count in zip(bars, counts):
            ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.1,
                   str(count), ha='center', va='bottom', fontweight='bold')
    
    elif challenge == 'temporal_dependencies':
        # Show time series pattern for first few samples
        n_samples_show = min(5, X.shape[0])
        for i in range(n_samples_show):
            # Show first feature over time
            ax.plot(X[i, :, 0], alpha=0.7, label=f'Sample {i+1} (Class {y[i]})')
        
        ax.set_xlabel('Time Step')
        ax.set_ylabel('Feature Value')
        ax.set_title(f'{dataset_name}\nTime Series Patterns')
        ax.legend(fontsize=8)
    
    else:
        # Default: scatter plot of first two features colored by class
        if X.shape[1] >= 2:
            scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap='tab10', alpha=0.6, s=20)
            ax.set_xlabel('Feature 1')
            ax.set_ylabel('Feature 2')
            plt.colorbar(scatter, ax=ax, shrink=0.8)
        else:
            # Class distribution
            unique_classes, class_counts = np.unique(y, return_counts=True)
            ax.bar(unique_classes, class_counts, alpha=0.7)
            ax.set_xlabel('Class')
            ax.set_ylabel('Count')
        
        ax.set_title(f'{dataset_name}\nData Visualization')
    
    # Add challenge and description as subtitle
    ax.text(0.5, -0.15, f'Challenge: {challenge}\n{description}', 
            ha='center', va='top', transform=ax.transAxes, fontsize=8, style='italic')
    
    plot_idx += 1

# Hide empty subplots
for idx in range(plot_idx, len(axes)):
    axes[idx].set_visible(False)

plt.tight_layout()

# Save special datasets visualization
save_figure(fig, 'special_purpose_datasets',
           'Visualization of special purpose datasets for edge cases', 'special')
plt.show()

# Special datasets summary
print("\n📈 Special Purpose Datasets Summary:")
print("=" * 120)
print(f"{'Dataset':<20} {'Type':<15} {'Shape':<15} {'Challenge':<20} {'Description':<35}")
print("=" * 120)

for name, info in special_datasets.items():
    X, y = info['X'], info['y']
    dataset_type = info['type']
    challenge = info['challenge']
    description = info['description'][:34] + "..." if len(info['description']) > 34 else info['description']
    
    if len(y.shape) > 1:  # Multi-label or time series
        if len(X.shape) > 2:  # Time series
            shape_str = f"{X.shape}"
        else:
            shape_str = f"{X.shape} → {y.shape}"
    else:
        shape_str = f"{X.shape} → ({len(y)},)"
    
    print(f"{name:<20} {dataset_type:<15} {shape_str:<15} {challenge:<20} {description:<35}")

print("=" * 120)

## 7. Comparative Algorithm Analysis {#analysis}

Now let's perform a comprehensive analysis comparing algorithm performance across all our generated datasets.

In [None]:
# Comprehensive cross-dataset algorithm performance analysis
print("🔬 Comprehensive Cross-Dataset Algorithm Performance Analysis...")

# Collect all classification datasets for comprehensive analysis
all_classification_data = {}

# Add classification complexity hierarchy datasets
for name, info in classification_datasets.items():
    all_classification_data[name] = {
        'X': info['X'], 'y': info['y'],
        'source': 'complexity_hierarchy',
        'level': info['level'],
        'complexity_factors': info.get('complexity_factors', {}),
        'description': info['description'],
        'optimal_algorithm': info.get('optimal_algorithm', 'Unknown')
    }

# Add special purpose classification datasets
for name, info in special_datasets.items():
    if info['type'] == 'classification':
        all_classification_data[name] = {
            'X': info['X'], 'y': info['y'],
            'source': 'special_purpose',
            'challenge': info['challenge'],
            'description': info['description'],
            'optimal_algorithm': info.get('optimal_algorithm', 'Unknown')
        }

print(f"Total datasets for analysis: {len(all_classification_data)}")

# Create ultimate performance analysis visualization
print("📊 Creating Ultimate Performance Analysis Visualization...")

# Prepare data for mega-analysis
all_performance_data = []

for dataset_name, dataset_info in all_classification_data.items():
    X, y = dataset_info['X'], dataset_info['y']
    
    # Get results from benchmark if available
    if dataset_name in algorithm_benchmark_results:
        results = algorithm_benchmark_results[dataset_name]
        
        for alg_name, metrics in results.items():
            if 'error' not in metrics:
                row = {
                    'Dataset': dataset_name,
                    'Algorithm': alg_name,
                    'Source': dataset_info['source'],
                    'Accuracy': metrics['accuracy'],
                    'Training_Time': metrics['training_time'],
                    'Total_Time': metrics['total_time'],
                    'Generalization_Gap': metrics['generalization_gap'],
                    'Optimal_Algorithm': dataset_info['optimal_algorithm']
                }
                
                # Add dataset-specific information
                if 'level' in dataset_info:
                    row['Complexity_Level'] = dataset_info['level']
                if 'challenge' in dataset_info:
                    row['Challenge'] = dataset_info['challenge']
                
                all_performance_data.append(row)

mega_df = pd.DataFrame(all_performance_data)

# Create the ultimate visualization (2x2 grid)
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Overall algorithm performance heatmap
if not mega_df.empty:
    perf_pivot = mega_df.pivot_table(index='Algorithm', columns='Dataset', values='Accuracy', aggfunc='mean')
    sns.heatmap(perf_pivot, annot=False, cmap='RdYlGn', center=0.7, 
                cbar_kws={'label': 'Accuracy'}, ax=axes[0, 0])
    axes[0, 0].set_title('Ultimate Algorithm Performance Heatmap\nAccuracy Across All Datasets')
    axes[0, 0].set_ylabel('Algorithm')

# 2. Performance vs efficiency trade-off
if not mega_df.empty:
    efficiency_data = mega_df.groupby('Algorithm').agg({
        'Accuracy': 'mean',
        'Training_Time': 'mean'
    }).reset_index()
    
    scatter = axes[0, 1].scatter(efficiency_data['Training_Time'], efficiency_data['Accuracy'], 
                                s=100, alpha=0.7, c=range(len(efficiency_data)), cmap='viridis')
    
    # Add algorithm labels
    for _, row in efficiency_data.iterrows():
        axes[0, 1].annotate(row['Algorithm'], (row['Training_Time'], row['Accuracy']), 
                           xytext=(3, 3), textcoords='offset points', fontsize=8)
    
    axes[0, 1].set_xlabel('Average Training Time (seconds)')
    axes[0, 1].set_ylabel('Average Accuracy')
    axes[0, 1].set_title('Algorithm Efficiency Analysis\n(Top-left is optimal: High accuracy, Low time)')
    axes[0, 1].grid(True, alpha=0.3)

# 3. Performance by dataset source
if not mega_df.empty:
    source_perf = mega_df.groupby(['Source', 'Algorithm'])['Accuracy'].mean().reset_index()
    
    pivot_source = source_perf.pivot(index='Algorithm', columns='Source', values='Accuracy')
    if not pivot_source.empty:
        pivot_source.plot(kind='bar', ax=axes[1, 0], width=0.8)
        axes[1, 0].set_title('Performance by Dataset Source\nComplexity Hierarchy vs Special Purpose')
        axes[1, 0].set_ylabel('Average Accuracy')
        axes[1, 0].tick_params(axis='x', rotation=45)
        axes[1, 0].legend(title='Dataset Source')
        axes[1, 0].grid(True, alpha=0.3)

# 4. Algorithm vs Optimal Algorithm comparison
if not mega_df.empty:
    # Compare actual performance vs optimal algorithms
    optimal_performance = []
    
    for dataset in mega_df['Dataset'].unique():
        dataset_data = mega_df[mega_df['Dataset'] == dataset]
        optimal_alg = dataset_data['Optimal_Algorithm'].iloc[0]
        
        best_actual_performance = dataset_data['Accuracy'].max()
        best_actual_alg = dataset_data.loc[dataset_data['Accuracy'].idxmax(), 'Algorithm']
        
        # Check if optimal algorithm was tested
        if optimal_alg in dataset_data['Algorithm'].values:
            optimal_performance_score = dataset_data[dataset_data['Algorithm'] == optimal_alg]['Accuracy'].iloc[0]
        else:
            optimal_performance_score = None
        
        optimal_performance.append({
            'Dataset': dataset,
            'Optimal_Algorithm': optimal_alg,
            'Best_Actual_Algorithm': best_actual_alg,
            'Best_Actual_Performance': best_actual_performance,
            'Optimal_Performance': optimal_performance_score,
            'Match': optimal_alg == best_actual_alg
        })
    
    optimal_df = pd.DataFrame(optimal_performance)
    
    # Show match percentage
    match_rate = optimal_df['Match'].mean()
    
    # Create bar chart of optimal vs actual
    datasets_subset = optimal_df.head(8)  # Show first 8 for readability
    x_pos = np.arange(len(datasets_subset))
    
    bars1 = axes[1, 1].bar(x_pos - 0.2, datasets_subset['Best_Actual_Performance'], 
                          0.4, label='Best Actual', alpha=0.7)
    
    optimal_scores = [score if score is not None else 0 for score in datasets_subset['Optimal_Performance']]
    bars2 = axes[1, 1].bar(x_pos + 0.2, optimal_scores, 
                          0.4, label='Predicted Optimal', alpha=0.7)
    
    axes[1, 1].set_xlabel('Dataset')
    axes[1, 1].set_ylabel('Accuracy')
    axes[1, 1].set_title(f'Predicted vs Actual Best Performance\nMatch Rate: {match_rate:.1%}')
    axes[1, 1].set_xticks(x_pos)
    axes[1, 1].set_xticklabels([d[:10] + '...' if len(d) > 10 else d for d in datasets_subset['Dataset']], 
                              rotation=45, ha='right')
    axes[1, 1].legend()
    axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()

# Save ultimate analysis visualization
save_figure(fig, 'ultimate_algorithm_analysis',
           'Ultimate algorithm performance analysis across all datasets', 'analysis')
plt.show()

# Performance insights
if not mega_df.empty:
    print("\n🎯 Key Performance Insights:")
    print("=" * 60)
    
    # Best overall algorithm
    best_overall = mega_df.groupby('Algorithm')['Accuracy'].mean().idxmax()
    best_score = mega_df.groupby('Algorithm')['Accuracy'].mean().max()
    print(f"🏆 Best Overall Algorithm: {best_overall} ({best_score:.4f})")
    
    # Most consistent algorithm
    algo_consistency = mega_df.groupby('Algorithm')['Accuracy'].std()
    most_consistent = algo_consistency.idxmin()
    print(f"📊 Most Consistent: {most_consistent} (std: {algo_consistency.min():.4f})")
    
    # Fastest algorithm
    fastest = mega_df.groupby('Algorithm')['Training_Time'].mean().idxmin()
    fastest_time = mega_df.groupby('Algorithm')['Training_Time'].mean().min()
    print(f"⚡ Fastest Training: {fastest} ({fastest_time:.4f}s)")
    
    # Algorithm-dataset matching analysis
    if 'optimal_df' in locals():
        print(f"🔍 Optimal Algorithm Prediction Accuracy: {match_rate:.1%}")
        
        # Most frequently optimal algorithms
        optimal_counts = optimal_df['Optimal_Algorithm'].value_counts()
        print("🎯 Most Frequently Optimal Algorithms:")
        for alg, count in optimal_counts.head(3).items():
            print(f"   {alg}: {count} datasets")
    
    print("=" * 60)

# Save comprehensive analysis results
comprehensive_analysis_results = {
    'total_datasets_analyzed': len(all_classification_data),
    'total_algorithm_evaluations': len(mega_df) if not mega_df.empty else 0,
    'best_overall_algorithm': best_overall if not mega_df.empty else 'N/A',
    'best_overall_score': float(best_score) if not mega_df.empty else 0,
    'algorithm_rankings': mega_df.groupby('Algorithm')['Accuracy'].mean().sort_values(ascending=False).to_dict() if not mega_df.empty else {},
    'dataset_sources': list(all_classification_data.keys()) if all_classification_data else [],
    'optimal_algorithm_prediction_accuracy': float(match_rate) if 'optimal_df' in locals() else 0
}

save_experiment_results('comprehensive_analysis_results', comprehensive_analysis_results,
                       'Complete analysis results across all datasets and algorithms', 'analysis')

## 8. Interactive Dataset Explorer {#explorer}

Let's create an interactive exploration tool for our datasets.

In [None]:
# Interactive dataset explorer and recommendation system
print("🔍 Creating Interactive Dataset Explorer and Recommendation System...")

class DatasetExplorer:
    """Interactive dataset exploration and algorithm recommendation system."""
    
    def __init__(self, datasets_dict, algorithm_results):
        self.datasets = datasets_dict
        self.results = algorithm_results
        self.recommendations = {}
        self._build_recommendation_system()
    
    def _build_recommendation_system(self):
        """Build algorithm recommendation system based on performance results."""
        for dataset_name, algorithms in self.results.items():
            valid_results = {alg: metrics for alg, metrics in algorithms.items() if 'error' not in metrics}
            
            if valid_results:
                # Rank algorithms by accuracy
                acc_sorted = sorted(valid_results.items(), key=lambda x: x[1]['accuracy'], reverse=True)
                
                self.recommendations[dataset_name] = {
                    'best_overall': acc_sorted[0][0],
                    'best_score': acc_sorted[0][1]['accuracy'],
                    'all_results': valid_results,
                    'top_3': acc_sorted[:3]
                }
    
    def explore_dataset(self, dataset_name):
        """Explore a specific dataset with detailed analysis."""
        if dataset_name not in self.datasets:
            print(f"❌ Dataset '{dataset_name}' not found!")
            available = list(self.datasets.keys())[:5]
            print(f"Available datasets: {', '.join(available)}{'...' if len(self.datasets) > 5 else ''}")
            return
        
        dataset_info = self.datasets[dataset_name]
        X, y = dataset_info['X'], dataset_info['y']
        
        print(f"\n🔍 DATASET EXPLORATION: {dataset_name}")
        print("=" * 60)
        
        # Basic statistics
        print(f"📊 Basic Statistics:")
        print(f"  • Shape: {X.shape}")
        print(f"  • Classes: {len(np.unique(y))} {np.unique(y)}")
        print(f"  • Class distribution: {dict(zip(*np.unique(y, return_counts=True)))}")
        print(f"  • Features: {X.shape[1]}")
        
        # Feature statistics
        print(f"\n📈 Feature Analysis:")
        print(f"  • Feature means range: [{np.mean(X, axis=0).min():.3f}, {np.mean(X, axis=0).max():.3f}]")
        print(f"  • Feature std range: [{np.std(X, axis=0).min():.3f}, {np.std(X, axis=0).max():.3f}]")
        
        # Calculate additional metrics
        class_balance = np.min(np.bincount(y)) / np.max(np.bincount(y))
        feature_correlation = np.abs(np.corrcoef(X.T)).mean() if X.shape[1] > 1 else 0
        
        print(f"  • Class balance ratio: {class_balance:.3f}")
        print(f"  • Average feature correlation: {feature_correlation:.3f}")
        
        # Dataset characteristics
        if 'description' in dataset_info:
            print(f"\n📝 Description: {dataset_info['description']}")
        
        if 'challenge' in dataset_info:
            print(f"🎯 Challenge: {dataset_info['challenge']}")
        
        if 'optimal_algorithm' in dataset_info:
            print(f"🏆 Predicted Optimal Algorithm: {dataset_info['optimal_algorithm']}")
        
        # Algorithm recommendations
        if dataset_name in self.recommendations:
            recs = self.recommendations[dataset_name]
            print(f"\n🤖 Algorithm Performance Results:")
            print(f"  🏆 Best Performing: {recs['best_overall']} (Accuracy: {recs['best_score']:.4f})")
            
            # Top 3 by accuracy
            print(f"\n  📋 Top 3 Performers:")
            for i, (alg, metrics) in enumerate(recs['top_3'], 1):
                cv_info = f" (CV: {metrics['cv_mean']:.3f})" if metrics.get('cv_mean', 0) > 0 else ""
                time_info = f", Time: {metrics['training_time']:.3f}s" if 'training_time' in metrics else ""
                print(f"    {i}. {alg}: {metrics['accuracy']:.4f}{cv_info}{time_info}")
        
        # Data quality assessment
        print(f"\n🔍 Data Quality Assessment:")
        
        # Assess difficulty based on various factors
        difficulty_factors = []
        if class_balance < 0.1:
            difficulty_factors.append('Severe class imbalance')
        elif class_balance < 0.5:
            difficulty_factors.append('Moderate class imbalance')
        
        if X.shape[1] > 100:
            difficulty_factors.append('High dimensionality')
        
        if feature_correlation > 0.8:
            difficulty_factors.append('High feature correlation')
        
        if len(difficulty_factors) > 0:
            print(f"  ⚠️  Challenges: {', '.join(difficulty_factors)}")
        else:
            print(f"  ✅ Standard difficulty dataset")
        
        # Recommendations based on characteristics
        print(f"\n💡 Recommendations:")
        if class_balance < 0.1:
            print(f"  • Consider resampling techniques (SMOTE, ADASYN)")
            print(f"  • Use metrics like F1-score, AUC-ROC instead of accuracy")
        
        if X.shape[1] > 50:
            print(f"  • Consider dimensionality reduction (PCA, feature selection)")
            print(f"  • Linear models may perform well in high dimensions")
        
        if feature_correlation > 0.7:
            print(f"  • Consider regularized models (Ridge, Lasso)")
            print(f"  • Principal Component Analysis may help")
    
    def find_best_for_criteria(self, criteria='accuracy', top_k=5):
        """Find best algorithms for specific criteria across all datasets."""
        print(f"\n🏆 TOP {top_k} ALGORITHMS BY {criteria.upper()}")
        print("=" * 60)
        
        algorithm_scores = {}
        
        for dataset_name, algorithms in self.results.items():
            for alg, results in algorithms.items():
                if 'error' not in results:
                    if alg not in algorithm_scores:
                        algorithm_scores[alg] = []
                    
                    if criteria == 'accuracy':
                        algorithm_scores[alg].append(results['accuracy'])
                    elif criteria == 'speed':
                        algorithm_scores[alg].append(1.0 / (results.get('total_time', results.get('training_time', 1)) + 1e-6))
                    elif criteria == 'consistency':
                        gap = abs(results.get('generalization_gap', 0))
                        algorithm_scores[alg].append(1.0 / (gap + 1e-6))
                    elif criteria == 'confidence':
                        algorithm_scores[alg].append(results.get('confidence', 0.5))
        
        # Calculate average scores
        avg_scores = {alg: np.mean(scores) for alg, scores in algorithm_scores.items()}
        
        # Sort and display top k
        sorted_algorithms = sorted(avg_scores.items(), key=lambda x: x[1], reverse=True)
        
        for i, (alg, score) in enumerate(sorted_algorithms[:top_k], 1):
            std_score = np.std(algorithm_scores[alg])
            n_datasets = len(algorithm_scores[alg])
            print(f"  {i}. {alg:<20}: {score:.4f} ± {std_score:.4f} ({n_datasets} datasets)")
        
        return sorted_algorithms[:top_k]
    
    def dataset_similarity_analysis(self, target_dataset):
        """Find datasets similar to the target dataset."""
        if target_dataset not in self.datasets:
            print(f"❌ Dataset '{target_dataset}' not found!")
            return
        
        target_info = self.datasets[target_dataset]
        target_X, target_y = target_info['X'], target_info['y']
        
        similarities = []
        
        for dataset_name, dataset_info in self.datasets.items():
            if dataset_name == target_dataset:
                continue
            
            X, y = dataset_info['X'], dataset_info['y']
            
            # Calculate similarity based on multiple factors
            similarity_score = 0
            factors = 0
            
            # Shape similarity
            shape_sim = 1 - abs(X.shape[0] - target_X.shape[0]) / max(X.shape[0], target_X.shape[0])
            feature_sim = 1 - abs(X.shape[1] - target_X.shape[1]) / max(X.shape[1], target_X.shape[1])
            class_sim = 1 - abs(len(np.unique(y)) - len(np.unique(target_y))) / max(len(np.unique(y)), len(np.unique(target_y)))
            
            similarity_score += (shape_sim + feature_sim + class_sim)
            factors += 3
            
            # Challenge similarity
            if 'challenge' in target_info and 'challenge' in dataset_info:
                if target_info['challenge'] == dataset_info['challenge']:
                    similarity_score += 1
                factors += 1
            
            # Level similarity
            if 'level' in target_info and 'level' in dataset_info:
                level_sim = 1 - abs(target_info['level'] - dataset_info['level']) / 4  # Max level is 4
                similarity_score += level_sim
                factors += 1
            
            avg_similarity = similarity_score / factors if factors > 0 else 0
            similarities.append((dataset_name, avg_similarity, dataset_info.get('description', 'No description')))
        
        # Sort by similarity
        similarities.sort(key=lambda x: x[1], reverse=True)
        
        print(f"\n🔍 Datasets Most Similar to {target_dataset}:")
        print("=" * 60)
        
        for i, (name, sim_score, description) in enumerate(similarities[:5], 1):
            print(f"{i}. {name} (Similarity: {sim_score:.3f})")
            print(f"   {description}")
            print()
        
        return similarities[:5]
    
    def generate_comprehensive_report(self):
        """Generate comprehensive dataset analysis report."""
        report = "\n" + "="*80 + "\n"
        report += "COMPREHENSIVE DATASET ANALYSIS REPORT\n"
        report += "="*80 + "\n\n"
        
        # Dataset portfolio summary
        total_datasets = len(self.datasets)
        total_samples = sum(info['X'].shape[0] for info in self.datasets.values())
        avg_features = np.mean([info['X'].shape[1] for info in self.datasets.values()])
        
        report += f"📊 Dataset Portfolio Summary:\n"
        report += f"  • Total datasets: {total_datasets}\n"
        report += f"  • Total samples: {total_samples:,}\n"
        report += f"  • Average features: {avg_features:.1f}\n\n"
        
        # Algorithm performance summary
        if self.recommendations:
            all_algorithms = set()
            for recs in self.recommendations.values():
                all_algorithms.update(recs['all_results'].keys())
            
            algorithm_wins = {alg: 0 for alg in all_algorithms}
            algorithm_scores = {alg: [] for alg in all_algorithms}
            
            for dataset_name, recs in self.recommendations.items():
                best_alg = recs['best_overall']
                algorithm_wins[best_alg] += 1
                
                for alg, metrics in recs['all_results'].items():
                    algorithm_scores[alg].append(metrics['accuracy'])
            
            report += f"🏆 Algorithm Championship:\n"
            sorted_wins = sorted(algorithm_wins.items(), key=lambda x: x[1], reverse=True)
            for i, (alg, wins) in enumerate(sorted_wins[:5], 1):
                win_rate = wins / len(self.recommendations) * 100
                avg_score = np.mean(algorithm_scores[alg]) if algorithm_scores[alg] else 0
                report += f"  {i}. {alg:<20}: {wins} wins ({win_rate:.1f}%), Avg Score: {avg_score:.3f}\n"
            
            report += "\n"
        
        # Dataset complexity distribution
        complexity_dist = {}
        for info in self.datasets.values():
            level = info.get('level', 'special')
            complexity_dist[level] = complexity_dist.get(level, 0) + 1
        
        report += f"📈 Dataset Complexity Distribution:\n"
        for level, count in sorted(complexity_dist.items()):
            percentage = (count / total_datasets) * 100
            report += f"  Level {level}: {count} datasets ({percentage:.1f}%)\n"
        
        report += "\n" + "="*80 + "\n"
        
        return report
    
    def list_available_datasets(self):
        """List all available datasets with brief descriptions."""
        print("\n📚 Available Datasets:")
        print("=" * 80)
        
        # Group by source
        regression_datasets = []
        classification_datasets = []
        clustering_datasets = []
        special_datasets = []
        
        for name, info in self.datasets.items():
            dataset_type = info.get('type', 'classification')
            description = info.get('description', 'No description')
            shape = info['X'].shape
            
            entry = f"{name:<25} {str(shape):<15} {description[:40]}"
            
            if 'regression' in name.lower() or dataset_type == 'regression':
                regression_datasets.append(entry)
            elif 'clustering' in name.lower() or dataset_type == 'clustering':
                clustering_datasets.append(entry)
            elif info.get('source') == 'special_purpose':
                special_datasets.append(entry)
            else:
                classification_datasets.append(entry)
        
        if classification_datasets:
            print("\n🎯 Classification Datasets:")
            for entry in classification_datasets:
                print(f"  {entry}")
        
        if regression_datasets:
            print("\n📈 Regression Datasets:")
            for entry in regression_datasets:
                print(f"  {entry}")
        
        if clustering_datasets:
            print("\n🔗 Clustering Datasets:")
            for entry in clustering_datasets:
                print(f"  {entry}")
        
        if special_datasets:
            print("\n⚡ Special Purpose Datasets:")
            for entry in special_datasets:
                print(f"  {entry}")
        
        print("\n💡 Use explorer.explore_dataset('dataset_name') for detailed analysis")

# Initialize the explorer
print("🔧 Initializing Dataset Explorer...")

all_dataset_info = {}
all_dataset_info.update({name: info for name, info in classification_datasets.items()})
all_dataset_info.update({name: info for name, info in special_datasets.items() if info['type'] == 'classification'})
all_dataset_info.update({name: info for name, info in regression_datasets.items()})
all_dataset_info.update({name: info for name, info in clustering_datasets.items()})

explorer = DatasetExplorer(all_dataset_info, algorithm_benchmark_results)

print("✅ Dataset Explorer initialized!")
print(f"📊 Total datasets available: {len(all_dataset_info)}")

In [None]:
# Demonstrate the explorer functionality
print("\n🎬 DATASET EXPLORER DEMONSTRATION")
print("=" * 50)

# 1. List available datasets
explorer.list_available_datasets()

# 2. Explore specific datasets
example_datasets = ['Easy_Linear', 'Moons_Pattern', 'Extreme_Imbalance']

for dataset_name in example_datasets:
    if dataset_name in all_dataset_info:
        explorer.explore_dataset(dataset_name)
        
        # Show similarity analysis for first dataset
        if dataset_name == example_datasets[0]:
            explorer.dataset_similarity_analysis(dataset_name)

# 3. Find best algorithms by different criteria
criteria_list = ['accuracy', 'speed', 'consistency']
for criteria in criteria_list:
    top_algorithms = explorer.find_best_for_criteria(criteria, top_k=3)

# 4. Generate comprehensive report
comprehensive_report = explorer.generate_comprehensive_report()
print(comprehensive_report)

# Save the comprehensive report
save_report(comprehensive_report, 'dataset_explorer_report',
           'Comprehensive dataset analysis and exploration report', 'explorer')

print("\n✨ Dataset Explorer demonstration complete!")

## 9. Performance Benchmarking {#benchmarking}

Comprehensive benchmarking system for evaluating model performance across all datasets.

In [None]:
# Advanced benchmarking system
print("⚡ Advanced Performance Benchmarking System...")

class PerformanceBenchmark:
    """Comprehensive performance benchmarking system."""
    
    def __init__(self):
        self.benchmark_results = {}
        self.timing_results = {}
        self.memory_results = {}
    
    def benchmark_algorithm(self, algorithm, X_train, X_test, y_train, y_test, 
                          algorithm_name, dataset_name):
        """Comprehensive benchmarking of a single algorithm."""
        import psutil
        import gc
        
        results = {}
        
        # Memory before training
        process = psutil.Process()
        memory_before = process.memory_info().rss / 1024 / 1024  # MB
        
        try:
            # Training phase
            gc.collect()  # Clean memory
            start_time = time.time()
            algorithm.fit(X_train, y_train)
            training_time = time.time() - start_time
            
            # Memory after training
            memory_after_training = process.memory_info().rss / 1024 / 1024  # MB
            
            # Prediction phase
            start_time = time.time()
            predictions = algorithm.predict(X_test)
            prediction_time = time.time() - start_time
            
            # Memory after prediction
            memory_after_prediction = process.memory_info().rss / 1024 / 1024  # MB
            
            # Performance metrics
            if hasattr(algorithm, 'predict_proba'):
                probabilities = algorithm.predict_proba(X_test)
                confidence = np.mean(np.max(probabilities, axis=1))
            else:
                confidence = None
            
            # Classification metrics
            if len(np.unique(y_test)) > 1:
                accuracy = accuracy_score(y_test, predictions)
                
                # Additional metrics for binary classification
                if len(np.unique(y_test)) == 2:
                    from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
                    precision = precision_score(y_test, predictions, average='weighted', zero_division=0)
                    recall = recall_score(y_test, predictions, average='weighted', zero_division=0)
                    f1 = f1_score(y_test, predictions, average='weighted', zero_division=0)
                    
                    if confidence is not None:
                        try:
                            auc = roc_auc_score(y_test, probabilities[:, 1])
                        except:
                            auc = None
                    else:
                        auc = None
                else:
                    precision = precision_score(y_test, predictions, average='weighted', zero_division=0)
                    recall = recall_score(y_test, predictions, average='weighted', zero_division=0)
                    f1 = f1_score(y_test, predictions, average='weighted', zero_division=0)
                    auc = None
            else:
                accuracy = precision = recall = f1 = auc = 0.0
            
            # Cross-validation
            try:
                cv_scores = cross_val_score(algorithm, X_train, y_train, cv=3, scoring='accuracy')
                cv_mean = cv_scores.mean()
                cv_std = cv_scores.std()
            except:
                cv_mean = cv_std = 0.0
            
            # Compile results
            results = {
                'accuracy': accuracy,
                'precision': precision,
                'recall': recall,
                'f1_score': f1,
                'auc_score': auc,
                'confidence': confidence,
                'cv_mean': cv_mean,
                'cv_std': cv_std,
                'training_time': training_time,
                'prediction_time': prediction_time,
                'total_time': training_time + prediction_time,
                'memory_usage_mb': memory_after_training - memory_before,
                'memory_peak_mb': memory_after_prediction - memory_before,
                'samples_per_second': len(X_test) / prediction_time if prediction_time > 0 else 0,
                'model_size_params': self._estimate_model_size(algorithm)
            }
            
        except Exception as e:
            results = {'error': str(e)}
        
        return results
    
    def _estimate_model_size(self, model):
        """Estimate model complexity/size."""
        if hasattr(model, 'coef_'):
            return np.size(model.coef_)
        elif hasattr(model, 'n_estimators'):
            return model.n_estimators
        elif hasattr(model, 'support_vectors_'):
            return len(model.support_vectors_)
        elif hasattr(model, 'hidden_layer_sizes'):
            total_params = sum(model.hidden_layer_sizes)
            return total_params
        else:
            return 1  # Simple model
    
    def run_comprehensive_benchmark(self, datasets, algorithms, test_size=0.3):
        """Run comprehensive benchmark across all datasets and algorithms."""
        print("🚀 Running Comprehensive Performance Benchmark...")
        
        for dataset_name, dataset_info in datasets.items():
            print(f"\n--- Benchmarking on {dataset_name} ---")
            
            X, y = dataset_info['X'], dataset_info['y']
            
            # Skip if dataset is too large for benchmarking
            if X.shape[0] > 5000:
                print(f"  Skipping {dataset_name} - too large for detailed benchmarking")
                continue
            
            # Split data
            X_train, X_test, y_train, y_test = train_test_split(
                X, y, test_size=test_size, random_state=42, 
                stratify=y if len(np.unique(y)) > 1 else None
            )
            
            # Scale features
            scaler = StandardScaler()
            X_train_scaled = scaler.fit_transform(X_train)
            X_test_scaled = scaler.transform(X_test)
            
            dataset_results = {}
            
            for alg_name, alg_info in algorithms.items():
                print(f"  Testing {alg_name}...")
                
                # Create fresh instance
                algorithm = alg_info['model'].__class__(**alg_info['model'].get_params())
                
                # Benchmark
                results = self.benchmark_algorithm(
                    algorithm, X_train_scaled, X_test_scaled, y_train, y_test,
                    alg_name, dataset_name
                )
                
                dataset_results[alg_name] = results
                
                if 'error' not in results:
                    print(f"    Accuracy: {results['accuracy']:.4f}, "
                          f"Time: {results['total_time']:.3f}s, "
                          f"Memory: {results['memory_usage_mb']:.1f}MB")
                else:
                    print(f"    ❌ Failed: {results['error']}")
            
            self.benchmark_results[dataset_name] = dataset_results
        
        print("\n✨ Comprehensive benchmark complete!")
        return self.benchmark_results
    
    def generate_benchmark_report(self):
        """Generate comprehensive benchmark report."""
        if not self.benchmark_results:
            return "No benchmark results available."
        
        report = "\n" + "="*100 + "\n"
        report += "COMPREHENSIVE PERFORMANCE BENCHMARK REPORT\n"
        report += "="*100 + "\n\n"
        
        # Overall statistics
        total_tests = sum(len(algorithms) for algorithms in self.benchmark_results.values())
        successful_tests = sum(
            len([alg for alg, results in algorithms.items() if 'error' not in results])
            for algorithms in self.benchmark_results.values()
        )
        
        report += f"📊 Benchmark Overview:\n"
        report += f"  • Total tests: {total_tests}\n"
        report += f"  • Successful tests: {successful_tests}\n"
        report += f"  • Success rate: {successful_tests/total_tests*100:.1f}%\n\n"
        
        # Performance rankings
        all_results = []
        for dataset_name, algorithms in self.benchmark_results.items():
            for alg_name, results in algorithms.items():
                if 'error' not in results:
                    all_results.append({
                        'dataset': dataset_name,
                        'algorithm': alg_name,
                        **results
                    })
        
        if all_results:
            df = pd.DataFrame(all_results)
            
            # Top performers by accuracy
            report += f"🏆 Top 10 Performers by Accuracy:\n"
            top_accuracy = df.nlargest(10, 'accuracy')
            for i, (_, row) in enumerate(top_accuracy.iterrows(), 1):
                report += f"  {i:2d}. {row['algorithm']} on {row['dataset']}: {row['accuracy']:.4f}\n"
            
            # Fastest algorithms
            report += f"\n⚡ Fastest Algorithms (by total time):\n"
            fastest = df.nsmallest(10, 'total_time')
            for i, (_, row) in enumerate(fastest.iterrows(), 1):
                report += f"  {i:2d}. {row['algorithm']} on {row['dataset']}: {row['total_time']:.4f}s\n"
            
            # Most memory efficient
            report += f"\n💾 Most Memory Efficient:\n"
            memory_efficient = df.nsmallest(10, 'memory_usage_mb')
            for i, (_, row) in enumerate(memory_efficient.iterrows(), 1):
                report += f"  {i:2d}. {row['algorithm']} on {row['dataset']}: {row['memory_usage_mb']:.1f}MB\n"
            
            # Algorithm summary
            report += f"\n📈 Algorithm Performance Summary:\n"
            algo_summary = df.groupby('algorithm').agg({
                'accuracy': ['mean', 'std'],
                'total_time': ['mean', 'std'],
                'memory_usage_mb': ['mean', 'std']
            }).round(4)
            
            for alg in algo_summary.index:
                acc_mean = algo_summary.loc[alg, ('accuracy', 'mean')]
                acc_std = algo_summary.loc[alg, ('accuracy', 'std')]
                time_mean = algo_summary.loc[alg, ('total_time', 'mean')]
                report += f"  {alg}: Acc {acc_mean:.4f}±{acc_std:.4f}, Time {time_mean:.4f}s\n"
        
        report += "\n" + "="*100 + "\n"
        return report

# Run comprehensive benchmark
print("\n--- Running Performance Benchmark ---")

# Use a subset of algorithms for detailed benchmarking
benchmark_algorithms = {
    'Logistic Regression': classification_algorithms['Logistic Regression'],
    'Random Forest': classification_algorithms['Random Forest'],
    'SVM (RBF)': classification_algorithms['SVM (RBF)'],
    'Gradient Boosting': classification_algorithms['Gradient Boosting']
}

# Use a subset of datasets for benchmarking
benchmark_datasets = {name: info for name, info in classification_datasets.items() 
                     if info['X'].shape[0] <= 2000}  # Limit to smaller datasets

# Initialize and run benchmark
benchmark = PerformanceBenchmark()
benchmark_results = benchmark.run_comprehensive_benchmark(benchmark_datasets, benchmark_algorithms)

# Generate and display report
benchmark_report = benchmark.generate_benchmark_report()
print(benchmark_report)

# Save benchmark results
save_experiment_results('performance_benchmark', benchmark_results,
                       'Comprehensive performance benchmark results', 'benchmarking')

save_report(benchmark_report, 'performance_benchmark_report',
           'Detailed performance benchmark analysis', 'benchmarking')

print("\n✨ Performance benchmarking complete!")

## 10. Comprehensive Results Saving {#saving}

Save all generated models, datasets, and comprehensive analysis results with detailed metadata.

In [None]:
# Comprehensive results saving with enhanced metadata
print("💾 COMPREHENSIVE RESULTS SAVING")
print("=" * 60)

def save_all_trained_models():
    """Save all trained models from the analysis."""
    print("🤖 Saving all trained models...")
    
    models_saved = 0
    
    # Save regression models
    if 'algorithm_performance' in globals():
        for dataset_name, algorithms in algorithm_performance.items():
            for alg_name, metrics in algorithms.items():
                if 'model' in metrics and 'error' not in metrics:
                    model_metadata = {
                        'dataset': dataset_name,
                        'algorithm': alg_name,
                        'test_r2': metrics.get('test_r2', 'N/A'),
                        'cv_mean': metrics.get('cv_mean', 'N/A'),
                        'training_time': metrics.get('training_time', 'N/A'),
                        'model_type': 'regression'
                    }
                    save_model(metrics['model'], 
                             f"{dataset_name}_{alg_name}_regression",
                             f"Regression model: {alg_name} on {dataset_name}",
                             'regression', model_metadata)
                    models_saved += 1
    
    # Save classification models
    if 'algorithm_benchmark_results' in globals():
        for dataset_name, algorithms in algorithm_benchmark_results.items():
            for alg_name, metrics in algorithms.items():
                if 'model' in metrics and 'error' not in metrics:
                    model_metadata = {
                        'dataset': dataset_name,
                        'algorithm': alg_name,
                        'accuracy': metrics.get('accuracy', 'N/A'),
                        'cv_mean': metrics.get('cv_mean', 'N/A'),
                        'training_time': metrics.get('training_time', 'N/A'),
                        'model_type': 'classification'
                    }
                    save_model(metrics['model'], 
                             f"{dataset_name}_{alg_name}_classification",
                             f"Classification model: {alg_name} on {dataset_name}",
                             'classification', model_metadata)
                    models_saved += 1
    
    # Save clustering models
    if 'clustering_results' in globals():
        for dataset_name, algorithms in clustering_results.items():
            for alg_name, metrics in algorithms.items():
                if 'model' in metrics and 'error' not in metrics:
                    model_metadata = {
                        'dataset': dataset_name,
                        'algorithm': alg_name,
                        'silhouette_score': metrics.get('silhouette_score', 'N/A'),
                        'ari_score': metrics.get('ari_score', 'N/A'),
                        'clustering_time': metrics.get('clustering_time', 'N/A'),
                        'model_type': 'clustering'
                    }
                    save_model(metrics['model'], 
                             f"{dataset_name}_{alg_name}_clustering",
                             f"Clustering model: {alg_name} on {dataset_name}",
                             'clustering', model_metadata)
                    models_saved += 1
    
    print(f"✅ Saved {models_saved} trained models")
    return models_saved

def save_all_datasets():
    """Save all generated datasets."""
    print("📊 Saving all generated datasets...")
    
    datasets_saved = 0
    
    # Save regression datasets
    if 'regression_datasets' in globals():
        for name, info in regression_datasets.items():
            dataset_package = {
                'X': info['X'],
                'y': info['y'],
                'true_coef': info.get('true_coef'),
                'metadata': {
                    'description': info['description'],
                    'optimal_algorithm': info.get('optimal_algorithm'),
                    'challenge': info.get('challenge'),
                    'shape': info['X'].shape,
                    'type': 'regression'
                }
            }
            
            filepath = results_dir / 'data_generation' / f"{name}_regression_dataset.joblib"
            joblib.dump(dataset_package, filepath, compress=3)
            datasets_saved += 1
    
    # Save classification datasets
    if 'classification_datasets' in globals():
        for name, info in classification_datasets.items():
            dataset_package = {
                'X': info['X'],
                'y': info['y'],
                'metadata': {
                    'description': info['description'],
                    'level': info.get('level'),
                    'optimal_algorithm': info.get('optimal_algorithm'),
                    'complexity_factors': info.get('complexity_factors'),
                    'shape': info['X'].shape,
                    'type': 'classification'
                }
            }
            
            filepath = results_dir / 'data_generation' / f"{name}_classification_dataset.joblib"
            joblib.dump(dataset_package, filepath, compress=3)
            datasets_saved += 1
    
    # Save clustering datasets
    if 'clustering_datasets' in globals():
        for name, info in clustering_datasets.items():
            dataset_package = {
                'X': info['X'],
                'y': info['y'],
                'metadata': {
                    'description': info['description'],
                    'optimal_algorithm': info.get('optimal_algorithm'),
                    'challenge': info.get('challenge'),
                    'true_k': info.get('true_k'),
                    'shape': info['X'].shape,
                    'type': 'clustering'
                }
            }
            
            filepath = results_dir / 'data_generation' / f"{name}_clustering_dataset.joblib"
            joblib.dump(dataset_package, filepath, compress=3)
            datasets_saved += 1
    
    # Save special datasets
    if 'special_datasets' in globals():
        for name, info in special_datasets.items():
            dataset_package = {
                'X': info['X'],
                'y': info['y'],
                'metadata': {
                    'description': info['description'],
                    'type': info['type'],
                    'challenge': info['challenge'],
                    'optimal_algorithm': info.get('optimal_algorithm'),
                    'shape': info['X'].shape
                }
            }
            
            filepath = results_dir / 'data_generation' / f"{name}_special_dataset.joblib"
            joblib.dump(dataset_package, filepath, compress=3)
            datasets_saved += 1
    
    print(f"✅ Saved {datasets_saved} datasets")
    return datasets_saved

def generate_final_summary():
    """Generate final comprehensive summary."""
    summary = {
        'notebook_execution': {
            'completion_time': get_timestamp(),
            'notebook_name': '01_data_generation_showcase',
            'status': 'completed'
        },
        'datasets_generated': {
            'regression': len(regression_datasets) if 'regression_datasets' in globals() else 0,
            'classification': len(classification_datasets) if 'classification_datasets' in globals() else 0,
            'clustering': len(clustering_datasets) if 'clustering_datasets' in globals() else 0,
            'special': len(special_datasets) if 'special_datasets' in globals() else 0,
            'total': (len(regression_datasets) if 'regression_datasets' in globals() else 0) + 
                    (len(classification_datasets) if 'classification_datasets' in globals() else 0) + 
                    (len(clustering_datasets) if 'clustering_datasets' in globals() else 0) + 
                    (len(special_datasets) if 'special_datasets' in globals() else 0)
        },
        'algorithms_tested': {
            'regression': len(regression_algorithms) if 'regression_algorithms' in globals() else 0,
            'classification': len(classification_algorithms) if 'classification_algorithms' in globals() else 0,
            'clustering': len(clustering_algorithms) if 'clustering_algorithms' in globals() else 0
        },
        'performance_insights': {
            'best_overall_classifier': best_overall if 'best_overall' in globals() else 'N/A',
            'best_classifier_score': float(best_score) if 'best_score' in globals() else 0,
            'total_experiments': len(mega_df) if 'mega_df' in globals() and not mega_df.empty else 0,
            'optimal_prediction_accuracy': float(match_rate) if 'match_rate' in globals() else 0
        },
        'resources_saved': {
            'figures': len(list((results_dir / 'figures').glob('*data_generation*.png'))),
            'models': 0,  # Will be updated
            'datasets': 0,  # Will be updated
            'reports': len(list((results_dir / 'reports').glob('*data_generation*.txt')))
        }
    }
    
    return summary

# Execute comprehensive saving
print("\n🔄 Executing comprehensive results saving...")

# Save all models
models_count = save_all_trained_models()

# Save all datasets
datasets_count = save_all_datasets()

# Generate final summary
final_summary = generate_final_summary()
final_summary['resources_saved']['models'] = models_count
final_summary['resources_saved']['datasets'] = datasets_count

# Save final summary
save_experiment_results('final_data_generation_summary', final_summary,
                       'Comprehensive summary of all data generation activities', 'summary')

# Generate master report
master_report = f"""
{'='*100}
MASTER DATA GENERATION SHOWCASE REPORT
{'='*100}

🎯 EXECUTIVE SUMMARY
{'-'*50}
This comprehensive data generation showcase successfully demonstrated the creation and 
analysis of sophisticated synthetic datasets across multiple machine learning domains.

📊 DATASETS GENERATED
{'-'*25}
• Regression Datasets: {final_summary['datasets_generated']['regression']}
• Classification Datasets: {final_summary['datasets_generated']['classification']}
• Clustering Datasets: {final_summary['datasets_generated']['clustering']}
• Special Purpose Datasets: {final_summary['datasets_generated']['special']}
• Total Datasets: {final_summary['datasets_generated']['total']}

🤖 ALGORITHMS EVALUATED
{'-'*30}
• Regression Algorithms: {final_summary['algorithms_tested']['regression']}
• Classification Algorithms: {final_summary['algorithms_tested']['classification']}
• Clustering Algorithms: {final_summary['algorithms_tested']['clustering']}

🏆 KEY PERFORMANCE INSIGHTS
{'-'*35}
• Best Overall Classifier: {final_summary['performance_insights']['best_overall_classifier']}
• Best Classification Score: {final_summary['performance_insights']['best_classifier_score']:.4f}
• Total Experiments Conducted: {final_summary['performance_insights']['total_experiments']}
• Algorithm Prediction Accuracy: {final_summary['performance_insights']['optimal_prediction_accuracy']:.1%}

💾 RESOURCES SAVED
{'-'*20}
• Visualization Figures: {final_summary['resources_saved']['figures']}
• Trained Models: {final_summary['resources_saved']['models']}
• Dataset Files: {final_summary['resources_saved']['datasets']}
• Analysis Reports: {final_summary['resources_saved']['reports']}

🎓 METHODOLOGICAL CONTRIBUTIONS
{'-'*40}
1. Progressive Complexity Hierarchy: Developed systematic approach to dataset complexity
2. Comprehensive Algorithm Benchmarking: Standardized evaluation across diverse scenarios
3. Interactive Exploration System: Created tools for dataset analysis and recommendation
4. Special Purpose Edge Cases: Addressed real-world ML challenges and limitations
5. Performance Optimization: Demonstrated efficiency considerations in algorithm selection

🔮 PRACTICAL APPLICATIONS
{'-'*30}
• Algorithm Selection Guidance: Data-driven recommendations for optimal algorithms
• Educational Resource: Comprehensive examples for ML education and training
• Benchmark Standard: Reproducible evaluation framework for new algorithms
• Research Foundation: Baseline datasets for ML research and development

✨ CONCLUSION
{'-'*15}
The data generation showcase successfully created a comprehensive ecosystem of synthetic
datasets that effectively demonstrate the strengths, weaknesses, and optimal use cases
of various machine learning algorithms. The systematic approach to complexity progression
and comprehensive evaluation provides valuable insights for both practitioners and
researchers in the field of machine learning.

All results, models, and datasets have been systematically saved with detailed metadata
for future reference and reproducibility.

{'='*100}
Report Generated: {final_summary['notebook_execution']['completion_time']}
Status: {final_summary['notebook_execution']['status'].upper()}
{'='*100}
"""

# Save master report
save_report(master_report, 'master_data_generation_report',
           'Master report summarizing all data generation activities', 'summary')

print(master_report)

print("\n🎉 DATA GENERATION SHOWCASE COMPLETE!")
print("\n🔬 Key Achievements:")
print("   • Generated comprehensive dataset portfolio across ML domains")
print("   • Demonstrated systematic algorithm evaluation methodology")
print("   • Created interactive exploration and recommendation tools")
print("   • Established performance benchmarking standards")
print("   • Produced reproducible research-quality results")
print("\n💾 All results systematically saved with detailed metadata")
print("📊 Comprehensive visualizations and analysis completed")
print("📋 Master documentation generated for future reference")

print(f"\n📁 Results location: {results_dir}")
print("✨ Ready for integration with advanced ML techniques! ✨")