# Classification Algorithms: Hands-On Practice 🎯

Welcome to your comprehensive classification workshop! In this notebook, you'll:

- 🔍 **Explore real datasets** with different characteristics
- 🛠️ **Implement multiple algorithms** from scratch and with scikit-learn
- 📊 **Compare performance** across different scenarios
- 🎮 **Interactive exercises** to test your understanding
- 🏆 **Build a complete classification pipeline**

Let's dive in! 🚀

In [None]:
# Import all necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, load_breast_cancer, load_wine
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("🎯 Classification Workshop Setup Complete!")
print("Ready to explore the world of classification algorithms!")

## Dataset 1: Breast Cancer Detection 🏥

Our first challenge: Build a model to detect malignant vs benign breast tumors.
This is a **binary classification** problem with real-world medical implications!

**Business Context:** 
- False Negative (missing cancer) = Very Bad 😰
- False Positive (false alarm) = Less bad but still concerning 😟

In [None]:
# Load the breast cancer dataset
cancer_data = load_breast_cancer()
X_cancer = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
y_cancer = cancer_data.target

print("🏥 BREAST CANCER DATASET OVERVIEW")
print(f"Samples: {X_cancer.shape[0]}")
print(f"Features: {X_cancer.shape[1]}")
print(f"Classes: {np.unique(y_cancer)} (0=Malignant, 1=Benign)")
print(f"Class distribution: {np.bincount(y_cancer)}")
print(f"Balance: {np.bincount(y_cancer)[0]/len(y_cancer):.1%} Malignant, {np.bincount(y_cancer)[1]/len(y_cancer):.1%} Benign")

# Display first few features
print("\nFirst 5 features:")
print(X_cancer.head())

In [None]:
# Exploratory Data Analysis
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Feature distribution
X_cancer['mean_radius'].hist(bins=30, alpha=0.7, ax=axes[0,0])
axes[0,0].set_title('Mean Radius Distribution')
axes[0,0].set_xlabel('Mean Radius')

# Class distribution
class_counts = pd.Series(y_cancer).value_counts()
axes[0,1].bar(['Malignant', 'Benign'], class_counts.values, color=['red', 'green'], alpha=0.7)
axes[0,1].set_title('Class Distribution')
axes[0,1].set_ylabel('Count')

# Feature correlation heatmap (subset)
corr_subset = X_cancer[['mean_radius', 'mean_texture', 'mean_perimeter', 'mean_area', 'mean_smoothness']].corr()
sns.heatmap(corr_subset, annot=True, cmap='coolwarm', ax=axes[1,0])
axes[1,0].set_title('Feature Correlations (Subset)')

# Boxplot: Malignant vs Benign for a key feature
cancer_df = X_cancer.copy()
cancer_df['target'] = y_cancer
cancer_df['target_name'] = cancer_df['target'].map({0: 'Malignant', 1: 'Benign'})
sns.boxplot(data=cancer_df, x='target_name', y='mean_radius', ax=axes[1,1])
axes[1,1].set_title('Mean Radius by Diagnosis')

plt.tight_layout()
plt.show()

print("\n🔍 INSIGHTS:")
print("- Mean radius is clearly different between malignant and benign tumors")
print("- Some features are highly correlated (radius, perimeter, area)")
print("- Dataset is reasonably balanced")

### 🎮 Interactive Exercise 1: Your First Classification Model

**Challenge:** Build a logistic regression model to predict cancer diagnosis.

**Your Task:**
1. Split the data (80% train, 20% test)
2. Scale the features (cancer features have very different ranges!)
3. Train a logistic regression model
4. Evaluate using multiple metrics
5. Interpret the results

**Think about:** What metrics matter most for cancer detection?

In [None]:
# TODO: Your code here! 
# Hint: Start by splitting the data using train_test_split

# Step 1: Split the data
X_train_cancer, X_test_cancer, y_train_cancer, y_test_cancer = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)

print("✅ Data split complete!")
print(f"Training set: {X_train_cancer.shape[0]} samples")
print(f"Test set: {X_test_cancer.shape[0]} samples")

# Check if you maintained class balance
print(f"Training class distribution: {np.bincount(y_train_cancer)}")
print(f"Test class distribution: {np.bincount(y_test_cancer)}")

In [None]:
# Step 2: Scale the features
scaler = StandardScaler()
X_train_cancer_scaled = scaler.fit_transform(X_train_cancer)
X_test_cancer_scaled = scaler.transform(X_test_cancer)

print("✅ Feature scaling complete!")
print(f"Original feature range (mean radius): {X_train_cancer['mean_radius'].min():.2f} to {X_train_cancer['mean_radius'].max():.2f}")
print(f"Scaled feature range (mean radius): {X_train_cancer_scaled[:, 0].min():.2f} to {X_train_cancer_scaled[:, 0].max():.2f}")

In [None]:
# Step 3: Train logistic regression
lr_cancer = LogisticRegression(random_state=42)
lr_cancer.fit(X_train_cancer_scaled, y_train_cancer)

# Step 4: Make predictions
y_pred_cancer = lr_cancer.predict(X_test_cancer_scaled)
y_proba_cancer = lr_cancer.predict_proba(X_test_cancer_scaled)[:, 1]

print("✅ Model training complete!")
print(f"Model accuracy: {lr_cancer.score(X_test_cancer_scaled, y_test_cancer):.3f}")

In [None]:
# Step 5: Comprehensive evaluation
from sklearn.metrics import precision_score, recall_score, f1_score

def evaluate_model(y_true, y_pred, y_proba, model_name="Model"):
    """
    Comprehensive model evaluation function
    """
    accuracy = (y_true == y_pred).mean()
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    auc = roc_auc_score(y_true, y_proba)
    
    print(f"📊 {model_name.upper()} EVALUATION RESULTS:")
    print(f"Accuracy:  {accuracy:.3f} - Overall correctness")
    print(f"Precision: {precision:.3f} - When we predict benign, how often are we right?")
    print(f"Recall:    {recall:.3f} - Of all actual benign cases, how many did we catch?")
    print(f"F1-Score:  {f1:.3f} - Balance between precision and recall")
    print(f"AUC:       {auc:.3f} - Overall discriminative ability")
    
    # Confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    
    plt.figure(figsize=(12, 5))
    
    plt.subplot(1, 2, 1)
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
                xticklabels=['Pred: Malignant', 'Pred: Benign'],
                yticklabels=['True: Malignant', 'True: Benign'])
    plt.title(f'{model_name} Confusion Matrix')
    
    # ROC Curve
    fpr, tpr, _ = roc_curve(y_true, y_proba)
    plt.subplot(1, 2, 2)
    plt.plot(fpr, tpr, linewidth=2, label=f'{model_name} (AUC = {auc:.3f})')
    plt.plot([0, 1], [0, 1], 'k--', alpha=0.5, label='Random Classifier')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.legend()
    plt.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Medical interpretation
    print(f"\n🏥 MEDICAL INTERPRETATION:")
    print(f"False Negatives (missed cancer): {cm[0,0]} cases")
    print(f"False Positives (unnecessary worry): {cm[1,0]} cases")
    print(f"True Positives (correctly identified benign): {cm[1,1]} cases")
    print(f"True Negatives (correctly identified malignant): {cm[0,1]} cases")
    
    return accuracy, precision, recall, f1, auc

# Evaluate our logistic regression model
lr_results = evaluate_model(y_test_cancer, y_pred_cancer, y_proba_cancer, "Logistic Regression")

### 🏆 Algorithm Comparison Arena

Now let's compare multiple algorithms on the same dataset! 
This is where the fun begins - seeing how different approaches handle the same problem.

In [None]:
# Define multiple algorithms to compare
algorithms = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5),
    'Naive Bayes': GaussianNB()
}

# Store results
comparison_results = {}

print("🏟️ ALGORITHM COMPARISON ARENA")
print("Training and evaluating multiple algorithms...")

for name, algorithm in algorithms.items():
    print(f"\nTraining {name}...")
    
    # Train the algorithm
    if name in ['Logistic Regression', 'SVM', 'K-Nearest Neighbors']:
        # These algorithms benefit from scaling
        algorithm.fit(X_train_cancer_scaled, y_train_cancer)
        y_pred = algorithm.predict(X_test_cancer_scaled)
        y_proba = algorithm.predict_proba(X_test_cancer_scaled)[:, 1]
    else:
        # Tree-based algorithms don't need scaling
        algorithm.fit(X_train_cancer, y_train_cancer)
        y_pred = algorithm.predict(X_test_cancer)
        y_proba = algorithm.predict_proba(X_test_cancer)[:, 1]
    
    # Calculate metrics
    accuracy = (y_test_cancer == y_pred).mean()
    precision = precision_score(y_test_cancer, y_pred)
    recall = recall_score(y_test_cancer, y_pred)
    f1 = f1_score(y_test_cancer, y_pred)
    auc = roc_auc_score(y_test_cancer, y_proba)
    
    comparison_results[name] = {
        'Accuracy': accuracy,
        'Precision': precision,
        'Recall': recall,
        'F1-Score': f1,
        'AUC': auc
    }

# Create comparison DataFrame
comparison_df = pd.DataFrame(comparison_results).T
print("\n📊 FINAL RESULTS:")
print(comparison_df.round(3))

# Visualize comparison
plt.figure(figsize=(15, 10))

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'AUC']
for i, metric in enumerate(metrics):
    plt.subplot(2, 3, i+1)
    values = comparison_df[metric]
    bars = plt.bar(range(len(values)), values, alpha=0.7)
    plt.xticks(range(len(values)), values.index, rotation=45, ha='right')
    plt.ylabel(metric)
    plt.title(f'{metric} Comparison')
    plt.ylim(0, 1)
    
    # Highlight best performer
    best_idx = values.idxmax()
    best_bar_idx = values.index.get_loc(best_idx)
    bars[best_bar_idx].set_color('gold')
    bars[best_bar_idx].set_edgecolor('black')
    bars[best_bar_idx].set_linewidth(2)

plt.tight_layout()
plt.show()

# Find overall best performer
overall_scores = comparison_df.mean(axis=1).sort_values(ascending=False)
print(f"\n🏆 CHAMPION: {overall_scores.index[0]}")
print(f"Average score across all metrics: {overall_scores.iloc[0]:.3f}")

## Dataset 2: Wine Classification 🍷

Let's tackle a **multiclass classification** problem! 
Predicting wine type based on chemical properties.

**Business Context:** 
- Wine quality control and authentication
- 3 different wine classes to predict
- All mistakes are roughly equal cost

In [None]:
# Load wine dataset
wine_data = load_wine()
X_wine = pd.DataFrame(wine_data.data, columns=wine_data.feature_names)
y_wine = wine_data.target

print("🍷 WINE CLASSIFICATION DATASET")
print(f"Samples: {X_wine.shape[0]}")
print(f"Features: {X_wine.shape[1]}")
print(f"Classes: {np.unique(y_wine)} - {wine_data.target_names}")
print(f"Class distribution: {np.bincount(y_wine)}")

# Quick visualization
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Feature distributions by class
for class_idx, class_name in enumerate(wine_data.target_names):
    class_data = X_wine[y_wine == class_idx]['alcohol']
    axes[0].hist(class_data, alpha=0.7, label=f'Class {class_idx}: {class_name}', bins=15)

axes[0].set_xlabel('Alcohol Content')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Alcohol Content by Wine Class')
axes[0].legend()

# 2D scatter plot
scatter = axes[1].scatter(X_wine['alcohol'], X_wine['malic_acid'], c=y_wine, cmap='viridis', alpha=0.7)
axes[1].set_xlabel('Alcohol Content')
axes[1].set_ylabel('Malic Acid')
axes[1].set_title('Wine Classes in 2D')
plt.colorbar(scatter, ax=axes[1])

# Class distribution
class_counts = np.bincount(y_wine)
axes[2].bar(wine_data.target_names, class_counts, alpha=0.7, color=['red', 'green', 'blue'])
axes[2].set_title('Class Distribution')
axes[2].set_ylabel('Count')

plt.tight_layout()
plt.show()

### 🎮 Interactive Exercise 2: Multiclass Classification

**Challenge:** Build a complete pipeline for wine classification!

**Your Mission:**
1. Split and scale the data
2. Train multiple algorithms
3. Use multiclass evaluation metrics
4. Find the best performer
5. Analyze feature importance

**New Twist:** This time, you'll implement cross-validation for more robust results!

In [None]:
# TODO: Implement multiclass classification pipeline

from sklearn.model_selection import cross_validate

# Step 1: Split the data
X_train_wine, X_test_wine, y_train_wine, y_test_wine = train_test_split(
    X_wine, y_wine, test_size=0.2, random_state=42, stratify=y_wine
)

# Step 2: Scale features
scaler_wine = StandardScaler()
X_train_wine_scaled = scaler_wine.fit_transform(X_train_wine)
X_test_wine_scaled = scaler_wine.transform(X_test_wine)

print("✅ Wine data prepared!")
print(f"Training set shape: {X_train_wine_scaled.shape}")
print(f"Test set shape: {X_test_wine_scaled.shape}")

# Step 3: Cross-validation comparison
wine_algorithms = {
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'SVM': SVC(probability=True, random_state=42),
    'K-Nearest Neighbors': KNeighborsClassifier(n_neighbors=5)
}

wine_cv_results = {}

print("\n🍷 WINE CLASSIFICATION - CROSS VALIDATION RESULTS")
for name, algorithm in wine_algorithms.items():
    print(f"Cross-validating {name}...")
    
    # Use appropriate features (scaled for some algorithms)
    if name in ['Logistic Regression', 'SVM', 'K-Nearest Neighbors']:
        X_cv = X_train_wine_scaled
    else:
        X_cv = X_train_wine
    
    # 5-fold cross-validation with multiple metrics
    cv_results = cross_validate(
        algorithm, X_cv, y_train_wine, 
        cv=5, 
        scoring=['accuracy', 'precision_macro', 'recall_macro', 'f1_macro'],
        return_train_score=True
    )
    
    wine_cv_results[name] = {
        'CV_Accuracy': cv_results['test_accuracy'].mean(),
        'CV_Precision': cv_results['test_precision_macro'].mean(),
        'CV_Recall': cv_results['test_recall_macro'].mean(),
        'CV_F1': cv_results['test_f1_macro'].mean(),
        'CV_Std': cv_results['test_accuracy'].std()
    }

wine_cv_df = pd.DataFrame(wine_cv_results).T
print("\n📊 CROSS-VALIDATION RESULTS:")
print(wine_cv_df.round(3))

In [None]:
# Train best model on full training set and evaluate on test set
best_wine_model = wine_cv_df['CV_F1'].idxmax()
print(f"🏆 Best model: {best_wine_model}")

# Train and evaluate the best model
if best_wine_model in ['Logistic Regression', 'SVM', 'K-Nearest Neighbors']:
    best_algorithm = wine_algorithms[best_wine_model]
    best_algorithm.fit(X_train_wine_scaled, y_train_wine)
    y_pred_wine = best_algorithm.predict(X_test_wine_scaled)
    y_proba_wine = best_algorithm.predict_proba(X_test_wine_scaled)
else:
    best_algorithm = wine_algorithms[best_wine_model]
    best_algorithm.fit(X_train_wine, y_train_wine)
    y_pred_wine = best_algorithm.predict(X_test_wine)
    y_proba_wine = best_algorithm.predict_proba(X_test_wine)

# Multiclass evaluation
from sklearn.metrics import classification_report

print(f"\n📋 DETAILED CLASSIFICATION REPORT - {best_wine_model}")
print(classification_report(y_test_wine, y_pred_wine, target_names=wine_data.target_names))

# Multiclass confusion matrix
cm_wine = confusion_matrix(y_test_wine, y_pred_wine)

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
sns.heatmap(cm_wine, annot=True, fmt='d', cmap='Blues', 
            xticklabels=wine_data.target_names,
            yticklabels=wine_data.target_names)
plt.title(f'{best_wine_model} - Confusion Matrix')
plt.ylabel('True Class')
plt.xlabel('Predicted Class')

# Feature importance (if available)
if hasattr(best_algorithm, 'feature_importances_'):
    feature_importance = pd.DataFrame({
        'Feature': wine_data.feature_names,
        'Importance': best_algorithm.feature_importances_
    }).sort_values('Importance', ascending=False).head(10)
    
    plt.subplot(1, 2, 2)
    plt.barh(range(len(feature_importance)), feature_importance['Importance'])
    plt.yticks(range(len(feature_importance)), feature_importance['Feature'])
    plt.xlabel('Feature Importance')
    plt.title('Top 10 Most Important Features')
elif hasattr(best_algorithm, 'coef_'):
    # For linear models, show coefficient magnitudes
    feature_importance = pd.DataFrame({
        'Feature': wine_data.feature_names,
        'Importance': np.abs(best_algorithm.coef_).mean(axis=0)
    }).sort_values('Importance', ascending=False).head(10)
    
    plt.subplot(1, 2, 2)
    plt.barh(range(len(feature_importance)), feature_importance['Importance'])
    plt.yticks(range(len(feature_importance)), feature_importance['Feature'])
    plt.xlabel('Average |Coefficient|')
    plt.title('Top 10 Most Important Features')

plt.tight_layout()
plt.show()

print(f"\n🎯 KEY INSIGHTS:")
print(f"- {best_wine_model} achieved {(y_pred_wine == y_test_wine).mean():.1%} accuracy on test set")
print(f"- Cross-validation accuracy: {wine_cv_results[best_wine_model]['CV_Accuracy']:.1%} ± {wine_cv_results[best_wine_model]['CV_Std']:.1%}")
print(f"- All three wine classes can be distinguished quite well!")

## Dataset 3: Custom Synthetic Dataset 🎨

Let's create our own dataset with specific characteristics to test algorithm behavior!
This is great for understanding when each algorithm shines.

In [None]:
# Create different types of synthetic datasets
def create_challenge_datasets():
    """
    Create datasets with different characteristics to challenge our algorithms
    """
    datasets = {}
    
    # Dataset 1: Linearly separable
    X_linear, y_linear = make_classification(
        n_samples=1000, n_features=2, n_redundant=0, n_informative=2,
        n_clusters_per_class=1, class_sep=2, random_state=42
    )
    datasets['Linear'] = (X_linear, y_linear, "Linearly separable classes")
    
    # Dataset 2: Non-linear (XOR-like)
    X_nonlinear, y_nonlinear = make_classification(
        n_samples=1000, n_features=2, n_redundant=0, n_informative=2,
        n_clusters_per_class=2, class_sep=0.5, random_state=42
    )
    datasets['Non-Linear'] = (X_nonlinear, y_nonlinear, "Non-linearly separable (multiple clusters per class)")
    
    # Dataset 3: High noise
    X_noisy, y_noisy = make_classification(
        n_samples=1000, n_features=2, n_redundant=0, n_informative=2,
        n_clusters_per_class=1, class_sep=1, flip_y=0.15, random_state=42
    )
    datasets['Noisy'] = (X_noisy, y_noisy, "High noise (15% label flip)")
    
    # Dataset 4: High dimensional
    X_highdim, y_highdim = make_classification(
        n_samples=500, n_features=100, n_informative=20, n_redundant=80,
        n_clusters_per_class=1, random_state=42
    )
    datasets['High-Dim'] = (X_highdim, y_highdim, "High dimensional (100 features)")
    
    return datasets

challenge_datasets = create_challenge_datasets()

# Visualize the 2D datasets
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for i, (name, (X, y, description)) in enumerate(list(challenge_datasets.items())[:3]):
    if X.shape[1] == 2:  # Only plot 2D datasets
        scatter = axes[i].scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', alpha=0.7)
        axes[i].set_title(f'{name} Dataset\n{description}')
        axes[i].set_xlabel('Feature 1')
        axes[i].set_ylabel('Feature 2')
        plt.colorbar(scatter, ax=axes[i])

plt.tight_layout()
plt.show()

print("🎨 SYNTHETIC DATASETS CREATED!")
print("Each dataset tests different algorithm strengths:")
print("- Linear: Should favor linear models (Logistic Regression, SVM)")
print("- Non-Linear: Should favor flexible models (Trees, KNN)")  
print("- Noisy: Should favor robust models (Random Forest)")
print("- High-Dim: Should favor regularized models (Logistic Regression)")

In [None]:
# Compare algorithms across all challenge datasets
challenge_algorithms = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Decision Tree': DecisionTreeClassifier(random_state=42, max_depth=10),
    'Random Forest': RandomForestClassifier(n_estimators=50, random_state=42),
    'SVM (RBF)': SVC(kernel='rbf', random_state=42),
    'K-NN (k=5)': KNeighborsClassifier(n_neighbors=5),
}

# Store all results
all_results = {}

print("🏟️ ALGORITHM CHALLENGE ARENA")
print("Testing each algorithm on different dataset types...")

for dataset_name, (X, y, description) in challenge_datasets.items():
    print(f"\n📊 Testing on {dataset_name} dataset: {description}")
    
    # Split data
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    
    # Scale features
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    dataset_results = {}
    
    for alg_name, algorithm in challenge_algorithms.items():
        # Use scaled features for distance-based algorithms
        if alg_name in ['Logistic Regression', 'SVM (RBF)', 'K-NN (k=5)']:
            algorithm.fit(X_train_scaled, y_train)
            accuracy = algorithm.score(X_test_scaled, y_test)
        else:
            algorithm.fit(X_train, y_train)
            accuracy = algorithm.score(X_test, y_test)
        
        dataset_results[alg_name] = accuracy
    
    all_results[dataset_name] = dataset_results

# Create comprehensive results DataFrame
results_df = pd.DataFrame(all_results)
print("\n🏆 FINAL CHALLENGE RESULTS:")
print(results_df.round(3))

# Visualize results
plt.figure(figsize=(12, 8))
sns.heatmap(results_df, annot=True, cmap='RdYlGn', center=0.8, 
            fmt='.3f', cbar_kws={'label': 'Accuracy'})
plt.title('Algorithm Performance Across Different Dataset Types')
plt.ylabel('Algorithm')
plt.xlabel('Dataset Type')
plt.tight_layout()
plt.show()

# Find best algorithm for each dataset type
print("\n🥇 WINNERS FOR EACH DATASET TYPE:")
for dataset_name in results_df.columns:
    best_alg = results_df[dataset_name].idxmax()
    best_score = results_df[dataset_name].max()
    print(f"{dataset_name}: {best_alg} ({best_score:.3f})")

# Find most versatile algorithm
versatility_scores = results_df.mean(axis=1).sort_values(ascending=False)
print(f"\n🌟 MOST VERSATILE ALGORITHM: {versatility_scores.index[0]}")
print(f"Average performance: {versatility_scores.iloc[0]:.3f}")

### 🎮 Interactive Exercise 3: Hyperparameter Tuning

**Final Challenge:** Optimize your best algorithm using Grid Search!

**Your Mission:**
1. Choose the best performing algorithm from above
2. Define a hyperparameter grid to search
3. Use GridSearchCV with cross-validation
4. Compare default vs optimized performance
5. Analyze which parameters matter most

**Pro Tip:** Start with a coarse grid, then refine around the best values!

In [None]:
# TODO: Implement hyperparameter tuning for your chosen algorithm

# Choose Random Forest as it performed well across datasets
print("🔧 HYPERPARAMETER TUNING: Random Forest")

# Define parameter grid for Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}

print(f"Parameter grid size: {np.prod([len(v) for v in param_grid.values()])} combinations")

# Use the breast cancer dataset for tuning (medical importance!)
rf_tuning = RandomForestClassifier(random_state=42)

# Perform grid search
print("🔍 Performing Grid Search (this might take a moment...)")
grid_search = GridSearchCV(
    rf_tuning, 
    param_grid, 
    cv=5,  # 5-fold cross-validation
    scoring='f1',  # Optimize for F1-score (good for medical data)
    n_jobs=-1,  # Use all available cores
    verbose=1
)

grid_search.fit(X_train_cancer_scaled, y_train_cancer)

print("✅ Grid Search Complete!")
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best cross-validation F1 score: {grid_search.best_score_:.3f}")

# Compare default vs optimized
rf_default = RandomForestClassifier(random_state=42)
rf_default.fit(X_train_cancer, y_train_cancer)

rf_optimized = grid_search.best_estimator_
# Note: GridSearch was done on scaled data, so let's retrain on unscaled for fair comparison
rf_optimized.fit(X_train_cancer, y_train_cancer)

# Test both models
y_pred_default = rf_default.predict(X_test_cancer)
y_pred_optimized = rf_optimized.predict(X_test_cancer)

print(f"\n📊 DEFAULT vs OPTIMIZED COMPARISON:")
print(f"Default Random Forest F1-Score: {f1_score(y_test_cancer, y_pred_default):.3f}")
print(f"Optimized Random Forest F1-Score: {f1_score(y_test_cancer, y_pred_optimized):.3f}")
print(f"Improvement: {f1_score(y_test_cancer, y_pred_optimized) - f1_score(y_test_cancer, y_pred_default):.3f}")

In [None]:
# Analyze hyperparameter importance
results_df_tuning = pd.DataFrame(grid_search.cv_results_)

# Plot parameter importance
fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

param_names = list(param_grid.keys())
for i, param in enumerate(param_names):
    if i < len(axes):
        # Group by parameter and show mean scores
        param_performance = results_df_tuning.groupby(f'param_{param}')['mean_test_score'].mean()
        
        axes[i].bar(range(len(param_performance)), param_performance.values, alpha=0.7)
        axes[i].set_xticks(range(len(param_performance)))
        axes[i].set_xticklabels([str(x) for x in param_performance.index], rotation=45)
        axes[i].set_xlabel(param)
        axes[i].set_ylabel('Mean CV F1 Score')
        axes[i].set_title(f'Impact of {param}')
        axes[i].grid(True, alpha=0.3)

# Feature importance from optimized model
if len(axes) > len(param_names):
    feature_importance = pd.DataFrame({
        'Feature': X_cancer.columns,
        'Importance': rf_optimized.feature_importances_
    }).sort_values('Importance', ascending=False).head(15)
    
    axes[-1].barh(range(len(feature_importance)), feature_importance['Importance'])
    axes[-1].set_yticks(range(len(feature_importance)))
    axes[-1].set_yticklabels(feature_importance['Feature'])
    axes[-1].set_xlabel('Feature Importance')
    axes[-1].set_title('Top 15 Features (Optimized Model)')

plt.tight_layout()
plt.show()

print("🎯 TUNING INSIGHTS:")
print("- Look for parameters that show large performance differences")
print("- Feature importance can guide future feature engineering")
print("- Sometimes default parameters are already quite good!")

## 🏆 Workshop Summary & Key Takeaways

Congratulations! You've completed a comprehensive classification workshop. Let's summarize what you've learned:

### 🎯 Key Insights from Today:

1. **Algorithm Selection Matters**: Different algorithms excel in different scenarios
2. **Data Preprocessing is Crucial**: Scaling can dramatically affect performance
3. **Evaluation Strategy**: Use multiple metrics, especially for medical/critical applications
4. **Cross-Validation**: Provides more reliable performance estimates
5. **Hyperparameter Tuning**: Can provide meaningful improvements
6. **No Free Lunch**: No single algorithm is always best

### 🛠️ Practical Skills Gained:

- ✅ Built and evaluated multiple classification models
- ✅ Handled both binary and multiclass problems
- ✅ Applied proper data preprocessing techniques
- ✅ Used cross-validation for robust evaluation
- ✅ Performed hyperparameter optimization
- ✅ Interpreted model results in business context

### 🚀 Next Steps:

1. **Practice More**: Try these techniques on your own datasets
2. **Advanced Topics**: Explore ensemble methods, deep learning
3. **Production Skills**: Learn about model deployment and monitoring
4. **Domain Expertise**: Apply to specific fields (healthcare, finance, etc.)

In [None]:
print("🎉 CLASSIFICATION WORKSHOP COMPLETE! 🎉")
print("\n📚 What you've mastered today:")
print("✅ Multiple classification algorithms")
print("✅ Proper evaluation techniques") 
print("✅ Cross-validation and hyperparameter tuning")
print("✅ Real-world problem solving")
print("\n🚀 You're ready to tackle classification problems in the wild!")
print("\nKeep practicing and happy learning! 🤖📊")

# Final challenge for the ambitious learners
print("\n🏆 BONUS CHALLENGE:")
print("Can you achieve >95% F1-score on the breast cancer dataset?")
print("Hint: Try ensemble methods, feature selection, or advanced preprocessing!")