# 🤖 Scikit-Learn Algorithms from Scratch

This notebook covers implementing classic machine learning algorithms from scratch, commonly asked in ML engineering interviews.

## 📋 Table of Contents
1. [K-Means Clustering](#k-means-clustering)
2. [Logistic Regression with Regularization](#logistic-regression)
3. [Decision Tree Implementation](#decision-tree)
4. [Model Evaluation and Cross-Validation](#model-evaluation)
5. [Practice Problems](#practice-problems)
6. [Interview Tips](#interview-tips)

In [None]:
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_blobs, load_iris, load_wine
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.preprocessing import StandardScaler
import time
import warnings
warnings.filterwarnings('ignore')

# Set up plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
np.random.seed(42)

print("✅ All libraries imported successfully!")
print(f"📊 NumPy version: {np.__version__}")
print(f"🤖 Scikit-learn available for comparison")

## 🎯 Problem 1: K-Means Clustering from Scratch

**Problem Statement**: Implement K-means clustering algorithm using Lloyd's algorithm.

**Requirements**:
- Initialize centroids randomly or using K-means++
- Implement Lloyd's algorithm with convergence criteria
- Calculate within-cluster sum of squares (WCSS)
- Handle edge cases (empty clusters, convergence)

**Time Complexity**: O(n × k × i × d) where n=samples, k=clusters, i=iterations, d=dimensions

In [None]:
class KMeansFromScratch:
    """K-means clustering implementation from scratch."""
    
    def __init__(self, n_clusters=3, max_iters=100, tol=1e-4, init='random'):
        self.n_clusters = n_clusters
        self.max_iters = max_iters
        self.tol = tol
        self.init = init  # 'random' or 'kmeans++'
        
    def _initialize_centroids(self, X):
        """Initialize centroids using random or K-means++ method."""
        n_samples, n_features = X.shape
        
        if self.init == 'random':
            # Random initialization
            indices = np.random.choice(n_samples, self.n_clusters, replace=False)
            centroids = X[indices].copy()
        
        elif self.init == 'kmeans++':
            # K-means++ initialization
            centroids = np.zeros((self.n_clusters, n_features))
            
            # Choose first centroid randomly
            centroids[0] = X[np.random.randint(n_samples)]
            
            for i in range(1, self.n_clusters):
                # Calculate distances to nearest centroid
                distances = np.array([min([np.linalg.norm(x - c) ** 2 
                                         for c in centroids[:i]]) for x in X])
                
                # Choose next centroid with probability proportional to squared distance
                probabilities = distances / distances.sum()
                cumulative_probabilities = probabilities.cumsum()
                r = np.random.rand()
                
                for j, p in enumerate(cumulative_probabilities):
                    if r < p:
                        centroids[i] = X[j]
                        break
        
        return centroids
    
    def _assign_clusters(self, X, centroids):
        """Assign each point to the nearest centroid."""
        distances = np.sqrt(((X - centroids[:, np.newaxis])**2).sum(axis=2))
        return np.argmin(distances, axis=0)
    
    def _update_centroids(self, X, labels):
        """Update centroids based on current assignments."""
        centroids = np.zeros((self.n_clusters, X.shape[1]))
        
        for k in range(self.n_clusters):
            if np.sum(labels == k) > 0:
                centroids[k] = X[labels == k].mean(axis=0)
            else:
                # Handle empty cluster by reinitializing
                centroids[k] = X[np.random.randint(len(X))]
        
        return centroids
    
    def _calculate_wcss(self, X, labels, centroids):
        """Calculate Within-Cluster Sum of Squares."""
        wcss = 0
        for k in range(self.n_clusters):
            cluster_points = X[labels == k]
            if len(cluster_points) > 0:
                wcss += np.sum((cluster_points - centroids[k]) ** 2)
        return wcss
    
    def fit(self, X):
        """Fit K-means clustering to data."""
        # Initialize centroids
        self.centroids = self._initialize_centroids(X)
        self.wcss_history = []
        
        for iteration in range(self.max_iters):
            # Assign points to clusters
            self.labels = self._assign_clusters(X, self.centroids)
            
            # Calculate WCSS
            wcss = self._calculate_wcss(X, self.labels, self.centroids)
            self.wcss_history.append(wcss)
            
            # Update centroids
            new_centroids = self._update_centroids(X, self.labels)
            
            # Check for convergence
            if np.allclose(self.centroids, new_centroids, rtol=self.tol):
                print(f"Converged after {iteration + 1} iterations")
                break
                
            self.centroids = new_centroids
        
        # Final WCSS calculation
        self.inertia_ = self._calculate_wcss(X, self.labels, self.centroids)
        return self
    
    def predict(self, X):
        """Predict cluster labels for new data."""
        return self._assign_clusters(X, self.centroids)
    
    def fit_predict(self, X):
        """Fit the model and predict cluster labels."""
        return self.fit(X).labels

# Test K-means implementation
print("🧪 Testing K-Means Implementation:")

# Generate sample data
X_blobs, y_true = make_blobs(n_samples=300, centers=4, n_features=2, 
                            random_state=42, cluster_std=0.60)

# Apply our K-means
kmeans_custom = KMeansFromScratch(n_clusters=4, init='kmeans++', max_iters=100)
labels_custom = kmeans_custom.fit_predict(X_blobs)

print(f"Final WCSS: {kmeans_custom.inertia_:.2f}")
print(f"Number of iterations: {len(kmeans_custom.wcss_history)}")

# Compare with sklearn
from sklearn.cluster import KMeans
kmeans_sklearn = KMeans(n_clusters=4, random_state=42, n_init=10)
labels_sklearn = kmeans_sklearn.fit_predict(X_blobs)

print(f"Sklearn WCSS: {kmeans_sklearn.inertia_:.2f}")
print(f"\nCustom implementation WCSS: {kmeans_custom.inertia_:.2f}")
print(f"Difference: {abs(kmeans_sklearn.inertia_ - kmeans_custom.inertia_):.2f}")

print("\n✅ K-means implementation test completed!")

In [None]:
# Visualize K-means results
plt.figure(figsize=(15, 5))

# Plot 1: Original data with true clusters
plt.subplot(1, 3, 1)
scatter = plt.scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_true, alpha=0.7)
plt.title('True Clusters')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(scatter)
plt.grid(True, alpha=0.3)

# Plot 2: Our K-means results
plt.subplot(1, 3, 2)
scatter = plt.scatter(X_blobs[:, 0], X_blobs[:, 1], c=labels_custom, alpha=0.7)
plt.scatter(kmeans_custom.centroids[:, 0], kmeans_custom.centroids[:, 1], 
           c='red', marker='x', s=200, linewidths=3, label='Centroids')
plt.title('Custom K-means Results')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.colorbar(scatter)
plt.grid(True, alpha=0.3)

# Plot 3: WCSS convergence
plt.subplot(1, 3, 3)
plt.plot(range(1, len(kmeans_custom.wcss_history) + 1), 
         kmeans_custom.wcss_history, 'o-', alpha=0.8)
plt.xlabel('Iteration')
plt.ylabel('Within-Cluster Sum of Squares')
plt.title('WCSS Convergence')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Elbow method for optimal K
def elbow_method(X, max_k=10):
    """Find optimal number of clusters using elbow method."""
    wcss_values = []
    k_values = range(1, max_k + 1)
    
    for k in k_values:
        if k == 1:
            # For k=1, WCSS is total sum of squares from the mean
            center = X.mean(axis=0)
            wcss = np.sum((X - center) ** 2)
        else:
            kmeans = KMeansFromScratch(n_clusters=k, max_iters=50)
            kmeans.fit(X)
            wcss = kmeans.inertia_
        
        wcss_values.append(wcss)
    
    return k_values, wcss_values

# Run elbow method
print("🧪 Running Elbow Method Analysis:")
k_range, wcss_values = elbow_method(X_blobs, max_k=8)

plt.figure(figsize=(10, 6))
plt.plot(k_range, wcss_values, 'o-', linewidth=2, markersize=8, alpha=0.8)
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Within-Cluster Sum of Squares')
plt.title('Elbow Method for Optimal k')
plt.grid(True, alpha=0.3)

# Highlight the elbow point (k=4 in this case)
plt.axvline(x=4, color='red', linestyle='--', alpha=0.7, label='Optimal k=4')
plt.legend()

# Add annotations
for k, wcss in zip(k_range, wcss_values):
    plt.annotate(f'{wcss:.0f}', (k, wcss), textcoords="offset points", 
                xytext=(0,10), ha='center', fontsize=9)

plt.show()

print(f"WCSS values: {[f'{w:.1f}' for w in wcss_values]}")
print("🎯 Optimal k appears to be 4 (elbow point)")

## 🎯 Problem 2: Logistic Regression with Regularization

**Problem Statement**: Implement logistic regression with L1/L2 regularization and multiple solvers.

**Requirements**:
- Binary and multiclass classification support
- L1 (Lasso) and L2 (Ridge) regularization
- Gradient descent with different optimizers
- Probability predictions and decision boundaries
- Handle numerical stability issues

**Key Concepts**: Sigmoid function, cross-entropy loss, gradient descent, regularization

In [None]:
class LogisticRegressionFromScratch:
    """Logistic Regression with regularization from scratch."""
    
    def __init__(self, learning_rate=0.01, max_iters=1000, 
                 regularization=None, reg_strength=0.01,
                 fit_intercept=True, tol=1e-6, solver='gd'):
        self.learning_rate = learning_rate
        self.max_iters = max_iters
        self.regularization = regularization  # None, 'l1', 'l2', 'elasticnet'
        self.reg_strength = reg_strength
        self.fit_intercept = fit_intercept
        self.tol = tol
        self.solver = solver  # 'gd', 'sgd', 'adam'
        
        # For Adam optimizer
        self.beta1 = 0.9
        self.beta2 = 0.999
        self.epsilon = 1e-8
        
    def _add_intercept(self, X):
        """Add bias term to features."""
        intercept = np.ones((X.shape[0], 1))
        return np.concatenate((intercept, X), axis=1)
    
    def _sigmoid(self, z):
        """Sigmoid activation with numerical stability."""
        # Clip z to prevent overflow
        z = np.clip(z, -250, 250)
        return 1 / (1 + np.exp(-z))
    
    def _cost_function(self, X, y, weights):
        """Calculate logistic regression cost with regularization."""
        # Forward pass
        z = X @ weights
        predictions = self._sigmoid(z)
        
        # Avoid log(0) by adding small epsilon
        epsilon = 1e-15
        predictions = np.clip(predictions, epsilon, 1 - epsilon)
        
        # Binary cross-entropy loss
        cost = -np.mean(y * np.log(predictions) + (1 - y) * np.log(1 - predictions))
        
        # Add regularization
        if self.regularization == 'l1':
            # Don't regularize intercept
            reg_weights = weights[1:] if self.fit_intercept else weights
            cost += self.reg_strength * np.sum(np.abs(reg_weights))
        elif self.regularization == 'l2':
            reg_weights = weights[1:] if self.fit_intercept else weights
            cost += self.reg_strength * np.sum(reg_weights ** 2)
        elif self.regularization == 'elasticnet':
            reg_weights = weights[1:] if self.fit_intercept else weights
            l1_penalty = self.reg_strength * 0.5 * np.sum(np.abs(reg_weights))
            l2_penalty = self.reg_strength * 0.5 * np.sum(reg_weights ** 2)
            cost += l1_penalty + l2_penalty
        
        return cost
    
    def _compute_gradients(self, X, y, weights):
        """Compute gradients for weight updates."""
        # Forward pass
        z = X @ weights
        predictions = self._sigmoid(z)
        
        # Basic gradient
        gradients = (1 / len(y)) * X.T @ (predictions - y)
        
        # Add regularization gradients
        if self.regularization == 'l1':
            l1_grad = self.reg_strength * np.sign(weights)
            if self.fit_intercept:
                l1_grad[0] = 0  # Don't regularize intercept
            gradients += l1_grad
        elif self.regularization == 'l2':
            l2_grad = 2 * self.reg_strength * weights
            if self.fit_intercept:
                l2_grad[0] = 0  # Don't regularize intercept
            gradients += l2_grad
        elif self.regularization == 'elasticnet':
            l1_grad = self.reg_strength * 0.5 * np.sign(weights)
            l2_grad = self.reg_strength * 0.5 * 2 * weights
            if self.fit_intercept:
                l1_grad[0] = 0
                l2_grad[0] = 0
            gradients += l1_grad + l2_grad
        
        return gradients
    
    def _update_weights_gd(self, gradients):
        """Standard gradient descent update."""
        self.weights -= self.learning_rate * gradients
    
    def _update_weights_adam(self, gradients, t):
        """Adam optimizer update."""
        # Update biased first moment estimate
        self.m = self.beta1 * self.m + (1 - self.beta1) * gradients
        
        # Update biased second raw moment estimate
        self.v = self.beta2 * self.v + (1 - self.beta2) * (gradients ** 2)
        
        # Compute bias-corrected first moment estimate
        m_corrected = self.m / (1 - self.beta1 ** t)
        
        # Compute bias-corrected second raw moment estimate
        v_corrected = self.v / (1 - self.beta2 ** t)
        
        # Update weights
        self.weights -= self.learning_rate * m_corrected / (np.sqrt(v_corrected) + self.epsilon)
    
    def fit(self, X, y):
        """Train the logistic regression model."""
        # Add intercept term if needed
        if self.fit_intercept:
            X = self._add_intercept(X)
        
        # Initialize weights
        n_features = X.shape[1]
        self.weights = np.random.normal(0, 0.01, n_features)
        
        # Initialize Adam optimizer variables
        if self.solver == 'adam':
            self.m = np.zeros_like(self.weights)
            self.v = np.zeros_like(self.weights)
        
        # Store training history
        self.cost_history = []
        
        # Training loop
        for i in range(self.max_iters):
            # Calculate cost
            cost = self._cost_function(X, y, self.weights)
            self.cost_history.append(cost)
            
            # Compute gradients
            gradients = self._compute_gradients(X, y, self.weights)
            
            # Update weights based on solver
            if self.solver == 'gd':
                self._update_weights_gd(gradients)
            elif self.solver == 'adam':
                self._update_weights_adam(gradients, i + 1)
            elif self.solver == 'sgd':
                # Simple SGD implementation
                indices = np.random.choice(len(X), size=min(32, len(X)), replace=False)
                X_batch = X[indices]
                y_batch = y[indices]
                batch_gradients = self._compute_gradients(X_batch, y_batch, self.weights)
                self._update_weights_gd(batch_gradients)
            
            # Check for convergence
            if i > 0 and abs(self.cost_history[-2] - self.cost_history[-1]) < self.tol:
                print(f"Converged after {i + 1} iterations")
                break
        
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        if self.fit_intercept:
            X = self._add_intercept(X)
        
        return self._sigmoid(X @ self.weights)
    
    def predict(self, X, threshold=0.5):
        """Predict binary class labels."""
        probabilities = self.predict_proba(X)
        return (probabilities >= threshold).astype(int)

# Test Logistic Regression implementation
print("🧪 Testing Logistic Regression Implementation:")

# Generate sample data
X_class, y_class = make_classification(n_samples=1000, n_features=2, n_redundant=0,
                                      n_informative=2, n_clusters_per_class=1, random_state=42)

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_class, y_class, test_size=0.2, random_state=42)

# Test different configurations
configs = [
    {'regularization': None, 'solver': 'gd', 'name': 'No Regularization (GD)'},
    {'regularization': 'l2', 'reg_strength': 0.01, 'solver': 'gd', 'name': 'L2 Regularization (GD)'},
    {'regularization': 'l1', 'reg_strength': 0.01, 'solver': 'gd', 'name': 'L1 Regularization (GD)'},
    {'regularization': 'l2', 'reg_strength': 0.01, 'solver': 'adam', 'name': 'L2 Regularization (Adam)'}
]

results = []

for config in configs:
    print(f"\n=== Testing {config['name']} ===")
    
    # Create and train model
    lr = LogisticRegressionFromScratch(
        learning_rate=0.01,
        max_iters=1000,
        regularization=config.get('regularization'),
        reg_strength=config.get('reg_strength', 0.01),
        solver=config.get('solver', 'gd')
    )
    
    # Train model
    start_time = time.time()
    lr.fit(X_train, y_train)
    training_time = time.time() - start_time
    
    # Make predictions
    y_pred = lr.predict(X_test)
    y_proba = lr.predict_proba(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    
    results.append({
        'name': config['name'],
        'accuracy': accuracy,
        'final_cost': lr.cost_history[-1],
        'iterations': len(lr.cost_history),
        'training_time': training_time,
        'weights': lr.weights.copy(),
        'cost_history': lr.cost_history.copy()
    })
    
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Final cost: {lr.cost_history[-1]:.4f}")
    print(f"Training time: {training_time:.3f}s")
    print(f"Iterations: {len(lr.cost_history)}")

print("\n✅ Logistic Regression tests completed!")

In [None]:
# Visualize Logistic Regression results
plt.figure(figsize=(16, 12))

# Plot 1: Original data
plt.subplot(3, 3, 1)
scatter = plt.scatter(X_class[:, 0], X_class[:, 1], c=y_class, alpha=0.7)
plt.title('Training Data')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.colorbar(scatter)
plt.grid(True, alpha=0.3)

# Plot 2-5: Decision boundaries for different models
def plot_decision_boundary(X, y, model, title, subplot_idx):
    plt.subplot(3, 3, subplot_idx)
    
    # Create a mesh
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    # Predict on mesh
    mesh_points = np.c_[xx.ravel(), yy.ravel()]
    Z = model.predict_proba(mesh_points)
    Z = Z.reshape(xx.shape)
    
    # Plot decision boundary
    plt.contourf(xx, yy, Z, levels=50, alpha=0.6, cmap='RdYlBu')
    plt.contour(xx, yy, Z, levels=[0.5], colors='black', linestyles='--', linewidths=2)
    
    # Plot data points
    scatter = plt.scatter(X[:, 0], X[:, 1], c=y, alpha=0.8, edgecolors='black')
    plt.title(title)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

# Plot decision boundaries for first 4 models
for i, result in enumerate(results[:4]):
    # Recreate model with learned weights
    model = LogisticRegressionFromScratch()
    model.weights = result['weights']
    model.fit_intercept = True
    
    plot_decision_boundary(X_test, y_test, model, result['name'], i + 2)

# Plot 6: Cost histories
plt.subplot(3, 3, 6)
for result in results:
    plt.plot(result['cost_history'], label=result['name'], alpha=0.8)
plt.xlabel('Iteration')
plt.ylabel('Cost')
plt.title('Training Cost History')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)

# Plot 7: Accuracy comparison
plt.subplot(3, 3, 7)
names = [r['name'] for r in results]
accuracies = [r['accuracy'] for r in results]
bars = plt.bar(range(len(names)), accuracies, alpha=0.7)
plt.xlabel('Model')
plt.ylabel('Accuracy')
plt.title('Model Accuracy Comparison')
plt.xticks(range(len(names)), [n.split('(')[0][:10] + '...' for n in names], rotation=45)

# Add value labels on bars
for bar, acc in zip(bars, accuracies):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{acc:.3f}', ha='center', va='bottom')

plt.grid(True, alpha=0.3)

# Plot 8: Training time comparison
plt.subplot(3, 3, 8)
times = [r['training_time'] for r in results]
bars = plt.bar(range(len(names)), times, alpha=0.7, color='orange')
plt.xlabel('Model')
plt.ylabel('Training Time (s)')
plt.title('Training Time Comparison')
plt.xticks(range(len(names)), [n.split('(')[0][:10] + '...' for n in names], rotation=45)

# Add value labels on bars
for bar, time_val in zip(bars, times):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + bar.get_height()*0.05,
             f'{time_val:.3f}s', ha='center', va='bottom')

plt.grid(True, alpha=0.3)

# Plot 9: Weight magnitudes (regularization effect)
plt.subplot(3, 3, 9)
for i, result in enumerate(results):
    weights = result['weights']
    weight_magnitudes = np.abs(weights[1:])  # Exclude intercept
    plt.bar(np.arange(len(weight_magnitudes)) + i*0.2, weight_magnitudes, 
            width=0.2, label=result['name'].split('(')[0][:10], alpha=0.7)

plt.xlabel('Weight Index')
plt.ylabel('Weight Magnitude')
plt.title('Weight Magnitudes (Regularization Effect)')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary table
print("📊 Model Performance Summary:")
print("=" * 80)
print(f"{'Model':<25} {'Accuracy':<12} {'Final Cost':<12} {'Time (s)':<10} {'Iterations':<12}")
print("=" * 80)
for result in results:
    print(f"{result['name'][:24]:<25} {result['accuracy']:<12.4f} "
          f"{result['final_cost']:<12.4f} {result['training_time']:<10.3f} {result['iterations']:<12}")

## 🌳 Problem 3: Decision Tree Implementation

**Problem Statement**: Implement a decision tree classifier with different splitting criteria.

**Requirements**:
- Support for Gini impurity, Entropy, and Classification Error
- Recursive tree building with stopping criteria
- Handle categorical and numerical features
- Pruning to prevent overfitting
- Visualization of the tree structure

**Key Concepts**: Information gain, impurity measures, recursive partitioning

In [None]:
class DecisionTreeNode:
    """Node class for decision tree."""
    
    def __init__(self):
        self.feature_index = None
        self.threshold = None
        self.left = None
        self.right = None
        self.value = None  # For leaf nodes
        self.samples = 0
        self.impurity = 0

class DecisionTreeFromScratch:
    """Decision Tree Classifier implementation from scratch."""
    
    def __init__(self, criterion='gini', max_depth=None, min_samples_split=2,
                 min_samples_leaf=1, max_features=None):
        self.criterion = criterion  # 'gini', 'entropy', 'misclassification'
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.max_features = max_features
        self.root = None
        self.n_classes_ = None
        self.n_features_ = None
    
    def _calculate_impurity(self, y):
        """Calculate impurity based on criterion."""
        if len(y) == 0:
            return 0
        
        proportions = np.bincount(y) / len(y)
        
        if self.criterion == 'gini':
            return 1 - np.sum(proportions ** 2)
        elif self.criterion == 'entropy':
            # Avoid log(0)
            proportions = proportions[proportions > 0]
            return -np.sum(proportions * np.log2(proportions))
        elif self.criterion == 'misclassification':
            return 1 - np.max(proportions)
    
    def _calculate_information_gain(self, y, left_indices, right_indices):
        """Calculate information gain from a split."""
        n = len(y)
        n_left, n_right = len(left_indices), len(right_indices)
        
        if n_left == 0 or n_right == 0:
            return 0
        
        # Parent impurity
        parent_impurity = self._calculate_impurity(y)
        
        # Children impurities
        left_impurity = self._calculate_impurity(y[left_indices])
        right_impurity = self._calculate_impurity(y[right_indices])
        
        # Weighted average of children impurities
        weighted_impurity = (n_left / n) * left_impurity + (n_right / n) * right_impurity
        
        return parent_impurity - weighted_impurity
    
    def _find_best_split(self, X, y):
        """Find the best feature and threshold to split on."""
        best_gain = -1
        best_feature = None
        best_threshold = None
        
        n_features = X.shape[1]
        
        # Feature selection
        if self.max_features is None:
            features_to_consider = range(n_features)
        else:
            n_features_to_consider = min(self.max_features, n_features)
            features_to_consider = np.random.choice(n_features, n_features_to_consider, replace=False)
        
        for feature_index in features_to_consider:
            feature_values = X[:, feature_index]
            possible_thresholds = np.unique(feature_values)
            
            for threshold in possible_thresholds:
                left_indices = np.where(feature_values <= threshold)[0]
                right_indices = np.where(feature_values > threshold)[0]
                
                # Check minimum samples constraints
                if (len(left_indices) < self.min_samples_leaf or 
                    len(right_indices) < self.min_samples_leaf):
                    continue
                
                # Calculate information gain
                gain = self._calculate_information_gain(y, left_indices, right_indices)
                
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature_index
                    best_threshold = threshold
        
        return best_feature, best_threshold, best_gain
    
    def _build_tree(self, X, y, depth=0):
        """Recursively build the decision tree."""
        node = DecisionTreeNode()
        node.samples = len(y)
        node.impurity = self._calculate_impurity(y)
        
        # Stopping criteria
        if (len(y) < self.min_samples_split or 
            (self.max_depth is not None and depth >= self.max_depth) or
            len(np.unique(y)) == 1):  # Pure node
            
            # Create leaf node
            node.value = np.argmax(np.bincount(y))
            return node
        
        # Find best split
        best_feature, best_threshold, best_gain = self._find_best_split(X, y)
        
        if best_feature is None or best_gain <= 0:
            # No good split found, create leaf
            node.value = np.argmax(np.bincount(y))
            return node
        
        # Create internal node
        node.feature_index = best_feature
        node.threshold = best_threshold
        
        # Split data
        left_indices = X[:, best_feature] <= best_threshold
        right_indices = ~left_indices
        
        # Recursively build children
        node.left = self._build_tree(X[left_indices], y[left_indices], depth + 1)
        node.right = self._build_tree(X[right_indices], y[right_indices], depth + 1)
        
        return node
    
    def fit(self, X, y):
        """Train the decision tree."""
        self.n_classes_ = len(np.unique(y))
        self.n_features_ = X.shape[1]
        
        # Build the tree
        self.root = self._build_tree(X, y)
        return self
    
    def _predict_sample(self, x, node):
        """Predict a single sample."""
        if node.value is not None:  # Leaf node
            return node.value
        
        if x[node.feature_index] <= node.threshold:
            return self._predict_sample(x, node.left)
        else:
            return self._predict_sample(x, node.right)
    
    def predict(self, X):
        """Predict class labels for samples."""
        predictions = []
        for x in X:
            predictions.append(self._predict_sample(x, self.root))
        return np.array(predictions)
    
    def _get_tree_depth(self, node):
        """Calculate the depth of the tree."""
        if node.value is not None:  # Leaf node
            return 1
        
        left_depth = self._get_tree_depth(node.left) if node.left else 0
        right_depth = self._get_tree_depth(node.right) if node.right else 0
        
        return 1 + max(left_depth, right_depth)
    
    def get_tree_depth(self):
        """Get the depth of the fitted tree."""
        return self._get_tree_depth(self.root)
    
    def _count_nodes(self, node):
        """Count total number of nodes in the tree."""
        if node.value is not None:  # Leaf node
            return 1
        
        left_count = self._count_nodes(node.left) if node.left else 0
        right_count = self._count_nodes(node.right) if node.right else 0
        
        return 1 + left_count + right_count
    
    def count_nodes(self):
        """Count total number of nodes in the fitted tree."""
        return self._count_nodes(self.root)

# Test Decision Tree implementation
print("🧪 Testing Decision Tree Implementation:")

# Load Iris dataset for testing
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(X_iris, y_iris, test_size=0.3, random_state=42)

# Test different criteria
criteria = ['gini', 'entropy', 'misclassification']
dt_results = []

for criterion in criteria:
    print(f"\n=== Testing with {criterion} criterion ===")
    
    # Create and train decision tree
    dt = DecisionTreeFromScratch(
        criterion=criterion,
        max_depth=5,
        min_samples_split=2,
        min_samples_leaf=1
    )
    
    # Train model
    start_time = time.time()
    dt.fit(X_train, y_train)
    training_time = time.time() - start_time
    
    # Make predictions
    y_pred = dt.predict(X_test)
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    tree_depth = dt.get_tree_depth()
    node_count = dt.count_nodes()
    
    dt_results.append({
        'criterion': criterion,
        'accuracy': accuracy,
        'depth': tree_depth,
        'nodes': node_count,
        'training_time': training_time,
        'predictions': y_pred
    })
    
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Tree depth: {tree_depth}")
    print(f"Number of nodes: {node_count}")
    print(f"Training time: {training_time:.4f}s")

# Compare with sklearn
from sklearn.tree import DecisionTreeClassifier
sklearn_dt = DecisionTreeClassifier(criterion='gini', max_depth=5, random_state=42)
sklearn_dt.fit(X_train, y_train)
sklearn_pred = sklearn_dt.predict(X_test)
sklearn_accuracy = accuracy_score(y_test, sklearn_pred)

print(f"\n🔍 Comparison with Scikit-learn:")
print(f"Custom Decision Tree (Gini): {dt_results[0]['accuracy']:.4f}")
print(f"Sklearn Decision Tree (Gini): {sklearn_accuracy:.4f}")
print(f"Difference: {abs(dt_results[0]['accuracy'] - sklearn_accuracy):.4f}")

print("\n✅ Decision Tree tests completed!")

In [None]:
# Visualize Decision Tree results
plt.figure(figsize=(15, 10))

# Plot 1: Accuracy comparison
plt.subplot(2, 3, 1)
criteria = [r['criterion'] for r in dt_results]
accuracies = [r['accuracy'] for r in dt_results]
bars = plt.bar(criteria, accuracies, alpha=0.7, color=['skyblue', 'lightgreen', 'salmon'])
plt.ylabel('Accuracy')
plt.title('Accuracy by Splitting Criterion')
plt.grid(True, alpha=0.3)

# Add sklearn comparison
plt.axhline(y=sklearn_accuracy, color='red', linestyle='--', label=f'Sklearn: {sklearn_accuracy:.3f}')
plt.legend()

# Add value labels
for bar, acc in zip(bars, accuracies):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{acc:.3f}', ha='center', va='bottom')

# Plot 2: Tree complexity comparison
plt.subplot(2, 3, 2)
depths = [r['depth'] for r in dt_results]
nodes = [r['nodes'] for r in dt_results]

x_pos = np.arange(len(criteria))
width = 0.35

plt.bar(x_pos - width/2, depths, width, label='Tree Depth', alpha=0.7, color='orange')
plt.bar(x_pos + width/2, nodes, width, label='Node Count', alpha=0.7, color='purple')

plt.xlabel('Criterion')
plt.ylabel('Count')
plt.title('Tree Complexity Comparison')
plt.xticks(x_pos, criteria)
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 3: Feature importance visualization (for Gini criterion)
plt.subplot(2, 3, 3)
# Simple feature importance based on how often each feature is used for splitting
feature_names = iris.feature_names
feature_usage = np.zeros(len(feature_names))

# This is a simplified version - in practice, you'd traverse the tree
# and calculate importance based on impurity decrease
# For demonstration, we'll show feature variance as a proxy
feature_importance = np.var(X_train, axis=0)
feature_importance = feature_importance / np.sum(feature_importance)  # Normalize

bars = plt.bar(range(len(feature_names)), feature_importance, alpha=0.7, color='green')
plt.xlabel('Features')
plt.ylabel('Relative Importance')
plt.title('Feature Importance (Variance-based)')
plt.xticks(range(len(feature_names)), [name[:10] + '...' if len(name) > 10 else name 
                                      for name in feature_names], rotation=45)
plt.grid(True, alpha=0.3)

# Plot 4: Confusion Matrix for best model
plt.subplot(2, 3, 4)
best_result = max(dt_results, key=lambda x: x['accuracy'])
cm = confusion_matrix(y_test, best_result['predictions'])

sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=iris.target_names, yticklabels=iris.target_names)
plt.title(f'Confusion Matrix ({best_result["criterion"].title()} Criterion)')
plt.xlabel('Predicted')
plt.ylabel('Actual')

# Plot 5: Training time comparison
plt.subplot(2, 3, 5)
training_times = [r['training_time'] for r in dt_results]
bars = plt.bar(criteria, training_times, alpha=0.7, color='coral')
plt.ylabel('Training Time (s)')
plt.title('Training Time by Criterion')
plt.grid(True, alpha=0.3)

# Add value labels
for bar, time_val in zip(bars, training_times):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + bar.get_height()*0.05,
             f'{time_val:.4f}s', ha='center', va='bottom')

# Plot 6: Decision boundary visualization (using first two features)
plt.subplot(2, 3, 6)
# Use only first two features for visualization
X_2d = X_train[:, :2]
y_2d = y_train

# Train a 2D decision tree
dt_2d = DecisionTreeFromScratch(criterion='gini', max_depth=3)
dt_2d.fit(X_2d, y_2d)

# Create mesh for decision boundary
h = 0.02
x_min, x_max = X_2d[:, 0].min() - 1, X_2d[:, 0].max() + 1
y_min, y_max = X_2d[:, 1].min() - 1, X_2d[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Predict on mesh
mesh_points = np.c_[xx.ravel(), yy.ravel()]
Z = dt_2d.predict(mesh_points)
Z = Z.reshape(xx.shape)

# Plot decision boundary
plt.contourf(xx, yy, Z, alpha=0.6, cmap='viridis')
scatter = plt.scatter(X_2d[:, 0], X_2d[:, 1], c=y_2d, alpha=0.8, edgecolors='black')
plt.xlabel(feature_names[0])
plt.ylabel(feature_names[1])
plt.title('Decision Boundary (First 2 Features)')
plt.colorbar(scatter)

plt.tight_layout()
plt.show()

# Print detailed results table
print("📊 Decision Tree Performance Summary:")
print("=" * 70)
print(f"{'Criterion':<15} {'Accuracy':<10} {'Depth':<6} {'Nodes':<6} {'Time (s)':<10}")
print("=" * 70)
for result in dt_results:
    print(f"{result['criterion']:<15} {result['accuracy']:<10.4f} "
          f"{result['depth']:<6} {result['nodes']:<6} {result['training_time']:<10.4f}")
print("=" * 70)
print(f"{'Sklearn (Gini)':<15} {sklearn_accuracy:<10.4f} {'N/A':<6} {'N/A':<6} {'N/A':<10}")

## 📊 Problem 4: Model Evaluation and Cross-Validation

**Problem Statement**: Implement comprehensive model evaluation techniques from scratch.

**Requirements**:
- K-fold cross-validation
- Stratified sampling for imbalanced datasets
- Bootstrap sampling and confidence intervals
- Multiple evaluation metrics
- Learning curves and validation curves

**Key Concepts**: Bias-variance tradeoff, overfitting detection, model selection

In [None]:
class ModelEvaluator:
    """Comprehensive model evaluation toolkit."""
    
    @staticmethod
    def k_fold_cross_validation(X, y, model, k=5, random_state=None):
        """Perform k-fold cross-validation."""
        if random_state is not None:
            np.random.seed(random_state)
        
        n_samples = len(X)
        indices = np.arange(n_samples)
        np.random.shuffle(indices)
        
        fold_size = n_samples // k
        scores = []
        predictions_all = np.zeros(n_samples)
        
        for fold in range(k):
            # Define test indices for this fold
            start_idx = fold * fold_size
            end_idx = start_idx + fold_size if fold < k - 1 else n_samples
            
            test_indices = indices[start_idx:end_idx]
            train_indices = np.concatenate([indices[:start_idx], indices[end_idx:]])
            
            # Split data
            X_train_fold = X[train_indices]
            y_train_fold = y[train_indices]
            X_test_fold = X[test_indices]
            y_test_fold = y[test_indices]
            
            # Train and predict
            model_copy = type(model)(**model.__dict__)
            model_copy.fit(X_train_fold, y_train_fold)
            y_pred_fold = model_copy.predict(X_test_fold)
            
            # Store predictions
            predictions_all[test_indices] = y_pred_fold
            
            # Calculate accuracy for this fold
            fold_accuracy = accuracy_score(y_test_fold, y_pred_fold)
            scores.append(fold_accuracy)
        
        return scores, predictions_all
    
    @staticmethod
    def stratified_k_fold(X, y, model, k=5, random_state=None):
        """Perform stratified k-fold cross-validation."""
        if random_state is not None:
            np.random.seed(random_state)
        
        unique_classes, class_counts = np.unique(y, return_counts=True)
        n_samples = len(X)
        scores = []
        predictions_all = np.zeros(n_samples)
        
        # Create stratified folds
        folds = [[] for _ in range(k)]
        
        for class_label in unique_classes:
            class_indices = np.where(y == class_label)[0]
            np.random.shuffle(class_indices)
            
            # Distribute class samples across folds
            for i, idx in enumerate(class_indices):
                folds[i % k].append(idx)
        
        # Perform cross-validation
        for fold in range(k):
            test_indices = np.array(folds[fold])
            train_indices = np.concatenate([folds[i] for i in range(k) if i != fold])
            
            # Split data
            X_train_fold = X[train_indices]
            y_train_fold = y[train_indices]
            X_test_fold = X[test_indices]
            y_test_fold = y[test_indices]
            
            # Train and predict
            model_copy = type(model)(**model.__dict__)
            model_copy.fit(X_train_fold, y_train_fold)
            y_pred_fold = model_copy.predict(X_test_fold)
            
            # Store predictions
            predictions_all[test_indices] = y_pred_fold
            
            # Calculate accuracy for this fold
            fold_accuracy = accuracy_score(y_test_fold, y_pred_fold)
            scores.append(fold_accuracy)
        
        return scores, predictions_all
    
    @staticmethod
    def bootstrap_evaluation(X, y, model, n_bootstrap=100, random_state=None):
        """Perform bootstrap evaluation for confidence intervals."""
        if random_state is not None:
            np.random.seed(random_state)
        
        n_samples = len(X)
        scores = []
        
        for i in range(n_bootstrap):
            # Bootstrap sample
            bootstrap_indices = np.random.choice(n_samples, n_samples, replace=True)
            out_of_bag_indices = np.setdiff1d(np.arange(n_samples), bootstrap_indices)
            
            if len(out_of_bag_indices) == 0:
                continue
            
            # Train on bootstrap sample
            X_bootstrap = X[bootstrap_indices]
            y_bootstrap = y[bootstrap_indices]
            
            # Test on out-of-bag samples
            X_oob = X[out_of_bag_indices]
            y_oob = y[out_of_bag_indices]
            
            # Train and predict
            model_copy = type(model)(**model.__dict__)
            model_copy.fit(X_bootstrap, y_bootstrap)
            y_pred_oob = model_copy.predict(X_oob)
            
            # Calculate accuracy
            oob_accuracy = accuracy_score(y_oob, y_pred_oob)
            scores.append(oob_accuracy)
        
        return scores
    
    @staticmethod
    def learning_curve(X, y, model, train_sizes, cv=5, random_state=None):
        """Generate learning curve data."""
        if random_state is not None:
            np.random.seed(random_state)
        
        train_scores = []
        val_scores = []
        
        for train_size in train_sizes:
            # Limit training data
            n_train = int(train_size * len(X))
            indices = np.random.choice(len(X), n_train, replace=False)
            X_subset = X[indices]
            y_subset = y[indices]
            
            # Perform cross-validation on subset
            cv_scores, _ = ModelEvaluator.k_fold_cross_validation(
                X_subset, y_subset, model, k=cv, random_state=random_state
            )
            
            # Also calculate training score
            model_copy = type(model)(**model.__dict__)
            model_copy.fit(X_subset, y_subset)
            train_pred = model_copy.predict(X_subset)
            train_score = accuracy_score(y_subset, train_pred)
            
            train_scores.append(train_score)
            val_scores.append(np.mean(cv_scores))
        
        return train_sizes, train_scores, val_scores

# Test Model Evaluation techniques
print("🧪 Testing Model Evaluation Techniques:")

# Load Wine dataset for evaluation
wine = load_wine()
X_wine = wine.data
y_wine = wine.target

# Standardize features
scaler = StandardScaler()
X_wine_scaled = scaler.fit_transform(X_wine)

# Create a logistic regression model for testing
eval_model = LogisticRegressionFromScratch(
    learning_rate=0.01, max_iters=500, regularization='l2', reg_strength=0.01
)

# 1. K-Fold Cross-Validation
print("\n=== K-Fold Cross-Validation ===")
cv_scores, cv_predictions = ModelEvaluator.k_fold_cross_validation(
    X_wine_scaled, y_wine, eval_model, k=5, random_state=42
)

print(f"CV Scores: {[f'{score:.4f}' for score in cv_scores]}")
print(f"Mean CV Score: {np.mean(cv_scores):.4f} (+/- {np.std(cv_scores) * 2:.4f})")

# 2. Stratified K-Fold Cross-Validation
print("\n=== Stratified K-Fold Cross-Validation ===")
stratified_scores, stratified_predictions = ModelEvaluator.stratified_k_fold(
    X_wine_scaled, y_wine, eval_model, k=5, random_state=42
)

print(f"Stratified CV Scores: {[f'{score:.4f}' for score in stratified_scores]}")
print(f"Mean Stratified CV Score: {np.mean(stratified_scores):.4f} (+/- {np.std(stratified_scores) * 2:.4f})")

# 3. Bootstrap Evaluation
print("\n=== Bootstrap Evaluation ===")
bootstrap_scores = ModelEvaluator.bootstrap_evaluation(
    X_wine_scaled, y_wine, eval_model, n_bootstrap=50, random_state=42
)

bootstrap_mean = np.mean(bootstrap_scores)
bootstrap_ci = np.percentile(bootstrap_scores, [2.5, 97.5])

print(f"Bootstrap Mean Score: {bootstrap_mean:.4f}")
print(f"95% Confidence Interval: [{bootstrap_ci[0]:.4f}, {bootstrap_ci[1]:.4f}]")

# 4. Learning Curves
print("\n=== Learning Curves ===")
train_sizes = np.linspace(0.1, 1.0, 10)
lc_train_sizes, lc_train_scores, lc_val_scores = ModelEvaluator.learning_curve(
    X_wine_scaled, y_wine, eval_model, train_sizes, cv=3, random_state=42
)

print(f"Learning curve computed for {len(train_sizes)} training sizes")
print(f"Final training score: {lc_train_scores[-1]:.4f}")
print(f"Final validation score: {lc_val_scores[-1]:.4f}")

print("\n✅ Model evaluation tests completed!")

In [None]:
# Visualize Model Evaluation results
plt.figure(figsize=(16, 12))

# Plot 1: Cross-validation scores comparison
plt.subplot(3, 3, 1)
methods = ['K-Fold CV', 'Stratified CV']
mean_scores = [np.mean(cv_scores), np.mean(stratified_scores)]
std_scores = [np.std(cv_scores), np.std(stratified_scores)]

bars = plt.bar(methods, mean_scores, yerr=std_scores, alpha=0.7, capsize=5)
plt.ylabel('Accuracy')
plt.title('Cross-Validation Comparison')
plt.grid(True, alpha=0.3)

# Add value labels
for bar, score in zip(bars, mean_scores):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.01,
             f'{score:.3f}', ha='center', va='bottom')

# Plot 2: Individual fold scores
plt.subplot(3, 3, 2)
folds = range(1, 6)
plt.plot(folds, cv_scores, 'o-', label='K-Fold CV', alpha=0.8, linewidth=2)
plt.plot(folds, stratified_scores, 's-', label='Stratified CV', alpha=0.8, linewidth=2)
plt.xlabel('Fold')
plt.ylabel('Accuracy')
plt.title('Fold-wise Performance')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 3: Bootstrap score distribution
plt.subplot(3, 3, 3)
plt.hist(bootstrap_scores, bins=20, alpha=0.7, color='green', edgecolor='black')
plt.axvline(bootstrap_mean, color='red', linestyle='--', linewidth=2, label=f'Mean: {bootstrap_mean:.3f}')
plt.axvline(bootstrap_ci[0], color='orange', linestyle='--', alpha=0.7, label='95% CI')
plt.axvline(bootstrap_ci[1], color='orange', linestyle='--', alpha=0.7)
plt.xlabel('Bootstrap Score')
plt.ylabel('Frequency')
plt.title('Bootstrap Score Distribution')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 4: Learning curves
plt.subplot(3, 3, 4)
plt.plot(lc_train_sizes, lc_train_scores, 'o-', label='Training Score', alpha=0.8, linewidth=2)
plt.plot(lc_train_sizes, lc_val_scores, 's-', label='Validation Score', alpha=0.8, linewidth=2)
plt.xlabel('Training Set Size (fraction)')
plt.ylabel('Accuracy')
plt.title('Learning Curves')
plt.legend()
plt.grid(True, alpha=0.3)

# Fill area between curves to show gap
plt.fill_between(lc_train_sizes, lc_train_scores, lc_val_scores, alpha=0.1, color='red')

# Plot 5: Confusion matrix for cross-validation predictions
plt.subplot(3, 3, 5)
cm_cv = confusion_matrix(y_wine, cv_predictions.astype(int))
sns.heatmap(cm_cv, annot=True, fmt='d', cmap='Blues', 
            xticklabels=wine.target_names, yticklabels=wine.target_names)
plt.title('Confusion Matrix (K-Fold CV)')
plt.xlabel('Predicted')
plt.ylabel('Actual')

# Plot 6: Class distribution in dataset
plt.subplot(3, 3, 6)
unique_classes, class_counts = np.unique(y_wine, return_counts=True)
bars = plt.bar(wine.target_names, class_counts, alpha=0.7, color=['red', 'green', 'blue'])
plt.ylabel('Sample Count')
plt.title('Class Distribution in Wine Dataset')
plt.xticks(rotation=45)

# Add value labels
for bar, count in zip(bars, class_counts):
    plt.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1,
             str(count), ha='center', va='bottom')

plt.grid(True, alpha=0.3)

# Plot 7: Model performance metrics
plt.subplot(3, 3, 7)
from sklearn.metrics import precision_recall_fscore_support

# Calculate detailed metrics for CV predictions
precision, recall, f1, support = precision_recall_fscore_support(
    y_wine, cv_predictions.astype(int), average=None
)

x_pos = np.arange(len(wine.target_names))
width = 0.25

plt.bar(x_pos - width, precision, width, label='Precision', alpha=0.7)
plt.bar(x_pos, recall, width, label='Recall', alpha=0.7)
plt.bar(x_pos + width, f1, width, label='F1-Score', alpha=0.7)

plt.xlabel('Class')
plt.ylabel('Score')
plt.title('Per-Class Performance Metrics')
plt.xticks(x_pos, wine.target_names, rotation=45)
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 8: Bias-Variance analysis (simplified)
plt.subplot(3, 3, 8)
# Show training vs validation gap as proxy for bias-variance
training_gap = np.array(lc_train_scores) - np.array(lc_val_scores)
plt.plot(lc_train_sizes, training_gap, 'o-', color='red', alpha=0.8, linewidth=2)
plt.axhline(y=0, color='black', linestyle='--', alpha=0.5)
plt.xlabel('Training Set Size (fraction)')
plt.ylabel('Training - Validation Gap')
plt.title('Bias-Variance Indicator')
plt.grid(True, alpha=0.3)

# Add interpretation text
if training_gap[-1] > 0.1:
    plt.text(0.5, max(training_gap) * 0.8, 'High Variance\n(Overfitting)', 
             ha='center', va='center', bbox=dict(boxstyle='round', facecolor='red', alpha=0.3))
else:
    plt.text(0.5, max(training_gap) * 0.8, 'Good Balance', 
             ha='center', va='center', bbox=dict(boxstyle='round', facecolor='green', alpha=0.3))

# Plot 9: Statistical significance test
plt.subplot(3, 3, 9)
from scipy import stats

# Compare K-Fold vs Stratified CV using paired t-test
t_stat, p_value = stats.ttest_rel(cv_scores, stratified_scores)

# Create comparison visualization
methods = ['K-Fold CV', 'Stratified CV']
all_scores = [cv_scores, stratified_scores]

positions = [1, 2]
box_plot = plt.boxplot(all_scores, positions=positions, labels=methods, patch_artist=True)

# Color the boxes
colors = ['lightblue', 'lightgreen']
for patch, color in zip(box_plot['boxes'], colors):
    patch.set_facecolor(color)

plt.ylabel('Accuracy')
plt.title(f'Statistical Comparison\n(p-value: {p_value:.4f})')
plt.grid(True, alpha=0.3)

# Add significance indicator
if p_value < 0.05:
    plt.text(1.5, max(np.concatenate(all_scores)) * 1.05, 'Significant Difference', 
             ha='center', va='center', bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.5))
else:
    plt.text(1.5, max(np.concatenate(all_scores)) * 1.05, 'No Significant Difference', 
             ha='center', va='center', bbox=dict(boxstyle='round', facecolor='gray', alpha=0.5))

plt.tight_layout()
plt.show()

# Print comprehensive evaluation summary
print("\n📊 Comprehensive Model Evaluation Summary:")
print("=" * 60)
print(f"Dataset: Wine Classification ({len(X_wine)} samples, {X_wine.shape[1]} features)")
print(f"Classes: {len(wine.target_names)} ({', '.join(wine.target_names)})")
print("\nCross-Validation Results:")
print(f"  K-Fold CV (5-fold):        {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")
print(f"  Stratified CV (5-fold):    {np.mean(stratified_scores):.4f} ± {np.std(stratified_scores):.4f}")
print(f"  Bootstrap (50 samples):    {bootstrap_mean:.4f} [CI: {bootstrap_ci[0]:.4f}, {bootstrap_ci[1]:.4f}]")
print(f"\nLearning Curve Analysis:")
print(f"  Final Training Score:      {lc_train_scores[-1]:.4f}")
print(f"  Final Validation Score:    {lc_val_scores[-1]:.4f}")
print(f"  Training-Validation Gap:   {lc_train_scores[-1] - lc_val_scores[-1]:.4f}")
print(f"\nStatistical Test:")
print(f"  K-Fold vs Stratified t-test: t={t_stat:.4f}, p={p_value:.4f}")
print(f"  Significant difference: {'Yes' if p_value < 0.05 else 'No'}")
print("=" * 60)

## 🏃‍♂️ Practice Problems

Let's practice some additional problems that test understanding of ML algorithms.

In [None]:
# Problem 5: Ensemble Methods - Random Forest from Scratch
class RandomForestFromScratch:
    """Random Forest implementation using our Decision Trees."""
    
    def __init__(self, n_estimators=10, max_features='sqrt', max_depth=None, 
                 min_samples_split=2, bootstrap=True, random_state=None):
        self.n_estimators = n_estimators
        self.max_features = max_features
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.bootstrap = bootstrap
        self.random_state = random_state
        self.trees = []
        
    def _get_max_features(self, n_features):
        """Calculate number of features to consider."""
        if self.max_features == 'sqrt':
            return int(np.sqrt(n_features))
        elif self.max_features == 'log2':
            return int(np.log2(n_features))
        elif isinstance(self.max_features, int):
            return min(self.max_features, n_features)
        elif isinstance(self.max_features, float):
            return int(self.max_features * n_features)
        else:
            return n_features
    
    def fit(self, X, y):
        """Train the random forest."""
        if self.random_state is not None:
            np.random.seed(self.random_state)
        
        n_samples, n_features = X.shape
        max_features = self._get_max_features(n_features)
        
        self.trees = []
        
        for i in range(self.n_estimators):
            # Create decision tree
            tree = DecisionTreeFromScratch(
                criterion='gini',
                max_depth=self.max_depth,
                min_samples_split=self.min_samples_split,
                max_features=max_features
            )
            
            # Bootstrap sampling
            if self.bootstrap:
                indices = np.random.choice(n_samples, n_samples, replace=True)
                X_bootstrap = X[indices]
                y_bootstrap = y[indices]
            else:
                X_bootstrap = X
                y_bootstrap = y
            
            # Train tree
            tree.fit(X_bootstrap, y_bootstrap)
            self.trees.append(tree)
        
        return self
    
    def predict(self, X):
        """Make predictions using majority voting."""
        # Get predictions from all trees
        tree_predictions = np.array([tree.predict(X) for tree in self.trees])
        
        # Majority voting
        predictions = []
        for i in range(X.shape[0]):
            votes = tree_predictions[:, i]
            prediction = np.argmax(np.bincount(votes))
            predictions.append(prediction)
        
        return np.array(predictions)
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        # Get predictions from all trees
        tree_predictions = np.array([tree.predict(X) for tree in self.trees])
        
        # Calculate probabilities based on votes
        n_classes = len(np.unique(tree_predictions))
        probabilities = []
        
        for i in range(X.shape[0]):
            votes = tree_predictions[:, i]
            class_counts = np.bincount(votes, minlength=n_classes)
            class_probs = class_counts / len(self.trees)
            probabilities.append(class_probs)
        
        return np.array(probabilities)

# Test Random Forest
print("🧪 Testing Random Forest Implementation:")

# Use Iris dataset
X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.3, random_state=42
)

# Test our Random Forest
rf_custom = RandomForestFromScratch(
    n_estimators=20, max_features='sqrt', max_depth=5, random_state=42
)
rf_custom.fit(X_train_iris, y_train_iris)
rf_predictions = rf_custom.predict(X_test_iris)
rf_probabilities = rf_custom.predict_proba(X_test_iris)

rf_accuracy = accuracy_score(y_test_iris, rf_predictions)

# Compare with sklearn
from sklearn.ensemble import RandomForestClassifier
rf_sklearn = RandomForestClassifier(n_estimators=20, max_features='sqrt', 
                                   max_depth=5, random_state=42)
rf_sklearn.fit(X_train_iris, y_train_iris)
rf_sklearn_pred = rf_sklearn.predict(X_test_iris)
rf_sklearn_accuracy = accuracy_score(y_test_iris, rf_sklearn_pred)

print(f"Custom Random Forest Accuracy: {rf_accuracy:.4f}")
print(f"Sklearn Random Forest Accuracy: {rf_sklearn_accuracy:.4f}")
print(f"Difference: {abs(rf_accuracy - rf_sklearn_accuracy):.4f}")

# Test individual vs ensemble performance
print(f"\n🌳 Individual Tree vs Forest Comparison:")

# Single decision tree
single_tree = DecisionTreeFromScratch(criterion='gini', max_depth=5)
single_tree.fit(X_train_iris, y_train_iris)
single_tree_pred = single_tree.predict(X_test_iris)
single_tree_accuracy = accuracy_score(y_test_iris, single_tree_pred)

print(f"Single Decision Tree Accuracy: {single_tree_accuracy:.4f}")
print(f"Random Forest Accuracy: {rf_accuracy:.4f}")
print(f"Improvement: {rf_accuracy - single_tree_accuracy:.4f}")

print("\n✅ Random Forest test completed!")

In [None]:
# Problem 6: Gradient Boosting (simplified version)
class GradientBoostingFromScratch:
    """Simplified Gradient Boosting implementation."""
    
    def __init__(self, n_estimators=100, learning_rate=0.1, max_depth=3):
        self.n_estimators = n_estimators
        self.learning_rate = learning_rate
        self.max_depth = max_depth
        self.models = []
        self.initial_prediction = None
    
    def fit(self, X, y):
        """Train gradient boosting model."""
        # For binary classification, convert to {-1, 1}
        y_transformed = 2 * y - 1
        
        # Initial prediction (mean)
        self.initial_prediction = np.mean(y_transformed)
        
        # Initialize predictions
        predictions = np.full(len(y), self.initial_prediction)
        
        self.models = []
        
        for i in range(self.n_estimators):
            # Calculate residuals (negative gradient)
            residuals = y_transformed - predictions
            
            # Fit a decision tree to residuals
            tree = DecisionTreeFromScratch(
                criterion='gini', 
                max_depth=self.max_depth,
                min_samples_split=2
            )
            
            # Convert residuals to binary classification problem
            # (simplified: use sign of residuals)
            residual_classes = (residuals > 0).astype(int)
            
            # Skip if all residuals have same sign
            if len(np.unique(residual_classes)) == 1:
                break
            
            tree.fit(X, residual_classes)
            self.models.append(tree)
            
            # Update predictions
            tree_predictions = tree.predict(X)
            # Convert back to {-1, 1}
            tree_predictions = 2 * tree_predictions - 1
            predictions += self.learning_rate * tree_predictions
        
        return self
    
    def predict(self, X):
        """Make predictions."""
        predictions = np.full(X.shape[0], self.initial_prediction)
        
        for model in self.models:
            tree_pred = model.predict(X)
            tree_pred = 2 * tree_pred - 1  # Convert to {-1, 1}
            predictions += self.learning_rate * tree_pred
        
        # Convert back to {0, 1}
        return (predictions > 0).astype(int)

# Test Gradient Boosting (simplified)
print("🧪 Testing Simplified Gradient Boosting:")

# Use binary classification subset of Iris
binary_mask = y_iris != 2  # Remove class 2 to make it binary
X_binary = X_iris[binary_mask]
y_binary = y_iris[binary_mask]

X_train_gb, X_test_gb, y_train_gb, y_test_gb = train_test_split(
    X_binary, y_binary, test_size=0.3, random_state=42
)

# Test our Gradient Boosting
gb_custom = GradientBoostingFromScratch(
    n_estimators=50, learning_rate=0.1, max_depth=3
)
gb_custom.fit(X_train_gb, y_train_gb)
gb_predictions = gb_custom.predict(X_test_gb)

gb_accuracy = accuracy_score(y_test_gb, gb_predictions)

print(f"Custom Gradient Boosting Accuracy: {gb_accuracy:.4f}")
print(f"Number of weak learners used: {len(gb_custom.models)}")

# Compare with single tree
single_tree_gb = DecisionTreeFromScratch(criterion='gini', max_depth=3)
single_tree_gb.fit(X_train_gb, y_train_gb)
single_pred_gb = single_tree_gb.predict(X_test_gb)
single_accuracy_gb = accuracy_score(y_test_gb, single_pred_gb)

print(f"Single Tree Accuracy: {single_accuracy_gb:.4f}")
print(f"Gradient Boosting Improvement: {gb_accuracy - single_accuracy_gb:.4f}")

print("\n✅ Gradient Boosting test completed!")

## 💡 Interview Tips

### 🎯 Scikit-Learn Algorithm Fundamentals
1. **Understand the math** - Know the underlying mathematics, not just the API
2. **Know when to use each algorithm** - Understand strengths and weaknesses
3. **Implement from scratch** - Shows deep understanding of algorithms
4. **Compare with sklearn** - Validate your implementation
5. **Handle edge cases** - Empty clusters, convergence, numerical stability

### 🤖 Algorithm Selection Guide
- **K-Means**: When you need fast clustering with known number of clusters
- **Logistic Regression**: Linear decision boundaries, interpretable results
- **Decision Trees**: Non-linear patterns, feature interactions, interpretability
- **Random Forest**: Reduces overfitting, handles missing values well
- **Gradient Boosting**: High accuracy, but prone to overfitting

### ⚡ Performance Considerations
- **Time Complexity**: Know Big O notation for each algorithm
- **Space Complexity**: Understand memory requirements
- **Scalability**: How algorithms perform with large datasets
- **Convergence**: When and why algorithms might not converge

### 🔍 Common Interview Questions
1. "Implement K-means from scratch"
2. "Explain the difference between bagging and boosting"
3. "How would you handle missing values in decision trees?"
4. "What's the difference between L1 and L2 regularization?"
5. "How do you choose the optimal number of clusters?"

### 📊 Model Evaluation Best Practices
- **Cross-validation**: Always use proper validation techniques
- **Stratified sampling**: For imbalanced datasets
- **Multiple metrics**: Don't rely on accuracy alone
- **Statistical significance**: Use confidence intervals and hypothesis testing
- **Learning curves**: Diagnose bias-variance tradeoffs

## 🎓 Summary

In this notebook, we covered:

✅ **K-Means Clustering** - Lloyd's algorithm, K-means++, elbow method  
✅ **Logistic Regression** - Gradient descent, regularization, multiple solvers  
✅ **Decision Trees** - Information gain, splitting criteria, pruning  
✅ **Model Evaluation** - Cross-validation, bootstrap, learning curves  
✅ **Random Forest** - Bootstrap aggregating, feature randomness  
✅ **Gradient Boosting** - Sequential learning, weak learners  

### 🚀 Next Steps
1. Practice implementing these algorithms from memory
2. Try different datasets and compare performance
3. Move on to neural network implementations
4. Study advanced ensemble methods (XGBoost, LightGBM)

### 📚 Additional Practice
- Implement support vector machines (SVM)
- Create a complete ML pipeline with preprocessing
- Build ensemble methods (voting, stacking)
- Implement feature selection algorithms

### 🔑 Key Takeaways for Interviews
- **Mathematical Foundation**: Understand the underlying math
- **Implementation Skills**: Can code algorithms from scratch
- **Practical Knowledge**: Know when and how to use each algorithm
- **Evaluation Expertise**: Proper model validation and selection
- **Problem-Solving**: Handle edge cases and numerical issues

**Ready to tackle deep learning next! 🧠**