# Lab 2: Supervised Learning Algorithms

In this lab, we'll explore fundamental supervised learning algorithms for classification. While Lab 1 focused on regression (predicting continuous values), this lab focuses on classification (predicting discrete categories).

## Learning Objectives

By the end of this lab, you will:
- Understand the difference between classification and regression
- Implement logistic regression from scratch
- Build a k-Nearest Neighbors (k-NN) classifier
- Create decision trees using the CART algorithm
- Understand Support Vector Machines (SVM) basics
- Apply Naive Bayes classification
- Compare different algorithms on real datasets

## Classification vs Regression

| Aspect | Regression | Classification |
|--------|-----------|----------------|
| Output | Continuous value | Discrete category |
| Example | House price prediction | Email spam detection |
| Evaluation | MSE, RMSE, R² | Accuracy, precision, recall |
| Algorithms | Linear regression, polynomial regression | Logistic regression, k-NN, decision trees |

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Tuple, List, Dict
import pandas as pd
from collections import Counter
from sklearn.datasets import make_classification, make_moons, make_circles, load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression as SKLogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# Set style
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

## Part 1: Logistic Regression

Despite its name, logistic regression is a **classification** algorithm. It models the probability that an example belongs to a particular class.

### The Sigmoid Function

Logistic regression uses the sigmoid function to map predictions to probabilities:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

Where $z = \mathbf{w}^T \mathbf{x} + b$

The sigmoid function outputs values between 0 and 1, which we interpret as probabilities.

### Loss Function: Binary Cross-Entropy

$$L(y, \hat{y}) = -\frac{1}{m} \sum_{i=1}^{m} [y_i \log(\hat{y}_i) + (1-y_i) \log(1-\hat{y}_i)]$$

In [None]:
class LogisticRegression:
    """
    Logistic Regression for binary classification.
    
    Parameters:
    -----------
    learning_rate : float
        Step size for gradient descent
    n_iterations : int
        Number of training iterations
    """
    
    def __init__(self, learning_rate: float = 0.01, n_iterations: int = 1000):
        self.learning_rate = learning_rate
        self.n_iterations = n_iterations
        self.weights = None
        self.bias = None
        self.loss_history = []
    
    def sigmoid(self, z: np.ndarray) -> np.ndarray:
        """
        Sigmoid activation function.
        """
        return 1 / (1 + np.exp(-np.clip(z, -500, 500)))  # Clip to avoid overflow
    
    def compute_loss(self, y_true: np.ndarray, y_pred: np.ndarray) -> float:
        """
        Binary cross-entropy loss.
        """
        epsilon = 1e-15  # To avoid log(0)
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        Train the logistic regression model.
        """
        m, n = X.shape
        
        # Initialize parameters
        self.weights = np.zeros(n)
        self.bias = 0
        
        # Gradient descent
        for i in range(self.n_iterations):
            # Forward pass
            z = X.dot(self.weights) + self.bias
            y_pred = self.sigmoid(z)
            
            # Compute loss
            loss = self.compute_loss(y, y_pred)
            self.loss_history.append(loss)
            
            # Compute gradients
            dw = (1/m) * X.T.dot(y_pred - y)
            db = (1/m) * np.sum(y_pred - y)
            
            # Update parameters
            self.weights -= self.learning_rate * dw
            self.bias -= self.learning_rate * db
    
    def predict_proba(self, X: np.ndarray) -> np.ndarray:
        """
        Predict probabilities.
        """
        z = X.dot(self.weights) + self.bias
        return self.sigmoid(z)
    
    def predict(self, X: np.ndarray, threshold: float = 0.5) -> np.ndarray:
        """
        Predict class labels.
        """
        return (self.predict_proba(X) >= threshold).astype(int)

In [None]:
# Visualize sigmoid function
z = np.linspace(-10, 10, 100)
sigmoid = 1 / (1 + np.exp(-z))

plt.figure(figsize=(10, 6))
plt.plot(z, sigmoid, linewidth=2)
plt.axhline(y=0.5, color='r', linestyle='--', label='Decision threshold')
plt.axvline(x=0, color='g', linestyle='--', alpha=0.5)
plt.xlabel('z (weighted sum)')
plt.ylabel('σ(z)')
plt.title('Sigmoid Function')
plt.grid(True, alpha=0.3)
plt.legend()
plt.show()

In [None]:
# Test logistic regression on synthetic data
np.random.seed(42)
X, y = make_classification(n_samples=200, n_features=2, n_redundant=0, 
                          n_informative=2, n_clusters_per_class=1, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train model
model_lr = LogisticRegression(learning_rate=0.1, n_iterations=1000)
model_lr.fit(X_train_scaled, y_train)

# Predictions
y_pred = model_lr.predict(X_test_scaled)
y_proba = model_lr.predict_proba(X_test_scaled)

print("Logistic Regression Results:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred))

In [None]:
# Visualize decision boundary
def plot_decision_boundary(model, X, y, title="Decision Boundary"):
    """
    Plot decision boundary for binary classification.
    """
    h = 0.02  # Step size
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdYlBu', edgecolors='black', s=50)
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(title)

# Plot
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Decision boundary
plt.subplot(1, 2, 1)
plot_decision_boundary(model_lr, X_test_scaled, y_test, "Logistic Regression Decision Boundary")

# Loss history
plt.subplot(1, 2, 2)
plt.plot(model_lr.loss_history)
plt.xlabel('Iteration')
plt.ylabel('Cross-Entropy Loss')
plt.title('Training Loss')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## Part 2: k-Nearest Neighbors (k-NN)

k-NN is a simple, instance-based learning algorithm. To classify a new point:
1. Find the k nearest training examples
2. Take a majority vote of their labels

### Distance Metrics

**Euclidean Distance:**
$$d(\mathbf{x}, \mathbf{x'}) = \sqrt{\sum_{i=1}^{n} (x_i - x'_i)^2}$$

**Manhattan Distance:**
$$d(\mathbf{x}, \mathbf{x'}) = \sum_{i=1}^{n} |x_i - x'_i|$$

In [None]:
class KNearestNeighbors:
    """
    k-Nearest Neighbors classifier.
    
    Parameters:
    -----------
    k : int
        Number of neighbors to consider
    distance_metric : str
        'euclidean' or 'manhattan'
    """
    
    def __init__(self, k: int = 3, distance_metric: str = 'euclidean'):
        self.k = k
        self.distance_metric = distance_metric
        self.X_train = None
        self.y_train = None
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        Store training data (lazy learning).
        """
        self.X_train = X
        self.y_train = y
    
    def compute_distance(self, x1: np.ndarray, x2: np.ndarray) -> float:
        """
        Compute distance between two points.
        """
        if self.distance_metric == 'euclidean':
            return np.sqrt(np.sum((x1 - x2) ** 2))
        elif self.distance_metric == 'manhattan':
            return np.sum(np.abs(x1 - x2))
        else:
            raise ValueError(f"Unknown distance metric: {self.distance_metric}")
    
    def predict_single(self, x: np.ndarray) -> int:
        """
        Predict label for a single example.
        """
        # Compute distances to all training points
        distances = [self.compute_distance(x, x_train) for x_train in self.X_train]
        
        # Get indices of k nearest neighbors
        k_indices = np.argsort(distances)[:self.k]
        
        # Get labels of k nearest neighbors
        k_nearest_labels = self.y_train[k_indices]
        
        # Return most common label
        return Counter(k_nearest_labels).most_common(1)[0][0]
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Predict labels for multiple examples.
        """
        return np.array([self.predict_single(x) for x in X])

In [None]:
# Test k-NN with different values of k
k_values = [1, 3, 5, 10, 20]

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, k in enumerate(k_values):
    # Train model
    model_knn = KNearestNeighbors(k=k)
    model_knn.fit(X_train_scaled, y_train)
    
    # Predict
    y_pred = model_knn.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Plot decision boundary
    plt.subplot(2, 3, idx + 1)
    plot_decision_boundary(model_knn, X_test_scaled, y_test, 
                          f"k-NN (k={k})\nAccuracy: {accuracy:.3f}")

# Remove extra subplot
fig.delaxes(axes[-1])

plt.tight_layout()
plt.show()

print("Notice how:")
print("- k=1 creates very complex, possibly overfitted boundaries")
print("- Larger k creates smoother boundaries")
print("- Very large k can underfit the data")

## Part 3: Decision Trees

Decision trees make predictions by learning a series of if-then-else decision rules from features.

### Information Gain

Decision trees use **entropy** to measure impurity:

$$H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)$$

Where $p_i$ is the proportion of class $i$ in set $S$.

**Information Gain** measures how much entropy is reduced by splitting on a feature:

$$IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)$$

In [None]:
class Node:
    """
    A node in the decision tree.
    """
    def __init__(self, feature=None, threshold=None, left=None, right=None, value=None):
        self.feature = feature      # Feature index to split on
        self.threshold = threshold  # Threshold value for split
        self.left = left           # Left child
        self.right = right         # Right child
        self.value = value         # Class label (for leaf nodes)

class DecisionTree:
    """
    Decision Tree classifier using CART algorithm.
    
    Parameters:
    -----------
    max_depth : int
        Maximum depth of the tree
    min_samples_split : int
        Minimum samples required to split a node
    """
    
    def __init__(self, max_depth: int = 10, min_samples_split: int = 2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None
    
    def entropy(self, y: np.ndarray) -> float:
        """
        Calculate entropy of a set.
        """
        proportions = np.bincount(y) / len(y)
        return -np.sum([p * np.log2(p) for p in proportions if p > 0])
    
    def information_gain(self, X: np.ndarray, y: np.ndarray, 
                        feature: int, threshold: float) -> float:
        """
        Calculate information gain from a split.
        """
        # Parent entropy
        parent_entropy = self.entropy(y)
        
        # Split data
        left_mask = X[:, feature] <= threshold
        right_mask = ~left_mask
        
        if np.sum(left_mask) == 0 or np.sum(right_mask) == 0:
            return 0
        
        # Weighted child entropy
        n = len(y)
        n_left, n_right = np.sum(left_mask), np.sum(right_mask)
        e_left, e_right = self.entropy(y[left_mask]), self.entropy(y[right_mask])
        child_entropy = (n_left / n) * e_left + (n_right / n) * e_right
        
        return parent_entropy - child_entropy
    
    def best_split(self, X: np.ndarray, y: np.ndarray) -> Tuple[int, float]:
        """
        Find the best feature and threshold to split on.
        """
        best_gain = -1
        best_feature = None
        best_threshold = None
        
        n_features = X.shape[1]
        
        for feature in range(n_features):
            thresholds = np.unique(X[:, feature])
            
            for threshold in thresholds:
                gain = self.information_gain(X, y, feature, threshold)
                
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature
                    best_threshold = threshold
        
        return best_feature, best_threshold
    
    def build_tree(self, X: np.ndarray, y: np.ndarray, depth: int = 0) -> Node:
        """
        Recursively build the decision tree.
        """
        n_samples, n_features = X.shape
        n_classes = len(np.unique(y))
        
        # Stopping criteria
        if (depth >= self.max_depth or 
            n_samples < self.min_samples_split or 
            n_classes == 1):
            # Create leaf node with majority class
            leaf_value = Counter(y).most_common(1)[0][0]
            return Node(value=leaf_value)
        
        # Find best split
        best_feature, best_threshold = self.best_split(X, y)
        
        if best_feature is None:
            leaf_value = Counter(y).most_common(1)[0][0]
            return Node(value=leaf_value)
        
        # Split data
        left_mask = X[:, best_feature] <= best_threshold
        right_mask = ~left_mask
        
        # Recursively build children
        left = self.build_tree(X[left_mask], y[left_mask], depth + 1)
        right = self.build_tree(X[right_mask], y[right_mask], depth + 1)
        
        return Node(feature=best_feature, threshold=best_threshold, 
                   left=left, right=right)
    
    def fit(self, X: np.ndarray, y: np.ndarray):
        """
        Build the decision tree.
        """
        self.root = self.build_tree(X, y)
    
    def predict_single(self, x: np.ndarray, node: Node) -> int:
        """
        Predict label for a single example.
        """
        if node.value is not None:
            return node.value
        
        if x[node.feature] <= node.threshold:
            return self.predict_single(x, node.left)
        else:
            return self.predict_single(x, node.right)
    
    def predict(self, X: np.ndarray) -> np.ndarray:
        """
        Predict labels for multiple examples.
        """
        return np.array([self.predict_single(x, self.root) for x in X])

In [None]:
# Test decision tree with different max depths
depths = [1, 2, 3, 5, 10]

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, depth in enumerate(depths):
    # Train model
    model_dt = DecisionTree(max_depth=depth)
    model_dt.fit(X_train_scaled, y_train)
    
    # Predict
    y_pred = model_dt.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Plot decision boundary
    plt.subplot(2, 3, idx + 1)
    plot_decision_boundary(model_dt, X_test_scaled, y_test, 
                          f"Decision Tree (depth={depth})\nAccuracy: {accuracy:.3f}")

# Remove extra subplot
fig.delaxes(axes[-1])

plt.tight_layout()
plt.show()

print("Notice how:")
print("- Shallow trees create axis-aligned, rectangular decision boundaries")
print("- Deeper trees create more complex boundaries")
print("- Very deep trees can overfit")

## Part 4: Comparing Algorithms on Different Datasets

Different algorithms work better on different types of data. Let's compare them on various synthetic datasets.

In [None]:
# Generate different datasets
np.random.seed(42)

datasets = {
    'Linearly Separable': make_classification(n_samples=200, n_features=2, n_redundant=0,
                                             n_informative=2, n_clusters_per_class=1, 
                                             random_state=42),
    'Moons': make_moons(n_samples=200, noise=0.2, random_state=42),
    'Circles': make_circles(n_samples=200, noise=0.1, factor=0.5, random_state=42)
}

# Models to compare
models = {
    'Logistic Regression': LogisticRegression(learning_rate=0.1, n_iterations=1000),
    'k-NN (k=5)': KNearestNeighbors(k=5),
    'Decision Tree': DecisionTree(max_depth=5)
}

# Compare
fig, axes = plt.subplots(len(datasets), len(models), figsize=(15, 12))

for i, (dataset_name, (X, y)) in enumerate(datasets.items()):
    # Split and scale
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    for j, (model_name, model) in enumerate(models.items()):
        # Train
        model.fit(X_train_scaled, y_train)
        
        # Predict
        y_pred = model.predict(X_test_scaled)
        accuracy = accuracy_score(y_test, y_pred)
        
        # Plot
        plt.subplot(len(datasets), len(models), i * len(models) + j + 1)
        plot_decision_boundary(model, X_test_scaled, y_test, 
                             f"{dataset_name}\n{model_name}\nAcc: {accuracy:.3f}")

plt.tight_layout()
plt.show()

## Part 5: Using Scikit-Learn

Let's use scikit-learn's implementations, which are optimized and include additional features.

In [None]:
# Load breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

print("Breast Cancer Dataset:")
print(f"Number of samples: {X.shape[0]}")
print(f"Number of features: {X.shape[1]}")
print(f"Classes: {data.target_names}")
print(f"\nClass distribution: {Counter(y)}")

In [None]:
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train multiple models
sklearn_models = {
    'Logistic Regression': SKLogisticRegression(max_iter=1000),
    'k-NN (k=5)': KNeighborsClassifier(n_neighbors=5),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'SVM (RBF)': SVC(kernel='rbf', random_state=42),
    'Naive Bayes': GaussianNB()
}

results = {}

for name, model in sklearn_models.items():
    # Train
    model.fit(X_train_scaled, y_train)
    
    # Predict
    y_pred = model.predict(X_test_scaled)
    
    # Evaluate
    accuracy = accuracy_score(y_test, y_pred)
    results[name] = accuracy
    
    print(f"\n{name}:")
    print(f"Accuracy: {accuracy:.3f}")
    print(f"\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=data.target_names))

In [None]:
# Compare models
plt.figure(figsize=(10, 6))
plt.barh(list(results.keys()), list(results.values()))
plt.xlabel('Accuracy')
plt.title('Model Comparison on Breast Cancer Dataset')
plt.xlim([0.9, 1.0])
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Visualize decision tree
model_tree_viz = DecisionTreeClassifier(max_depth=3, random_state=42)
model_tree_viz.fit(X_train_scaled[:, :2], y_train)  # Use only 2 features for visualization

plt.figure(figsize=(20, 10))
plot_tree(model_tree_viz, 
         feature_names=data.feature_names[:2],
         class_names=data.target_names,
         filled=True,
         rounded=True,
         fontsize=10)
plt.title('Decision Tree Visualization (depth=3, first 2 features)')
plt.show()

## Part 6: Support Vector Machines (SVM) Basics

SVMs find the hyperplane that maximizes the margin between classes.

### Key Concepts:
- **Support Vectors**: Training points closest to the decision boundary
- **Margin**: Distance between the hyperplane and the nearest points
- **Kernel Trick**: Map data to higher dimensions for non-linear separation

In [None]:
# Compare SVM kernels
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

kernels = ['linear', 'poly', 'rbf']

fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for idx, kernel in enumerate(kernels):
    # Train SVM
    svm = SVC(kernel=kernel, random_state=42)
    svm.fit(X_train_scaled, y_train)
    
    # Predict
    y_pred = svm.predict(X_test_scaled)
    accuracy = accuracy_score(y_test, y_pred)
    
    # Plot
    plt.subplot(1, 3, idx + 1)
    
    h = 0.02
    x_min, x_max = X_test_scaled[:, 0].min() - 1, X_test_scaled[:, 0].max() + 1
    y_min, y_max = X_test_scaled[:, 1].min() - 1, X_test_scaled[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    
    Z = svm.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    plt.contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    plt.scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], c=y_test, 
               cmap='RdYlBu', edgecolors='black', s=50)
    plt.title(f"SVM ({kernel} kernel)\nAccuracy: {accuracy:.3f}")
    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("Notice how the RBF (Radial Basis Function) kernel handles non-linear data best.")

## Key Takeaways

1. **Logistic Regression**: Simple, interpretable, works well for linearly separable data
2. **k-NN**: Simple, non-parametric, but slow for large datasets and sensitive to feature scaling
3. **Decision Trees**: Interpretable, handle non-linear relationships, prone to overfitting
4. **SVM**: Powerful for high-dimensional data, kernel trick enables non-linear separation
5. **Naive Bayes**: Fast, works well with small datasets, assumes feature independence
6. **Feature scaling** is crucial for distance-based algorithms (k-NN, SVM, Logistic Regression)
7. **No single best algorithm** - choice depends on data characteristics and requirements

## Exercises

1. **Algorithm Comparison**: Compare all algorithms on the Iris dataset. Which performs best?

2. **Distance Metrics**: Implement and compare Manhattan distance with Euclidean distance for k-NN.

3. **Feature Engineering**: For the breast cancer dataset, try:
   - Creating interaction features (e.g., feature1 * feature2)
   - Polynomial features
   - Feature selection (remove low-importance features)

4. **Weighted k-NN**: Modify the k-NN algorithm to weight neighbors by inverse distance.

5. **Pruning Decision Trees**: Implement post-pruning to reduce tree complexity.

6. **Multi-class Classification**: Extend LogisticRegression to handle more than 2 classes using one-vs-rest or softmax.

7. **Visualization**: Create an interactive tool to explore how k in k-NN or tree depth affects decision boundaries.

## Next Steps

In Lab 3, we'll learn:
- Model evaluation metrics in depth
- Cross-validation techniques
- Handling overfitting and underfitting
- Hyperparameter tuning
- Learning curves and validation curves

Excellent work! You now understand the core classification algorithms in machine learning.