# Module 5: Supervised Learning — Classification

---

Classification is the supervised learning task where we predict a **discrete category** rather than a continuous number. This module covers the most widely used classification algorithms, from logistic regression to decision trees, along with decision boundary visualizations.

**What you will learn:**
- Logistic Regression
- K-Nearest Neighbors (KNN)
- Support Vector Machines (SVM)
- Decision Trees
- Naive Bayes
- Comparing classifiers on the same dataset

---

## Table of Contents

1. [Logistic Regression](#1.-Logistic-Regression)
2. [K-Nearest Neighbors](#2.-K-Nearest-Neighbors)
3. [Support Vector Machines](#3.-Support-Vector-Machines)
4. [Decision Trees](#4.-Decision-Trees)
5. [Naive Bayes](#5.-Naive-Bayes)
6. [Classifier Comparison](#6.-Classifier-Comparison)
7. [Exercises](#7.-Exercises)
8. [Summary and Further Reading](#8.-Summary-and-Further-Reading)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.colors import ListedColormap

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.datasets import load_breast_cancer, make_moons, make_classification

plt.style.use('seaborn-v0_8-whitegrid')
np.random.seed(42)

In [None]:
# Utility function: plot decision boundaries for any 2D classifier
def plot_decision_boundary(model, X, y, ax, title='', xlabel='Feature 1', ylabel='Feature 2'):
    """Plot the decision boundary of a fitted classifier on 2D data."""
    h = 0.02
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    cmap_bg = ListedColormap(['#BBDEFB', '#FFCCBC'])
    cmap_pts = ListedColormap(['#1565C0', '#E64A19'])
    
    ax.contourf(xx, yy, Z, alpha=0.4, cmap=cmap_bg)
    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_pts, s=30, edgecolors='white', linewidth=0.5)
    ax.set_xlabel(xlabel, fontsize=11)
    ax.set_ylabel(ylabel, fontsize=11)
    ax.set_title(title, fontsize=13, fontweight='bold')

In [None]:
# Prepare a 2D dataset for decision boundary visualization (Moons dataset)
X_moons, y_moons = make_moons(n_samples=300, noise=0.25, random_state=42)

# Also prepare the Breast Cancer dataset for full-feature experiments
cancer = load_breast_cancer()
X_cancer = cancer.data
y_cancer = cancer.target

X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
    X_cancer, y_cancer, test_size=0.2, random_state=42, stratify=y_cancer
)
scaler = StandardScaler()
X_train_cs = scaler.fit_transform(X_train_c)
X_test_cs = scaler.transform(X_test_c)

print(f"Moons dataset: {X_moons.shape[0]} samples, {X_moons.shape[1]} features")
print(f"Breast Cancer dataset: {X_cancer.shape[0]} samples, {X_cancer.shape[1]} features")
print(f"  Target classes: {list(cancer.target_names)}")

---

## 1. Logistic Regression

Despite its name, Logistic Regression is a **classification** algorithm. It models the probability that an input belongs to a particular class using the **sigmoid function**:

$$\sigma(z) = \frac{1}{1 + e^{-z}}$$

where $z = w_0 + w_1 x_1 + w_2 x_2 + \ldots$

The sigmoid squashes any real number into the range (0, 1), which we interpret as a probability.

In [None]:
# Visualize the sigmoid function
z = np.linspace(-8, 8, 200)
sigmoid = 1 / (1 + np.exp(-z))

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(z, sigmoid, linewidth=2.5, color='#2196F3')
ax.axhline(y=0.5, color='red', linestyle='--', alpha=0.7, label='Decision threshold (0.5)')
ax.axvline(x=0, color='gray', linestyle='--', alpha=0.3)
ax.fill_between(z, sigmoid, 0.5, where=(sigmoid >= 0.5), alpha=0.1, color='green', label='Predict class 1')
ax.fill_between(z, sigmoid, 0.5, where=(sigmoid < 0.5), alpha=0.1, color='red', label='Predict class 0')
ax.set_xlabel('z = w*x + b', fontsize=13)
ax.set_ylabel('Sigmoid output (probability)', fontsize=13)
ax.set_title('The Sigmoid Function', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.set_ylim(-0.05, 1.05)
plt.tight_layout()
plt.show()

In [None]:
from sklearn.linear_model import LogisticRegression

# Train on the Moons dataset (2D for visualization)
X_train_m, X_test_m, y_train_m, y_test_m = train_test_split(
    X_moons, y_moons, test_size=0.2, random_state=42
)

logreg = LogisticRegression(random_state=42)
logreg.fit(X_train_m, y_train_m)

print(f"Logistic Regression — Moons Dataset")
print(f"  Training accuracy: {logreg.score(X_train_m, y_train_m):.4f}")
print(f"  Test accuracy:     {logreg.score(X_test_m, y_test_m):.4f}")

# Train on Breast Cancer dataset
logreg_bc = LogisticRegression(max_iter=5000, random_state=42)
logreg_bc.fit(X_train_cs, y_train_c)
print(f"\nLogistic Regression — Breast Cancer Dataset")
print(f"  Training accuracy: {logreg_bc.score(X_train_cs, y_train_c):.4f}")
print(f"  Test accuracy:     {logreg_bc.score(X_test_cs, y_test_c):.4f}")

In [None]:
fig, ax = plt.subplots(figsize=(8, 6))
plot_decision_boundary(logreg, X_moons, y_moons, ax,
                       title='Logistic Regression — Decision Boundary')
plt.tight_layout()
plt.show()

print("Logistic Regression produces a linear decision boundary.")
print("It struggles with non-linearly separable data like the Moons dataset.")

---

## 2. K-Nearest Neighbors (KNN)

KNN is a non-parametric algorithm that classifies a data point based on the majority class of its K nearest neighbors. It makes no assumptions about the underlying data distribution.

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# Compare different values of K
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
k_values = [1, 5, 15]

for idx, k in enumerate(k_values):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_m, y_train_m)
    acc = knn.score(X_test_m, y_test_m)
    plot_decision_boundary(knn, X_moons, y_moons, axes[idx],
                           title=f'KNN (K={k}) — Test Acc: {acc:.3f}')

plt.suptitle('KNN Decision Boundaries for Different K Values', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

print("Observations:")
print("  - K=1: Very complex boundary, fits noise (overfitting risk).")
print("  - K=5: Good balance between flexibility and smoothness.")
print("  - K=15: Overly smooth boundary (underfitting risk).")

In [None]:
# KNN on Breast Cancer dataset
knn_bc = KNeighborsClassifier(n_neighbors=5)
knn_bc.fit(X_train_cs, y_train_c)

y_pred_knn = knn_bc.predict(X_test_cs)
print("KNN (K=5) — Breast Cancer Dataset")
print(f"  Test accuracy: {accuracy_score(y_test_c, y_pred_knn):.4f}")
print(f"\n{classification_report(y_test_c, y_pred_knn, target_names=cancer.target_names)}")

---

## 3. Support Vector Machines (SVM)

SVM finds the hyperplane that maximizes the **margin** between classes. It can handle non-linear boundaries using the **kernel trick**.

Key concepts:
- **Margin**: Distance between the decision boundary and the nearest data points (support vectors)
- **Kernel**: A function that maps data to a higher-dimensional space (linear, polynomial, RBF)

In [None]:
from sklearn.svm import SVC

# Compare different kernels
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
kernels = ['linear', 'poly', 'rbf']
kernel_names = ['Linear Kernel', 'Polynomial Kernel', 'RBF (Gaussian) Kernel']

for idx, (kernel, name) in enumerate(zip(kernels, kernel_names)):
    svm = SVC(kernel=kernel, random_state=42)
    svm.fit(X_train_m, y_train_m)
    acc = svm.score(X_test_m, y_test_m)
    plot_decision_boundary(svm, X_moons, y_moons, axes[idx],
                           title=f'{name}\nTest Acc: {acc:.3f}')

plt.suptitle('SVM Decision Boundaries — Different Kernels', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

print("Observations:")
print("  - Linear: similar to logistic regression — straight boundary.")
print("  - Polynomial: can capture some curvature.")
print("  - RBF: highly flexible, adapts well to complex shapes.")

In [None]:
# SVM (RBF) on Breast Cancer dataset
svm_bc = SVC(kernel='rbf', random_state=42)
svm_bc.fit(X_train_cs, y_train_c)

y_pred_svm = svm_bc.predict(X_test_cs)
print("SVM (RBF Kernel) — Breast Cancer Dataset")
print(f"  Test accuracy: {accuracy_score(y_test_c, y_pred_svm):.4f}")
print(f"\n{classification_report(y_test_c, y_pred_svm, target_names=cancer.target_names)}")

---

## 4. Decision Trees

A decision tree splits the data into subsets based on feature values, creating a tree-like structure of if-then rules. It is one of the most interpretable models.

Key concepts:
- **Splitting criteria**: Gini impurity or Entropy (information gain)
- **Depth**: Number of levels in the tree (controls complexity)
- **Pruning**: Limiting tree depth to prevent overfitting

In [None]:
from sklearn.tree import DecisionTreeClassifier, plot_tree

# Compare different depths
fig, axes = plt.subplots(1, 3, figsize=(18, 5))
depths = [2, 5, None]  # None means unlimited depth
depth_labels = ['Depth=2', 'Depth=5', 'No limit']

for idx, (depth, label) in enumerate(zip(depths, depth_labels)):
    dt = DecisionTreeClassifier(max_depth=depth, random_state=42)
    dt.fit(X_train_m, y_train_m)
    acc = dt.score(X_test_m, y_test_m)
    plot_decision_boundary(dt, X_moons, y_moons, axes[idx],
                           title=f'Decision Tree ({label})\nTest Acc: {acc:.3f}')

plt.suptitle('Decision Tree — Effect of Maximum Depth', fontsize=15, fontweight='bold')
plt.tight_layout()
plt.show()

print("Observations:")
print("  - Shallow trees (depth=2): simple, potentially underfitting.")
print("  - Deep trees (no limit): complex boundaries, risk of overfitting.")
print("  - Decision boundaries are always axis-aligned (rectangular).")

In [None]:
# Visualize the tree structure
dt_small = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_small.fit(X_train_m, y_train_m)

fig, ax = plt.subplots(figsize=(16, 8))
plot_tree(dt_small, filled=True, rounded=True,
          feature_names=['Feature 1', 'Feature 2'],
          class_names=['Class 0', 'Class 1'],
          ax=ax, fontsize=10)
ax.set_title('Decision Tree Structure (max_depth=3)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("Gini impurity measures the probability of misclassifying a randomly chosen sample.")
print("A Gini of 0.0 means the node is pure (all samples belong to one class).")

In [None]:
# Decision Tree on Breast Cancer dataset — with feature importance
dt_bc = DecisionTreeClassifier(max_depth=5, random_state=42)
dt_bc.fit(X_train_cs, y_train_c)

y_pred_dt = dt_bc.predict(X_test_cs)
print("Decision Tree (depth=5) — Breast Cancer Dataset")
print(f"  Test accuracy: {accuracy_score(y_test_c, y_pred_dt):.4f}")

# Feature importance
importance = pd.DataFrame({
    'Feature': cancer.feature_names,
    'Importance': dt_bc.feature_importances_
}).sort_values('Importance', ascending=False).head(10)

fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(importance['Feature'], importance['Importance'], color='#2196F3', edgecolor='white')
ax.set_xlabel('Feature Importance', fontsize=13)
ax.set_title('Top 10 Features — Decision Tree', fontsize=14, fontweight='bold')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

---

## 5. Naive Bayes

Naive Bayes applies Bayes' theorem with the "naive" assumption that features are conditionally independent given the class label. Despite this simplification, it often performs surprisingly well, especially on text classification and high-dimensional data.

In [None]:
from sklearn.naive_bayes import GaussianNB

# Moons dataset
gnb = GaussianNB()
gnb.fit(X_train_m, y_train_m)

fig, ax = plt.subplots(figsize=(8, 6))
acc = gnb.score(X_test_m, y_test_m)
plot_decision_boundary(gnb, X_moons, y_moons, ax,
                       title=f'Gaussian Naive Bayes — Test Acc: {acc:.3f}')
plt.tight_layout()
plt.show()

# Breast Cancer dataset
gnb_bc = GaussianNB()
gnb_bc.fit(X_train_cs, y_train_c)
y_pred_nb = gnb_bc.predict(X_test_cs)
print(f"Gaussian Naive Bayes — Breast Cancer Dataset")
print(f"  Test accuracy: {accuracy_score(y_test_c, y_pred_nb):.4f}")
print(f"\n{classification_report(y_test_c, y_pred_nb, target_names=cancer.target_names)}")

---

## 6. Classifier Comparison

Let us compare all five classifiers side by side — both visually (decision boundaries) and numerically (accuracy on the Breast Cancer dataset).

In [None]:
# Decision boundary comparison on Moons dataset
classifiers = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'KNN (K=5)': KNeighborsClassifier(n_neighbors=5),
    'SVM (RBF)': SVC(kernel='rbf', random_state=42),
    'Decision Tree (d=5)': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Naive Bayes': GaussianNB()
}

fig, axes = plt.subplots(2, 3, figsize=(18, 11))
axes = axes.flatten()

for idx, (name, clf) in enumerate(classifiers.items()):
    clf.fit(X_train_m, y_train_m)
    acc = clf.score(X_test_m, y_test_m)
    plot_decision_boundary(clf, X_moons, y_moons, axes[idx],
                           title=f'{name}\nTest Acc: {acc:.3f}')

axes[5].axis('off')  # hide the empty subplot
plt.suptitle('Classifier Comparison — Moons Dataset', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
# Performance comparison on Breast Cancer dataset
classifiers_bc = {
    'Logistic Regression': LogisticRegression(max_iter=5000, random_state=42),
    'KNN (K=5)': KNeighborsClassifier(n_neighbors=5),
    'SVM (RBF)': SVC(kernel='rbf', random_state=42),
    'Decision Tree': DecisionTreeClassifier(max_depth=5, random_state=42),
    'Naive Bayes': GaussianNB()
}

comparison_results = []
for name, clf in classifiers_bc.items():
    clf.fit(X_train_cs, y_train_c)
    train_acc = clf.score(X_train_cs, y_train_c)
    test_acc = clf.score(X_test_cs, y_test_c)
    comparison_results.append({
        'Classifier': name, 'Train Accuracy': train_acc, 'Test Accuracy': test_acc
    })

comp_df = pd.DataFrame(comparison_results).sort_values('Test Accuracy', ascending=False)
print("Classifier Comparison — Breast Cancer Dataset")
print("=" * 60)
print(comp_df.to_string(index=False))

# Bar chart
fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(comp_df))
width = 0.35
ax.bar(x - width/2, comp_df['Train Accuracy'], width, label='Train', color='#2196F3', edgecolor='white')
ax.bar(x + width/2, comp_df['Test Accuracy'], width, label='Test', color='#FF5722', edgecolor='white')
ax.set_xticks(x)
ax.set_xticklabels(comp_df['Classifier'], rotation=20, ha='right')
ax.set_ylabel('Accuracy', fontsize=13)
ax.set_title('Classifier Comparison — Breast Cancer Dataset', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
ax.set_ylim(0.9, 1.01)
plt.tight_layout()
plt.show()

---

## 7. Exercises

### Exercise 1: Iris Classification

In [None]:
# Exercise 1: Using the Iris dataset (3 classes):
# 1. Load the data and split 80/20 with stratification
# 2. Scale the features using StandardScaler
# 3. Train all five classifiers from this module
# 4. Print a comparison table of test accuracies
# 5. Generate a confusion matrix for the best-performing classifier

from sklearn.datasets import load_iris

# Your code here:


### Exercise 2: Hyperparameter Tuning

In [None]:
# Exercise 2: For the SVM classifier on the Breast Cancer dataset:
# 1. Try C values: [0.01, 0.1, 1, 10, 100] with RBF kernel
# 2. Record train and test accuracy for each C
# 3. Plot C vs accuracy (both train and test)
# 4. What value of C gives the best test accuracy?
# 5. What does C control? (Hint: it controls the regularization strength)

# Your code here:


### Exercise 3: Decision Tree Depth Analysis

In [None]:
# Exercise 3: Using the Breast Cancer dataset:
# 1. Train Decision Trees with max_depth from 1 to 20
# 2. Record training and test accuracy for each depth
# 3. Plot depth vs accuracy
# 4. Identify when overfitting begins (train accuracy >> test accuracy)
# 5. What is the optimal depth?

# Your code here:


---

## 8. Summary and Further Reading

### What We Covered

| Algorithm | Type | Strengths | Weaknesses |
|-----------|------|-----------|------------|
| Logistic Regression | Linear | Fast, interpretable, good baseline | Only linear boundaries |
| KNN | Non-parametric | Simple, no training phase | Slow at prediction time, sensitive to scale |
| SVM | Kernel-based | Handles high dimensions well, flexible kernels | Slow on large datasets, needs scaling |
| Decision Tree | Rule-based | Interpretable, handles mixed types | Prone to overfitting |
| Naive Bayes | Probabilistic | Very fast, works well with high dimensions | Assumes feature independence |

### Recommended Reading

- [Scikit-learn Classification Guide](https://scikit-learn.org/stable/supervised_learning.html)
- Chapter 3 (Classification) and Chapter 5 (SVM) of Aurélien Géron, *Hands-On Machine Learning*
- Chapter 4 (Classification) of *ISLR*

### Next Module

In **Module 6: Model Evaluation and Validation**, we will learn how to rigorously evaluate classifier performance using confusion matrices, ROC curves, cross-validation, and hyperparameter tuning strategies.

---