# Lab 09: Support Vector Machines (SVM)

In this lab, you'll develop intuition for how Support Vector Machines work, explore different kernels, and learn to tune hyperparameters using cross-validation.

### Learning Objectives

- Understand the geometric intuition of SVM (margins, support vectors)
- Know when to use linear vs. non-linear kernels
- Apply cross-validation for hyperparameter tuning
- Interpret classifier performance metrics (precision, recall, ROC curves)

### Overview

| Part | Topic                                    | Time    |
| ---- | ---------------------------------------- | ------- |
| 1    | Linear SVM & Margins                     | ~30 min |
| 2    | Kernels (RBF, Polynomial)                | ~30 min |
| 3    | Hyperparameter Tuning + Cross-Validation | ~30 min |
| 4    | Real-World Application (Stretch)         | ~30 min |


---

## Setup


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.svm import SVC
from sklearn.datasets import make_blobs, make_circles, make_moons, load_digits
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    ConfusionMatrixDisplay,
    RocCurveDisplay,
)
from sklearn.dummy import DummyClassifier

# Set random seed for reproducibility
np.random.seed(42)

### Helper Functions

We'll use these functions throughout the lab to visualize SVM decision boundaries.


In [None]:
def plot_svm_decision_boundary(
    clf, X, y, ax=None, title="SVM Decision Boundary", show_margin=True
):
    """
    Plot the decision boundary, margins, and support vectors for an SVM classifier.

    Parameters:
    -----------
    clf : fitted SVM classifier
    X : feature array (n_samples, 2)
    y : target array
    ax : matplotlib axis (optional)
    title : plot title
    show_margin : whether to show margin lines (only works for linear kernel)
    """
    if ax is None:
        fig, ax = plt.subplots(figsize=(8, 6))

    # Create a mesh grid
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 200), np.linspace(y_min, y_max, 200))

    # Get decision function values
    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Plot decision boundary and margins
    ax.contourf(xx, yy, Z, levels=50, cmap="RdBu", alpha=0.3)
    ax.contour(xx, yy, Z, levels=[0], colors="k", linewidths=2)  # Decision boundary

    if show_margin and clf.kernel == "linear":
        ax.contour(
            xx, yy, Z, levels=[-1, 1], colors="k", linewidths=1, linestyles="--"
        )  # Margins

    # Plot data points
    scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap="RdBu", edgecolors="k", s=50)

    # Highlight support vectors
    ax.scatter(
        clf.support_vectors_[:, 0],
        clf.support_vectors_[:, 1],
        s=200,
        facecolors="none",
        edgecolors="green",
        linewidths=2,
        label=f"Support Vectors (n={len(clf.support_vectors_)})",
    )

    ax.set_xlabel("Feature 1")
    ax.set_ylabel("Feature 2")
    ax.set_title(title)
    ax.legend(loc="upper right")

    return ax

---

## Part 1: Visual Intuition of SVM in 2D

### The SVM Idea

A **Support Vector Machine** finds the hyperplane that maximizes the **margin** — the distance between the hyperplane and the nearest data points from each class. These nearest points are called **support vectors**.

For linearly separable data in 2D, we're looking for a line:
$$\mathbf{w}^T \mathbf{x} + b = 0$$

The margin is $\frac{2}{\|\mathbf{w}\|}$, so maximizing the margin means minimizing $\|\mathbf{w}\|$.

The **regularization parameter C** controls the trade-off between:

- Maximizing the margin (small C → wider margin, more misclassifications allowed)
- Minimizing classification errors (large C → narrower margin, fewer misclassifications)


### 1.1 Generate and Plot Linearly Separable Data

Let's create a simple 2D dataset with two linearly separable classes.


In [None]:
# Generate linearly separable data
X_linear, y_linear = make_blobs(
    n_samples=100, centers=2, cluster_std=1.5, random_state=42
)

# Plot the raw data
plt.figure(figsize=(8, 6))
plt.scatter(
    X_linear[:, 0], X_linear[:, 1], c=y_linear, cmap="RdBu", edgecolors="k", s=50
)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.title("Linearly Separable Data")
plt.colorbar(label="Class")
plt.show()

### 1.2 Fit a Linear SVM

**Task:** Fit a linear SVM to the data and visualize the decision boundary, margins, and support vectors.


In [None]:
# Fit a linear SVM with default C=1
svm_linear = SVC(kernel="linear", C=1.0)
svm_linear.fit(X_linear, y_linear)

# Visualize
fig, ax = plt.subplots(figsize=(10, 7))
plot_svm_decision_boundary(
    svm_linear,
    X_linear,
    y_linear,
    ax=ax,
    title=f"Linear SVM (C=1.0)\nSupport Vectors: {len(svm_linear.support_vectors_)}",
)
plt.show()

print(f"Number of support vectors: {len(svm_linear.support_vectors_)}")
print(f"Training accuracy: {svm_linear.score(X_linear, y_linear):.3f}")

### 1.3 Effect of the Regularization Parameter C

The parameter **C** controls the trade-off between a smooth decision boundary and classifying training points correctly.

- **Small C:** Wider margin, more tolerant of misclassifications (soft margin)
- **Large C:** Narrower margin, less tolerant of misclassifications (hard margin)

**Task:** Fit SVMs with different values of C and observe how the margin and support vectors change.


In [None]:
# Compare different values of C
C_values = [0.01, 0.1, 1, 10, 100]

fig, axes = plt.subplots(1, 5, figsize=(20, 4))

for ax, C in zip(axes, C_values):
    svm = SVC(kernel="linear", C=C)
    svm.fit(X_linear, y_linear)
    plot_svm_decision_boundary(
        svm, X_linear, y_linear, ax=ax, title=f"C={C}\nSV: {len(svm.support_vectors_)}"
    )

plt.tight_layout()
plt.show()

**Discussion Questions:**

1. What happens to the number of support vectors as C increases?
2. What happens to the margin width as C increases?
3. Why might a very large C lead to overfitting?


### 1.4 Your Turn: Experiment with C

**Task:** Create a dataset with some overlap between classes (increase `cluster_std`), then compare how different C values affect the decision boundary.


In [None]:
# TODO: Generate data with more overlap (try cluster_std=2.5 or higher)
# X_overlap, y_overlap = make_blobs(...)

# TODO: Fit SVMs with C=0.1 and C=100 and compare

---

## Part 2: Nonlinear SVM with Kernels

### The Kernel Trick

What if the data isn't linearly separable? The **kernel trick** allows SVM to find nonlinear decision boundaries by implicitly mapping data to a higher-dimensional space.

Common kernels:

- **Linear:** $K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^T \mathbf{x}_j$
- **Polynomial:** $K(\mathbf{x}_i, \mathbf{x}_j) = (\gamma \mathbf{x}_i^T \mathbf{x}_j + r)^d$
- **RBF (Radial Basis Function):** $K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2)$


### 2.1 Non-Linearly Separable Data

Let's create datasets that cannot be separated by a straight line.


In [None]:
# Generate non-linearly separable datasets
# Note: We use moderate noise to create realistic overlap between classes
X_circles, y_circles = make_circles(
    n_samples=300, noise=0.2, factor=0.5, random_state=42
)
X_moons, y_moons = make_moons(n_samples=300, noise=0.25, random_state=42)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].scatter(
    X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap="RdBu", edgecolors="k"
)
axes[0].set_title("Circles Dataset")
axes[0].set_xlabel("Feature 1")
axes[0].set_ylabel("Feature 2")

axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap="RdBu", edgecolors="k")
axes[1].set_title("Moons Dataset")
axes[1].set_xlabel("Feature 1")
axes[1].set_ylabel("Feature 2")

plt.tight_layout()
plt.show()

### 2.2 Linear SVM Fails

Let's see what happens when we try to fit a linear SVM to non-linearly separable data.


In [None]:
# Try linear SVM on circles data
svm_linear_circles = SVC(kernel="linear", C=1.0)
svm_linear_circles.fit(X_circles, y_circles)

fig, ax = plt.subplots(figsize=(8, 6))
plot_svm_decision_boundary(
    svm_linear_circles,
    X_circles,
    y_circles,
    ax=ax,
    title=f"Linear SVM on Circles\nAccuracy: {svm_linear_circles.score(X_circles, y_circles):.3f}",
)
plt.show()

### 2.3 RBF Kernel

The **RBF (Radial Basis Function)** kernel can create circular or blob-like decision boundaries.

The parameter **gamma** ($\gamma$) controls the "reach" of each training example:

- **Small gamma:** Larger similarity radius, smoother decision boundary
- **Large gamma:** Smaller similarity radius, more complex decision boundary


In [None]:
# Fit RBF SVM on circles data
svm_rbf = SVC(kernel="rbf", C=1.0, gamma="scale")
svm_rbf.fit(X_circles, y_circles)

fig, ax = plt.subplots(figsize=(8, 6))
plot_svm_decision_boundary(
    svm_rbf,
    X_circles,
    y_circles,
    ax=ax,
    show_margin=False,
    title=f"RBF SVM on Circles\nAccuracy: {svm_rbf.score(X_circles, y_circles):.3f}",
)
plt.show()

### 2.4 Effect of Gamma

**Task:** Explore how gamma affects the decision boundary.


In [None]:
# Compare different values of gamma
gamma_values = [0.1, 0.5, 1, 5, 10]

fig, axes = plt.subplots(1, 5, figsize=(20, 4))

for ax, gamma in zip(axes, gamma_values):
    svm = SVC(kernel="rbf", C=1.0, gamma=gamma)
    svm.fit(X_circles, y_circles)
    plot_svm_decision_boundary(
        svm,
        X_circles,
        y_circles,
        ax=ax,
        show_margin=False,
        title=f"gamma={gamma}\nAcc: {svm.score(X_circles, y_circles):.2f}",
    )

plt.tight_layout()
plt.show()

**Discussion:**

- What happens with very large gamma? (Hint: look for overfitting)
- What happens with very small gamma?


### 2.5 Polynomial Kernel

The **polynomial kernel** can create polynomial decision boundaries. The **degree** parameter controls the complexity.


In [None]:
# Compare polynomial degrees on moons data
degrees = [2, 3, 4, 5]

fig, axes = plt.subplots(1, 4, figsize=(16, 4))

for ax, degree in zip(axes, degrees):
    svm = SVC(kernel="poly", degree=degree, C=1.0, coef0=1)
    svm.fit(X_moons, y_moons)
    plot_svm_decision_boundary(
        svm,
        X_moons,
        y_moons,
        ax=ax,
        show_margin=False,
        title=f"Poly (degree={degree})\nAcc: {svm.score(X_moons, y_moons):.2f}",
    )

plt.tight_layout()
plt.show()

### 2.6 Your Turn: Compare Kernels

**Task:** Fit linear, RBF, and polynomial SVMs to the moons dataset. Which works best?


In [None]:
# TODO: Fit three SVMs with different kernels on X_moons, y_moons
# Compare their accuracies and decision boundaries

---

## Part 3: Hyperparameter Tuning with Cross-Validation

### The Problem

We have multiple hyperparameters to tune:

- **C** (regularization strength)
- **gamma** (for RBF kernel)
- **degree** (for polynomial kernel)

How do we find the best combination without overfitting to our test set?

### Grid Search with Cross-Validation

**Grid Search** systematically tries all combinations of hyperparameters.
**Cross-Validation** gives us a robust estimate of performance without touching the test set.

Together, they help us find optimal hyperparameters while avoiding overfitting.


### 3.1 Prepare Data with Train/Test Split


In [None]:
# Use the circles dataset - split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    X_circles, y_circles, test_size=0.2, random_state=42, stratify=y_circles
)

print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")

### 3.2 Grid Search for RBF SVM

Let's search over a grid of C and gamma values.


In [None]:
# Define parameter grid
param_grid = {"C": [0.1, 1, 10, 100], "gamma": [0.01, 0.1, 1, 10]}

# Create GridSearchCV object
grid_search = GridSearchCV(
    SVC(kernel="rbf"),
    param_grid,
    cv=5,  # 5-fold cross-validation
    scoring="accuracy",
    return_train_score=True,
)

# Fit on training data
grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.3f}")

### 3.3 Visualize Grid Search Results

A heatmap helps us understand how performance varies across the parameter space.


In [None]:
# Extract results into a DataFrame
results = pd.DataFrame(grid_search.cv_results_)

# Create a pivot table for the heatmap
scores = results.pivot(index="param_gamma", columns="param_C", values="mean_test_score")

# Plot heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(
    scores, annot=True, fmt=".3f", cmap="viridis", cbar_kws={"label": "CV Accuracy"}
)
plt.title("Grid Search Results: RBF SVM")
plt.xlabel("C")
plt.ylabel("gamma")
plt.show()

### 3.4 Evaluate on Test Set

Now we evaluate the best model on our held-out test set.


In [None]:
# Get the best model
best_svm = grid_search.best_estimator_

# Evaluate on test set
y_pred = best_svm.predict(X_test)
test_accuracy = accuracy_score(y_test, y_pred)

print(f"Test accuracy: {test_accuracy:.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=["Class 0", "Class 1"]))

### 3.5 Confusion Matrix and ROC Curve


In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Confusion Matrix
ConfusionMatrixDisplay.from_estimator(
    best_svm,
    X_test,
    y_test,
    display_labels=["Class 0", "Class 1"],
    cmap="Blues",
    ax=axes[0],
)
axes[0].set_title("Confusion Matrix")

# ROC Curve
RocCurveDisplay.from_estimator(best_svm, X_test, y_test, ax=axes[1])
axes[1].plot([0, 1], [0, 1], "k--", label="Random Classifier")
axes[1].set_title("ROC Curve")
axes[1].legend()

plt.tight_layout()
plt.show()

### 3.6 Your Turn: Grid Search for Polynomial SVM

**Task:** Perform grid search to tune C and degree for a polynomial SVM on the moons dataset.


In [None]:
# TODO: Split moons data into train/test
# X_train_m, X_test_m, y_train_m, y_test_m = ...

# TODO: Define parameter grid for polynomial SVM (C and degree)
# param_grid_poly = {...}

# TODO: Run GridSearchCV

# TODO: Create a heatmap of results

# TODO: Evaluate best model on test set

---

## Part 4: Real-World Application

Now let's apply what you've learned to a real dataset: the **Digits** dataset. This dataset contains 8×8 pixel images of handwritten digits — a classic machine learning benchmark where SVMs have historically performed well.


In [None]:
# Load digits dataset
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

print(f"Dataset shape: {X_digits.shape}")
print(f"Number of classes: {len(np.unique(y_digits))}")
print(f"Class distribution: {np.bincount(y_digits)}")

# Visualize some sample digits
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap="gray")
    ax.set_title(f"Label: {y_digits[i]}")
    ax.axis("off")
plt.suptitle("Sample Digits from the Dataset", fontsize=14)
plt.tight_layout()
plt.show()

### 4.1 Data Preparation


In [None]:
# Train/test split
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(
    X_digits, y_digits, test_size=0.2, random_state=42, stratify=y_digits
)

print(f"Training set: {len(X_train_d)} samples")
print(f"Test set: {len(X_test_d)} samples")

### 4.2 Baseline Model


In [None]:
# Baseline: most frequent class
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train_d, y_train_d)
print(f"Baseline accuracy: {dummy.score(X_test_d, y_test_d):.3f}")
print("(With 10 classes, random guessing would give ~10% accuracy)")

### 4.3 SVM with Grid Search

Now let's tune hyperparameters using cross-validation, just like we did in Part 3.


In [None]:
# Create a pipeline with scaling and SVM
pipe = Pipeline([("scaler", StandardScaler()), ("svm", SVC(kernel="rbf"))])

# Parameter grid
param_grid = {"svm__C": [0.1, 1, 10], "svm__gamma": ["scale", 0.001, 0.01]}

# Grid search (may take a minute with 10 classes)
print("Running grid search (this may take a moment)...")
grid_search_digits = GridSearchCV(
    pipe, param_grid, cv=5, scoring="accuracy", return_train_score=True
)
grid_search_digits.fit(X_train_d, y_train_d)

print(f"Best parameters: {grid_search_digits.best_params_}")
print(f"Best CV score: {grid_search_digits.best_score_:.3f}")

### 4.4 Visualize Grid Search Results


In [None]:
# Extract results and create heatmap
results_digits = pd.DataFrame(grid_search_digits.cv_results_)

# Create pivot table - need to handle 'scale' as a string
results_digits["param_svm__gamma_str"] = results_digits["param_svm__gamma"].astype(str)
scores_digits = results_digits.pivot(
    index="param_svm__gamma_str", columns="param_svm__C", values="mean_test_score"
)

plt.figure(figsize=(8, 5))
sns.heatmap(
    scores_digits,
    annot=True,
    fmt=".3f",
    cmap="viridis",
    cbar_kws={"label": "CV Accuracy"},
)
plt.title("Grid Search Results: Digits Dataset")
plt.xlabel("C")
plt.ylabel("gamma")
plt.show()

### 4.5 Final Evaluation


In [None]:
# Evaluate on test set
best_model = grid_search_digits.best_estimator_
y_pred_digits = best_model.predict(X_test_d)

print(f"Test accuracy: {accuracy_score(y_test_d, y_pred_digits):.3f}")
print("\nClassification Report:")
print(classification_report(y_test_d, y_pred_digits))

# Per-digit accuracy
print("\nPer-digit accuracy:")
for digit in range(10):
    mask = y_test_d == digit
    digit_acc = accuracy_score(y_test_d[mask], y_pred_digits[mask])
    n_samples = mask.sum()
    print(f"  Digit {digit}: {digit_acc:.2%} ({n_samples} samples)")

### 4.6 Confusion Matrix and Error Analysis


In [None]:
# Confusion matrix for multiclass
fig, ax = plt.subplots(figsize=(10, 8))

ConfusionMatrixDisplay.from_estimator(
    best_model, X_test_d, y_test_d, cmap="Blues", ax=ax
)
ax.set_title("Confusion Matrix - Digits Dataset")
plt.tight_layout()
plt.show()

# Show some misclassified examples
misclassified_idx = np.where(y_pred_digits != y_test_d)[0]
if len(misclassified_idx) > 0:
    print(f"\nNumber of misclassified samples: {len(misclassified_idx)}")

    # Show up to 5 misclassified examples
    n_show = min(5, len(misclassified_idx))
    fig, axes = plt.subplots(1, n_show, figsize=(2.5 * n_show, 3))
    if n_show == 1:
        axes = [axes]

    for i, ax in enumerate(axes):
        idx = misclassified_idx[i]
        img = X_test_d[idx].reshape(8, 8)
        ax.imshow(img, cmap="gray")
        ax.set_title(f"True: {y_test_d[idx]}\nPred: {y_pred_digits[idx]}", fontsize=10)
        ax.axis("off")

    plt.suptitle("Misclassified Examples", fontsize=12)
    plt.tight_layout()
    plt.show()

### 4.7 Reflection Questions

1. Looking at the confusion matrix, which digit pairs are most often confused? Why might that be?
2. Examine the misclassified examples — do they look ambiguous to you as well?
3. How does the SVM's accuracy compare to the baseline? What does this tell you about the task?
4. Which digits have the lowest per-class accuracy? Can you hypothesize why?


---

## Summary

In this lab, you learned:

1. **Linear SVM** finds the maximum-margin hyperplane; the parameter C controls the trade-off between margin width and misclassification.

2. **Kernel trick** allows SVM to handle non-linearly separable data by implicitly mapping to higher dimensions.

3. **RBF kernel** creates circular/blob-like boundaries (controlled by gamma); **Polynomial kernel** creates polynomial boundaries (controlled by degree).

4. **Grid Search + Cross-Validation** helps find optimal hyperparameters without overfitting to the test set.
