# Chapter 5: Support Vector Machines

## 1. Chapter Overview
**Goal:** Understand Support Vector Machines (SVMs), a powerful and versatile Machine Learning model capable of performing linear or nonlinear classification, regression, and even outlier detection.

**Key Concepts:**
* **Large Margin Classification:** The core idea of fitting the "widest possible street" between classes.
* **Hard vs. Soft Margin:** Balancing perfectly separating data vs. allowing some violations to generalize better.
* **Kernel Trick:** A mathematical technique to map data into higher dimensions without actually calculating the coordinates, allowing for nonlinear classification.
* **SVM Regression:** Reversing the objective (trying to fit as many instances *on* the street as possible).

**Practical Skills:**
* Using `LinearSVC` for fast linear classification.
* Implementing `StandardScaler` (crucial for SVMs).
* Using `SVC` with Polynomial and RBF kernels.
* Tuning hyperparameters `C` and `gamma`.

In [None]:
# Setup
import sys
import sklearn
import numpy as np
import os
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

np.random.seed(42)
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

## 2. Theoretical Explanation

### 1. Large Margin Classification
Think of an SVM classifier as fitting the widest possible street (represented by the dashed lines) between the classes. This is called **Large Margin Classification**. New instances that fall off the street will not affect the decision boundary; the boundary is determined fully by the instances located on the edge of the street. These instances are called **Support Vectors**.

### 2. Sensitivity to Feature Scaling
SVMs are very sensitive to the scales of the features. If one feature has a much larger range than another (e.g., salary vs. age), the "street" will be almost parallel to the larger feature. **Always use Feature Scaling (StandardScaler)** before training an SVM.

### 3. Hard Margin vs. Soft Margin
* **Hard Margin:** Strictly imposes that all instances must be off the street and on the right side. Only works if data is linearly separable and is sensitive to outliers.
* **Soft Margin:** Allows some instances to end up on the street or on the wrong side (margin violations) to keep the street wide and generalize better. This balance is controlled by the hyperparameter **C**.
    * **High C:** Fewer margin violations, smaller margin (strict).
    * **Low C:** More margin violations, wider margin (tolerant/regularized).

### 4. The Kernel Trick
To handle nonlinear data (like the Moons dataset), we can add polynomial features ($x^2, x^3$). However, adding many features slows down training. The **Kernel Trick** yields the same result as if you added many polynomial features, without actually adding them. It computes the dot product in a higher-dimensional space mathematically.

* **Polynomial Kernel:** Simulates adding polynomial features.
* **RBF (Radial Basis Function) Kernel:** Simulates adding "similarity features" (like Gaussian landmarks).

## 3. Code Reproduction

### 3.1 Linear SVM Classification
We use the Iris dataset again, specifically to detect Iris-Virginica.

In [None]:
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris["data"][:, (2, 3)]  # petal length, petal width
y = (iris["target"] == 2).astype(np.float64)  # Iris-Virginica

svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("linear_svc", LinearSVC(C=1, loss="hinge", random_state=42)),
])

svm_clf.fit(X, y)

# Predict
print("Prediction for [5.5, 1.7]:", svm_clf.predict([[5.5, 1.7]]))

### 3.2 Nonlinear SVM Classification
For nonlinear data, we use the `make_moons` dataset (two interleaving half circles). We will use a Pipeline that adds polynomial features, scales them, and then applies a Linear SVM.

In [None]:
from sklearn.datasets import make_moons
from sklearn.preprocessing import PolynomialFeatures

X, y = make_moons(n_samples=100, noise=0.15, random_state=42)

polynomial_svm_clf = Pipeline([
    ("poly_features", PolynomialFeatures(degree=3)),
    ("scaler", StandardScaler()),
    ("svm_clf", LinearSVC(C=10, loss="hinge", random_state=42))
])

polynomial_svm_clf.fit(X, y)

# Visualization function
def plot_dataset(X, y, axes):
    plt.plot(X[:, 0][y==0], X[:, 1][y==0], "bs")
    plt.plot(X[:, 0][y==1], X[:, 1][y==1], "g^")
    plt.axis(axes)
    plt.grid(True, which='both')
    plt.xlabel(r"$x_1$", fontsize=20)
    plt.ylabel(r"$x_2$", fontsize=20, rotation=0)

def plot_predictions(clf, axes):
    x0s = np.linspace(axes[0], axes[1], 100)
    x1s = np.linspace(axes[2], axes[3], 100)
    x0, x1 = np.meshgrid(x0s, x1s)
    X = np.c_[x0.ravel(), x1.ravel()]
    y_pred = clf.predict(X).reshape(x0.shape)
    y_decision = clf.decision_function(X).reshape(x0.shape)
    plt.contourf(x0, x1, y_pred, cmap=plt.cm.brg, alpha=0.2)
    plt.contourf(x0, x1, y_decision, cmap=plt.cm.brg, alpha=0.1)

plot_predictions(polynomial_svm_clf, [-1.5, 2.5, -1, 1.5])
plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])
plt.title("Linear SVM with Polynomial Features")
plt.show()

### 3.3 Polynomial Kernel (Kernel Trick)
Instead of manually adding features (which explodes combinatorially), we use `SVC(kernel="poly")`.

In [None]:
from sklearn.svm import SVC

# Degree=3, coef0=1 controls how much the model is influenced by high-degree polynomials
poly_kernel_svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
])
poly_kernel_svm_clf.fit(X, y)

plot_predictions(poly_kernel_svm_clf, [-1.5, 2.5, -1, 1.5])
plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])
plt.title("SVM with Polynomial Kernel")
plt.show()

### 3.4 Gaussian RBF Kernel
This is the most popular kernel for general nonlinear problems. It adds features based on similarity to landmarks.

$$ \phi_\gamma(\mathbf{x}, \ell) = \exp(-\gamma \| \mathbf{x} - \ell \|^2) $$

* **Gamma ($\gamma$):** Controls the bell-curve width. Higher gamma $\rightarrow$ narrower bell curve $\rightarrow$ more complex/irregular boundary (risk of overfitting).
* **C:** Regularization parameter.

In [None]:
rbf_kernel_svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
])
rbf_kernel_svm_clf.fit(X, y)

plot_predictions(rbf_kernel_svm_clf, [-1.5, 2.5, -1, 1.5])
plot_dataset(X, y, [-1.5, 2.5, -1, 1.5])
plt.title("SVM with RBF Kernel (gamma=5, C=0.001)")
plt.show()

## 4. Step-by-Step Explanation

### 1. The Pipeline
You will notice `Pipeline` is used everywhere. This is standard practice for SVMs.
1.  **StandardScaler:** SVM tries to maximize distance. If one axis is 100x larger than the other, the distance calculation is dominated by that axis. Scaling makes all axes contribute equally.
2.  **Model:** `LinearSVC` or `SVC`.

### 2. Linear vs. Kernel SVM
* **`LinearSVC`:** Optimized for linear tasks. It does not support the kernel trick but scales well with the number of instances ($O(m \times n)$).
* **`SVC`:** Supports kernels (poly, rbf). It relies on libsvm. It is slower with large datasets ($O(m^2 \times n)$ or worse) but powerful for complex small-medium datasets.

### 3. Hyperparameter Tuning
In the RBF example:
* We used `gamma=5` (high gamma). This makes each instance's range of influence small. The decision boundary becomes jagged, trying to wrap around individual instances.
* We used `C=0.001` (low C). This applies strong regularization, forcing a wider street even if it means misclassifying some training points. 
This combination likely results in **underfitting** in this specific visualization, showing a smooth "blob" rather than a tight fit.

## 5. Chapter Summary

* **SVMs** classify data by finding the widest street (margin) between classes.
* **Support Vectors** are the specific instances located on the edge of the street; they fully determine the decision boundary.
* **Scaling is mandatory.** Always use `StandardScaler`.
* **Kernel Trick:** Allows fitting nonlinear data efficiently without manually adding thousands of polynomial features.
* **Hyperparameters:**
    * **C:** Inverse of regularization. Small C = wide street (more violations). Large C = narrow street (strict).
    * **Gamma:** (RBF Kernel only) Controls the range of influence. Small gamma = smooth boundary. Large gamma = irregular boundary.