# Chapter 5: Support Vector Machines

In [1]:
import numpy as np

## 5.1 Linear SVM Classification

You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes; this is called *larged margin classification*.

The decision boundary is fully determined ("supported") by the instances located on the edge of the street and are called *support vectors*.

> Note: SVMs are sensitive to the feature scales. Boundaries can be grouped close together if scales are not proportional.

### 5.1.1 Soft Margin Classification

*Hard margin classification* is imposing strict rules such as having all instances be off the street and on the right side. This causes some problems:
- Only works if data is linearly separable
- Sensitive to outliers

*Soft margin classification* is the objective to find a good balance between keeping the street as large as possible and limiting the margin violations (instances that end up in the middle of the street or wrong side).

The following code loads the iris dataset, scales the features, and then trains a linear SVM model (`LinearSVC` class with `C=1` and *hinge loss function*) to detect Iris virginica flowers.

In [3]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

In [4]:
iris = datasets.load_iris()
X = iris["data"][:, (2, 3)] # petal length, petal width
y = (iris["target"] == 2).astype(np.float64) # Iris virginica

svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("linear_svc", LinearSVC(C=1, loss="hinge"))
])
svm_clf.fit(X, y)
svm_clf.predict([[5.5, 1.7]])

array([1.])

> Note: SVM classifiers do not output probabilities for each class, unlike Logisitic Regresssion classifiers.
>
> Note: You can use SVC with a linear kernel, `SVC(kernel="linear", C=1)`. Or `SGDClassifier(loss="hinge", alpha=1/(m*C))`.
>
> Tips when using `LinearSVC`:
>> - It regularizes the bias term, so center the training set by subtracting its mean.
>> - `StandardScaler` automatically scales the data ($\mu=0, \sigma^2=1$)
>> - Set `loss="hinge"` as it's not the default value
>> - Set `dual=False` for better performance

## 5.2 Nonlinear SVM Classification

One approach to handling nonlinear datasets is to add more features, such as polynomial features.

Let's test this on the moons dataset: a toy dataset for binary classification in which the data points are shaped as two interleaving half circles. Create a `Pipeline` containing a `PolynomialFeatures` transformer, `StandardScaler`, and `LinearSVC`.

In [5]:
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

In [6]:
X, y = make_moons(n_samples=100, noise=0.15)
polynomial_svm_clf = Pipeline([
    ("pol_features", PolynomialFeatures(degree=3)),
    ("scaler", StandardScaler()),
    ("svm_clf", LinearSVC(C=10, loss="hinge"))
])
polynomial_svm_clf.fit(X, y)

Pipeline(steps=[('pol_features', PolynomialFeatures(degree=3)),
                ('scaler', StandardScaler()),
                ('svm_clf', LinearSVC(C=10, loss='hinge'))])

### 5.2.1 Polynomial Kernel

The *kernel trick* makes it possible to get the same result as if you had added many polynomial features, even with very high-degree polynomials, without actually having to add them.

In [7]:
from sklearn.svm import SVC

In [8]:
poly_kernel_svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
])
poly_kernel_svm_clf.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('svm_clf', SVC(C=5, coef0=1, kernel='poly'))])

If your model is overfitting, reduce the polynomial degree, and increase if underfitting.  
The hyperparameter `coef0` controls how much the model is influenced by high degree polynomials versus low-degree polynomials.

### 5.2.2 Similarity Features

Another technique to tackle nonlinear problems is to add features computed using a *similarity function*, which measures how much each instance resembles a particular *landmark*.

*Equation 5-1. Gaussian RBF*

$$ \phi_\gamma(\vec{x}, l) = exp(-\gamma \lVert \vec{x} - l \rVert )^2$$ 

where $\gamma = 0.3$ and $\lVert \vec{x} - l \rVert$ is the distance between the new instance and landmark

How to select landmarks?  
Simplest approach is to create a landmark at the location of each and every instance of dataset. This creates many dimensions and increases chances the transformed training set will be linearly separable. Downside is that it'll become $(m \times m)$ size which can be very large.

### 5.2.3 Gaussian RBF Kernel

Similarity features method (such as Gaussian RBF) can be useful with any Machine Learning algorithm, but may be computationally expensive to compute all the additional features.  

The kernel trick with the Gaussian RBF (`kernel="rbf"`) achieves a similar result.

In [9]:
rbf_kernel_svm_clf = Pipeline([
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
])
rbf_kernel_svm_clf.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('svm_clf', SVC(C=0.001, gamma=5))])

$\uparrow$ $\gamma$ makes the bell-shaped curve narrower -> each instance's range is narrower and decision boundary becomes more irregular (more strict).

$\downarrow$ $\gamma$ makes the bell-shaped curve wider -> each instance's range is larger and decision boundary becomes smoother.

$\gamma$ acts like a regularization hyperparameter.
- If overfitting -> decrease $\gamma$
- If underfitting -> increase $\gamma$

> Note: Which kernel to choose?
>> - Always try linear kernel first
>> - `LinearSVC` >> `SVC(kernel="linear")` (LinearSVC much faster than SVC)
>> - If training set not too large, Gaussian RBF kernel `SVC(kernel="rbf")`

### 5.2.4 Computational Complexity

`LinearSVC` does not support kernel trick, but scales linearly with number of training examples and features. Training time complexity is $\approx O(m \times n) $.

Algorithm takes longer if you require high precision, which is controlled by tolerance hyperparameter $\epsilon$ (`tol` in Scikit-Learn). Most times, default tolerance is fine.

`SVC` supports the kernel trick, but gets dreadfully slow when the number of training examples gets large (100,000+). Perfect for complex small or medium-sized training sets and scales well with number of features. Training time complexity is $\approx O(m^2 \times n) - O(m^3 \times n) $.

## 5.3 SVM Regression

To use SVMs for regression instead of classification, the trick is to reverse the objective:  
Instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible **on** the street while limiting margin violations (instances **off** the street). The width of the street is controlled by hyperparameter $\epsilon$ (bigger width = bigger $\epsilon$ , smaller width = smaller $\epsilon$).

> Note: Adding more training instances within the margin does not affect the model's predictions -> model is *$\epsilon$-insensitive*.

In [10]:
from sklearn.svm import LinearSVR

In [11]:
svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)

LinearSVR(epsilon=1.5)

To tackle nonlinear regression tasks, use a kernelized SVM model.

In [12]:
from sklearn.svm import SVR

In [13]:
svm_poly_reg = SVR(kernel="poly", degree=2, C=100, epsilon=0.1)
svm_poly_reg.fit(X, y)

SVR(C=100, degree=2, kernel='poly')

`SVR` is the regression equivalent of `SVC` and similarly with `LinearSVR` to `LinearSVC`.
- `LinearSVR` scales linearly with size of training set
- `SVR` gets much too slow when training set grows large

## 5.4 Under the Hood

### 5.4.1 Decision Function and Predictions

### 5.4.2 Training Objective

### 5.4.3 Quadratic Programming

### 5.4.4 The Dual Problem

### 5.4.5 Kernelized SVMs

### 5.4.6 Online SVMs