# Chapter 5: Support Vector Machines

<i>Support Vector Machines</i> are a machine learning algorithm capable of both
linear and nonlinear classification. They are best for small- to medium-sized
datasets.

## Linear SVM Classification

One useful property of SVMs is that unlike other linear classifiers, SVMs'
decision boundary will be as far from the training instances as possible. This
is called <i>large margin classification.</i> SVMs try to find the largest possible margin (or "street") that can be used as a decision boundary between classes.

Adding training instances far off the street will not affect the outcome
of training the model. Instances on the edge of the street are referred to as <i>support vectors.</i>

SVMs are also sensitive to the scale of the data being used to train the model.

### Soft Margin Classification

<i>Hard margin classification</i> is when a model requires each instance of each class be on the same side of the street. Training models this way is only possible when the data is linearly separable and is sensitive to outliers.

Another way to train the model is to aim for the widest street possible while limiting the number of <i>margin violations</i> (instances that end up on the street or on the wrong side). This method is called <i>soft margin classification</i>. In Scikit-Learn, you can tune how lenient the model is with margin violations using the $C$ hyperparameter. Large values of $C$ allow for fewer margin violations and a thinner street, whereas smaller values of $C$ allow for more margin violations and a larger street.

In [9]:
# The following code uses Scikit-Learn to classify the iris dataset using
# an SVM

import warnings
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

warnings.filterwarnings("ignore")

iris = datasets.load_iris()
X = iris['data'][:, (2, 3)]  # petal length, width
y = (iris['target'] == 2).astype(np.float64)

svm_clf = Pipeline([
  ('scaler', StandardScaler()),
  ('linear_svc', LinearSVC(C=1, loss='hinge'))
])

svm_clf.fit(X, y)

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linear_svc', LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))])

In [10]:
svm_clf.predict([[5.5, 1.7]])

array([1.])

You can also uses the `SVC` class using `SVC(kernel='linear', C='1')`, but it is much slower than `LinearSVC`. You can also use the `SGDClassifier` class using `SGDClassifier(loss='hinge', alpha=1/(m*C))` which would apply Stochastic Gradient Descent to train a linear SVM classifier.

## Nonlinear SVM Classification

Often datasets are not linearly separable. One thing you can do is add polynomial features to the dataset to make it linearly separable. An example of this is shown below.

In [13]:
from sklearn.datasets import make_moons
from sklearn.preprocessing import PolynomialFeatures

X, y = make_moons()

poly_svm_clf = Pipeline([
  ('poly_features', PolynomialFeatures(degree=3)),
  ('scaler', StandardScaler()),
  ('svm_clf', LinearSVC(C=10, loss='hinge')),
])

poly_svm_clf.fit(X, y)

Pipeline(memory=None,
     steps=[('poly_features', PolynomialFeatures(degree=3, include_bias=True, interaction_only=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))])

### Polynomial Kernel

It can be difficult to choose the right degree of polynomial to use for transforming the data before fitting. SVMs are able train themselves to use
polynomials without you having to add them using the <i>kernel trick</i>. An example of doing this with Scikit-Learn is below using polynomial features up to degree 3.

In [15]:
from sklearn.svm import SVC

poly_kernel_svm_clf = Pipeline([
  ('scaler', StandardScaler()),
  ('svm_clf', SVC(kernel='poly', degree=3, coef0=0.1, C=5))
])

poly_kernel_svm_clf.fit(X, y)

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', SVC(C=5, cache_size=200, class_weight=None, coef0=0.1,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='poly', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False))])

If the model is underfitting, you can increase the degree (or lower the degree if it is overfitting). You can also tune the `coef0` hyperparameter which controls how sensitive the model is to high degree polynomials versus low degree polynomials.

### Adding Similarity Features

Another way to solve nonlinear problems is to add features computed using a <i>similarity function</i> that measures how much each instance resembles a <i>landmark</i>. One example of a similarity function is called the Gaussian <i>Radial Basis Function</i>,

$$ \phi_\gamma(\mathbf{x}, \mathbf{\ell}) =
\exp\left(-\gamma\,||\mathbf{x} - \mathbf{\ell} ||^2\right) $$

which has a range of 0 (very far from the landmark) to 1 (at the landmark). Often one uses each instance in the training set as a landmark, then drops the original features. However, for large datasets this can greatly increase the time it takes to train a model.