# Chapter 5. Support Vector Machines

- SVM is a powerful and versetile ML model, capable of performing Classification, Regression, & even outlier detection.
- SVMs are particularly suited for complex small-to-medium sized datasets.
- This chapter will explain the core concepts of SVMs, how they work, and how to use them.

## Linear SVM Classification

- The fundamental idea behind SVMs is better explained with the following picture:

<div style="text-align:center;"><img style="width:66%;" src="static/imgs/SVM_example.png" /></div>

- The two classes can be separated easily by a straight line (linearly separable).
- The left plot shows the decision boundaries of three possible linear classifiers.
- The dashed line model is so bad that it doesn't even separate the two groups linearly.
- The other two models work perfectly on the plotted training set.
    - But their boundaries are so close to the training data points that they'll probably not perform well on unseen data.
- In constrast, the model on the right not only separates the training data linearly, it also stays as far as possible from both classes data points.
    - Thus, it will likely perform well on unseen data.
- You can think of an SVM as fitting the widest possible street (represented by the dashed lines) between the classes.
    - This is called **Large Margin Classification**.
- Notice that adding more training points off the street **won't effect the decision boundary at all**.
- It's fully determined by the data points located at the edge of the street.
- These instances are called **support vectors**.
- SVMs are also sensitive to feature scales.

### Soft Margin Classification

- If we restrict that all training instances should be off the street, this is called Hard Margin Classification, the problem with it is that is will only with Linearly separated data, and is greatly effected by the presence of outliers.
- The following are two example of how outliers can mess-up hard margin classifiers:

<div style="text-align:center;"><img style="width:66%;" src="static/imgs/Hard_Margin_Classifier.png" /></div>

- To fix the issue, try to balance finding a wide street with limiting the number of violations.
    - This is called **Soft Margin Classification**.
- This can be controlled in scikit-learn by the `C` hyper-parameter:

<div style="text-align:center;"><img style="width:66%;" src="static/imgs/Soft_to_Hard_Street.png" /></div>

- By increasing `C`, We're increasing the sensitivity of the model to minimize margin violations within the training set.
    - Meaning, If you're overfitting, try to reduce the value of the `C` hyper-parameter.
- Let's use scikit-learn's SVMs:

In [1]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

In [2]:
iris = datasets.load_iris()

In [3]:
iris['data'].shape

(150, 4)

In [4]:
X = iris['data'][:, [2,3]]  # Petal Length, Petal Width
y = (iris['target'] == 2).astype(np.float64)  # Iris Virginica
X.shape, y.shape

((150, 2), (150,))

In [5]:
svm_clf = Pipeline([
    ('Scaler', StandardScaler()),
    ('Linear_svc', LinearSVC(C=1, loss='hinge'))
])

In [6]:
svm_clf.fit(X, y)

Pipeline(memory=None,
         steps=[('Scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('Linear_svc',
                 LinearSVC(C=1, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='hinge', max_iter=1000, multi_class='ovr',
                           penalty='l2', random_state=None, tol=0.0001,
                           verbose=0))],
         verbose=False)

In [7]:
 svm_clf.predict([[5.5, 1.7]])

array([1.])

- Unlike Logistic Regression Models (with their sigmoid functions), SVMs do not output probabilities for each class.

## NonLinear SVM Classification

- Many datasets are not even close to being lienarly separable.
- One approach to handly non-linear modeling is to add more features, such as polynomial features.
    - In some cases this can result in linearly separable datasets.
- The following is an example of an original non-linearly separable dataset with only one feature $x_{1}$ (on the left), and an augmented linearly seprable dataset with an added feature $x_{2}=x_{1}^{2}$: 

<div style="text-align:center;"><img style="width:66%;" src="static/imgs/nonlinear_to_linear.png" /></div>

- Let's implement this idea using scikit-learn:

In [8]:
from sklearn.datasets import make_moons
from sklearn.preprocessing import PolynomialFeatures

In [9]:
X, y = make_moons(n_samples=100, noise=0.15)

In [10]:
polynomial_svm_clf = Pipeline([
    ("poly_features", PolynomialFeatures(degree=3)),
    ("scaler", StandardScaler()),
    ("svm_clf", LinearSVC(C=10, loss="hinge"))
])

In [11]:
polynomial_svm_clf.fit(X, y)



Pipeline(memory=None,
         steps=[('poly_features',
                 PolynomialFeatures(degree=3, include_bias=True,
                                    interaction_only=False, order='C')),
                ('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svm_clf',
                 LinearSVC(C=10, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='hinge', max_iter=1000, multi_class='ovr',
                           penalty='l2', random_state=None, tol=0.0001,
                           verbose=0))],
         verbose=False)

In [12]:
polynomial_svm_clf.score(X, y)

0.99

- The following represents the decision boundaries of the model, because we added polynomial degrees, projected boundaries are now non-linear:

<div style="text-align:center;"><img style="width:50%;" src="static/imgs/polynomial_svms.png" /></div>

### Polynomial Kernels

- At a low polynomial degrees, adding features cannot deal with complex datasets.
- At a high polynomial degrees, we endup adding a lot of features, resulting in a very complex & slow model.
- Fortunately, when using SVMs you can apply an almost miraculous mathematical technique called the kernel trick.
    - The kernel trick makes it possible to have the same result as if you added many polynomial features without actually adding them.
- Let's test it on the moon dataset:

In [13]:
from sklearn.svm import SVC

In [14]:
poly_kernel_svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_clf', SVC(kernel='poly', degree=3, coef0=1, C=5))
])

In [15]:
poly_kernel_svm_clf.fit(X, y)

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svm_clf',
                 SVC(C=5, cache_size=200, class_weight=None, coef0=1,
                     decision_function_shape='ovr', degree=3,
                     gamma='auto_deprecated', kernel='poly', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

- This model trains an SVM classifier using a kernel of third degree features.
- If your model is overfitting, you might want to decrease the polynomial degree, and If it's underfitting, it might be a good idea to increase the degree
    - `coef0` controls how much the model is influenced by high-degree polynomials vs. low degree polynomials.
- The following showcases the previously trained model (on the left) vs. a more complex model of kernel degree 10:

<div style="text-align:center;"><img style="width:66%;" src="static/imgs/kernel_trick.png" /></div>

### Similarity Features

- Another technique to tackle non-linear problems is to add features computed using a **similarity function**.
    - Which measures how much each instance resembles a particular landmark.
- For example, let's take the 1D dataset discussed earlier & add two landmarks to it at $x_{1}=-2$ and $x_{1}=1$, as showcased in the left plot of:

<div style="text-align:center;"><img style="width:66%;" src="static/imgs/similarity_measures.png" /></div>

- We defined the similarity function to be the **Gaussian Radial Basis Function (RBF)** with $\gamma = 0.3$:

