# Support Vector Machines

"SVMs are particularly well suited for classification of complex but small- or medium-sized datasets."

The idea behind SVMs can be thought of as "fitting the widest possible street... between the classes."

If you train a linear classification model on two classes, it may accurately separate the classes in the training set, however it may come very close to the instances, so much that it may not be accurate when given new instances it hasn't trained on. The idea of fitting the widest street allows for the best separation between the classes.

This process is called **large margin classification**.

Now if we add more training instances clearly towards one class or another, it will not change the slope or location of the road at all. This is because the decision boundary is "fully determined (or "supported") by the instances located on the edge of the street. These instances are called **support vectors**.

#### Soft Margin Classification
Strictly forcing all positive instances to be off the street and on the right side is called a **hard margin**. The issue with  this kind of classification is that it is super susceptible to outliers and that it will only work if the data is linearly seperable.

A better solution is **soft margin classification**. Here, "the objective is to find a good balance between keeping the street as large as possible and limiting the **margin violiations** (i.e., instances that end up in the middle of the street or even on the wrong side).

We can control the balance of this by changing the 'C' hyperparameter of the SVM model in Scikit-learn. A small C value will have a larger margin with more margin violations compared to a large C value.

A good regularization technique to reduce overfitting of SVMs is to just increase C.

In [2]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
x = iris['data'][:,(2,3)] # just pedal length and pedal width
y = (iris['target'] == 2).astype(np.float64) # Iris-Virginica

svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('linear_svc', LinearSVC(C=1, loss='hinge'))
])

svm_clf.fit(x, y)

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('linear_svc',
                 LinearSVC(C=1, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='hinge', max_iter=1000, multi_class='ovr',
                           penalty='l2', random_state=None, tol=0.0001,
                           verbose=0))],
         verbose=False)

In [6]:
svm_clf.predict([[5.5,1.7]])

array([1.])

Unfortunately, SVM classifiers do not output probabilities for each class like a Logistic Regression does.

#### Nonlinear SVM Classification

Most datasets you deal with will not be linearly separable. How do we deal with this?

Well, just as we discussed in chapter 4, we can use the `PolynomialFeatures` class of Scikit-learn to make something linearly separable. If you have a bunch of instances with only one feature, it's likely they are not linearly separable along the x1 axis. However, if you define a second instance x2 = x1^2, then on the graph of x1 v x2, they may be linearly separable.

In [3]:
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

polynomial_svm_clf = Pipeline([
    ('poly_features', PolynomialFeatures(degree=3)),
    ('scaler', StandardScaler()),
    ('svm_clf', LinearSVC(C=10, loss='hinge'))
])
x, y = make_moons()
polynomial_svm_clf.fit(x, y)



Pipeline(memory=None,
         steps=[('poly_features',
                 PolynomialFeatures(degree=3, include_bias=True,
                                    interaction_only=False, order='C')),
                ('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svm_clf',
                 LinearSVC(C=10, class_weight=None, dual=True,
                           fit_intercept=True, intercept_scaling=1,
                           loss='hinge', max_iter=1000, multi_class='ovr',
                           penalty='l2', random_state=None, tol=0.0001,
                           verbose=0))],
         verbose=False)

The issue with polynomial features is that a small amount of features is likely to underfit, and a high number of degrees will result in a lot of features and a very slow model. Luckily, with SVMs, we can simply use a polynomial kernel.

#### Polynomial Kernels

The 'kernel trick' is a "miraculous mathematical technique... that makes it possible to get the same result as if you added many polynomial features, even with very high degree polynomials, without actually having to add them. So there is no combinatorial explosion of the number of features since you don't actually add any features." Let's see this in action in Scikit-learn.

In [4]:
from sklearn.svm import SVC

poly_kernel_svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_clf', SVC(kernel='poly', degree=3, coef0=1, C=5))
])

poly_kernel_svm_clf.fit(x,y)

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svm_clf',
                 SVC(C=5, cache_size=200, class_weight=None, coef0=1,
                     decision_function_shape='ovr', degree=3,
                     gamma='auto_deprecated', kernel='poly', max_iter=-1,
                     probability=False, random_state=None, shrinking=True,
                     tol=0.001, verbose=False))],
         verbose=False)

#### Adding Similarity Features
"Another technique to tackle nonlinear problems is to add features computed using a **similarity function** that measures how much each instance resembles a particular **landmark**." We can define a few instances to be a landmark, then choose a similarity function like the *Gaussian Radial Basis Function* to compare every instance to it.

Then we can define a new feature for each landmark instance to each instance in the training set to be the evaluation of the similarity function when inputted the landmark and the instance. Lots of words here and I can't really add pictures so check the book if you need.

Anyways, the new graph with more features is more likely to be linearly separable. It may be difficult to choose the landmarks, and the simplest solution is to use every instance as a landmark. "The downside is that a training set with m instances and n features get transformed into a training set with m instances and m features (assumng you drop the original features). If your training set is very large, you end up with an equally large number of features."

#### Gaussian RBF Kernel

Just how the polynomial kernel of the SVM solved our computational issues of the `PolynomialFeatures` class, we can use a kernel with a similarity function. "It makes it possible to obtain a similar result as if you had added many similarity features, without actually having to add them. Let's try the Gaussian RBF kernel using the SVC class:"

In [5]:
rbf_kernel_svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_clf', SVC(kernel='rbf', gamma=5, C=0.001))
])

rbf_kernel_svm_clf.fit(x,y)

Pipeline(memory=None,
         steps=[('scaler',
                 StandardScaler(copy=True, with_mean=True, with_std=True)),
                ('svm_clf',
                 SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0,
                     decision_function_shape='ovr', degree=3, gamma=5,
                     kernel='rbf', max_iter=-1, probability=False,
                     random_state=None, shrinking=True, tol=0.001,
                     verbose=False))],
         verbose=False)

Gamma works as a regularization hyperparameter just like C. Increasing it will make the positive class boundary smaller, and decreasing it will make it wider. If your model is overfitting, try decreasing gamma.

#### Computational Complexity

The `LinearSVC` model has an optimized algorithm specific for SVMs, but the algorithm cannot support the kernel trick. For the kernel trick, you need to use the `SVC` class. However, this class is really good at complex, but small to medium-sized datasets, but performs really slowly on large (e.g., hundreds of thousands of instances) datasets.

### SVM Regression

Believe it or not, the SVM algorithm is also capable of doing regression problems. "The trick is to reverse the objective: instead of trying to fit the largest possible street between two classes while limiting margin violations, SVM Regression tries to fit as many instances as possible *on* the street... The width of the street is controlled by a hyperparameter epsilon.

"You can use Scikit-Learn's `LinearSVR` class to perform linear SVM Regression."

In [7]:
from sklearn.svm import LinearSVR

svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(x,y)

LinearSVR(C=1.0, dual=True, epsilon=1.5, fit_intercept=True,
          intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
          random_state=None, tol=0.0001, verbose=0)

We can also use the kernel trick on SVM regression with the `SVR` class.

In [9]:
from sklearn.svm import SVR

svm_poly_reg = SVR(kernel='poly', degree=3, C=100, epsilon=0.1)
svm_poly_reg.fit(x,y)



SVR(C=100, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
    gamma='auto_deprecated', kernel='poly', max_iter=-1, shrinking=True,
    tol=0.001, verbose=False)