# Chapter 5
## Linear SVM Classification
Using the IRIS dataset, you can easily classify the data into two sets, using two Linear Classifers. However, they tend to be very close to the data, leading to poor fiting of new data. On the other hand, the decision boundary of an SVM classifier stays as far away from the instances a possible. This is called _large margin classification_.

SVMs are sensitive to feature scales!!!

### Soft Margin Classification
If we strictly say that all instances be off the street and on the right side, this is called _hard margin classification_. The two main errors with this is that it assumes that the data is linearly separable and it's quite sensitve to outliers.

To avoid these issues, we find a balance between the largest street and _margin violations_. This is called _soft margin classification_.

You control this using the C hyperparameter.(If your SVM model is overfitting, you can try regularizing it by reducing C.

Here's a linear SVM using `LinearSVC` (C=0.1 and the _hinge loss_ function) on IRIS.

In [1]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris["data"][:, (2, 3)]  # pedal length, pedal width
y = (iris["target"] == 2).astype(np.float64)  # Iris-Virginica

svm_clf = Pipeline((
    ("scaler", StandardScaler()),
    ("linear_svc", LinearSVC(C=1, loss="hinge"))
))

svm_clf.fit(X, y)

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linear_svc', LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))])

Then make the prediction

In [2]:
svm_clf.predict([[5.5, 1.7]])

array([1.])

Unlike Logistic Regression classifiers, SVM classifers do not output probabilites for each class.

## Nonlinear SVM Classification
Even though SVMs are efficent and work suprisingly well in most cases, many datasets are not remotely linearly separable. One approach is to add polynomial features (as we have done before).

To implement this using Scikit-Learn, do the following.

In [3]:
from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures

polynomial_svm_clf = Pipeline((
    ("poly_features", PolynomialFeatures(degree=3)),
    ("scaler", StandardScaler()),
    ("svm_clf", LinearSVC(C=10, loss="hinge"))
))

polynomial_svm_clf.fit(X, y)

Pipeline(memory=None,
     steps=[('poly_features', PolynomialFeatures(degree=3, include_bias=True, interaction_only=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))])

### Polynomial Kernel
Adding polynomial features is relatively simple to implement, but if it's a low polynomial degree, then you can't deal with complex datasets. If it a high degree, it creates too many features, thus making the model too slow...

However, with SVMs, you can use a technique called the _kernel trick_. It's possible to get the result of a high degree polynomial without having to actually add them! We'll try this out on the moons dataset.

In [4]:
from sklearn.svm import SVC

poly_kernel_svm_clf = Pipeline((
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
))
poly_kernel_svm_clf.fit(X, y)

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', SVC(C=5, cache_size=200, class_weight=None, coef0=1,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

What we made makes a SVM classifier using a 3rd degree polynomial kernel!
### Similarity Features
Another way to tackle nonlinear problems is to add features using a _similarity function_ that measures similary to a particular _landmark_. A similarity function can be the Gaussian _Radial Basis Function_(RBF).

How do you choose landmarks? The easiest way is to create a landmark for every instance in the dataset. This however, doesn't scale very well at all...

### Gaussian RBF Kernel
Just like the Polynomial Kernel, the kernel does some trickry to make it possible to obtain a similar result without having to add them! Here's an example.

In [5]:
rbf_kernel_svm_clf = Pipeline((
    ("scaler", StandardScaler()),
    ("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
))
rbf_kernel_svm_clf.fit(X, y)

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=5, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))])

Gamma make the bell-shape curve narrower. Yada yada yada, look at the docs for more info...

Other kernels exist but aren't frequented as much. For example _String kernels_ are for text docs/DNA sequences.

How do I decide what to use? Try using the LinearSVC first, then Gaussian RBF kernel as it does pretty well. Otherwise, look for a specialized kernel.
### Computational Complexity
Please refer to the book for computational complexity
## SVM Regression
SVM Regression tries to solve the Regression problem by doing the inverse of what the Classification version does. It tries to put as many instances as possible on the street while limiting the margin of violations.

Adding more trainig instances within the margin does not affect the model' prediction. Hence it is _e-insensitive_.

In [7]:
from sklearn.svm import LinearSVR

svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)

LinearSVR(C=1.0, dual=True, epsilon=1.5, fit_intercept=True,
     intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
     random_state=None, tol=0.0001, verbose=0)

Or, you can choose the Kernel trick version

In [8]:
from sklearn.svm import SVR

svm_poly_reg = SVR(kernel="poly", degree=2, C=100, epsilon=0.1)
svm_poly_reg.fit(X, y)

SVR(C=100, cache_size=200, coef0=0.0, degree=2, epsilon=0.1, gamma='auto',
  kernel='poly', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

Refer to the book for the "under the hood" look of support vector machines