Very powerful and versatile, capable of both linear or nonlinear classificatino, regression and even outliter detection. Very popular and a must-have. 

Suited for classification of complex but small or medium sized datasets. 

In [2]:
import numpy as np

from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

# Concepts not sure about:
# Pipeline - ?
# Feature scaling - standard scaler, etc
# Polynomial features

# Polynomial Regression

#### Polynomial Regression:

Linear Regression - y = b0 + b1x1 <br>
Multiple Linear Regression - y = b0 + b1x1 + b2x2 + ... + bnxn <br>
Polynomial Linear Regression - y = b0 + b1x1 + b2x1^2 + b3x1^3 ... bnx1^3 <br>

Instead of xn we have x1 to the power of n - only one variable but with different powers. Used whenever the data cannot be described using a straight line but can by a simple curve. The reason why its still linear is that what determines linear and non-linear, is not x, but instead the coeffecient. 

# Linear SVM Classification

Unlike other models who might fit the decision boundary right next to data, and hence can't generalize well; 

SVM classifier not only separates the two classes but does so staying away from the closest training instances. It tries to fit the largest possible street between the classess - called Large margin classification. We want the margins to be free and as wide as possible. 

Adding instances 'off the street' or elsewhere won't affect the decision boundary - it's determined (supported) by the instances located on the edge of the street - called the support vectors. 

Again, feature scaling is very important - SVMs are very sensitive to scales. 


## Soft Margin Classification

Hard margin classification - we impose that all instances be off the street on the right side. 

This only works if the data is linearly separable, and is very sensitive to outliers. Hard margin classification does not generalize well and becomes completely messed up with outliers.

Soft margin classification, on the other hand, is much more flexible, and aims to find a good balance between keeping the street as large as possible, and limiting the margin violations (instances that end up in the middle or on the wrong side). 

The C parameter in Sklearn controls this - a smaller C value leads to a wider street but more margin violations, and vice versa. It seems like a wider street is preferrable as it will generalize better, as long as the margin violations are not on the wrong sides. If the SVM model is overfitting, reducing C might help. 

The LinearSVC class regularizes the bias term, so the training set should be centered first by subtracting its mean. StandardSlcaer does this for you. Setting the loss hyperparameter to 'hinge' is also important; finally, dual hyperparameter should be set to False for performance, unless there are more features than training instances. 

We continue to work on the Iris Dataset. 

In [3]:
iris = datasets.load_iris()
X = iris['data'][:, (2, 3)] # petal length, petal width
y = (iris['target'] == 2).astype(np.float64) # Iris-Virginica

svm_clf = Pipeline([
        ('scaler', StandardScaler()),
        ('linear_svc', LinearSVC(C=1, loss='hinge'))
])

svm_clf.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('linear_svc', LinearSVC(C=1, loss='hinge'))])

In [4]:
svm_clf.predict([[5.5, 17]]) # Not a probability, unlike the Logist Regression classifiers

array([1.])

Another approcach to this could be using the SVC class - SVC(kernel='linear', C=1), but it's very slow so not recommended. 

SGDClassifier class's SGDClassifier(loss='hinge', alpha=1/(m*C)) also works, which applies regular Stochastic Gradient Descent to train a linear SVM classifier. Doesn't converge as fast as the LinearSVC, but can be used for huge datasets or online classification tasks. 

## Nonlinear SVM Classification

To handle nonlinear datasets, one approach is to add more features like polynomial features (Chapter 4).

This can be implemented by creating a Pipeline containing a PolynomialFeatures transformer (pg130 on book), followed by a StandardScaler and a LinearSVC. 

We are using the moons dataset - a toy dataset for binary classification where the data points are shaped as two interleaving half circles. 

In [5]:
from sklearn.datasets import make_moons
from sklearn.preprocessing import PolynomialFeatures

polynomial_svm_clf = Pipeline([
    ('poly_features', PolynomialFeatures(degree=3)),
    ('sclaer', StandardScaler()),
    ('svm_clf', LinearSVC(C=10, loss='hinge'))
])

polynomial_svm_clf.fit(X, y)

Pipeline(steps=[('poly_features', PolynomialFeatures(degree=3)),
                ('sclaer', StandardScaler()),
                ('svm_clf', LinearSVC(C=10, loss='hinge'))])

Adding polynomial features is simple and works great with all sorts of algorithms. However, at a low degree it cannot deal with very complex datasets, and at an overly high polynomial degree too many features are generated (since they are created combinatorially) to make the model too slow. 

Luckily, the mathematical technique of kernel trick can achieve the same result as if many polynomial features were added, without adding them and having too many features. 

Decrease the polynomial degree when overfitting. Converly, increase it when underfitting. (Increasing the degree makes the curves more fit to the training data (low bias), but runs the risk of overfitting with high variance).

A grid search is common to find the right hyperparameter values. 

In [7]:
from sklearn.svm import SVC

poly_kernel_svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_clf', SVC(kernel='poly', degree=3, coef0=1, C=5))
])

poly_kernel_svm_clf.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('svm_clf', SVC(C=5, coef0=1, kernel='poly'))])

### Similarity Features

For nonlinear problems, you can also add features computed using a similarity function, measuring how much each instance resembles a particular landmark. 

To select the landmarks, simply create one at the location of every instance in the dataset, increasing the chance that the trianing set will be linearly separable. There is a risk of having too many features, nonetheless. 

### Gaussian RBF Kernel

The similiatiry features method can be useful for all ML algorithms, but it might be computationally expensive, especially on large training sets. The kernel, again, enables you to obtain similar resuls without having to actually add the many similarity features. 

In the example below, increasing gamma makes the curve narrower - making the decision boundary more irregular and more fitting to the data. A small gamma value increases the curve, with a large range of influences and more smooth decision boundaries. 

y acts like a regularization hyperparameter - reduce if overfitting, increase if underfitting (similar to C)

##### How to choose which kernel to use?
Two kernels have been shown - others do exist but are used less widely. <br>
A rule of thumb - start with the linear kernel (LinearSVC is much faster than SVC(kernel='linear)), especially on large training sets. If the set isn't as big, try Gaussian RBF which tends to work well. With spare computing power left, you can experiment with other kernels using cross-validation and grid search, especially for the specialized kernels for the training set's data structure. 

In [9]:
rbf_kernel_svm_clf = Pipeline([
    ('scaler', StandardScaler()),
    ('svm_clf', SVC(kernel='rbf', gamma=5, C=0.001))
])

rbf_kernel_svm_clf.fit(X, y)

Pipeline(steps=[('scaler', StandardScaler()),
                ('svm_clf', SVC(C=0.001, gamma=5))])

### Computer Complexity

LinearSVC doesn't support kernel but training time is almost linear with the number of training instances. The tolerance hyperparameter (tol) controls the precision. 

The SVC class, supports the kernel trick but become dreadfully slow with large training sets - making it perfect for small/medium training sets. 

# SVM Regression

SVM supports classification. It also supports regression. VERY VERSATILE. 

Instead of trying to fit the largest street between two classes, regression tries to fit as many instances as possible on the street while limiting margin violations. The tolerance hyperparemter controls the width of the street. 

In [10]:
# Linear regression
from sklearn.svm import LinearSVR

svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)

# Nonlinear regression
from sklearn.svm import SVR # SVC equivalent in regression

svm_poly_reg = SVR(kernel='poly', degree=2, C=100, epsilon=0.1)
svm_poly_reg.fit(X, y)

# LinearSVR and SVR behave exactly like their classification brothers. 

SVR(C=100, degree=2, kernel='poly')

# Under the Hood

Will come back later. 