# Support Vector Machines (SVM)

Classification algorithm that looks to set a boundary with as wide a margin as possible

<img src='SVMOverview.png'>

$$ Margin = \frac{2}{|W|}$$
$$ Error = |W|^{2} $$
Example:

$$ W = (3,4), b=1;$$
$$w_{1}x_{1}+w_{2}x_{2}+b=0 \rightarrow 3x_{1}+4x_{2}+1 $$

$$ Error = |W|^{2} = 3^{2}+4^{2} = 25 $$

$$Margin = \frac{2}{|W|} = \frac{2}{\sqrt{25}} $$

<img src='SVMMarginErrorExample.png'>

This is the same error term as in L2 regularization

Recall:
> ## L1 Regularization
>Add the absolute value of the coefficients:
$$
2x_{1}^{3}-2x_{1}^{2}x_{2} - 4x_{2}^{3} + 3x_{1}^{2} + 6x_{1}x_{2} + 4x_{2}^{2} + 5 = 0
$$

$$
Error = |2|+|-2|+|-4|+|3|+|6|+|4|= 21
$$

> ## L2 Regularization
> Add the squares of the coefficients:
$$
2x_{1}^{3}-2x_{1}^{2}x_{2} - 4x_{2}^{3} + 3x_{1}^{2} + 6x_{1}x_{2} + 4x_{2}^{2} + 5 = 0
$$

$$
Error = 2^{2}+(-2)^{2}+(-4)^{2}+3^{2}+6^{2}+4^{2}= 85
$$

# Kernel Trick

Transform the locations of the points with a function to better separate them than linearly alone. In this case, we have points that look like a bullseye and are thus easily seperable with a circle, i.e. $x^{2}+y^{2}$

<img src='KernelTrickGrid.png'>

# SVMs with SKLearn

tools required to build this model.

For your support vector machine model, you'll be using scikit-learn's SVC class. This class provides the functions to define and fit the model to your data.

```C```: The C parameter.

```kernel```: The kernel. The most common ones are 'linear', 'poly', and 'rbf'.

```degree```: If the kernel is polynomial, this is the maximum degree of the monomials in the kernel.

```gamma```: If the kernel is rbf, this is the gamma parameter.

For example, here we define a model with a polynomial kernel of degree 4, and a C parameter of 0.1.

```>>> model = SVC(kernel='poly', degree=4, C=0.1)```

In [22]:
# Import statements 
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np

# Read the data.
data = np.asarray(pd.read_csv('SVMData.csv', header=None))
# Assign the features to the variable X, and the labels to the variable y. 
X = data[:,0:2]
y = data[:,2]

# TODO: Create the model and assign it to the variable model.
# Find the right parameters for this model to achieve 100% accuracy on the dataset.
model = SVC(kernel='rbf', gamma=28)

# TODO: Fit the model.
model.fit(X,y)

# TODO: Make predictions. Store them in the variable y_pred.
y_pred = model.predict(X)

# TODO: Calculate the accuracy and assign it to the variable acc.
acc = accuracy_score(y, y_pred)
acc

1.0

# Recap

In this lesson, you learned about Support Vector Machines (or SVMs). SVMs are a popular algorithm used for classification problems. You saw three different ways that SVMs can be implemented:

> 1. Maximum Margin Classifier
> 2. Classification with Inseparable Classes
> 3. Kernel Methods

**Maximum Margin Classifier**

When your data can be completely separated, the linear version of SVMs attempts to maximize the distance from the linear boundary to the closest points (called the support vectors). For this reason, we saw that in the picture below, the boundary on the left is better than the one on the right.
<img src='Recap1.png'>

**Classification with Inseparable Classes**

Unfortunately, data in the real world is rarely completely separable as shown in the above images. For this reason, we introduced a new hyper-parameter called **C**. The **C** hyper-parameter determines how flexible we are willing to be with the points that fall on the wrong side of our dividing boundary. The value of **C** ranges between 0 and infinity. When **C** is large, you are forcing your boundary to have fewer errors than when it is a small value.

**Note: when C is too large for a particular set of data, you might not get convergence at all because your data cannot be separated with the small number of errors allotted with such a large value of C.**
<img src='Recap2.png'>

**Kernels**

Finally, we looked at what makes SVMs truly powerful, kernels. Kernels in SVMs allow us the ability to separate data when the boundary between them is nonlinear. Specifically, you saw two types of kernels:

<ul>
    <li>polynomial</li>
    <li>rbf</li>
</ul>
    
By far the most popular kernel is the **rbf** kernel (which stands for radial basis function). The rbf kernel allows you the opportunity to classify points that seem hard to separate in any space. This is a density based approach that looks at the closeness of points to one another. This introduces another hyper-parameter **gamma**. When **gamma** is large, the outcome is similar to having a large value of **C**, that is your algorithm will attempt to classify every point correctly. Alternatively, small values of **gamma** will try to cluster in a more general way that will make more mistakes, but may perform better when it sees new data.
<img src='Recap3.png'>

# Resources:
**Support Vector Machines are described in Introduction to Statistical Learning starting on page 337**

**The wikipedia page related to SVMs**

**The derivation of SVMs from Stanford's CS229 notes**(http://cs229.stanford.edu/notes/cs229-notes3.pdf)

