# Support Vector Machines
This lab will introduce Support Vector Machines


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
import warnings
from sklearn.exceptions import ConvergenceWarning

warnings.filterwarnings(action="ignore", category=ConvergenceWarning)

## Data Preparation
Let's begin by preparing a dataset for this lab

In [None]:
df = (
    pd.read_csv("../datasets/winequality-red.csv", sep=";")
    .sample(frac=1)
    .reset_index(drop=True)
)

X_train, X_test, y_train, y_test = train_test_split(
    df.iloc[:, :-1].values, df.iloc[:, -1], test_size=0.2
)

## What is a Support Vector Machine? 



Another classification algorithm that utilizes gradient descent for optimization is the Support Vector Machine (SVM).

As with the perceptron and the Perceptron, the SVM is a linear classifier, attempting to find a hyper-plane that separates the classes in feature space (i.e., the dimension in which the features exist).

Unlike our Perceptron, however, the SVM attempts to _maximize_ the separation of the classes. It does this by optimizing the amount of space between the classes in feature space:

<img src="../images/support_vectors.jpeg" width=600 align="left" />


<span style="font-size:10px"><a href="https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47">Image Source</a></span>

Also similar to an Perceptrons, SVM's are optimized using gradient descent to minimize a cost function, but in an SVM that cost function is built to maximize the distance between classes rather than only classify each point correctly as in a Perceptron.

As you can see in the figure above, the cost function is attempting to measure the space between the classes in feature space using support vectors.

## The Kernel Trick

<img src="../images/kernel_trick.png" width=700 align="left" />



<span style="font-size:10px"><a href="https://towardsdatascience.com/the-kernel-trick-c98cdbcaeb3f">Image Source</a></span>

The SVM, like the perceptron, is attempting to find a hyperplane that can separate the classes in our data. Without any improvement beyond what we have discussed, the SVM would be restricted, like the perceptron, to only being applicable to linearly separable problems.

However, recall how linear models, utilize a non-linear function, $f$? This non-linearity was in essence mapping our features into a higher dimensional space, where previously inseprable features now become linearly separable.

We can follow this same process in our SVM. First, we can apply a transformation to our data to map it to a linearly seperable feature space. We can then operate in this higher dimensional space where the features are linearly separable to determine our decision boundaries between classes, and then map back down to the original feature space.

The problem is that if we have many data points, and we map all of the data points' features into higher dimensions and then opearate, the memory and processing requirements will scale to unrealistic levels. As it turns out, we don't need to actually apply this transformation to a higher dimension, we can just store pairwise similarities between the two feature spaces (the original and the higher dimension one), and this representation is sufficient to find decision boundaries. This process is called the **kernel trick**.

## Multi-class Classification with SVM

While the kernel trick allows the SVM to work on problems that are not linearly separable, it is still restricted to only functioning with 2-classes in its naive form.

Because SVMs are attempting to maximize the distance between two classes by maximizing the length of support vectors, it can only take on 2-classes at a time.

We can accomplish multi-class classification with SVMs using a _over versus the rest_, or OVR, scheme. In this scheme, we train _n_classes_ SVMs, each one trained as a 2-class SVM, with the postive class being one of the classes in our dataset and the the negative class being all other classes.

So in a three class problem with classes A, B, C, the OVR scheme is 3 SVMs:

- class A vs classes B & C
- class B vs classes A & C
- Class C vs classes B & C

This means that for every class we add, we are adding another model to be trained and tested. Thankfully, `sklearn` abstracts this process for us, and we need to only give it our data and tell it to use a `ovr` scheme.


## Using an SVM

Now that we understand what a SVM is, how it is trained, and how it solves linearly inseparable problems, let's build one. The name of the function in `sklearn` is the Support Vector Classifier, or `SVC`. We'll choose to parameterize our SVM with a polynomial kernel of degree 3 (a cubic function) and a maximum number of iterations of 50:


In [5]:
svm = SVC(
    kernel="poly", degree=3, verbose=False, max_iter=2500, decision_function_shape="ovr"
)

Now let's train it:


In [None]:
svm.fit(X_train, y_train)

Now, let's score it against the test data:


In [None]:
svm.score(X_test, y_test)

SVMs are very sensitive, as are all Machine Learning models, to their hyperparameters, let's look at how the choice of Kernel and Regularization affects performance:

In [None]:
svm = SVC(kernel="rbf", C=10)
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

In [None]:
svm = SVC(kernel="poly", degree=5, C=10)
svm.fit(X_train, y_train)
svm.score(X_test, y_test)

## <span style="background:yellow">Your Turn</span>

Train and test a SVC model with degree of 2 and poly kernel:

In [10]:
# <-- Your Code Here -->