## Support Vector Machines (SVM)

SVM stands for a support vector machine. SVM's are typically used for classification tasks similar to what we did with K Nearest Neighbors. They work very well for high dimensional data and are allow for us to classify data that does not have a linear correspondence. For example classifying a data set like the one below.

https://techwithtim.net/wp-content/uploads/2019/01/svm-data-768x578.png

Attempting to use K Nearest Neighbors to do this would likely give us a very low accuracy score and is not favorable. This is where SVM's are useful.

### Importing Modules


Before we start we need to import a few things from sklearn.



In [17]:
import sklearn
from sklearn import svm
from sklearn import datasets

### Loading Data

In previous tutorials we did quite a bit of work to load in our data sets from places like the UCI Machine Learning Repository. That is a very useful skill and is something you will often have to do when applying these algorithm to your own data. However, now that we have learned this we will use the data sets that come with sklearn. These are much nicer to work with and have some nice methods that make loading in data very quick.

For this tutorial we will be using a breast cancer data set. It consists of many features describing a tumor and classifies them as either cancerous or non cancerous.

To load our data we will simply do the following.

In [18]:
cancer = datasets.load_breast_cancer()


To see a list of the features in the data set we can do:



In [19]:
print("Features: ", cancer.feature_names)

Features:  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


Similarly for the labels.



In [20]:
print("Labels: ", cancer.target_names)

Labels:  ['malignant' 'benign']


### Splitting Data

Now that we have loaded in our data set it is time to split it into training and testing data. We will do this like seen in previous tutorials.

In [21]:
x = cancer.data  # All of the features
y = cancer.target  # All of the labels
print(type(x))
print(type(y))
print(x)
print(y)
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(x, y, test_size=0.2)

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>
[[1.799e+01 1.038e+01 1.228e+02 ... 2.654e-01 4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 ... 1.860e-01 2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 ... 2.430e-01 3.613e-01 8.758e-02]
 ...
 [1.660e+01 2.808e+01 1.083e+02 ... 1.418e-01 2.218e-01 7.820e-02]
 [2.060e+01 2.933e+01 1.401e+02 ... 2.650e-01 4.087e-01 1.240e-01]
 [7.760e+00 2.454e+01 4.792e+01 ... 0.000e+00 2.871e-01 7.039e-02]]
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 

If we want to have a look at our data we can print the first few instances.


In [16]:
print(x_train[:5], y_train[:5])
classes= ['malignant' 'benign']

[[2.051e+01 2.781e+01 1.344e+02 1.319e+03 9.159e-02 1.074e-01 1.554e-01
  8.340e-02 1.448e-01 5.592e-02 5.240e-01 1.189e+00 3.767e+00 7.001e+01
  5.020e-03 2.062e-02 3.457e-02 1.091e-02 1.298e-02 2.887e-03 2.447e+01
  3.738e+01 1.627e+02 1.872e+03 1.223e-01 2.761e-01 4.146e-01 1.563e-01
  2.437e-01 8.328e-02]
 [1.349e+01 2.230e+01 8.691e+01 5.610e+02 8.752e-02 7.698e-02 4.751e-02
  3.384e-02 1.809e-01 5.718e-02 2.338e-01 1.353e+00 1.735e+00 2.020e+01
  4.455e-03 1.382e-02 2.095e-02 1.184e-02 1.641e-02 1.956e-03 1.515e+01
  3.182e+01 9.900e+01 6.988e+02 1.162e-01 1.711e-01 2.282e-01 1.282e-01
  2.871e-01 6.917e-02]
 [1.359e+01 1.784e+01 8.624e+01 5.723e+02 7.948e-02 4.052e-02 1.997e-02
  1.238e-02 1.573e-01 5.520e-02 2.580e-01 1.166e+00 1.683e+00 2.222e+01
  3.741e-03 5.274e-03 1.065e-02 5.044e-03 1.344e-02 1.126e-03 1.550e+01
  2.610e+01 9.891e+01 7.391e+02 1.050e-01 7.622e-02 1.060e-01 5.185e-02
  2.335e-01 6.263e-02]
 [1.189e+01 2.117e+01 7.639e+01 4.338e+02 9.773e-02 8.120e-02 2.555

The next tutorial will explain how a SVM works and the math behind it. Following that I will go over implementing the algorithm.

## What a SVM Does?

A SVM has a large list of applicable uses. However, in machine learning it is typically used for classification. It is a powerful tool that is a good choice for classifying complicated data with a high degree of dimensions(features). Note that K-Nearest Neighbors does not perform well on high-dimensional data.

## How A Support Vector Machine Works

In short a support vector machine works by dividing data into multiple classes using something called a hyper-plane. A hyper plane is a fancy word for something that is straight that can divide data points. In 2D space a hyper-plane is simply a line, in 3D space a hyper-plane is a plane. In any space higher than 3D it is simply called a hyper-plane.

Here’s an example of a hyper-plane for the data points on the 2D graph below.

https://techwithtim.net/wp-content/uploads/2019/01/svm2.png

### Hyper-Planes

When we create a hyper-plane we need to do the following. We must pick two points that are known as our support vectors. These points must be the two closest points to the hyper-plane and their distance from the hyper-plane must be identical. In the example above we can see that the two circled points are our support vectors and their distance to the hyper-plane is the same, they are also the closest points. With this rule we can actually create an infinite amount of hyper planes (see below).

https://techwithtim.net/wp-content/uploads/2019/01/svm4.png

https://techwithtim.net/wp-content/uploads/2019/01/svm3.png

All of the images above are valid hyper-planes.


### Picking a Hyper Plane

Once we create a hyper-plane we are going to use it to classify our data. If a test point is on the left side of the plane we would classify it as red (in our examples above) and if it is on the right we would classify it as green. So how can we pick a hyper-plane that will give us the best classification predictions?

Have a look at the hyper-planes above and determine which you think would give the best classification for a mystery test point. What do you notice about that hyper-plane?

Well the best possible hyper-plane would be the first image on this page. Notice the distance between the support vectors and the hyper-plane is far greater than the other generated hyper-planes.

When we pick a hyper-plane we want to pick one that has the greatest possible margin.

### Margin

The margin is the distance that separates all of the points in our test data. The blue lines below show you the margin for this particular data and hyper-plane. Typically the greater our margin the better our classification will be.

https://techwithtim.net/wp-content/uploads/2019/01/svm5.png

### Kernels

So you now have a very basic understanding of how a SVM works. Seems pretty simple in theory, but in practice we can run into a lot of issues.

Let’s say our data isn’t as pretty and we have some points that look like this:

https://techwithtim.net/wp-content/uploads/2019/01/svm6.png

Can you determine which hyper-plane would be the best for this data? Even if you could it would make a horrible classifier. This is where we introduce something called kernels.

Kernels provide a way for us to create a hyper-plane for data like seen above. We use a kernel to bring our data up to a higher dimension (in this case from 2D->3D). We hope that by doing this we will have our points plotted in a way that we can divide them using a hyper-plane.

By applying a kernel to our data above we hope to get something that looks like the following:

https://techwithtim.net/wp-content/uploads/2019/01/svm8.png

You can see that we can now divide our points with a plane in 3D. By applying the kernel our data has become separable.

### What Is A Kernel?

A kernel is simply a function that takes as input our features (x1, x2 in our example) and returns a value equal to the third dimensional coordinate (x3). An example of a kernel copuld be the equation:

(x1)^2 + (x2)^2 = x3

https://techwithtim.net/wp-content/uploads/2019/01/svm7.png

Typically when we use a kernel we use a pre-existing one. There is much debate about which kernel is the best but here are some examples of popular kernels.

### Soft & Hard Margin

The last topic to touch on is soft and hard margins. A hard margin is precisely what you’ve learned already, no points may exist inside the margin. However, sometimes if we have outlier points we want to allow them to exist inside the margin and use points that are not the closest to the hyper-plane to be our support vectors. Doing this is called using a soft margin.

https://techwithtim.net/wp-content/uploads/2019/01/svm9.png

You can see in the example above that there is a point that exists inside the margin. If we had not allowed this not only would it be difficult to create a hyper-plane but our classification would perform poorly.

The amount of points you allow to exists inside the margin is something we can define as hyper-parameter.

### Implementing a SVM


Implementing the SVM is actually fairly easy. We can simply create a new model and call .fit() on our training data.

In [22]:
from sklearn import svm
clf = svm.SVC()
clf.fit(x_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

To score our data we will use a useful tool from the sklearn module.

In [23]:
from sklearn import metrics
y_pred = clf.predict(x_test) # Predict values for our test data

acc = metrics.accuracy_score(y_test, y_pred) # Test them against our correct values

In [26]:
print(y_pred)
print(y_test)

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1]
[1 1 0 1 1 1 1 0 1 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1
 1 1 1 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 0 0 1 1 1 0
 1 1 0 1 0 1 0 1 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 1 1 1 1 0 1 1 1
 0 0 0]


And that is all we need to do to implement our SVM, now we can run the program and take note of our amazing accuracy!

In [25]:
print(acc)

0.6403508771929824


Wait... Our accuracy is close to 60% and that is horrible! Looks like we need to add something else.

### Adding a Kernel

We will use linear for this data-set.

In [30]:
clf = svm.SVC(kernel="linear")
clf.fit(x_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

After running this we receive a much better accuracy of close to 98%

In [31]:
print(clf)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


In [32]:
from sklearn import metrics
y_pred = clf.predict(x_test) # Predict values for our test data

acc = metrics.accuracy_score(y_test, y_pred) # Test them against our correct values

In [34]:
for i in range(len(y_pred)):
    print("Y_PRED :", y_pred[i]," ","Y_TEST:",y_test[i])
    

Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 0   Y_TEST: 0
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 0   Y_TEST: 0
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 0   Y_TEST: 0
Y_PRED : 0   Y_TEST: 0
Y_PRED : 0   Y_TEST: 0
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 0   Y_TEST: 0
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 0   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 0   Y_TEST: 0
Y_PRED : 0   Y_TEST: 0
Y_PRED : 0   Y_TEST: 0
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1   Y_TEST: 1
Y_PRED : 0   Y_TEST: 0
Y_PRED : 0   Y_TEST: 0
Y_PRED : 1   Y_TEST: 1
Y_PRED : 1 

In [35]:
print(acc)

0.9736842105263158


### Changing the Margin

By default our kernel has a soft margin of value 1. This parameter is known as C. We can increase C to give more of a soft margin, we can also decrease it to 0 to make a hard margin. Playing with this value should alter your results slightly.

In [36]:
from sklearn import svm
clf = svm.SVC(kernel="linear", C=2)
clf.fit(x_train, y_train)

SVC(C=2, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [37]:
print(clf)

SVC(C=2, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)


In [38]:
from sklearn import metrics
y_pred = clf.predict(x_test) # Predict values for our test data

acc = metrics.accuracy_score(y_test, y_pred) # Test them against our correct values

In [39]:
print(acc)

0.9824561403508771


If you want to play around with some other parameters have a look here.
https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html

## Comparing to KNearestNeighbors

If we want to see how this algorithm runs in comparison to KNN we can run the KNN classifier on this data-set and compare our accuracy values.

To change to the KNN classifier is quite simple.

In [40]:
from sklearn.neighbors import KNeighborsClassifier

clf = KNeighborsClassifier(n_neighbors=11)
# Simply change clf to what is above

Note that KNN still does well on this data set but hovers around the 90% mark.

In [41]:
clf.fit(x_train, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=11, p=2,
           weights='uniform')

In [42]:
y_pred = clf.predict(x_test)


In [43]:
acc = metrics.accuracy_score(y_test, y_pred)

In [44]:
print(acc)

0.9210526315789473
