# Support Vector Machines 

## Simple SVM

In case of linearly separable data in two dimensions, as shown in Fig. 1, a typical machine learning algorithm tries to find a boundary that divides the data in such a way that the misclassification error can be minimized. If you closely look at Fig. 1, there can be several boundaries that correctly divide the data points. The two dashed lines as well as one solid line classify the data correctly.
![](https://s3.amazonaws.com/stackabuse/media/implementing-svm-kernel-svm-python-scikit-learn-1.jpg)

SVM differs from the other classification algorithms in the way that it chooses the decision boundary that maximizes the distance from the nearest data points of all the classes. An SVM doesn't merely find a decision boundary; it finds the most optimal decision boundary.

The most optimal decision boundary is the one which has maximum margin from the nearest points of all the classes. The nearest points from the decision boundary that maximize the distance between the decision boundary and the points are called support vectors as seen in Fig 2. The decision boundary in case of support vector machines is called the maximum margin hyper plane.
![](https://s3.amazonaws.com/stackabuse/media/implementing-svm-kernel-svm-python-scikit-learn-2.jpg)

## Non-Linearly Separable Data 
In case of non-linearly separable data, the simple SVM algorithm cannot be used. Rather, a modified version of SVM, called Kernel SVM, is used.
![](https://s3.amazonaws.com/stackabuse/media/implementing-svm-kernel-svm-python-scikit-learn-3.jpg)

The kernel SVM projects the non-linearly separable data into lower dimensions to linearly separate data in higher dimensions in such a way that the data points belonging to different classes are allocated to different dimensions.

Below, the data points are plotted on the x-axis and z-axis (Z is the squared sum of both x and y: z=x^2=y^2). Now you can easily segregate these points using linear separation.

![](http://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1526288453/index_bnr4rx.png)


### Linear Kernel
A linear kernel can be used just like we saw in linear regression, take the dot product of any two given observations. The product between two vectors is the sum of the multiplication of each pair of input values.

**K(x, xi) = sum(x * xi)** 

### Polynomial Kernal 
The polynomial kernel can distinguish curved or nonlinear input space. Where d is the degree of the polynomial. d=1 is similar to the linear transformation. The degree needs to be manually specified in the learning algorithm.

**K(x,xi) = 1 + sum(x * xi)^d**

### RBF - Radial Basis Function Kernel 
The Radial basis function kernel is a popular kernel function commonly used in support vector machine classification. RBF can map an input space in infinite dimensional space. Here **gamma** is a parameter, which ranges from 0 to 1. A higher value of gamma will perfectly fit the training dataset, which causes over-fitting. Gamma=0.1 is considered to be a good default value. The value of gamma needs to be manually specified in the learning algorithm.

**K(x,xi) = exp(-gamma * sum((x – xi^2))**

## Soft Margin and Kernel Tricks 
SVM addresses non-linearly separable cases by introducing two concepts: Soft Margin and Kernel Tricks.

![](https://miro.medium.com/max/1400/1*vwRojapdm0po85w8XnyWRQ.png)

**2 Solutions:** 

1. Soft Margin: try to find a line to separate, but tolerate one or few misclassified dots (e.g. the dots circled in red)

2. Kernel Trick: try to find a non-linear decision boundary

### Soft Margin 

Two types of misclassifications are tolerated by SVM under **soft margin:**
1. The dot is on the wrong side of the decision boundary but on the correct side/ on the margin (shown in left)

2. The dot is on the wrong side of the decision boundary and on the wrong side of the margin (shown in right)

![](https://miro.medium.com/max/1400/1*pNJ7IaXvSvxjpyUw3KKVzA.png)

Applying Soft Margin, SVM tolerates a few dots to get misclassified and tries to balance the trade-off between finding a line that maximizes the margin and minimizes the misclassification.

How much tolerance(soft) we want to give when finding the decision boundary is an important hyper-parameter for the SVM (both linear and nonlinear solutions). In Sklearn, it is represented as the penalty term — ‘C’. The bigger the C, the more penalty SVM gets when it makes misclassification. Therefore, the narrower the margin is and fewer support vectors the decision boundary will depend on.

![](https://miro.medium.com/max/1400/1*0vOVPBmYCkw-sUt77HtyGA.png)

### Kernel Trick 
What Kernel Trick does is it utilizes existing features, applies some transformations, and creates new features. Those new features are the key for SVM to find the nonlinear decision boundary.
In Sklearn — svm.SVC(), we can choose ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable as our kernel/transformation. I will give examples of the two most popular kernels — Polynomial and Radial Basis Function(RBF).
![](https://miro.medium.com/max/1400/1*Ha7EfcfB5mY2RIKsXaTRkA.png)

#### Polynomial Kernel
Think of the polynomial kernel as a transformer/processor to generate new features by applying the polynomial combination of all the existing features.
![](https://miro.medium.com/max/1400/1*gIHnZCcl4Q9fFx2AZsJ7pw.png)

- Existing Feature: X = np.array([-2,-1,0, 1,2])
- Label: Y = np.array([1,1,0,1,1])
- it’s impossible for us to find a line to separate the yellow (1)and purple (0) dots (shown on the left).

But, if we apply transformation X² to get:
- New Feature: X = np.array([4,1,0, 1,4])
- By combing the existing and new feature, we can certainly draw a line to separate the yellow purple dots (shown on the right).

Support vector machine with a polynomial kernel can generate a non-linear decision boundary using those polynomial features.

#### Radial Basis Function (RBF) kernel

Think of the Radial Basis Function kernel as a transformer/processor to generate new features by measuring the distance between all other dots to a specific dot/dots — centers. The most popular/basic RBF kernel is the 

**Gaussian Radial Basis Function:**
![](https://miro.medium.com/max/1260/1*izqr1xGcP89B7Xs1EluIQQ.png)

**gamma (γ)** controls the influence of new features — Φ(x, center) on the decision boundary. The higher the gamma, the more influence of the features will have on the decision boundary, more wiggling the boundary will be.
![](https://miro.medium.com/max/1400/1*M9spISHtIR_wOXKtmFTFvg.png)

- Existing Feature: X = np.array([-2,-1,0, 1,2])
- Label: Y = np.array([1,1,0,1,1])

Again, it’s impossible for us to find a line to separate the dots (on left hand). But, if we apply Gaussian RBF transformation using two centers (-1,0) and (2,0) to get new features, we will then be able to draw a line to separate the yellow purple dots (on the right):

- New Feature 1: X_new1 = array([1.01, 1.00, 1.01, 1.04, 1.09])
- New Feature 2: X_new2 = array([1.09, 1.04, 1.01, 1.00, 1.01])

By combining the soft margin (tolerance of misclassifications) and kernel trick together, SVMs are able to structure the decision boundary for linear non-separable cases.

**Hyper-parameters like C or Gamma control how wiggling the SVM decision boundary could be.**
- the higher the C, the more penalty SVM was given when it misclassified, and therefore the less wiggling the decision boundary will be

- the higher the gamma, the more influence the feature data points will have on the decision boundary, thereby the more wiggling the boundary will be


In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

In [2]:
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"

colnames = ['sepal-length', 'sepal-width', 'petal-length',
            'petal-width', 'Class']

iris = pd.read_csv(url, names=colnames)


In [3]:
X = iris.drop('Class', axis=1)
y = iris['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

In [4]:
#poly kernel must pass in degree 
clf = SVC(kernel='poly', degree=3)
clf.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='poly', max_iter=-1, probability=False, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [5]:
y_pred = clf.predict(X_test)

In [6]:
from sklearn.metrics import classification_report, confusion_matrix
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[13  0  0]
 [ 0  8  0]
 [ 0  1  8]]
                 precision    recall  f1-score   support

    Iris-setosa       1.00      1.00      1.00        13
Iris-versicolor       0.89      1.00      0.94         8
 Iris-virginica       1.00      0.89      0.94         9

       accuracy                           0.97        30
      macro avg       0.96      0.96      0.96        30
   weighted avg       0.97      0.97      0.97        30



In [10]:
#rbf kernel 
clf = SVC(kernel='sigmoid', probability=True)
clf.fit(X_train, y_train)



SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='sigmoid', max_iter=-1, probability=True, random_state=None,
    shrinking=True, tol=0.001, verbose=False)

In [11]:
y_pred = clf.predict(X_test)

In [12]:
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[ 0 13  0]
 [ 0  8  0]
 [ 0  9  0]]
                 precision    recall  f1-score   support

    Iris-setosa       0.00      0.00      0.00        13
Iris-versicolor       0.27      1.00      0.42         8
 Iris-virginica       0.00      0.00      0.00         9

       accuracy                           0.27        30
      macro avg       0.09      0.33      0.14        30
   weighted avg       0.07      0.27      0.11        30



  'precision', 'predicted', average, warn_for)


In [None]:
#try sigmoid kernel 

## Pros 
- Good for datasets with more variables than observations
- Robust against outliers
- Good performance
- Good off-the-shelf model in general for several scenarios
- Can approximate complex non-linear functions

## Cons 
- Long training time required
- Tuning required to determine optimal kernel for non-linear SVMs

## Requirements
- Scaled features
- Null values filled