# Support Vector Machines

In this tutorial, you'll learn about **Support Vector Machines**, one of the most popular and widely used supervised machine learning algorithms.

**SVM** offers very high accuracy compared to the classifiers such as logistic regression, and decision trees. It is known for its kernel trick to handle nonlinear input spaces. It is used in a variety of applications such as face detection, intrusion detection, classification of emails, news articles and web pages, classification of genes, and handwriting recognition.

SVM is an exciting algorithm and the concepts are relatively simple. The clasifier separates data points using a hyperplane with the largest amount of margin. That's why an SVM classifier is also known as a discrimnimative classifier. SVM finds an optimal hyperplane which helps in classifying new data points.

## Support Vector Machines

Generally, **Support Vector Machines** is considered to be a classification approach, but it can be employed in both types of classification and regression problems. It can easily handle multiple continuous and categorical variables.

SVM constructs a hyperplane in multidimensional space to separate different classes. It generates optimal hyperplanes in an iterative manner, which is used to minimize an error. The core idea of SVM is to find a `maximum marginal hyperplane` (MMH) that best divides the dataset into classes.

### Support Vectors

_Support vectors_ are the data points, which are closest to the hyperplane. These points will define the separating line better by calculating margins. These points are more relevant to the construction of the classifier.

### Hyperplane

A _hyperline_ is a decision plane, which separates between a set of objects having different class memberships.

### Margin

A _margin_ is a gap between the two lines on the closest class points. This is calculated as the perpendicular distance from the line to support vectors or closest points. If the margin is larger in between the classes, then it is considered a good margin, a smaller margin is a bad margin.

## How Does SVM Work?

The main objective is to segregate the given dataset in the best possible way. The distance between the either nearest points is known as the margin. SVM selects a hyperplane with the maximum possible margin between support vectors in the given dataset. SVM searches for the maximum marginal hyperplane in the following steps:
1. Generate hyperplanes which segregates the classes in the best way.
2. Select the right hyperplane with the maximum segregation from either nearest data point.

### Dealing with Non-Linear and Inseparable Planes

Some problems can't be solved using linear hyperplane. In such situations, SVM uses a kernel trick to transform the input space to a higher dimensional space. The data points are plotted on the x-axis and z-axis.

`z` is the squared sum of both `x` and `y`: $z = x^2 = y^2$.

## SVM Kernels

The SVM algorithm is implemented in practice using a kernel. A kernel transforms on input data space into the required form. SVM uses a technique called the kernel trick. Here, the kernel takes a low-dimensional input space and transforms it into a higher dimensional space.

In other words, you can say that it converts inseparable problem to separable problems by adding more dimension to it. It is most useful in non-linear separation problem. Kernel trick helps you to build a more accurate classifier.

- **Linear Kernel**. A linear kernel can be used as normal dot product any two given observations. The product between two vectors is the sum of the multiplication of each pair of input values.
$$K(x, xi) = sum(x * xi)$$

- **Polynomial Kernel**. A polynomial kernel is a more generalized form of the linear kernel. The polynomial kernel can distinguish curved or nonlinear input space.
$$K(x,xi) = 1 + sum(x * xi)^d$$
>... where `d` is the degree of the polynomial. `d` equal to 1 is similar to the linear transformation. The degree needs to be manually specified in the learning algorithm.

- **Radial Basis Function Kernel**. The Radial basis function kernel is a popular kernel function commonly used in support vector machine classificaiton. RBF can map an input space in infinite dimensional space.
$$K(x,xi) = exp(-gamma * sum((x – xi^2))$$
> ... here, `gamma` is a parameter, which ranges from 0 to 1. A higher value of `gamma` will perfectly fit the training dataset, which causes over-fitting. `Gamma` equal to 0,1 is considered to be a good default value. The value of `gamma` needs to be manually specified in the larning algorithm.

## Classifier Building in Scikit-Learn

Until now, you have learned about the theoretical background of SVM. Now you will learn about its implementation in Python using scikit-learn.

Building the model, you can use the cancer dataset, which is a very famous multi-class classification problem. This dataset is computed from a digitized image of a `fine needle aspirate` (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image.

The dataset comprises 30 features and a target (type of cancer).

This data has two types of cancer classes: malignant and benign. Here, you can build a model to classify the type of cancer. The dataset is available in the scikit-learn library or you can also download it from the UCI Machine Learning Library.

### Loading Data

In [2]:
# import scikit-learn dataset library
from sklearn import datasets

In [3]:
# load dataset
cancer = datasets.load_breast_cancer()

### Exploring Data

In [5]:
# print the names of the 13 features
print('Features: ', cancer.feature_names)

# print the label type of cancer (malignant, benign)
print('Labels:', cancer.target_names)

# print data shape
cancer.data.shape

Features:  ['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']
Labels: ['malignant' 'benign']


(569, 30)

In [6]:
# print the cancer data features (top 5 records)
print(cancer.data[0:5])

[[1.799e+01 1.038e+01 1.228e+02 1.001e+03 1.184e-01 2.776e-01 3.001e-01
  1.471e-01 2.419e-01 7.871e-02 1.095e+00 9.053e-01 8.589e+00 1.534e+02
  6.399e-03 4.904e-02 5.373e-02 1.587e-02 3.003e-02 6.193e-03 2.538e+01
  1.733e+01 1.846e+02 2.019e+03 1.622e-01 6.656e-01 7.119e-01 2.654e-01
  4.601e-01 1.189e-01]
 [2.057e+01 1.777e+01 1.329e+02 1.326e+03 8.474e-02 7.864e-02 8.690e-02
  7.017e-02 1.812e-01 5.667e-02 5.435e-01 7.339e-01 3.398e+00 7.408e+01
  5.225e-03 1.308e-02 1.860e-02 1.340e-02 1.389e-02 3.532e-03 2.499e+01
  2.341e+01 1.588e+02 1.956e+03 1.238e-01 1.866e-01 2.416e-01 1.860e-01
  2.750e-01 8.902e-02]
 [1.969e+01 2.125e+01 1.300e+02 1.203e+03 1.096e-01 1.599e-01 1.974e-01
  1.279e-01 2.069e-01 5.999e-02 7.456e-01 7.869e-01 4.585e+00 9.403e+01
  6.150e-03 4.006e-02 3.832e-02 2.058e-02 2.250e-02 4.571e-03 2.357e+01
  2.553e+01 1.525e+02 1.709e+03 1.444e-01 4.245e-01 4.504e-01 2.430e-01
  3.613e-01 8.758e-02]
 [1.142e+01 2.038e+01 7.758e+01 3.861e+02 1.425e-01 2.839e-01 2.414

In [7]:
# print the cancer labels (0: malignant, 1: benign)
print(cancer.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 1 0 0 1 1 1 1 0 1 0 0 1 1 1 1 0 1 0 0
 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 1 1 1 0 1 1 0 1 1
 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 1
 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 1 1 1 0 1 1 0 0 0 1 0
 1 0 1 1 1 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 1 0 0 1 1
 1 0 1 1 1 1 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 1 1 0 1 0 0 1 1 1 1 1 1 1 1 1 1 1 1
 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 1 0 0 0 1 1
 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0
 0 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1
 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 0 1 1 1 1 1 0 1 1
 0 1 0 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1
 1 1 1 1 1 1 0 1 0 1 1 0 

### Splitting Data

To understand model performance, dividing the dataset into a training set and a test set in a good strategy.

In [8]:
from sklearn.model_selection import train_test_split

In [9]:
# split dataset
xtrain, xtest, ytrain, ytest = train_test_split(cancer.data, cancer.target, test_size=0.3)

### Generating Model

First, import the SVM module and create support vector classifier object by passing argument kernel as the linear kernel in `SVC()` function.

In [13]:
# import SVM model
from sklearn import svm

# create a SVM classifier
clf = svm.SVC(kernel='linear')

# train the model using the training sets
clf.fit(xtrain, ytrain)

# predict the response for test dataset
ypred = clf.predict(xtest)

### Evaluating the Model

Let's estimate how accurately the classifier or model can predict the breast cancer of patients. _Accuracy_ can be computed by comparing actual test set values and predicted values.

In [14]:
from sklearn import metrics

In [15]:
# how often is the classifier correct?
print('Accuracy:', metrics.accuracy_score(ytest, ypred))

Accuracy: 0.9824561403508771


For further evaluation, you can also check precision and recall of model.

In [16]:
# what percentage of all positives are correctly labeled?
print('Precision:', metrics.precision_score(ytest,ypred))

# what percentage of all positives are correctly labeled?
print('Recall:', metrics.recall_score(ytest, ypred))

Precision: 0.9908256880733946
Recall: 0.9818181818181818


## Tuning Hyperparameters

**Kernel**. The main function fo the kernel is to transform the given dataset input data into the required form. There are various types of functions such as `linear`, `polynomial`, and `radial basis function` (RBF).

`polynomial` and `RBF` kernels compute the separation line in the higher dimension. In some of the applications, it is suggested to use a more complex kernel to separate the classes that are curved or nonlinear. This transformation can lead to more accurate classifiers.

**Regularization**. The regularization parameter `C` is used to maintain regularization. Here, `C` is the penalty parameter, which represents misclassification or the error term.

The misclassification tells the SVM optimization how much error is bearable. This is how you can control the trade-off between the decision boundary and misclassification term. A smaller value of `C` creates a small-margin hyperplane and a larger value of `C` creates a larger-margin hyperplane.

**Gamma**. A lower value of `gamma` will loosely fit the training dataset, whereas a higher value of `gamma` will exactly fit the training dataset, which causes  over-fitting.

In other words, you can say a low `gamma` considers only nearby points in calculating the separation line, while the high value of `gamma` considers all the data points in the calculation of the separation line.

## Advantages

**SVM** Classifiers offers good accuracy and perform faster predictions in comparison with **Naive Bayes** algorithm. They also use less memory because they use a subset of training points in the decision phase. SVM works well with a clear margin of separation and with high dimensional space.

## Disadvantages

**SVM** is not suitable for large datasets because of its high training time and it also takes more time to train in comparison to **Naive Bayes**. It works poorly with overlapping classes and is also sensitive to the type of kernel used.