# Support Vector Machines (SVM)

```{warning}
This is a heavy-math topic! See Vorontsov's [slides](http://machinelearning.ru/wiki/images/archive/a/a0/20150316112120!Voron-ML-Lin-SVM.pdf) for details
```

# Introduction

The support vector machine is a type of supervised learning algorithm used for classification and regression analysis.
The support vector machine is based on the idea of finding a hyperplane that best separates the data points into different classes.
The hyperplane is chosen such that it maximizes the margin between the two classes. The support vector machine is a generalization of the maximal margin classifier and the support vector classifier.
Therefore, before starting, we need to touch on a little maximal margin classifier and the support vector classifier.


#### Maximal Margin Classifier

The maximal margin classifier is a simple linear classifier that separates the data points into different classes by finding a hyperplane that maximizes the margin between the two classes.
The margin is defined as the distance between the hyperplane and the closest data points from each class.
The maximal margin classifier is a special case of the support vector classifier.

The maximal margin classifier is designed specifically for linearly separable data, which refers to the condition in which data can be separated linearly using a hyperplane.
However, this classifier has some drawbacks. It is heavily reliant on the support vector and changes as support vectors change, which makes it tend to overfit.
It also can’t be used for data that isn’t linearly separable, which makes it inefficient for the majority of real-world data that is non-linear.

<img src="./svm/mmc.png" alt="sfsdgeg" />

#### Support Vector Classifier

The support vector classifier is a linear classifier that separates the data points into different classes by finding a hyperplane that maximizes the margin between the two classes. The margin is defined as the distance between the hyperplane and the closest data points from each class. The support vector classifier is a special case of the support vector machine.

The support vector classifier is an extension of the maximal margin classifier and is less sensitive to individual data. Since it allows certain data to be misclassified, it’s also known as the “Soft Margin Classifier”. It creates a budget under which the misclassification allowance is granted. This classifier covers the drawbacks of the maximal margin classifier by allowing for some misclassification and being less reliant on the support vector.

<img src="./svm/svc.png">

#### Hyperplanes

Hyperplanes are decision boundaries that help classify the data points. Data points falling on either side of the hyperplane can be attributed to different classes. Also, the dimension of the hyperplane depends upon the number of features. If the number of input features is 2, then the hyperplane is just a line. If the number of input features is 3, then the hyperplane becomes a two-dimensional plane. It becomes difficult to imagine when the number of features exceeds 3.

# Understanding the mathematics behind SVM

Many people skip the math intuition behind this algorithm because it is pretty hard to digest. Here in this section, we’ll try to understand each and every step working under the hood.

#### Understanding Dot-Product

<img src="./svm/math_1.png" alt="sfsdgeg" />

Here a and b are 2 vectors, to find the dot product between these 2 vectors we first find the magnitude of both the vectors (to find magnitude we use the Pythagorean theorem or the distance formula). After finding the magnitude we simply multiply it with the cosine angle between both the vectors. Mathematically it can be written as:

$$\overrightarrow{a} \cdot \overrightarrow{b} = \vert a \vert cos\theta \cdot \vert b \vert$$

Now in SVM we just need the projection of A not the magnitude of B, I’ll tell you why later. To just get the projection we can simply take the unit vector of B. Hence now the equation becomes:

$$\overrightarrow{a} \cdot \overrightarrow{b} = \vert a \vert cos\theta \cdot \text{unit vector of }b$$

#### Use of Dot Product in SVM

Consider a random point X and we want to know whether it lies on the right side of the plane or the left side of the plane (positive or negative).

<img src="./svm/math_2.png" alt="sfsdgeg" />

To find this first we assume this point is a vector (X) and then we make a vector (w) which is perpendicular to the hyperplane. Let’s say the distance of vector w from origin to decision boundary is ‘c’. Now we take the projection of X vector on w.

<img src="./svm/math_3.png" alt="sfsdgeg" />

We already know that projection of any vector or another vector is called dot-product. Hence, we take the dot product of x and w vectors. If the dot product is greater than ‘c’ then we can say that the point lies on the right side. If the dot product is less than ‘c’ then the point is on the left side and if the dot product is equal to ‘c’ then the point lies on the decision boundary.

$\overrightarrow{x} \cdot \overrightarrow{w} = c$ *(the point lies on the decision boundary)*<br/>
$\overrightarrow{x} \cdot \overrightarrow{w} > c$ *(positive samples)*<br/>
$\overrightarrow{x} \cdot \overrightarrow{w} > c$ *(negative samples)*

#### Equation of the Hyperplane, Margin

This equation is derived from two-dimensional vectors. But in fact, it also works for any number of dimensions. Equation of the hyperplane:

$$\omega \cdot x + b = 0$$

Distance from a data point to the decision boundary:<br/>
$$\text{Margin} = \frac{1}{\Vert \omega \Vert}$$

# Kernels in Support Vector Machine

The most interesting feature of SVM is that it can even work with a non-linear dataset and for this, we use “Kernel Trick” which makes it easier to classifies the points. Suppose we have a dataset like this:

<img src="./svm/kernel_1.png" alt="sfsdgeg" />

Here we see we cannot draw a single line or say hyperplane which can classify the points correctly. So what we do is try converting this lower dimension space to a higher dimension space using some quadratic functions which will allow us to find a decision boundary that clearly divides the data points. These functions which help us do this are called Kernels and which kernel to use is purely determined by hyperparameter tuning.

<img src="./svm/kernel_2.png" alt="sfsdgeg" />