# Support Vector Machines

Support Vector Machines, or SVMs, are powerful tools in the realm of supervised machine learning. They're like smart guides, helping us classify and predict things accurately. SVMs are especially handy when dealing with complicated data that doesn't neatly split into different groups. They've become quite popular because they're really good at handling lots of information and can be applied to all sorts of real-world problems, like figuring out what's in an image or sorting through text.

## Understanding the Basics

Support Vector Machines (SVMs) are all about finding the best way to separate different groups of data. Imagine you have a bunch of points scattered in space, and you want to draw a line between them so that the gap between the groups is as big as possible. That line is what we call a hyperplane. SVMs balance between two approaches in machine learning: some methods rely heavily on specific rules (parametric), while others are more flexible and learn directly from the data (nonparametric). SVMs fall somewhere in between, using a clever trick to find the best line to divide data into two groups. However, they work best when the data can be separated by a straight line. When things get more complicated, SVMs use special tools called kernels to handle it, sort of like upgrading to a higher-dimensional world where separation is easier.

### Support Vector Points:

Support vector points a're the MVPs in SVMs. They're the ones closest to the dividing line, influencing where it's placed and how it tilts. Only these special points really matter in determining the line; all the others don't make a difference. So, if we change or remove any of the other points, it won't change the result. But if we mess with the support vectors, it will definitely affect where the line ends up, possibly changing how things get classified.

### Hyperplane Dimension:

The hyperplane's dimension depends on how many features we have in our data. If we're working with just two features, the hyperplane is a simple line. Add one more feature, and suddenly it's a flat surface. But if we have lots of features, visualizing the hyperplane becomes a challenge. Despite this, the idea of dividing the space between different groups remains the same, whether it's a line, a plane, or something even more complex.

### Margin:

The margin is like a safety buffer between the dividing line and the nearest points from each group. It helps keep things organized and reduces the chances of mistakes in classification. Ideally, we want to maximize this margin, making sure there's plenty of space between groups. When all the support vectors have the same distance from the line, we call it a "Good Margin." But if the distances vary a lot, leading to inconsistent results, it's a "Bad Margin."


## Mathematics Behind SVMs

### 1. Linear SVMs:

In the case of linearly separable data, the decision boundary can be represented by the equation of a hyperplane:

$ w \cdot x + b = 0 $

where:
- $ w $ is the weight vector perpendicular to the hyperplane,
- $ x $ is the input feature vector,
- $ b $ is the bias term.

The goal of the SVM algorithm is to find the optimal values for $ w $ and $ b $ that maximize the margin between the classes while minimizing the classification error. This can be formulated as an optimization problem.

### Optimization Objective:

Given a training dataset $ \{(x_1, y_1), (x_2, y_2), ..., (x_n, y_n)\} $, where $ x_i $ represents the feature vector and $ y_i $ represents the class label.

<center><img src="./imgs/objective.png"/></center>

All data points $ x_i $ belonging to class A have label $ y = 1 $.  
All data points $ x_i $ belonging to class B have label $ y = -1 $.

Hence, 

$$
y_i = 
\begin{cases} 
-1, & \text{if } (w \cdot x + b) \leq -1 \\ 
+1, & \text{if } (w \cdot x + b) \geq +1
\end{cases} 
$$

We establish the optimization objective for our hyperplane:

1. Predicts a value greater than or equal to 1 for all data points in Class A (when $ y = 1 $).
2. Predicts a value less than or equal to -1 for all data points in Class B.

The distance between two support vectors (margin) is calculated as:

$$
\text{margin} = \frac{(w \cdot x + b + 1)-(w \cdot x + b - 1)}{\|w\|} =  \frac{1 - (-1)}{\|w\|} =  \frac{2}{\|w\|}
$$

The best hyperplane is the one with the largest margin. Therefore, to maximize the margin, we minimize $ \|w\| $. 

To facilitate solving this minimization problem, we take the square of this value and halve it, resulting in the following minimization problem with the aforementioned condition:

$$
\text{minimize} \frac{1}{2} \| w \|^2 
$$

subject to the constraints:

$$
y_i(w \cdot x_i + b) \geq 1 \quad \text{for all } i = 1, 2, ..., m 
$$

This problem constitutes a non-linear programming problem, solvable using Karush-Kuhn-Tucker conditions. By applying this method, we obtain the conditions:

$$w = \sum^{m}_{i=0}\lambda_i x_i y_i  \quad \text{and} \quad \sum^{m}_{i=0}\lambda_i y_i = 0$$

where $m$ is the number of training examples and $\lambda$ is the Lagrangian multiplier.

### Hard Margin SVM:
In a hard margin SVM, the algorithm aims to find a hyperplane that perfectly separates the two classes in the feature space, with no misclassifications. This works well when the data is linearly separable. However, in real-world scenarios, data is often noisy or contains outliers, making it difficult to find a hyperplane that perfectly separates the classes without errors.

### Soft Margin SVM:
In contrast, a soft margin SVM allows for some misclassifications by introducing a margin of tolerance, represented by the parameter $C$. This parameter controls the trade-off between maximizing the margin and minimizing the classification error. A smaller value of $C$ allows for a wider margin and permits more misclassifications, while a larger value of $C$ results in a narrower margin with fewer misclassifications.

To allow for misclassifications we make the following adjustment to SVM objective function:

$$
\text{minimize} \frac{1}{2} \| w \|^2 + C \sum^{m}_{i=0}\xi_i
$$

subject to the constraints:

$$
y_i(w \cdot x_i + b) \geq 1 -\xi_i \quad \text{for all } \quad i = 1, 2, ..., n 
$$

$\xi_i $(xi) is known as the slack variable or penalty. It gives the distance misclassified points are away from their classes margin.

<center><img src="./imgs/soft-margin.png"/></center>

Since we are multiplying $\xi$ with $C$, with lower values of $C$ there are lower penalties to misclassified data points. The objective function would put less emphasis on minimising the penalties of misclassified points. The SVM won't try as hard to separate the data and therefore produce a more generalisable model which reduces overfitting. With larger values of $C$, there are bigger penalties to misclassified data points and therefore the SVM would try hard not to make mistakes and sperate the data more in order to minimise the objective function. This may lead to overfitting shown in the non-linear SVM above.

By allowing for a soft margin, the SVM can generalize better to unseen data and is more robust to noisy or overlapping classes. The soft margin SVM strikes a balance between finding a hyperplane with a large margin and minimizing classification errors, making it suitable for real-world datasets where perfect separation may not be feasible.

## Hinge Loss

The hinge loss serves as a pivotal loss function utilized in training classifiers, particularly in the context of "maximum-margin" classification, notably applied in support vector machines (SVMs).

This loss function finds its utility primarily within the domain of binary classification tasks, where labels assume values of either +1 or -1. Its core objective lies in gauging the disparity, or "hinge," between the predicted classification scores and the true labels. 

$$
f(x) = 
\begin{cases} 
-1, & \text{if } (w \cdot x + b) \leq -1 \\ 
+1, & \text{if } (w \cdot x + b) \geq +1
\end{cases} 
$$

Formally, the hinge loss, denoted as $ \mathcal{L}(y, f(x)) $, for a given sample characterized by its true label $ y $ and predicted score $ f(x) $, is articulated as:

$$ \mathcal{L}(y, f(x)) = \max(0, 1 - y \cdot f(x)) $$

Here, $ y $ represents the veritable label (+1 or -1), while $ f(x) $ denotes the forecasted score or output of the decision function for a given input $ x $. The hinge loss systematically penalizes misclassifications by augmenting the loss proportionally to the degree of deviation from the correct side of the decision boundary.

The term $ y \cdot f(x) $ effectively delineates the correctness of the prediction with respect to the decision boundary. If the product surpasses 1, signifying a correct prediction, the incurred loss reduces to zero. Conversely, if the product falls short of 1, indicating misclassification, the loss escalates linearly in correspondence to the margin between the correct classification boundary and the forecasted score.

Hinge loss serves to foster the creation of expansive margins, thereby engendering robustness against outliers and bolstering generalization capabilities, essential attributes for models aimed at navigating unseen data. Its prevalence in SVMs and akin algorithms for binary classification tasks is attributable to its efficacy in fostering optimal decision boundaries. Moreover, the convexity of hinge loss facilitates streamlined optimization processes during model training.

<center><img src="./imgs/hinge.png"/></center>

  
In the visual representation depicted above, the x-axis signifies the distance from the boundary for individual instances, while the y-axis portrays the magnitude of loss incurred by the function contingent on its spatial disposition. Several salient insights emerge from this visualization:

- The dashed line at $ x = 1 $ delineates a critical threshold. Instances positioned at or beyond this threshold incur negligible loss, thereby elucidating the notion of margin delineation in classification.

- Instances precisely situated on the decision boundary, denoted by a distance of 0, incur a loss magnitude of 1, emblematic of the penalty levied for being in proximity to misclassification.

- Correctly classified points register minimal, if any, loss, whereas erroneously classified instances incur substantial penalties, delineating the pivotal role of hinge loss in discerning classification accuracy.

- Instances positioned negatively with respect to the boundary yield significant hinge loss, indicative of misclassification due to misalignment with the decision boundary.

- Conversely, instances exhibiting positive distances from the boundary inculcate minimal or negligible hinge loss, with greater separations translating to diminished loss, underscoring the efficacy of robust classification.


### 2. Non-linear SVMs:

For data that are not linearly separable, Support Vector Machines (SVMs) employ a technique known as the **kernel trick**. This technique enables SVMs to implicitly map the input features into a higher-dimensional space, where the data becomes separable by a hyperplane. Common kernel functions used for this purpose include linear, polynomial, and radial basis function (RBF) kernels.  


### The Polynomial Kernel

The Polynomial Kernel serves the purpose of enriching the feature space by introducing additional dimensions. It achieves this by leveraging the existing features to generate novel ones. In scenarios where the observations cannot be effectively separated within a two-dimensional space, the polynomial kernel offers a solution. By applying polynomial transformations, it extends the feature set, transforming it into a higher-dimensional space.  

<center><img src="./imgs/poly-1.png"/></center>
<center><img src="./imgs/poly-2.png"/></center>  


The Polynomial kernel, represented as $ (a \cdot b + r)^d $, introduces additional dimensions to the feature space, where:

- $ a $ and $ b $ denote distinct observations within the dataset.
- $ r $ determines the coefficient of the polynomial.
- $ d $ establishes the degree of the polynomial.

Setting $ r = \frac{1}{2} $ and $ d = 2 $, we obtain:

$$ (a \cdot b + \frac{1}{2})^2 = (a \cdot b + \frac{1}{2})(a \cdot b + \frac{1}{2}) = ab + a^2b^2 + \frac{1}{4} = (a, a^2, \frac{1}{2}) \cdot (b, b^2, \frac{1}{2}) $$

Here, the resulting expression illustrates a dot product operation between two vectors. The first term pertains to the x-coordinates, the second to the y-coordinates, and the third to the z-coordinates.

Alternatively, considering:

$$ (a \cdot b + 1)^2 = (a \cdot b + 1)(a \cdot b + 1) = 2ab + a^2b^2 + 1 = (\sqrt{2}a, a^2, 1) \cdot (\sqrt{2}b, b^2, 1) $$

In this scenario, the x-axis coordinates undergo scaling by a factor of $ \sqrt{2} $, while the y-axis coordinates are squared. Notably, the z-axis coordinates remain constant. Consequently, to ascertain the high-dimensional relationship, a simple dot product suffices. No explicit data transformation is requisite; rather, the values are directly substituted into the equation to unveil the relationship.


### Radial basis function (RBF) kernels

RBF kernel is renowned for its versatility and effectiveness in capturing non-linear relationships within data. At its core, the RBF kernel computes the similarity or "kernel" value between two data points, $ x $ and $ x' $, based on their Euclidean distance in feature space. Mathematically, the RBF kernel is defined as:

$$ K(x, x') = \exp \left( -\frac{\|x - x'\|^2}{2\sigma^2} \right) $$

Here, $ \| \cdot \| $ represents the Euclidean norm, $ \sigma^2 $ denotes the kernel width parameter (also known as the bandwidth or spread), and $ K(x, x') $ signifies the computed kernel value.

The RBF kernel's functionality stems from its ability to assign high similarity values to data points that are close to each other in feature space and lower values to those farther apart. The exponential decay in similarity as the distance increases reflects the kernel's characteristic "locality" principle: points nearby contribute significantly to the decision process, while distant points have diminishing influence.

The flexibility of the RBF kernel lies in its capacity to capture complex relationships and patterns within data, irrespective of linearity. By appropriately tuning the kernel width parameter $ \sigma^2 $, practitioners can adjust the smoothness of the decision boundary. A smaller $ \sigma^2 $ value yields a more complex decision boundary, potentially leading to overfitting, while a larger value results in a smoother boundary, possibly underfitting the data.

Furthermore, the RBF kernel's innate ability to implicitly map data into a higher-dimensional feature space allows it to handle non-linearly separable data without the need for explicit feature engineering. This characteristic renders it invaluable in scenarios where the underlying relationships are intricate and multidimensional.

The radial basis function (RBF) kernel is also known as **Guassian Kernel**. The term "radial basis function" refers to the fact that the kernel value depends solely on the distance between two data points $x$ and $x'$, with the Gaussian distribution determining the weights.

To elucidate the Gaussian Kernel, a visual approach often proves more intuitive than delving into the intricacies of mathematical equations. Let's consider a scenario involving data points in a one-dimensional space (1-D):

<center><img src="./imgs/rbf-1.png"/></center>

While attempting to draw a single line to effectively separate the two classes of points, we find it impractical. However, with the Gaussian Kernel, we can accurately distinguish between the classes by utilizing two Gaussian curves, which effectively divide them at their intersection point.

<center><img src="./imgs/rbf-2.png"/></center>

Now, we see the possibility of distinguishing between the groups. But what happens when we move into a multidimensional space?

<center><img src="./imgs/rbf-3.png"/></center>

The Gaussian Kernel constructs numerous Gaussian curves within the multidimensional space to establish a robust model. When visualizing the Gaussian kernel in a multidimensional space, it appears as follows:

<center><img src="./imgs/rbf-4.png"/></center>

As a result, our data points will now be separable:

<center><img src="./imgs/rbf-5.png"/></center>

If the data demonstrates increased complexity, the Gaussian Kernel effortlessly extends into N-Dimensional space. The algorithm is carefully tuned to effectively differentiate between the various classes of observations.