<a href="https://colab.research.google.com/github/MaralAminpour/ML-BME-Course-UofA-Fall-2023/blob/main/SVM_kernel.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **1. I support vector machines and so should you**

"Alright, what’s a support vector machine good for? They are a type of machine learning that allows a computer to take a set of data and classify it into two groups (or more by comparing any one group to all the rest). We can use them to classify basically anything from taking financial information to decide if someone is likely to default on a loan to image information so a computer can decide if something is a dog or a cat. They have a benefit of maximizing the margin (or space) between the two groups to allow for new observation that the computer has not seen before to be better classified. They also can take advantage of something called the kernel trick which I will explain shortly.

Basically the computer uses lots of math that smart people on youtube can explain to you to draw a line between groups of already labeled data and then uses the line to predict what class or group new observations fall under. If you think about it, it is rather impressive. We can naturally look at the graph below and realize where best to draw a line separating the blue from red X’s, but there are actually a lot of different possibilities and finding smart ways to get a computer to recognize the best one is big business.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-3-Classification-models/imgs/gif1.gif' width=300px>


Mathematically the computer draws a line between the data, moves other lines away from it in both directions, and rotates these until it hits observations on either side.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-3-Classification-models/imgs/gif2.gif' width=300px>


It gets its name from these vectors that are ultimately supporting the model. These vectors are also the only important information the computer needs to hold onto as all other points have no affect on the model or predictions.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-3-Classification-models/imgs/gif3.gif' width=300px>

All this seems pretty simple right? It is! Though technically we have only built a Maximum Margin Classifier. If we are classifying groups of things that have either some overlap or extreme outliers this method will fail. Imagine you are trying to tell the difference between cats and dogs and someone throws a chihuahua into the mix! Or, god forbid, a shih tzu. No computer could tell those apart, right? Wrong. We just tell our ignore them! If only people were so easily to program…


In the below GIF, you can first see a mixed group where a computer could not mathematically draw a line between them. By ignoring two points it is able to find the best support vectors and draw our line. In the second half of the GIF, one extreme outlier (the infamous shih tzu) would otherwise PUSH our line far too close to the red class and cause our model to misclassify future predictions. Teaching a computer that some observations simply do not fit what we normally expect is very important in model building and prevents overfitting to whatever training data we give it.

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-3-Classification-models/imgs/gif4.gif' width=300px>


OK. Now all these visualizations have shown 2D representations of data. If we are classifying dogs and cats, it would be as if we are only looking at two pieces of information about them. Maybe the x-axis represents their weight and the y-axis represents, ya know, innate evil or something. All of this works in 3 dimensions and beyond. Instead of lines we are drawing planes (or hyperplanes in 4D and beyond), but everything still works out just the same. Look, here is a picture of the same magic in 3D:

<img src='https://raw.githubusercontent.com/MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-3-Classification-models/imgs/gif5.webp' width=300px>

So far so good, yeah? Now we get into the real magic: the kernel trick. So, I am sure you can imagine lots of different datasets or graphs were you simply cannot draw a line/plane/whatever through them. A cluster of red surrounded by a sea of blue would be impossible to draw a line through. Math comes to the rescue again. In the below example we have 1D data and no matter where we draw a line, we won’t be able to divide our groups up. The kernel trick in this example is just adding another dimension with the square of the first dimension. If we are just talking about the weight of cats and dogs, it is like we decide also to tell the computer the square of their weight too. To us, that makes no sense, but watch.

<img src='https://raw.githubusercontent.com//MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-3-Classification-models/imgs/gif6.gif' width=300px>

Holy crap, am I right? Now watch the same thing from 2D to 3D!

<img src='https://raw.githubusercontent.com//MaralAminpour/ML-BME-Course-UofA-Fall-2023/main/Week-3-Classification-models/imgs/gif7.gif' width=300px>


That was just a super simple kernel trick. People have been coming up with some fancy ones for support vector machines for a while now and that allows them to tackle some pretty complex stuff. AND if you are still having trouble after adding one dimension, there is nothing saying you can’t add more! Crazy.
If you want to learn more about kernels and some of the more effective ones, check out this article written by Eric Kim. It pretty approachable as those things go.
And if you are interested in the math behind the other aspects of support vector machines, I recommend these two videos on youtube:

Udi Aharoni:

https://www.youtube.com/watch?v=3liCbRZPrZA

Temel Bilisim:

https://www.youtube.com/watch?v=5zRmhOUjjGY

I hope that helped to demystify support vector machines and teach you the basics of what is going on under the hood when you start modeling with them."

[Source](https://towardsdatascience.com/i-support-vector-machines-and-so-should-you-7af122b6748)










## 2. Unveiling the Magic: The Untold Power of Kernels in Machine Learning and SVM

A kernel function offers a practical approach to certain calculations. Instead of performing computations directly in higher-dimensional spaces, using a kernel often proves to be a faster and more efficient choice.

### **Mathematical Definition**:

The kernel function is given by:
$K(x, y) = \langle f(x), f(y) \rangle$
Where:
- $K$ is the kernel function.
- $x$ and $y$ are our n-dimensional input vectors.
- $f$ is a function that maps these n-dimensional vectors into an m-dimensional space.
- $\langle x, y \rangle$ represents the dot product of two vectors.
Typically, $m$ (the dimensions of the transformed space) is much larger than $n$ (the dimensions of our original inputs).

### **Intuition**:

To determine $\langle f(x), f(y) \rangle$, you might think that you first need to calculate $f(x)$ and $f(y)$ and then compute their dot product. This process can be computationally intensive, especially as it involves operations in an m-dimensional space, which might be considerably large. However, after all the effort in this vast m-dimensional space, the final outcome is just a scalar. This raises a question: Is it necessary to go through the extensive process in the m-dimensional space just for a single scalar result? If we utilize an appropriate kernel function, the answer is "no".

Certainly! Let's present the example without simplifying or omitting any details:


### **Simple Example with Kernel Trick**:

Given two vectors:

$$ x = (x_1, x_2, x_3) $$
$$ y = (y_1, y_2, y_3) $$

Consider the function:

$$ f(x) = (x_1x_1, x_1x_2, x_1x_3, x_2x_1, x_2x_2, x_2x_3, x_3x_1, x_3x_2, x_3x_3) $$

With this, our kernel is defined as:

$$ K(x, y) = \langle x, y \rangle^2 $$

To elucidate further, let's use specific values for our vectors:

$$ x = (1, 2, 3) $$
$$ y = (4, 5, 6) $$

Using our function $f$, we can derive:

$$ f(x) = (1*1, 1*2, 1*3, 2*1, 2*2, 2*3, 3*1, 3*2, 3*3) $$

This yields:

$$ f(x) = (1, 2, 3, 2, 4, 6, 3, 6, 9) $$

Similarly:

$$ f(y) = (4*4, 4*5, 4*6, 5*4, 5*5, 5*6, 6*4, 6*5, 6*6) $$

Which gives:

$$ f(y) = (16, 20, 24, 20, 25, 30, 24, 30, 36) $$

Taking the dot product of $f(x)$ and $f(y)$, we have:

$$ \langle f(x), f(y) \rangle = 16 + 40 + 72 + 40 + 100 + 180 + 72 + 180 + 324 = 1024 $$

This computation might seem convoluted, mainly because the function $f$ transforms our original 3-dimensional vectors into a 9-dimensional space.

Now, observe the efficiency of the kernel trick:
Using the kernel, we calculate:

$$ K(x, y) = (1*4 + 2*5 + 3*6)^2 = 32^2 = 1024 $$

Remarkably, both methods yield the same result of 1024. However, employing the kernel trick has simplified our calculations considerably.

Of course! Here's your content with in-text mathematical expressions using a single `$` on each side and the displayed formulas with `$$` on each side:


### **The Additional Sophistication of Kernels**:

Kernels are particularly powerful because they allow us to perform calculations in infinite dimensions. Sometimes, moving to a higher-dimensional space isn't just computationally taxing—it may also be impractical or indefinable. For example, $f(x)$ could map from an $n$-dimensional space to an infinite-dimensional one, which is usually too complex to work with. Here, kernels offer a remarkable computational shortcut.

### **Relation to Support Vector Machines (SVM)**:

How do kernels integrate into the SVM framework?

The basic SVM decision rule is

$$y = w \phi(x) + b,$$

where $w$ represents the weights, $\phi(x)$ is the feature vector, and $b$ is the bias. If $y > 0$, we categorize the data point as belonging to class 1; otherwise, it falls into class 0.

The goal in SVM is to find weights and a bias that maximize the margin between the classes. While it is commonly stated that kernels make data linearly separable in SVM, a more nuanced explanation is that the feature vector $\phi(x)$ makes the data linearly separable. The kernel function simply streamlines the calculation, especially when $\phi$ maps to a high-dimensional space, like

$x_1, x_2, x_3, \ldots, x_{D}^n, x_1^2, x_2^2, \ldots, x_{D}^2$.

### **Kernel as a Measure of Similarity**:

When considering SVM and feature vectors, the kernel definition $\langle f(x), f(y) \rangle$ transforms into $\langle \phi(x), \phi(y) \rangle$.

This inner product effectively measures how much $\phi(x)$ projects onto $\phi(y)$, or in simpler terms, the extent to which $x$ and $y$ overlap in their feature space. This degree of overlap serves as an indicator of their similarity.

[Source](https://colab.research.google.com/drive/1tBNGOT-ubxqGswskHLQ36FsSbejeImE7#scrollTo=8x-S-qBwBotf&line=89&uniqifier=1)


### **3. Another Stab at the Kernel Trick (Why?)**

"There are a lot of great tutorials on Support Vector Machines, and how to apply them in binary classification. But I wanted to dive deeper specifically into the kernel trick and why it’s used.

The Support Vector Machine relies on enlarging the feature space to a dimension that allows the two classes to be separable by a linear hyperplane. The first way this can be achieved, a simpler approach, is to transform your input data. This means to take your input features and square, cube or otherwise transform them to produce new additional features. The other method is to use kernels.

A kernel is a function that represents the similarity between two observations in a desired dimension (mathematically defined by Mercer’s Theorem, have a google). The similarity between two observations with the same coordinates is 1, and tends to 0 as the euclidean distance between them increases.

Without getting too mathematical, the calculation required to solve the SVM optimisation problem to find the hyperplane’s position, is the inner product between observations. The output of an inner product is a scalar meaning the output we are interested in exists in one dimension, not whatever crazy dimensional space we visit to separate the data. The inner product calculation is equivalent to a linear kernel function and represents similarity in a linear space.

The reason this is important, is that it means the value of the observations in the higher dimension is irrelevant to us, and we only care about the inner product result. This inner product can be calculated by the two step stage of method one, transforming the input data and then calculating the inner product. Or applying a kernel to the input data. The benefit of method two is computational efficiency.

The computational efficiency arises from the fact that the kernel calculation is faster than the calculation that would be needed to transform the input data to the higher dimension that would be required to linearly separate the data.

You can intuitively think about why this would be faster as instead of first needing to perform a data transform and then perform an inner product calculation, you just apply a kernel function in one step and have the inner product result, voila!

There is little benefit to using the kernel trick when the problem is quite simple and only transforming the input data quadratically or cubicly suffices to allow for the data to be linearly separable.

But imagine you had a radial boundary between the two classes such as in the below figure. To figure out how to transform the input data to be able to linearly separate the below classes is as mind boggling as it is computationally expensive. This is where kernels (radial/gaussian in this case) really come in to their own, and the similarity computation between observations drastically improves on transforming the input data.

This was just a quick story (and my first) about why the kernel trick is widely used in difficult binary classification problems, I hope to follow up with how to use the kernel trick soon regarding parameter tuning etc."
[
Source](https://towardsdatascience.com/another-stab-at-the-kernel-trick-why-f73c70ce98dd)