# The Mathematics behind Support Vector Machines

It's a big topic, so we will cover the essentials. A lot of the math is skipped, but resources that explain everything fully are given along the way. We'll start with the prerequisites:  

* Linear Algebra
* Calculus  

You would need these at a college-level understanding.

You should understand these concepts:
- Vectors & Matrices
- Derivatives + partial derivatives
- Distances in n-dimensional space
- Function optimization with/without constraints (Lagrange multipliers) - *some resources will be provided*

Remember that the basic idea of SVM classifiers is the maximize the margin (distance from the hyperplane to the nearest data points) between the two classes. Let's think about how this is calculated.  

Let's say we have a hyperplane. We calculate the distance to the closest data point (because that datapoint does in fact represent a sort of barrier we cannot cross - at least while we're talking about hard-margin classifiers). If we double that distance, we have our margin.  
Of course, we want to maximize our margin, which means that our goal is to find a middle ground between the two classes (which leads to the maximum possible margin).  
To do that, we introduce a few ideas:  
- each point in n-space is essentially a vector.
- our hyperplane is an (n-1)-dimensional space, which means that it is a vector in n-space (because we add back the intercept-term).  

So we have 2 vectors: a point A, and our normal vector W (which describes the hyper-plane; remember that this vector will always be normal to the actual hyperplane representation in a graph). To get the distance between the hyperplane and the point A, we have to *project* A onto W, then calculate the norm of that projection. Double that, and we have our margin calculated. The steps are relatively simple, and the math is easy to apply, once we get the hang of it.  

Formula for projection of A onto W (let **U** be the unit vector of W, *aka* the direction of W; let **P** be the projection):
$$
U = \frac{W}{||W||}
$$
$$
||P|| = U \cdot A
$$  
$$
P = (U \cdot A) \cdot U
$$


Formula for norm:
$$
\| \vec{A} \| = \sqrt{\vec{A} \cdot \vec{A}}
$$

Note that we would only be interested in the norm of the projection, not the projection itself. We would simply calculate $||P|| = U \cdot A$, and double it (because this value is only one side of the hyperplane, and we want to take into account both sides). That would be our margin.

We now know how to calculate the margin for one hyperplane. Let's see how we can find the optimal hyperplane, with the biggest margin.  

Since our goal is to find the biggest value for the margin, this means we find the biggest norm for $P$, our projection vector. As we now know, $||P|| = U \cdot A$.  
$U = \frac{W}{||W||}$, so we can substitute that into the equation for $||P||$. 
$$
||P|| = \frac{W}{||W||} \cdot A
$$
$$
||P|| = \frac{W \cdot A}{||W||}
$$


The math does involve some more steps, but I'm going to simplify it a bit.  
If you want to see each step of the way, I would recommend checking out this [article](https://www.svm-tutorial.com/2014/11/svm-understanding-math-part-2/). You might want to read every part, it's worth it.  

All right, the simplification is as follows: to get $||P||$ as big as possible, we look at our variables. We have $W$, that's our hyperplane vector. We change this, it's norm changes. We have a few ways in which we could change it, but which one is the best? Let's go to $A$. This is a datapoint, so we know for sure we're not modifying anything about that. Lastly, we have $||W||$, which depends on $W$. What's different now is that instead of having so many directions in which we could change $W$, we only have 1 for $||W||$: make it smaller. To maximize $||P||$, we want to make $||W||$ as small as possible. And that gives us a target:  

#### Minimize the norm of our weights vector W

I will skip the math in the middle, but eventually equations lead to this:

$$
m = \frac{2}{||W||}
$$

If we were looking at the full mathematics, we would eventually reach this result. So now we have a target. This is an optimization problem. Since we have no other conditions, we call this a **unconstrained optimization problem**. Things are a bit more complicated going forward.  
In order to understand things going forward, derivatives + lagrange multipliers are required (at least a basic understanding).  
Here is a link to a playlist [about lagrange multipliers](https://www.youtube.com/playlist?list=PLCg2-CTYVrQvNGLbd-FN70UxWZSeKP4wV) that I found helpful.
Derivatives (+gradients) would be covered in a calculus course, so here is [another playlist covering many aspects of calculus](https://www.youtube.com/watch?v=WUvTyaaNkzM&list=PL0-GT3co4r2wlh6UHTUeQsrf3mlS2lk6x).
It's hard to understand everything without a lot of practice, so more work outside of this notebook is required.

Up until this point, we've focused on the aspect of maximizing the margin. The next point is to make sure that we don't cross the margin. This is where the constraints come in. Remember those dotted lines we plotted along with the hyperplane? Those are (essentially) constraints. We want to make sure that points within those dotted lines are not misclassified. The mathematical way of saying that is:  

Given a hyperplane with parameters W that adheres to the following equation:
$$
W \cdot X + b = 0
$$
This means that every point $X$ on the hyperplane satisfies this equation. Other points (such us from our dataset) will be $>0$ or $<0$, depending on their respective class. We want to make sure that the points from our dataset are on the correct side of the hyperplane.

We can write equations for 2 other hyperplanes that are equidistant from our main hyperplane:
$$
W \cdot X + b = δ
$$
$$
W \cdot X + b = -δ
$$
where δ is the distance from the main hyperplane to the dotted lines (so half our margin).

In practice, we set $δ = 1$, but we can set it to any value we want. This is to simplify calculations. Ultimately, what we set this value to doesn't matter, and we'll see why shortly.  
Our 2 hyperplanes equations now look like this:
$$
W \cdot X + b = 1
$$
$$
W \cdot X + b = -1
$$
We want all of the points from our dataset to be on the correct side of these hyperplanes. This means that we want all of the points from our dataset to satisfy the following equations:
$$
W \cdot X + b \geq 1, \text{ if } y = 1
$$
$$
W \cdot X + b \leq -1, \text{ if } y = -1
$$
Where $y$ is the correct class. We can rewrite those 2 equations as:
$$
y(W \cdot X + b) \geq 1
$$
Which then becomes:
$$
y(W \cdot X + b) - 1 \geq 0
$$
And this is our constraint. We want to make sure that this constraint is satisfied for all of our datapoints.  

Let's take a step back. We now have an optimization problem with constraints. Our problem looks like this:
$$
\underset{W}{\text{minimize}} \frac{1}{2} ||W||^2
$$
$$
\text{subject to} \quad y(W \cdot X + b) - 1 \geq 0
$$
We can now use the lagrange multiplier method to solve this problem. And that's outside the scope of this notebook. Refer to videos above and [this article](https://www.svm-tutorial.com/2016/09/unconstrained-minimization/) for more information.  

What happens is that we consider the following functions:
$$
f(W) = \frac{1}{2} ||W||^2 \\
g(W,b) = y(W \cdot X + b) - 1
$$

--------------
#### NOTE
In the case of soft-margin classifiers, we allow some room for error. That is where the $C$ parameter comes in. Formulas change a bit. Our $f$ function would look like this:
$$
f(W) = \frac{1}{2}||W||^2 + C \sum_{i=1}^{n} \xi_i
$$
Where $\xi_i$ is the error for the $i^{th}$ datapoint.

-----------

We then consider their gradients, and have the following condition:
$$
\nabla f(W) - \lambda \nabla g(W,b) = 0
$$
Where $\lambda$ is the lagrange multiplier. This means we want the gradient of $f(W)$ to be equal to the gradient of $g(W,b)$, multiplied by the lagrange multiplier. Note that + or - means the same thing here, the multiplier can absorb the sign. You will also see people writing the equation with a + sign, but it's the same thing.  

The method then defines a new function:
$$
L(W,b,\lambda) = f(W) - \lambda g(W,b)
$$
And takes its gradient:
$$
\nabla L(W,b,\lambda) = \nabla f(W) - \lambda \nabla g(W,b)
$$
If we set it to 0, we find points where the gradients of $f$ and $g$ are paralell (meaning a lot of things depending how you interpret them). What matters is that solving this last equation gives us the results we want.  

Again, a lot of math has been skipped, and I don't even claim to have done a good job with these last few paragraphs. I just wanted to give you an idea of what steps are involved, so you know what you can search for. Going forward in this course, a lot more math will be skipped, claiming previous knowledge (or just focusing on what I believe matters more). Instead of being able to spew the equations from memory, I want you to know how to explain the concepts at a high level, and eventually how to find equations (either by yourself or by searching on the internet), if needed.

### What comes next

Next, we'll be looking at Decision Trees. They are a very popular algorithm, when combined with some other techniques we will also be talking about. You can find the notebook here: [Decision Trees](decisionTrees.ipynb)