# The Mathematics behind Support Vector Machines

It's a big topic, so get confortable. We'll start with the prerequisites:  

* Linear Algebra
* Calculus  

You would need these at a college-level understanding.

You should understand these concepts:
- Vectors & Matrices
- Derivatives + partial derivatives
- Distances in n-dimensional space
- Function optimization with/without constraints (Lagrange multipliers)

Remember that the basic idea of SVM classifiers is the maximize the margin (distance from the hyperplane to the nearest data points) between the two classes. Let's think about how this is calculated.  

Let's say we have a hyperplane. We calculate the distance to the closest data point (because that datapoint does in fact represent a sort of barrier we cannot cross - at least while we're talking about hard-margin classifiers). If we double that distance, we have our margin.  
Of course, we want to maximize our margin, which means that our goal is to find a middle ground between the two classes (which leads to the maximum possible margin).  
To do that, we introduce a few ideas:  
- each point in n-space is essentially a vector.
- our hyperplane is an (n-1)-dimensional space, which means that it is a vector in n-space (because we add back the intercept-term).  

So we have 2 vectors: a point A, and our normal vector W (which describes the hyper-plane; remember that this vector will always be normal to the actual hyperplane representation in a graph). To get the distance between the hyperplane and the point A, we have to *project* A onto W, then calculate the norm of that projection. Double that, and we have our margin calculated. The steps are relatively simple, and the math is easy to apply, once we get the hang of it.  

Formula for projection of A onto W (let **U** be the unit vector of W, *aka* the direction of W; let **P** be the projection):
$$
U = \frac{W}{||W||}
$$
$$
||P|| = U \cdot A
$$  
$$
P = (U \cdot A) \cdot U
$$


Formula for norm:
$$
\| \vec{A} \| = \sqrt{\vec{A} \cdot \vec{A}}
$$

Note that we would only be interested in the norm of the projection, not the projection itself. We would simply calculate $||P|| = U \cdot A$, and double it (because this value is only one side of the hyperplane, and we want to take into account both sides). That would be our margin.

We now know how to calculate the margin for one hyperplane. Let's see how we can find the optimal hyperplane, with the biggest margin.  

Since our goal is to find the biggest value for the margin, this means we find the biggest norm for $P$, our projection vector. As we now know, $||P|| = U \cdot A$.  
$U = \frac{W}{||W||}$, so we can substitute that into the equation for $||P||$. 
$$
||P|| = \frac{W}{||W||} \cdot A
$$
$$
||P|| = \frac{W \cdot A}{||W||}
$$


The math does involve some more steps, but I'm going to simplify it a bit.  
If you want to see each step of the way, I would recommend checking out this [article](https://www.svm-tutorial.com/2014/11/svm-understanding-math-part-2/). You might want to read every part, it's worth it.  

All right, the simplification is as follows: to get $||P||$ as big as possible, we look at our variables. We have $W$, that's our hyperplane vector. We change this, it's norm changes. We have a few ways in which we could change it, but which one is the best? Let's go to $A$. This is a datapoint, so we know for sure we're not modifying anything about that. Lastly, we have $||W||$, which depends on $W$. What's different now is that instead of having so many directions in which we could change $W$, we only have 1 for $||W||$: make it smaller. To maximize $||P||$, we want to make $||W||$ as small as possible. And that gives us a target:  

#### Minimize the norm of our weights vector W

