# 3) Support vector machines

* Supervised learning (briefly)
* Max margin intuition 
* Func. & Geo. margins

### Supervised learning

In the Supervised Learning setting we are given a dataset $D = \{(x_1,y_1),...,(x_n,y_n)\}$ to learn from. The way we "learn" is by using D to search a hypothesis space H for some "best" hypothesis $g\in H$. By "best", we mean that we want g to be as close to the __ unknown target function f__ as possible, i.e.: 

$$g \approx f$$

We most often find g by minimizing an in-sample error function $E_{in}(h)$. The process of finding g is what we call "training" and when we have found it, we can use it to predict on new inputs.

Within supervised learning we distinguish between __two subcategories__. If the output y is a real number we call it __regression__. If the output y is a member of a discrete set - e.g. $y \in \{red,blue,green\}$ - then we call it __classification__.

In Supervised Learning we try to minimize $E_{in}(g)$, but what we _really_ care about in the end is having a low out-of-sample error $E_{out}(g)$, that is, in the end we only care about how g performs on _new_ data. Since we cannot measure $E_{out}(g)$ during training, we hope that we can generalize such that $E_{in}(g)$ - which we _can_ measure - is close to $E_{out}(g)$. 

### Support Vector Machines

#### Max margin intuition

When doing binary classification, we end up with some final hypothesis $g \in H$ which splits the input space in two. For linear models, this decision boundary is a separating hyperplane. If the data is linearly separable, the perceptron learning algorithm (PLA) can be used to find such a hyperplane, and is guaranteed to find one that achieves 0 in-sample error. But there is no guarantee which of the infinitely many perfectly separating hyperplane it picks. In the picture below, there are 3 examples of hyperplanes that PLA could choose. 

<img src="imgs/maxmargin.png" style="width: 400px;"/>

The one on the left lie very close to one of the red points. One can imaging that a slight change to the red point would make it move on the other side of the decision boundary, and thus we would predict "blue" while it might very well still belong to class "red". Intuitively it seems that the further a point is from the decision boundary, the more sure we get that it is classified correctly (this holds true in the case of logistic regression where we output probabilities of belonging to classes). Thus the hyperplane on the right seems like a better choice than the other two. The __Support Vector Machines__ model tries to find this "maximum margin" decision boundary. 




#### Fat margins: generalization

A "big margin" decision boundary also has advantages in terms of generalization. If we add the constraint that the hypothesis must have at least "some" margin then there will be much fewer valid dichotomies, and thus the growth function is smaller than usual, which is good for generalization. 

9 min : https://www.youtube.com/watch?v=eHsErlPJWUU&t=138s
 
 
#### Lagrange multipliers purpose:
 
Find extremum of a function with constraints: Use lagrange multipliers

Gives us a new expression that we can maximize/minimize without thinking about the original constraints anymore


### Linearly separable data

__Functional margins__ $\hat{\gamma_i} = y_i(w^Tx_i +b)$: 

For a given hyperplane the signal $w^T x_i +b$ tells us something about how sure we are about the prodiction. A huge positive signals tells us that we are confident about the prodiction +1, and a huge negative signal tells us that we are confident about the prediction -1. If the signal is close to 0, we are not confident about anything. __These signals give an order of which points are nearest, this would be usefull if we were comparing different points to one hyperplane. But what we want is to compare diffrent hyperplanes to the same point. So we need the euclidian distance, which is called the geometric margin. __ 

__Geometric margins__ $\gamma_i = y_i(\frac{w^T}{\|w\|}x_i +\frac{b}{\|w\|})$:


So we add a constraint $y_i(w^Tx_i +b) = 1$ for the support vectors. 

http://rinterested.github.io/statistics/svm.html

https://stats.stackexchange.com/questions/267267/what-does-scaling-the-normal-vector-of-a-plane-hyperplane-mean?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa

...

The separating hyperplane can be expressed by the equation $w^Tx +b = 0$. w is a normal vector to the hyperplane, this can be shown by taking two points $x',x''$ which lie on the hyperplane. This mean that $w^T x' +b = 0$ and $w^T x'' +b = 0$. If we take the difference between the two equations we get $w^Tx' - w^Tx'' = w^T(x'-x'') = 0$. It follows that w is orthogonal to the vector (x'-x''), and thus a normal vector to the hyperplane.

If we consider the point $x_n$ which is the point closest to the hyperplane (but not on the hyperplane), we want to compute the distance from $x_n$ to the hyperplane, such that we can maximize this "margin". 

If we normalize w such that is has length 1 ($\hat{w} = \frac{w}{\| w \|}$) we can easily compute this distance as a dot product. If we take the difference of $x_n$ and an arbitrary point on the plane x, then we obtain a vector, $(x_n - x)$, going from x to $x_n$. Now, since $\hat{w}$ has length 1, we get the distance from $x_n$ to the hyperplane as $\rvert \hat{w}^T (x_n - x) \rvert$ (the absolute value is to handle points on the other side of the hyperplane where the dot product would be negative). 

The distance can be rewritten as follows: 

\begin{equation} 
       \begin{split}
        dist(hyperplane,x_n) 
        &= \ \rvert \hat{w}^T (x_n - x) \rvert \\\\
        &= \frac{1}{\| w \|}\rvert w^T x_n - w^T x \rvert\\\\
        &=  \frac{1}{\| w \|}\rvert w^T x_n + b - w^T x - b \rvert \\\\
        &=  \frac{1}{\| w \|}\rvert w^T x_n + b \rvert \\\\
        &=  \frac{1}{\| w \|} 
    \end{split}
    \end{equation}

where $-w^T x - b$ disappear because it is 0 by definition of the hyperplane, and $w^T x_n + b$ is equal to 1 by our previous constraint. 