# Support Vector Machines

## Margins

For logisitic regression if $\boldsymbol{\theta}^{T}\boldsymbol{x} \gg 0$ then we are very confident that $y=1$. Similarly, if $\boldsymbol{\theta}^{T}\boldsymbol{x} \ll 0$ then we are very confident that $y=0$. So, it is not unnatural to seek some $\boldsymbol{\theta}$ so that for each training sample, these two points are reflected. This idea will be formalized later using the idea of functional margins

Another interpretation is to consider a training set with a decision boundary ($\boldsymbol{\theta}^{T}\boldsymbol{x}=0$ also called the separating hyperplane) separating the positive and negative cases. Points farther away from the decision boundary are points we are very confident with our prediction while those closer represent less confidence. Thus, it would be better if all points have a sufficient distance from the decision boundary. This idea will be formalized later using the idea of geometric margins

We will use different notation to talk about support vector machines (SVMs). Consider a linear classification for binary classification with labels $y$ and features $\boldsymbol{x}$. Our class labels will be $y \in \{-1,1\}$ and are parameters will be paramterized by $\boldsymbol{w},b$ (we also drop the convention of $x_0=1$). So, our classifier is

$$h_{w,b}(\boldsymbol{x}) = g(\boldsymbol{w}^{T}\boldsymbol{x}+b)$$

Here $g(z)=1$ if $z \geq 0$ and $g(z)=-1$ otherwise. So, we will directly predict $y=1$ or $y=-1$ without estimating $p(y=1)$ like logistic regression

## Functional and Geometric Margins

Given a training example $(\boldsymbol{x}^{(i)},y^{(i)})$, the functional margin of $(\boldsymbol{w},b)$ w.r.t the training example is

$$\hat{\gamma}^{(i)} =y^{(i)\boldsymbol{w}^{T}\boldsymbol{x}^{(i)}+b}$$

If $y^{(i)\boldsymbol{w}^{T}\boldsymbol{x}^{(i)}+b}>0$ then our prediction for the example is correct. A large functional margin means a confident and correct prediction.

For a linear classifier with our choice of $g$ above there is one property of the functional margin that makes it not a good measure of confidence. Scaling the parameters $(\boldsymbol{w},b)$ does not change $h_{w,b}$ so, $g$ and $h_{w,b}$ depend only on the sign and not the magnitude of $\boldsymbol{w}^{T}\boldsymbol{x}+b$. These same scaling factors will affect the functional margin and scale it accordingly. Thus, the freedom of scaling the parameters allows the functional margin to arbitrarily large without changing anything meaningful. One may try to impose a normalization condition such as $||\boldsymbol{w}||_2$ (the Euclidean norm). Then we can replace $(\boldsymbol{w},b)$ with $(\boldsymbol{w}/||\boldsymbol{w}||_2,b/||\boldsymbol{w}||_2)$ and consider the functional margin of this instead. This idea will be revisited later.

Given a training set $S = \{(\boldsymbol{x}^{(i)},y^{(i)}) | i=1,\ldots,n\}$ we define the function margin of $(\boldsymbol{w},b)$ w.r.t $S$ as the smallest of the functional
margins of the individual training examples. We denote this as $\hat{\gamma}$ and write

$$\hat{\gamma} = \underset{i=1,\ldots,n}{\text{min}} \hat{\gamma}^{(i)}$$

Next we will talk about geometric margins.