## Linear Supervised Learning Series

# Part 4: Support vector machines

Here we discuss an often used variation of the original perceptron, called the margin perceptron, that is once again based on analyzing the geometry of the classification problem where a line (or hyperplane in higher dimensions) is used to separate two classes of data. We then build on this fundamental concept to derive *support vector machines*, a popular method used for linear classification.  While this approach provides interesting insight into the two-class classification process, we will eventually that support vector machines do not greatly differ fundamentally from logistic regression (or, likewise, the perceptron). 

In [1]:
# imports from custom library
import sys
sys.path.append('../../')
import matplotlib.pyplot as plt
from mlrefined_libraries import superlearn_library as superlearn
import autograd.numpy as np
import pandas as pd
%matplotlib notebook

# this is needed to compensate for %matplotlib notebook's tendancy to blow up images when plotted inline
from matplotlib import rcParams
rcParams['figure.autolayout'] = True

%load_ext autoreload
%autoreload 2

# 1.  Support vector machines

In this Section we derive support vector machines based on the notion of a *margin-perceptron*, a close relative to the perceptron approach described in the previous post (and is in fact derived in almost exactly the same manner).  For perfectly linearly separable datasets the support vector machine provides a uniquely different hyperplane than that of logistic regression / the perceptron.  Nonetheless, because perfectly linearly separable datasets are the exception rather than the rule, we will that support vector machines in general do not provide a considerably different decision boundary than the one provided by logistic regression / the perceptron.

## 1.1 The margin-perceptron

Suppose once again that we have a two-class classification training dataset of $P$ points $\left\{ \left(\mathbf{x}_{p},y_{p}\right)\right\} _{p=1}^{P}$ - where the labels $y_p \in \{\pm 1\}$.  Suppose for the time being that we are dealing with a two class dataset that is perfectly linearly separable with a known linear decision boundary $\mathbf{x}^{T}\mathbf{w}^{\,}=0$ passing evenly between the two classes as illustrated in the figure below.  

<figure>
  <img src= '../../mlrefined_images/superlearn_images/Fig_4_4.png' width="50%" height="50%" alt=""/>
  <figcaption>   
<strong>Figure 1:</strong> <em> CHANGE NOTATION ON DECISION BOUNDARY AND MARGIN HYPERPLANES For linearly data separable the width of the buffer zone confined between two evenly spaced translates of a separating hyperplane that just touch each respective class, defines the margin of that separating hyperplane.</em>  </figcaption> 
</figure>

This separating hyperplane creates a buffer between the two classes confined between two evenly shifted versions of itself: one version that lies above the separator and just touches the class having labels $y_{p}=+1$ taking the form $\mathbf{x}^{T}\mathbf{w}^{\,}=+1$, and one lying below it just touching the class with labels $y_{p}=-1$ taking the form $\mathbf{x}^{T}\mathbf{w}^{\,}=-1$.  The translations above and below the separating hyperplane are more generally defined as $\mathbf{x}^{T}\mathbf{w}^{\,}=+\beta^{\,}$ and $\mathbf{x}^{T}\mathbf{w}^{\,}=-\beta^{\,}$ respectively, where $\beta>0$. However by dividing off $\beta$ in both equations and reassigning the variables as $\mathbf{w}\longleftarrow\frac{\mathbf{w}}{\beta}$, we can leave out the redundant parameter $\beta$ and have the two translations as stated $\mathbf{x}^{T}\mathbf{w}^{\,}=\pm1^{\,}$. 

The fact that all points in the $+1$ class lie on or above $\mathbf{x}^{T}\mathbf{w}^{\,}=+1$, and all points in the '$-1$' class lie on or below $\mathbf{x}^{T}\mathbf{w}^{\,}=-1$ can be written formally as the following conditions

\begin{equation}
\begin{array}{cc}
\mathbf{x}_{p}^{T}\mathbf{w}^{\,}>1 & \,\,\,\,\text{if} \,\,\, y_{p}=+1\\
\mathbf{x}_{p}^{T}\mathbf{w}^{\,}<-1 & \,\,\,\,\text{if} \,\,\, y_{p}=-1
\end{array}
\end{equation}

We can combine these conditions into a single statement by multiplying each by their respective label values, giving the single inequality $y_{p}^{\,}\left(\mathbf{x}_{p}^{T}\mathbf{w}^{\,}\right)\geq1$ which can be equivalently written as 

\begin{equation}
\mbox{max}\left(0,\,1-y_{p}^{\,}\mathbf{x}_{p}^{T}\mathbf{w}_{\,}^{\,}\right)=0.
\end{equation}

Again, this value is precisely equal to zero if indeed the point has been classified correctly, otherwise the evaluation is positive.  In sum the above equation is always nonnegative.  Summing up all $P$ equations of the form above gives the *margin-perceptron* (which is so called because of its close resemblence to the basic perceptron cost described in the previous post).

\begin{equation}
g\left(\mathbf{w}\right)=\underset{p=1}{\overset{P}{\sum}}\text{max}\left(0,\,1-y_{p}\mathbf{x}_{p}^{T}\mathbf{w}_{\,}^{\,}\right)
\end{equation}

Notice the striking similarity between the original perceptron cost from the previous post and the margin perceptron cost above: naively we have just 'added a $1$' to the non-zero input of the 'max' function in each summand. However this additional '$1$' prevents the issue of a trivial zero solution with the original perceptron discussed previously, which simply does not arise here. 

If the data is indeed linearly separable any hyperplane passing between the two classes will have a parameters $\mathbf{w}$ where $g\left(\mathbf{w}\right)=0$. However the margin perceptron is still a valid cost function even if the data is not linearly separable.  The only difference is that with such a dataset we can not make the criteria above hold for all points in the dataset. Thus a violation for the $p^{\textrm{th}}$ point adds the positive value of $1-y_{p}^{\,}\mathbf{x}_{p}^{T}\mathbf{w}^{\,}$ to the cost function.

This cost function goes by many names such as the *perceptron* cost, the *rectified linear unit* cost (or *relu cost* for short), and the *hinge cost* (since when plotted a relu function looks like a 'hinge').   This cost function is *always convex* but has only a single (discontinuous) derivative in each input dimension.  This implies that we meaning can only use gradient descent to minimize it (Newton's method requiring a function to have a second derivative as well).  Note that the relu cost also *always* has a trivial solution when $\mathbf{w} = \mathbf{0}$, since indeed $g\left(\mathbf{0}\right) = 0$, thus one may need to take care practice to avoid finding it (or a point too close to it) accidentally.

>  This relu cost function is *always convex* but has only a single (discontinuous) derivative in each input dimension.  This implies that we meaning can only use gradient descent to minimize it (Newton's method requiring a function to have a second derivative as well).  The relu cost *always* has a trivial solution when $\mathbf{w} = \mathbf{0}$, since indeed $g\left(\mathbf{0}\right) = 0$, thus one may need to take care practice to avoid finding it (or a point too close to it) accidentally.

#### <span style="color:#a50e3e;">Example 1: </span> Using gradient descent to minimize the margin-perceptron cost

In this example we use (unnormalized) gradient descent to minimize the margin-perceptron cost function.  Note however that in examining a partial derivative of just one summand $\mbox{max}\left(0,\,1-y_{p}\mathbf{x}_{p}^{T}\mathbf{w}_{\,}^{\,}\right)$

\begin{equation}
\frac{\partial}{\partial w_n}  \mbox{max}\left(0,\,1-y_{p}^{\,}\mathbf{x}_{p}^{T}\mathbf{w}_{\,}^{\,}\right) = \begin{cases} -y_p x_{p,n} \,\,\,\,\text{if} \,\,\, w_n > 0 \\ 0 \,\,\,\, \text{else} \end{cases}
\end{equation}

we can conclude the magnitude of the full cost function's gradient will not necessarily diminish to zero eventually, but will stay fixed (in magnitude) based on the dataset (this was also an issue with the standard perceptron cost).  Thus a fixed steplength value $\alpha$ is out of the question here, as it could lead to oscillating 'zig-zag' behavior that never leads to a minimum (as detailed in our series on *mathematical optimization).  Instead we must use e.g., a diminishing steplength value here in order to force the length of each step to eventually vanish, thus providing guaranteed convergence (to a global minimum).  

In the next Python cell we load in the first dataset originally introduced in Example 5 of our post on logistic regression.

In [2]:
# load in dataset
data = np.loadtxt('../../mlrefined_datasets/superlearn_datasets/3d_classification_data_v0.csv',delimiter = ',')

In [3]:
# define the input and output of our dataset - assuming arbitrary N > 2 here
x = data[:,:-1]
y = data[:,-1]
y.shape = (len(y),1)

Next we implement the relu cost function in Python.

In [4]:
# the relu cost function
def margin_perceptron(w):
    cost = 0
    for p in range(0,len(y)):
        x_p = x[p]
        y_p = y[p]
        a_p = w[0] + sum([a*b for a,b in zip(w[1:],x_p)])
        cost += np.maximum(0,1-y_p*a_p)
    return cost

Then we make a gradient descent run of 50 steps, randomly initialized, with a diminishing steplength value $\alpha = \frac{1}{k}$ where $k$ is the step number.

In [5]:
# declare an instance of our current our optimizers
opt = superlearn.optimimzers.MyOptimizers()

# run desired algo with initial point, max number of iterations, etc.,
w_hist = opt.gradient_descent(g = margin_perceptron,w = np.random.randn(np.shape(x)[1]+1,1),max_its = 50,alpha = 10**-2,steplength_rule = 'diminishing')

Finally we plot the dataset and learned decision boundary $\mathbf{x}^{T}\mathbf{w}^{\star} = 0
$ (right panel) below.  Although we focused exclusively on learning the decision boundary with the perceptron perspective, just as logistic regression 'indirectly' learns a decision boundary on its way to fit the logistic surface properly the perceptron indirectly learns a logistic surface given by $\text{tanh}\left(\mathbf{x}^{T}\mathbf{w}^{\star}\right)$.  We plot this learned surface along with the dataset in 3-dimensions in the left panel below.

In [6]:
# create instance of 3d demos
demo5 = superlearn.classification_3d_demos.Visualizer(data)

# draw the final results
demo5.static_fig(w_hist,view = [15,-140])

<IPython.core.display.Javascript object>

## 1.2 A quest for the hyperplane with maximum margin

When two classes of data are linearly separable infinitely many hyperplanes could be drawn to separate the data. In the figure below we display two such hyperplanes for a given synthetic dataset. Given that both classifiers (as well as any other decision boundary perfectly separating the data) would perfectly classify the data, is there one that we can say is the 'best' of all possible separating hyperplanes?  

One reasonable standard for judging the quality of these hyperplanes is via their margin lengths, that is the distance between the evenly spaced translates that just touch each class. The larger this distance is the intuitively better the associated hyperplane separates the entire space given the particular distribution of the data. This idea is illustrated pictorially in the figure below.

<figure>
  <img src= '../../mlrefined_images/superlearn_images/Fig_4_13.png' width="30%" height="30%" alt=""/>
  <figcaption>   
<strong>Figure 2:</strong> <em> Of the infinitely many hyperplanes that exist between two classes of linearly separable data the one with maximum margin does an intuitively better job than the rest at distinguishing between classes because it more equitably partitions the entire space based on how the data is distributed. In this illustration two separators are shown along with their respective margins. While both perfectly distinguish between the two classes the green separator (with smaller margin) divides up the space in a rather awkward fashion given how the data is distributed, and will therefore tend to more easily misclassify future datapoints. On the other hand, the black separator (having a larger margin) divides up the space more evenly with respect to the given data, and will tend to classify future points more accurately.</em>  </figcaption> 
</figure>

To find the separating hyperplane with maximum margin, remember that the margin of a particular hyperplane $\mathbf{x}^{T}\mathbf{w}^{\,}=0$ is the width of the buffer zone confined between two symmetric translations of itself, written conveniently as $\mathbf{x}^{T}\mathbf{w}^{\,}=\pm1$, each just touching one of the two classes.  

Recall here the notation we are using to compactly describe the decision boundary: $\mathbf{x}^T\mathbf{w}^{\,} = 0$ here

\begin{equation}
\begin{array}
\mathbf{w} = \begin{bmatrix}  w_0 \\ w_1 \\ \, \vdots \\ w_N \end{bmatrix}
\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,
\mathbf{x} = \begin{bmatrix}  1 \\ x_1 \\ \, \vdots \\ x_N \end{bmatrix}
\end{array}
\end{equation}

Because we must isolate the *normal vector* to these hyperplanes we will need to isolate the last $N$ elements of each vector, and will use the notation $\mathbf{x}_{1:N}$ and $\mathbf{w}_{1:N}$ to denote the vector of the last $N$ elements of each i.e., 

\begin{equation}
\begin{array}
\mathbf{w}_{1:N} = \begin{bmatrix}  w_1 \\ \, \vdots \\ w_N \end{bmatrix}
\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,
\mathbf{x}_{1:N} = \begin{bmatrix}  x_1 \\ \, \vdots \\ x_N \end{bmatrix}
\end{array}
\end{equation}

With this notation we can express our decision boundary as 

\begin{equation}
\mathbf{x}^T\mathbf{w}^{\,} = w_0^{\,} + \mathbf{x}_{1:N}^T\mathbf{w}_{1:N}^{\,}
\end{equation}

where $\mathbf{w}_{1:N}$ is the normal vector to the hyperplane.

As shown in the figure below the margin can be determined by calculating the distance between any two points (one from each translated hyperplane) both lying on the normal vector $\mathbf{w}_{1:N}$. Denoting by $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ the points on vector $\mathbf{w}$ belonging to the *upper* and *lower* translated hyperplanes, respectively, the margin is computed simply as the length of the line segment connecting $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$, i.e., $\left\Vert \mathbf{x}_{1}-\mathbf{x}_{2}\right\Vert _{2}$. 

<figure>
  <img src= '../../mlrefined_images/superlearn_images/Fig_4_14.png' width="50%" height="50%" alt=""/>
  <figcaption>   
<strong>Figure 2:</strong> <em> The margin of a
separating hyperplane can be calculated by measuring the distance
between the two points of intersection of the normal vector $\mathbf{w}$
and the two equidistant translations of the hyperplane. This distance
can be shown to have the value of $\frac{2}{\left\Vert \mathbf{w}\right\Vert _{2}}$
(see text for further details). </em>  </figcaption> 
</figure>

The margin can be written much more conveniently by taking the difference of the two translates evaluated at $\mathbf{x}_{1}$ and $\mathbf{x}_{2}$ respectively, as 

\begin{equation}
\left(w_0^{\,} + \mathbf{x}_{1,1:N}^T\mathbf{w}_{1:N}^{\,} - w_0^{\,} + \mathbf{x}_{2,1:N}^T\mathbf{w}_{1:N}^{\,}\right) =\left(\mathbf{x}_{1,1:N}^{\,}-\mathbf{x}_{2,1:N}^{\,}\right)^{T}\mathbf{w}_{1:N}^{\,}=2.
\end{equation}

Using the inner product rule (see Appendix A) and the fact that the two vectors $\mathbf{x}_{1}-\mathbf{x}_{2}$ and $\mathbf{w}$ are parallel to each other, we can solve for the margin directly in terms of $\mathbf{w}$, as 

\begin{equation}
\left\Vert \mathbf{x}_{1,1:N}-\mathbf{x}_{2,1:N} \right\Vert _2=\frac{2}{\left\Vert \mathbf{w}_{1:N}\right\Vert _{2}}.
\end{equation}

Therefore finding the separating hyperplane with maximum margin is equivalent to finding the one with the smallest possible normal vector $\mathbf{w}_{1:N}$. 

## 1.3  The hard-margin SVM problem

In order to find a separating hyperplane for the data with minimum length normal vector we can simply combine our desire to minimize $\left\Vert \mathbf{w}\right\Vert _{2}^{2}$ subject to the constraint
that the hyperplane perfectly separates the data (given by the margin criterion described above). This gives the so-called *hard-margin support vector machine* problem

\begin{equation}
\begin{aligned}\underset{\mathbf{w}}{\mbox{minimize}} & \,\,\left\Vert \mathbf{w}_{1:N}\right\Vert _{2}^{2}\\
\mbox{subject to} & \,\,\,\text{max}\left(0,\,1-y_{p}^{\,}\mathbf{x}_{p}^{T}\mathbf{w}^{\,} \right)=0,\,\,\,\,p=1,...,P.
\end{aligned}
\end{equation}

Unlike the minimization problems we have seen so far, here we have a set of constraints on the permissible values of $\mathbf{w}$ that guarantee that the hyperplane we recover separates the data perfectly.  Problems of this sort can be directly solved using a variety of optimization techniques that we do not discuss here. 

#### <span style="color:#a50e3e;">Example 2: </span>  Approximately solving the hard-margin problem

The output of the next Python cell shows the SVM hyperplane learned for a toy dataset along with the buffer zone confined between the separating hyperplane's translates. The points from each class lying on either boundary of the buffer zone are called support vectors, hence the name 'support vector machines', and are highlighted in green.