# Support Vector Machine

## Maximal Margin Classifier

### Hyperplane

In a p-dimensional space, a hyperplane is a flat affine subspace of
hyperplane
dimension p − 1. For instance, in two dimensions, a hyperplane is a flat
one-dimensional subspace—in other words, a line. In three dimensions, a
hyperplane is a flat two-dimensional subspace—that is, a plane. In p-dimensions, a hyperplane is defined as:
$$
\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p = 0
$$

We can also think of the
hyperplane as dividing p-dimensional space into two halves. One can easily
determine on which side of the hyperplane a point lies by simply calculating
the sign of the left hand side of the above equation. A two dimensional hyperplane is shown in the figure below.

![](images/09_01.png)

### Classification Using a Separating Hyperplane

Suppose that we have a $n\times p$ data matrix X that consists of n training
observations in p-dimensional space. These observations falls into two categories: 1 and -1. Suppose that it is possible to construct a hyperplane that separates the
training observations perfectly according to their class labels. That is, we have a hyperplane such that:
$$
\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} > 0 \quad \text{if} \quad y_{i} = 1
$$
and
$$
\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip} < 0 \quad \text{if} \quad y_{i} = -1
$$
This is equivalent to:
$$
y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}) > 0
$$

If a separating hyperplane exists, we can use it to construct a very natural
classifier: a test observation is assigned a class depending on which side of
the hyperplane it is located. We can even use the value $f(x) = \beta_0 + \beta_1 x_{1} + \beta_2 x_{2} + \cdots + \beta_p x_{p}$ to determining how "sure" the classifier is about the class of the test observation.

![](images/09_02.png)

### The Maximal Margin Classifier

In general, if our data can be perfectly separated using a hyperplane, then
there will in fact exist an infinite number of such hyperplanes. In order to construct a classifier based upon a separating
hyperplane, we must have a reasonable way to decide which of the infinite
possible separating hyperplanes to use.

A natural choice is the maximal margin hyperplane, which is the separating hyperplane that
is farthest from the training observations. That is, we can compute the
(perpendicular) distance from each training observation to a given separating hyperplane; the smallest such distance is the minimal distance from the
observations to the hyperplane, and is known as the margin. The maximal
margin
margin hyperplane is the separating hyperplane for which the margin is
largest—that is, it is the hyperplane that has the farthest minimum distance to the training observations. We can then classify a test observation
based on which side of the maximal margin hyperplane it lies. This is known
as the maximal margin classifier.

![](images/09_03.png)

In the figure above, we see that three training observations are equidistant from the maximal margin hyperplane and lie along the dashed lines
indicating the width of the margin. These three observations are known as *support vectors*, since they are vectors in p-dimensional space and they “support” the maximal margin hyperplane in the sense vector
that if these points were moved slightly then the maximal margin hyperplane would move as well. Interestingly, the maximal margin hyperplane
depends directly on the support vectors, but not on the other observations:
a movement to any of the other observations would not affect the separating
hyperplane, provided that the observation’s movement does not cause it to
cross the boundary set by the margin.

#### Construction of the Maximal Margin Classifier

We need to find the hyperplane constraint to:
$$
y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}) \ge M \quad \forall i=1,2,\dots,n
$$
the constraint requires that each
observation be on the correct side of the hyperplane, with some cushion,
provided that M is positive.

## Support Vector Classifiers

Many times the hyperplane that separates the training observations does not exist. Furthermore, maximal margin hyperplane is extremely sensitive to a change in a single observation
suggests that it may have overfit the training data. In this case, we might be willing to consider a classifier based on a hyperplane that does not perfectly separate the two classes, in the interest of:
* Greater robustness to individual observations, and
* Better classification of most of the training observations

The support vector classifier, sometimes called a soft margin classifier,
does exactly this. Rather than seeking the largest possible margin so that
every observation is not only on the correct side of the hyperplane but
also on the correct side of the margin, we instead allow some observations
to be on the incorrect side of the margin, or even the incorrect side of
the hyperplane.

### Details of the Support Vector Classifier

The support vector classifier classifies a test observation depending on
which side of a hyperplane it lies. The hyperplane is chosen to correctly separate most of the training observations into the two classes, but may
misclassify a few observations. It is the solution to the optimization problem:
$$
\text{maximize}_{\beta_0, \beta_1, \cdots, \beta_p, \epsilon_0, \epsilon_1, \cdots, \epsilon_p} M
$$
subjected to the constraints:
$$
\sum_{j=1}^p\beta_j^2 = 1\\
y_i(\beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \cdots + \beta_p x_{ip}) \ge M(1-\epsilon_i)\\
\epsilon_i \ge 0, \sum_{j=1}^n\epsilon_j = C
$$

Here C is a nonnegative tuning parameter. M is the width
of the margin; we seek to make this quantity as large as possible. $\epsilon_1, \cdots, \epsilon_n$ are slack variables that allow individual observations to be on
slack
the wrong side of the margin or the hyperplane.

the slack variable $\epsilon_i$ tells us where the ith observation is located,
relative to the hyperplane and relative to the margin. If $\epsilon_i=0$ then the ith observation is on the correct side of the margin. If $\epsilon_i > 0$ then the ith observation is on the wrong side of the margin, and
we say that the ith observation has violated the margin. If $\epsilon_i > 1$ then it
is on the wrong side of the hyperplane.

C bounds
the sum of the $\epsilon_i$’s, and so it determines the number and severity of the violations to the margin (and to the hyperplane) that we will tolerate. We can
think of C as a budget for the amount that the margin can be violated
by the n observations. For C > 0 no more than C observations can be on the wrong side of the hyperplane. C controls the bias-variance trade-off of the support
vector classifier. When the tuning parameter C is large, then the margin is
wide, many observations violate the margin, and so there are many support
vectors. In this case, many observations are involved in determining the
hyperplane.

The optimization problem has a very interesting property:
it turns out that only observations that either lie on the margin or that
violate the margin will affect the hyperplane, and hence the classifier obtained. In other words, an observation that lies strictly on the correct side
of the margin does not affect the support vector classifier! Changing the
position of that observation would not change the classifier at all, provided
that its position remains on the correct side of the margin. Observations
that lie directly on the margin, or on the wrong side of the margin for
their class, are known as support vectors. These observations do affect the
support vector classifier.

![](images/09_04.png)

The figure above shows classifiers fit with different values of C. Largest C corresponds to the top-left image.

## Support Vector Machines

The support vector classifier is a natural approach for classification in the
two-class setting, if the boundary between the two classes is linear. However, in practice we are sometimes faced with non-linear class boundaries. A natural solution to this problem is to enlarge the feature space quadratic or cubic terms. For instance, rather than fitting a
support vector classifier using p features:
$$
X_1, X_2, \cdots, X_p 
$$
we could instead fit a support vector classifier using 2p features:
$$
X_1, X_1^2, X_2, X_2^2, \cdots, X_n, X_p^2
$$

The support vector machine (SVM) is an extension of the support vector
classifier that results from enlarging the feature space in a specific way,
using kernels. 

### Kernel

A kernel is a
function that quantifies the similarity of two observations. Suppose we have two vectors $x_i, x_{i^{'}}$. The inner product of these two vectors can be computed as
$$
\langle x_i, x_{i^{'}} \rangle = \sum_{j=1}^p x_{ij} x_{{i}^{'}j}
$$ 
This also forms a kernel, called a linear kernel:
$$
K(x_i, x_{i^{'}}) = \sum_{j=1}^p x_{ij} x_{{i}^{'}j}
$$

We could generalize the linear kernel as:
$$
K(x_i, x_{i^{'}}) =(1+\sum_{j=1}^p x_{ij} x_{{i}^{'}j})^d
$$
which is a polynoimal kernel of order d. Using such a kernel with d > 1, instead of the standard linear kernel in the support vector classifier algorithm leads to a much more
flexible decision boundary. It essentially amounts to fitting a support vector
classifier in a higher-dimensional space involving polynomials of degree d,
rather than in the original feature space. 

Radial Basis Function (RBF) Kernel, another popular kernel takes form:
$$
K(x_i, x_{i^{'}}) = e^{-\gamma \sum_{j=1}^{p}(x_i - x_{i^{'}j})^2}
$$
Where $\gamma$, a positive constant, is a parameter that controls the width of the kernel.

![](images/09_05.png)
Polynomial Kernel and Radial Basis Function (RBF) Kernel

## SVMs with More than Two Classes

It turns out that the concept of separating hyperplanes upon which SVMs
are based does not lend itself naturally to more than two classes. Though
a number of proposals for extending SVMs to the K-class case have been
made, the two most popular are the *one-versus-one* and *one-versus-all*
approaches. 

### One-Versus-One Classification

Suppose that we would like to perform classification using SVMs, and there
are K > 2 classes. A one-versus-one or all-pairs approach constructs $^kP_2$. SVMs, each of which compares a pair of classes. For example, one such
SVM might compare the kth class, coded as +1, to the $k^{'}$ th class, coded
as −1. We classify a test observation using each of the $^kP_2$ classifiers, and
we tally the number of times that the test observation is assigned to each
of the K classes. The final classification is performed by assigning the test
observation to the class to which it was most frequently assigned in these
$^kP_2$ pairwise classifications.

### One-Versus-All Classification

The one-versus-all approach is an alternative procedure for applying SVMs in the case of K > 2 classes. We fit K SVMs, each time comparing one of all
the K classes to the remaining K − 1 classes. Let $\beta_{0k}, \beta_{1k}, \cdots , \beta_{pk}$ denote
the parameters that result from fitting an SVM comparing the kth class
(coded as +1) to the others (coded as −1). Let $x^∗$ denote a test observation.
We assign the observation to the class for which $\beta_{0k} +\beta_{1k}{x^∗_1} +\beta_{2k}{x^∗_2} +\cdots+\beta_{pk}{x^∗_p}$ is largest, as this amounts to a high level of confidence that the test
observation belongs to the kth class rather than to any of the other classes