# Support Vector Machines

## 9.1 Maximal Margin Classifier

SVMs are intended for binary classifiaction settings.

#### Hyperplanes
A flat affine subspace of dimension $p - 1$. In 2D this is a line. In 3D this is a square. A hyperplane is defined by the formula:
$$\beta_0 + \beta_1X_1 + \beta_2X_2 = 0$$

When we say that the above formula defines the hyperplane we say that any $X = (X_1, X_2)^T$ for which the above equation holds is a point on the hyperplane. The above equation is for a line, but we can add additional coefficients and variables X for high p-dimensional planes. 

What if $X > 0$? That tells us that $X$ lies on one side of the plane. WHat if less than 0? This means $X$ lies on the other side. 

#### Classification Using a Seperating Hyperplane
Suppose we can create a hyperplane that seperates the training observations perfectly according to their class labels. We classify the points based on whether:
$$\beta_0 + \sum_{j = 1}^p\ \sum_{i = 1}^n\ \beta_j x_{ij} > 0$$

or equivalently less than 0. This determines if $y_i = 1$ or $y_i = -1$.

#### Maximal Margin Classifier
This is the hyperplane that is farthest from the training observations. We can compute the perpendicular distance from each training observation to give a seperating hyperplane. The smallest distance is the minimal distance from the observations to the hyperplane; this is called the margin. Maximal margin hyperplane is the farthest distance. We can classify a test observations based on which side of the maximal marginal hyperplane it sits on. We hope a classifier that has a large margin on the training data will also have a large margin on the test data. 

This method can lead to overfitting when $p$ is large. Observations that sit on the margin are known as support vectors because they support the maximal margin hyperplane in the sense that if these points were moved then the hyperplane would move as well. 

#### Construction of the Maximal Margin Classifier
The maximal margin hyperplane is the solution to the optimization problem to maximize $M$ subject to:
$$\sum_{j = 1}^p\ \beta_j^2 = 1$$

The main constraint is that each point is on the correct side of the hyperplane and at least a distance of $M$ from the hyperplane. Hence $M$ is our hyperplane and we choose the coefficients to maximize $M$. 

## 9.2 Support Vector Classifiers

#### Overview
The Maximal Marginal Hyperplane is not ideal because the narrower this plane is, which means that training examples can be misclassified and therefore be prone to overfitting on the test data. It would be better to create a hyperplane that does not perfectly fit the data as that:
- Offers greater robustness to individual observations
- Ofers better classification of most of the training observations

We call this a soft margin because some observations can violate the margin. 

#### Details
The optimization problem is to maximize the margin $M$ subject to:
$$\sum_{j = 1}^p\ \beta_j^2 = 1$$

given that $\epsilon_i \ge 0$ and:
$$\sum_{i = 1}^n\ \epsilon_i \ge C$$

where $C$ is a nonnegative tuning parameter. The $\epsilon$ are slack variables that allow individual observations to be on the wrong side of the margin or the hyperplane. The $\epsilon_i$ tell us where the $i$th observation is located relative to the hyperplane and the margin. if $\epsilon_i = 0$ then the $i$th observation is on the correct side of the hyperplane. If greater then it is on the wrong side of the margin and the observation has violated the margin. If greater than 1 then it is on the wrong side of the hyperplane. 

$C$ bounds the sum of the $\epsilon$s and so determines the number and severity of the violations to the margin that will be tolerated. $C$ is the budget for the amount $M$ can be violated by the $n$ observation. If $C > 0$ then we can tolerate $C$ violations of the hyperplane. As $C$ increases the more tolerant the margins and therefore they widen. $C$ is usually chosen via cross-validation and controls the bias-variance trade-off. Small $C$ means small margin which means high variance. Large $C$ is large margin and therefore more bias. 

points that lie on the margin or on the wrong side of the margin are called support vectors and control the margin. So larger $C$ means wider margin which means more points are influencing the margin. 

## 9.3 Support Vector Machines

#### Classification with Non-Linear Decision Boundaries
Maximize $M$ subject to:
$$y_i = ( \beta_0 + \sum_{j = 1}^p\ \beta_{j1}x_{ij} + \sum_{j = 1}^p\ \beta_{j2}x_{ij}^2 ) \ge M(1 - \epsilon_i)$$

where:
$$\sum_{i = 1}^n\ \epsilon_i \le C$$

given $\epsilon \ge 0$

so that:
$$\sum_{j = 1}^p\ \sum_{k = 1}^2\ \beta_{jk}^2 = 1$$

#### Support Vector Machine
We can expand the feature space using kernels. This allows us to fit a non-linear boundary between the classes. The inner product of two r-vectors $a$ and $b$ is defined as:
$$<a, b> = \sum_{i = 1}^r\ a_i b_i$$ 

so the inner product of tw observations $x_i, x_{i'}$ is given by:
$$<x_i, x_{i'}> = \sum_{j = 1}^p\ x_{ij}x_{i'j}$$

It can be shown that:
- The linear support vector classifier can be represented as $f(x) = \beta_0 + \sum_{i = 1}^n\ \alpha_i\ <x, x_i>$ where there are $n$ parameters $\alpha_i$, one per training example. 
- To estimate the parameters $\alpha$ and $\beta_0$ all we need are the $\frac{n(n - 1)}{2}$ inner products between all pairs of training observations. 

In order to evaluate $f(x)$ we need to compute the inner product between the new point $x$ and each of the training points $x_i$. $\alpha_i$ is nonzero only for the support vectors. If a training oservation is not a support vector then its $\alpha_i$ is 0. Based on $S$ being a collection of indices for support points, we can rewrite a more efficient solution:
$$f(x) = \beta_0 + \sum_{i \in S}\ \alpha_i\ <x, x_i>$$

A kernel is a function that quantifies the similarity of two observations. You can use the following kernel:
$$K(x_i, x_{i'}) = (1 + \sum_{j = 1}^p\ x_{ij}x_{i'j})^d$$

This is known as the polynomial kernel of degree $d$ where $d$ is a positive integer. Using this kernel with $d > 1$ in the SV classifier leads to a much more flexible decision boundary. When a support vector classifier is combined with a non-linear kernel it results in a classifier known as a Support Vector Machine. 

Radial Kernel:
$$K(x_i, x_{i'}) = exp( -\gamma\ \sum_{j = 1}^p\ (x_{ij} - x_{i'j})^2 )$$

where $\gamma$ is a positive constant. The radial kernel sees if a test observation is far from a training observation $x_i$ in terms of Euclidean Distance. If so then $\sum_{j = 1}^p\ (x_{ij} - x_{i'j})^2$ will be large and the $K()$ function will be very tiny. This means $x_i$ will play no rolein the function. 

## 9.4 SVMs with More than Two Classes

SVMs do not typically lend itself to aid with more than 2 classes. THere have been a couple of attempts to allow SVMs to extend t $k$ classes.

#### One-Versus-One
We compare each pair of classes if the number of classes is greater than 2. We classify a test observation sing each of the classifiers and tally number of times test observation is assigned to each of the $k$ classes. Final classification done by assigning test observation to the class it which it is most frequently assigned in the pairise classification. 

#### One-Versus-All 
Fit $K$ SVMs. Each time comparing one of the $K$ classes to the remaining $K - 1$ classes. 

## 9.5 Relationship to Logistic Regression

Rewrite the support vector classifier as:
$$minimize_{\beta}\ { \sum_{i = 1}^n\ max[ 0, 1 - y_i\ f(x_i) ] + \lambda\ \sum_{j = 1}^p\ \beta_j^2 }$$

where $\lambda$ is a nonnegative tuning parameter. When $\lambda$ is large the coefficients are small, which means more violations to the margin are tolerated, and a low variance but high bias classifiet will result. If small then few violations to the margin will occur, which results in high variance and low bias classifier. 

###### Hinge Loss
$$L(X, y, \beta) = \sum_{i = 1}^n\ max[ 0, 1 - y_i( \beta_0 + \sum_{j = 1}^p\ \beta_{j}X_{ij} ) ]$$

When $ y_i( \beta_0 + \sum_{j = 1}^p\ \beta_{j}X_{ij} )$ is greater than 1 the SVM will loss at 0 since this corresponds to an observation that is on the correct side of the margin. This differs for the loss function for logistic regression as the loss function does not get to zero, but rather very close to 0.

SVMs are preferred if the classes have a good seperation, but if there is overlapping of the classification then Logisitc Regression is preferred. 