# Maximal Margin Classifier


## What Is a Hyperplane?

- In 2 dimension, a hyperplane is a flat 1-dimensional subspace-that is a line, defined by $β0 + β1X1 + β2X2 = 0$
- In 3 dimensions, a hyperplane is a flat 2-dimensional subspace—that is, a plane.
- In p > 3 dimensions, a hyperplane is a flat affine subspace of dimension p − 1, defined by $β0 +β1X1 +β2X2 +...+βpXp = 0$


We can think of the hyperplane as dividing p-dimensional space **into two halves**. One can easily determine on which side of the hyperplane a point lies:

<img src="./images/93.png" width="600">

## Classification Using a Separating Hyperplane

A test observation is assigned a class **depending on which side of the hyperplane it is located**.


<img src="./images/94.png" width="600">

We can label the observations from the blue class as **yi = 1** and those from the purple class as **yi = −1**. Then a separating hyperplane has the property that

\begin{align}
β_0 +β_1x_{i1} +β_2x_{i2} +...+β_px_{ip} > 0  &&&& \text{if }y_i = 1, \\
β_0 +β_1x_{i1} +β_2x_{i2} +...+β_px_{ip} < 0  &&&& \text{if }y_i = -1
\end{align}

Equivalently, a separating hyperplane has the property that

\begin{align}
y_i(β_0 +β_1x_{i1} +β_2x_{i2} +...+β_px_{ip}) > 0 &&&& \text{for all i = 1,...,n}
\end{align}

We can also make use of the **magnitude of f(x∗)**. If f(x∗) is far from zero, then this means that x∗ lies far from the hyperplane, and so we can be confident about our class assignment for x.

## The Maximal Margin Classifier

In general, if our data can be perfectly separated using a hyperplane, then there will in fact exist an infinite number of such hyperplanes. In order to construct a classifier based upon a separating hyperplane, we must have a reasonable way to **decide which** of the infinite possible separating hyperplanes to use.

A natural choice is the **maximal margin hyperplane** (also known as the optimal separating hyperplane), which is the separating hyperplane that is farthest from the training observations. 


**Steps**


- We can compute the **(perpendicular) distance** from each training observation to a given separating hyperplane; the smallest such distance is **the minimal distance from the observations to the hyperplane**, and is known as the **margin**. 
- The maximal margin hyperplane is the separating hyperplane for which **the margin is largest**.
- We can then classify a test observation based on which side of the maximal margin hyperplane it lies. We hope that a classifier that has a large margin on the training data will also have a large margin on the **test data**, and hence will classify the test observations correctly.

**In a sense, the maximal margin hyperplane represents the mid-line of the widest “slab” that we can insert between the two classes.**

<img src="./images/95.png" width="600">


**Support vectors**

Support vectors are data points that are **closer to** the hyperplane and **directly influence the position and orientation** of the hyperplane. Using these support vectors, we maximize the margin of the classifier.

## Construction of the Maximal Margin Classifier


Briefly, the maximal margin hyperplane is the solution to the optimization problem support vector:

**1. maximize M**

**2. subject to $\sum_{j=1}^p \beta_j^2 = 1$**
> If $β_0 + β_1x_{i1} + β_2x_{i2} + ... + β_px_{ip} = 0$ defines a hyperplane, then so does $k(β_0 + β_1x_{i1} + β_2x_{i2} + ... + β_px_{ip}) = 0$ for any $k \neq 0$, which means we can get a group of proportional results of β0, β1, β2, ... βp, and therefore we cannot reach on one optimized result.
 
**3. $y_i(β_0 +β_1x_{i1} +β_2x_{i2} +...+β_px_{ip}) >= M $**

> Recall the distance from a point($x_i$,$y_i$) to a hyperplane: $β_0 + β_1x_1 + β_2x_2 + ... + β_px_p = 0$ is: $\frac{|β_0 + β_1x_{i1} + β_2x_{i2} + ... + β_px_{ip}|}{\sqrt{β_0^2+β_1^2+β_2^2+..+ β_p^2}}$. 

> The margin distance is then:  $\frac{|β_0 + β_1x_{i1} + β_2x_{i2} + ... + β_px_{ip}|}{\sqrt{β_0^2+β_1^2+β_2^2+..+ β_p^2}} >= M \\ \rightarrow |β_0 + β_1x_{i1} + β_2x_{i2} + ... + β_px_{ip}| >= M \\ \rightarrow y_i(β_0 +β_1x_{i1} +β_2x_{i2} +...+β_px_{ip}) >= M$

> This guarantees that each observation will be on the correct side of the hyper- plane, provided that M is positive.
 
Hence, M represents the margin of our hyperplane, and the optimization problem chooses β0,β1,...,βp to maximize M

## The Non-separable Case

In many cases no separating hyperplane exists, and so there is no maximal margin classifier. In this case, the optimization formular has no solution with M > 0.

However, as we will see in the next section, we can extend the concept of a separating hyperplane in order to develop a hyperplane that **almost** separates the classes, using a so-called **soft margin**. The generalization of the maximal margin classifier to the **non-separable** case is known as the **support vector classifier**.

# Support Vector Classifiers / Soft margin classifier

There are instances in which a classifier based on a separating hyperplane might **not be desirable**. A classifier based on a separating hyperplane must perfectly classify all of the training observations; this can lead to **sensitivity to individual observations**.

<img src="./images/95.png" width="600">
- The resulting maximal margin has only a tiny margin. This is problematic because **the distance of an observation from the hyperplane** can be seen as a measure of our **confidence** that the observation was correctly classified
- The fact that the maximal margin hyperplane is **extremely sensitive to a change in a single observation** suggests that it may have **overfit the training data**.


**Support Vector Classifiers / Soft margin classifier**

In this case, it could be worthwhile to use soft margin, with which we might misclassify a few training observations in order to do a better job in **classifying the remaining observations**. 
- Greater robustness to individual observations
- Better classification of most of the training observations


# Construction of the Support Vector Classifiers


**1. maximize M**

**2. subject to $\sum_{j=1}^p \beta_j^2 = 1$**

**3. $y_i(β_0 +β_1x_{i1} +β_2x_{i2} +...+β_px_{ip}) >= M(1-\epsilon_i) $**
- The slack variable εi tells us where the ith observation is located, relative to the hyperplane and relative to the margin.  
  - If εi = 0 then the ith observation is on the correct side of the margin. 
  - If εi > 0 then the ith observation is on the wrong side of the margin, and we say that the ith observation has violated the margin. 
  - If εi > 1 then it is on the wrong side of the hyperplane.

**4. $\epsilon_i >= 0, \sum_{i=1}^n \epsilon_i <= C$**



**Parameter C：controls the bias-variance trade-off of the statistical learning technique.**

In this case, C is a nonnegative tuning parameter. It bounds the sum of misclassified samples or samples within the margin boundary, so we can think of it as a **budget** for the amount that the margin can be violated by the n observations.
  - If C = 0 then there is no budget for violations to the margin, and it must be the case that ε1 = ... = εn = 0, in which case simply amounts to the maximal margin hyperplane.
  - For C > 0 no more than C observations can be on the wrong side of the hyperplane.
  - When C is small, we seek **narrow margins** that are rarely violated; this amounts to a classifier that is highly fit to the data, which may have **low bias but high variance**.



**Parameter C from other source (including Scikit-learn)**

SVC solves the following primal problem:

<img src="./images/97.png" width="450">

Intuitively, we’re trying to maximize the margin, while incurring a **penalty** when a sample is misclassified or within the margin boundary. The classifier allows some samples to be at a distance $\zeta_i$ from their correct margin boundary. And C controls the **strengh of the penalty**, and as a result, acts as an inverse regularization parameter, trading off between the slack variable penalty (misclassifications) against width of the margin.
- Small C makes the constraints easy to ignore which leads to a large margin, constructing models with high bias but low variance.
- Large C allows the constraints hard to be ignored which leads to a small margin.
- For C=inf, all the constraints are enforced.


It turns out that only observations that either **lie on the margin** or that **violate the margin** will affect the hyperplane, and hence the classifier obtained. These observations are **support vectors**.



**SVC vs Logistic vs LDA**

- SVC: Its decision rule is based only on a potentially small subset of the training observations (the support vec tors) means that it is quite robust to the behavior of observations that are far away from the hyperplane.

- logistic regression also has very low sensitivity to observations far from the decision boundary.

- LDA classification rule depends on the mean of all of the observations within each class, as well as the within-class covariance matrix computed using all of the observations.

# Support Vector Machines


- **Support vector classifier** is a natural approach for classification in the **two-class setting**, if the boundary between the two classes is **linear**.


- **Support vector machines (SVM)** is an extension of the support vector classifier that results from **enlarging the feature space in a specific way**, using kernels when there're **non-linear class boundaries** between classes.


<img src="./images/100.jpg" width="600">

> When we don’t have linear separable set of training data, we can try **map the non-linear separable dataset into a higher dimensional space** where we can find a **hyperplane** that can separate the samples. For example, if the input observations is a 2D space, we can finding a mapping function that transforms them into a 3D output space.


# Kernel function

https://towardsdatascience.com/understanding-support-vector-machine-part-2-kernel-trick-mercers-theorem-e1e6848c6c4d


It can be shown that the boundary optimization will finally depend only on the **dot product** of pairs of support vector. That means if we use a mapping function that maps our data into a **higher dimensional space**, then, the optimization and decision rule will depend on the dot products of the mapping function for different samples, which could be hard to calculate.

Therefore, we can create some **kernel functions**, which enables operation in a high-dimensional feature space by simply computing the inner products between pairs of data in the feature space. Kernel function **defines inner product in the transformed space** and has great computational advantages.


Let's look at some of the most used kernel functions

<img src="./images/98.png" width="800">


## Polynomial kernel

<img src="./images/99.png" width="300">

This is known as a polynomial kernel of degree d, which leads to a much more **flexible decision boundary**. It essentially amounts to fitting a support vector classifier in a higher-dimensional space involving polynomials of degree d, rather than in the original feature space. Note that in this case the (non-linear) function has the form

<img src="./images/101.png" width="300">


## Radial kernel

<img src="./images/102.png" width="300"> where  γ is a positive constant.

If a given test observation $x^* = (x^*_1 ...x^*_p)$ is far from a training observation $x_i$ in terms of
Euclidean distance, the $\sum_{j=1}^􏰂p(x^∗_j−x_{ij})^2$ will be large, and so the kernel function will be very tiny.  
- In this case, this observation will play virtually no role in the function. 
- In other words, training observations that are far from the test observation will play essentially no role in the predicted class label for the test data.
- The radial kernel has very **local** behavior, in the sense that only **nearby** training observations have an effect on the class label of a test observation.


<img src="./images/103.png" width="600">

# SVMs with More than Two Classes

## One-Versus-One Classification

A one-versus-one approach **constructs $\begin{pmatrix} k \\2 \end{pmatrix} = k(k-1)/2$**.
- For example, one such SVM might compare the kth class, coded as +1, to the k′th class, coded as −1. 
- Classify a test observation **using each of the $\begin{pmatrix} k \\2 \end{pmatrix}$􏰀 classifiers**, and we **tally the number of times** that the test observation is assigned to each of the K classes. 
- The final classification is performed by assigning the test observation to the class to which it was most frequently assigned in these 􏰁$\begin{pmatrix} k \\2 \end{pmatrix}$􏰀􏰀 pairwise classifications.


## One-Versus-All Classification

- Fit K SVMs, each time comparing one of all the K classes to the **remaining K − 1 classes**. Let β0k, β1k, . . . , βpk denote the parameters that result from fitting an SVM comparing the kth class (coded as +1) to the others (coded as −1).
- Let x∗ denote a test observation. We assign the observation to the class for which β0k + β1k x∗1 + β2k x∗2 + . . . + βpkx∗p is **largest**, as this amounts to a **high level of confidence** that the test observation belongs to the kth class rather than to any of the other classes.

# SVMs vs Logistic

Due to the similarities between their loss functions, logistic regression and the support vector classifier often give very **similar results**. When the classes are **well separated**, SVMs tend to behave better than logistic regression.

# Support vector regression

Support vector regression instead seeks coefficients that minimize a different type of loss, where only residuals larger in absolute value than some positive constant contribute to the loss function. This is an extension of the margin used in support vector classifiers to the regression setting.