# Support Vector Machines

## 1. SVMs overview:

### 1.1 What does SVM learn from the training dataset and labeled data?
- A linear model, or a line / hyperplane (for multiple variables);
- Now, we have the equation that represents the 'line':
$$ y = w^{T}x + w_0 $$
- We use an algorithm to determine which are the values of W and b giving the 'best' line seperating the data;
- SVM is one of the algorithms that help determine the two parameters.

### 1.2 Some background knowledge about SVM
- SVMs include SVM (for classification) and SVR (for regression);
- Four different SVM:
  - The original one : the Maximal Margin Classifier,
  - The kernelized version using the Kernel Trick,
  - The soft-margin version,
  - The soft-margin kernelized version (which combine 1, 2 and 3)

## 2. Understanding the Math of SVM

### 2.1 The Margin (concept)
1. ```The goal of SVM:```
The goal of a support vector machine is to find  the optimal separating hyperplane which maximizes the margin of the training data. 
2. ```Optimal seperating hyperplane:```The fact that you can find a separating hyperplane,  does not mean it is the best one !
<img src=https://www.svm-tutorial.com/wp-content/uploads/2014/11/01_svm-dataset1-separated-2.png width="300" height="300" alt="SVM" align=center>

So we will try to select an hyperplane as far as possible from data points from each category. The optimal seperating hyperplane should:
- correctly classifies the training data;
- generalize better with unseen data.
3. ```Margin:``` the optimal hyperplane will be the one with the biggest margin.

### 2.2 Margin Calculation
<img src=images/Lab2/SVM1.jpg width="200" height="200" alt="SVM1" align=center>
<img src=images/Lab2/SVM2.jpg width="200" height="200" alt="SVM2" align=center>

- The distance of one training data (x) to the hyperplane is c, which is equal to |b-a|, while the margin is the distance from the closest training point to the hyperplane, $minimize \quad c$;

- Step 1: calculating b
  - z (a vector) has the magnitude of b; its direction is the same as $w$, so its direction would be $\frac{w}{||w||}$;
  - That is, $z = b\frac{w}{||w||}$;
  - z on the hyperplane, so we have $ w^Tz + w_0 = 0$;
  $$ w^T \frac{bw}{||w||} + w_0 = 0 $$
  $$ b||w|| + w_0 = 0 $$
  $$ b = - \frac{w_0}{||w||} $$
  Note: $||w|| = \sqrt{w^Tw}$
- Step 2: calculating a
  - $a$ is the magnituede of $x$'s projection on $w$;
  - that is, 
  $$a = \frac {w^Tx}{||w||}$$
- Step 3: calculating c
  - $ c = |b-a| = |\frac{w_0}{||w||} + \frac {w^Tx}{||w||}| $
  - that is, the distance of one training point 
  $$ c = \frac{1}{||w||}|w^Tx + w_0| $$
- Step 4: the margin
  - therefore, the margin is 
  $$ min_{i} \frac{1}{||w||}|w^Tx_i + w_0| $$

### 2.3 The SVM optimisation problem
- Scaling: $(w, w_0)$ and $(cw, cw_0)$ define the same hyperplane;
- This is because: $cw^Tx + cw_0 \geq 0$ is equal to $w^Tx + w_0 \geq 0$;
- Put a constraint on $(w, w_0)$, 
$$ min_{i} \frac{1}{||w||}|w^Tx_i + w_0| = 1 $$
- Now the margin will always be $\frac{1}{||w||}$;
- We want a hyperplane that will maximize the margin:
$$ max_w \frac{1}{||w||} $$
subject to: 
$$w^Tx_i + w_0 \geq 1 $$, for all i with $y_i = 1$; 
$$w^Tx_i + w_0 \leq -1 $$, for all i with $y_i = -1$; 
$$min_{i} \frac{1}{||w||}|w^Tx_i + w_0| = 1$$
- After deleting the third rebundent restriction and simplifying the first two restrictions, we have:
$$ max_w \frac{1}{2||w||} $$
subject to:
$$ y_i(w^Tx_i + w_0) \geq 1 $$, for all i
- The above optimization is equal to:
$$ min_w ||w||^2 $$
subject to:
$$ y_i(w^Tx_i + w_0) \geq 1 $$, for all i

### 2.4 The solution of optimal paramters
- ```Compute w:```
$$ w = \sum_i {\alpha}_i{y_i}{x_i} $$
- ```Compute w_0:```
  - we can use a constraint to calculate $w_0$:
$$ y_i(w^Tx_i + w_0) = 1 $$
  - multiply $y_i$ at each side (note ${y_i}^2 = 1$),
$$ w^Tx_i + w_0 = y_i $$
$$ w_0 = y_i - w^Tx_i $$
- ```Hypothesis function:```
  - therefore, prediction on new data point $x$ is:
$$ f(x) = sign((w^Tx) + w_0) $$
$$ = sign(\sum_i^n {\alpha}_i{y_i}({x_i^T}x) + w_0)$$
- The formulation of the SVM is the hard margin SVM. It can not work when the data is not linearly separable.

## 3. Soft Margin SVM

### 3.1 When and how soft margin SVM helps?
<img src=images/Lab2/Soft SVM1.jpg width="200" height="200" alt="SVM3" align=center>
<img src=images/Lab2/Soft SVM2.jpg width="200" height="200" alt="SVM4" align=center>