# Support Vector Machines

## 1. SVMs overview:

### 1.1 What does SVM learn from the training dataset and labeled data?
- A linear model, or a line / hyperplane (for multiple variables);
- Now, we have the equation that represents the 'line':
$$ y = \mathbf{w^{T}x} + w_0 $$
- We use an algorithm to determine which are the values of W and b giving the 'best' line seperating the data;
- SVM is one of the algorithms that help determine the two parameters.

### 1.2 Some background knowledge about SVM
- SVMs include SVM (for classification) and SVR (for regression);
- Four different SVM:
  - The original one : the Maximal Margin Classifier,
  - The kernelized version using the Kernel Trick,
  - The soft-margin version,
  - The soft-margin kernelized version (which combine 1, 2 and 3)

### 1.3 Comparison of SVMs and logistics regression
- Logistics regression optimisation problem:
$$ \underset{\theta}{\text{min}} \frac{1}{m} [\sum_{i=1}^m y^{(i)}(-\text{log}h_{\theta}(x^{(i)}))+(1-y^{(i)})(-\text{log}(1-h_{\theta}(x^{(i)})))]+\frac{\lambda}{2m}\sum_{j=1}^n{\theta_j}^2 $$
- Support vector machine:
$$ \underset{\theta}{\text{min}} C\sum_{i=1}^m [y^{(i)}\text{cost}_1({\theta}^T(x^{(i)})+(1-y^{(i)})(\text{cost}_0{\theta}^T(x^{(i)})]+\frac{1}{2}\sum_{j=1}^n{\theta_j}^2 $$

## 2. Understanding the Math of SVM

### 2.1 The Margin (concept)
1. ```The goal of SVM:```
The goal of a support vector machine is to find  the optimal separating hyperplane which maximizes the margin of the training data. 
2. ```Optimal seperating hyperplane:```The fact that you can find a separating hyperplane,  does not mean it is the best one !
<img src=https://www.svm-tutorial.com/wp-content/uploads/2014/11/01_svm-dataset1-separated-2.png width="300" height="300" alt="SVM" align=center>

So we will try to select an hyperplane as far as possible from data points from each category. The optimal seperating hyperplane should:
- correctly classifies the training data;
- generalize better with unseen data.
3. ```Margin:``` the optimal hyperplane will be the one with the biggest margin.

### 2.2 Margin Calculation
<img src=images/Lab2/SVM1.jpg width="200" height="200" alt="SVM1" align=center>
<img src=images/Lab2/SVM2.jpg width="200" height="200" alt="SVM2" align=center>

- The distance of one training data (x) to the hyperplane is c, which is equal to |b-a|, while the margin is the distance from the closest training point to the hyperplane, minimize $c$;

- Step 1: calculating b
  - z (a vector) has the magnitude of b; its direction is the same as $w$, so its direction would be $\frac{\mathbf{w}}{||\mathbf{w}||}$;
  - That is, $z = b\frac{\mathbf{w}}{||\mathbf{w}||}$;
  - z on the hyperplane, so we have $ \mathbf{w^Tz} + w_0 = 0$;
  $$ \mathbf{w^T} \frac{b\mathbf{w}}{||\mathbf{w}||} + w_0 = 0 $$
  $$ b||\mathbf{w}|| + w_0 = 0 $$
  $$ b = - \frac{w_0}{||\mathbf{w}||} $$
  Note: $||\mathbf{w}|| = \sqrt{\mathbf{w^Tw}}$
- Step 2: calculating a
  - $a$ is the magnituede of $x$'s projection on $w$;
  - that is, 
  $$a = \frac {\mathbf{w^Tx}}{||\mathbf{w}||}$$
- Step 3: calculating c
  - $ c = |b-a| = |\frac{w_0}{||\mathbf{w}||} + \frac {\mathbf{w^Tx}}{||\mathbf{w}||}| $
  - that is, the distance of one training point 
  $$ c = \frac{1}{||\mathbf{w}||}|\mathbf{w^Tx} + w_0| $$
- Step 4: the margin
  - therefore, the margin is 
  $$ \underset{i}{\text{minimize}} \frac{1}{||\mathbf{w}||}|\mathbf{w^Tx_i} + w_0| $$

### 2.3 The SVM optimisation problem
- Scaling: $(\mathbf{w}, w_0)$ and $(c\mathbf{w}, cw_0)$ define the same hyperplane;
- This is because: $c\mathbf{w^Tx} + cw_0 \geq 0$ is equal to $\mathbf{w^Tx} + w_0 \geq 0$;
- Put a constraint on $(\mathbf{w}, w_0)$, 
$$ \underset{i}{\text{min}} \frac{1}{||\mathbf{w}||}|\mathbf{w^Tx_i} + w_0| = 1 $$
- Now the margin will always be $\frac{1}{||\mathbf{w}||}$;
- We want a hyperplane that will maximize the margin:
$$ \underset{\mathbf{w}}{\text{max}} \frac{1}{||\mathbf{w}||} $$
subject to: 
$$\mathbf{w^Tx_i} + w_0 \geq 1 $$, for all i with $y_i = 1$; 
$$\mathbf{w^Tx_i} + w_0 \leq -1 $$, for all i with $y_i = -1$; 
$$ \underset{i}{\text{min}} \frac{1}{||\mathbf{w}||}|\mathbf{w^Tx_i} + w_0| = 1 $$
- After deleting the third rebundent restriction and simplifying the first two restrictions, we have:
$$ \underset{\mathbf{w}}{\text{max}} \frac{1}{2||\mathbf{w}||} $$
subject to:
$$ y_i(\mathbf{w^Tx_i} + w_0) \geq 1 $$, for all i
- The above optimization is equal to:
$$ \underset{i}{\text{min}} ||\mathbf{w}||^2 $$
subject to:
$$ y_i(\mathbf{w^Tx_i} + w_0) \geq 1 $$, for all i

### 2.4 The solution of optimal paramters
- ```Compute w:```
$$ \mathbf{w} = \sum_i {\alpha}_i{y_i}{x_i} $$
- ```Compute w_0:```
  - we can use a constraint to calculate $w_0$:
$$ y_i(\mathbf{w^Tx_i} + w_0) = 1 $$
  - multiply $y_i$ at each side (note ${y_i}^2 = 1$),
$$ \mathbf{w^Tx_i} + w_0 = y_i $$
$$ w_0 = y_i - \mathbf{w^Tx_i} $$
- ```Hypothesis function:```
  - therefore, prediction on new data point $x$ is:
$$ f(x) = \text{sign}((\mathbf{w^Tx}) + w_0) $$
$$ = \text{sign}(\sum_i^n {\alpha}_i{y_i}({x_i^T}x) + w_0)$$
- The formulation of the SVM is the hard margin SVM. It can not work when the data is not linearly separable.

## 3. Soft Margin SVM

### 3.1 When and how soft margin SVM helps?
<p align="center">
<img src=images/Lab2/Soft_SVM1.jpg width="200" height="200" alt="SVM3" align=center>
<img src=images/Lab2/Soft_SVM2.jpg width="200" height="200" alt="SVM4" align=center>

- Outlier reducing the margin vs. outlier breaking linear separability
- Soft margin SVM allows mistakes, but should make as few mistakes as possible.

### 3.2 $\zeta$:
- Therefore, we need to modify the constraints of the optimization problem, from...to...:
$$ y_i(\mathbf{w^Tx_i} + w_0) \geq 1 $$, for all i
$$ y_i(\mathbf{w^Tx_i} + w_0) \geq 1 - \zeta_i $$, for all i
- However, if we choose a very large $\zeta$, the constraint can be satisfied quite easily. To keep the mistakes as few as possible, we can modify the objective function to penalize the choice of $\zeta$. That is,
$$ \underset{\mathbf{w},b,\zeta}{\text{minimize}} \frac{1}{2} ||\mathbf{w}||^2 + C\sum_{i=1}^m \zeta_i $$

$$ \text{subject to} \quad y_i(\mathbf{w^Tx_i} + w_0) \geq 1 - \zeta_i \quad \text{where} \quad \zeta_i \geq 0 \quad \text{for any i=1,...,m} $$

### 3.3 $C$:
Generally speaking, parameter C will help us to determine how important the $\zeta$ should be.
<p align="center">
<img src=images/Lab2/Soft_margin1.jpg width="500" height="200" alt="SVM5" align=centering>
<p align="center">
<img src=images/Lab2/Soft_margin2.jpg width="500" height="200" alt="SVM6" align=centering>
<p align="center">
<img src=images/Lab2/Soft_margin3.jpg width="500" height="200" alt="SVM7" align=centering>

  - 1) small C --> wider margin --> cost of some misclassifications;
  - 2) big C --> hard margin --> no tolerance of misclassifications;
  - 3) no magic value for C --> select C by grid search with cross-validation (note: C is specific to what we are using)

### 3.4 2-Norm soft margin (L2 regularized)
$$ \underset{\mathbf{w},b,\zeta}{\text{minimize}} \frac{1}{2} ||\mathbf{w}||^2 + C\sum_{i=1}^m \zeta_i^2 $$

$$ \text{subject to} \quad y_i(\mathbf{w^Tx_i} + w_0) \geq 1 - \zeta_i \quad \text{where} \quad \zeta_i \geq 0 \quad \text{for any i=1,...,m} $$

## 4. Kernals

### 4.1 Feature mapping
- In some case, the data is not linearly seperable, such as:
<p align="center">
<img src=images/Lab2/kernal1.jpg width="250" height="200" alt="SVM8" align=centering>

- Just as what we do in polynomial regression, we can do polynomial mapping to make: 
$$ \phi : \mathbb{R}^2 \rightarrow \mathbb{R}^3 $$
defined by,
$$ \phi(x_1, x_2) = (x_1^2, \sqrt{2}x_1x_2, x_2^2) $$
- Now, the graph becomes:
<p align="center">
<img src=images/Lab2/kernal2.jpg width="250" height="200" alt="SVM8" align=centering>
<p align="center">
<img src=images/Lab2/kernal3.jpg width="250" height="200" alt="SVM8" align=centering>

- However, we have to try which transformation to apply dependent on the data that we have. [sklearn dataset transformation](https://scikit-learn.org/stable/data_transforms.html)

### 4.2 What is and why we need a kernal?
- Why: the feature mapping transformation will transform every example. It will take a huge amount of time to do it when we have millions of examples. Kernal does not need to transform every exmaple, we can compare the following two functions:

In [2]:
# Transform a two-dimensional vector x into a three-dimensional vector
import numpy as np 
def transform(x): 
    return [x[0]**2, np.sqrt(2)*x[0]*x[1], x[1]**2]
def polynomial_kernel(a, b): 
    return a[0]**2 * b[0]**2 + 2*a[0]*b[0]*a[1]*b[1] + a[1]**2 * b[1]**2

x1 = [3,6] 
x2 = [10,10] 
x1_3d = transform(x1) 
x2_3d = transform(x2)
print(np.dot(x1_3d,x2_3d))
print(polynomial_kernel(x1, x2))

8100.0
8100


In [None]:
def polynomial_kernel(a, b, degree, constant=0): 
    result = sum([a[i] * b[i] for i in range(len(a))]) + constant return pow(result, degree)
# We do not transform the data
print(polynomial_kernel(x1, x2, degree=2))

As we can see, kernal allows us to deal with a large dataset.
- What is a kernal?
  - The kernal is defined by:
  $$ K(\mathbf{x_i}, \mathbf{x_j}) = \mathbf{x_i}^T·\mathbf{x_j} $$
  - A kernal is a function that returns the result of a dot product performed in another space.
  - The polynomial_kernal computes their dot product as if they have been transformed into vectors belong to $\mathbb{R^3}$
- Kernal trick:
  -  The original hypothesis function: 
$$ f(x) = \text{sign}((\mathbf{w^Tx}) + w_0) $$
$$ = \text{sign}(\sum_i^n {\alpha}_i{y_i}({x_i^T}x) + w_0)$$
  - With kernal (note that SVM is a sparse kernal machines):
$$ h(\mathbf{x_i}) = \text{sign}(\sum_j^n {\alpha}_j{y_j}K(\mathbf{x_j}^T, \mathbf{x_i}) + w_0)$$

### 4.3 Kernal types
- General form:
$$ K(\mathbf{x}, \mathbf{x'}) = \langle{\phi(\mathbf{x}^T), \phi(\mathbf{x'})}\rangle_\mathcal{V} $$
- General rule:
Try a RBF kernal first, because it uasually works well.
- Linear Kernal
$$ K(\mathbf{x}, \mathbf{x'}) = \mathbf{x}^T·\mathbf{x'} $$
- Polynomial Kernal
$$ K(\mathbf{x}, \mathbf{x'}) = (\mathbf{x}^T·\mathbf{x'} + w_0)^d $$
(note: using a high-degree polynomial kernal will often lead to overfitting)
<p align="center">
<img src=images/Lab2/kernal4.jpg width="500" height="200" alt="SVM9" align=centering>
- RBF or Gaussian kernal
$$ K(\mathbf{x}, \mathbf{x'}) = \text{exp}(-\gamma||\mathbf{x} - \mathbf{x'}||^2) $$
, or
$$ K(\mathbf{x}, \mathbf{x'}) = \text{exp}(-\frac{||\mathbf{x} - \mathbf{x'}||^2}{2{\sigma}^2}) $$
- the RBF(Radial Basis Function) returns the result of a dot product performed in $\mathbb{R}^{\infty} $
- [How to choose gamma with sklearn](https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html)
<p align="center">
<img src=images/Lab2/kernal6.jpg width="300" height="200" alt="SVM11" align=centering>
<p align="center">
<img src=images/Lab2/kernal5.jpg width="500" height="200" alt="SVM10" align=centering>

## References:
- [markdown image](https://stackoverflow.com/questions/12090472/github-readme-md-center-image)