## Programming Exercise 6: Support Vector Machines
#### Author - Rishabh Jain

In [1]:
import warnings
warnings.simplefilter('ignore')

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib inline

from scipy.io import loadmat

#### Learning Resources
1. [SVM Video Lecture (MIT)](https://www.youtube.com/watch?v=_PwhiWxHK8o)
2. [29 to 33 SVM Video Lectures (University of Buffalo)](https://www.youtube.com/watch?v=N4pai7eZW_o&list=PLhuJd8bFXYJsSXPMrGlueK6TMPdHubICv&index=29)
3. [Support Vector Machine Succinctly (PDF)](./Lectures/SVM_succinctly.pdf)
4. [An Idiot’s guide to Support vector machines](./Lectures/SVM_notes.pdf)

### 0&nbsp;&nbsp;&nbsp;&nbsp;Maths Behind SVM (Maximum Margin Classifier)

For two-class, such as the one shown below, there are lots of possible linear separators. Intuitively, a decision boundary drawn in the middle of the void between data items of the two classes seems better than one which approaches very close to examples of one or both classes. While some learning methods such as the logistic regression find just any linear separator. **The SVM in particular defines the criterion to be looking for a decision surface that is MAXIMALLY far away from any data point**. This distance from the decision surface to the closest data point determines the margin of the classifier.

<img src="images/svm1.PNG" width="380"/>

Let's imagine a vector $\vec{w}$ perpandicular to the margin and an unknown data point $\vec{u}$ which can be on either side of the margin. In order to know whether $\vec{u}$ is on the right or left side of the margin, we will project (Dot product) $\vec{u}$ onto $\vec{w}$.

$$\vec{w}.\vec{u}\geq c$$
$$\boxed{\vec{w}.\vec{u}+b\geq 0}\;\;(1)$$ 

If the projection of $\vec{u}$ plus some constant $b$ is greater than zero, then its a positive sample otherwise its a negative sample.**Eq. (1) is our DECISION RULE**. Here the problem is that we don't know what $w$ and $b$ to use.  

**An unknown sample may be located anywhere inside or outside the margin (i.e. >0 or <0), but if it's a known positive sample $\vec{x_{+}}$ then the SVM decision rule should insist the dot product plus some constant $b$ to be 1 or greater than 1.** Likewise for a negative sample $\vec{x_{-}}$, dot product plus some constant $b$ should be less than or equal to -1 Hence:

$\vec{w}.\vec{x_{+}}+b\geq 1 $   
$\vec{w}.\vec{x_{-}}+b\leq -1 $ 

Introducing a variable $y_i$ such that :  

$$\begin{equation}
  y_{i}=\begin{cases}
    +1 & \text{for +ve samples}\\
    -1 & \text{for -ve samples}
  \end{cases}
\end{equation}$$

Mutiplying the above two inequality eqauations with $y_i$:

For +ve sample : $y_{i}(\vec{w}.\vec{x_{i}}+b) \geq 1$  
For -ve sample : $y_{i}(\vec{w}.\vec{x_{i}}+b) \geq 1$

###### Note : Sign changed from $\leq$ to $\geq$ because $y_i$ is -1 in case of -ve samples
Since both the equations are same, we can rewrite them as :

$$\boxed{y_{i}(\vec{w}.\vec{x_{i}}+b)\geq 1}\;\;(2)$$

$$\boxed{y_{i}(\vec{w}.\vec{x_{i}}+b)-1= 0}\;\;(3)\;\;\text{For samples on margin}$$

Eq.(1) is basically a **constraint** for our margin, which means that **all the training samples should be on the correct side OR on the margin** (i.e. +ve samples on the right and -ve samples on the left side of the margin) and **NO training sample should be inside the margin at all meaning ZERO TRAINING ERROR.** 

###### Let's calculate the width of the margin.

<img src="images/svm2.PNG" width="400"/>

Let's imagine two vectors $\vec{x_+}$ and $\vec{x_-}$, both are +ve and -ve known samples respectively. The difference of these two vectors is a resultant vector called $\vec{R}$ where :

$$\vec{R}=\vec{x_+}-\vec{x_-}$$

All we need is a $\hat{u}$, **so that the WIDTH of the margin will be the projection of $\vec{R}$ onto $\hat{u}$**. From the first image, we already know a vector $\vec{w}$ in the same direction.

$$\hat{u}=\frac{\vec{w}}{||w||}$$

**WIDTH** $=\vec{R}.\hat{u} $  

$\;\;\;\;\;\;\;\;\;\;=(\vec{x_+}-\vec{x_-}).\frac{\vec{w}}{||w||}$  
$\;\;\;\;\;\;\;\;\;\;=\frac{(\vec{x_+}.\vec{w}-\vec{x_-}.\vec{w})}{||w||}$

Using eq (3), we get

$\;\;\;\;\;\;\;\;\;\;=\frac{(1-b+1+b)}{||w||}$
$$\boxed{\text{WIDTH}=\frac{2}{||w||}}\;\;(4)$$

Now, we want to maximize the margin while incurring zero training error.

max $\frac{2}{||w||}$ with 0 loss OR (Flipping for mathematical convenience)

min $\frac{||w||}{2}\;$ with 0 loss OR (Squaring the numerator for mathematical convenience)

min $\frac{||w||^2}{2}$ with 0 loss **(NO LONGER AN UNCONSTRAINED OPTIMIZATION)**

##### SVM Optimization Formulation

> minimize $\;\;\frac{||w||^2}{2}$  
> subject to $\;\;y_{i}(w^{T}x_{i}+b)\geq 1\;\;,i=1,2...N$

In order to solve a constrained optimization problem, Lagrange multipliers are used.  
Note: Lagrange Multipliers are explained [**here**](#Understanding-Lagrange-Multipliers).

### Understanding Lagrange Multipliers
Lagrange multipliers is a strategy of finding the local maxima and minima of a function subject to **equality** constraints. Let's try to solve a constrained opitimization problem :

#### Example 1 (Equality Constraints) :

minimize $\;\;f(x,y)=2-x^2-2y^2$  
subject to $\;\;h(x,y)=x+y-1=0$

**We introduce a new variable ($\beta$) called a Lagrange multiplier and study the Lagrange function defined by:**

$$\boxed{L(x,y,\beta)=f(x,y)-\beta h(x,y)}$$

$L(x,y,\beta)=(2-x^2-2y^2)-\beta(x+y-1)$

Now we solve the above equation like an unconstrained optimization problem by taking partial derivatives w.r.t $x$ & $y$ and set them equal to zero solving for $x$, $y$ and $\beta$

$\frac{\partial{L}}{\partial{x}}=0\;\;=>\;\;-2x-\beta=0\;\;=>\;\;x=\frac{-\beta}{2}$

$\frac{\partial{L}}{\partial{y}}=0\;\;=>\;\;-4y-\beta=0\;\;=>\;\;y=\frac{-\beta}{4}$

$\frac{\partial{L}}{\partial{\beta}}=0\;\;=>\;\;x+y-1=0\;\;=>\;\;\beta=\frac{-4}{3}$

$\boxed{x=\frac{4}{6},y=\frac{4}{12},\beta=\frac{-4}{3}}$

#### Example 2 (Inequality Constraints / Karush-Kuhn-Tucker (KKT) conditions)

maximize $\;\;f(x,y)=3x+4y$  
subject to $\;\;h_{1}(x,y)=x^2+y^2\leq4$  
$\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;h_{2}(x,y)=x\geq1$

**Note: Inequality constraints should be in the form of $h(x,y)\leq0$**

$$\boxed{L(x,y,\alpha_1,\alpha_2)=f(x,y)-\alpha_1 h_{1}(x,y)-\alpha_2 h_{2}(x,y)\\\;\;\text{s.t. }\alpha_1,\alpha_2\geq0}$$

$L(x,y,\alpha_1,\alpha_2)=3x+4y-\alpha_1(x^2+y^2-4)-\alpha_2(-x+1)$  

**KKT Conditions :**

1. $\frac{\partial{L}}{\partial{x}}=3-2\alpha_1x+\alpha_2=0$

2. $\frac{\partial{L}}{\partial{y}}=4-2\alpha_1y=0$

3. $\alpha_1(x^2+y^2-4)=0$

4. $\alpha_2(-x+1)=0$

5. $\alpha_1,\alpha_2\geq0$ 

A constraint is considered to be binding (active) if changing it also changes the optimal solution. Less severe constraints that do not affect the optimal solution are non-binding (non active). For 2 constraints possible combinations are :

- Both constraints are binding
- Constraint 1 binding, Constraint 2 not binding
- Constraint 2 binding, Constraing 1 not binding
- Both constraints are not binding

**POSSIBILITY 1 : Both constraints are binding**

$-x+1=0\;\text{and}\;\alpha_2>0\;\;=>\;\;x=1$  
$x^2+y^2-4=0\;\text{and}\;\alpha_1>0\;\;=>\;\;x^2+y^2=4\;\;=>\;\;1+y^2=4\;\;=>\;\;y=\pm\sqrt{3}$  

(a) For $y=+\sqrt{3}$ 

>Condition 2 becomes:  
>$4-2\sqrt{3}\alpha_1=0\;\;=>\;\;\alpha_1=\frac{2}{\sqrt{3}}>0$  
>Condition 1 becomes:  
>$3-2\alpha_1+\alpha_2=0\;\;=>\;\;3-\frac{4}{\sqrt{3}}+\alpha_2=0\;\;=>\;\;\alpha_2=\frac{4}{\sqrt{3}}-3<0$ (KKT condition fails)

(a) For $y=-\sqrt{3}$  

>Condition 2 becomes:  
>$4+2\sqrt{3}\alpha_1=0\;\;=>\;\;\alpha_1=\frac{-2}{\sqrt{3}}<0$ (KKT condition fails)    
>Condition 1 becomes:  
>$3-2\alpha_1+\alpha_2=0\;\;=>\;\;3+\frac{4}{\sqrt{3}}+\alpha_2=0\;\;=>\;\;\alpha_2=\frac{-4}{\sqrt{3}}-3<0$ (KKT condition fails)

**POSSIBILITY 2 : Constraint 1 binding , Contraint 2 not binding**

$x>1\;\text{and}\;\boxed{\alpha_2=0}$  
$x^2+y^2<4\;\text{and}\;\alpha_1>0\;\;=>\;\;x=+\sqrt{4-y^{2}}$  

>Condition 1 becomes:  
>$3-2\alpha_1x=0\;\;=>\;\;x=\frac{3}{2\alpha_1}\;\;=>\;\;3-2\alpha_1\sqrt{4-y^{2}}=0\;\;=>\;\;\alpha_1=\frac{3}{2\sqrt{4-y^{2}}}$  
>Condition 2 becomes:  
>$4-2\alpha_1y=0\;\;=>\;\;4-\frac{3y}{\sqrt{4-y^{2}}}=0\;\;=>\;\;4\sqrt{4-y^{2}}=3y\;\;=>\;\;16(4-y^2)=9y^2\;\;=>\;\;64-16y^2=9y^2\;\;=>\;\;64=25y^2\;\;=>\;\;y=\pm\frac{8}{5}$

$\boxed{\alpha_1=\frac{3}{2\sqrt{4-\frac{64}{25}}}=\frac{3}{2(\frac{6}{5})}=\frac{5}{4}>0}$  
$x=+\sqrt{4-y^{2}}\;\;=>\;\;x=\frac{6}{5}$

1 candidate point: $\boxed{(x,y)=(\frac{6}{5},\frac{8}{5})}$

**POSSIBILITY 3 : Constraint 2 binding , Contraint 1 not binding**

$x=1\;\text{and}\;\alpha_2>0$  
$x^2+y^2<4\;\text{and}\;\alpha_1=0$  

>Condition 2 becomes:  
>$4-2\alpha_1y=0\;\;=>\;\;4=0$ (Contradiction, no candidate points)  

**POSSIBILITY 4 : - Both constraints are not binding**

$x>1\;\text{and}\;\alpha_2=0$  
$x^2+y^2<4\;\text{and}\;\alpha_1=0$  

>Condition 2 becomes:  
>$4-2\alpha_1y=0\;\;=>\;\;4=0$ (Contradiction, no candidate points)  

**Check maximality of the candidate point :**

$f(\frac{6}{5},\frac{8}{5})=3(\frac{6}{4})+4(\frac{8}{5})=\frac{18}{5}+\frac{32}{5}=10$

Optimal Solution : $\boxed{x=\frac{6}{5},y=\frac{8}{5},\alpha_1=0,\alpha_2=\frac{5}{4}}$

### Primal and Dual Formulations