# Support Vector Machine (SVM)
***

1. The objective of support vector machine is to find a hyperplane in a N dimensional space that separates two classes. Thus similar to linear regression, SVM also contains a weight vector and a bias as parameters.
1. To find the correct parameters, we first need to assume the training instances are linearly separable. Then an convex optimization problem is solved to find the weights and bias such that the hyperplane has the maximum distances from the support vectors. The support vectors are the training instances that are closest to the hyperplane.
1. If the training set contains noise points that make them linearly non-separable, we can add slack variable for each training instance to the constraints of the optimization problem so that it permits some training instances to be on the other side of the hyperplane. Basically large slack variables allow more misclassified training instances and the sum of them is added to the target function to be minimized. 
1. A hyperparameter C can be used to determine how important the slack variables are. Setting C to be 0 means that we want the SVM to perfectly separate two classes in the training set while a suitable value means that we allow some errors in the training process.

## Preliminary
***
(lagrangian)=
#### The Lagrangian dual problem (Duality)

1. Given a minimization primal problem:
    $$
    \begin{alignat}{2}
    \min_{x} \quad & f(x) \\
    \text{s.t. } \quad & h_{i}(x) \leq 0, \quad i = 1, \dots, n \\
    \quad & l_{j}(x) = 0, \quad j = 1, \dots, m \\
    \end{alignat}
    $$
1. The Lagrangian of the primal problem is defined as:
    $$ L(x, u, v) = f(x) + \sum_{i}^{n} u_{i}h_{i}(x) + \sum_{j}^{m} v_{j}l_{j}(x) $$
    where $u_{i}$ and $v_{j}$ are new variables called Lagrangian multipliers.
1. The Lagrange dual function is:
    $$ g(u, v) = \min_{x} L(x, u, v) $$
1. The Lagrange dual problem is:
    $$
    \begin{alignat}{2}
    \max_{u, v} \quad & g(u, v) \\
    \text{s.t. } \quad & u \geq 0 \\
    \end{alignat}
    $$
1. The properties of dual problem:
    1. The dual problem is always convex even if the primal problem is not convex.
    1. For any primal problem and its dual problem, the weak duality always holds (the optimal value of the primal problem is always greater or equal to the optimal value of the dual problem).

#### Karush-Kuhn-Tucker (KKT) conditions

1. Given the Lagrange dual problem stated above, the KKT conditions are:
    1. Stationarity condition: 
        $$ 0 \in \partial \left( f(x) + \sum_{i=1}^{n} u_{i} h_{i}(x) + \sum_{j=1}^{m} v_{j}l_{j}(x) \right) $$
    1. Complementary Slackness:
        $$ u_{i}h_{i}(x) = 0, \quad i = 1, \dots, n $$
    1. Primal feasibility:
        $$ h_{i}(x) \leq 0, \quad i = 1, \dots, n $$
        $$ l_{j}(x) = 0,  \quad j = 1, \dots, m $$
    1. Dual feasibility:
        $$ u_{i} \geq 0, \quad i = 1, \dots, n $$
1. If a strong duality (the primal optimal objective and the dual optimal objective are equal) holds, the $x^{*}$ and $u^{*}, v^{*}$ are primal and dual solutions if and only if $x^{*}$ and $u^{*}, v^{*}$ satisfy the KKT conditions.

## SVM formulation
***

#### SVM without slacks (hard margin SVM)

Given a dataset with $n$ instances $x_{i} \in R^{d}$ and $n$ labels $y_{i} \in \{-1, 1\}$, a hard margin SVM model is a linear function (hyperplane) that is defined by a set of weights $w \in R^{d}$ and a bias $b \in R$, which has the largest distances to the support vectors. You can get the hyperplane by solving following optimization problem:
$$
\begin{alignat}{2}
\min \quad & \frac{1}{2} \lVert w \rVert^{2} \\
\text{s.t. } \quad & y_{i}(w x_{i} + b) \geq 1, \quad i = 1, \dots n \\
\end{alignat}
$$
Solving the above optimization problem will give us two parallel hyperplanes ($w x + b = 1$ and $w x + b = -1$) that strictly separate the positive and negative training instances and at the same time have the maximum gap in between.

1. The objective maximizes the squared distance between the parallel hyperplanes by minimizing the multiplicative inverse of the squared distance between the parallel hyperplanes, which is defined as 
    $$ \frac{\lvert b_{2} - b_{1} \rvert}{\lVert w \rVert} = \frac{\lvert (b + 1) - (b - 1) \rvert}{\lVert w \rVert} = \frac{2}{\lVert w \rVert} $$
1. The constraints specify that the instances must be on the correct side of the two hyperplanes:
    $$ w x_{i} + b \geq 1 \quad \mathrm{if} y_{i} = 1 $$
    $$ w x_{i} + b \leq -1 \quad \mathrm{if} y_{i} = -1 $$
    and $y_{i}(w x_{i} + b) \geq 1$ summarizes the above two conditions.
            
#### SVM with slacks (soft margin SVM)

In case there is no way that the instances can be linearly separated, we can use slack variables in the formulation to tolerate a small number of non-separable training instances. 
$$
\begin{alignat}{2}
\min \quad & \frac{1}{2} \lVert w \rVert^{2} + C \sum_{i}^{n} \xi_{i} \\
\text{s.t. } \quad & y_{i}(w x_{i} + b) \geq 1 - \xi_{i}, \quad i = 1, \dots n \\
\quad & \xi_{i} \geq 0, \quad i = 1, \dots n \\
\end{alignat}
$$
where $\xi_{i}$ is the slack variable for the instance $x_{i}$ and $C$ is a hyperparameter that penalizes the misclassification of $x_{i}$. 

1. If $\xi_{i}$ is nonzero for $x_{i}$, it means that $x_{i}$ is on the misclassified side of $w x_{i} + b = 1$ (or $w x_{i} + b = -1$) and the distance is $\xi_{i}$. 
1. If $C = 0$, $\xi_{i}$ can be arbitrary large for each $x_{i}$. If $C \to \inf$, it is the same as hard margin SVM because any misclassification can induce infinite loss. 

## Solving SVM
***

### Solving hard margin SVM
1. Rewrite the primal program for easier Lagrangian computation below:
    $$
    \begin{alignat}{2}
    \min \quad & \frac{1}{2} ww \\
    \text{s.t. } \quad & -(y_{i}(w x_{i} + b) - 1) \leq 0, \quad i = 1, \dots n \\
    \end{alignat}
    $$
1. We can derive the Lagrangian primal function from the primal program:
    $$
    \begin{alignat}{2}
    L(w, b, \alpha) & = f(w, b) + \sum_{i}^{n} \alpha h_{i}(w, b) \\
    & = \frac{1}{2} ww - \sum_{i}^{n} \alpha_{i}(y_{i}(w x_{i} + b) - 1) \\
    \end{alignat}
    $$
    where $\alpha$ is a new variable called Lagrangian multiplier.
1. Then we can write and solve Lagrangian dual function:
    $$ 
    \begin{alignat}{2}
    g(\alpha) & = \min_{w, b} L(w, b, \alpha) \\
    & = \min_{w, b} \frac{1}{2} ww - \sum_{i}^{n} \alpha_{i}(y_{i}(w x_{i} + b) - 1) \\
    \end{alignat}
    $$
    Taking the derivation of $L(w, b, \alpha)$ over $w$:
    $$ 
    \begin{alignat}{2}
    \frac{\partial L}{\partial w} & = 0 \\
    w - \sum_{i}{n} \alpha_{i}y_{i}x_{i} & = 0 \\
    w & = \sum_{i}^{n} \alpha_{i}y_{i}x_{i} \\
    \end{alignat}
    $$
    Taking the derivation of $L(w, b, \alpha)$ over $b$:
    $$
    \begin{alignat}{2}
    \frac{\partial L}{\partial b} & = 0 \\
    \sum_{i}^{n} \alpha_{i}y_{i} & = 0 \\
    \end{alignat}
    $$
    Plug in $w = \sum_{i}^{n} \alpha_{i}y_{i}x_{i}$ back to $g(\alpha)$:
    $$ 
    \begin{alignat}{2}
    g(\alpha) 
    & = \min_{w, b} \frac{1}{2} ww - \sum_{i}^{n} \alpha_{i}(y_{i}(w x_{i} + b) - 1) \\
    & = \min_{w, b} \frac{1}{2} \left( \sum_{i}^{n} \alpha_{i}y_{i}x_{i} \right) \left( \sum_{j}^{n} \alpha_{j}y_{j}x_{j} \right) 
        - \sum_{i}^{n} \alpha_{i} \left( y_{i} \left( \left( \sum_{j}^{n} \alpha_{j}y_{j}x_{j} \right) x_{i} + b \right) - 1 \right) \\
    & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) 
        - \sum_{i}^{n} \alpha_{i}y_{i}\left( \left( \sum_{j}^{n} \alpha_{j} y_{j} x_{j} \right) x_{i} + b \right) + \sum_{i}^{n}\alpha_{i} \\
    & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) 
        - \sum_{i}^{n} \alpha_{i}y_{i} \left( \sum_{j}^{n} \alpha_{j} y_{j} x_{j} \right) x_{i} 
        + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} \\
    & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) 
        - \left( \sum_{i}^{n} \alpha_{i}y_{i}x_{i} \right) \left( \sum_{j}^{n} \alpha_{j} y_{j} x_{j} \right) 
        + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} \\
    & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) 
        - \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j})
        + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} \\    
    & = \min_{w, b} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) 
        + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} \\  
    \end{alignat}
    $$
    Since we know that $\alpha_{i}y_{i} = 0$, then $b\sum_{i}^{n} \alpha_{i}y_{i} = 0$, and thus the final Lagrange dual function is:
    $$ g(\alpha) = \sum_{i}^{n}\alpha_{i} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) $$
1. The Lagrange dual problem is written as:
    $$
    \begin{alignat}{2}
    \max \quad & \sum_{i}^{n}\alpha_{i} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) \\
    \text{s.t. } \quad & \alpha_{i} \geq 0, \quad i = 1, \dots n \\
    \quad & \alpha_{i}y_{i} = 0 \\
    \end{alignat}
    $$
    Notice that $\alpha_{i}y_{i} = 0$ is added as part of the constraint.
1. Since strong duality holds for hard margin SVM and also soft margin SVM, solving dual problem has the same solution as the primal problem. The benefits of solving its dual problem are:
    1. The Lagrange dual problem only involves $\alpha_{i}$, but primal problem has $w$ and $b$, which are much more parameters.
    1. The Lagrange dual problem allows application of kernel trick in the computation process, but the primal problem doesn't.

### Solving soft margin SVM
1. Similar as hard margin SVM, we can write Lagrangian dual function as:
    $$
    \begin{alignat}{2}
    g(\alpha, \beta) & = \min_{w, b} \frac{1}{2} ww 
        - \sum_{i}^{n} \alpha_{i}\left( y_{i}(w x_{i} + b) - 1 + \xi_{i} \right) - \sum_{i}^{n}\beta_{i}\xi_{i} \\
    \end{alignat}
    $$
    where a new Lagrange multiplier is introduced for the constraint $\xi_{i} \geq 0$.
1. Similar as hard margin SVM, we can solve Lagrangian dual function by taking the derivatives over the $w$, $b$, and $\xi_i$:
    $$ 
    \begin{alignat}{2}
    \frac{\partial L}{\partial w} = 0 & \Rightarrow w - \sum_{i}{n} \alpha_{i}y_{i}x_{i} = 0 \Rightarrow w = \sum_{i}^{n} \alpha_{i}y_{i}x_{i} \\
    \frac{\partial L}{\partial b} = 0 & \Rightarrow \sum_{i}^{n} \alpha_{i}y_{i} = 0 \\
    \frac{\partial L}{\partial \xi_{i}} = 0 & \Rightarrow C - \alpha_{i} - \beta_{i} = 0 \Rightarrow C = \alpha_{i} + \beta_{i} \\
    \end{alignat}
    $$
    and plug the $w = \sum_{i}^{n} \alpha_{i}y_{i}x_{i}$ and $C = \alpha_{i} + \beta_{i}$ back in $g(\alpha, \beta)$. 
    $$ 
    \begin{alignat}{2}
    g(\alpha, \beta) 
    & = \min_{w, b} \frac{1}{2} ww + C\sum_{i}^{n}\xi_{i} - \sum_{i}^{n} \alpha_{i}\left( y_{i}(w x_{i} + b) - 1 + \xi_{i} \right) - \sum_{i}^{n}\beta_{i}\xi_{i} \\
    & = \min_{w, b} \frac{1}{2} \left( \sum_{i}^{n} \alpha_{i}y_{i}x_{i} \right) \left( \sum_{j}^{n} \alpha_{j}y_{j}x_{j} \right) + \sum_{i}^{n}(\alpha_{i} + \beta_{i})\xi_{i}
        - \sum_{i}^{n} \alpha_{i} \left( y_{i} \left( \left( \sum_{j}^{n} \alpha_{j}y_{j}x_{j} \right) x_{i} + b \right) - 1 + \xi_{i} \right) - \sum_{i}^{n} \beta_{i} \xi_{i} \\
    & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + \sum_{i}^{n}(\alpha_{i} + \beta_{i})\xi_{i}
        - \sum_{i}^{n} \alpha_{i}y_{i}\left( \left( \sum_{j}^{n} \alpha_{j} y_{j} x_{j} \right) x_{i} + b \right) 
        + \sum_{i}^{n}\alpha_{i} - \sum_{i}^{n} \alpha_{i} \xi_{i} - \sum_{i}^{n} \beta_{i} \xi_{i} \\
    & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + \sum_{i}^{n}(\alpha_{i} + \beta_{i})\xi_{i}
        - \sum_{i}^{n} \alpha_{i}y_{i} \left( \sum_{j}^{n} \alpha_{j} y_{j} x_{j} \right) x_{i} 
        + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} - \sum_{i}^{n} \alpha_{i} \xi_{i} - \sum_{i}^{n} \beta_{i} \xi_{i} \\
    & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + \sum_{i}^{n}(\alpha_{i} + \beta_{i})\xi_{i}
        - \left( \sum_{i}^{n} \alpha_{i}y_{i}x_{i} \right) \left( \sum_{j}^{n} \alpha_{j} y_{j} x_{j} \right) 
        + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} - \sum_{i}^{n} \alpha_{i} \xi_{i} - \sum_{i}^{n} \beta_{i} \xi_{i} \\
    & = \min_{w, b} \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + \sum_{i}^{n}(\alpha_{i} + \beta_{i})\xi_{i}
        - \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j})
        + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} - \sum_{i}^{n} \alpha_{i} \xi_{i} - \sum_{i}^{n} \beta_{i} \xi_{i} \\    
    & = \min_{w, b} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + \sum_{i}^{n}(\alpha_{i} + \beta_{i})\xi_{i}
        + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} - \sum_{i}^{n} \alpha_{i} \xi_{i} - \sum_{i}^{n} \beta_{i} \xi_{i} \\  
    & = \min_{w, b} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + \sum_{i}^{n} \alpha_{i}\xi_{i} + \sum_{i}^{n} \beta_{i}\xi_{i}
        + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} - \sum_{i}^{n} \alpha_{i} \xi_{i} - \sum_{i}^{n} \beta_{i} \xi_{i} \\  
    & = \min_{w, b} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) + b\sum_{i}^{n} \alpha_{i}y_{i} + \sum_{i}^{n}\alpha_{i} \\ 
    & = \min_{w, b} \sum_{i}^{n}\alpha_{i}  - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) \\ 
    \end{alignat}
    $$ 
    which has exactly the same form as Lagrangian dual function of hard margin SVM. 
1. The Lagrange dual problem is written as:
    $$
    \begin{alignat}{2}
    \max \quad & \sum_{i}^{n}\alpha_{i} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) \\
    \text{s.t. } \quad & \alpha_{i} \geq 0, \quad i = 1, \dots n \\
    \quad & \beta_{i} \geq 0, \quad i = 1, \dots n \\
    \quad & \alpha_{i}y_{i} = 0 \\
    \end{alignat}
    $$
    Since we know $C = \alpha_{i} + \beta_{i} \Rightarrow \alpha_{i} = C - \beta_{i}$, the constraint $\beta_{i} \geq 0$ can be removed by merging into $\alpha_{i} \geq 0$:
    $$
    \begin{alignat}{2}
    \max \quad & \sum_{i}^{n}\alpha_{i} - \frac{1}{2} \sum_{i}^{n}\sum_{j}^{n} \alpha_{i}\alpha_{j}y_{i}y_{j}(x_{i}x_{j}) \\
    \text{s.t. } \quad & C \geq \alpha_{i} \geq 0, \quad i = 1, \dots n \\
    \quad & \alpha_{i}y_{i} = 0 \\
    \end{alignat}
    $$
    The only difference with Lagrange dual problem of hard margin SVM is the addition of $C \geq \alpha_{i}$. 

## Kernel trick
***

TODO

## Reference
***

1. https://shuzhanfan.github.io/2018/05/understanding-mathematics-behind-support-vector-machines/
2. https://cse.iitkgp.ac.in/~dsamanta/courses/da/resources/slides/10SupportVectorMachine.pdf