# SVM

When the training examples are linearly seperable we can maximize the margin by minimizing the regularization term

$$
 \min_{\mathbf{w}} \frac{1}{2} ||\mathbf{w}||^2 = \frac{1}{2} \sum_{i=1}^d w^2_i
$$

subject to the classification constraint (gives us a constraint which requires that we correctly classify our labels and that there is a minimum margin between the closest points and the hyperplane).

$$
s.t. \quad y_i [\mathbf{x}_i^T \mathbf{w}] - 1 \geq 0, i=1,...,n
$$

The solution is defined only on the basis of a subset of examples or support vectors ($y_i [\mathbf{x}^T_i \mathbf{w}] - 1 = 0$). 
The support vectors are the datapoints $ \mathbf{x}_i $ which lie on the margin. 
The margin is thus defined by these support vectors.

When the training examples are not lineary seperable we add a penalty for violating the classification constraint

$$
\min_{\mathbf{w}} \frac{1}{2} ||\mathbf{w}||^2 + C\sum_{i=1}^n \xi_i
$$

subject to the relaxed constraint

$$
s.t. \quad y_i [\mathbf{x}^T_i \mathbf{w}] - 1 + \xi_i \geq 0, i=1,...,n
$$

The $\xi_i$ are called slack variables and it works as a shift variable which would shift the point back to the correct side of the margin.

We can rewrite the non-seperable case as

$$
C \sum_{i=1}^n (1 - y_i[\mathbf{x}_i^T \mathbf{w}])^+ + \frac{1}{2} ||\mathbf{w}||^2
$$

where $z^+ = t$ if $z \geq 0$ else $0$. 
This is equivalent to reguralized emperical loss minimization (ridge regression).

$$
\underbrace{\frac{1}{n} \sum_{i=1}^n (1 - y_i [\mathbf{x}_i^T \mathbf{w}])^+}_{R_{emp} \ (\text{Loss function})} + \lambda ||\mathbf{w}||^2, \quad \lambda = \frac{1}{2nC}
$$

The SVM is also very similair to the Logistic Regression (LOGREG).

$$
\begin{align*}
    \text{SVM} &: \frac{1}{n} \sum_{i=1}^n (1 - y_i [\mathbf{x}_i^T \mathbf{w}])^+ + \lambda ||\mathbf{w}||^2 \\
    \text{LOGREG} &: \frac{1}{n} \sum_{i=1}^n - \log \underbrace{\sigma(y_i [\mathbf{x}_i^T \mathbf{w}])}_{\mathbb{P}(y_i | \mathbf{x}_i, \mathbf{w})} + \lambda ||\mathbf{w}||^2
\end{align*}
$$

Where $\sigma(z) = (1 + e^{-z})^{-1}$ is the logistic function.

The way they differ is in the usage of the loss function. 
While the SVM uses the loss $(1 - z)^+$ (hinge loss), the LOGREG uses the loss $\log(1 + \exp(-z))$.

# Solution to SVM

We want to solve

$$
\min_{\mathbf{w}} \ \frac{1}{2} ||\mathbf{w}||^2 \quad s.t. \quad y_i[\mathbf{x}_i^T \mathbf{w}] - 1 \geq 0, i = 1, ..., n
$$

We rewrite our constraint with the help of the lagrangian multiplier into

$$
\sup_{\alpha_i \geq 0} \alpha_i (1 - y_i [\mathbf{x}_i^T \mathbf{w}]) = 
\begin{cases}
    0, \quad if \ y_i [\mathbf{x}_i^T \mathbf{w}] - 1 \geq 0 \\
    \infty, \quad otherwise
\end{cases}
$$

Thus with lagrangian our minimization problem as a lagrangian function we get

$$
\min_{\mathbf{w}} \frac{1}{2} ||\mathbf{w}||^2 + \sum_{i=1}^n \sup_{\alpha_i \geq 0} \alpha_i(1 - y_i[\mathbf{x}^T_i \mathbf{w}]) \\
\Rightarrow 
\min_{\mathbf{w}} \sup_{\alpha_i \geq 0} \left( \frac{1}{2} ||\mathbf{w}||^2 + \sum_{i=1}^n \alpha_i(1 - y_i[\mathbf{x}^T_i \mathbf{w}]) \right)
$$

We can swap the min and max problem by the use of slaters condition

$$
\max_{\alpha \geq 0} \left( \min_{\mathbf{w}} \left( \frac{1}{2} ||\mathbf{w}||^2 + \sum_{i=1}^n \alpha_i(1 - y_i[\mathbf{x}^T_i \mathbf{w}]) \right) \right)
$$

Because now the inner term (:= $ J(\mathbf{w})$) is a convex function we can take it's derivative and set it to zero, which then gives us

$$
\frac{\partial J(\mathbf{w})}{\partial \mathbf{w}} = \mathbf{w} - \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i \overset{!}{=} 0 \\
\Rightarrow \mathbf{\hat{w}} =  \sum_{i=1}^n \alpha_i y_i \mathbf{x}_i
$$

Plugging this back in to our initial problem gives us 

$$
\max_{\alpha_i \geq 0} \left( \frac{1}{2} ||\hat{\mathbf{w}}||^2 + \sum_{i=1}^n \alpha_i(1 - y_i[\mathbf{x}^T_i \hat{\mathbf{w}}]) \right) \\
\max_{\alpha_i \geq 0} \left( \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n y_i y_j \alpha_i \alpha_j \mathbf{x}_i^T \mathbf{x}_j^T \right) \\
$$

Here the function in terms of $\alpha$ is a concave function therefor we can again easily calculate the optimal parameters by taking the derivative and setting it to zero. 
This maximization is the so called dual. 
Only $\hat{\alpha}_i$'s corresponding to the support vectors will be non-zero.

If we now would like to make new predictions we do

$$
sign(\mathbf{x}^T \hat{\mathbf{w}}) = sign(\mathbf{x}^T \sum_{i=1}^n \hat{\alpha}_i y_i \mathbf{x}_i) = sign(\sum_{i \in SV} \hat{\alpha}_i y_i \mathbf{x}_i^T \mathbf{x}_i)
$$

The value of the function then on the input vectors only via the dot-product of the new datapoint vector and all the support vectors.

# Kernel function

A kernel function is a real-valued function of two arguments, $k(\mathbf{x}, \mathbf{x}') \in \mathbb{R}$ for $ \mathbf{x}, \mathbf{x}' \in \mathcal{X} $, where $ \mathcal{X}$ is the input space.