# SVM

Given training vectors $x_i \in \mathbb{R}^p, i=1,..., n$ in two classes, and a
vector $y \in \{1, -1\}^n$, 

our goal is to find $w \in \mathbb{R}^p$ and $b \in \mathbb{R}$ such that the prediction given by $\text{sign} (w^T\phi(x) + b)$ is correct for most samples.

SVC solves the following **primal problem**:

$$
\min_ {w, b, \zeta} \frac{1}{2} w^T w + C \sum_{i=1}^{n} \zeta_i
$$

$$
\text{subject to }  y_i (w^T \phi (x_i) + b) \geq 1 - \zeta_i,\\  \zeta_i \geq 0, i=1, ..., n
$$

Intuitively, we're trying to maximize the margin (by minimizing $||w||^2 = w^Tw$), 

while incurring a penalty when a sample is misclassified or within the margin boundary. 

Ideally, the value $y_i(w^T \phi (x_i) + b)$ would be $\geq 1$ for all samples, which indicates a perfect prediction. 

But problems are usually not always perfectly separable with a hyperplane, 

so we allow some samples to be at a distance $\zeta_i$ from their correct margin boundary. 

The penalty term $C$ controls the strength of this penalty, 

and as a result, acts as an inverse regularization parameter

The **dual problem** to the primal is


$$
\min_{\alpha} \frac{1}{2} \alpha^T Q \alpha - e^T \alpha
$$
$$
   \textrm {subject to }  y^T \alpha = 0\\
    0 \leq \alpha_i \leq C, i=1, ..., n
$$

where $e$ is the vector of all ones,

$Q$ is an $n$ by $n$ positive semidefinite matrix,

$Q_{ij} \equiv y_i y_j K(x_i, x_j)$, where $K(x_i, x_j) = \phi (x_i)^T \phi (x_j)$ is the kernel. 

The terms $\alpha_i$ are dual coefficients, upper-bounded by $C$.

This dual representation highlights the fact that training vectors are implicitly mapped into a higher (maybe infinite) dimensional space by the function $\phi$

see [kernel trick](https://en.wikipedia.org/wiki/Kernel_method).


Once the optimization problem is solved, 

predicted class is sign of output of decision function:

$$\text{sign}\left[\sum_{i\in SV} y_i \alpha_i K(x_i, x) + b\right]$$


We only need to sum over the support vectors (i.e. the samples that lie within the margin) 

because the dual coefficients $\alpha_i$ are zero for the other samples.

These parameters can be accessed through the attributes ``dual_coef_``
which holds the product $y_i \alpha_i$, 

``support_vectors_`` which holds the support vectors, 

and ``intercept_`` which holds the independent term $b$
