# Linear Soft Margin SVM
SVM is a classification algorithm that aims at finding the decision surface which maximizes the margin between itself and the class samples. <br>
In the context of SVM, we define *margin* as the distance between the hyperplane and the sample which is the closest to it. <br>
The linear function, wrt $\mathbf{x}$:
$$
f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b
$$
represents the equation of the hyperplane, if put equal to zero:
$$
f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b = 0
$$
Recalling that the perpendicular distance of a point $\mathbf{x}$ from an hyperplane $y(\mathbf{x}) = 0$ is $\mid y(\mathbf{x}) \mid / \|\mathbf{w}\|$, we can write:
$$
d(\mathbf{x}_i) = \frac{\mid \mathbf{w}^T \mathbf{x}_i + b \mid}{\|\mathbf{w}\|}
$$
where $\mathbf{w}$ si always the weights vector, orthogonal to the hyperplane. <br>
We can introduce the same substitution for the classes we used when computing the average Loss Function for L.R.:
$$
z_i = 2c_i - 1 \implies
\begin{cases}
    z_i = 2* 0 - 1 = -1 & \text{if } c_i = 0 \\
    z_i = 2 * 1 - 1 = 1  & \text{if } c_i = 1
\end{cases}
$$
Since at the numerator we have the absolute value, the the distance doesn't change if we write:
$$
d(\mathbf{x}_i) = \frac{\mid z_i \left( \mathbf{w}^T \mathbf{x}_i + b \right) \mid}{\|\mathbf{w}\|}
$$
If we just want to consider solutions which **correctly classify all samples**, we can thus maximize, wrt the model parameters $(\mathbf{w}, b)$, the objective function:
$$
\operatorname*{argmax}_{\mathbf{w}, b} \left\{ \operatorname*{min}_{i} \left\{\frac{\mid z_i \left( \mathbf{w}^T \mathbf{x}_i + b \right) \mid}{\|\mathbf{w}\|} \right\} \right\} \\[1em]
\text{subject to: } z_i \left( \mathbf{w}^T \mathbf{x}_i + b \right) > 0
$$
So, in other words, we want to find the hyperplane, described by $(\mathbf{w}, b)$, which has the maximum distance to its closest point, so the largest minimum distance from the samples. The contraint ensures we just select hyperplanes which make just correct clasifications. <br>
We can write this in a more compact form, and meanwhile also drop the contraint and the absolute value at the denominator, observing that all solutions which correctly classify all samples meet the contraint for each sample $\mathbf{x}_i$ and so for them we will always have $\operatorname*{min}_{i} \left\{ z_i \left( \mathbf{w}^T \mathbf{x}_i + b \right) \right\} > 0$, so:
$$
\operatorname*{argmax}_{\mathbf{w}, b} \left\{ \frac{1}{\|\mathbf{w}\|} \operatorname*{min}_{i} \left\{ z_i \left( \mathbf{w}^T \mathbf{x}_i + b \right) \right\} \right\}
$$
Now, we can exploit the property of this objective function being **invariant to rescaling**, i.e. given the rescaling factor $\phi > 0$ we know that both functions:
$$
\frac{1}{\|\mathbf{w}\|} \operatorname*{min}_{i} \left\{ z_i \left( \mathbf{w}^T \mathbf{x}_i + b \right) \right\} \\[1em]
\frac{1}{\|\phi \mathbf{w}\|} \operatorname*{min}_{i} \left\{ z_i \left( \phi \mathbf{w}^T \mathbf{x}_i + \phi b \right) \right\} \\[1em]
$$
will lead to both optimal solutions $(\mathbf{w}^*, b^*)$ or $(\phi \mathbf{w}^*, \phi b^*)$. In other words, the collection of parameters $(\phi \mathbf{w}^*, \phi b^*)\mid_{\phi > 0}$ forms an **equivalence class** of equivalent solutions. <br>
Because of the fact that we can choose any solution among them, to simplify the objective function we choose the one corresponding to $\phi = 1$, so for which we have:
$$
\operatorname*{min}_{i} \left\{ z_i \left( \mathbf{w}^T \mathbf{x}_i + b \right) \right\} = 1
$$
The objective function thus becomes just:
$$
\frac{1}{\|\mathbf{w}\|} \\[1em]
\text{subject to: } 
\begin{cases}
\operatorname*{min}_{i} \left\{ z_i \left( \mathbf{w}^T \mathbf{x}_i + b \right) \right\} = 1 \\
z_i \left( \mathbf{w}^T \mathbf{x}_i + b \right) \geq 1 \space \forall i
\end{cases}
$$
Then, in order to link the SVM objective to the L.R. one, we can make some constant transformations. Then we can also drop the first contraint observing that, since we're **minimizing** the objective function (because we now put $\mathbf{w}$ at the numerator thanks to these transformations), optimal solutions will have just points such that $z_i \left( \mathbf{w}^T \mathbf{x}_i + b \right) \geq 1$, so we can write the **Primal Formulation of the Hard-Margin SVM Problem**:
$$
\operatorname*{argmin}_{\mathbf{w}, b} \frac{1}{2} \|\mathbf{w}\|^2 \\[1em]
\text{subject to: } z_i \left( \mathbf{w}^T \mathbf{x}_i + b \right) \geq 1 \space \forall i
$$
It's called **Hard Margin** since all of this is based on the assumption that the optimal solution always correctly classify all points (and this is expressed in the contraint). <br>

Then, the steps to recover the **Soft Margin** version of the primal formulation of the problem are pretty quick.<br>
If classes are **not linearly separable**, we won't be abe to find a solution which satisfies the primal contraint $z_i \left( \mathbf{w}^T \mathbf{x}_i + b \right) \geq 1 \space \forall i$. We can make a trade off and accept to have some samples which sit **inside the margin region** (for them we would have $ 0 \leq z_i \left( \mathbf{w}^T \mathbf{x}_i + b \right) \lt 1$) or, if we want, are even **misslassified** (i.e, they end up on the opposite side of the decision boundary after the margin, for them we would have $z_i \left( \mathbf{w}^T \mathbf{x}_i + b \right) \lt 0$) and try to achieve the largest margin between all the correcty classified samples. In  practice, we introduce the so called **slack variables** $\xi_i$ in the primal constraint: 
$$
z_i \left( \mathbf{w}^T \mathbf{x}_i + b \right) \geq 1 - \xi_i \\
\xi_i \geq 0
$$
Keep in mind that $\xi_i \geq 0$ is an additional primal contraint related to the Soft Margin formulation of the problem. So:
- correctly classified points not inside the margin will have $\xi_i = 0$
- correctly classified points which sit inside the margin will have $0 \lt \xi_i \lt 1$
- missclassified points will have $\xi_i \geq 1$

The functional:
$$
\Phi(\xi) = \sum_{i = 1}^{n} (\xi_i)^{\sigma}
$$
- for small values of $\sigma$ is approximately equal to the number of points violating the hard margin contraint
- for $\sigma = 1$ represents **an upper bound** on the number of samples violating the hard margin contraint

Since using small values of $\sigma$ makes the problem not convex anymore, we set $\sigma = 1$ and can finally write the **Primal Formulation of the Soft-Margin SVM Problem**:
