In [1]:
# TODO: The math formulas vs the authors implementation of support vector machines is
#  not clear. If interested, spend some more time on this in next iteration.

# TODO: In some cases, once the support vector is computed here in my code, some datapoints are
#  not within the margin. It might be that there is some little difference between
#  how the support vector is computed and how the top and lower boundaries are
#  computed in my plotting function. It usually happens when the two classes merge 
#  a lot together in soft margin classification. This should be investigated. 
#    --> I checked that with LLMs and seems that this behavior is normal, when there are some outliers.

## Decision function
The decision function for a support vector classifier (SVC) can be derived from the linear equation that defines the 
hyperplane separating the two classes in the feature space. For a linear SVC, the decision function is given by:

$f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b$

Here:

- $\mathbf{w}$ is the weight vector (coefficients) of the hyperplane.
- $\mathbf{x}$ is the input feature vector.
- $b$ is the bias term (intercept).

The decision function outputs a real-valued number, which is the distance of the point $\mathbf{x}$ from the hyperplane. The sign of the decision function determines the class label:

- If $f(\mathbf{x}) > 0$, the point is classified as belonging to the positive class.
- If $f(\mathbf{x}) < 0$, the point is classified as belonging to the negative class.
- If $f(\mathbf{x}) = 0$, the point lies exactly on the decision boundary (hyperplane).

## Label transformation
In Support Vector Classifiers (SVC), the labels of the classes are typically transformed to -1 and 1 to simplify the optimization problem. This transformation is particularly useful because it allows the classifier to express the decision boundary in a standardized way, leveraging the mathematical properties of these binary labels.

### Transformation Formula

Assume you have a dataset with binary class labels, typically denoted as 0 and 1. To transform these labels into -1 and 1 for use in a Support Vector Classifier, you can use the following formula:

$
y' = 2y - 1
$

Where:
- $y$ is the original label, either 0 or 1.
- $y'$ is the transformed label, either -1 or 1.

### Example

If $y = 0$:
$
y' = 2(0) - 1 = -1
$

If $y = 1$:
$
y' = 2(1) - 1 = 1
$

### Why This Transformation?

The transformation to -1 and 1 is advantageous in the context of SVC because:

1. **Simplifies the Optimization Problem**: The optimization problem in SVMs involves maximizing the margin between the classes. With labels -1 and 1, the margin can be directly related to the distance from the decision boundary, which is expressed as $f(\mathbf{x}) = \mathbf{w}^T \mathbf{x} + b$. This simplifies the constraints to $y' \cdot f(\mathbf{x}) \geq 1$.

2. **Mathematical Properties**: The transformation leverages the algebraic properties of these labels in dot products and other operations during the optimization process, leading to more straightforward and efficient computations.

This transformation is standard practice in the implementation of binary SVMs. For multi-class SVMs, similar principles apply, but the transformation and decision process involve handling multiple binary classifiers.

## Margin
In classification task, you compute support vectors which are basically a datapoints within the margin. Your aim is to
have the widest possible margin and at the same time have as few support vectors as possible. 
The computation of parameters is done only using the support vectors. Data points that are not support vectors are not used in the computation.

In soft margin classification some datapoints are allowed to be within the margin and misclassified.
  
In hard margin classification the aim is to have only datapoints exactly on the edges of the margin.
Only linearly separable data can be used in hard margin classification.

In regression task the objective is the opposite. You want to fit as many datapoints as possible into the margin. 
This time however, the datapoints that are not within the margin are support vectors. 

## Performance
SVM scale poorly with huge datasets. It makes sense to use them for small to medium datasets.