# Support Vector Machine ( SVM ) 

[sklearn: SVM ](https://scikit-learn.org/stable/modules/svm.html)

[Video Ressource: SVM algorithm ( mathematical approach ) ](https://youtu.be/vgBEMuQa4XU)

**Support Vector Machines (SVMs)** are a versatile set of **supervised learning** methods used for **classification, regression**, and **outliers detection**.

## 1- Linear Classification with SVM:

![support-vector-machine-algorithm.png](attachment:support-vector-machine-algorithm.png)

![1_kzdqdDUTwNsAkVZNLQAPvQ.png](attachment:1_kzdqdDUTwNsAkVZNLQAPvQ.png)

## Support Vector Machine (SVM) Algorithm

![1_DECyqDH_OCzHnh8jc7W4Bg.png](attachment:1_DECyqDH_OCzHnh8jc7W4Bg.png)


## Objective

SVM aims to find a hyperplane defined by $$w \cdot x + b = 0$$, where \(w\) is the weight vector, \(x\) is the input feature vector, and \(b\) is the bias term.

## Decision Function

The decision function for predicting the class of a new data point \(x_i\) is $$f(x_i) = w \cdot x_i + b$$.

## Margin

The margin (\(M\)) is the distance from the hyperplane to the nearest data point. For a linearly separable case, $$M = \frac{1}{\|w\|}$$.

## Optimization Objective

Maximize \(M\) by minimizing $$\frac{1}{2}\|w\|^2$$ subject to the constraints $$y_i(w \cdot x_i + b) \geq 1$$ for all training samples \((x_i, y_i)\).

## Soft Margin SVM (for non-linearly separable cases)

Introduce a slack variable (\(\xi_i\)) for each data point to allow for some misclassifications. Modify the optimization objective to minimize $$\frac{1}{2}\|w\|^2 + C \sum_{i=1}^{N} \xi_i$$, where \(C\) controls the trade-off between margin maximization and classification error.

## Kernel Trick

Introduce a kernel function $$K(x_i, x_j)$$ to implicitly map input features into a higher-dimensional space. The decision function becomes $$f(x) = \sum_{i=1}^{N} \alpha_i y_i K(x, x_i) + b$$, where \(\alpha_i\) are Lagrange multipliers.

## Support Vectors

Support vectors are the training samples that have non-zero Lagrange multipliers (\(\alpha_i > 0\)).

## Dual Form

The optimization problem is often expressed in its dual form, involving the maximization of $$\sum_{i=1}^{N} \alpha_i - \frac{1}{2} \sum_{i=1}^{N} \sum_{j=1}^{N} \alpha_i \alpha_j y_i y_j K(x_i, x_j)$$.


## When to Use 2D SVM

  - **Data is easily separable by a straight line.**
  - **Features have simple relationships.**
  - **Computational efficiency matters.**
  - **Visualization of the decision boundary is crucial**
  - **A simpler model is adequate.**
  - **Only two relevant features are available**

Consider higher-dimensional SVM for non-linear data. Choice depends on data characteristics and problem complexity.


Consider using higher-dimensional SVM with kernels for non-linear data or more complex relationships. The choice depends on your data characteristics and the problem at hand.


![Screenshot%202023-12-27%20at%2017.19.05.png](attachment:Screenshot%202023-12-27%20at%2017.19.05.png)

## 2- NON- Linear Classification with SVM:

![0_l7Tg9hZq-617K11S.png](attachment:0_l7Tg9hZq-617K11S.png)

### SVM for Non-Linear Classification with Kernel Trick

SVM, originally designed for linear separation, can handle non-linear classification using the "kernel trick." When data is not linearly separable, a kernel function is applied to implicitly map the feature space into a higher-dimensional one.

### Kernel Functions:

Common kernel functions include:
- **Polynomial Kernel:** $$K(x, x') = (x \cdot x' + c)^d$$
- **RBF (Gaussian) Kernel:** $$K(x, x') = \exp\left(-\frac{\|x - x'\|^2}{2\sigma^2}\right)$$
- **Sigmoid Kernel:** $$K(x, x') = \tanh(\alpha x \cdot x' + c)$$

### Decision Function:

In the transformed space, the decision function is given by:
\[f(x) = \sum_{i=1}^{N} \alpha_i y_i K(x, x_i) + b\]

### When to Use Mapping in 3D:

- **Data Non-Linearity:** Apply when the original feature space lacks a linear separation.
- **Complex Relationships:** For intricate feature relationships that linear models can't capture.
- **Improved Classification:** Especially when linear models perform poorly.
- **Computational Efficiency:** Kernel trick is computationally efficient compared to explicit mapping.

By leveraging the kernel trick, SVM becomes a powerful tool for handling non-linear classification problems efficiently.


# Advantages vs Disadvantages of SVMs:

| Advantages of SVMs                                       | Disadvantages of SVMs                                       |
|----------------------------------------------------------|------------------------------------------------------------|
| **Effective in high-dimensional spaces:** SVMs perform well in scenarios with a large number of dimensions. | **Risk of overfitting:** If the number of features significantly exceeds the number of samples, careful selection of Kernel functions and regularization terms is crucial to avoid overfitting. |
| **Still effective in cases of high dimensionality:** They remain effective even when the number of dimensions is greater than the number of samples. | **No direct probability estimates:** SVMs do not provide direct probability estimates. Probability calculations require an expensive five-fold cross-validation. |
| **Memory efficient:** SVMs use a subset of training points (support vectors) in the decision function, making them memory-efficient. | |
| **Versatile:** Different Kernel functions can be specified for the decision function. Common kernels are provided, and custom kernels can also be defined. | |
