## SVM Types
1. The main types of SVM we encounter frequently are:
* C-SVM or $\epsilon$-SVM
* NuSVM ($\nu$SVM)

### C-SVM or $\epsilon$-SVM

*C-SVM* is named after its regularization parameter $C$ (hyperparameter) that is used. The C parameter must be greater than zero and can extend upto infinity. It is the most commonly used SVM in practice and includes slack vairables to allow misclassifications for non-linearly separable data.

**Interpretation of C**
- Must be strictly greater than 0
- **High 𝐶**: Penalizes misclassification heavily, aims for fewer training errors and has a high risk of overfitting.
- **Low 𝐶**: Tolerates more margin violations, leads to a wider margin and potentially better generalization and poses risks of underfitting.

### NuSVM or $\nu$SVM

NuSVM offers more intuitive control over the number of support vectors and margin errors. It replaces the parameter C with ν, which consists a value between 0 and 1.

$\nu$ is interpreted as an upper bound on the fraction of margin errors and a lower bound of the fraction of support vectors.

*Benefits of $\nu$SVM*
- Better control over model complexity
- Useful when the number of support vectors or training errors needs to be explicitly bounded.
- Works well in imbalanced or noisy datasets

Although both variants of SVM—C-SVM and Nu-SVM—often produce similar results, parameter 𝜈 directly relates to the ratio of support vectors and the ratio of training errors, providing more interpretable and intuitive control over the model's complexity and tolerance for misclassification.

If you wish to learn more on NuSVM, please consult [this link](http://is.tuebingen.mpg.de/fileadmin/user_upload/files/publications/pdf3353.pdf).

We will be using the C-SVM variant for demonstrating the applications of SVM model.

## SVM as Classifier

So far we have studied SVM as a model for classification of data in binary context, i.e. only into two categories. However, many real life scenarios may require us to classify the data points into three or more categories. Hence, SVM classifiers are divided into two types:
- Binary Classifier
- Multiclass Classifier

### Binary Classifier

Here's a quick recap of what we have learnt so far.

The Support Vector Machine classifies the data points either linearly or non-linearly using hyperplanes, Kernels and slack variables. These examples of classifications classify the data only into two categories, which is called binary classification.

In binary classification, we simply assign a class as $1$ and another class as $0$. To understand it simply, consider this. For lets say cats and dogs classification, we assign cats $0$ and dogs as $1$.

Suppose, $W$ is our learned weight and we are given data $X$ for the prediction then our prediction would be as follows:
$$
\begin{cases}1,&  \text{ if } ~~ w^TX \geq 0\\
0,  & \text{ Otherwise}
\end{cases}
$$


### MultiClass Classifier

Multi class classification is the method of classifying or identifying objects among a number of different crowds.

For example: Classification between cats and dogs is a binary classification problem. Classification between cats, dogs and mice is a multiclass classification problem.

There are a number of methods which we can apply for the multi class classification:
* One vs. All for multi class classification
* One vs. One for multi class classification

a. **One vs. All Multi Class Classification:**  
 One vs All Multi Class Classification or One vs. Rest Multi Class Classification is a multi class classification technique in which we split the multi class dataset into multiple binary classification problem.

Consider we have to do classification on three labels: **Cats**, **Dogs** and **Humans**, then we will have three models that does classification as:
 * Model 1: Cats vs. [Dogs, Humans]  
 * Model 2: Dogs vs. [Cats, Humans]  
 * Model 3: Humans vs. [Dogs, Cats]  
Here, *Model 1* classifies Cats as positive sample and the rest(Dogs and Humans) as negative. A similar case is seen for model 2 and model 3.  
The main idea behind this is that we create one model for each label which is really good at recognisizing only one object. Here the number of model required is equal to the number of classes.

b. **One vs. One Multi Class Classification:**  
Similar to the "*One vs. All*" method we make multiple binary classifier in *One vs. One Multi Class Classification* but the difference is that in the one vs one, the model splits the dataset into one dataset for each class verus every other class.  
To make thing clear, consider the case as before. We have to do classification on cats dogs and humans, here we divide into the binary classification as:
 * Model 1: Cats vs. Dogs
 * Model 2: Cats vs. Humans
 * Model 3: Dogs vs. Humans  
The formula to calculate the number of binary datasets, and in turn models required for the one vs. one  multi class classification is given as:
$$\frac{Class ~ Number \times (Class ~ Number - 1)}{2}$$
In the above example, we had $3$ classes. Hence:
$\frac{3 \times (3-1)}{2} = 3$ models and binary datasets.


## One Class SVM:
One class SVM is the most common type of SVM used for anomaly detection. It can be considered as a type of binary classification that classifies data points either into *outliers* or *inliers*.

If the prediction from the algorithm is *positive*, the data is normal; if the the prediction is *negative*, then the data is an outlier or anomaly. i.e.  

**Positive Case**: *Normal Data*  
**Negative Case**: *Outlier/Anomaly Data*

In One-Class SVM, the model is trained using only the "normal" data, without any class labels. Because it doesn't rely on labeled anomalies, this makes One-Class SVM an unsupervised learning algorithm.

**How it works:**

- The model learns the boundary that encloses the majority of the normal data points in the feature space.
- At prediction time, it checks whether a new data point falls inside or outside this learned boundary.




## SVM as a Regression Algorithm:
So far we have learnt SVM as a classification model for categorization of datapoints using hyperplanes.
In this section, we shall discuss how SVM can be used for regression problem as well (SVR).

The principle in suport vector machine regression is same as in support vector machine classifiers. However, instead of predicting class labels, SVR predicts continuous numerical values.

For the hyper plane and the margins, our best fit is the hyperplane that has maximum number of points.

Given a training vectors $X$ and the result $Y$ ; $x_i \in X, ~ y_i \in Y$
and $w$ be the learned weight. Then the prediction for  support vector regression can be formulated as:

$$\min_ {w, b} \frac{1}{2} w^T w + C \sum_{i=1}\max(0, |y_i - (w^T \phi(x_i) + b)| )$$
 here. $C$ is th regularization parameter
and $b$ is bias.

Hence, the objective of SVM regression training is to maximize the margin and minimize the error.

**Goal of SVR:**
- Minimize model complexity (flatness of the function)
- Ignore small errors (within 𝜖)
- Penalize larger errors, scaled by 𝐶

## SVM Under the Hood

The Support Vector Machine (SVM) models we use in scikit-learn are powered by two popular open-source libraries: *libsvm* and *liblinear*. Each is used under the hood for different SVM variants and optimizes the model using different algorithms.  

The support vector machine model(*SVM*) implemented on sklearn uses *libsvm* where as the model *LinearSVC* uses *liblinear*.
- SVC is ideal for non-linear classification and supports kernel functions.
- LinearSVC is optimized for large-scale linear problems and does not support kernels.

At its core, SVM solves an optimization problem to find the best decision boundary that separates classes with the maximum margin. In its dual form, the SVM optimization problem is:
$$\max_\alpha \sum_{i=1}^n \alpha_i - \frac{1}{2} \sum_{i=1}^n \sum_{j=1}^n y_i y_j K(x_i, x_j) \alpha_i \alpha_j$$
subject to:
$$0 \leq \alpha_i \leq C  for i = 1, 2, \cdots,n$$  
$$\sum _{i=1} ^n y_i \alpha_i = 0$$

where,
$α_i$: Lagrange multipliers

𝐶: Regularization parameter (controls soft margin)

𝐾($𝑥_𝑖$,$𝑥_𝑗$): Kernel function (e.g., linear, polynomial, RBF)

For the optimization, *libsvm* uses *Sequential Minimal Optimization (SMO)* whereas *liblinear* uses [Coordinate Descent](https://en.wikipedia.org/wiki/Coordinate_descent) algorithm.

<!-- To put things simply, SMO algorithm is dedicated to heuristics for choosing which $\alpha_i$ and $\alpha_j$ to
optimize so as to maximize the objective function as much as possible. For large data sets, this
is critical for the speed of the algorithm, since there are $m(m − 1)$ possible choices for $\alpha_i$ and
$\alpha_j$ , and some will result in much less improvement than others.
However, for our simplified version of SMO, we employ a much simpler heuristic. We simply
iterate over all $\alpha_i$, $i = 1, \cdots m$. If $\alpha_i$ does not fulfill the KKT conditions to within some numerical
tolerance, we select $\alpha_j$ at random from the remaining $m − 1$ $\alpha$’s and attempt to jointly optimize
$\alpha_i$ and $\alpha_j$ . If none of the $\alpha$’s are changed after a few iteration over all the $\alpha_i$’s, then the algorithm
terminates. It is important to realize that by employing this simplification, the algorithm is no
longer guaranteed to converge to the global optimum (since we are not attempting to optimize
all possible $\alpha_i$
, $\alpha_j$ pairs, there exists the possibility that some pair could be optimized which we
do not consider) -->



## SMO: Sequential Minimal Optimization

*Sequential Minimal Optimization* is an iterative algorithm for optimization. The algorithm is used by libsvm and solves this optimization problem efficiently by breaking it down into smaller problems.

*Working Algorithm*

1. Select one Lagrange multiplier $𝛼_1$ that violates the KKT conditions (i.e., it does not satisfy the optimality constraints).

2. Select a second multiplier $𝛼_2$ and solve the subproblem involving only $𝛼_1$ and $𝛼_2$ while keeping others fixed.

3. Update the model and repeat the process until all multipliers satisfy the KKT conditions.

The algorithm converges when all the Lagrange multipliers satisfy the KKT condition.

This pairwise approach avoids complex numerical solvers and converges efficiently, even on large datasets.


## Application of SVM
Support Vector Machines (SVMs) are powerful and versatile models used across various domains for both classification and regression tasks due to ability to handle high-dimensional data and non-linear decision boundaries (via kernels).

1. **Text Classification & Natural Language Processing (NLP)**: Spam detection (email classification), Sentiment analysis (positive vs negative reviews), Topic classification (news categorization)

2. **Image Classification**: Handwritten digit recognition (e.g., MNIST), Face detection and recognition, Object detection

3. **Bioinformatics**: Gene expression classification, Protein structure prediction, Disease diagnosis (e.g., cancer detection using microarray data)

4. **Financial Applications**: Credit risk assessment, Stock price prediction (in limited, feature-engineered scenarios), Fraud detection

5. **Anomaly Detection**: Network intrusion detection, Fraudulent transaction detection, Industrial defect detection

6. **Computer Vision in Real-Time Systems**: License plate recognition, Gesture recognition, Medical imaging classification

## Advantages and disadvantages of SVM
### Advantages:
 - **Effective in high-dimensional spaces:**	SVM works well when the number of features is very large, such as in text classification or gene expression data.
 - **Works well with a clear margin of separation**:	SVM is ideal when the classes are clearly separable with a large margin.
 - **Memory efficient**:	Only the support vectors are used in the decision function, which reduces the model size.
 - **Versatile with kernels**:	Through the kernel trick, SVM can model complex non-linear decision boundaries without explicitly transforming the data.
 - **Robust to overfitting (with the right C value)**:	Especially in high-dimensional spaces and with appropriate regularization, SVMs generalize well.

### Disadvantages
 - **Not suitable for large datasets**:	Training time and memory usage scale poorly with the number of samples — especially in non-linear SVMs.
 - **Less effective when classes overlap**:	SVM performs poorly when there is significant noise and class overlap, unless carefully tuned.
 - **Difficult to tune**:	Choosing the right kernel, regularization parameter 𝐶, and kernel-specific parameters (like 𝛾 in RBF) can be complex and time-consuming.
 - **No direct probabilistic output**:	Unlike models like logistic regression, SVM doesn’t naturally provide class probabilities (though they can be approximated).
 - **Performance degrades with imbalanced data**:	SVM assumes balanced class distribution; it may bias toward the majority class without adjustment.

# Tips on Practical Use
### 1. Kernel cache size:
For SVC, SVR, NuSVC and NuSVR, the size of the kernel cache has a strong impact on run times for larger problems. If you have enough RAM available, it is recommended to set cache_size to a higher value than the default of 200(MB), such as 500(MB) or 1000(MB).

### 2. Randomness of the underlying implementations:
The underlying implementations of SVC and NuSVC use a random number generator only to shuffle the data for probability estimation (when probability is set to True). This randomness can be controlled with the random_state parameter. If probability is set to False these estimators are not random and random_state has no effect on the results. The underlying OneClassSVM implementation is similar to the ones of SVC and NuSVC. As no probability estimation is provided for OneClassSVM, it is not random.

The underlying LinearSVC implementation uses a random number generator to select features when fitting the model with a dual coordinate descent (i.e. when dual is set to True). It is thus not uncommon to have slightly different results for the same input data. If that happens, try with a smaller tol parameter. This randomness can also be controlled with the random_state parameter. When dual is set to False the underlying implementation of LinearSVC is not random and random_state has no effect on the results.

3. Parameter nu in NuSVC/OneClassSVM/NuSVR approximates the fraction of training errors and support vectors.

4. In SVC, if the data is unbalanced (e.g. many positive and few negative), set class_weight='balanced' and/or try different penalty parameters C.



<!-- ## Implementation of SVM in sklearn

We may take sklearn for granted. Sklearn library is what we use to make our prediction for our machine learnign model. For the implementation of the SVM, sklearn uses two library to handle all the computations. they are **libsvm** and **liblinear**. Internally both the library(libsvm and liblinear) uses *C* and *Cython* for the computation. Let's discuss **libsvm** and **liblinear** briefly.

**LIBSVM** is a library explicitly designed for the support vecor machine. It is capable fo performing classificationa nd regression task. in addition to that, LIBSVM is also capable of other variants of the svm such as nuSVM, linear svm etc.  
The fact that makes LIBSVM so popular and versatile is that it makes the use of **SMO** algorithm -->
