# Support Vector Machine

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.<br>
The advantages of support vector machines are:

    . Effective in high dimensional spaces.

    . Still effective in cases where number of dimensions is greater than the number of samples.

    . Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

    . Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also    possible to specify custom kernels. 
<br>
The disadvantages of support vector machines include:

    . If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.

    . SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).





###  Understanding the Terminology

* **Samples (or Data Points, $n$):** This is the number of rows in your dataset. If you have 100 customer records, you have 100 samples.
* **Dimensions (or Features, $p$):** This is the number of columns in your dataset (the variables used to describe each sample). If you record a customer's age, income, and purchase frequency, you have 3 dimensions/features.

###  What is the Problem When $p > n$? (More Dimensions than Samples)

In many traditional statistical models and machine learning algorithms, having **many more features ($p$) than samples ($n$)** creates a major problem known as the **"curse of dimensionality"** or an **ill-posed problem**.

* **Overfitting:** With too many features, the model can essentially memorize the few training samples perfectly (including their noise), leading to extremely poor performance on new, unseen data.
* **Computational Cost:** Calculations become exponentially more demanding.
* **Statistical Instability:** The model often becomes unstable because there's not enough data to reliably estimate the relationships for all those features.

###  Why are SVMs Still Effective When $p > n$?

This is where the unique mathematical formulation of the SVM comes into play. 

#### 1. Focus on the Boundary

* **SVM's Goal:** An SVM does not try to model the entire data distribution. Its primary goal is only to find the optimal **separating hyperplane**â€”the decision boundary that maximizes the distance (margin) to the nearest data points of any class.
* **The Key:** The hyperplane is defined *only* by a very small subset of the training points called the **Support Vectors** (the samples closest to the margin).

#### 2. The Power of the Kernel Trick

* **Implicit Mapping:** The **Kernel Trick** allows the SVM to operate in a very high-dimensional feature space (where the data might be separable) **without ever explicitly calculating the coordinates** of the data in that space. It only calculates the *similarity* (dot product) between pairs of data points using the kernel function.
* **Complexity:** The complexity of the SVM solution depends more on the **number of support vectors** than the total number of features. Since the number of support vectors is often small (and cannot exceed $n$), the algorithm remains computationally feasible and statistically robust even when $p$ is very large (sometimes even infinite, as with the RBF/Gaussian kernel).



## Classification

SVC, NuSVC and LinearSVC are classes capable of performing binary and multi-class classification on a dataset.<br>
SVC and NuSVC are similar methods, but accept slightly different sets of parameters and have different mathematical formulations (see section Mathematical formulation). On the other hand, LinearSVC is another (faster) implementation of Support Vector Classification for the case of a linear kernel. It also lacks some of the attributes of SVC and NuSVC, like support_. LinearSVC uses squared_hinge loss and due to its implementation in liblinear it also regularizes the intercept, if considered. This effect can however be reduced by carefully fine tuning its intercept_scaling parameter, which allows the intercept term to have a different regularization behavior compared to the other features. The classification results and score can therefore differ from the other two classifiers.