<h1 id="Contents">Contents<a href="#Contents"></a></h1>
        <ol>
        <li><a class="" href="#Support-Vector-Machine">Support Vector Machine</a></li>
<ol><li><a class="" href="#Mathematical-Formulation">Mathematical Formulation</a></li>
<ol><li><a class="" href="#SVC">SVC</a></li>
<li><a class="" href="#LinearSVC">LinearSVC</a></li>
<li><a class="" href="#SVR">SVR</a></li>
<li><a class="" href="#LinearSVR">LinearSVR</a></li>
</ol><li><a class="" href="#Classifier">Classifier</a></li>
<ol><li><a class="" href="#Multi-class-classification">Multi-class classification</a></li>
</ol><li><a class="" href="#Regression">Regression</a></li>
<ol><li><a class="" href="#Complexity">Complexity</a></li>
</ol><li><a class="" href="#Kernel">Kernel</a></li>
<ol><li><a class="" href="#Parameters-of-Kernel">Parameters of Kernel</a></li>
<li><a class="" href="#Using-Python-functions-as-kernels">Using Python functions as kernels</a></li>
</ol><li><a class="" href="#SVC">SVC</a></li>
<ol><li><a class="" href="#Parameters-of-SVC">Parameters of SVC</a></li>
<li><a class="" href="#Attributes-of-SVC">Attributes of SVC</a></li>
</ol><li><a class="" href="#SVR">SVR</a></li>
<ol><li><a class="" href="#Parameters-of-SVR">Parameters of SVR</a></li>
<li><a class="" href="#Attributes-of-SVR">Attributes of SVR</a></li>
</ol><li><a class="" href="#LinearSVC">LinearSVC</a></li>
<ol><li><a class="" href="#Attributes-of-LinearSVC">Attributes of LinearSVC</a></li>
</ol><li><a class="" href="#LinearSVR">LinearSVR</a></li>
<ol><li><a class="" href="#Parameters-of-LinearSVR">Parameters of LinearSVR</a></li>
<li><a class="" href="#Attributes-of-LinearSVR">Attributes of LinearSVR</a></li>
</ol><li><a class="" href="#NuSVC">NuSVC</a></li>
<ol><li><a class="" href="#Parameters-of-NuSVC">Parameters of NuSVC</a></li>
<li><a class="" href="#Attributes-of-NuSVC">Attributes of NuSVC</a></li>
</ol><li><a class="" href="#NuSVR">NuSVR</a></li>
<ol><li><a class="" href="#Parameters-of-NuSVR">Parameters of NuSVR</a></li>
<li><a class="" href="#Attributes-of-NuSVR">Attributes of NuSVR</a></li>
</ol>

# Support Vector Machine

Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.

The advantages of support vector machines are:

* Effective in high dimensional spaces.

* Still effective in cases where number of dimensions is greater than the number of samples.

* Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.

* Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

 The disadvantages of support vector machines include:

* If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.

* SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

## Mathematical Formulation

A support vector machine constructs a hyper-plane or set of hyper-planes in a high or infinite dimensional space, which can be used for classification, regression or other tasks. Intuitively, a good separation is achieved by the hyper-plane that has the largest distance to the nearest training data points of any class (so-called functional margin), since in general the larger the margin the lower the generalization error of the classifier. The figure below shows the decision function for a linearly separable problem, with three samples on the margin boundaries, called “support vectors”.

![](https://scikit-learn.org/stable/_images/sphx_glr_plot_separating_hyperplane_001.png)

>In general, when the problem isn’t linearly separable, the support vectors are the samples within the margin boundaries.

### SVC

Given training vectors $x_i \in \mathbb{R}^p$ , i=1,…, n, in two classes, and a vector $y \in \{1, -1\}^n$, our goal is to find $w \in
\mathbb{R}^p$ and $b \in \mathbb{R}$ such that the prediction given by $\text{sign} (w^T\phi(x) + b)$ is correct for most samples.

SVC solves the following primal problem:
$$
\begin{align*}\begin{aligned}\min_ {w, b, \zeta} \frac{1}{2} w^T w + C \sum_{i=1}^{n} \zeta_i\\\begin{split}\textrm {subject to } & y_i (w^T \phi (x_i) + b) \geq 1 - \zeta_i,\\
& \zeta_i \geq 0, i=1, ..., n\end{split}\end{aligned}\end{align*}
$$

Intuitively, we’re trying to maximize the margin (by minimizing $||w||^2 = w^Tw$), while incurring a penalty when a sample is misclassified or within the margin boundary. Ideally, the value $y_i
(w^T \phi (x_i) + b)$ would be $\ge 1$ for all samples, which indicates a perfect prediction. But problems are usually not always perfectly separable with a hyperplane, so we allow some samples to be at a distance $\zeta_i$ from their correct margin boundary. The penalty term `C` controls the strength of this penalty, and as a result, acts as an inverse regularization parameter.

The same problem can also be written as:
$$\begin{align*}\begin{aligned}\min_{\alpha} \frac{1}{2} \alpha^T Q \alpha - e^T \alpha\\\begin{split}
\textrm {subject to } & y^T \alpha = 0\\
& 0 \leq \alpha_i \leq C, i=1, ..., n\end{split}\end{aligned}\end{align*}$$

where $e$ is the vector of all ones, and $Q$ is an $n$ by $n$ positive semidefinite matrix,$Q_{ij} \equiv y_i y_j K(x_i, x_j)$ , where $K(x_i, x_j) = \phi (x_i)^T \phi (x_j)$ is the kernel. The terms $\alpha_i$ are called the dual coefficients, and they are upper-bounded by C. This dual representation highlights the fact that training vectors are implicitly mapped into a higher (maybe infinite) dimensional space by the function $\phi$.

Once the optimization problem is solved, the output of decision_function for a given sample $x$ becomes:
$$\sum_{i\in SV} y_i \alpha_i K(x_i, x) + b,$$
and the predicted class correspond to its sign. We only need to sum over the support vectors (i.e. the samples that lie within the margin) because the dual coefficients $\alpha_i$ are zero for the other samples.

These parameters can be accessed through the attributes `dual_coef_` which holds the product $y_i \alpha_i$, `support_vectors_` which holds the support vectors, and `intercept_` which holds the independent term b.
 

### LinearSVC

The primal problem can be equivalently formulated as
$$\min_ {w, b} \frac{1}{2} w^T w + C \sum_{i=1}^{n}\max(0, 1 - y_i (w^T \phi(x_i) + b)),$$
where we make use of the hinge loss. This is the form that is directly optimized by `LinearSVC`, but unlike the dual form, this one does not involve inner products between samples, so the famous kernel trick cannot be applied. This is why only the linear kernel is supported by `LinearSVC`.

### SVR

Given training vectors $x_i \in \mathbb{R}^p$, i=1,…, n, and a vector $y \in \mathbb{R}^n$ $\epsilon$-SVR solves the following primal problem:
$$
\begin{align*}\begin{aligned}\min_ {w, b, \zeta, \zeta^*} \frac{1}{2} w^T w + C \sum_{i=1}^{n} (\zeta_i + \zeta_i^*)\\\begin{split}\textrm {subject to } & y_i - w^T \phi (x_i) - b \leq \varepsilon + \zeta_i,\\
                      & w^T \phi (x_i) + b - y_i \leq \varepsilon + \zeta_i^*,\\
                      & \zeta_i, \zeta_i^* \geq 0, i=1, ..., n\end{split}\end{aligned}\end{align*}
$$

Here, we are penalizing samples whose prediction is at least $\epsilon$ away from their true target. These samples penalize the objective by $\zeta_i$ or $\zeta_i^*$
, depending on whether their predictions lie above or below the $\epsilon$ tube.

The dual problem is:
$$
\begin{align*}\begin{aligned}\min_{\alpha, \alpha^*} \frac{1}{2} (\alpha - \alpha^*)^T Q (\alpha - \alpha^*) + \varepsilon e^T (\alpha + \alpha^*) - y^T (\alpha - \alpha^*)\\\begin{split}
\textrm {subject to } & e^T (\alpha - \alpha^*) = 0\\
& 0 \leq \alpha_i, \alpha_i^* \leq C, i=1, ..., n\end{split}\end{aligned}\end{align*}
$$

where $e$ is the vector of all ones, $Q$ is an $n$ by $n$ positive semidefinite matrix, $Q_{ij} \equiv K(x_i, x_j) = \phi (x_i)^T \phi (x_j)$ is the kernel. Here training vectors are implicitly mapped into a higher (maybe infinite) dimensional space by the function $\phi$.

The prediction is:
$$
\sum_{i \in SV}(\alpha_i - \alpha_i^*) K(x_i, x) + b
$$
These parameters can be accessed through the attributes `dual_coef_` which holds the difference $\alpha_i - \alpha_i^*$
, `support_vectors_` which holds the support vectors, and `intercept_` which holds the independent term $b$.

### LinearSVR

The primal problem can equivalenty be written as:
$$\min_ {w, b} \frac{1}{2} w^T w + C \sum_{i=1}\max(0, |y_i - (w^T \phi(x_i) + b)| - \varepsilon),$$
where we make use of the epsilon-insensitive loss, i.e. errors of less than $\epsilon$ are ignored. This is the form that is directly optimized by `LinearSVR`.

## Classifier

`SVC`, `NuSVC` and `LinearSVC` are classes capable of performing binary and multi-class classification on a dataset. `SVC` and `NuSVC` are similar methods, but accept slightly different sets of parameters and have different mathematical formulations (see section Mathematical formulation). On the other hand, `LinearSVC` is another (faster) implementation of Support Vector Classification for the case of a linear kernel.

### Multi-class classification

`SVC` and `NuSVC` implement the “one-versus-one” approach for multi-class classification. In total, n_classes * (n_classes - 1) / 2 classifiers are constructed and each one trains data from two classes. To provide a consistent interface with other classifiers, the decision_function_shape option allows to monotonically transform the results of the “one-versus-one” classifiers to a “one-vs-rest” decision function of shape (n_samples, n_classes).

On the other hand, `LinearSVC` implements “one-vs-the-rest” multi-class strategy, thus training n_classes models.

## Regression

The method of Support Vector Classification can be extended to solve regression problems. This method is called Support Vector Regression.

The model produced by support vector classification (as described above) depends only on a subset of the training data, because the cost function for building the model does not care about training points that lie beyond the margin. Analogously, the model produced by Support Vector Regression depends only on a subset of the training data, because the cost function ignores samples whose prediction is close to their target.

There are three different implementations of Support Vector Regression: `SVR`, `NuSVR` and `LinearSVR`. `LinearSVR` provides a faster implementation than `SVR` but only considers the linear kernel, while `NuSVR` implements a slightly different formulation than `SVR` and `LinearSVR`.

### Complexity

Support Vector Machines are powerful tools, but their compute and storage requirements increase rapidly with the number of training vectors. The core of an SVM is a quadratic programming problem (QP), separating support vectors from the rest of the training data. The QP solver used by the libsvm-based implementation scales between $O(n_{features} \times n_{samples}^2)$
 and $O(n_{features} \times n_{samples}^3)$
 depending on how efficiently the libsvm cache is used in practice (dataset dependent). If the data is very sparse  should be replaced by the average number of non-zero features in a sample vector.

For the linear case, the algorithm used in `LinearSVC` by the liblinear implementation is much more efficient than its libsvm-based `SVC` counterpart and can scale almost linearly to millions of samples and/or features.

## Kernel

Kernels are used to transform the training vectors into a higher dimensional space. The kernel function is used to compute the dot product between two vectors. Here are some of the kernels implemented in scikit-learn:
* Linear: $K(x, y) = x^T y$
* Polynomial: $K(x, y) = (\gamma x^T y + r)^{d}$
* rbf: $K(x, y) = exp(-\gamma \|x - y\|^2)$
* sigmoid: $K(x, y) = \frac{1}{1 + exp(-\gamma x^T y +r)}$
 
Here, d is specified by the `degree` parameter, $\gamma$ is specified by the `gamma` parameter, and $r$ is specified by the `coef0` parameter.

### Parameters of Kernel

The parameter `C`, common to all SVM kernels, trades off misclassification of training examples against simplicity of the decision surface. A low `C` makes the decision surface smooth, while a high `C` aims at classifying all training examples correctly. `gamma` defines how much influence a single training example has. The larger `gamma` is, the closer other examples must be to be affected.

Proper choice of `C` and `gamma` is critical to the SVM’s performance.

### Using Python functions as kernels

You can use your own defined kernels by passing a function to the kernel parameter.

Your kernel must take as arguments two matrices of shape `(n_samples_1, n_features)`, `(n_samples_2, n_features)` and return a kernel matrix of shape `(n_samples_1, n_samples_2)`.

In [1]:
import numpy as np
from sklearn import svm
def my_kernel(X, Y):
    return np.dot(X, Y.T)

clf = svm.SVC(kernel=my_kernel)

## SVC

### Parameters of SVC

Here are some of the parameters of the `SVC` class:
* **C**: float, optional (default=1.0)
    
    Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.
* **kernel**: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, (default='rbf')
    
    Specifies the kernel type to be used in the algorithm. It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix.
* **degree**: int, optional (default=3)
  
    Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
* **gamma**: float, optional (default=’auto’)
    
    Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. 
    * if gamma='scale' (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma,

    * if ‘auto’, uses 1 / n_features.
* **coef0**: float, optional (default=0.0)
    
    Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.
* **probability**: boolean, optional (default=False)
  
  Whether to enable probability estimates. This must be enabled prior to calling fit, will slow down that method as it internally uses 5-fold cross-validation, and predict_proba may be inconsistent with predict.
* **tol**: float, optional (default=1e-3)
    
    Tolerance for stopping criterion.
* **cache_size**: float, optional (default=200.0)

    Specify the size of the kernel cache (in MB). 
* **class_weight**: {dict, ‘balanced’}, optional (default=None)
    
    Set the parameter C of class i to class_weight[i]*C for SVC. If not given, all classes are supposed to have weight one. The ‘balanced’ mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as `n_samples / (n_classes * np.bincount(y))`  
* **verbose**: int, optional (default=0)
  
    Enable verbose output. Note that this setting takes advantage of a per-process runtime setting in libsvm that, if enabled, may not work properly in a multithreaded context.
* **max_iter**: int, optional (default=-1)
  
    Hard limit on iterations within solver, or -1 for no limit.
* **decision_function_shape**: [‘ovo’, ‘ovr’], optional (default=’ovr’)
    
    Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) or one-vs-one (‘ovo’) decision function of shape (n_samples, n_classes * (n_classes - 1) / 2).

    However, note that internally, one-vs-one (‘ovo’) is always used as a multi-class strategy to train models; an ovr matrix is only constructed from the ovo matrix. The parameter is ignored for binary classification.

### Attributes of SVC

Here are some of the attributes of the `SVC` class:
* **support_**: array-like, shape = [n_SV, n_features]
    
    Support vectors.
* **support_vectors_**: array-like, shape = [n_SV, n_features]
  
    Support vectors.
* **class_weight_**: array-like, shape = [n_classes]
    
    Multipliers of parameter C for each class. Computed based on the `class_weight` parameter.
* **classes_**: array-like, shape = [n_classes]
    
    The classes labels.
* **coef_**: ndarray of shape (n_classes * (n_classes - 1) / 2, n_features)
    
    Weights assigned to the features when `kernel="linear"`.

* **dual_coef_**: ndarray of shape (n_classes -1, n_SV)
    
    Coefficients of the support vectors in the decision function.
* **intercept_**: ndarray of shape (n_classes * (n_classes - 1) / 2,)
    
    Constants in decision function.
* **probA_**: ndarray of shape (n_classes * (n_classes - 1) / 2)
    
    Parameter learned in Platt scaling when probability=True.

## SVR

### Parameters of SVR

The `SVR` class has the same parameters as the `SVC` class.

### Attributes of SVR

The `SVR` class has the same attributes as the `SVC` class.

## LinearSVC

Here are some of the parameters of the `LinearSVC` class:
* **penalty**: {‘l1’, ‘l2’}, optional (default=’l2’)
    
    Specifies the norm used in the penalization. The ‘l2’ penalty is the standard used in SVC. The ‘l1’ leads to coef_ vectors that are sparse.
* **loss**: {‘hinge’, ‘squared_hinge’}, optional (default=’squared_hinge’)
  
    Specifies the loss function. ‘hinge’ is the standard SVM loss (used e.g. by the SVC class) while ‘squared_hinge’ is the square of the hinge loss. The combination of `penalty='l1'` and `loss='hinge'` is not supported.
* **dual**: boolean, optional (default=True)
        
        Whether to fit the classifier dual or primal. Prefer dual=False when n_samples > n_features.

Other parameters are the same as for the `SVC` class.

### Attributes of LinearSVC

Same as for the `SVC` class.

## LinearSVR

### Parameters of LinearSVR

The `LinearSVR` and the `SVR` clases have a lot of parameters in common. There are, however, some differences:
* **epsilon**: float, optional (default=0.0)
    
    Epsilon parameter in the epsilon-insensitive loss function.
* **loss**: {‘epsilon_insensitive’, ‘squared_epsilon_insensitive’}, optional (default=’epsilon_insensitive’)
  
    Specifies the loss function. The epsilon-insensitive loss (standard SVR) is the L1 loss, while the squared epsilon-insensitive loss (‘squared_epsilon_insensitive’) is the L2 loss.

### Attributes of LinearSVR

Same as for the `SVR` class.

## NuSVC

### Parameters of NuSVC

Same as SVC with some differences as:
* **nu**: float, optional (default=0.5)
    
   An upper bound on the fraction of margin errors and a lower bound of the fraction of support vectors. Should be in the interval (0, 1].

### Attributes of NuSVC

Same as for the `SVC` class.

## NuSVR

### Parameters of NuSVR

Same as `NuSVC`.

### Attributes of NuSVR

Same as for the `NuSVC` class.