A **Support Vector Machine** (SVM) is a very powerful and versatile Machine Learning model, capable of
performing **linear** or **nonlinear** **classification**, **regression**, and even **outlier detection**. It is one of the most popular models in Machine Learning, and anyone interested in Machine Learning should have it in their toolbox. SVMs are particularly well suited for classification of complex but small- or medium-sized datasets.

**large margin classification** : It separate classes as large as possible. You can think of it as a classifier that fitting the widest possible street. and that, adding more training examples/instances off the street will not affect street size ( the decision boundary).It is fully determined ( supported) by the instances located at the edge of the street:  these instances are called **support vectors**
SVMs are sensitive to feature scaling

### Soft Margin Classification 
: its objective is to find a good balance between keeping the street as large as possible and limiting the margin violations (i.e., instances that end up in the middle of the street or even on the wrong side).
This is handled by a hyperparameter c.
A smaller c leads to wider street or margin, but more margin viloation (less good classification). And a larger c, to a smaller margin, but less margin violation. the latter is more likely to generalize on new data. 
*Reducing c is a form of regularization to avoid overfitting*

The following Scikit-Learn code loads the iris dataset, scales the features, and then trains a linear SVM
model (using the LinearSVC class with C = 0.1 and the hinge loss function, described shortly) to detect
Iris-Virginica flowers

In [4]:
import numpy as np
from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
X = iris["data"][:, (2, 3)] # petal length, petal width
y = (iris["target"] == 2).astype(np.float64) # Iris-Virginica
svm_clf = Pipeline((("scaler", StandardScaler()),("linear_svc", LinearSVC(C=1, loss="hinge")),))

In [9]:
#svm_clf.fit(X, y)
#svm_clf.fit(X_scaled, y) # you can think of scaling X and see
svm_clf.predict([[5.5, 1.7]])
# Unlike Logistic Regression classifiers, SVM classifiers do not output probabilities for each class.

array([ 1.])

Alternatively, you could use the **SVC class**, using **SVC(kernel="linear", C=1)**, but it is **much slower**,
especially with large training sets, so it is not recommended. 
Another option is to use the **SGDClassifier class**, with **SGDClassifier(loss="hinge", alpha=1/(mC))**. This applies regular Stochastic **Gradient Descent** to train a linear SVM classifier. It does not converge as fast as the **LinearSVC class**, but it can be useful to handle huge datasets that do not fit in memory (out-of-core training), or to handle **online classification** tasks.

The **LinearSVC** class regularizes the bias term, so you should center the training set first by subtracting its mean. This is
automatic if you scale the data using the StandardScaler. Moreover, make sure you set the **loss hyperparameter to "hinge"**, as
it is not the default value. Finally, for better performance you should set the **dual hyperparameter** to **False**, unless there are more features than training instances.

### Nonlinear SVM Classification
Although linear SVM classifiers are efficient and work surprisingly well in many cases, many datasets are not even close 
to being linearly separable. One approach to handling nonlinear datasets is to **add more features**, such as **polynomial features** ; in some cases this can result in a linearly separable dataset.

In [10]:
# Let's add polynomial features in or model.

from sklearn.datasets import make_moons
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
polynomial_svm_clf = Pipeline((
("poly_features", PolynomialFeatures(degree=3)),
("scaler", StandardScaler()),
("svm_clf", LinearSVC(C=10, loss="hinge"))
))

polynomial_svm_clf.fit(X, y)

Pipeline(steps=(('poly_features', PolynomialFeatures(degree=3, include_bias=True, interaction_only=False)), ('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', LinearSVC(C=10, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))))

### Polynomial Kernel
when using **SVMs** you can apply an almost miraculous mathematical technique called the **kernel trick**. It makes it possible to get the same result as if you added many polynomial features, even with very high-degree polynomials, without actually having to add them. So there is no combinatorial explosion of the number of features since you don’t actually add any features.
This trick is implemented by the SVC class. 

In [12]:
from sklearn.svm import SVC
poly_kernel_svm_clf = Pipeline((
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="poly", degree=3, coef0=1, C=5))
))
poly_kernel_svm_clf.fit(X, y)

Pipeline(steps=(('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', SVC(C=5, cache_size=200, class_weight=None, coef0=1,
  decision_function_shape=None, degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))))

if your model is **overfitting**, you might want to **reduce the polynomial degree**. Conversely, if it is **underfitting**, you can try increasing it. The hyperparameter **coef0** controls how much the model is influenced by high-degree polynomials versus low-degree polynomials.

### Gaussian RBF Kernel
Just like the polynomial features method, the similarity features method can be useful with any Machine
Learning algorithm, but it may be computationally expensive to compute all the additional features,
especially on large training sets. However, once again the kernel trick does its SVM magic: it makes it
possible to obtain a similar result as if you had added many similarity features, without actually having to
add them. Let’s try the Gaussian RBF kernel using the SVC class

In [15]:
rbf_kernel_svm_clf = Pipeline((
("scaler", StandardScaler()),
("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
))
rbf_kernel_svm_clf.fit(X, y)

Pipeline(steps=(('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('svm_clf', SVC(C=0.001, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=5, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False))))

Increasing gamma makes the bell-shape curve narrower, and as a result each instance’s range of influence is smaller: the decision boundary ends up being more irregular, wiggling around individual instances. Conversely, a
small gamma value makes the bell-shaped curve wider, so instances have a larger range of influence, and the decision boundary ends up smoother. So γ acts like a regularization hyperparameter: if your model is overfitting, you should reduce it, and if it is underfitting, you should increase it (similar to the Chyperparameter).

#### TIP

With so many kernels to choose from, how can you decide which one to use? As a rule of thumb, you should always try the linear
kernel first (remember that LinearSVC is much faster than SVC(kernel="linear")), especially if the training set is very large or
if it has plenty of features. If the training set is not too large, you should try the Gaussian RBF kernel as well; it works well in
most cases. Then if you have spare time and computing power, you can also experiment with a few other kernels using crossvalidation
and grid search, especially if there are kernels specialized for your training set’s data structure

### Computational Complexity
The **LinearSVC** class is based on the **liblinear** library, which implements an optimized algorithm for
linear SVMs. It does not support the **kernel trick**, but it scales almost linearly with the number of training instances and the number of features: its training time complexity is roughly O(m × n). The algorithm takes longer if you require a very high precision. This is controlled by the tolerance hyperparameter ϵ (called tol in Scikit-Learn). In most classification tasks, the default tolerance is fine.

The **SVC** class is based on the **libsvm** library, which implements an algorithm that supports the kernel trick. The training time complexity is usually between O(m2 × n) and O(m3 × n). Unfortunately, this means that it gets dreadfully slow when the number of training instances gets large (e.g., hundreds of thousands of instances). This algorithm is perfect for complex but small or medium training sets. However, it scales well with the number of features, especially with sparse features (i.e., when each instance has few nonzero features). In this case, the algorithm scales roughly with the average number of nonzero features per instance. 


### SVM Regression
As we mentioned earlier, the SVM algorithm is quite versatile: not only does it support linear and nonlinear classification, but it also supports linear and nonlinear regression. The trick is to reverse the objective: instead of trying to fit the largest possible street between two classes while limiting margin violations, **SVM Regression** tries to fit as many instances as possible on the street while limiting margin violations (i.e., instances off the street). The width of the street is controlled by a hyperparameter ϵ .

Adding more training instances within the margin does not affect the model’s predictions; thus, the model
is said to be ϵ-insensitive.

You can use Scikit-Learn’s LinearSVR class to perform linear SVM Regression.

In [16]:
from sklearn.svm import LinearSVR
svm_reg = LinearSVR(epsilon=1.5)
svm_reg.fit(X, y)

LinearSVR(C=1.0, dual=True, epsilon=1.5, fit_intercept=True,
     intercept_scaling=1.0, loss='epsilon_insensitive', max_iter=1000,
     random_state=None, tol=0.0001, verbose=0)

To tackle nonlinear regression tasks, you can use a kernelized SVM model

In [17]:
from sklearn.svm import SVR
svm_poly_reg = SVR(kernel="poly", degree=2, C=100, epsilon=0.1)
svm_poly_reg.fit(X, y)

SVR(C=100, cache_size=200, coef0=0.0, degree=2, epsilon=0.1, gamma='auto',
  kernel='poly', max_iter=-1, shrinking=True, tol=0.001, verbose=False)