# Support Vector Machines

<p style="font-family:verdana; font-size:15px" id="svm"><b> Support Vector Machines</b> are supervised learning models which can be used for both classification and regression. SVMs are among the best supervised learning algorithms. It is effective in high dimensional space and it is memory efficient as well.

Consider a binary classification problem, where the task is to assign a one of the two labels to given input. We plot each data item as a point in n-dimensional space as follows:</p>

![title](./images/svm1.png)

<p style="font-family:verdana; font-size:15px">We can perform classification by finding the hyperplane that differentiate the two classes very well. As you can see in the above image, we can draw m number of hyperplanes. How do we find the best one? We can find the optimal hyperplane by maximizing the <b> margin </b>.</p>
![title](./images/svm2.png)

<p style="font-family:verdana; font-size:15px">We define margin as a twice of the distance between the hyperplane and the nearest sample points to the hyperplane. This points are known as <b>support vector</b>. They known as support vectors because they hold up optimal hyperplane. In above figure, support vectors are represented with filled color.

Consider a first hyperplane in figure-1 which touches the two sample points(red). Although it classifies all the examples correctly, but the problem is our hyperplane is close to the so many sample points and other red examples might fall on the other side of the hyperplane. This problem can be solved by choosing a hyperplane which is farthest away from the sample points. It turns out that this type of model generalize very well. This optimal hyperplane is also known as <b> maximum margin separator</b>.</p>

<p style="font-family:verdana; font-size:15px">We know that we want hyperplane with maximum margin and we also discussed why we want this. Now, let us learn how to find this optimal hyperplane? Before that, please note in case of SVMs, we represent class labels with +1 and -1 instead of 0 and 1(Binary Valued Labels). Here, we represent each hyperplane, the optimal one, negative and positive hyperplane(dashed lines) with linear equations - w<sup>T</sup>x + b = 0, w<sup>T</sup>x+b = -1 and w<sup>T</sup>x + b = +1 respectively. The left most dashed line is negative hyperplane. We represent red points with x<sub>-</sub> and blue points with x<sub>+</sub>. To derive the equation for a margin let us substract equations of negative and positive hyperplane from each other.</p>

<p style="font-family:verdana; font-size:15px">\begin{align}
w^T(x_+ - x_-) = 2
\end{align}<br><br>

Adding length of the vector w to normalize this,<br>

\begin{align}
\frac{w^T(x_+ - x_-)}{||w||} = \frac{2}{||w||} 
\end{align}
<br><br>
where, <b> 2/||w||</b> is the margin.

Now the objective of the SVM becomes maximization of the margin under the constraint that samples are classified correctly.
\begin{align}
w^T x^{(i)} + b >= 1  \hspace{1cm} if \hspace{1cm} y^{(i)} = +1 \newline
w^T x^{(i)} + b < -1  \hspace{1cm} if \hspace{1cm} y^{(i)} = -1
\end{align}
<br><br>
This can also be written more compractly as 
\begin{align}
y^{(i)} ( w_0 + w^T x^{(i)}) >= 1 \forall_i
\end{align}</p>

<p style="font-family:verdana; font-size:15px">In practice, it is easier to minimize the below given reciprocal term \begin{align} \frac{1}{2} ||w||^2 \end{align}. This is the quadratic programming problem with the linear constraint.</p>

<p style="font-family:verdana; font-size:15px">In the case of inherently noisy data, we may not want a linear hyperplane in high-dimensional space. Rather, we'd like a decision surface in low dimensional space that does not clearly seperate the classes, but reflects the reality of the noisy data. That is possible with the <b> soft margin classifier</b>, which allows examples to fall on the wrong side of the decision boundary, but assigns them a penalty proportional to the distance required to move them back on the correct side. In soft margin classifier, we add slack variables to the linear constraint.<br><br>
\begin{align}
y_{(i)} (w^T x_i + b ) >= 1 - \xi \hspace{1cm} for \hspace{1cm} i = 1,..,N
\end{align}

Now, our objective to minimize is<br><br>
\begin{align}
\frac{1}{2} ||w||^2 + C (\sum_{i} \xi^{(i)})
\end{align}

C is the <b> regularization </b> parameter. Small C allows constraint to be easily ignored and results in large margin whereas large C makes constraints hard to ignore and results in narrow margin. This is still a quadratic optimization problem and there is a unique minimum. 

Now let us implement linear SVM classifier in Python using sklearn. We will use iris dataset</p>

In [2]:
#import the dependencies
from sklearn.datasets import load_iris
from sklearn.svm import SVC
#load dataset
dataset = load_iris()
data = dataset.data
target = dataset.target

    In machine learning, we always need to do some preprocessing to make our dataset suitable for the learning algorithm. I will introduce few preprocessing techniques as we go through various algorithms. Here, we will perform feature scaling which is required for optimal performance. Feature scaling is used to standardize the range of features of data.

In [3]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
sc.fit_transform(data) #check out preprocessing module of sklearn to learn more about preprocessing in ML

array([[ -9.00681170e-01,   1.03205722e+00,  -1.34127240e+00,
         -1.31297673e+00],
       [ -1.14301691e+00,  -1.24957601e-01,  -1.34127240e+00,
         -1.31297673e+00],
       [ -1.38535265e+00,   3.37848329e-01,  -1.39813811e+00,
         -1.31297673e+00],
       [ -1.50652052e+00,   1.06445364e-01,  -1.28440670e+00,
         -1.31297673e+00],
       [ -1.02184904e+00,   1.26346019e+00,  -1.34127240e+00,
         -1.31297673e+00],
       [ -5.37177559e-01,   1.95766909e+00,  -1.17067529e+00,
         -1.05003079e+00],
       [ -1.50652052e+00,   8.00654259e-01,  -1.34127240e+00,
         -1.18150376e+00],
       [ -1.02184904e+00,   8.00654259e-01,  -1.28440670e+00,
         -1.31297673e+00],
       [ -1.74885626e+00,  -3.56360566e-01,  -1.34127240e+00,
         -1.31297673e+00],
       [ -1.14301691e+00,   1.06445364e-01,  -1.28440670e+00,
         -1.44444970e+00],
       [ -5.37177559e-01,   1.49486315e+00,  -1.28440670e+00,
         -1.31297673e+00],
       [ -1.26418478e

In [4]:
#now let us divide data into training and testing set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data, target, random_state=42, test_size=0.3)

In [5]:
#train a model
model = SVC()
model.fit(X_train,y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [6]:
from sklearn.metrics import accuracy_score

In [7]:
accuracy_score(y_test, model.predict(X_test))

1.0

In [8]:
model.support_vectors_.shape

(39, 4)

In [9]:
model.support_vectors_

array([[ 5.1,  3.8,  1.9,  0.4],
       [ 4.8,  3.4,  1.9,  0.2],
       [ 5.5,  4.2,  1.4,  0.2],
       [ 4.5,  2.3,  1.3,  0.3],
       [ 5.8,  4. ,  1.2,  0.2],
       [ 5.6,  3. ,  4.5,  1.5],
       [ 5. ,  2. ,  3.5,  1. ],
       [ 5.4,  3. ,  4.5,  1.5],
       [ 6.7,  3. ,  5. ,  1.7],
       [ 5.9,  3.2,  4.8,  1.8],
       [ 5.1,  2.5,  3. ,  1.1],
       [ 6. ,  2.7,  5.1,  1.6],
       [ 6.3,  2.5,  4.9,  1.5],
       [ 6.1,  2.9,  4.7,  1.4],
       [ 6.5,  2.8,  4.6,  1.5],
       [ 7. ,  3.2,  4.7,  1.4],
       [ 6.1,  3. ,  4.6,  1.4],
       [ 5.5,  2.6,  4.4,  1.2],
       [ 4.9,  2.4,  3.3,  1. ],
       [ 6.9,  3.1,  4.9,  1.5],
       [ 6.3,  2.3,  4.4,  1.3],
       [ 6.3,  2.8,  5.1,  1.5],
       [ 7.7,  2.8,  6.7,  2. ],
       [ 6.3,  2.7,  4.9,  1.8],
       [ 7.7,  3.8,  6.7,  2.2],
       [ 5.7,  2.5,  5. ,  2. ],
       [ 6. ,  3. ,  4.8,  1.8],
       [ 5.8,  2.7,  5.1,  1.9],
       [ 6.2,  3.4,  5.4,  2.3],
       [ 6.1,  2.6,  5.6,  1.4],
       [ 6

<p style="font-family:verdana; font-size:15px">Till now, we see problems where input data can be seperated by linear hyperplane. But what is data points are not linearly seperable as shown below?</p>
![title](./images/svm5.png)


<p style="font-family:verdana; font-size:15px">To solve this type of problems where data can not be seperated linearly, we add new feature. For example, let us add new feature z = x<sup>2</sup> + y<sup>2</sup>. Now, if we plot data points on x and z axis we get :</p>
![title](./images/svm3.png)


<p style="font-family:verdana; font-size:15px"> As you can see, now we can have a linear hyperplane that can seperate data points very well. Do we need to add this additional feature manually? And the answer is no. We use the technique called <b> Kernel Trick</b>. Kernel trick is nothing but a set of functions which takes low-dimensional input space and transform it into high-dimensional space where data points are linearly seperable. These functions are called kernels. Widely used kernels are Radial Basis Function Kernel, Polynomial Kernel, Sigmoid kernel, etc.<br><br>

Let us implement this in sklearn.</p>

In [10]:
#we have already imported libs and dataset
model2= SVC(kernel="rbf", gamma=0.2)
model2.fit(X_train, y_train)
model2.score(X_test, y_test)

1.0

<p style="font-family:verdana; font-size:15px"> We can have different decision boundary for different kernels and gamma values. Here is the screenshot from scikit-learn website.</p>
![title](./images/svm6.png)
