<p style="font-family: Arial; font-size:3em;color:navy; font-style:bold"><br>
Lecture 6: Further Examples of Classifiers
<br><br></p>

This week we're discussing more classifiers and their applications. 

<p style="font-family: Arial; font-size:2.5em;color:purple; font-style:bold"><br>
Support Vector Machine
<br><br></p>

Support Vector Machine is another classification method which is known for its memory friendly nature. A good way to think about all these machine learning algorithms we have seen is as a set of tools. Each one has its unique advantages and disadvantages and often times can be used in conjunction with eachother. An SVM is almost like a scalpel, it can navigate through complex relationships in high-dimensions, but is best used on smaller subsets of data.

<p style="font-family: Arial; font-size:2em;color:indigo; font-style:bold"><br>
Basics
<br><br></p>

We start by assigning each data point a set of "coordinates" called its features. As a result we can represent the data in some n-dimensional feature space as a set of points. The basic goal is to find a hyper plane, that is an (n-1)-dimensional plane that best seperates the data points. A support vector is the possition vector of the points close to this hyper plane of seperation.

![image](http://dni-institute.in/blogs/wp-content/uploads/2015/09/SVM-Planes.png)


Consider for example a 2-dimensional case. This means we have represented some data in terms of two features. What we aim to find is a line that can seperate the two data sets.

A general line is of the form $mx + b$ and so we want to try and tweek $m$ and $b$ to find the best line of seperation.

![image](https://www.analyticsvidhya.com/wp-content/uploads/2015/10/SVM_21.png)

It pretty intuitive to see that in the above case, line B is the best case. The following case is a bit trickier, but still we can find that line C appears to be the best.

![image](https://www.analyticsvidhya.com/wp-content/uploads/2015/10/SVM_4.png)

This is because it maximizes the **Margin** which is the distance between the hyperplane and the closest support vectors.

The support vector machine does this process for us. It finds the best hyperplane to seperate the data points.

![image](https://media.licdn.com/mpr/mpr/AAEAAQAAAAAAAAuSAAAAJDlhYzcwMzhlLTA0MjYtNDEyYS1hMWM4LTE3Zjk5NDlhNzVkMQ.png)



Pros:
* Effective in high dimensions
* Uses a subset of data, so is memory efficient

Cons:
* Bad on large training sets, due to large training time
* Sensitive to noise

<p style="font-family: Arial; font-size:2em;color:indigo; font-style:bold"><br>
Example 1: Classify Iris Species
<br><br></p>

We'll use logistic regression to predict whether annual income is greater than $50k based on census data. You can read more about the dataset <a href="https://www.kaggle.com/uciml/adult-census-income">here</a>.

In [19]:
# import necessary packages
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from sklearn import svm, datasets

In [20]:
# here we use the iris dataset
iris = datasets.load_iris()
X = iris.data[:, :2] 
Y = iris.target

In [21]:
# we use .SVC function to make the classification object
model = svm.SVC(kernel='linear', C=1,gamma=1)

#Here we split the training and testing data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)

#Here we fit the model to the training data
model.fit(X_train, Y_train)


SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=1, kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [22]:
#Here we predict on the testing data
model.predict(X_test)

array([0, 1, 2, 2, 1, 2, 1, 2, 0, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 0, 0, 2,
       2, 2, 0, 1, 2, 1, 0, 2, 1, 0, 2, 2, 1, 2, 1, 2, 0, 1, 2, 1, 2, 2])

In [23]:
#We test the score of the model
model.score(X_test,Y_test)

0.77777777777777779

<p style="font-family: Arial; font-size:2.5em;color:purple; font-style:bold"><br>
Margins
<br><br></p>
Given a dataset 

<img src = "http://www.eric-kim.net/eric-kim-net/posts/1/imgs/dataset_linsep.png" width = 500>

How do we separate the blue dots from the red dots? One way is to draw a straight line between the two classes, separating them. However, there are an infinite number of straight lines we can draw from this graph, so which line should we pick? 

One reasonable choice would be the line that gives the largest separation between the two classes. This separation is called the **margin**, and in **maximal margin classifiers**, we want to maximize the margin. In other words, we want to maximize the distance between the hyperplane (a line in our case) and the closest point(s) (filled in in the graph below) in each class.

![image](https://qph.ec.quoracdn.net/main-qimg-312b52d9d056499a3c8893e22f6fee5e)

These points are called **support vectors** and they help us find where our hyperplane lies. 

### Hard Margins
Our hyperplane may not always be perfect, especially when our dataset is not linearly separable. In such cases, we want to introduce the **cost function**: 

$$-log(1-\frac{1}{1+e^{cx}})$$

It looks like an exponential function and the coefficient c will change how fast the value increases. This function dictates how much to penalize support vectors for being mislabled. If the penalty value (namely $c$) is high, then the svm is **hard margin**.

Let's look at specific case. The graph below shows the hyperplane between two classes, blue are red. As you can see, the existence of one (red) outlier drasticaly changes the hyperplane. 

<img src = "http://yaroslavvb.com/upload/save/so-svm.png" width = 250/>

If we were to use a **hard margin** SVM, we would have a high penalty value and the resulting hyperplane would be dictated by that one red outlier. 

### Soft Margins
You might have realized that, in the example above, it is better to allow one outlier to fall on to the wrong side rather than letting that one outlier dictate the position of our hyperplane. We can achieve a "more balanced" hyperplane that lies more or less right in the middle of the two classes, by having a low $c$ value. This is called a **soft margin support vector machine**.

Especially in non-linearly separable datasets, it is generally better to use a soft margin SVM. In this case, we introduce the **hinge loss function**: 

![image](https://wikimedia.org/api/rest_v1/media/math/render/svg/53b729df53f32c7fbf933b1b034a8e368037d9b5)

where $y_i$ is either 1 or -1, indicating the class to which $x_i$ belongs, and $w$ is the vector normal to the hyperplane.

<img src = "https://upload.wikimedia.org/wikipedia/commons/2/2a/Svm_max_sep_hyperplane_with_margin.png" width = 300>

The regularization constant controls how "soft" the margins are. 