# Linear Separators

We'll talk about the following models in this notebook:

* Perceptrons
* Support vector machines

You'll find that each one is just a variation of the other.

A **linear separator** $f : X \to \{-1, 1\}$ is a function that
classifies points based upon the following criterion:
$$f(\vec{x}) = \text{sign}(\vec{w} \cdot \vec{x} + b) \qquad (1)$$
where
$$
\text{sign}(a) = \left\{
\begin{array}{ll}
1 & \text{ if } a \geq 0 \\
-1 & \text{ otherwise}
\end{array}
\right.
$$

![](images/linear.jpg)

## More notation: Canonical representations

For the rest of this presentation, I'd like to use the **conanical** representation of an input sample $\vec{x}$:

$$\vec{x}' = \begin{bmatrix} \vec{x} \\ 1 \end{bmatrix}$$

And with that, I'll also create $\vec{w}'$:

$$\vec{w}' = \begin{bmatrix} \vec{w} \\ b \end{bmatrix}$$

With this representation, $(1)$ becomes

$$f(\vec{x}') = \text{sign}(\vec{w}' \cdot \vec{x}') \qquad (2)$$

which cleans up some of the notation. I'll also let $\vec{w} \leftarrow \vec{w}'$.

## Part 1: The perceptron

The perceptron (Rosenblatt, 1957) is probably one of the simpliest machine learning models you'll find. And implementations follow the gradient descent framework.

**Objective**: Minimize the following cost per sample:

$$c(\vec{x_i}, y_i) = \max(0, -y_i (\vec{w} \cdot \vec{x_i})) \qquad(3)$$

which is equivalent to

$$c(\vec{x_i}, y_i) = \left\{
\begin{array}{ll}
-y_i (\vec{w} \cdot \vec{x_i}) & \text{if } y_i (\vec{w} \cdot \vec{x}) \leq 0 \qquad\text{misclassification} \\
                             0 & \text{otherwise}
\end{array}
\right.$$

We'll use stochastic gradient descent to update our weight vector $\vec{w}$. Refresh on the stochastic gradient descent algorithm....

![](images/sgd-algo2.png)

It turns out that calculating $\nabla_{\vec{w}} c(\vec{x}, y)$ is extremely easy!
$$\nabla_{\vec{w}} = \Big(\frac{\partial f}{\partial w_1}, \dots, \frac{\partial f}{\partial w_d}, \frac{\partial f}{\partial b} \Big)$$

<!--
Therefore, if we focus on just one of these $\frac{\partial f}{\partial w_i}$, we have

$$
\frac{\partial f}{\partial w_i} \Big(c(y, f(\vec{x})) \Big) \Big\vert_{\vec{x} = \vec{x_i}, y = y_i} = \left\{
\begin{array}{ll}
-y_i \vec{x}_i & \text{if } y_i (\vec{w} \cdot \vec{x}) \leq 0 \qquad\text{misclassification}  \\
0 & \text{otherwise } \\
\end{array}
\right.
$$-->

Therefore,
$$\nabla_w c(y, f(\vec{x})) \Big\vert_{\vec{x} = \vec{x_i}, y_i = y} = \left\{
\begin{array}{ll}
-y_i \vec{x_i} & \text{if } y_i (\vec{w} \cdot \vec{x}) \leq 0 \qquad\text{misclassification} \\
                             0 & \text{otherwise}
\end{array}
\right.
$$

which yields the update rule

$$\vec{w}^{(t + 1)} \leftarrow \vec{w}^{(t)} + \eta^{(t)} 
\left\{
\begin{array}{ll}
y_i \vec{x}_i & \text{if } y_i (\vec{w} \cdot \vec{x}) \leq 0 \qquad\text{misclassification} \\
                             0 & \text{otherwise}
\end{array}
\right.
$$

![](images/perceptron.png)

In [None]:
%matplotlib inline

In [None]:
# Source: http://stats.stackexchange.com/questions/71335/decision-boundary-plot-for-a-perceptron

import numpy as np
from sklearn.linear_model import Perceptron
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=200, n_features=2, n_redundant=0, random_state=0)
clf = Perceptron(n_iter=100).fit(X, y)

In [None]:
h = .02  # step size in the mesh

# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, m_max]x[y_min, y_max].
fig, ax = plt.subplots()
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

# Put the result into a color plot
Z = Z.reshape(xx.shape)
ax.contourf(xx, yy, Z, cmap=plt.cm.Paired)
ax.axis('off')

# Plot also the training points
ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Paired)

ax.set_title('Perceptron')

## More resources

* [Scikit-learn implementation](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html)
* Scikit-learn's [SGDClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html) implements a perceptron when the "perceptron loss" option is used

## Part 2: Support Vector Machines

In spirit, support vector machines are not that different than perceptrons. They are also a linear classifier, trained with a different loss functions. Since their introduction by Vapnik in 1963, SVMs have gone through multiple incarnations of research.

![](images/svm.png)

Which of these hyperplanes is the best?

Intuition of the **hard svm**:
* Support vector machines try to find a linear separator (hyperplane) that maximizes the distnace between two classes
* The points nearest to the hyperplane are known as **support vectors**

Problem formulation is unideal! You usually won't get nice separations between two classes. This is where the **soft SVM** comes in

![](images/soft.png)

**Objective**: Find the maximum margin hyperplane such that if there are violations, penalize the score.

In other words, we can write
$$
\begin{align}
\min_{\vec{w}, \xi} \qquad& \frac{1}{2}\| w \|^2 + C\sum_i \xi_i \\
s.t.      \forall i,& y_i (\vec{w}\cdot \vec{x_i}) \geq 1 - \xi_i \\
          \forall i, &\xi_i \geq 0
\end{align}
$$

In English,

* The $\frac{1}{2} \| w \|^2$ represents the margin. There is math that shows that this maximizes the margin (proof needed)
* $\xi$ is the **slack** variable. A penalty is incurred on each $\vec{x_i}$
* The $C$ is a **hyperparameter** (set before training) that determines how much penalty to incur from a sample crossing the margin

Using algebra to solve for $\xi_i$, you can show that minimizing this objective minizes the following objective:

$$\min_\vec{w} \Bigg( \frac{1}{2} \| \vec{w} \|^2 + C\sum_i \max(0, 1 - y_i (\vec{w} \cdot x_i)) \Bigg)$$

Once again, you can use SGD to minimize this loss function.

## Kernels

Once in a while, you might hear someone mention kernel SVMs. Without getting too deep into it, kernels allow a linear separator to take samples from an original space that is not linearly separable, and embed them into a space where the samples are linearly separable.

## Example with Scikit-Learn

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets

# import some data to play with
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features. We could
                      # avoid this ugly slicing by using a two-dim dataset
y = iris.target

In [None]:
h = .02  # step size in the mesh

# we create an instance of SVM and fit out data. We do not scale our
# data since we want to plot the support vectors
C = 1.0  # SVM regularization parameter
svc = svm.SVC(kernel='linear', C=C).fit(X, y)
rbf_svc = svm.SVC(kernel='rbf', gamma=0.7, C=C).fit(X, y)
poly_svc = svm.SVC(kernel='poly', degree=3, C=C).fit(X, y)
lin_svc = svm.LinearSVC(C=C).fit(X, y)

# create a mesh to plot in
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

# title for the plots
titles = ['SVC with linear kernel',
          'LinearSVC (linear kernel)',
          'SVC with RBF kernel',
          'SVC with polynomial (degree 3) kernel']

In [None]:
# Way to change sizes of plots
# http://stackoverflow.com/a/332311/2014591
from pylab import rcParams
rcParams['figure.figsize'] = 10, 10

In [None]:

num_zeros = len(y[y == 0])
num_ones = len(y[y == 1])
num_twos = len(y[y == 2])

colors= "".join(["b"] * num_zeros + ["r"] * num_ones + ["y"] * num_twos)

for i, clf in enumerate((svc, lin_svc, rbf_svc, poly_svc)):
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, m_max]x[y_min, y_max].
    plt.subplot(2, 2, i + 1)
    plt.subplots_adjust(wspace=0.4, hspace=0.4)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=plt.cm.Paired, alpha=0.8)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=colors, cmap=plt.cm.Paired)
    plt.xlabel('Sepal length')
    plt.ylabel('Sepal width')
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xticks(())
    plt.yticks(())
    plt.title(titles[i])

plt.show()