# Introduction
In this notebook we implement the Perceptron Learning Algorithm.

- Algorithms: Perceptron Learning Algorithm (PLA) and the Pocket extension
- DataSets:   "perceptron" and "pocket"
- Video: To Be Added


# Perceptron
The hypothesis set for the Perceptron is

$$ H=\{h(x)=\text{sign}(w^T x) \mid w\in R^{d+1}\}$$

Given data $D=\{(x_1, y_1), ..., (x_N, y_N)\}$ we want to minimize the following error function  

$$E_{in}(h)=\sum_{i=1}^N \mathbb{1}_{h(x_i)\neq y_i} $$

Here $\mathbb{1}_{h(x_i)\neq y_i}$ is the indicator function 

$$\mathbb{1}_{h(x_i)\neq y_i} = \begin{cases}
1 &\text{if }h(x_i) \neq y_i\\
0 &\text{else}
\end{cases}$$

Notice that any hypothesis $h(x)=\text{sign}(w^Tx)\in H$ is described exactly by its weights $w\in R^{d+1}$. Our goal is to find weights $w$ such that we minimize the in-sample error $E_{in}$. 

Run the following code. It will visualize a set of data

In [None]:
import ipython_notebook_importer
import DataSet as ds

data = ds.DataSet("perceptron", 100)
data.plot() 

Notice that the two groups of points could be seperated by a line. We say that the data is *linearly seperable*. The following algorithm is called the Perceptron Learning Algorithm. If the data is *linearly seperable* it will find a seperating line. This will make the perceptron classify all points correctly and thus get $E_{in}(h)=0$ !

       Perceptron Learning Algorithm
       w = initialize random
       while there is a misclassified point x in D
          pick misclassified point x
          update weights w = w + learning_rate * x * y
      
It can be proved that 

$$\text{D is linearly seperable}\quad\Rightarrow \quad\text{The PLA algorithm finds a seperating hyperplane in finite steps}$$

Several proofs [1,2,3] are available online (the proof is not a part of the AU ML 2017 curriculum). 

# Code: Perceptron 
The following code implements the class `Perceptron`. For simplicity all visualization code has been moved to `hide_visualization_code`. 

In [None]:
import ipython_notebook_importer
import DataSet as ds
import hide_visualization_code
import matplotlib.pyplot as plt
import numpy as np
import time

class Perceptron:

    def __init__(self, learning_rate=0.01, visualize=False, sleep=0.0):
        self.w = None
        self.learning_rate = learning_rate
        self.visualize = visualize
        self.sleep = sleep
        
        if self.visualize: 
            hide_visualization_code.init_perceptron(self)
            
    def fit(self, X, y):
        """ Train the Perceptron on data X with labels y. 
        
        Parameters
        ----------
        X:    Matrix with shape (n, d) with data point x_i as the i'th row.   
        y:    Array with shape (n, ) with label y_i on the i'th entry. 
        """
        n, d = X.shape
        
        # Initialize weights 
        self.w = np.zeros(d)
        
        # Get a misclassified point. If all are classified correctly, the 
        # function will return False. 
        all_classified_corectly, misclassified_point, misclassified_label = self.misclassified(X, y)

        while not all_classified_corectly:
            # Visualize if enabled
            if self.visualize: self.visualize_step(X, y)
            
            # Update weights
            self.w += self.learning_rate * misclassified_point * misclassified_label
        
            # Get a new misclassified point. 
            all_classified_corectly, misclassified_point, misclassified_label = self.misclassified(X, y)
            
        # Visualize the last round if enabled
        if self.visualize: self.visualize_step(X, y)

    def misclassified(self, X, y):
        """ Finds a misclassified label, returns False if all points are correctly classified. 
        
        Parameters
        ----------
        X:    Matrix with shape (n, d) with data point x_i as the i'th row.   
        y:    Array with shape (n, ) with label y_i on the i'th entry. 
        
        Returns
        -------
        misclassified_point:    The miss classified point, if no such point return False. 
        misclassified_label:    The label of the miss classified point, if no such point return False. 
        """
        # Predict the class of each data point. 
        predictions = self.predict(X)
        
        # Get the number of miss classified points
        misclassified_count = sum(predictions != y)
        
        # Return False if all points are correctly classified. 
        if misclassified_count == 0: return True, None, None
        
        # Filter out the points where predictions disagree with labels. 
        misclassified_points = X[predictions != y]
        misclassified_labels = y[predictions != y]
        
        # Return te first miss classified point
        return False, misclassified_points[0], misclassified_labels[0]
        
    def error(self, X, y):
        """ Compute the error """
        n, d = X.shape
        return 1/n * sum(self.predict(X) != y) 
    
    def predict(self, X):
        """ Predicts the class of X given the trained weights. """
        return np.sign(X @ self.w)
    
    def visualize_step(self, X, y, subclass=False): 
        hide_visualization_code.visualize_perceptron(self, X, y, subclass)

Let's try to run the `Perceptron` algorithm on the dataset! I have written code that will visualize each step of the algorithm.

In [None]:
data = ds.DataSet("perceptron")       

perceptron = Perceptron(learning_rate=1, visualize=True, sleep=0.78)
perceptron.fit(data.X, data.y)

This will eventually find a line seperating the data, however, it might take a while. To visualize the algorithm I added `sleep=0.78` in the code above. This forces the algorithm to pause 0.78 seconds each iteration so we can see what happens. If you want to see the algorithm finish you could try to remove this (it might take between 200-300 iterations to find the line).  

# Pocket
Unfortunately, it is very rare that our data is linearly seperable. Try to run the following code, it will visualize data that is not lienarly seperable 

In [None]:
data = ds.DataSet("pocket")
data.plot()

If we run the `Perceptron` on this dataset it would run forever. How could we fix this? We change the *while* loop to a *for loop* with T iterations and return the best weights. The best weights are the weight with smallest in-sample error $E_{in}(w)$.   

       Pocket Learning Algorithm
       w = initialize random
       for i=1,...,T
          pick misclassified point x (if none stop)
          update weights w = w + learning_rate * x * y
          compute in-sample error of w 
          if error is better than previous weights save w 
      return best w of all T iterations
      
You can think of this as storing the current best weights in your pocket.

# Code: The Pocket Algorithm

The following code implements the class `Pocket`. It is a simple modification of the class `Perceptron`. It runs $T$ iterations and returns the best hypothesis of the $T$ hypothesis. 

The `Pocket` class inherits the functions of the `Perceptron` class, so you need to run the code above.

In [None]:
class Pocket(Perceptron): 
    
    def fit(self, X, y, T):
        """ Train the Perceptron on data X with labels y. At each iteration evaluate the performance
        of the current weights. Save the weights with the best performance and return these weights
        after T iterations. """
        n, d = X.shape
        
        # Initialize weights 
        self.w = np.zeros(d)
        
        # Get a misclassified point. If all are classified correctly, the 
        # function will return False. 
        all_classified_corectly, misclassified_point, misclassified_label = self.misclassified(X, y)
        
        # Initialize best error to the worst
        best_error = 1.0
        self.best_w = np.zeros(d)

        for i in range(T): 
            # Visualize if enabled
            if self.visualize: self.visualize_step(X, y)
            
            # Update weights
            self.w += self.learning_rate * misclassified_point * misclassified_label
        
            # If current error is better than previous update best error and 
            # best weights. 
            current_error = self.error(X, y)
            if current_error < best_error: 
                best_error = current_error
                self.best_w = np.copy(self.w) # copy and save the best weights.
                # If we get zero in-sample error we are done. 
                if best_error == 0:
                    return
                
            # Get a new misclassified point. 
            all_classified_corectly, misclassified_point, misclassified_label = self.misclassified(X, y)
            
        # Visualize the last round if enabled
        if self.visualize: self.visualize_step(X, y)
            
        # Set the final weights to the best weights
        self.w = self.best_w
            
    def visualize_step(self, X, y):
        # Let Perceptron draw step as usually, then do our own stuff afterwards. 
        super().visualize_step(X, y, subclass=True)
        hide_visualization_code.visualize_pocket(self, X, y)

Let's run the Pocket algorithm on some data! The visualization plots the best hypothesis soo far with dashed lines. 

In [None]:
data = ds.DataSet("pocket")

pocket = Pocket(learning_rate = 1, visualize = True, sleep=0.1)
pocket.fit(data.X, data.y, 50)

# Experiment: Breast Cancer
So far we tried to run `Perceptron` and `Pocket` on the same dataset, the "perceptron" dataset. I generated the dataset myself to allow a nice visualization of the algorithms. In this section we will explore running `Pocket` on real data instead of artificial generated data.

The dataset we are going to use is called `breast_cancer` [4]. The input and output of our algorithm will be 

\begin{align}
X:& \quad\text{characteristics of the breast cell nuclei}\\
y:& \quad\text{malevolent or benign (cancer/not cancer)}
\end{align}

The original dataset measured 30 different characteristisc of breast cell nuclei. To simplify matters and allow us to visualize our algorithm, I reduced the 30 dimensions to 2 in a "smart way" [5]. What I mean by "smart way" will be covered later on. For now you should just think of the breast cancer data as being 2d dimensional.  

Let's check out the dataset. 

In [None]:
import ipython_notebook_importer
import DataSet as ds

data = ds.DataSet("breast_cancer_2d")
data.plot()

Let's try to run the `Pocket` algorithm on the dataset for $100$ iterations! You can speed up the time the algorithm takes by disabling visualization (`visualize=False`). 

In [None]:
data = ds.DataSet("breast_cancer_2d")

iterations = 100
p = Pocket(visualize=True, learning_rate=1)
p.fit(data.X, data.y, iterations)

print("Error of best weights: ", p.error(data.X, data.y))

This should give around 0.15 in-sample error. You might run try to increase `iterations` and see if this would give you a better in-sample error. It probably wont. The data isn't linearly seperable. Furthermore, many of the blue and green points lie on top of each other. 

There is one thing we can do to improve performance slightly. Remember that the data originally had more 30 dimensions. Learning directly on the entire data will take more time, but we will probably get better in-sample error! To reduce the time (and because the data isn't 2d) we will turn visualization off (`visualization=False`).  

In [None]:
data = ds.DataSet("breast_cancer")

pocket = Pocket(visualize=False)
pocket.fit(data.X, data.y, 100)
pocket.error(data.X, data.y)

So, that's it. Our first algorithm can predict roughly $90$% correct in-sample on the breast cancer dataset! The next notebooks explore different linear models that will get better in-sample error. That said, there are limitations to linear models; we will later see how introducing non-linearities will significantly improve our error. The breast cancer dataset is also fairly easy, we will later introduce more difficult dataset. 

# References
[1] http://www.cs.columbia.edu/~mcollins/courses/6998-2012/notes/perc.converge.pdf

[2] https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-867-machine-learning-fall-2006/lecture-notes/lec2.pdf

[3] http://www.cems.uvm.edu/~rsnapp/teaching/cs295ml/notes/perceptron.pdf

[4] http://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+%28diagnostic%29

[5] I used a technique called Principle Component Analysis (PCA). This will be covered in later iPython Notebooks.

# Errors, Suggestions and Hall Of Fame
If you find any mistakes or have suggestions for improvements reach me at alexmath@cs.au.dk. Any help is very much appreciated, I'll even add your name below for super-awesome everlasting fame!

- ...

# (todo) Discussion

<iframe src="..discussionboard/perceptron.html" style="width: 800px; height: 800px; "/></iframe>