# Perceptron
* Lets now look at the famous perceptron algorithm
* this is seen as the precursor to neural networks
* This is a **linear binary classifier**

# History
* invented in 1957 by Rosenblatt
* originally was designed for image recognition
* Lead to a great deal of excitement for AI
* Famous for not being able to solve XOR 
* Caused significant decline in interest

# Theory - Limitations of our problem
* Setup:
* Perceptrons only handle Binary classification
* Instead of using targets = {0, 1}, we will use targets = {-1, +1}
* This is very convenient and can be seen in training

# Prediction
* prediction with a perceptron is very simple! It is just like any other linear classifier
* We simply take the input x, comput the dot product of it with the weights, and add the bias b
### $$w^Tx+b$$
* if $w^Tx+b = 0$ we fall directly on the line/hyperplane
* if $w^Tx+b > 0$ we predict +1
* if $w^Tx+b < 0$ we predict -1
* in other words, our prediction is:
### $$sign(w^Tx+b)$$

# Perceptron Training
* Training with the perceptron is the interesting part
* It is an iterative procedure, meaning we go through a for loop a certain number of times, and at each iteration, called an "epoch", the classification rate should go up on average as we converge to the final solution 
* Lets look at the pseudocode

![perceptron%20pseudo%20code.png](attachment:perceptron%20pseudo%20code.png)

* the first thing we need to do is randomly initialize the weights
* it is common to make w gaussian distributed, and to set b = 0 to start 
* we won't consider the bias right now, since we will soon see that it can be absorbed into w (weights)
* Then we loop through the maximum number of iterations- we won't necessarily go through each iteration, since if we reach the point where we classify every correctly, we can just break out of the loop
* inside of the loop the first thing we do is retrieve all of the currently misclassified examples
* of course this means we need to a do a prediction
* we pick a misclassified sample at random, and update w (eta is a known as a learning rate)

# How does this process help get the optimal weights? 
* How does:
### $$w = w + \eta*yx$$
* help move w in the right direction?
* Well first we need to understand the geometry behind planes and lines
* recall that a vector that is perpendicular to a line, can define the line
* we call this the normal vector
* n = (a,b)

![normal%20vector.png](attachment:normal%20vector.png)

## Lines
* we refer to this normal vector as w

![normal%20vector%202.png](attachment:normal%20vector%202.png)

* the bias term tells us where we intersect the x2 (aka vertical/y) axis. 
* however, we will ignore it for now, since the bias term can also be absorbed into w
* How? By assuming we have another column of x that is always equal to 1
* Initial model:
### $$y = w_0+w_1x_1+w_2x_2, x_0 = 1$$
* New model:
### $$y = w_0x_0+w_1x_1+w_2x_2, x_0 = 1$$
* Hence, any model where we do not consider bias explicitly can be be assumed to contain a bias term anyway 

## Training - case 1
* So currently, we have some w pointing in some direction which is not yet correct
* The line (the classifying line) and its corresponding w are both shown in black

![perceptron%20training.png](attachment:perceptron%20training.png)

* We find an x which is not classified correctly, which is shown in red
* suppose w and x are both on the same side of the line
* then the dot product is greater than 0, because the angle is less than 90 degrees
* that means it should be classified as -1, but is mistakenly being classified as +1
* so we update w, which is equivalent to subtracting x from w, since y (the target) is minus 1
    * w = w + (y*x) = w + (-1*x) =  w - x
* the result is that this shifts the line, so that it is now facing a direction where it is either classifying x correctly, or the line is at least closer to x, so that the next time we move w maybe it will classify x correctly 
* in this particular picture, the new line in green classifies x correctly

## Training - case 2
* now lets consider the case where x is on the other side of the line from w

![perceptron%20training%202.png](attachment:perceptron%20training%202.png)

* so the black vector w, and the black line, correspond to our incorrect setting of w 
* so this means that the angle between x and w will be greater than 90 degrees, so the dot product will be negative 
* which means we predict -1, but the target is plus 1
* when we update, this is equivalent to adding x to w, since y is now plus 1 
* notice how this shifts the line so that we will either be able to classify x correctly, or we will be closer to being able to do so 
* in this example, shown in green, we are now classifying x correctly

# Summary
* we have introduced an iterative algorithm that can train a linear binary classifier
* we have seen how the update rule fixes w, so that it better predicts current misclassified samples and we have seen geometrically how it works 