# The Perceptron

A perceptron is simple model that can help us solve such a problem given enough data. **The perceptron can be defined as an algorithm for supervised binary classification.**

### What's on the agenda: 

- We will provide a case study on the Perceptron.
- We will perform inference on the Perceptron.
- We will training the perceptron with data
- Scaling the perceptron: Multi-Layer Perceptron
- In-class assignment which will include building your own MLP model on data!

### Case Study: Mbali's Gym Patterns

Let's imagine we have to predict whether your friend, Mbali, would be going for a workout or not. We have the following things to take into account:

- Weather (Sunny or Rainy) ?
- Time of Day (Morning or Evening) ?
- Energy Level (Energized) ?

<img src="resources/perceptron_cartoon.png" style="max-width:100%; width: 70%; max-width: none">

Perceptrons are made up of three components, namely:

   - The input ($x_k$) : For example, the current status of the weather, Mbali's energy level, the time of day.
   - The input weights ($w_k$) : For example, how to weight the contribution of Mbali's energy towards the final decision (Gym or No Gym)
   - The activation function ($\sigma$) : For example, a decision function which integrates the inputs to product the final decision (Gym or No Gym)
    
Mathematically, this translates to:

$$
y = \sigma( \sum_k^{N} w_k x_k + b) 
$$

### Let's dig into building this perceptron.

#### Prelim: Let's make sure our packages work!

In [None]:
import numpy as np

#### To make the outputs deterministic, we seed the random number generator with a constant. This will guarantee that every time you run the code, you will get the same random distribution:

In [8]:
np.random.seed(1)

### Let's start building a perceptron!

Let's define a framework (class) to create perceptrons. With this class, we can create how ever many perceptrons we want!

We define a perceptron class which will compute the **dot-product** $w \cdot x$ and add the **bias** $b$ to it. Subsequently, the activation function will convert this **dot-product** to the **activation value** $a$ of the unit:

In [96]:
class Perceptron:
    
    def __init__(self, weights, bias, activation):
        self.weights = weights
        self.bias = bias
        self.activation = activation
        
    def predict(self, x):
        pre_activation = self.weights.dot(x) + self.bias
        return self.activation(pre_activation)

Here, we define our example data **input** $x$ 

In [90]:
weather = 0
evening = 1
energized = 1
x = np.array([weather, evening, energized])

Let's define the corresponding **weights** $w$, and **bias** $b$ of our perceptron $p$:

In [91]:
weights = np.array([0.5, 0.5, 0])
bias = 0

Furthermore, let's now define our *activation* function. We choose this binary activation function, but we are free to choose any activation function of our choice

In [89]:
def binary_activation_fn(z):
    if z == 0.5:
        return 1
    else: return 0

Here, we create our first perceptron object with our corresponding **weights**, **bias** and **activation function**:

In [92]:
perceptron = Perceptron(weights, bias, binary_activation_fn)

Let's make our first prediction with $x$:

In [94]:
prediction = perceptron.predict(x)

In [95]:
print("Activation Value of Perceptron (a):", prediction)

Activation Value of Perceptron (a): 1


### Mini-Assignments

Provide an instance $x$ to the perceptron $p$, where your friend, Mbali, does not feel energized, while it's the morning and it's raining?

Create an activation function with a threshold of $0.5$, and now provide an input to the perceptron for when, Mbali, feels energized while it's raining outside in the evening.

$$
y = \begin{cases}
    \ 0 & \quad \text{if } w \cdot x + b \leq 0.5\\
    \ 1 & \quad \text{if } w \cdot x + b > 0.5
  \end{cases}
$$

Now with the same function, play with bias values = $[-1, -0.5, 0, 0.5, 1]$ and observe how that changes the output of the perceptron.

## Case Study: Logic Gates (AND, OR) with Perceptrons

#### Logic gates are one of the first applications of the perceptrons and have been heavily explored in early literature. Here we get the opportunity to explore some of the early research questions in this field.

Logic gates are boolean functions $f:\{0,1\}^k \rightarrow \{0,1\}$, that have applications in electronic gates and mathematical logic.

Perceptrons are thought to be good function approximators for Logic Gates. In this mini-tutorial, we will fit logic gates, AND & OR, with a perceptron!

| x1 | x2  | AND | OR |
| --- | ------ |:------:| ---- |
| 0 | 0  | 0 | 0 |
| 0 | 1  | 0 | 1 |
| 1 | 0  | 0 | 1 |
| 1 | 1  | 1 | 1 |

The task is to implement a simple **perceptron** to compute logical operations like AND, and OR.

- Input: $x_1$ and $x_2$
- Bias: $b = -1$ for AND; $b = 0$ for OR
- Weights: $w = [1, 1]$

with the following activation function:

$$
y = \begin{cases}
    \ 0 & \quad \text{if } w \cdot x + b \leq 0\\
    \ 1 & \quad \text{if } w \cdot x + b > 0
  \end{cases}
$$

Let's define our X:

In [97]:
X = np.array([[0,0]
    ,[1,0]
    ,[0,1]
    ,[1,1]])

We can define this threshold function in Python as:

In [98]:
def activation(z):
    if z > 0:
        return 1
    return 0

For the AND operation, we implement the perceptron as:

In [100]:
w = np.array([1, 1])
b = -1

perceptron = Perceptron(w, b, activation)

for x in X:
    print(f'{x[0]} AND {x[1]}: {perceptron.predict(x)}')

0 AND 0: 0
1 AND 0: 0
0 AND 1: 0
1 AND 1: 1


For the OR operation, we implement the perceptron as:

In [101]:
w = np.array([1, 1])
b = 0

perceptron = Perceptron(w, b, activation)

for x in X:
    print(f'{x[0]} OR {x[1]}: {perceptron.predict(x)}')

0 OR 0: 0
1 OR 0: 1
0 OR 1: 1
1 OR 1: 1


## In-Class Exercise

Can you implement a perceptron that computes 2 logic gates for the NOR operation?

| x1 | x2  | NOR |
| --- | ------ |:------:|
| 0 | 0  | 1 |
| 0 | 1  | 0 |
| 1 | 0  | 0 |
| 1 | 1  | 0 |

# Training Perceptrons: Gradient Descent and Backpropagation

### Loss Functions

To teach the perceptron to recognize pictures of cats ***(for example)***, we show it a bunch of pictures and tell it whether each picture is a cat or not. The perceptron tries to guess whether each picture is a cat or not based on the numbers it receives, but it doesn't always get it right.

***The "loss function" is like a scorekeeper that tells you how well the perceptron is doing at recognizing cats.*** Every time the perceptron makes a mistake, the scorekeeper gives it a "point" (or a "loss"). Your goal is to make the scorekeeper's score as low as possible by adjusting the perceptron's "weights". The weights are like the settings inside the perceptron that determine how it works. By minimizing the scorekeeper's score, you can teach the perceptron to become better at recognizing cats.

### Gradient Descent

To update the weights of a perceptron $p$ given some data $x$ and loss function $L$, we use ***Gradient Descent***.

***Gradient Descent*** is a first-order gradient optimization algorithm. At first, the weights are randomly initialized, then the optimization algorithm takes repeated steps in the opposite direction of the gradient until it reaches a "satisfactory condition".

![SGD](resources/sgd.gif)

The gradient descent equation can be encapsulated in the following equation:

$$
w_{k+1} = w_{k} - \alpha \nabla L_{w}
$$

,where $L$ is the loss function and $\alpha$ is the learning rate.

***Backpropagation*** computes the gradient of a loss function with respect to the weights of the network $\nabla L_{w}$ for a single input–output example, and does so efficiently, computing the gradient one layer at a time, iterating backward from the last layer to avoid redundant calculations of intermediate terms in the chain rule.

Let's first define a simple loss function $L(w)$ for a perceptron $p$:

$$
L(w) = (target - y)^2
$$

$$
y = \sigma( \sum_k^{N} w_k x_k + b) 
$$

Compute the gradient for a perceptron, takes the following shape:

$$
\frac{\partial L}{\partial w_i} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial \sigma} \frac{\partial \sigma}{\partial w_i}
$$

In this example, we fit a simple logic problem. We want to make the following predictions from the input:

| x1 | x2 | x3 | Output |
| - | - | - |:------:|
| 0 | 0 | 1  | 0      |
| 1 | 1 | 1  | 1      |
| 1 | 0 | 1  | 1      |
| 0 | 1 | 1  | 0      |

Up until now, we have been looking at linear activation functions, mostly threshold functions. To appromixate more interesting functions, we will need to look into non-linear activation functions, namely the Sigmoid and ReLU activation functions,

We will use the *[Sigmoid](http://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#sigmoid)* activation function:

In [8]:
def sigmoid(z):
    """The sigmoid activation function."""
    return 1 / (1 + np.exp(-z))

We could use the [ReLU](http://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#activation-relu) activation function instead:

In [4]:
def relu(z):
    """The ReLU activation function."""
    return max(0, z)

The [Sigmoid](http://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html#sigmoid) activation function introduces non-linearity to the computation. It maps the input value to an output value between $0$ and $1$.

<img src="resources/SigmoidFunction1.png" style="max-width:100%; width: 30%; max-width: none">

The derivative of the sigmoid function is maximal at $x=0$ and minimal for lower or higher values of $x$:

<img src="resources/sigmoid_prime.png" style="max-width:100%; width: 25%; max-width: none">

The *sigmoid_prime* function returns the derivative of the sigmoid for any given $z$. The derivative of the sigmoid is $z * (1 - z)$. This is basically the slope of the sigmoid function at any given point: 

In [9]:
def sigmoid_prime(z):
    """The derivative of sigmoid for z."""
    return z * (1 - z)

### In-Class Exercise

What is the derivative of the ReLU activation function? Build a function for the ReLU derivative.

We define the inputs as rows in *X*. There are three input nodes (three columns per vector in $X$. Each row is one trainig example:

In [6]:
X = np.array([ [ 0, 0, 1 ],
               [ 0, 1, 1 ],
               [ 1, 0, 1 ],
               [ 1, 1, 1 ] ])
print(X)

[[0 0 1]
 [0 1 1]
 [1 0 1]
 [1 1 1]]


The outputs are stored in *y*, where each row represents the output for the corresponding input vector (row) in *X*. The vector is initiated as a single row vector and with four columns and transposed (using the $.T$ method) into a column vector with four rows:

In [7]:
y = np.array([[0,0,1,1]]).T
print(y)

[[0]
 [0]
 [1]
 [1]]


For our perceptron, we create a weight matrix ($Wo$) with randomly initialized weights:

TODO: Delve into how we initialize the network.

In [79]:
n_inputs = 3
n_outputs = 1
Wo = np.random.random( (n_inputs, n_outputs) ) * np.sqrt(2.0/n_inputs)
print(Wo)

[[0.43532763]
 [0.5649153 ]
 [0.25761743]]


The reason for the output weight matrix ($Wo$) to have 3 rows and 1 column is that it represents the weights of the connections from the three input neurons to the single output neuron. The initialization of the weight matrix is random with a mean of $0$ and a variance of $1$. There is a good reason for chosing a mean of zero in the weight initialization. See for details the section on Weight Initialization in the [Stanford course CS231n on Convolutional Neural Networks for Visual Recognition](https://cs231n.github.io/neural-networks-2/#init).


The core representation of this network is basically the weight matrix *Wo*. The rest, input matrix, output vector and so on are components that we need for learning and evaluation. The learning result is stored in the *Wo* weight matrix.

We loop in the optimization and learning cycle 10,000 times. In the *forward propagation* line we process the entire input matrix for training. This is called **full batch** training. I do not use an alternative variable name to represent the input layer, instead I use the input matrix $X$ directly here. Think of this as the different inputs to the input neurons computed at once. In principle the input or training data could have many more training examples, the code would stay the same.

This is the result of the perceptron without training:

In [80]:
sigmoid(np.dot(X, Wo))

array([[0.56405051],
       [0.6947737 ],
       [0.66662175],
       [0.77865756]])

Let's recall that the loss function is the squared error loss and using the perceptron gradient equation, our backpropagation equation would translate to:

$$
\frac{\partial L}{\partial w_i} = [ - 2 * (target - y)] * [\sigma'(\sum_k^{N} w_k x_k + b)] * [x_i] 
$$

After training with this gradient update equation, we will improve error on our perceptron model. 

Observe the error of our initialized perceptron model:

In [81]:
(y - sigmoid(np.dot(X, Wo)))**2

array([[0.31815298],
       [0.4827105 ],
       [0.11114106],
       [0.04899248]])

In [82]:
for n in range(10000):
    
    # forward propagation
    prediction = sigmoid(np.dot(X, Wo))
    
    # compute the loss
    loss_gradient = -2 * (y - prediction)
    
    # multiply the loss by the slope of the sigmoid at l1 (Backpropagation Step)
    l1_delta = sigmoid_prime(prediction) * loss_gradient
    gradients = np.dot(X.T, l1_delta)
    
    # update weights (Gradient Descent)
    Wo += - gradients
    


Now observe the error of our trained perceptron model:

In [83]:
(y - sigmoid(np.dot(X, Wo)))**2

array([[4.61951671e-05],
       [3.06376089e-05],
       [2.03748845e-05],
       [3.07354915e-05]])

This is the result of the perceptron with training:

In [84]:
sigmoid(np.dot(X, Wo))

array([[0.0067967 ],
       [0.00553513],
       [0.99548615],
       [0.99445604]])

# Limitations of the Perceptron: The Famous XOR Problem

The power of neural units comes from combining them into larger networks. Minsky and Papert (1969): A single neural unit cannot compute the simple logical function XOR.

With this narrow definition of a perceptron, it seems not possible to implement an XOR logic perceptron. The restriction is that there is a threshold function that is binary and piecewise linear.

There is a way to fit the XOR logic gate with a perceptron. If we use the following activation function:

$$
y = \begin{cases}
    \ 0 & \quad \text{if } w \cdot x + b \neq 0.5\\
    \ 1 & \quad \text{if } w \cdot x + b = 0.5
  \end{cases}
$$

#### The only problem is that it is non-differentiable, so we can't use gradient descent.

We illustrate this below.

In [86]:
def bactivation(z):
    if z == 0.5:
        return 1
    else: return 0

If we assume the weights to be set to 0.5 and the bias to 0, one unit can handle the XOR logic:

- Input: $x_1$ and $x_2$
- Bias: $b = 0$ for XOR
- Weights: $w = [0.5, 0.5]$

In [88]:
w = np.array([0.5, 0.5])
b = 0
x = np.array([0, 0])
print("0 OR 0:", bactivation(w.dot(x) + b))
x = np.array([1, 0])
print("1 OR 0:", bactivation(w.dot(x) + b))
x = np.array([0, 1])
print("0 OR 1:", bactivation(w.dot(x) + b))
x = np.array([1, 1])
print("1 OR 1:", bactivation(w.dot(x) + b))

0 OR 0: 0
1 OR 0: 1
0 OR 1: 1
1 OR 1: 0


#### Interesting background knowledge on the XOR problem plus a fun fact!

This particular activation function is of course not differentiable, and it remains to be shown that the weights can be learned, but nevertheless, a single unit can be identified that solves the XOR problem.

The difference between Minsky and Papert's (1969) definition of a perceptron and this unit is that - as Julia Hockenmaier pointed out - a perceptron is defined to have a decision function that would be binary and piecewise linear. This means that the unit that solves the XOR problem is not compatible with the definition of perceptron as in Minsky and Papert (1969) (p.c. Julia Hockenmaier).

Minsky and Papert's (1969) claims on the XOR problem was one of the contributing factors towards the famous AI Winter: A period of reduced funding towards Artificial Intelligence research.

Also, Seymour Papert is from South Africa!

## Solving the XOR Problem

There is a proposed solution in [Goodfellow et al. (2016)](https://www.deeplearningbook.org/) for the XOR problem, using a network with two layers of ReLU-based units.

![XOR Network](resources/XOR_Network.png)

This two layer and three perceptron network solves the problem.

Next we will show why the multi-layer perceptron allows us to solve more complex problems like the XOR problem!

# Multi-Layer Perceptron

**A Multi-Layer Perceptron is a model which consists of at least three layers of perceptrons: input layer, hidden layer, and output layer.**

![MLP](resources/mlp_hidden.jpeg)

Add a viz that shows that multi-layer perceptrons are better for fitting data vs single-layer perceptrons. Universal function approximator.

How would backpropagation work in such a model?

![MLP_backprop](resources/mlp_backprop.png)

Consider the following dataset:

| Input  | Output |
| ------ |:------:|
| 0 0 1  | 0      |
| 0 1 1  | 1      |
| 1 0 1  | 1      |
| 1 1 1  | 0      |

The pattern here is a XOR pattern problem: If there is a $1$ in either column $1$ or $2$, but not in both, the output is $1$ (XOR over column $1$ and $2$).

To solve this problem, we need a network with another layer (MLP), that is a layer that will combine and transform the input, and an additional layer will map it to the output. We will add a *hidden layer* with randomized weights and then train those to optimize the output probabilities of the table above.

We will define a new $X$ input matrix that reflects the above table:

In [3]:
X = np.array([[0, 0, 1],
              [0, 1, 1],
              [1, 0, 1],
              [1, 1, 1]])
print(X)

[[0 0 1]
 [0 1 1]
 [1 0 1]
 [1 1 1]]


We also define a new output matrix $y$:

In [4]:
y = np.array([[ 0, 1, 1, 0]]).T
print(y)

[[0]
 [1]
 [1]
 [0]]


We initialize the random number generator with a constant again:

In [5]:
np.random.seed(1)

Assume that our 3 inputs are mapped to 4 hidden layer ($Wh$) neurons, we have to initialize the hidden layer weights in a 3 by 4 matrix. The outout layer ($Wo$) is a single neuron that is connected to the hidden layer, thus the output layer is a 4 by 1 matrix:

In [38]:
n_inputs = 3
n_hidden_neurons = 4
n_output_neurons = 1
Wh = np.random.random( (n_inputs, n_hidden_neurons) )  * np.sqrt(2.0/n_inputs)
Wo = np.random.random( (n_hidden_neurons, n_output_neurons) )  * np.sqrt(2.0/n_hidden_neurons)
print("Wh:\n", Wh)
print("Wo:\n", Wo)

Wh:
 [[0.54642637 0.21630591 0.05416217 0.30217248]
 [0.51416219 0.17160636 0.61462234 0.05432681]
 [0.21254639 0.65707935 0.15793843 0.52211762]]
Wo:
 [[0.37099793]
 [0.65393799]
 [0.18617893]
 [0.04664153]]


This is the current initial prediction of the Multi-Layer Perceptron (MLP):

In [39]:
sigmoid(np.dot(sigmoid(np.dot(X, Wh)), Wo))

array([[0.68255195],
       [0.70318206],
       [0.7004341 ],
       [0.71831433]])

We will loop now 100,000 times to optimize the weights:

In [72]:
for i in range(100000):
    # forward propagation
    l1 = sigmoid(np.dot(X, Wh))
    l2 = sigmoid(np.dot(l1, Wo))
    
    l2_error = y - l2
    
    if (i % 10000) == 0:
        print("Error:", np.mean(np.abs(l2_error)))
    
    # gradient of output layer with respect to loss
    l2_delta = l2_error * sigmoid_prime(l2)
    
    # compute the l1 contribution by value to the l2 error
    l1_error = l2_delta.dot(Wo.T)
    
    # gradient of hidden layer with respect to loss
    l1_delta = l1_error * sigmoid_prime(l1)
    
    Wo += np.dot(l1.T, l2_delta)
    Wh += np.dot(X.T, l1_delta)

Error: 0.003887504994617163
Error: 0.0036937834970836095
Error: 0.003525709960695171
Error: 0.003378132302446337
Error: 0.003247233371050477
Error: 0.003130117979719035
Error: 0.003024545801488532
Error: 0.0029287529741954285
Error: 0.0028413296155250386
Error: 0.0027611336786237985


This is the trained prediction of the Multi-Layer Perceptron (MLP):

In [41]:
sigmoid(np.dot(sigmoid(np.dot(X, Wh)), Wo))

array([[9.54277709e-04],
       [9.95491443e-01],
       [9.95479782e-01],
       [5.56696699e-03]])

# Mini-Project Idea:

This is the original July 13, 1958 New York Times article describing Rosenblatt's perceptron.

As a mini-project, you could consider reproducing the "square-locating" task!

<img src="resources/Times_July_13_Rosenblatt_Perceptron.png" style="max-width:100%; width: 70%; max-width: none">