# The Perceptron
The perceptron exhibits a different artficial neuron called <mark>*Threshold logic unit* (TLU)</mark>, or sometime a <mark>*Linear Threshold Unit* (LTU)</mark>. The inputs and outputs are numbers. The TLU coputes a weighted sum of its inputs ($z=w_1x_1+w_2x_2+...+w_nx_n = x^Tw)$, then applies a <mark>*step function*</mark> to that sum and outputs the result: $h_w(x) = step(z)$ where $z = x^Tw$.

The most common step function used in Perceptron is **Heaviside step function** while somethimes sign function is also preferred.
$$\text{heaviside}\,(z)\> = \> \begin{cases} 0 \>\text{if} \>z<0 \\ 1 \>\text{if} \>z >=0 \end{cases}  \qquad \text{sgn}\, (z)\> = \> \begin{cases} -1 \>\text{if} \> z<0 \\ 0 \quad\text{if}\> z=0 \\ +1 \> \text{if}\> z>0 \end{cases} $$

A perceptron is simply composed of a single layer of TLUs.<mark> When all the neurons in a layer are connected to every neuron in the <b>previous layer</b>, the layer is called a **fully connected layer**</mark> or a *dense layer*. The inputs of the perceptron are fed to special passthrough neurons called *input neurons*, and all these input neurons form *input layer*. An extra bias feature is generally added ($x_0\>=\>1)$ represented by *bias neuron*, which outputs 1 all the time.

Computing output of a fully connected layer.
- $h_{W,\, b} = \Phi\>(XW\, + \, b)$

where,
- $X$ represents the matrix of input feature. <mark>One row per instance</mark> and <mark>one column for per feature.</mark>
- Weight Matrix $W$ except for the one from the bias neuron. <mark>One row per neuron</mark> and <mark>One column per artificial neuron in the layer.</mark>
- Bias vector $b$ contains all the connection weights between the bias neuron and the artificial neurons. <mark>One bias term per artificial neuron.</mark>
- Activation function $\Phi$: when the artificial neurons are TLUs, it is a *step function*.

### Training Algorithm 
*Hebb's rule* "Cells that fire togther, wire together"; the connection weight between two neurons tends to increase when they fire simultaneously.

Perceptron training is also done in the same that resembles the above mentioned rule. For every output neuron that produced a wrong perdiction, it reinforces the connection weights from the inputs that would have contributed to the correct prediction.

Perceptron learning rule (weight update):
- $w_{i,\,j}^{(\text{next}\;\text{step})}\> = \> w_{i,\,j} \> +\> \eta\big(y_j\, -\,\hat{y_j}\big)x_i$

where,

- $w_{i,\,j}$ is the <mark>connection weight between the $i^{\text{th}}$ input neuron and  $j^{\text{th}}$ output neuron.</mark>
- $\eta$ is <mark>learning rate.</mark>
- $x_i$ is  <mark>$i^{\text{th}}$ input value of the current training instance.</mark>
- $y_j$ is the <mark>target output of the  <b>$j^{\text{th}}$ output neuron</b></mark> for the current training instance.
- $\hat{y_j}$ is the <mark>output of the  $j^{\text{th}}$ output neuron</mark> for the current training instance.


In [3]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2, 3)] # petal length, petal width
y = (iris.target == 0).astype(np.int) # Iris setosa

per_clf = Perceptron()
per_clf.fit(X, y)

y_pred = per_clf.predict([[2, 0.5]])
y_pred

array([0])

Perceptrons do not output a class probability; rather, they make predictions based on a hard threshold. But the fact that they are incapable of solving some trivail problems like the *Exclusive OR* (XOR) classification problem make them little less promising. Well, this problem and bunch of other can be just solved for Perceptrons by stacking up layers of Perceptrons. This will result in what is called as <mark>*Multilayer Perceptron* (MLP)</mark>

# The Multilayer Perceptron and Backpropagation
An MLP is composed for variety of things which can be named are:
- ***input layer***: one passthrough layer
- ***hidden layer***: one or more layers of TLUs, called hidden because their "values" are not given in the data.
- ***output layer**: Final layer of TLUs.
- ***lower layers**: Layers closer to the input layer
- ***upper layers**: layers closer to the output layer

Every layer except the output layer includes a bias neuron and is fully connected to the next layer.

> The signal flows only in one direction, so this architecture is an example of a <mark>*feedforward neural network* (FNN)</mark>.

When an ANN contains a deep stack of hideen layers, it is referred to as <mark>*deep neural network* (DNN)</mark>.

# Backpropagation
It is the most popular algorithm for training MLPs and other DNN. In short, it is Gradient Descent using an efficient technique for computing the gradients mathematically.

> Automatically computing gradients is called <mark>*automatic differentiation*</mark> or <mark>*autodiff*</mark>. The one used by backpropagation is called <mark>*reverse mode autodiff*</mark>. It is fast and well suited when the function to differentiate has many variables (connection weights) and few outputs(one loss).

**Let's see through each step**:
- It handles one mini-batch at a time (for example, containing 32 instances each) and it goes through the full training set multiple time, Each pass is called an <mark>*epoch*</mark>
- Each mini-batch is passed to the network's input layer. <mark>*Forward pass*</mark>: it is exactly like making predictions, except all intermediate results are preserved since they are needed for the backward pass.
- <mark>Next, the algorithm measures the network;s output error.</mark>
- <mark>Compute how much each output cnnection contributed to the error, by applying *chain rule*.</mark>
- <mark>Then measure how much of these error contribution came from each connection in the layer below, again using chain rule,</mark> working backward until the algorithm reaches the input layer. This reverse pass efficiently measures the error gradient across all the connection weights in the network by <mark> propagating the error gradient backward through the network</mark>. (Hence the name).

> ⚠ It is important to initialize all the hidden layers' connection weights randomly, or lese the training will fail.

A key change was also applied: replaing the activation function from sigmoid function to RELU. Because the step function contains only flat segments, so there is no gradient for gradient descent to go around. There are other choices too:
- <mark>*Hyperbolic Tangent function*</mark>: $tanh\,(z) \> = \> 2\sigma(2z)\,-\,1$.
    
  The output value ranges from -1 to 1. That range tends to make each layer's output more or less centered around 0 at beginning of training, which often helps speed up convergence.

- <mark>*Rectified Linear Unit Function*</mark>: $ReLU\,(z) \> = \> max(0, z)$

    Continuous but unfortunately not differentaible at $z$ = 0. Derivative is 0 for $z$ <0. In practice however,it works very well. Also, the fact that it does not have a maximum output value helps reduce some issues during Gradient Descent. 
