### The Perceptron

It is based on a slightly different artificial neuron called a Threshold Logic Unit (TLU), or a Linear Threshold Unit (LTU). The input and output are numbers and each input connection is associated with a weight. The TLU computes a weighted sum of its input, then applies a step function to that sum and output the result.

The most common step function is the Heaviside step function, sometimes the sign function is used instead.


A Perceptron is simply composed of a single layer of TLUs, with each TLU connected to all the inputs. When all the neurons in a layer are connected to every neuron in the previous layer, it is called a fully connected layer or a dense layer. An extra bias feature is generally added and it is represented using a special type of neuron called a bias neuron. 

Computing the ouputs of a fully connected layer:
h(X) = f(XW + b)

X represents the matrix of input features, and has one row per instance, one column per feature.

The weight matrix W contains all the connection weights except for the ones from the bias neuron. It has one row per input neuron and one column per artificial neuron in the layer.

The bias vector b contains all the connection weights between the bias neuron and the artificial neurons. It has one bias term per artificial neuron.

The function f is called the activation function: when the artificial neurons are TLUs, it is a step function.



Perceptron Learning rule (weight update)

w(i, j)(next step) = w + n(yj - yj')xi


w = connection weight between the ith input neuron and the jth output neuron.

xi = ith input value of the current training instance.

yj' = output of the jth output neuron for the current training instance.

yj = target output of the jth output neuron for the current training instance.

n = learning rate.


The decision boundary of each output neuron is linear, so Perceptrons are incapable of learning complex patterns. However, if the training instances are linearly separable then this algorithm would converge to a solution. This is called Perceptron convergence theorem.

In [6]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2, 3)]
y = (iris.target == 0).astype(int)

In [8]:
per_clf = Perceptron()
per_clf.fit(X, y)

Perceptron()

In [9]:
y_pred = per_clf.predict([[2, 0.5]])

In [10]:
y_pred

array([0])

### Multi-Layer Perceptron and Backpropagation

An MLP is composed of one input layer, one or more layers of TLUs, called hidden layers, and one final layer of TLUs called the output layer. The layers close to the input layer are usually called the lower layers, and the ones close to the outputs are usually called the upper layers. Every layer except the output layer includes a bias neuron and is fully connected to the next layer. 

The architecture in which the signal flows only in one direction is an example of a feedforward neural network (FNN).

When an ANN contains a deep stack of hidden layers, it is called a deep neural network.


In 1986, David Rumelhart, Geoffrey Hinton and Ronald Williams published a groundbreaking paper introducing the backpropagation training algorithm. It trains a neural network by adjusting its parameters using chain rule and gradients.

At first we initialize random values to weights and bias terms. Each mini-batch is passed to hte network's input layer and computes the output of all the neurons in the layer and eventually the output. It is called forward pass. Next, the algorithm measures the networks output error. Then we need to compute the new weights and bias using chain rule. 


In order for this algorithm to work properly, the step function is replaced by the logistic function. This is due to the step function contains only the flat segments, so there is no gradient to work with, while the logistic function has a well-defined nonzero derivative every where, allowing Gradient Descent to make some progress at every step.

The hyperbolic tangent function tanhz is just like logistic function is S-shaped, continuous, and differentiable, but its output values ranges from -1 to 1, which tends to make each layer's output more or less centered around 0 at the beginning of training. This often helps us convergence.

The Rectified Linear Unit function: It is continuous but not differentiable at z=0 and its derivative is 0 for z<0. It is fast to compute.


#### Regression MLPs
We don't need to use any activation function for the output neurons, so they are free to ouput any range of values. 

The loss function is the mean squared error, but if there are lot of outliers in the training set, we cna use the mean absolute error. Alternatively, we can use the Huber loss, which is combination of both.

The Huber loss is quadratic when the error is smaller than a threshold, but linear when the error is larger than del. This makes it less sensitive to outliers than the mean squared error, and it is often  more precise and converges faster than the mean absolute error. 

#### Classification MLPs
For a binary classification problem, we need a single output neuron using the logistic activation function: the output will be a number between 0 and 1. When wee need to do multi-class classification then we should use teh softmax activation function for the whole output layer. 

The loss function used is cross-entropy(log loss).