   # Introduction to Artificial Neural Networks with Keras
   
## From Biological to Artificial Neurons

Artificial Neural Networks inspired by biological neurons. First introduced back in 1943, but entered a long winter as other methods worked better. We are now witnessing another wave of interest in ANNs. This time, there are a few good reasons to believe that this wave is different and that it will have a much more profound impact on our lives:

- There is now a huge quantity of data available to train neural networks, and ANNs frequently outperform other ML techniques on very large and complex problems.
- The tremendous increase in computing power since the 1990s now makes it possible to train large neural networks in a reasonable amount of time. This is in part due to Moore's Law, but also thanks to the gaming industry, which has produced powerful GPU cards by the millions.
- The training algos have been improved. To be fair theya re only slighly different from the ones used in the 1990s, but these relatively small tweaks have a huge positive impact.
- Some theoretical limitations of ANNs have turned out to be benign in practice. For example, many people thought that ANN training algos were doomed because they were likely to get stuck in local optima, but it turns out that this is rather rare in practice (or when it is the case, they are usually fairly close to the global optimum).
- ANNs seem to have entered a virtuous circle of funding and progress. Amazing products based on ANNs regularly make the headline news, which pulls more and more attention and funding towards them, resulting in more and more progress, and even more amazing products.

## Biological Neurons

Before discussing artificual neurons, let's take a quick look at a biological neuron. It is an unusual-looking cell mostly found in animal cerebral cortexes, composed of a *cell body* containing the nucleus and most of the cell's complex components, and many branching extensions called *dendrites*, plus one very long extension called the *axon*. The axon's length may be just a few times longer than the cell body, or up to tens of thousands of times longer. Near its extremity the axon splits off into many branches called *telodendria*, and at the tip of these branches are minuscule structures called *synaptic terminals* (or simply *synapses*), which are connected to the dendrites (or directly to the cell body) of other neurons. Biological neurons receive short electrical impluses called *signals* from other neurons via these synapes. When a neuron receives a sufficient number of signals from other neurons within a few milliseconds, it fires its own signals.

![alt text](neuron.PNG "bio neuron")

Thus, individual biological neurons seem to behave in a rather simple way, but they are organized in a vast network of billions of neurons, each neuron typically connected to thousands of other neurons. Highly complex computations can be performed by a vast network of fairly simple neurons, much like a complex anthill can emerge from the combined efforts of simple ants. The architecture of biological neural networks is still the subject of active research, but some parts of the brain have been mapped, and it seems that neurons are often organized in consecutive layers.

## Logical Computations with Neurons

McCulloch and Pitts proposed a very simple model of the biological neuron, which later became known as an *artificial neuron*: it has one or more binary (on/off) inputs and one binary output. The artificial neuron simply activates its output when more than a certain number of its inputs are active. They showed that even with such a simplified model it is possible to build a network of artificial neurons that computes any logical proposition you want.

## The Perceprton

The *Perceptron* is one of the simplest ANN architecturesm invented in 1957 by Frank Rosenblatt. It is based on a slightly different artificial neuron called a *threshold logic unit* (TLU), or sometimes a *linear threshold unit* (LTU): the inputs and output are now numbers rather than binary on/off values and each input connection is associated with a weight. The TLU computes a weighted sum of its inputs ($z = w_1x_1 + w_2x_2 + ... + w_nx_n = x^tw$), then applies a *step function* to that sum and outputs the results: $h_w(x) = step(z)$ , where $z = x^tw$.

![alt text](TLU.PNG "TLU")


A single TLU can be used for simple linear binary classification. It computes a linear combination of the inputs and if the result exceeds a threshold, it outputs the positive class or else outputs the negative class (just like a Logistic Regression classifier or a linear SVM). 

A Perceptron is simply composed of a single layer of TLUs, with each TLU connected to all the inputs. When all the neurons in a layer are connected to every neuron in the previous layers (i.e., its input neurons), it is called a *fully connected* layer or a *dense leyaer*. To represent the fact that each input is sent to every TLU, it is common to draw special passthrough neurons called *input neurons*: they just output whatever input they are fed. All the input neurons form the *input layer*. Moreover, an extra bias feature is generally added ($x_0 = 1$): it is typically represented using a special type of neuron called a *bias neuron*, which just outputs 1 all the time. 
A Perceptron with two inputs and three outputs is represented below.

![alt text](perceptron.PNG "perceptron")

This Perceptron can classify instances simultaneously into three different binary classes, which makes it a multioutput classifier.

Thanks to the magic of linear algebra, it is possible to efficiently compute the outputs of a layer of artificial neurons for several instances at once by using the equation below:

$$ h_{W,b}(X) = \phi(XW + b)$$

- As always, X represents the matrix of input features. It has one row per instance, one column per feature.
- The weight matrix W contains all the connection weights except for the ones from the bias neuron. It has one row per input neuron and one column per artifical neuron in the layer.
- The bias vector b contains all the connection weights between the bias neuron and the artificial neurons. It has one bias term per artificial neuron.
- The function $\phi$ is called the *activation function*: when the artificial neurons are TLUs, it is a step function (but we will discuss other activation functions shortly).

So how is a Perceptron trained? The Perceptron training algo proposed was largely inspired by *Hebb's rule* (when a biological neuron often triggers another neuron, the connection between these two neurons grows stronger or "Cells that fire together, wire together"). The connection weight between two neurons is increased whenever they have the same output. Perceptrons are trained using a variant of this rule that takes into account the error made by the netwkork; it reinforces connection that help reduce the error. More specifucally, the Perceptron is fed one training instance at a time, and for each instance it makes its predictions. For every output neuron that produced a wrong prediction, it reinforces the connection wights from the inputs that would have contributed the correct prediction. The rule is shown in the equation below:

$$ w_{i,j}^{(next step)} = w_{i,j} + \eta(y_j - \hat{y_j})x_i $$

where:

- $ w_{i,j}$ is the connection weight between the $i^{th}$ input neuron and the $j^{th}$ output neuron.
- $x_i$ is the $i^{th}$ input value of the current training instance.
- $\hat{y_j}$ is the output of the $j^{th}$ output neuron for the current training instance.
- $\eta$ is the learning rate.

The decision boundary of each output neuron is linear, so Perceptrons are incapable of learning complex patterns (just like Logistic Regression classifiers). However, if the training instances are linearly separable, it was demonstrated that this algo would converge to a solution. This is called the *Perceptron convergence theorem*.

Scikit-Learn providesa `Perceptron` class that implements a single TLU network. It can be used pretty much as you would expect.   

In [5]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:,(2,3)] # petal length, petal width
y = (iris.target == 0).astype(np.int) # Iris Setosa? 0 or 1

per_clf = Perceptron()
per_clf.fit(X,y)

y_pred = per_clf.predict([[2, 0.5]])
print(y_pred)

[1]




Yo may have noticed the fact that the Perceptron learning algo strongly resembles Stochastic Gradient Descent. In fact, Scikit-Learn's `Perceptron` class is equivalent to using a `SGDClassifier` with the following hyperparameters: `loss="perceptron"`,`learning_rate="constant"`,`eta0=1` (the learning rate), and `penalty=None` (no regularization).

Note that contrary to Logistic Regression classifiers, Perceptrons do not output a class probability; rather, they just make predictions based on a hard threshold. This is one of the good reasons to prefer Logistic Regression over Perceptrons.

In the 1969 monograph titled *Perceptrons*, a number of serious weaknesses of Perceptrons were highlighed, in particular the fact that they are incapable of solving some trivial problems (e.g. the *Exclusive OR (XOR)* classification problem. 

However, it turns out that some of the limitations of Perceptrons can be eliminated by stacking multiple Perceptrons. The resulting ANN is called a *Multi-Layer Perceptron* (MLP). 

## Multi-Layer Perceptron and Backpropagation

An MLP is composed of one (passtrhough) *input layer*, one or more layers of TLUs called *hidden layers*, and one final layer of TLUs called the *output layer* (see figure below). The layers close to the input layer are usually called the lower layers, and the ones close to the outputs are usually called the upper layers. Every layer except the output layer includes a bias neuron and is fully connected to the next layer.

![alt text](MLP.PNG "MLP")

**Note**:

The signal flows only in one direction (from the inputs to the outputs), so this architecture is an example of a *feedforward neural network* (FNN).

When an ANN contains a deep stack of hidden layers, it is called a *deep neuroal network* (DNN). The field of Deep Learning studies DNNs, and more generally models containing deep stacks of computations. However, many people talk about Deep Learning whenever neural networks are involved (even shallow ones).

For many years researchers struggled to find a way to train MLPs, without success. In 1986 a groundbreaking paper was published introducing the *backpropagation* training algorithm, which is still used today. In short, it is simply Gradient Descent using an efficient technique for computing the gradients automatically: in just two passes through the network (one forward, one backward), the backpropagation algorithm is able to compute the gradient of the network's error with regards to every single model parameter. In other words, it can find out how each connection weight and each bias term should be tweaked in order to reduce the error. Once it has these gradients, it just performs a regular Gradient Descent step, and the whole process is repeated until the network converges to the solution.

**Note**:

Automatically computing gradients is called *automatic differentiation*, or autodiff. There are various autodiff techniques, with different pros and cons. The one used by backpropagation is called *reverse-mode autodiff*. It is fast and precise, and is well suited when the function to differentiate has many variables (e.g. connection weights) and few outputs (e.g. one loss).

Let's run through this algo in a bit more detal:

- It handles one mini-batch at a time (for example containing 32 instances each), and it goes through the full training set multiple times. Each pass is called an *epoch*.
- Each mini-batch is passed t othe network's input layer, which just sends it to the first hidden layer. The algorithm then computes the output of all the neurons in this layer (for every instance in the mini-batch). The result is passed on to the next layer, its output is computed and passed to the next layer, and so on until we get the output of the last layer, the output layer. This is the *forward pass*: it is exactly like making predictions, except all intermediate results are preserved since they are needed for the backward pass.
- Next the algo measures the network's output error (i.e. it uses a loss function that compares the desired output and the actual output of the network, and returns some measure of the error).
- Then it computes how much each output connection contributed to the error. This is done analytically by simply applying the *chainrule*, which makes this step fast and precise.
- THe algo then measures how much of these error contributions came from each connection in the layer below, again using the chain rule -- and so on until the algo reaches the input layer. As we explained earlier, this reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward through the network (hence the neame of the algorithm).
-Finally, the algorithm performs a Gradient Descent step to tweak all the connection weights in the nextwork, using the error gradients it just computed.

This algo is so important, it's worth summarizing again: for each training instance the backprop algo first makes a prediction (forward pass), measures the error, then goes through each layer in reverse to measure the error contribution from each conection (reverse pass), and finally slightly tweaks the connection weights to reduce the error (Gradient Descent step).

**Caution**:




