# Introduction to Artificial Neural Networks with Keras

*Artificial Nerual Networks* (ANNs) are Machine Learning models inspired by the networks of biological neurons found in our brains.  ANNs are at the very core of Deep Learning. They are versatile, powerful, and scalable, making them ideal to tackle large and highly complex Machine Learning tasks such as classifying billions of images, powering speech recognition services, recommending the best videos to watch to hundreds of millions of users every day, or learning to beat the world champion at a game of Go.

The first part of this chapter introduces artificial neural networks.  In the second part, we will look at how to implement neural networks using the popular Keras API.

## From Biological to Artificial Neurons

The early successes of ANNs led to widespread belief that we would soon be conversing with truly intelligent machines.  When it became clear in the 1960's that this promise would go unfulfilled, funding flew elsewhere, and ANNs entered a long winter.  In the early 1980s, new architectures were invented and better training techniques were developed, sparking a revival of interest in *cnnectionism* (the study of neural networks).  But progress was slow, and by the 1990s other powerful Machine Learning techniques were invented, such as Support Vector Machines.  These techniques seemed to offer better results and stronger theoretical foundations than ANNs, so once again the study of neural networks was put on hold.

We are now witnessing yet another wave of interest in ANNs, and this time is different.
* There are now a huge quantity of data available to train neural networks, and ANNs frequently outperform other ML techniques on very large and complex problems.
* The tremendous increase in computing power since the 1990s now makes it possible to train large neural networks in a reasonable amount of time.
* The training algorithms have been improved.
* ANNs seem to have entered a virtuous cirle of funding and progress.

## The Perceptron

The *Perceptron* is one of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt.  It is based on a *threshold logic unit* (TLU). For a TLU, the inputs and outputs are numbers, and each input connection is associated with a weight.  The TLU computes a weighted sum of its inputs, then applies a *step function* to that sum and outputs the results.

A Perceptron is simply composed of a single layer of TLUs, with each TLU connected to all the inputs.  When all the neurons in a layer are connected to every neuron in the previous layer, the layer is called a *fully connected layer*, or a *dense layer*.

Thanks to the magic of linear algebra, Equation 10-2 makes it possible to efficiently compute the outputs of a layer of artificial neurons for several instances at once.

<c> Equation 10-2: Computing the outputs of a fully connected layer </c>
$$ h_{W, b}(X) = \phi(XW + b) $$

So how is a Perceptron trained? "Cells that fire together, wire together." Perceptrons are trained using a variant of this rule that takes into account the error made by the network when it makes a prediction; the Perceptron learning rule reinforces connections that help reduce the error.  More specifically, the Perceptron is fed one training instance at a time, and for each instance it makes its predictions.  For every output neuron that produced a wrong prediction, it reinforces the connection weights from the inputs that would ahve contributed to the correct prediction.  

**The decision boundary of each output neuron is linear, so Perceptrons are incapable of learning complex patterns (just like Logistic Regression classifiers).**

Scikit-Learn provides a Perceptron class that implements a single-TLU network.

### Example 1: Scikit-Learns Perceptron

In [13]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2, 3)] # petal length, petal width
y = (iris.target == 0).astype(int) # Iris Setosa

per_clf = Perceptron()
per_clf.fit(X, y)

y_pred = per_clf.predict([[2, 0.5]])
y_pred

array([0])

You may have noticed that the Perceptron learning algorithm strongly resembles Stochastic Gradient Descent.  In fact, Scikit-Learn's Perceptron class is equivalent to using an SGDClassifier with the following hyperparameters: loss = 'perceptron', learning_rate = 'constant', eta0 = 1 (the learning rate), and penalty = None (no regularization).

Note that contrary to Logistic Regression classifiers, Perceptrons do not output a class probability.  This is one reason to prefer Logistic Regression over Perceptrons.

There are a number of significant weaknesses of Perceptrons, in particular that they are incapable of solving some trivial problems.  It turns out that some of the limitations of Perceptrons an be eliminated by stacking multiple Perceptrons.  The resulting ANN is called a *Multilayer Perceptron* (MLP)

## The Multilayer Perceptron and Backpropagation

An MLP is composed of one (passthrough) *input layer*, one or more layers of TLUs, called *hidden layers*, and one final layer of TLUs, called the *output layer*.  The layers close to the input layers are usually called the *lower layers* and the ones close to the outputs are usually called the *upper layers*.  Every layer except the output layer includes a bias neuron and is fully connected to the next layer.

The signal flows only in one direction (from the inputs to the oupts), so this architecture is an example of a *feedforward neural network* (FNN).

When an ANN contains a deep stack of hidden layers (dozens or hundreds), it is called a *deep neural network* (DNN).  For many years researchers struggled to find a way to train MLPs, without success, until 1986 when David Rumelhart, Geoffrey Hinton, and Ronald Williams published a groundbreaking paper that introduced the *backpropagation* training algorithm, which is still used today.

In short, it is Gradient Descent using an efficient technique for computing the gradients automatically: in just two passes through the network (one forward, one backward), the backpropagation algorithm is able to compute the gradient of the network's error with regard to every single model parameter.  In other words, it can find out how each connection weight and each bias term should be tweaked in order to reduce the error.  Once it has these gradients, it just performs a regular Gradient Descent step, and the whole process is repeated until the network converges to the solution.

Automatically computing gradients is called *automatic differentiation* or *autodiff*.  There are various autodiff techniques, with different pros and cons.  The one used by backpropagation is called *reverse-moode autodiff*.  It is fase and precise, and is well suited when the function to differentiate has many variables and few outputs.

Let's run through the algorithm in a bit more detail:
* It handles one mini-batch at a time (for example, containing 32 instances each), and it goes through the full training set multiple times.  Each pass is called an *epoch*
* Each mini-batch is passed to the network's input layer, which sends it to the first hidden layer. The algorithm then computes the output of all the neurons in this layer (for every instance in the mini-batch). The result is passed on to the next layer, its output is computed and passed to the next layer, and so on until we get the output of the last layer, the output layer. This is the *forward pass*: it is exactly like making predictions, except all intermediate results are preserved since they are needed for the backward pass.
* Next, the algorithm measures the network's output error (i.e. it uses a loss function that compares the desired output and the actual output of the network, and returns some measure of the error).
* Then it computes how much each output connection contributed to the error. This is done analytically by applying the *chain rule* (perhaps the most fundamental rule in calculus), which makes this step fast and precise.
* The algorithm then measures how much of these error contributions came from each connection in the layer below, again using the chain rule, working backward until the algorithm reaches the input layer.  As explained earlier, this reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward through the network.
* Finally, the algorithm performs a Gradient Descent step to tweak all the connection weights in the network, using the error gradients it just computed.

**This algorithm is so important that it's worth summarizing it again: for each training instance, the backpropagation algorithm first makes a prediction (forward pass) and measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally tweaks the connection weights to reduce the error (Gradient Descent step)**

***It is important to initialize all the hidden layers' connection weights randomly, or else training will fail.***

In order for this algorithm to work properly, its authors made a key change to the MLP's architecture: they replaced the step function with the logistic (sigmoid) function:
$$ \sigma(z) = \frac{1}{1 + \exp(-z)} $$

The backpropagation algorithm works well with many other activation functions.  Here are two other popular choices:
1. **The hyperbolic tanger function: tanh(z)**: $2\sigma(2z) - 1$
<br>Just like the logistic function, this activation function is S-shaped, continuous, and differentiable, but its output value ranges from -1 to 1 (instead of 0 to 1 in the case of the logistic function). That range tends to make each layer's output more or less centered around 0 at the beginning of training, which often helps speed up convergence.
2. **The Rectified Linear Unit function: ReLU(z)**: $max(0, z)$
<br>The ReLU function is continuous but unfortunatley not differentiable at z=0 (the slope changes abruptly, which can make Gradient Descent bounce around), and its derivative is 0 for z < 0. In practice, however, it works very well and has the advantage of being fast to compute, so it has become the default. Most importantly, the fact that it does not have a maximum output value helps reduce some issues during Gradient Descent.

Why do we need activation functions in the first place? Well, if you chain several linear transformations, all you get is a linear transformation. So if you don't have some nonlinearity between layers, then even a deep stack of layers is equivalent to a single layer, and you can't solve very complex problems with that. Conversely, a large enough DNN with nonlinear activations can theoretically approximate any continuous function.

## Regression MLPs