# Introduction to Artificial Neural Networks with Keras

*Artificial Nerual Networks* (ANNs) are Machine Learning models inspired by the networks of biological neurons found in our brains.  ANNs are at the very core of Deep Learning. They are versatile, powerful, and scalable, making them ideal to tackle large and highly complex Machine Learning tasks such as classifying billions of images, powering speech recognition services, recommending the best videos to watch to hundreds of millions of users every day, or learning to beat the world champion at a game of Go.

The first part of this chapter introduces artificial neural networks.  In the second part, we will look at how to implement neural networks using the popular Keras API.

## From Biological to Artificial Neurons

The early successes of ANNs led to widespread belief that we would soon be conversing with truly intelligent machines.  When it became clear in the 1960's that this promise would go unfulfilled, funding flew elsewhere, and ANNs entered a long winter.  In the early 1980s, new architectures were invented and better training techniques were developed, sparking a revival of interest in *cnnectionism* (the study of neural networks).  But progress was slow, and by the 1990s other powerful Machine Learning techniques were invented, such as Support Vector Machines.  These techniques seemed to offer better results and stronger theoretical foundations than ANNs, so once again the study of neural networks was put on hold.

We are now witnessing yet another wave of interest in ANNs, and this time is different.
* There are now a huge quantity of data available to train neural networks, and ANNs frequently outperform other ML techniques on very large and complex problems.
* The tremendous increase in computing power since the 1990s now makes it possible to train large neural networks in a reasonable amount of time.
* The training algorithms have been improved.
* ANNs seem to have entered a virtuous cirle of funding and progress.

## The Perceptron

The *Perceptron* is one of the simplest ANN architectures, invented in 1957 by Frank Rosenblatt.  It is based on a *threshold logic unit* (TLU). For a TLU, the inputs and outputs are numbers, and each input connection is associated with a weight.  The TLU computes a weighted sum of its inputs, then applies a *step function* to that sum and outputs the results.

A Perceptron is simply composed of a single layer of TLUs, with each TLU connected to all the inputs.  When all the neurons in a layer are connected to every neuron in the previous layer, the layer is called a *fully connected layer*, or a *dense layer*.

Thanks to the magic of linear algebra, Equation 10-2 makes it possible to efficiently compute the outputs of a layer of artificial neurons for several instances at once.

<c> Equation 10-2: Computing the outputs of a fully connected layer </c>
$$ h_{W, b}(X) = \phi(XW + b) $$

So how is a Perceptron trained? "Cells that fire together, wire together." Perceptrons are trained using a variant of this rule that takes into account the error made by the network when it makes a prediction; the Perceptron learning rule reinforces connections that help reduce the error.  More specifically, the Perceptron is fed one training instance at a time, and for each instance it makes its predictions.  For every output neuron that produced a wrong prediction, it reinforces the connection weights from the inputs that would ahve contributed to the correct prediction.  

**The decision boundary of each output neuron is linear, so Perceptrons are incapable of learning complex patterns (just like Logistic Regression classifiers).**

Scikit-Learn provides a Perceptron class that implements a single-TLU network.

#### Example 1: Scikit-Learns Perceptron

In [8]:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2, 3)] # petal length, petal width
y = (iris.target == 0).astype(int) # Iris Setosa

per_clf = Perceptron()
per_clf.fit(X, y)

y_pred = per_clf.predict([[2, 0.5]])
y_pred

array([0])

You may have noticed that the Perceptron learning algorithm strongly resembles Stochastic Gradient Descent.  In fact, Scikit-Learn's Perceptron class is equivalent to using an SGDClassifier with the following hyperparameters: loss = 'perceptron', learning_rate = 'constant', eta0 = 1 (the learning rate), and penalty = None (no regularization).

Note that contrary to Logistic Regression classifiers, Perceptrons do not output a class probability.  This is one reason to prefer Logistic Regression over Perceptrons.

There are a number of significant weaknesses of Perceptrons, in particular that they are incapable of solving some trivial problems.  It turns out that some of the limitations of Perceptrons an be eliminated by stacking multiple Perceptrons.  The resulting ANN is called a *Multilayer Perceptron* (MLP)

## The Multilayer Perceptron and Backpropagation

An MLP is composed of one (passthrough) *input layer*, one or more layers of TLUs, called *hidden layers*, and one final layer of TLUs, called the *output layer*.  The layers close to the input layers are usually called the *lower layers* and the ones close to the outputs are usually called the *upper layers*.  Every layer except the output layer includes a bias neuron and is fully connected to the next layer.

The signal flows only in one direction (from the inputs to the oupts), so this architecture is an example of a *feedforward neural network* (FNN).

When an ANN contains a deep stack of hidden layers (dozens or hundreds), it is called a *deep neural network* (DNN).  For many years researchers struggled to find a way to train MLPs, without success, until 1986 when David Rumelhart, Geoffrey Hinton, and Ronald Williams published a groundbreaking paper that introduced the *backpropagation* training algorithm, which is still used today.

In short, it is Gradient Descent using an efficient technique for computing the gradients automatically: in just two passes through the network (one forward, one backward), the backpropagation algorithm is able to compute the gradient of the network's error with regard to every single model parameter.  In other words, it can find out how each connection weight and each bias term should be tweaked in order to reduce the error.  Once it has these gradients, it just performs a regular Gradient Descent step, and the whole process is repeated until the network converges to the solution.

Automatically computing gradients is called *automatic differentiation* or *autodiff*.  There are various autodiff techniques, with different pros and cons.  The one used by backpropagation is called *reverse-moode autodiff*.  It is fase and precise, and is well suited when the function to differentiate has many variables and few outputs.

Let's run through the algorithm in a bit more detail:
* It handles one mini-batch at a time (for example, containing 32 instances each), and it goes through the full training set multiple times.  Each pass is called an *epoch*
* Each mini-batch is passed to the network's input layer, which sends it to the first hidden layer. The algorithm then computes the output of all the neurons in this layer (for every instance in the mini-batch). The result is passed on to the next layer, its output is computed and passed to the next layer, and so on until we get the output of the last layer, the output layer. This is the *forward pass*: it is exactly like making predictions, except all intermediate results are preserved since they are needed for the backward pass.
* Next, the algorithm measures the network's output error (i.e. it uses a loss function that compares the desired output and the actual output of the network, and returns some measure of the error).
* Then it computes how much each output connection contributed to the error. This is done analytically by applying the *chain rule* (perhaps the most fundamental rule in calculus), which makes this step fast and precise.
* The algorithm then measures how much of these error contributions came from each connection in the layer below, again using the chain rule, working backward until the algorithm reaches the input layer.  As explained earlier, this reverse pass efficiently measures the error gradient across all the connection weights in the network by propagating the error gradient backward through the network.
* Finally, the algorithm performs a Gradient Descent step to tweak all the connection weights in the network, using the error gradients it just computed.

**This algorithm is so important that it's worth summarizing it again: for each training instance, the backpropagation algorithm first makes a prediction (forward pass) and measures the error, then goes through each layer in reverse to measure the error contribution from each connection (reverse pass), and finally tweaks the connection weights to reduce the error (Gradient Descent step)**

***It is important to initialize all the hidden layers' connection weights randomly, or else training will fail.***

In order for this algorithm to work properly, its authors made a key change to the MLP's architecture: they replaced the step function with the logistic (sigmoid) function:
$$ \sigma(z) = \frac{1}{1 + \exp(-z)} $$

The backpropagation algorithm works well with many other activation functions.  Here are two other popular choices:
1. **The hyperbolic tanger function: tanh(z)**: $2\sigma(2z) - 1$
<br>Just like the logistic function, this activation function is S-shaped, continuous, and differentiable, but its output value ranges from -1 to 1 (instead of 0 to 1 in the case of the logistic function). That range tends to make each layer's output more or less centered around 0 at the beginning of training, which often helps speed up convergence.
2. **The Rectified Linear Unit function: ReLU(z)**: $max(0, z)$
<br>The ReLU function is continuous but unfortunatley not differentiable at z=0 (the slope changes abruptly, which can make Gradient Descent bounce around), and its derivative is 0 for z < 0. In practice, however, it works very well and has the advantage of being fast to compute, so it has become the default. Most importantly, the fact that it does not have a maximum output value helps reduce some issues during Gradient Descent.

Why do we need activation functions in the first place? Well, if you chain several linear transformations, all you get is a linear transformation. So if you don't have some nonlinearity between layers, then even a deep stack of layers is equivalent to a single layer, and you can't solve very complex problems with that. Conversely, a large enough DNN with nonlinear activations can theoretically approximate any continuous function.

## Regression MLPs

First, MLPs can be used for regression tasks. If you want to predict a single value (like the value of a house), then you just need a single output neuron: its output is the predicted value. For multivariate regression (predicting multiple values at once), you need one output neuron per output dimension. For example, to locate the center of an object in an image, you need to predict 2D cooridinates, so you need two output neurons.

**In general, when building an MLP for regression, you do not want to use any activation function for the output neurons, so they are free to output any range of values.** If you want to guarantee that the output will always be positive, then you can use the ReLU activation function in the output layer. Alternatively, you can use the *softplus* activation function, which is a smooth variant of ReLU: $softplus(z) = log(1 + \exp(z))$

The loss function to use during training is typically the mean squared error, but if you have a lot of outliers in the training set, you may prefer to use the mean absolute error instead. Alternatively, you can us ethe Huber loss, which is a combination of both.

The Huber loss is quadratic when the error is smaller than a threshold $\delta$ (typically 1) but linear when the error is larger than $\delta$. The linear part makes it less sensitive to outliers than the mean squared error, and the quadratic part allows it to converge faster and be more precise than the mean absolute error.

In [14]:
import pandas as pd

# Typical regression MLP architecture
pd.DataFrame.from_dict(data={
    'Input Neurons': 'One per intput feature',
    'Hidden Layers': 'Depends on the problem, typically 1 to 5',
    'Neurons per Layer': 'Depends on the problem, typically 10 to 100',
    'Output Neurons': '1 per prediction dimension',
    'Hidden Activation': 'ReLU or SELU',
    'Output Activation': 'None, or ReLU/softplus. Generally tailored for desired output.',
    'Loss Function': 'MSE, MAE or Huber if outliers'
}, orient='index').rename(columns={0: 'Typical Value'})

Unnamed: 0,Typical Value
Input Neurons,One per intput feature
Hidden Layers,"Depends on the problem, typically 1 to 5"
Neurons per Layer,"Depends on the problem, typically 10 to 100"
Output Neurons,1 per prediction dimension
Hidden Activation,ReLU or SELU
Output Activation,"None, or ReLU/softplus. Generally tailored for..."
Loss Function,"MSE, MAE or Huber if outliers"


## Classification MLPs

MLPs can also be used for classification tasks. For a binary classification problem, you just need a single output neuron using the logistic activation function: the output will be a number between 0 and 1, which you can interpret as the estimated probability of the positive class.

MLPs can also easily handle multilabel binary classification tasks. More generally, you would dedicate one output neuron for each positive class. Note that the output probabilities do no necessarily add up to 1.

If each instance can belong only to a single class, out of three or more possible classes (e.g. classes 0 through 9 for digit image classification), then you need to have one output neuron per class, and you should use the softmax activiation function for the whole output layer. The softmax function will ensure that all the estimated probabilities are between 0 and 1 and they they add up to 1 (which is required if the classses are exclusive). This is called multiclass classification.

Regarding the loss function, since we are predicting probability distributions, the cross-entropy loss (also called the log loss) is generally a good choice.

In [17]:
# Typical classification MLP architecture
pd.DataFrame.from_dict({
    'Input and Hidden Layers': ['Same as regression', 'Same as regression', 'Same as regression'],
    'Number of Output Neurons': ['1', '1 per label', '1 per class'],
    'Output Layer Activation': ['Logistic', 'Logistic', 'Softmax'],
    'Loss Function': ['Cross-Entropy', 'Cross-Entropy', 'Cross-Entropy']
}, orient='index').rename(columns={0: 'Binary Classification', 1: 'Multilabel Binary Classification', 2: 'Multiclass Classification'})

Unnamed: 0,Binary Classification,Multilabel Binary Classification,Multiclass Classification
Input and Hidden Layers,Same as regression,Same as regression,Same as regression
Number of Output Neurons,1,1 per label,1 per class
Output Layer Activation,Logistic,Logistic,Softmax
Loss Function,Cross-Entropy,Cross-Entropy,Cross-Entropy


## Implementing MLPs with Keras

At present, you can choose from 3 popular open source Deep Learning libraries:
1. TensorFlow
2. Microsoft Cognitive Toolkit
3. Theano

As of 2016 Keras can be run on:
1. Apache MXNet
2. Apple's Core ML
3. JavaScript
4. TypeScript
5. PlaidML

Moreover, TensorFlow itself now comes bundled with its own Keras implementation, tf.keras. It only supports TensorFlow as the backend, but it has the advantage of offering some very useful extra features like TensorFlow's Data API which makes it easy to load and preprocess data efficiently.

**The most populare Deep Learning library after Keras and TensorFlow is Facebook's PyTorch. The good news is that its API is quite similar to Keras's, so once you know Keras, it is not difficult to switch to PyTorch, if you ever want to.** PyTorch's popularity grew exponentially in 2018, largely thanks to its simplicity and excellent documenatation, which were not TensorFlow 1.x's main strengths. However, TensorFlow 2 is arguably just as simple as PyTorch, as it has adopted Keras as its official high-level API and its developers have greatly simplified and cleaned up the rest of the API. Similarly, PyTorch's main weaknesses (e.g. limited portability and on computation graph analysis) have been largely addressed in PyTorch 1.0. Healthy competition is beneficial to everyone!

## Installing TensorFlow 2

In [21]:
# pip install -U tensorflow

In [22]:
import tensorflow as tf
from tensorflow import keras
tf.__version__, keras.__version__

('2.8.0', '2.8.0')

For GPU support, at the time of this writing you need to install tensorflow-gpu instead of tensorflow, but the TensorFlow team is working on having a single library that will support both CPU-only and GPU-equipped systems. You will need to install extra libraries for GPU support (see http://tensorflow.org/install for more details).

## Building an Image Classifier Using the Sequential API

### Using Keras to load the dataset

In [38]:
# Load the Keras dataset
fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

# Check dimensions and datatypes
print(X_train_full.shape, X_train_full.dtype)

# Split the training set into a validation and training set
X_valid, X_train = X_train_full[:5000] / 255.0, X_train_full[5000:] / 255.0
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]

# Define the class names
class_names = ['T-shirt/top', 'Trouser', 'Pullover', 'Dress', 'Coat', 'Sandal', 'Shirt', 'Sneaker', 'Bag', 'Ankle Boot']
class_names[y_train[0]]

(60000, 28, 28) uint8


'Coat'

### Creating the model using the Sequential API

In [200]:
model = keras.models.Sequential([
    keras.layers.Flatten(input_shape=[28, 28]),
    keras.layers.Dense(500, activation='relu'),
    keras.layers.Dense(500, activation='relu'),
    keras.layers.Dense(500, activation='relu'),
    keras.layers.Dense(500, activation='relu'),
    keras.layers.Dense(10, activation='softmax')
])

model.summary()

Model: "sequential_5"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 flatten_5 (Flatten)         (None, 784)               0         
                                                                 
 dense_17 (Dense)            (None, 500)               392500    
                                                                 
 dense_18 (Dense)            (None, 500)               250500    
                                                                 
 dense_19 (Dense)            (None, 500)               250500    
                                                                 
 dense_20 (Dense)            (None, 500)               250500    
                                                                 
 dense_21 (Dense)            (None, 10)                5010      
                                                                 
Total params: 1,149,010
Trainable params: 1,149,010
No

Let's go through this code line by line:
1. The first line creates a Sequential model. This is the simplest kind of Keras model for neural networks that are just composed of a single stack of layers connected sequentially. This is called the Sequential API.
2. Next, we build the first layer and add it to the model. It is a Flatten layer whose role is to convert each input image into a 1D array: if it receives input data X, it computes X.reshape(-1, 1). This layer does not have any parameters; it is just there to do some simple preprocessing. Since it is the first layer in the model, you should specify the input_shape, which doesn't include the batch size, only the shape of the instances. Alternatively, you could add a keras.Layers.InputLayer as the first layer, setting input_shape=[28, 28].
3. Next we add a Dense hidden layer with 300 neurons. It will use the ReLU activiation function. **Each Dense layer manages its own weight matrix, containing all the connection weights between the neurons and their inputs.** It also manages a vector of bias terms (one per neuron). When it receives some input data, it computes Equation 10-2.
4. The we add a second Dense hidden layer with 100 neurons, also using the ReLU activation function.
5. Finally, we add a Dense output layer with 10 neurons (one per class), using the softmax activation function (because the classes are exclusive.)

**Note that Dense layers often have a *lot* of parameters. This gives the model quite a lot of flexibility to fit the training data, but it also means that the model runs the risk of overfitting, especially when you do not have a lot of training data**

You can easily get a model's list of layers, to fetch a layer by its index, or you can fetch it by name. All the parameters of a layer can be accessed using its get_weights() and set_weights() methods. For a Dense layer, this includes both the connection weights and the bias terms.

In [207]:
model.layers, model.layers[1].name, model.get_layer('dense_17') is model.layers[1]

([<keras.layers.core.flatten.Flatten at 0x260302df310>,
  <keras.layers.core.dense.Dense at 0x260302dff70>,
  <keras.layers.core.dense.Dense at 0x260302f74f0>,
  <keras.layers.core.dense.Dense at 0x260302f7eb0>,
  <keras.layers.core.dense.Dense at 0x2602d35c1f0>,
  <keras.layers.core.dense.Dense at 0x2602d3bc3d0>],
 'dense_17',
 True)

In [209]:
weights, biases = model.layers[1].get_weights()
weights.shape, biases.shape, weights

((784, 500),
 (500,),
 array([[-0.02918262,  0.02894263,  0.05410552, ...,  0.02060419,
         -0.01661743, -0.03945428],
        [ 0.03310787, -0.02599242, -0.05342195, ..., -0.02981296,
          0.03183615, -0.01892894],
        [ 0.04201207, -0.03836633,  0.05245259, ..., -0.04116692,
          0.0665056 ,  0.00915689],
        ...,
        [-0.0621868 , -0.00052582,  0.02977216, ..., -0.02170236,
          0.06045925,  0.03534312],
        [-0.00262833, -0.0545004 , -0.06271002, ...,  0.01006421,
          0.03277665,  0.04097281],
        [-0.05442855,  0.03501238,  0.00235659, ...,  0.01903313,
         -0.00377643,  0.02144849]], dtype=float32))

The shape of the weight matrix depends on the number of inputs. This is why it is recommended to specify the input_shape when creating the first layer in a Sequential model. **Until the model is really built, the layers will not have any weights, and you will not be able to do certain things such as print the model summary or save the model.** So if you know the input shape when creating the model, it is best to specify it.

### Compiling the model

After a model is created, you must call its compile() method to specify the loss function and the optimizer to use. Optionally, you can specify a list of extra metrics to compute during training and evaluation.

In [213]:
model.compile(
    loss='sparse_categorical_crossentropy',
    optimizer=keras.optimizers.SGD(learning_rate=0.01),
    metrics=['accuracy']
)

This code requires some explanation. **First, we use the 'sparse_categorical_crossentropy' loss because we have sparse labels (i.e. for each instance, there is just a target class index, from 0 to 9 in this case), and the classes are exclusive. If instead we had one target probability per class for each instance (such as one-hot vectors, e.g. [0, 0, 1, 0]) to represent class 2), then we would need to use the 'categorical_crossentropy' loss instead.**

If we were doing binary classification (with one or more binary labels), then we would use the 'sigmoid' (i.e. logistic) activiation function in the output layer instead of the 'softmax' activiation function, and we would use the 'binary_crossentropy' loss.

If you want to convert sparse labels (i.e. class indices) to one-hot vector labels, use the keras.utils.to_categorical() function. To go the other way around, use the np.argmax() function with axis=1.

When using the SGD optimizer, it is important to tune the learning rate. So, you will generally want to use optimizer=keras.optimizers.SGD(learning_rate=???) to set the learning rate, rather than optimizer='sgd', which defaults to learning_rate=0.01

### Training and evaluating the model

Now the model is ready to be trained. For this we simply need to call its fit() method. We pass it the input features (X_train) and the target classes (y_train), as well as the number of epochs to train. We also pass a validation set (this is optional). Keras will measure the loss and the extra metrics on this set at the end of each epoch. If the performance on the training set is much better than the validation set, your model is probably overfitting the training set (or there is a bug such as a data mismatch between the training set and the validation set).

Instead of passing a validation set using the validation_data argument, you could set validation_split to the ratio of the training set that you want Keras to use for validation. For example, validation_split=0.1 tells Keras to use the last 10% of the data (before shuffling) for validation.

**If the training set was very skewed, with some classes being overrepresented and others underrepresnted, it would be useful to set the class_weight argument when calling the fit() method, which would give a larger weight to underrepresented classes and a lower weight to overrepresented classes.**

If you need per-instance weights, set the sample_weight argument (it supersedes class_weight). Per-instance weights could be useful if some instances were labeled by experts while others were labelled using a crowdsourcing platform: you might want to give more weight to the former.

The fit() method returns a History object containing the training parameters (history.params), the list of epochs it went through (history.epoch), and most importantly a dictionary (history.history) containing the loss and extra metrics it measured at the end of each epoch on the training set and on the validation set (if any). If you use this dictionary to create a pandas DataFrame and call its plot() method, you get the learning curves for the model.

**If your training or validation data does not match the expected shape, you will get an exception. This is perhaps the most common error, so you should get familiar with the error message. The message is actually quite clear: for example, if you try to train this model with an array containing flattened images it will throw this error.**

If the validation curves are close to the training curves, it means there is not too much overfitting. In this particular case, the model looks like it performed better on the validation set than on the training set at the beginning of training. But that's not the case: indeed, the validation error is computed at the *end* of each epoch, while the training error is computed using a running mean *during* each epoch. So the training curve should be shifted by half an epoch to the left. If you do that, you will see the that the training and validations curve overlap almost perfectly at the beginning of training.

**When plotting the training curve, it should be shifted by half an epoch to the left.**

In [None]:
history = model.fit(
    X_train,
    y_train,
    epochs=30,
    validation_data=(X_valid, y_valid)
)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30

In [None]:
# Create DataFrame from model history
learning_curves = pd.DataFrame.from_dict(history.history, orient='columns')

# Shift training loss back by 1 epoch for betting alignment (this is shifted left by 0.5 epochs becuase the training loss is a rolling mean during the epoch)
learning_curves.loc[:, 'loss'] = learning_curves.loc[:, 'loss'].shift(-1)

# Plot
learning_curves.plot(ylim=(0,1), grid=True)

The training set performance ends up beating the validation performance, as is generally the case when you train for long enough. **You can tell that the model has not quite converged yet, as the validation loss is still going down, so you should probably continue trianing.** It's as simple as calling the fit() method again, since Keras just continues training where you left off.

If you are not satisfied with the performance of your model, you should go back and tune the hyperparameters. The first one to check is the learning rate. If that doesn't help, try another optimizer (and always return the learning rate after changing any hyperparameter). If the performance is still not great, then try tuning the model hyperparameters such as the number of layers, the number of neurons per layer, and the types of activation functions to use for each hidden layer. You can also try tuning other hyperparameters, such as the batch size.  

Once you are satisfied with your model's validation accuracy, you should evaluate it on the test set to estimate the generalization error before you deploy the model to production. You can easily do this by using the evaluate() method. It is common to get slightly lower performance on the test set than the validation set, because the hyperparameters are tuned on the validation set, not the test set.