# The chicken or the beef? A justification for artificial neural networks

## Overview

The purpose of this section is to cover the foundations of Artificial Neural Networks (ANNs), as well as their relation to Deep Learning. We will first give an overview of various neural net architectures and then implement some key models using Keras and Tensorflow. We will also look at how to assess model performance and how to persist these models for future use. 

Specifically, we cover here:

* A conceptual introduction to Artificial Neural Networks (ANNs)
* TensorFlow and Keras overview
* Defining the base input
* Building models by adding defined layers
* Loss functions, optimizers, and metrics
* Training models 
* Performance assessment
* Improving performance by increasing network depth and width, and decreasing overfitting by introducing network drop-out
* Persisting Keras models for future use

## Libraries

In [1]:
# Standard libraries for manipulation and plotting 
import os
import pathlib
import sys
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

In [2]:
# TensorFlow and Keras
import tensorflow as tf
from tensorflow import keras

  from ._conv import register_converters as _register_converters


In [2]:
## Enable inline plotting for graphics
%matplotlib inline
## Set larger default figure size
matplotlib.rcParams['figure.figsize'] = [10.0,6.0]
## Multiple outputs from cells
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [4]:
## Get Version information
print(sys.version)
print("Pandas version: {0}".format(pd.__version__))
print("Matplotlib version: {0}".format(matplotlib.__version__))
print("Numpy version: {0}".format(np.__version__))
print("Tensorflow version: {0}".format(tf.__version__))

3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) 
[GCC 7.2.0]
Pandas version: 0.23.0
Matplotlib version: 2.2.2
Numpy version: 1.14.3
Tensorflow version: 1.10.0


## The most basic artificial neuron: the Perceptron

Let's say you have an important decision to make. You are hungry and wish to be sated. However, you are on an airplane, so your options are 'chicken' or 'beef'. Your target variable is 'satisfaction,' which will not be achieved if you choose neither, or both, options. You must make this decision while you have the flight attendant's attention. Finally, the attendant is new and very busy, and may not recall your first choice, so you will have to be explicit about what you _do not_ wish, as well as what you do (or perhaps they ran out of one option and another attendant has to provide this option later, without information on the first meal option).

This contrived scenario is simple by design. Yet how would we go about modeling this decision?

We could approach this as a linear regression problem: $f(x) = w_1x_1 + w_2x_2 + b$, where the $w_i$ are the weights of the inputs $x_i$, and minimize the least squares cost function, $J(x) = \sum_{r} (f'(x) - f(x))^2$, where the $r$ run over all the input values.

However, consider an intrinsic problem here. The final decision depends wholly on the interaction between the two input variables, 'chicken' (say, $x_1$) or 'beef' ($x_2$):

![Decision plane for meals... on a plane](images/chicken_or_beef.png)

If $x_1$ is maximal, we wish the value of $x_2$ to be minimal, and vice-versa. But we can't just change the coefficients from a regression at will! In other words, there is no single line that separates the Chicken-Beef ($x_1-x_2$) plane to define a distinct _decision boundary_ between the two classes, 'sated' and 'not sated'. However, we may do this with _two_ lines (and logical comparisons):

![Division of decision plane using two linear regions](images/chicken_or_beef_twoLines.png)

The lower (green) line marks a boundary between the (0, 0) point and the other three, containing all but the origin point. The upper (blue) line defines a region that contains only the (1, 1) point.

In other words, the regions defined by:

$x_2 \geq -x_1 + 0.5$

$x_2 \geq -x_1 + 1.2$

each return 1 if the condition is met. This may be re-cast:

$x_1 + x_2 \geq 0.5 \rightarrow H(x_1 + x_2 - 0.5)$

$x_1 + x_2 \geq 1.2 \rightarrow H(x_1 + x_2 - 1.2)$

The simplest function to return a 1 or 0 depending on a fixed criterion, is the Heaviside step-function, $H(x)$, with the following properties:
$H(x) = \{ ^{1\textrm{ for }x > 0} _{0\textrm{ for x } \leq 0}$

A schematic allowing these two lines to interact with each other, for this comparison may look like:

![Schematic of algorithm determining membership of classification regions ('sated' and 'not sated')](images/Simple_MLP_figure.png)

Where the last calculation is a simple logical comparison (`and`) of the regions.

This, which is really the construction of the XOR function, is the essential idea behind the most fundamental neural net: the perceptron.

## Artificial Neural Networks (ANNs)

Artificial Neural Networks (ANNs; often further abbreviated to NNs) are inspired directly from our understanding of brain neurophysiology. An individual _neuron_ is the basic cell unit of our complex central nervous system. Each neuron takes inputs, in the form of electrical signals, and performs a number of simple transforms on these inputs, resulting in a simple output. These outputs are in turn fed as inputs to other neurons.

![Image of a neuron (public domain: https://commons.wikimedia.org/wiki/File:Neuron.jpg)](images/Neuron.jpg)

Artificial neurons are simplified analogs of these biological units, taking in a limited number of signals, performing simple operations on them, before emitting a limited number of output signals. The astounding computational capabilities of this class of algorithm arise from the networks built up using these simple units.

### Anatomy of a Neural Network 

An ANN is composed of _neurons_ (_nodes_) and _layers_. Each node performs the atomic operations of the network, defined by _activation functions_. Groups of nodes may form a layer, a distinct structure representing a stage of the network. Each layer acts like a filter, or function. At least two layers are defined: the _input layer_ and the _output layer_. In addition, there may be one or more layer that is neither an input nor an output; these are referred to as _hidden layers_:



![Achitecture of a typical Artificial Neural Network (ANN). The number of layers determine the _depth_ of the model; the number of neurons in each layer determine the _width_ of the model.](images/ANN_architecture_intro.png)

The purpose of ANNs is to approximate any arbitrary function, say, $f'(x)$. Each layer can be thought of as a successive function $f_i()$ acting on the previous layers. The particular composition of layers and nodes of a neural network is known as the _net architecture_.

In this framework, for the chicken-beef calculation we performed above, each linear comparison was performed within a node (neuron), after being fed inputs $x_1$ and $x_2$. The outputs were fed into the final, output, node. This architecture is a Multi-Layer Perceptron (MLP). It is a particular type of _feed-forward network_, because there are no layers that make use of feedback.

The activation function we chose (fairly organically!) was the Heaviside step function, $H(x)$.

### Activation functions

The purpose of an activation function is to polarize the network (_i.e._ provide directionality), as well as condition the signals propagated throughout (very often regulated to have a limited output range). The most common activation functions are:

 *  **Perceptron** (Heaviside): $\sigma(z) = \{ ^{1\textrm{ for }z > 0} _{0\textrm{ for z } \leq 0}$

 *  **Sigmoid** (logistic): $\sigma(z) = \frac{1}{1 + \exp(-z)}$
 
 * **ReLU** (Rectified Linear Unit): $\sigma(z) = \max{(0, z)}$

 *  **Softmax**: $\sigma(z)_j = \frac{\exp{(z_j)}}{\sum_{k}^{K} \exp{(z_k)}}$
 
Their response functions look like the following:

![Response functions of three of the most popular activation functions for neurons in ANNs.](images/activation_function_comparison.png)

The sigmoid (or logistic) activation function is a smoothed version of the step function, so has nicer analytic properties than the step function. However, it can be somewhat computationally expensive for large numbers of nodes and layers. The ReLU (Rectified Linear Unit) is a simpler function. Although, being piece-wise linear, it is still technically non-linear, it provides network polarity while retaining many properties of linearity which make these nice for approximating functions. ReLU-based neurons are much 'faster' to train because of their computational simplicity.

The softmax activation function is an ensemble function, often used for an aggregation step. It also has nice analytic properties, regulating the output based on the ensemble mean. This often favors a 'winner-takes-all' condition.

So much for the theory! How do we code these things?

### Introduction to Keras and Tensorflow


**Tensorflow**

TensorFlow was developed by the Google Brain team, released to the Apache foundation in late 2015. It is a symbolic, high-performance, math library with specialized and generalized math objects, particularly _tensors_, a generalization of vector arithmetic and calculus (hence the name). The mental model for TensorFlow computations is a computational graph, defined by tensors. It is designed to be seamlessly applied to a range of hardware types (including GPGPUs and a specialized ASIC, the TPU---_Tensor Processing Unit_).  

**Tensorflow documentation:** https://www.tensorflow.org/

---
**Keras**

Keras is a high-level API to the neural network libraries CNTK, Theano and TensorFlow. Its high level of abstraction allows rapid prototyping of neural networks, with both convolution and recurrent network architectures. Its guiding principles are user-friendliness, modularity and to be easily extendible. Because it's written in Python, configuration and extension of functionality are relatively seamless within the Python eco-system. 

Keras is Greek for 'horn,' a reference to the vision-inducing spirits in the _Odyssey_.

**Keras documentation:** https://keras.io/

### Tensors and TensorFlow

The fundamental unit of computation within TensorFlow is the _tensor_. Tensors are generalizations of vector arithmetic and calculus, allowing linear operations on higher rank objects. These are used to partially define a computation, in the form of a data-flow graph, that will, when executed, produce an output value. TensorFlow constructs a graph based on tensor objects (`tf.Tensor`). This graph is then executed within a TensorFlow session (`tf.Session()`) instance.

We can simply generate a tensor object using `tf.Variable`:

In [5]:
odd_nums = tf.Variable([1, 3, 5, 7, 9, 11])  # Rank 1 tensor is a vector

It has the usual `.dtype` and `.shape` attributes: 

In [6]:
odd_nums.dtype
odd_nums.shape

tf.int32_ref

TensorShape([Dimension(6)])

In [7]:
weird_hypercube = tf.Variable([ [ [43], [121.234] ], [ [987], [2134] ] ], dtype=tf.float64)

However, as mentioned, tensors in TensorFlow are _partially_ computed graph objects:

In [8]:
rank1 = tf.rank(odd_nums)
rank1

<tf.Tensor 'Rank:0' shape=() dtype=int32>

In [9]:
rank2 = tf.rank(weird_hypercube)
rank2

<tf.Tensor 'Rank_1:0' shape=() dtype=int32>

In [10]:
weird_hypercube

<tf.Variable 'Variable_1:0' shape=(2, 2, 1) dtype=float64_ref>

Operations are not run until we have specified that our computation graph is complete. Note that the `tf.rank()` operations did not return an actual rank value. To execute operations on the tensors we just produced, we require a `tf.Session()` connection:

In [12]:
with tf.Session() as sess:
    sess.run(rank1)
    sess.run(rank2)

1

3

Defining the details of your computation this way, and generating the necessary graphs and sessions, requires quite some thought and a lot of boiler-plate code (of which is not particularly Pythonic!). 

There are so many commonly used operations and architectures, that a higher level API would be very useful! This is where Keras comes in.

## Generating a Multi-Level Perceptron (MLP) model in Keras

We will now generate a simple Multi-Level Perceptron (MLP) model to build a classifier of an arbitrary---but complicated---function: 

![Figure of the nasty function we wish to approximate with a feed-forward network algorithm (Multi-Layer Perceptron; MLP)](images/Simple_Perceptron_comparison.png)

This somewhat nasty function (OK, it's not that bad in the scheme of things) was generated from the following code:

In [13]:
x = np.arange(0, 100, 0.01)

L = len(x)
y1 = np.zeros(L)
y2 = np.zeros(L)
y3 = np.zeros(L)

# Produce piece-wise indices
x1_Idx = np.where(x < 40)
x2_Idx = np.where((x > 30) & (x < 80))
x3_Idx = np.where(x > 50)

# Generate arbitrary piece-wise functions

y1[x1_Idx] = [5 for Idx in x1_Idx]
y2[x2_Idx] = [pow(Idx - 55, 2) for Idx in x[x2_Idx]]
y3[x3_Idx] = [pow(Idx - 75, 4) for Idx in x[x3_Idx]]

y2 = 3*y2/np.max(y2)
y3 = 3*y3/np.max(y3)

# Add some noise
n = np.random.randn(L)
n = n/np.max(n)

y_tot = y1+y2+y3+n
y_thresh = np.median(y_tot)
y_out = np.where(y_tot > y_thresh, 1, 0)

We wish to classify this function based on three factors, $y_1$, $y_2$ and $y_3$, comparing their sum to a threshold. 


We will do this using an MLP feed-forward network in Keras. First, we will have to perform some additional imports of modules.

In [19]:
# Some additional imports
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import RMSprop

In [14]:
batch_size = 128 
num_classes = 2  # Binary classifier
epochs = 7  # Number of rounds of training 

Now we will split the data into training and test sets:

In [15]:
split_Idx = np.random.choice(range(x.shape[0]), int(0.75*x.shape[0]))

x = np.vstack([y1, y2, y3, n]).T  # Make each variable a feature vector 

x_train = x[split_Idx]
x_test = x[~split_Idx]  # Note the bit-wise logical negation
y_train = y_out[split_Idx]
y_test = y_out[~split_Idx]

In [16]:
x_train.shape
y_train.shape
x_test.shape
y_test.shape

(7500, 4)

(7500,)

(7500, 4)

(7500,)

In [17]:
# convert class vectors to binary class matrices
y_train = keras.utils.to_categorical(y_train, num_classes)
y_test = keras.utils.to_categorical(y_test, num_classes)

Keras has two types of model: `Sequential` and the functional `Model`. 
Each model type shares the following attributes:

* `.layers`: the layers of the model
* `.inputs`: the input tensors of the model
* `.outputs`: the output tensors

They also have a `.summary()` method, giving a summary of the model.

Time to call our first model! Instantiate an instance of the `Sequential()` model class:

In [20]:
model0 = Sequential()

Define the initial (input) layer:

In [21]:
model0.add(Dense(16, activation='relu', input_shape=(4,)))

Define the final (output) layer:                                                                                                                                                        

In [22]:
model0.add(Dense(num_classes, activation='softmax'))

In [23]:
model0.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                80        
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 34        
Total params: 114
Trainable params: 114
Non-trainable params: 0
_________________________________________________________________


We have thus defined our model's architecture. We must now define _how_ the model determines it has found a suitable approximation.

### Cost and loss functions

The important objective functions for ANNs are referred to as _cost_ or _loss_ functions. These functional measures of error are the important metrics with which we determine the success of our algorithm. The goal of machine learning algorithms is to optimize such a functions. The former term (_cost function_) is often reserved for the entire training set, in which case the _loss function_ is defined as the loss per epoch. 

We are already familiar with one of the most common cost functions, namely the Mean Squared Error (MSE). This is the routine used to obtain an Ordinary Least Squares (OLS) linear regression fit. Explicitly, we minimize the function:

$J(x) = \frac{1}{2n}\sum{}^{}_{j} [f'_j(x) - f_j(x)]^2$

Where $f'(x)$ denotes the function to be approximated and $f(x)$ represents the functional form used by the algorithm (here, the activation function used in each layer). Here, the sum is taken over each node, labeled with $j$. 

Another cost function, popularly used for binary classification, is the _cross-entropy_ (or Bernoulli negative log-likelihood or Binary Cross-Entropy):

$J(x) = -\sum{}_{j}[f(x)\log_e{(f'(x))} + (1 - f(x))\log_e{(1 - f'(x))}]$

One problem with using the MSE cost function is that it can be very slow to learn with large errors for a sigmoid activation function. This is not the case for cross-entropy, hence it is widely used.

There are a number of other cost functions, such as the Mean Absolute Error (MAE):

$J(x) = \frac{1}{n}\sum{}^{}_{j} |f'_j(x) - f_j(x)|$

### Optimizers

In practice, we do not usually have access to the analytic form of the cost or loss functions, and hence do not have an explicit expression for the optimal parameter values. We have to then rely on optimization schemes. Probably the best known are the Newton-Raphson family of optimization functions, which 'descend' to the optimal point, based on the gradient (this is known as _gradient descent_). 

Specifically, perhaps the most widely used optimizers for ANNs are based on _Stochastic Gradient Descent_ (SGD). Consider that the goal here is to minimize the cost function, $J(x)$. In other words, where $\nabla J(x) = 0$ 

(Note that here we have adopted the gradient operator,$\nabla$ ('nabla'); although we have written the above as a function of a single variable $\nabla f(x) := \partial f(x)/\partial x$, $\nabla$ is the derivative across all variables in the space.)  

As an iterative process, we update the weights $w^{(r)}$, beginning from the initial $w^{(0)}$. We wish to find $\nabla J(x)$; this may be well approximated by taking a random sample of training inputs and computing the (discrete) gradient by taking a group of random nodes (a _mini-batch_), $j \in m$. In other words, we assume $\nabla J(x) \approx \nabla J_m(x)$. 

The $r$th iteration of a weight is hence updated according to: $w^{(r)} = w^{(r-1)} - \eta\nabla J_m(w)$. The constant parameter, $\eta > 0 $, is known as the _Learning Rate_, as it determines the rate at which the weights are updated.

To help with numerical stability, the concept of _momentum_ has been introduced. This reduces oscillations, and overshoot of the global minimum, by introducing a term proportional to the incremental change in rate. Another useful parameter within enhancements to SGD optimization is the _decay_ parameter. This reduces the learning rate if the loss does not decrease after a set number of epochs. 

Even more sophisticated variants of SGD involve the automatic update of learning rates depending on how important a particular feature parameter is (Adagrad, Adadelta). We will use here the optimizer recommended as a great all-purpose default, the `RMSProp()` variant of Adadelta. In effect, this means that we do not have to attempt to tune the learning rate; this is done so automatically. 

For more details on variants of optimizers used in ANNs, I recommend Sebastian Ruder's _An overview of gradient descent optimization algorithms_:
 
http://ruder.io/optimizing-gradient-descent/index.html

In [34]:
# keras.optimizers.SGD(lr=0.01, momentum=0.0, decay=0.0, nesterov=False)

<tensorflow.python.keras.optimizers.SGD at 0x7f89dcca9c18>

### Tracking model performance: metrics

It is vital to quantify how our models perform. Keras makes it simple to track a number of off-the-shelf loss functions, that are not used to update or train the model, but may elucidate its behavior. This may range from the simple `accuracy` (the mean difference between prediction and actual 'ground truth' values), mean absolute error (`mae`) or `categorical_accuracy`. 

Within Keras, it is a simple matter to define the loss and optimizer functions, and performance metric to track for our MLP model. These are specified at the `compile` stage of the computation: 

In [24]:
model0.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])

We can now train the model:

In [25]:
history0 = model0.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test))

Train on 7500 samples, validate on 7500 samples
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7


Evaluate this model:

In [26]:
score0 = model0.evaluate(x_test, y_test, verbose=0)
print('Test loss: {0},      Test accuracy: {1}'.format(score0[0], score0[1]))

Test loss: 0.09144360537131628,      Test accuracy: 0.9846666666666667


We have trained a model on an arbitrary function, with an apparent accuracy of 98.1%!

### Adding hidden layers

We may add a hidden layer by simply repeating most of the initial part:

In [27]:
model1 = Sequential()

# 1st (Input) layer
model1.add(Dense(16, activation='relu', input_shape=(4,)))

Add the 2nd (hidden) layer. This is an internalization of features that are not seen externally to the model:

In [28]:
model1.add(Dense(16, activation='relu'))

And retain the remainder (the output layer):

In [29]:
# 3rd (Output) layer
model1.add(Dense(num_classes, activation='softmax'))

model1.summary()

model1.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])

history1 = model1.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test))

score1 = model1.evaluate(x_test, y_test, verbose=0)

print('Test loss: {0},      Test accuracy: {1}'.format(score1[0], score1[1]))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_2 (Dense)              (None, 16)                80        
_________________________________________________________________
dense_3 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 34        
Total params: 386
Trainable params: 386
Non-trainable params: 0
_________________________________________________________________
Train on 7500 samples, validate on 7500 samples
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7
Test loss: 0.04642097859084606,      Test accuracy: 0.9972


We can see that adding a layer decreased the validation (test) loss dramatically, and improved the accuracy.

Adding layers increases the network _depth_. As the number of hidden layers increase, the network becomes deeper; this is what is referred to as _Deep Learning_.

### Network dropout

Overfitting is an ever-present issue with machine learning models. One means of reducing overfitting is to induce _network dropout_. This involves selecting a subset of the model inputs at random during each training phase. This is simply done in keras, setting the `rate` parameter:

In [28]:
# keras.layers.Dropout(0.2)  # Induces a 20% drop-out rate

<tensorflow.python.keras.layers.core.Dropout at 0x7f89f158de80>

We can add drop-out to the first two layers of our MLP:

In [30]:
model2 = Sequential()
# 1st (Input) layer
model2.add(Dense(16, activation='relu', input_shape=(4,)))
model2.add(Dropout(0.1))

# 2nd (Hidden) layer
model2.add(Dense(16, activation='relu'))
model2.add(Dropout(0.1))

# 3rd (Output) layer
model2.add(Dense(num_classes, activation='softmax'))

model2.summary()

model2.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])

history2 = model2.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test))

score2 = model2.evaluate(x_test, y_test, verbose=0)

print('Test loss: {0},      Test accuracy: {1}'.format(score2[0], score2[1]))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_5 (Dense)              (None, 16)                80        
_________________________________________________________________
dropout (Dropout)            (None, 16)                0         
_________________________________________________________________
dense_6 (Dense)              (None, 16)                272       
_________________________________________________________________
dropout_1 (Dropout)          (None, 16)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 2)                 34        
Total params: 386
Trainable params: 386
Non-trainable params: 0
_________________________________________________________________
Train on 7500 samples, validate on 7500 samples
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 7/7
Test loss: 0.0496397633542

### Increasing the width of the model

Note the width on the above models is 16 neurons. This is not many! Let's increase this to 512 for the non-output layers:

In [31]:
# Base model
model3 = Sequential()
# 1st (Input) layer
model3.add(Dense(512, activation='relu', input_shape=(4,)))
model3.add(Dropout(0.2))  # Increased the drop-out rate

# 2nd (Hidden) layer
model3.add(Dense(512, activation='relu'))
model3.add(Dropout(0.2))

# 3rd (Output) layer
model3.add(Dense(num_classes, activation='softmax'))

print("Model 3 summary: {0}".format(model3.summary()))

model3.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])

history3 = model3.fit(x_train, y_train,
                    batch_size=batch_size,
                    epochs=epochs,
                    verbose=1,
                    validation_data=(x_test, y_test))

score3 = model3.evaluate(x_test, y_test, verbose=0)

print('Test loss: {0},      Test accuracy: {1}'.format(score3[0], score3[1]))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_8 (Dense)              (None, 512)               2560      
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 512)               262656    
_________________________________________________________________
dropout_3 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 2)                 1026      
Total params: 266,242
Trainable params: 266,242
Non-trainable params: 0
_________________________________________________________________
Model 3 summary: None
Train on 7500 samples, validate on 7500 samples
Epoch 1/7
Epoch 2/7
Epoch 3/7
Epoch 4/7
Epoch 5/7
Epoch 6/7
Epoch 

So increasing the model's depth improved the validation accuracy and reduced the error.

It is interesting to see how the training and validation errors of each of these models improves with each epoch:

![Comparison of the four models as a function of epoch. Note the decrease in error rate with epoch is rapid in the early stages of training.](images/Perceptron_training_comparisons.png)

### An overview of network architectures

A great pictorial summary of the various architectures may be found in Fjodor van Veen's article, "The Neural Network Zoo" (http://www.asimovinstitute.org/neural-network-zoo/). I reproduce it here without permission; however I highly recommend reading the whole article:

![The Neural Network Zoo (credit: Fjodor van Veen)](images/Asimov_neuralnetworks_architectures.png)

### Persisting Keras models for future use

Once you have a model you are satisfied with, you may save and distribute it by serialization. The Keras documentation **does not recommend pickling models**. However there are a number of other methods for common file formats.

We can save the entire model (_i.e._ architecture and weights) to Hierarchical Data Format (HDF5):

In [32]:
model0.save('model0_saved.h5')
del model0
model0.summary()

NameError: name 'model0' is not defined

In [33]:
# Check the files were created
print(list(pathlib.Path().glob('*.h5')))

[PosixPath('model0_saved.h5')]


Re-constituting the model is simple:

In [34]:
from tensorflow.keras.models import load_model

loaded_model0 = load_model('model0_saved.h5')
loaded_model0.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 16)                80        
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 34        
Total params: 114
Trainable params: 114
Non-trainable params: 0
_________________________________________________________________


Serialization of the model architecture to JSON:

In [35]:
model1_json = model1.to_json()
with open("model1.json", "w") as model_file:
    model_file.write(model1_json)

1700

In [50]:
# Check the files were created
print(list(pathlib.Path().glob('*.json')))

[PosixPath('model0.json'), PosixPath('model1.json')]


Serialization of the weights to Hierarchical Data Format (HDF5):

In [36]:
model1.save_weights("model1_weights.h5")

In [49]:
# Check the files were created
print(list(pathlib.Path().glob('*.h5')))

[PosixPath('model0_saved.h5'), PosixPath('model1_weights.h5')]


Loading the model from the resulting JSON file:

In [37]:
with open('model1.json', 'r') as model_file:
    model1_loaded = model_file.read()
    
from tensorflow.keras.models import model_from_json
loaded_model = model_from_json(model1_loaded)
loaded_model.summary()
loaded_model.compile(loss='categorical_crossentropy',
              optimizer=RMSprop(),
              metrics=['accuracy'])  # Compile the graph, without training the model 

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_2 (Dense)              (None, 16)                80        
_________________________________________________________________
dense_3 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 34        
Total params: 386
Trainable params: 386
Non-trainable params: 0
_________________________________________________________________


Load the weights from HDF5:

In [51]:
loaded_model.load_weights("model1_weights.h5")  
# This saves the weights of the trained model too

Now we can evaluate the persisted model:

In [38]:
loaded_model.summary()
score = loaded_model.evaluate(x_test, y_test, verbose=0)
print('Test loss: {0},      Test accuracy: {1}'.format(score[0], score[1]))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_2 (Dense)              (None, 16)                80        
_________________________________________________________________
dense_3 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 34        
Total params: 386
Trainable params: 386
Non-trainable params: 0
_________________________________________________________________
Test loss: 1.4468912439982096,      Test accuracy: 0.4953333333174388


### Comparison between TensorFlow+Keras and PyTorch

PyTorch is a similar platform to TensorFlow. Originally developed at FaceBook, it was originally a Python API to the (no longer developed) Lua 'Torch' library. It has gained in popularity to compete with TensorFlow/Keras. 

Similarly based on a tensor representation of a computational network graph, a major difference is the architecture is inherently _dynamic_, _i.e._ the graph architecture may be altered during training (contrast this with the Keras computational graph, which is statically compiled). This can be great for Recurrent Neural Net (RNN) architectures that have a variable output shape (_e.g._ text generation, where word lengths vary). However, this does mean that the library is lower level than that for Keras.

One objection to PyTorch, compared to TensorFlow/Keras, was the relative difficulty of deploying the former in production. However, this is currently being improved upon. There is much overlap in functionality and performance between the two frameworks.

The PyTorch project may be found here: https://pytorch.org/

### Appropriate applications of Artificial Neural Networks

Because ANNs are 'universal approximators,' as well as based on a large number of small, simple, units, they are great where:
 * The relationships between variables are poorly understood or analytically complex
 * There is a lot of data
 
They are not so great because:
 * Principal features are not explicitly apparent; decisions are opaque for deep networks
 * They can be slow to train, requiring a number of epochs 

### Additional resources


**Websites:** 

  * Michael A. Neilsen's _Neural Nets and Deep Learning_: http://neuralnetworksanddeeplearning.com/
  * Ian Goodfellow, Yoshua Bengio and Aaron Courville's _Deep Learning_: http://www.deeplearningbook.org

**YouTube channels:**

 * Andrew Ng's _Machine Learning_: https://www.youtube.com/watch?v=PPLop4L2eGk&list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN
 * Sentdex's _Practical Machine Learning with Python_: https://www.youtube.com/watch?v=OGxgnH8y2NM&index=1&list=PLQVvvaa0QuDfKTOs3Keq_kaG2P55YRn5v

**Platforms:** 

 * Kaggle: https://www.kaggle.com/
 * Coursera (Andrew Ng again): https://www.coursera.org/learn/machine-learning

# Conclusion

This was a very brief introduction to the field of artificial neural networks (ANNs) and Deep Learning! 

We have examined the theoretical justification for (ANNs), demonstrating that they are great 'universal approximators'. We also covered their use-cases and some of their pit-falls.

We also had a brief introduction to TensorFlow and Keras. We built a feed-forward network (a Multi-Layer Perceptron; MLP) to approximate complex functions. We 'tweaked' this model, improving the output, and evaluated its performance. In order to determine this, we also covered the concepts of appropriate activation functions, optimizers and cost functions.