# Chapter 10: Introduction to Artificial Neural Networks

Artificial Neural Networks (ANNs) are a machine learning model which was inspired by our own brains. They are useful for tackling complex machine learning challenges.

## From Biological to Artificial Neurons

The first neural network was designed in 1943 by Warren McCulloch and Walter Pitts. Their paper ["A logical calculus of the ideas immanent in nervous activity"](https://link.springer.com/article/10.1007/BF02478259), they presented a computation model designed loosely after our own brain. Until the 1980s, there was not much work done on Machine Learning. It was not until computers, particularly GPUs, had become powerful enough to train large neural networks.

### Biological Neurons

Biological neurons are cells with many long extensions called <i>dendrites</i> and one very long extension called an <i>axon</i>. The axion splits off into many branches called <i>telodendria</i> whose tips have small structures called <i>synaptic terminals</i>. These synaptic terminals send short electrical impulses called <i>signals</i>. When a neuron receives enough signals in a few milliseconds, it sends its own.

### Logical Computations with Neurons

Warren McCulloch and Walter Pitts proposed a simple model of the neuron which became known as <i>artificial neurons</i>. Each artificial neuron has a binary input and a binary output. A neuron can toggle its output as inactive or active. If enough of a neuron's inputs are active, it toggles its output active. Below is a graph representation of some examples.

<img src="https://drive.google.com/uc?export=view&id=1hGflagjs6QtHxt87AMyEfPq0COIXlzhG" width="500px">

### The Perceptron

<i>The Perceptron</i> is one of the simplest ANN architectures. It is based on a slightly different type of artificial neuron called a <i>linear threshold unit</i> (LTU). The inputs and output are numbers (instead of binary on/off values) and each input is associated a weight. The LTU computes the weighted sum of it's inputs, i.e

$$ z = w_1\,x_1 + w_2\,x_2\; + \;...\; + \;w_n\,x_n = \mathbf{w}^{\,T} \cdot \mathbf{x}.$$

then it applies a <i>step function</i> to that sum and outputs the result. The most common step function is the <i>Heaviside step function</i> given by

$$ \text{heaviside}\,(z) = \left\{ \begin{matrix}
0 && \text{if}\; z < 0 \\
1 && \text{if}\; z \geq 0
\end{matrix} \right. $$

Sometimes the sign function is used, given by

$$ \text{sgn}\,(z) = \left\{ \begin{matrix}
-1 && \text{if}\; z < 0 \\
0 && \text{if}\; z = 0 \\
+1 && \text{if}\; z > 0
\end{matrix} \right. $$

A single LTU can be used for linear binary classification, just like the Logistic Regression classifier. Training a single LTU is finding the optimal weight vector, $\mathbf{w}$. A perceptron is a layer of LTUs connected to a layer of input nodes for each feature.

Perceptrons are trained using Hebb's rule (or <i>Hebbian learning</i>), which strengthens connections which lead to correct predictions and also reduces the influence of connections which lead to incorrect inputs. The weights are initialized at zero, then for each training instance, the weights are updated using the function

$$ w_{i,\,j}^{(\text{next step})} = w_{i,\,j} + \eta \left( y_j - \hat{y}_j \right) x_i $$

where $w_{i,\,j}$ is the weight of the connection from the i<sup>th</sup> input neuron and the j<sup>th</sup> output neuron, $x_i$ is the i<sup>th</sup> input value of the current training instance, $\hat{y}_j$ is the output of the j<sup>th</sup> output node and $y_j$ is the target  output of the j<sup>th</sup> output neuron for the current training instance, and $\eta$ is the learning rate.

A single perceptron has a linear decision boundary, so if the data is complex, it will not work. But if the data is linearly separable, then Frank Rosenblatt showed that this algorithm will converge to the solution. This is known as the <i>Perceptron convergence theorem</i>.

In [0]:
# Scikit-Learn has its own Perceptron implementation.

import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import Perceptron

iris = load_iris()
X = iris.data[:, (2,3)] # Petal length, petal width
y = (iris.target == 0) # Iris Setosa?

per_clf = Perceptron(max_iter=10, tol=1e-3)
per_clf.fit(X, y)

per_clf.score(X, y)

1.0

Training a single perceptron is very similar to Stochastic Gradient Descent. In fact, using Scikit-Learn's `Perceptron` is the same as using the `SGDClassifier` with the `loss` hyperparameter set to `'perceptron'`.

Unlike Logistic Regression, Perceptrons do not output a probability that an instance belongs to a class, it only outputs which class the perceptron predicts the instance to be in. Also, perceptrons are not able to predict datasets which are not linearly separable. In order to do that, you need to use an ANN architecture called a <i>Multi-Layer Perceptron</i> (MLP).

### Multi-Layer Perceptron and Backpropagation

An MLP is composed of one input layer, one or more layers of LTUs called <i>hidden layers</i>, and one final layer of LTUs called the <i>output layer</i>. Every layer except the output layer includes a bias neuron. An ANN with two or more hidden layers is called a <i>deep neural network</i>.

In 1986, D. E. Rumelhart et al. published an article introducing the [backpropagation](https://apps.dtic.mil/docs/citations/ADA164453) (today called reverse-mode autodiff) training algorithm. Backpropagation first computes the output of every layer in the MLP. Next, it measures the network's error, first finding the error contribution from the last hidden layer of LTUs, then the previous layer, and so on. This reverse pass computes the error gradient (hence the name). Finally the gradients are used for a Gradient Descent algorithm.

In order to use Gradient Descent, we have to replace the step function in the LTUs with an activation function which is differentiable. The authors replaced the step function with the logistic function, given by

$$ \sigma(z) = \left( 1 + \exp(-z) \right)^{-1} $$

Obe common activation functions are the hyperbolic tangent function, given by

$$ \tanh(z) = 2\sigma(z) - 1 $$

which outputs values in the range of $(-1, 1)$ which helps speed up convergence. Another common activation function is the ReLU function, given by

$$ \text{ReLU}\,(z) = \max(0, z). $$

The ReLU function is not differentiable at $z=0$, but it still works well in practice.

An MLP is often used for classification, each output corresponding to a different binary class. For the case of multiclass classification, a output layer is modified by replacing the individual activation functions with a <i>softmax</i> function, given by

$$ \sigma(\mathbf{z})_i = \frac{e^{z_i}}{\sum\limits_{j\,=\,1}^K e^{z_j}} $$

where $K$ is the dimension of $\mathbf{z}$. In this case, each output neuron outputs the estimated probability that the instance belongs to a particular class. This architecture is an example of a <i>feedforward neural network</i> (FNN).

## Training an MLP with TensorFlow's High-Level API

Training a DNN to classify the MNIST dataset. Based on the code example [here](https://github.com/ageron/handson-ml/blob/master/10_introduction_to_artificial_neural_networks.ipynb).

In [0]:
# Importing and preparing the data.

import tensorflow as tf

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.mnist.load_data()

X_train = X_train.astype(np.float32).reshape(-1, 28*28) / 255.0
X_test = X_test.astype(np.float32).reshape(-1, 28*28) / 255.0
y_train = y_train.astype(np.int32)
y_test = y_test.astype(np.int32)
X_valid, X_train = X_train[:5000], X_train[5000:]
y_valid, y_train = y_train[:5000], y_train[5000:]

In [0]:
# Training the DNNClassifier.

feature_cols = [tf.feature_column.numeric_column('X', shape=[28 * 28])]
dnn_clf = tf.estimator.DNNClassifier(
  hidden_units=[300, 100],
  feature_columns=feature_cols,
  n_classes=10)

input_fn = tf.estimator.inputs.numpy_input_fn(
  x={'X': X_train},
  y=y_train,
  num_epochs=40,
  batch_size=50,
  shuffle=True)
dnn_clf.train(input_fn=input_fn)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_model_dir': '/tmp/tmptk8yrurf', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': None, '_save_checkpoints_secs': 600, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 100, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fc636d376a0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Instructions for updating:
To construct input pipelines, use the `tf.data` module.
Instructions for updating:
To construct input pipelines, u

<tensorflow_estimator.python.estimator.canned.dnn.DNNClassifier at 0x7fc636d37048>

In [0]:
# Evaluating the results on the test set.

test_input_fn = tf.estimator.inputs.numpy_input_fn(
  x={'X': X_test},
  y=y_test,
  shuffle=False)
eval_results = dnn_clf.evaluate(input_fn=test_input_fn)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-05-07T00:28:41Z
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmptk8yrurf/model.ckpt-44000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-05-07-00:28:42
INFO:tensorflow:Saving dict for global step 44000: accuracy = 0.9795, average_loss = 0.09805168, global_step = 44000, loss = 12.411606
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 44000: /tmp/tmptk8yrurf/model.ckpt-44000


In [0]:
# Here we see the model achieves nearly 98% accuracy on the test set.
eval_results

{'accuracy': 0.9795,
 'average_loss': 0.09805168,
 'global_step': 44000,
 'loss': 12.411606}

In [0]:
# Examining an instance of a prediction made by the DNN model.

y_pred = list(dnn_clf.predict(input_fn=test_input_fn))
y_pred[0]

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmptk8yrurf/model.ckpt-44000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


{'class_ids': array([7]),
 'classes': array([b'7'], dtype=object),
 'logits': array([ -2.171634  ,  -2.2013094 ,   2.377666  ,   5.0798316 ,
         -7.3936715 ,  -7.928346  , -19.821218  ,  20.246193  ,
          0.13360949,   5.228965  ], dtype=float32),
 'probabilities': array([1.8367960e-10, 1.7830908e-10, 1.7369899e-08, 2.5901980e-07,
        9.9119562e-13, 5.8070118e-13, 3.9713967e-18, 9.9999940e-01,
        1.8416872e-09, 3.0067719e-07], dtype=float32)}

Under the hood, the classifier trains layers of hidden LTUs which use the ReLU activation function. It minimizes the cross entropy cost function, given by

$$ J(\Theta) = - \frac{1}{m} \sum\limits_{i=1}^m \sum\limits_{k=1}^K \, y_k^{(i)} \log\left( \hat{p}_k^{(i)} \right) $$

where $y_k^{(i)}$ is the 1 if the i<sup>th</sup> training instance belongs to the k<sup>th</sup> class, 0 otherwise and $\hat{p}^{(i)}_k$ is the model's prediction that the i<sup>th</sup> training instance belongs to the k<sup>th</sup> class.

## Training a DNN Using Plain Tensorflow

In [0]:
# Specifying the number of inputs and the number of inputs in each layer.

n_inputs = 28 ** 2
n_hidden1 = 300
n_hidden2 = 100
n_outputs = 10

In [0]:
# Defining placeholder nodes to input the training set.

X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')

In [0]:
# Create a function for defining each layer of the neural network. Each layer
# starts with an n_inputs * n_neurons matrix using a normal distribution in
# the range of 2 / sqrt(n_inputs + n_neurons). Using this distribution helps
# the algorithm converge faster.

def neuron_layer(X, n_neurons, name, activation=lambda x: x):
  with tf.name_scope(name):
    n_inputs = int(X.get_shape()[1])
    stddev = 2 / np.sqrt(n_inputs + n_neurons)
    init = tf.truncated_normal((n_inputs, n_neurons), stddev=stddev)
    W = tf.Variable(init, name='kernel') # Weight matrix, i.e. the kernel
    b = tf.Variable(tf.zeros([n_neurons]), name='bias')
    Z = tf.matmul(X, W) + b
    return activation(Z)

In [0]:
# Now create the neural network. It initializes 2 hidden layers of neurons
# and an output layer. It holds off 

with tf.name_scope('dnn'):
  hidden1 = neuron_layer(X, n_hidden1, name='hidden1', activation=tf.nn.relu)
  hidden2 = neuron_layer(hidden1, n_hidden2, name='hidden2',
                         activation=tf.nn.relu)
  logits = neuron_layer(hidden2, n_outputs, name='outputs')

In [0]:
# Redefining the same neural network using a built-in Tensorflow function for
# defining layers of neural networks.

with tf.name_scope('dnn'):
  hidden1 = tf.layers.dense(X, n_hidden1, name='hidden1',
                                  activation=tf.nn.relu)
  hidden2 = tf.layers.dense(hidden1, n_hidden2, name='hidden2',
                                  activation=tf.nn.relu)
  logits = tf.layers.dense(hidden2, n_outputs, name='outputs')

In [0]:
# Resetting the graph.

tf.reset_default_graph()
X = tf.placeholder(tf.float32, shape=(None, n_inputs), name='X')
y = tf.placeholder(tf.int32, shape=(None), name='y')
with tf.name_scope('dnn'):
  hidden1 = neuron_layer(X, n_hidden1, name='hidden1', activation=tf.nn.relu)
  hidden2 = neuron_layer(hidden1, n_hidden2, name='hidden2',
                         activation=tf.nn.relu)
  logits = neuron_layer(hidden2, n_outputs, name='outputs')

In [0]:
# Defining the cost function, in this case it is the cross-entroy function.

with tf.name_scope('loss'):
  xentropy = tf.nn.sparse_softmax_cross_entropy_with_logits(labels=y,
                                                            logits=logits)
  loss = tf.reduce_mean(xentropy, name='loss')

In [0]:
# Defining the training operation.

learning_rate = 0.01

with tf.name_scope('train'):
  optimizer = tf.train.GradientDescentOptimizer(learning_rate)
  training_op = optimizer.minimize(loss)

In [0]:
# Defining an eval step to measure the model accuracy to evaluate the
# model performance.

with tf.name_scope('eval'):
  correct = tf.nn.in_top_k(logits, y, 1)
  accuracy = tf.reduce_mean(tf.cast(correct, tf.float32))

In [0]:
# Creating a saver to save the model, as well as logging metrics for
# Tensorboard. Also creating the initializer

from datetime import datetime

init = tf.global_variables_initializer()

now = datetime.utcnow().strftime('%Y%m%d%H%M%S')
root_logdir = 'tf_logs'
logdir = '{}/run-{}/'.format(root_logdir, now)

with tf.name_scope('saver'):
  saver = tf.train.Saver()
  loss_summary = tf.summary.scalar('Loss', loss)
  accuracy_summary = tf.summary.scalar('Accuracy', accuracy)
  file_writer = tf.summary.FileWriter(logdir, tf.get_default_graph())

### Execution Phase

In [0]:
# First load the MNIST dataset.

from tensorflow.examples.tutorials.mnist import input_data

mnist = input_data.read_data_sets('/tmp/data/')

Extracting /tmp/data/train-images-idx3-ubyte.gz
Extracting /tmp/data/train-labels-idx1-ubyte.gz
Extracting /tmp/data/t10k-images-idx3-ubyte.gz
Extracting /tmp/data/t10k-labels-idx1-ubyte.gz


In [0]:
# Define the number of epochs and the batch sizes.

n_epochs = 40
batch_size = 50

In [0]:
# Training the model.

with tf.Session() as sess:
  init.run()
  for epoch in range(n_epochs):
    for it in range(mnist.train.num_examples // batch_size):
      step = (epoch * (mnist.train.num_examples // batch_size)) + it
      X_batch, y_batch = mnist.train.next_batch(batch_size)
      sess.run(training_op, feed_dict={X: X_batch, y: y_batch})
      loss_summary_str = loss_summary.eval(feed_dict={X: X_batch, y: y_batch})
      acc_summary_str = \
        accuracy_summary.eval(feed_dict={X: X_batch, y: y_batch})
      file_writer.add_summary(loss_summary_str, step)
      file_writer.add_summary(acc_summary_str, step)
  save_path = saver.save(sess, './my_model_final.ckpt')

In [0]:
file_writer.close()

In [0]:
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip

In [0]:
get_ipython().system_raw(
  'tensorboard --logdir {} --host 0.0.0.0 --port 6006 &'.format(root_logdir))
get_ipython().system_raw('./ngrok http 6006 &')

In [0]:
! curl -s http://localhost:4040/api/tunnels | python3 -c \
  "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

https://0a46a8bd.ngrok.io


### Using the Neural Network

In [0]:
# Restoring the model, evaluating its accuracy on the test set, and printing
# its predictions

with tf.Session() as sess:
  saver.restore(sess, './my_model_final.ckpt')
  print('Test Set Accuracy:', accuracy.eval(feed_dict={X: X_test, y: y_test}))
  X_sample = X_test[:20]
  Z = logits.eval(feed_dict={X: X_sample})
  y_pred = np.argmax(Z, axis=1)
  print('Actual:   ', y_test[:20])
  print('Predicted:', y_pred)

INFO:tensorflow:Restoring parameters from ./my_model_final.ckpt
Test Set Accuracy: 0.9773
Actual:    [7 2 1 0 4 1 4 9 5 9 0 6 9 0 1 5 9 7 3 4]
Predicted: [7 2 1 0 4 1 4 9 6 9 0 6 9 0 1 5 9 7 3 4]


## Fine-Tuning Neural Network Hyperparameters

Neural networks are very flexible models, but this can be a drawback. In addition to being able to use any <i>network topology</i> (how neurons are connected), you can also tweak the number of layers and the number of neurons in this layer.

Due to the flexibility of neural networks, Grid Search is not a feasible way to select hyperparameters for training models on large datasets. [Randomized search](http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf) is more reasonable. Another option is to use a tool like [Oscar](http://oscar.calldesk.ai).

### Number of Hidden Layers

It is actually possible to model even complex, nonlinear functions using an MLP with a single layer, given enough nodes. However, deeper neural networks are able to recognize the same patterns with fewer neurons. Also, deeper neural networks are better at recognizing hierarchial structures.

You normally can start with just a few hidden layers and then add layers until you start to overfit the dataset. Other neural network architectures that allow you to use more hidden layers for complex tasks, or use partially trained neural networks.

### Number of Neurons per Hidden Layer

The number of input and output nodes are defined by the problem, but you have flexibility in choosing how many neurons are in each hidden layer. Neural networks as a funnel, where hidden layers closer to the input have a higher number of neurons and gradually have fewer and fewer neurons. This lets the network learn many low-level features and gradually reduce the patterns into fewer high-level features.

In practice, people use the same number of neurons per layer, which reduces the state space for hyperparamter tuning. You can increase the number of neurons per layer until you start overfitting. In general, you get more performance by adding more hidden layers than by adding neurons to each layer.

### Activiation Functions

In most cases, the ReLU function is the best activation function. It is fast to compute, and it does not saturate at the extreme values, like the hyperbolic tangent function.

For the output layer, it is generally best to use softmax in order to find the probability that the instance belongs to a particular class. For regression, the output layer does not use an activation function.