# Artificial Neural Networks

As we have seen, Perceptrons are only capable of solving *linearly separable* problems.
To overcome this limitation we can connect Perceptrons together into a network,
first proposed by Rumelhart. Mclelland & Hinton (1980's).
Each one becomes a *Node* in the network and they are connected together into *Layers*.
In standard Artificial Neural Network (ANN) architecture there is one *input*, one *output* and one or more *hidden* layers.
Though *input* layer is a bit misleading, it doesn't actually do any computation, it is just the inputs to the network.

![ANN](resources/ann.png "ANN Image")

So outputs of hidden layers become the inputs to subsequent hidden layers, or the final output layer.
Hidden nodes tend to learn different aspects of the problem space,
building more complex decision boundaries and are therefore able to solve more complex problems.

**Note:** The number of nodes in the input layer *must* equal the number of inputs/features in the data.
The number of output nodes *must* equal the number of labels/classes in the data.
The number of hidden layers and nodes in the layers is arbitrary,
and selecting this architecture is part of building an ANN.

### Differences Between Perceptrons and ANN

Before we look at the algorithm for ANN we need to understand two key differences.

#### 1. Activation Function

Each node needs to output a *real number*, so the step function we used before (which outputs 0 or 1) will not work.
Instead a *non-linear* function, like Sigmoid, which 'squashes' the output into a real number between 0 and 1.

**Note:** Other activation functions are also used, such as Tahn and ReLu, but we will stick to Sigmoid.

![Activation-Functions](resources/activation_functions.png "Activation-Functions Image")

We need an activation function that outputs real numbers because:
1. For output nodes, real numbers between 0 and 1 can be considered a **probability** of an input example belonging
to a particular class.
2. Hidden layer nodes need to produce *some* output, even if it is very small,
so that we can calculate the error and update weights using **Backpropagation**.
3. For Backpropagation the activation function needs to be differentiable, so we can calculate the gradient
of the error with respect to the weights for **Gradient Descent**.

#### 2. Backpropagation and Gradient Descent

Perceptrons only have one layer, so from its output we can calculate the error it produces and use that to
update the weight values.
But now we have multiple layers what should the hidden nodes output be?
What is the error and how much should we change the weights?

Instead, we *share out the error* from the output nodes to the hidden nodes,
and we do this in *proportion to the 'strength' of the output* that it produced - hence why we need *some* output.
So we are *propagating* the error from the output nodes back up the network.
This is achieved by calculating the derivative of the error from the previous layer with respect to the weights.
Then use a similar weight update function that we did with Perceptrons:

$change \, in \, weight = derivative \times input \times learning \, rate$

Why do we calculate the derivative of the error function? This is an algorithm called **Stochastic Gradient Descent**.
We want to *minimise* the error produced by a weight.
By calculating the derivative we get the *gradient* or the 'steepness' of a curve at that point (weight value).
The larger the gradient the further we are from the minimum error (0 gradient).
Again, the learning rate is how large a step we want to take towards the minimum error.

![Gradient-Descent](resources/gradient_descent.png "Gradient-Descent Image")

### ANN - Algorithm

Similar to Perceptrons, ANN are trained in two 'phases'.
The forward pass, where data is input into the network to produce an output.
The backward pass, where the error in output is used to update the weights using Backpropagation and Gradient Descent.

1. Set weights to random small values, for example in range [-0.5, 0.5]

2. Set learning rate to a small value, usually less than 0.5

3. For each training example in the dataset i.e one 'epoch'

    // Forward Propagation
    
    A. For each node in the layer and each layer in turn:
    
    Sum inputs multiplied by weights
        
    $sum = \sum\limits_{i=0}^{n} w_i \times x_i$

    Calculate Sigmoid (activation) of the sum

    $activation = \sigma(sum)$
    
    // Backpropagation
    
    B. For each node in the layer and each layer in turn **going backwards**:
        
    Calculate the error and derivative, first the output layer then hidden.
        
    $output \, \epsilon = target \, output - activation$
    
    $output \, \delta = output \, \epsilon \times sigmoid \, derivative(activation)$
    
    $hidden \, layer \, \epsilon = output \, \delta \times output \, weights$
    
    $hidden \, layer \, \delta = hidden \, layer \, \epsilon \times sigmoid \, derivative(hidden \, layer \, activation)$
        
    C. Update all the weights **at the same time**, with learning rate, inputs and gradients:
    
    $change \, in \, weight = learning \, rate \times input \times \delta$
    
4. Repeat from step 3 until error is as small as possible, or (more likely) for the number of training epochs.

### ANN - Solving XOR

As an introduction to the ANN algorithm, and to give you an intuition for how different nodes and layers in the network
learn different aspects of the problem space, we are going to look at how a small network can solve the XOR problem.
Take a look at the following diagram.
The hidden nodes both learn different logical functions (AND and OR), the output node learns OR, so in combination
they have solved XOR!

![ANN-XOR](resources/ann_xor.png "ANN-XOR Image")

First we will define a NeuralNetwork class that has the weight variables and functions like train, predict and sigmoid.
Then the training data is loaded and we can call the train function, which returns the trained weights.

As it trains you should see the error *decrease* and the accuracy *increase*.

In [None]:
# Import some needed modules
from IPython.display import HTML, display
import numpy as np
import pandas as pd
import seaborn as sns;sns.set()
import matplotlib.pyplot as plt
import matplotlib.animation as animation
%matplotlib inline
np.random.seed(3)
 
def generate_decision_boundary(x, pred_func, model):
    """ Generates predictions for each point of a grid. 
    This function has nothing to do with neural networks."""
    # Set min and max values and give it some padding
    x_min, x_max = x[:, 0].min() - .5, x[:, 0].max() + .5
    y_min, y_max = x[:, 1].min() - .5, x[:, 1].max() + .5
    h = 0.01
    # Generate a grid of points with distance h between them
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    # Predict the function value for the whole grid
    z = pred_func(np.c_[xx.ravel(), yy.ravel()], model)
    z = z.reshape(xx.shape)
    return xx, yy, z

class NeuralNetwork:
    def __init__(self):

        # Set the weights to small random values in the range -1 to 1
        self.hidden1_w1 = np.random.uniform(-1, 1)
        self.hidden1_w2 = np.random.uniform(-1, 1)
        self.hidden1_bw = np.random.uniform(-1, 1)
    
        self.hidden2_w1 = np.random.uniform(-1, 1)
        self.hidden2_w2 = np.random.uniform(-1, 1)
        self.hidden2_bw = np.random.uniform(-1, 1)
    
        self.out_w1 = np.random.uniform(-1, 1)
        self.out_w2 = np.random.uniform(-1, 1)
        self.out_bw = np.random.uniform(-1, 1)
        
        self.model = {'hidden1': [self.hidden1_w1, self.hidden1_w2, self.hidden1_bw],
                 'hidden2': [self.hidden2_w1, self.hidden2_w2, self.hidden2_bw],
                 'out': [self.out_w1, self.out_w2, self.out_bw]}
        
    @staticmethod
    def sigmoid(x):
        return 1 / (1 + np.exp(-x))
    
    @staticmethod
    def sigmoid_deriv(x):
        return x * (1 - x)
    
    def train(self, inputs, target_outputs, training_epochs, learning_rate):
        
        # Array to store the decision boundaries as the model trains
        decision_boundary = []
        
        # Each epoch will loop over the training data once
        for epoch in range(training_epochs + 1):
            epoch_error = []
            
            # Loop over all of the input examples
            for i in range(len(inputs)):
                
                """ Forward Pass - propagates input data through the network. """
                # Input layer is just the input data
                input_layer = inputs
                
                # Hidden layer sigmoid(W * X + b)
                hidden1_sum = (input_layer[i][0] * self.hidden1_w1) + (input_layer[i][1] * self.hidden1_w2) + self.hidden1_bw
                hidden1_output = self.sigmoid(hidden1_sum)
                
                hidden2_sum = (input_layer[i][0] * self.hidden2_w1) + (input_layer[i][1] * self.hidden2_w2) + self.hidden2_bw
                hidden2_output = self.sigmoid(hidden2_sum)  
                
                # Output layer sigmoid(W * X + b)
                out_sum = (hidden1_output * self.out_w1) + (hidden2_output * self.out_w2) + self.out_bw
                output = self.sigmoid(out_sum)
                
                """ Backpropagation - propagates the error backwards through the network. """
                # Calculate output error (target output - actual output)
                error = target_outputs[i] - output
                epoch_error.append(error) # Also keep track of total error for this epoch
    
                # Calculate the derivative of the error with respect to the weights
                out_delta = error * self.sigmoid_deriv(output)
                out_bias_delta = error
    
                # Calculate hidden layer errors (from the output layers weights and gradient)
                hidden1_error = out_delta * self.out_w1
                hidden2_error = out_delta * self.out_w2
    
                # Calculate the derivative of the error with respect to the weights
                hidden1_w1_delta = hidden1_error * self.sigmoid_deriv(hidden1_output)
                hidden1_w2_delta = hidden1_error * self.sigmoid_deriv(hidden1_output)
                hidden1_bw_delta = hidden1_error
    
                hidden2_w1_delta = hidden2_error * self.sigmoid_deriv(hidden2_output)
                hidden2_w2_delta = hidden2_error * self.sigmoid_deriv(hidden2_output)
                hidden2_bw_delta = hidden2_error
                
                """ Update the Weights - update the weights using the error gradients, input and learning rate."""
                # Change in weight = learning rate * layers input * layers gradient
                self.out_w1 += learning_rate * hidden1_output * out_delta
                self.out_w2 += learning_rate * hidden2_output * out_delta
                self.out_bw += learning_rate * out_bias_delta
    
                self.hidden1_w1 += learning_rate * input_layer[i][0] * hidden1_w1_delta
                self.hidden1_w2 += learning_rate * input_layer[i][1] * hidden1_w2_delta
                self.hidden1_bw += learning_rate * hidden1_bw_delta
    
                self.hidden2_w1 += learning_rate * input_layer[i][0] * hidden2_w1_delta
                self.hidden2_w2 += learning_rate * input_layer[i][1] * hidden2_w2_delta
                self.hidden2_bw += learning_rate * hidden2_bw_delta
            
            # Every 100 epochs, calculate error and accuracy    
            if epoch % 100 == 0:
                # Calculate the mean squared error
                mean_error = round(np.square(epoch_error).mean(), 5) 
                
                # Make predictions on the data
                predictions = self.predict(inputs, self.model)
                # Count the number of correct predictions
                correct_predictions = np.count_nonzero(target_outputs == np.rint(predictions))
                
                # Calculate the accuracy     
                accuracy = (100 / len(inputs)) * correct_predictions
                print("Epoch: " + str(epoch) + " Error: " + str(mean_error) + " Accuracy: " + str(accuracy) + "%")
                
                # Calculate and store decision boundary
                _, _, boundary = generate_decision_boundary(inputs, self.predict, self.model)
                decision_boundary.append({'boundary': boundary, 'epoch': epoch, 'error': mean_error, 'accuracy': accuracy})
            
            # Update the model
            self.model = {'hidden1': [self.hidden1_w1, self.hidden1_w2, self.hidden1_bw],
                 'hidden2': [self.hidden2_w1, self.hidden2_w2, self.hidden2_bw],
                 'out': [self.out_w1, self.out_w2, self.out_bw]}
    
        return self.model, decision_boundary
    
    def predict(self, x, model):
        """ Generates predictions for the whole network. """
        predictions = []
        
        # Loop over all of the input examples
        for i in range(len(x)):
            # Calculate output
            hidden1_sum = (x[i][0] * model['hidden1'][0]) + (x[i][1] * model['hidden1'][1]) + model['hidden1'][2]
            hidden1_output = self.sigmoid(hidden1_sum)
    
            hidden2_sum = (x[i][0] * model['hidden2'][0]) + (x[i][1] * model['hidden2'][1]) + model['hidden2'][2]
            hidden2_output = self.sigmoid(hidden2_sum)  
    
            out_sum = (hidden1_output * model['out'][0]) + (hidden2_output * model['out'][1]) + model['out'][2]
            output = self.sigmoid(out_sum)
            
            # Store predictions in an array
            predictions.append(output)
        return np.array(predictions)
    
    def node_predict(self, x, node):
        """ Generates predictions for a single node. """
        predictions = []
        
        # Loop over all of the input examples
        for i in range(len(x)):
            # Calculate output
            weight_sum = (x[i][0] * node[0]) + (x[i][1] * node[1]) + node[2]
            # output = self.sigmoid(weight_sum)
            output = 0 if weight_sum < 0 else 1  # Using step function here to make graphs easier to read
            
            # Store predictions in an array
            predictions.append(output)
        return np.array(predictions)

# Training data
train_x = np.array([[0, 0],
                   [0, 1],
                   [1, 0],
                   [1, 1]])

train_y = np.array([0, 1, 1, 0]) # XOR

# Number of training epochs
learning_rate = 1  # The learning rate usually SHOULD NOT be this high! (Why do you think it is?)
# Set the learning rate and number of training epochs
num_epochs = 1500

# Create ann and call train method
ann = NeuralNetwork()
trained_model, decision_boundaries = ann.train(train_x, train_y, num_epochs, learning_rate)

### ANN - Decision Boundary

Once the model is trained we can use the weights (and a little bit of trickery) to plot the decision boundary for each
node. You should see that each node has learned a different function, or a different aspect of the problem space,
as was shown in the diagram above.

In [None]:
# Create decision boundaries for each node in the network
x_points, y_points, h1_pred = generate_decision_boundary(train_x, ann.node_predict, trained_model['hidden1'])
_, _, h2_pred = generate_decision_boundary(train_x, ann.node_predict, trained_model['hidden2'])
_, _, ann_pred = generate_decision_boundary(train_x, ann.predict, trained_model)

# Plot the decision boundaries
figure, ax = plt.subplots(2, 2, figsize=(16, 12))
[axi.set_axis_off() for axi in ax.ravel()]
ax[0, 0].contourf(x_points, y_points, h1_pred, alpha = 0.6, cmap='Spectral')
ax[0, 0].scatter(train_x[:, 0], train_x[:, 1], c=train_y.ravel(), s=50, cmap='RdYlGn')
ax[0, 0].title.set_text('Hidden Node 1')
ax[0, 1].contourf(x_points, y_points, h2_pred, alpha = 0.6, cmap='Spectral')
ax[0, 1].scatter(train_x[:, 0], train_x[:, 1], c=train_y.ravel(), s=50, cmap='RdYlGn')
ax[0, 1].title.set_text('Hidden Node 2')
ax[1, 0].contourf(x_points, y_points, ann_pred, alpha = 0.6, cmap='Spectral')
ax[1, 0].scatter(train_x[:, 0], train_x[:, 1], c=train_y.ravel(), s=50, cmap='RdYlGn')
ax[1, 0].title.set_text('Output Node')

# This function animates the decision boundaries that were saved as the model trained
def animate(i):
    ax[1, 1].clear()
    contour = ax[1, 1].contourf(x_points, y_points, decision_boundaries[i]['boundary'], alpha = 0.6, cmap='Spectral')
    scatter = ax[1, 1].scatter(train_x[:, 0], train_x[:, 1], c=train_y.ravel(), s=50, cmap='RdYlGn')
    epoch, error, acc = decision_boundaries[i]['epoch'], decision_boundaries[i]['error'], decision_boundaries[i]['accuracy']
    ax[1, 1].title.set_text('Epoch: {} Error: {:.4f} Accuracy: {}%'.format(epoch, error, acc))
    ax[1, 1].axis('off')
    return contour, scatter
animation = animation.FuncAnimation(figure, animate, interval=250, repeat_delay=1000, frames=len(decision_boundaries))
plt.tight_layout()
# animation.save(os.path.join('output','ann_xor_decision_boundary.gif'), writer=animation.PillowWriter(fps=5), dpi='figure')
plt.close()
display(HTML(animation.to_jshtml()))

# Create a table that shows the inputs and outputs of each node
table = pd.DataFrame({'x1': train_x[:, 0], 'x2': train_x[:, 1], 'XOR': train_y,
                   'Hidden 1': ann.node_predict(train_x, trained_model['hidden1']),
                   'Hidden 2': ann.node_predict(train_x, trained_model['hidden2']),
                   'Output': np.rint(ann.predict(train_x, trained_model)).astype(int),
                   'Output (Raw)': ann.predict(train_x, trained_model)})

table

