# Implementation of a simple Artificial Neural Network

- Network of 3 layers, with 2 neurons in input layer, 3 neurons in hidden layer, 2 neurons in output layer
- Implementation of Backpropagation algorithm with Gradient Descent
- Learning optimisation by using Matrices and Vectors to represent data in the network

## Optimisation using Matrices and Vectors
Instead of training the network by updating each individual weight and bias seperately, we can optimise the learning time required by representing our data in matrices and vectors. This will allow the network to simultaneously update the weights and biases in each layer. As the number of layers and neurons increases, the optimisation of the training time becomes more evident when using matrices and vectors

- In Matrix/Vector form, the sum of the weighted input Z<sup>l</sup> becomes:

### Z<sup>l</sup> = W<sup>l</sup>a<sup>l-1</sup> + b<sup>l</sup>

<u>Weight Matrix:</u> <b>W<sup>l</sup> between layer l and l-1 contains a ROW for each neuron in layer l-1, and a COLUMN for each neuron in layer l</b>

For example, our network is (2,3,2) - weight matrix for connections between hidden layer and input layer will have 2 rows and 3 columns (2x3)

weight matrix for connections between output layer and hidden layer will have 3 rows and 2 columns (3x2)


<u>Activation Vector:</u> <b>a<sup>l-1</sup> is a vector which contains one entry for each neuron in layer l-1. These entries are the activation of the neuron

<u>Bias Vector:</u> <b>b<sup>l</sup> is a vector which contains one entry for each neuron in layer l. These entries are the bias for that neuron

<u>Sum of the Weighted Input Vector Z<sup>l</sup>:</u> Z<sup>l</sup> is a vector which contains one entry for each neuron in layer l. These entries are the sum of the weighted inputs, plus the bias

- The Activation a<sup>l</sup> is then an activation function applied to EVERY ENTRY in the vector Z<sup>l</sup>:

### a<sup>l</sup> =  σ(z<sup>l</sup>)

## Why initialise NN weights as random numbers?

During forward propagation, each unit in the hidden layer gets the sum of the input multipled by the corresponding weight. If all of these weights were initialised to the same value (e.g. zero or one) then each neuron would get exactly the same signal. 

For example, if all weights were initialized to 1, each neuron gets a signal equal to sum of inputs (and outputs sigmoid(sum(inputs))). If all weights are 0, which is even worse, every neuron will get a zero signal. No matter what was the input - if all weights are the same, all neurons in the hidden layer will be the same too.

This is the main issue with symmetry and reason why you should initialize weights randomly (or, at least, with different values). 

<b>If all weights start with equal values and if the solution requires that unequal weights be developed, the system can never learn.</b>

This is because error is propagated back through the weights in proportion to the values of the weights. This means that all hidden units connected directly to the output units will get identical error signals, and, since the weight changes depend on the error signals, the weights from those units to the output units must always be the same. The system is starting out at a kind of unstable equilibrium point that keeps the weights equal, but it is higher than some neighboring points on the error surface, and once it moves away to one of these points, it will never return. We counteract this problem by starting the system with small random weights. Under these conditions symmetry problems of this kind do not arise.

As to why between 0 and 1. The inputs to our neural network need to be normalized to a common range so they can be compared sensibly.  Consider the confusion if one input ranged between 300 and 700 and another ranged between 200 and 85000, and the value of both was 450. If we don't normalize the two ranges, it's difficult to determine which input is more severe.

For weight initialisations, its common to use small random Gaussians with mean 0 and variance 1

## Python implementation of NN

- Sigmoid Activation func.
- SSE for Cost function
- Learning Rate of 0.5
- Random initial weights

Forwards Pass
- Evaluate the performance of the network with its current values for Weights/Biases and input/output train patterns

Backwards Pass
- Find derivative of Cost function with respect to Weights/Biases and update values to reduce cost (Grad. Descent)
- Superscript 'L' denotes OUTPUT LAYER

$$COST(y, a^L) = \sum_{i=1}^n \frac{1}{2}(y - a^L)^2$$


OUTPUT LAYER
\begin{equation*}
\frac{\partial COST(y, a^L)}{\partial W^L} =  \frac{\partial COST(y, a^L)}{\partial a^L} * \frac{\partial a^L}{\partial z^L} * \frac{\partial z^L}{\partial W^L} 
\end{equation*}

\begin{equation*}
= (a^L - y) * a^{L}(1-a^{L}) * a^{L-1}
\end{equation*}




HIDDEN LAYER
\begin{equation*}
\frac{\partial COST(y, a^L)}{\partial W^{L-1}} =  \frac{\partial COST(y, a^L)}{\partial a^L} * \frac{\partial a^L}{\partial z^L} * \frac{\partial z^L}{\partial a^{L-1}}  * \frac{\partial a^{L-1}}{\partial z^{L-1}} * \frac{\partial z^{L-1}}{\partial W^{L-1}} 
\end{equation*}


\begin{equation*}
= (a^L - y) * a^{L}(1-a^{L}) * W^L * a^{L-1}(1-a^{L-1}) * a^{L-2}
\end{equation*}

In [25]:
import numpy as np

def sigmoid(x):
    return 1.0/(1+ np.exp(-x))

def sigmoid_derivative(x):
    return x * (1.0 - x)


class NeuralNetwork:
    def __init__(self, x, y):
        self.input = x
        self.y = y
    
        self.lRate = 0.5
        
        # Initialise weight matrices np.random.rand chooses random numbers between 0 and 1
        self.weights1 = np.random.rand(self.input.shape[1],3)
        self.bias1 = [0,0,0]
        
        self.weights2 = np.random.rand(3,2)
        self.bias2 = [0,0]
        
        # Print out initial weights and biases
        print("initial weights between input and hidden:" + "\n")
        print(str(self.weights1) + "\n")
        print("initial biases for hidden layer:" + "\n")
        print(str(self.bias1) + "\n")
        
        
        
        print("initial weights between hidden and output:" + "\n")
        print(str(self.weights2) + "\n")
        print("initial biases for output layer:" + "\n")
        print(str(self.bias2) + "\n")
   
    
        # Network output should have same shape as our 'y' output so we can calc. cost
        self.output = np.zeros(y.shape)
        
        
               
        
    # FORWARD PASS: Evaluating network by calculating activations at each layer (using SIGMOID activation function)
    def feedforward(self):
        
        self.layer1 = sigmoid(np.dot(self.input, self.weights1)+self.bias1)
        self.output = sigmoid(np.dot(self.layer1, self.weights2)+self.bias2)
        
        
    # BACKWARD PASS: Updating values of Weights and Biases by using Gradient Descent techniques
    def backpropagation(self):
            
        # chain rule to find derivative of the cost function with respect to weights2/bias2 and weights1/bias1
        d_weights2 = np.dot(self.layer1.T, ((self.output - self.y) * sigmoid_derivative(self.output)))
        d_bias2 =((self.output - self.y) * sigmoid_derivative(self.output))
        
        d_weights1 = np.dot(self.input.T,  (np.dot((self.output - self.y) * sigmoid_derivative(self.output), self.weights2.T) * sigmoid_derivative(self.layer1)))
        d_bias1 = (np.dot((self.output - self.y) * sigmoid_derivative(self.output), self.weights2.T) * sigmoid_derivative(self.layer1))
            
        # update weights
        self.weights1 -= self.lRate * d_weights1
        self.weights2 -= self.lRate * d_weights2
            
        # update biases
        self.bias1 = self.bias1 - (self.lRate * d_bias1)
        self.bias2 = self.bias2 - (self.lRate * d_bias2)
            
            
if __name__ == "__main__":
    X = np.array([[0.05,0.1],
                  [0.07,0.11],
                  [0.02,0.2],
                  [0.01,0.15]])
    
    y = np.array([[0.01,0.99],[0.03, 0.8],[0.04, 0.7],[0.06, 0.75]])
    nn = NeuralNetwork(X,y)
    iterations = 150000

    for i in range(iterations):
        nn.feedforward()
        nn.backpropagation()

    print('---------- Network Train Performance ----------')  
    print("Target output: " + "\n" + str(y) + "\n\n" + "Network output: " + "\n" + str(nn.output))
    

initial weights between input and hidden:

[[ 0.65466267  0.34384017  0.08110898]
 [ 0.52952613  0.18642881  0.48497102]]

initial biases for hidden layer:

[0, 0, 0]

initial weights between hidden and output:

[[ 0.03518685  0.29357741]
 [ 0.56166218  0.04892028]
 [ 0.04586889  0.24855736]]

initial biases for output layer:

[0, 0]

---------- Network Train Performance ----------
Target output: 
[[ 0.01  0.99]
 [ 0.03  0.8 ]
 [ 0.04  0.7 ]
 [ 0.06  0.75]]

Network output: 
[[ 0.00999997  0.98999995]
 [ 0.03        0.8       ]
 [ 0.04        0.7       ]
 [ 0.06        0.75      ]]
