# Comparing a Simple Neural Network Made from Scratch with One Constructed with Keras

In this program we will build a neural network from scratch for use on a dataset with only four samples to easily break down the structure and inner workings of the model, and to see how forward propagation and backward propagation can be implemented in practice. Then we will create a similar neural network using Keras for comparison.

In [33]:
import numpy as np
from keras.models import Sequential
from keras.layers import Dense 
from keras import optimizers

Create some training inputs, X, and a target output, y. X is an array containing four training samples, each with three features. y is an array containing the desired output for each of the four training samples.

In [34]:
# create input and desired output (target)
X = np.array([[0, 0, 1], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
y = np.array([[0], [1], [1], [0]])

In this neural network we will employ the sigmoid activation function, which will be used directly in forward propagation. The derivative of the sigmoid activation is required for backpropagation.

In [35]:
# define the sigmoid function
def sigmoid(x):
    return 1/(1 + np.exp(-x))

# define the derivative of the sigmoid function
def sigmoid_derivative(x):
    return x*(1 - x)

In forward propagation the hidden layer activations are given by $\mathbf{a}^2 = \sigma(w^2\mathbf{x} + \mathbf{b}^2) = \sigma(\mathbf{z}^2)$, where $\mathbf{a}^2$ is the vector of hidden layer activations, $w^2$ is the matrix of weights applied to the inputs, and $\mathbf{b}^2$ is the vector of biases used to compute the hidden layer activations. Similarly, the output activations are given by $\mathbf{a}^3 = \sigma(w^3\mathbf{a}^2 + \mathbf{b}^3) = \sigma(\mathbf{z}^3)$. In this neural network we will not use biases, i.e., $\mathbf{b}^2 = \mathbf{b}^3 = 0$.

In [36]:
# define the feedforward function
def feedforward(X, weights2, weights3):
    # multiply the inputs by their weights, sum together and pass to the sigmoid function
    layer2 = sigmoid(np.dot(X, weights2))
    # multiply the hidden layer activation by their weights, sum together and pass to the sigmoid function
    output = sigmoid(np.dot(layer2, weights3))
    return layer2, output    

Now we randomly initialise the weights matrices, ensuring that their dimensions permit matrix multiplication with the inputs and hidden layer, respectively.

In [37]:
# initialise the weights matrices between the input and hidden layers, and the hidden and output layers 
weights2 = np.random.rand(X.shape[1], 4)
weights3 = np.random.rand(4, 1)

Let's see exactly how the forward propagation function works:

In [38]:
print(X)
print('\n')
print(weights2)
print('\n')
print(np.dot(X, weights2))
print('\n')
print(sigmoid(np.dot(X, weights2)))

[[0 0 1]
 [0 1 1]
 [1 0 1]
 [1 1 1]]


[[0.34194905 0.24479932 0.04305208 0.6387264 ]
 [0.8247908  0.28574849 0.61266972 0.8960801 ]
 [0.70629426 0.92819626 0.89986665 0.93751877]]


[[0.70629426 0.92819626 0.89986665 0.93751877]
 [1.53108506 1.21394475 1.51253637 1.83359887]
 [1.04824331 1.17299558 0.94291872 1.57624517]
 [1.87303412 1.45874407 1.55558845 2.47232527]]


[[0.66958181 0.7167092  0.7109221  0.71859819]
 [0.82216502 0.77099618 0.81943679 0.8621899 ]
 [0.74043742 0.76368605 0.71968885 0.82867209]
 [0.86680896 0.81134051 0.82571941 0.9221788 ]]


The columns in the weights matrix contain the weights applied to a particular sample in X. For example, the first column contains the weights applied to the first sample in X (first row), the second column contains the weights applied to the second sample in X (second row), and so on.

The matrix np.dot(X, weights1) has as its rows the $z^2_i$ corresponding to the activations $a^2_i$ for each sample. For example, the first row has as its elements $z^2_1$, $z^2_2$, $z^2_3$ and $z^2_4$ for the first sample in X. The second row has as its elements $z^2_1$, $z^2_2$, $z^2_3$ and $z^2_4$ for the second sample in X, and so on.

sigmoid(np.dot(X, weights1) has as its rows the hidden layers activations $a^2_1$, $a^2_2$, $a^2_3$, and $a^2_4$ for each training sample. For example, the first row contains the four hidden layer activations corresponding to the first training sample, the second row contains the four hidden layer activations corresponding to the second training sample in X etc. 

We can define the Mean Squared Error (MSE) cost function as $C = \frac{1}{2}[y - \sigma(z)]^2$.

In order to undertake gradient descent, we need to calculate the partial derivatives of $C$ with respect to $w^2$ and $w^3$:
$$\frac{\partial C}{\partial w^3} = \frac{\partial C}{\partial \sigma}\frac{\partial \sigma}{\partial \mathbf{z}^3}\frac{\partial \mathbf{z}^3}{\partial w^3} = [y - \sigma(\mathbf{z}^3)]\sigma'(\mathbf{z}^3)\mathbf{a}^2$$

Now we know $\partial C/\partial w^3$ we can compute $\partial C/\partial w^2$:
$$\frac{\partial C}{\partial w^2} = \frac{\partial C}{\partial w^3}\frac{\partial w^3}{\partial w^2} = [y -\sigma(\mathbf{z}^3)]\sigma'(\mathbf{z}^3)\mathbf{a}^2\frac{\partial w^3}{\partial w^2}$$
$$\frac{\partial w^3}{\partial w^2} = \frac{\partial w^3}{\partial \mathbf{z}^3}\frac{\partial \mathbf{z}^3}{\partial \mathbf{z}^2}\frac{\partial \mathbf{z}^2}{\partial w^2} = \frac{1}{\mathbf{a}^2}w^3\sigma'(\mathbf{z}^2)\mathbf{a}^1$$
Therefore, $$\frac{\partial C}{\partial w^2} = \mathbf{a}^1[y -\sigma(\mathbf{z}^3)]\sigma'(\mathbf{z}^3)w^3\sigma'(\mathbf{z}^2)$$

This process of finding the partial derivatives of $C$ is called backpropagation.

In [39]:
def backpropagation(X, y, layer1, output, weights1, weights2):
    # derivative of the second weights matrix
    d_weights2 = np.dot(layer1.T, (y - output)*sigmoid_derivative(output))
    # derivative of the first weights matrix
    d_weights1 = np.dot(X.T, (np.dot((y - output)*sigmoid_derivative(output), weights2.T)
                                           *sigmoid_derivative(layer1)))
    return d_weights1, d_weights2

Once we have calculated the partial derivatives $\partial C/\partial w^2$ and $\partial C/\partial w^3$ we simultaneously update $w^2$ and $w^3$ according to:
$$w^i \rightarrow w^i - \eta\frac{\partial C}{\partial w^i},\;\;\;\;\;\;\mbox{for}\;\; i = 2, 3$$
where $\eta$ is the learning rate (below we will use $\eta = 1$).

We begin training the neural network by forward propagating the inputs through the network to generate the output. Then we backpropagate to find the partial derivatives of the cost function and use these to update the weights. This process is repeated many times as the weights are optimised via gradient descent.

In [40]:
# run feedforward and backpropagation over 2000 iterations
for i in range(2000):
    layer2, output = feedforward(X, weights2, weights3)
    d_weights2, d_weights3 = backpropagation(X, y, layer2, output, weights2, weights3)
    
    # gradient descent: update the weights matrices 
    weights2 += d_weights2
    weights3 += d_weights3

print(output)

[[0.0130081 ]
 [0.96259263]
 [0.96336681]
 [0.04577575]]


We can easily create a similar neural network using Keras. To build a linear set of layers, we initiate a Sequential model and add layers to it. The first layer we add (hidden layer) will have four neurons and will provide weights for the three input features of the samples. The second added layer is the output layer of a single neuron. A dense layer is a fully-connected layer, i.e., all the neurons in a layer are connected to those in the next layer.

In [41]:
model = Sequential()
# construct hidden layer of four neurons
model.add(Dense(units = 4, activation = 'sigmoid', input_dim = 3))
# construct output layer
model.add(Dense(units = 1, activation = 'sigmoid'))
print(model.summary())

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_7 (Dense)              (None, 4)                 16        
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 5         
Total params: 21
Trainable params: 21
Non-trainable params: 0
_________________________________________________________________
None


The model summary shows that the first layer contains 16 parameters for training; for the $j$-th neuron in the hidden layer we have $$z^2_j = w^2_{j1}a^1_1 + w^2_{j2}a^1_2 + w^2_{j3}a^1_3 + b^2_j,\;\;\;\;\;\;\mbox{for}\;\; j = 1,\;2,\;3,\;4$$
The output layer contains five parameters:
$$z^3_1 = w^3_{11}a^2_1 + w^3_{12}a^3_2 + w^3_{13}a^2_3 + w^3_{14}a^3_4 + b^2_1$$

We will utilise the SGD optimiser which employs the gradient descent algorithm to train the model with a learning rate of one. This optimiser works well for shallow neural networks. We pass the optimiser to the compile method along with the cost function we wish to use.

In [42]:
sgd = optimizers.SGD(lr = 1)
model.compile(loss = 'mean_squared_error', optimizer = sgd)
model.fit(X, y, epochs = 2000, verbose = False)
print(model.predict(X))

[[0.04577421]
 [0.95596975]
 [0.95841074]
 [0.03405814]]
