# Building a Neural Network from Scratch

https://towardsdatascience.com/math-neural-network-from-scratch-in-python-d6da9f29ce65 

### Abstract Base Class : Layer
The abstract class Layer, which all other layers will inherit from, handles simple properties which are an input, an output, and both a forward and backward methods.

In [2]:

# Base class
class Layer:
    def __init__(self):
        self.input = None
        self.output = None

    # computes the output Y of a layer for a given input X
    def forward_propagation(self, input):
        raise NotImplementedError

    # computes dE/dX for a given dE/dY (and update parameters if any)
    def backward_propagation(self, output_error, learning_rate):
        raise NotImplementedError

In the abstract class above, backward_propagation function has an extra parameter, learning_rate, which is controlling the amount of learning/updating parameters using gradient descent.

### Backward Propagation
Suppose we have a matrix containing the derivative of the error with respect to that layer’s output: $\frac{\partial E}{\partial Y}$

We need :
- The derivative of the error with respect to the parameters ($\frac{\partial E}{\partial W}$, $\frac{\partial E}{\partial B}$)
- The derivative of the error with respect to the input ($\frac{\partial E}{\partial X}$)

Let's calculate $\frac{\partial E}{\partial W}$. This matrix should be the same size as $W$ itself : 

$i x j$ where $i$ is the number of input neurons and $j$ the number of output neurons. We need one gradient for every weight

### Coding the Fully Connected Layer

In [3]:
#from layer import Layer
import numpy as np

# inherit from base class Layer
class FCLayer(Layer):
    # input_size = number of input neurons
    # output_size = number of edges that connects to neurons in next layer
    def __init__(self, input_size, output_size):
        self.weights = np.random.rand(input_size, output_size) - 0.5
        self.bias = np.random.rand(1, output_size) - 0.5

    # returns output for a given input
    def forward_propagation(self, input_data):
        self.input = input_data
        self.output = np.dot(self.input, self.weights) + self.bias
        return self.output

    # computes dE/dW, dE/dB for a given output_error=dE/dY. Returns input_error=dE/dX.
    def backward_propagation(self, output_error, learning_rate):
        input_error = np.dot(output_error, self.weights.T)
        weights_error = np.dot(self.input.T, output_error)
        # dBias = output_error

        # update parameters
        self.weights -= learning_rate * weights_error
        self.bias -= learning_rate * output_error
        return input_error

### Activation Layer
All the calculation we did until now were completely linear, may not learn well. We need to add non-linearity to the model by applying non-linear functions to the output of some layers.

Now we need to redo the whole process for this new type of layer!

In [4]:
#from layer import Layer

# inherit from base class Layer
class ActivationLayer(Layer):
    def __init__(self, activation, activation_prime):
        self.activation = activation
        self.activation_prime = activation_prime

    # returns the activated input
    def forward_propagation(self, input_data):
        self.input = input_data
        self.output = self.activation(self.input)
        return self.output

    # Returns input_error=dE/dX for a given output_error=dE/dY.
    # learning_rate is not used because there is no "learnable" parameters.
    def backward_propagation(self, output_error, learning_rate):
        return self.activation_prime(self.input) * output_error

You can also write some activation functions and their derivatives in a separate file. These will be used later to create an ActivationLayer.

In [5]:
import numpy as np

# activation function and its derivative
def tanh(x):
    return np.tanh(x);

def tanh_prime(x):
    return 1-np.tanh(x)**2;

### Loss Function
Until now, for a given layer, we supposed that ∂E/∂Y was given (by the next layer). But what happens to the last layer? How does it get ∂E/∂Y? We simply give it manually, and it depends on how we define the error.
The error of the network, which measures how good or bad the network did for a given input data, is defined by you. 

There are many ways to define the error, and one of the most known is called MSE — Mean Squared Error.

In [6]:

import numpy as np

# loss function and its derivative
def mse(y_true, y_pred):
    return np.mean(np.power(y_true-y_pred, 2));

def mse_prime(y_true, y_pred):
    return 2*(y_pred-y_true)/y_true.size;

### Network Class
Almost done ! We are going to make a Network class to create neural networks very easily using the building blocks we have prepared so far.


In [7]:
# example of a function for calculating softmax for a list of numbers
from numpy import exp
 
# calculate the softmax of a vector
def softmax(vector):
    e = exp(vector)
    return e / e.sum()

In [8]:
class Network:
    def __init__(self):
        self.layers = []
        self.loss = None
        self.loss_prime = None

    # add layer to network
    def add(self, layer):
        self.layers.append(layer)

    # set loss to use
    def use(self, loss, loss_prime):
        self.loss = loss
        self.loss_prime = loss_prime

        
    # predict output for given input
    def predict(self, input_data):
        # sample dimension first
        samples = len(input_data)
        result = []

        # run network over all samples
        for i in range(samples):
            # forward propagation
            output = input_data[i]
            for layer in self.layers:
                output = layer.forward_propagation(output)
            result.append(output)

        return result

    # train the network 
    
    def fit(self, x_train, y_train, epochs, learning_rate):
        '''
        Fit function does the training. 
        Training data is passed 1-by-1 through the network layers during forward propagation.
        Loss (error) is calculated for each input and back propagation is performed via partial 
        derivatives on each layer.
        '''
        # sample dimension first
        samples = len(x_train)

        # training loop
        for i in range(epochs):
            err = 0
            for j in range(samples):
                # forward propagation
                output = x_train[j]
                for layer in self.layers:
                    output = layer.forward_propagation(output)

                # compute loss (for display purpose only)
                err += self.loss(y_train[j], output)

                # backward propagation
                error = self.loss_prime(y_train[j], output)
                for layer in reversed(self.layers):
                    error = layer.backward_propagation(error, learning_rate)

            # calculate average error on all samples
            err /= samples
            print('epoch %d/%d   error=%f' % (i+1, epochs, err))

### Building Neural Networks
Finally ! We can use our class to create a neural network with as many layers as we want ! We are going to build two neural networks : a simple XOR and a MNIST solver.


### Solve XOR
Starting with XOR is always important as it’s a simple way to tell if the network is learning anything at all.

In [9]:
import numpy as np

#from network import Network
#from fc_layer import FCLayer
#from activation_layer import ActivationLayer
#from activations import tanh, tanh_prime
#from losses import mse, mse_prime

# training data
x_train = np.array([[[0,0]], [[0,1]], [[1,0]], [[1,1]]])
y_train = np.array([[[0]], [[1]], [[1]], [[0]]])

# network
net = Network()
net.add(FCLayer(2, 3))
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(3, 1))
net.add(ActivationLayer(tanh, tanh_prime))

# train
net.use(mse, mse_prime)
net.fit(x_train, y_train, epochs=1000, learning_rate=0.1)

# test
out = net.predict(x_train)
print(out)

epoch 1/1000   error=0.518158
epoch 2/1000   error=0.342602
epoch 3/1000   error=0.303756
epoch 4/1000   error=0.292410
epoch 5/1000   error=0.287782
epoch 6/1000   error=0.285355
epoch 7/1000   error=0.283832
epoch 8/1000   error=0.282747
epoch 9/1000   error=0.281902
epoch 10/1000   error=0.281199
epoch 11/1000   error=0.280580
epoch 12/1000   error=0.280010
epoch 13/1000   error=0.279467
epoch 14/1000   error=0.278933
epoch 15/1000   error=0.278395
epoch 16/1000   error=0.277844
epoch 17/1000   error=0.277271
epoch 18/1000   error=0.276670
epoch 19/1000   error=0.276036
epoch 20/1000   error=0.275365
epoch 21/1000   error=0.274652
epoch 22/1000   error=0.273894
epoch 23/1000   error=0.273089
epoch 24/1000   error=0.272235
epoch 25/1000   error=0.271330
epoch 26/1000   error=0.270372
epoch 27/1000   error=0.269361
epoch 28/1000   error=0.268296
epoch 29/1000   error=0.267177
epoch 30/1000   error=0.266005
epoch 31/1000   error=0.264779
epoch 32/1000   error=0.263502
epoch 33/1000   e

### Solve MNIST
We didn’t implemented the Convolutional Layer but this is not a problem. 
All we need to do is to reshape our data so that it can fit into a Fully Connected Layer.
MNIST Dataset consists of images of digits from 0 to 9, of shape 28x28x1. 
The goal is to predict what digit is drawn on a picture.

In [None]:
import numpy as np

#from network import Network
#from fc_layer import FCLayer
#from activation_layer import ActivationLayer
#from activations import tanh, tanh_prime
#from losses import mse, mse_prime

from keras.datasets import mnist
from keras.utils import np_utils

# load MNIST from server
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# training data : 60000 samples
# reshape and normalize input data
x_train = x_train.reshape(x_train.shape[0], 1, 28*28)
x_train = x_train.astype('float32')
x_train /= 255
# encode output which is a number in range [0,9] into a vector of size 10
# e.g. number 3 will become [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
y_train = np_utils.to_categorical(y_train)

# same for test data : 10000 samples
x_test = x_test.reshape(x_test.shape[0], 1, 28*28)
x_test = x_test.astype('float32')
x_test /= 255
y_test = np_utils.to_categorical(y_test)

# Network
net = Network()
net.add(FCLayer(28*28, 100))                # input_shape=(1, 28*28)    ;   output_shape=(1, 100)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(100, 50))                   # input_shape=(1, 100)      ;   output_shape=(1, 50)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(50, 10))                    # input_shape=(1, 50)       ;   output_shape=(1, 10)
net.add(ActivationLayer(tanh, tanh_prime))

# train on 1000 samples
# as we didn't implemented mini-batch GD, training will be pretty slow if we update at each iteration on 60000 samples...
net.use(mse, mse_prime)
net.fit(x_train[0:1000], y_train[0:1000], epochs=35, learning_rate=0.1)

# test on 3 samples
out = net.predict(x_test[0:3])
# print("\n")
# print("predicted values : ")
# print(out, end="\n")
# print("true values : ")
# print(y_test[0:3])

Evaluation Function

In [27]:

def my_evaluation(y_true, y_pred):
    TP,TN,FP,FN = 0,0,0,0
   #checking theconition of the comparison between the two arrays
    for yt, yp in zip(y_true, y_pred):
        if(yp == yt == 1 ):
            TP += 1
        elif(yp == yt == 0 ):
            TN += 1
        elif(yp==1 and yt==0):
             FP += 1
        else:
            FN += 1
    # inserting the values in a 2D array             
    cm = [[TN, FP],
        [FN, TP]]
    # convert to numpy array for assertion
    cm = np.array(cm) 
    accuracy = (cm[1][1] + cm[0][0]) / len(y_true)
    precesion =(cm[1][1] / (cm[1][1] + cm[0][1]))
    recall = (cm[1][1] / (cm[1][1] + cm[1][0]))

    return f'Accuracy = {accuracy}, precession = {precesion}, recall = {recall}, confusion matrix = {cm} '




Quesiton: What can go wrong if we have a wide range of numbers in our input/output data and we don't do any pre-processing on them and feed the neural network with unprocessed data? 

Answer: The first thing is that gradient decent converges much faster when the dataset is scaled using any normalization form or standarization form.
Secondly, if the model depends on distance measuring, the computation of the loss function can be controlled by this feature if it is in a bigger range/scale than the other features.
They also help feature selection too, all features can't be feeded to the algorithms because each feature is a calculation of a whole new dimension, and in the other hand we don't want to use features that does not have an correlation or contributes the dataset and it's training.

Training and testing without pre-processing the data

In [28]:
import numpy as np

#from network import Network
#from fc_layer import FCLayer
#from activation_layer import ActivationLayer
#from activations import tanh, tanh_prime
#from losses import mse, mse_prime

from keras.datasets import mnist
from keras.utils import np_utils

# load MNIST from server
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# training data : 60000 samples
# reshape and normalize input data
x_train = x_train.reshape(x_train.shape[0], 1, 28*28)
x_train = x_train.astype('float32')
# x_train /= 255
# encode output which is a number in range [0,9] into a vector of size 10
# e.g. number 3 will become [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
y_train = np_utils.to_categorical(y_train)

# same for test data : 10000 samples
x_test = x_test.reshape(x_test.shape[0], 1, 28*28)
x_test = x_test.astype('float32')
# x_test /= 255
y_test = np_utils.to_categorical(y_test)

# Network
net = Network()
net.add(FCLayer(28*28, 100))                # input_shape=(1, 28*28)    ;   output_shape=(1, 100)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(100, 50))                   # input_shape=(1, 100)      ;   output_shape=(1, 50)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(50, 10))                    # input_shape=(1, 50)       ;   output_shape=(1, 10)
net.add(ActivationLayer(tanh, tanh_prime))

# train on 1000 samples
# as we didn't implemented mini-batch GD, training will be pretty slow if we update at each iteration on 60000 samples...
net.use(mse, mse_prime)
net.fit(x_train[0:1000], y_train[0:1000], epochs=35, learning_rate=0.1)

# test on 3 samples
out = net.predict(x_test[0:3])
# print("\n")
# print("predicted values : ")
# print(out, end="\n")
# print("true values : ")
# print(y_test[0:3])

y_pred= net.predict(x_test)
# my_evaluation(y_test,y_pred)

epoch 1/35   error=0.257486
epoch 2/35   error=0.135108
epoch 3/35   error=0.131550
epoch 4/35   error=0.128017
epoch 5/35   error=0.125565
epoch 6/35   error=0.122146
epoch 7/35   error=0.122626
epoch 8/35   error=0.123109
epoch 9/35   error=0.120658
epoch 10/35   error=0.119352
epoch 11/35   error=0.118153
epoch 12/35   error=0.116941
epoch 13/35   error=0.115591
epoch 14/35   error=0.119216
epoch 15/35   error=0.117787
epoch 16/35   error=0.117178
epoch 17/35   error=0.116985
epoch 18/35   error=0.116795
epoch 19/35   error=0.115152
epoch 20/35   error=0.113121
epoch 21/35   error=0.112390
epoch 22/35   error=0.111542
epoch 23/35   error=0.109991
epoch 24/35   error=0.108382
epoch 25/35   error=0.107133
epoch 26/35   error=0.106278
epoch 27/35   error=0.105530
epoch 28/35   error=0.106209
epoch 29/35   error=0.106662
epoch 30/35   error=0.106181
epoch 31/35   error=0.105549
epoch 32/35   error=0.104921
epoch 33/35   error=0.104193
epoch 34/35   error=0.103698
epoch 35/35   error=0.1

  recall = (cm[1][1] / (cm[1][1] + cm[1][0]))


'Accuracy = 0.0, precession = 0.0, recall = nan, confusion matrix = [[    0 10000]\n [    0     0]] '

Normalization and standarization function:
The normalization function: normalizing the dataset using the max and min value making the dataset with values between 0 as the minimum value and 1 as the maximum value

The standarization function: Called as the Z-score normalization which uses the mean and standard deviation according to the data. Making all the features having a zero mean and a unit variance, as a result, variances can be compared between the features.

In [17]:
import numpy as np

def normalize(dataset):
    normalized_dataset = (dataset - np.min(dataset)) / (np.max(dataset) - np.min(dataset))
    return normalized_dataset

In [18]:
def standarize(dataset):
    standardized_dataset = (dataset - np.average(dataset)) / (np.std(dataset))
    return standardized_dataset

Training and testing with normalizing the dataset

In [31]:
import numpy as np

#from network import Network
#from fc_layer import FCLayer
#from activation_layer import ActivationLayer
#from activations import tanh, tanh_prime
#from losses import mse, mse_prime

from keras.datasets import mnist
from keras.utils import np_utils

# load MNIST from server
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# training data : 60000 samples
# reshape and normalize input data
x_train = x_train.reshape(x_train.shape[0], 1, 28*28)
x_train = x_train.astype('float32')
x_train = normalize(x_train)
# x_train /= 255
# encode output which is a number in range [0,9] into a vector of size 10
# e.g. number 3 will become [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
y_train = np_utils.to_categorical(y_train)

# same for test data : 10000 samples
x_test = x_test.reshape(x_test.shape[0], 1, 28*28)
x_test = x_test.astype('float32')
x_test = normalize(x_test)
# x_test /= 255
y_test = np_utils.to_categorical(y_test)

# Network
net = Network()
net.add(FCLayer(28*28, 100))                # input_shape=(1, 28*28)    ;   output_shape=(1, 100)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(100, 50))                   # input_shape=(1, 100)      ;   output_shape=(1, 50)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(50, 10))                    # input_shape=(1, 50)       ;   output_shape=(1, 10)
net.add(ActivationLayer(tanh, tanh_prime))

# train on 1000 samples
# as we didn't implemented mini-batch GD, training will be pretty slow if we update at each iteration on 60000 samples...
net.use(mse, mse_prime)
net.fit(x_train[0:1000], y_train[0:1000], epochs=35, learning_rate=0.1)

# test on 3 samples
# out = net.predict(x_test[0:3])
# print("\n")
# print("predicted values : ")
# print(out, end="\n")
# print("true values : ")
# print(y_test[0:3])



y_pred = net.predict(x_test)
# my_evaluation(y_test,y_pred)

epoch 1/35   error=0.242480
epoch 2/35   error=0.092763
epoch 3/35   error=0.073513
epoch 4/35   error=0.062377
epoch 5/35   error=0.053966
epoch 6/35   error=0.047466
epoch 7/35   error=0.042118
epoch 8/35   error=0.037618
epoch 9/35   error=0.034026
epoch 10/35   error=0.031124
epoch 11/35   error=0.028588
epoch 12/35   error=0.026346
epoch 13/35   error=0.024377
epoch 14/35   error=0.022712
epoch 15/35   error=0.021202
epoch 16/35   error=0.019830
epoch 17/35   error=0.018526
epoch 18/35   error=0.017340
epoch 19/35   error=0.016273
epoch 20/35   error=0.015278
epoch 21/35   error=0.014451
epoch 22/35   error=0.013709
epoch 23/35   error=0.013045
epoch 24/35   error=0.012419
epoch 25/35   error=0.011859
epoch 26/35   error=0.011343
epoch 27/35   error=0.010867
epoch 28/35   error=0.010418
epoch 29/35   error=0.010033
epoch 30/35   error=0.009688
epoch 31/35   error=0.009359
epoch 32/35   error=0.009029
epoch 33/35   error=0.008706
epoch 34/35   error=0.008310
epoch 35/35   error=0.0

Training and testing with standarizing the dataset

In [30]:
import numpy as np

#from network import Network
#from fc_layer import FCLayer
#from activation_layer import ActivationLayer
#from activations import tanh, tanh_prime
#from losses import mse, mse_prime

from keras.datasets import mnist
from keras.utils import np_utils

# load MNIST from server
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# training data : 60000 samples
# reshape and normalize input data
x_train = x_train.reshape(x_train.shape[0], 1, 28*28)
x_train = x_train.astype('float32')
x_train = standarize(x_train)
# x_train /= 255
# encode output which is a number in range [0,9] into a vector of size 10
# e.g. number 3 will become [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
y_train = np_utils.to_categorical(y_train)

# same for test data : 10000 samples
x_test = x_test.reshape(x_test.shape[0], 1, 28*28)
x_test = x_test.astype('float32')
x_test = standarize(x_test)
# x_test /= 255
y_test = np_utils.to_categorical(y_test)

# Network
net = Network()
net.add(FCLayer(28*28, 100))                # input_shape=(1, 28*28)    ;   output_shape=(1, 100)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(100, 50))                   # input_shape=(1, 100)      ;   output_shape=(1, 50)
net.add(ActivationLayer(tanh, tanh_prime))
net.add(FCLayer(50, 10))                    # input_shape=(1, 50)       ;   output_shape=(1, 10)
net.add(ActivationLayer(tanh, tanh_prime))

# train on 1000 samples
# as we didn't implemented mini-batch GD, training will be pretty slow if we update at each iteration on 60000 samples...
net.use(mse, mse_prime)
net.fit(x_train[0:1000], y_train[0:1000], epochs=35, learning_rate=0.1)

# test on 3 samples
# out = net.predict(x_test[0:3])
# print("\n")
# print("predicted values : ")
# print(out, end="\n")
# print("true values : ")
# print(y_test[0:3])



y_pred = net.predict(x_test)
# my_evaluation(y_test,y_pred)

epoch 1/35   error=0.259714
epoch 2/35   error=0.101194
epoch 3/35   error=0.078838
epoch 4/35   error=0.064737
epoch 5/35   error=0.055467
epoch 6/35   error=0.048928
epoch 7/35   error=0.043808
epoch 8/35   error=0.039842
epoch 9/35   error=0.036473
epoch 10/35   error=0.033588
epoch 11/35   error=0.031302
epoch 12/35   error=0.029384
epoch 13/35   error=0.027792
epoch 14/35   error=0.026198
epoch 15/35   error=0.024595
epoch 16/35   error=0.023367
epoch 17/35   error=0.022204
epoch 18/35   error=0.021045
epoch 19/35   error=0.020000
epoch 20/35   error=0.019171
epoch 21/35   error=0.018418
epoch 22/35   error=0.017624
epoch 23/35   error=0.017071
epoch 24/35   error=0.016588
epoch 25/35   error=0.016181
epoch 26/35   error=0.015763
epoch 27/35   error=0.015318
epoch 28/35   error=0.014958
epoch 29/35   error=0.014516
epoch 30/35   error=0.014178
epoch 31/35   error=0.013744
epoch 32/35   error=0.013421
epoch 33/35   error=0.013069
epoch 34/35   error=0.012871
epoch 35/35   error=0.0

In [29]:
print(y_pred)

[array([[-0.33321012,  0.05929038, -0.03778846,  0.12749001, -0.16952753,
        -0.30670006,  0.45117648,  0.88880122, -0.03625764,  0.05586455]]), array([[-0.08210128,  0.17579204,  0.37887544,  0.44688495,  0.05169365,
         0.08830161,  0.28437007, -0.16521875,  0.12847945, -0.10492135]]), array([[ 0.10038238,  0.88675315, -0.18452187,  0.24146474, -0.11433342,
        -0.12273227,  0.5301651 ,  0.03314136, -0.30903072, -0.02164572]]), array([[ 0.08281389,  0.01659541,  0.03478818,  0.41234681, -0.03653436,
        -0.1838995 ,  0.51236214,  0.10023936,  0.21778047, -0.03393552]]), array([[-0.22532927,  0.01385501, -0.25347874,  0.18635665,  0.25775579,
         0.03819208,  0.2989952 ,  0.76694017,  0.07505861, -0.07805226]]), array([[ 0.02626743,  0.93607624, -0.23400012,  0.21710829, -0.25545092,
        -0.02706136,  0.48721109, -0.02880154, -0.28343426, -0.08593294]]), array([[-0.06724034,  0.01600208, -0.02799283,  0.31895815,  0.05386573,
         0.10407063,  0.3659472 