Now we are going to code up a Nn from scratch, that means we are not going to use PyTorch instead we'll be using pure python and NumPy

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import sklearn.datasets 

In [None]:
x,y = sklearn.datasets.make_moons(200, noise=.15)

Note: if we press Tab after writing this ***sklearn.datasets.*** , then we'll see all the ready made datasets in the sklearn.

In [None]:
x

In [None]:
y

In [None]:
#plot the data
plt.scatter(x[:,0],x[:,1],s=20,c=y,)

In [None]:
x.shape

In [None]:
input_neurons = 2
output_neurons = 2
samples = x.shape[0]
learning_rate = 0.001
# Here, we'll use regularization from scratch as well.
lamb = 0.01

#### This is the Neural Network that we are gonna build with 2 input neurons, 3 hidden neurons and 2 output neurons.

![NN Architecture](NN_from_scratch_architecture.png)


In summary, our neural network has the following characteristics:

**Architecture:** Two input neurons, three hidden neurons, and two output neurons.

**Weight Updates:** we're planning to update the weights simultaneously using vectorized operations(Instead of updating weights one at a time, we use matrices to perform operations on multiple weights simultaneously.
This means we calculate the gradients for all weights together, which is much faster).

**Matrix Multiplication:** we'll use matrices to represent the three terms of the chain rule (derivative of error with respect to output, output with respect to input, and input with respect to weight).

**Gradient Calculation:** Multiplying these matrices together will yield the gradient of the error with respect to the weights.

**Weight Matrix:** Weights will also be represented as matrices.

**Weight Update:** To update the weights, we'll subtract the gradients from the weight matrices.

This approach allows for efficient computations and is commonly used when implementing neural networks using libraries like NumPy in Python. It helps optimize the training process and allows for parallelism when working with large datasets.

![NN Architecture2](NN_from_scratch_architecture_pic2.png)

This is what we have for first layer.
shape of the input matrix is (200 * 2), 200 samples and each sample has 2 features and for the weight matrix the number of rows is the number of input neurons and the number of columns is the number of output neurons. Therefore our matrix of weights should be of shape (2 * 3) (that means for each input we have 3 outputs) for hidden layer. Then we are going to perform a dot product between these 2 matrices and the resultant matris will have the shape of (200 * 3)

Now, Let's move on to the next layer.

![NN Architecture3](NN_from_scratch_architecture_pic3.png)

 So the (200 * 3) matrix that we have got for hidden layer becomes inputs to the next layer. And for weights, we have 3 neurons in the inputs and 2 neurons in the output. So our weight matrix is (3 * 2). So the outputs that we will get from output layer is going to be (200 * 2)

In [None]:
# forward propagation function and weight retrival function
# at first, we're going to represent the model as a dictionary
#model_dict = {'w1' : w1, 'b1' : b1, 'w2' : w2, 'b2' : b2}

In [None]:
def retreive(model_dict):
    #retreives weights from model dictionary
    w1 = model_dict["w1"]
    b1 = model_dict['b1']
    w2 = model_dict["w2"]
    b2 = model_dict['b2']
    
    return w1, b1, w2, b2

In [None]:
# forward propagation function
def forward(x, model_dict):
    w1, b1, w2, b2 = retreive(model_dict)
    z1 = x.dot(w1) + b1 #net input value of first neuron of hidden layer
    a1 = np.tanh(z1)  #activation function
    z2 = a1.dot(w2) + b2 #net input value of first neuron of output layer
    a2 = np.tanh(z2)
    #now we need to calculate softmax and then we can easily draw and inference by
    #taking the maximum as our prediction
    exp_scores = np.exp(a2)
    softmax = exp_scores / np.sum(exp_scores, axis = 1, keepdims = True) #dim=1, because across axis 0 (row) they are smples and across axis 1 (columns) is the output of output neurons
    return z1, a1, softmax

Here, we have a softmax output with a shape of (200, 2), meaning we have 200 rows, and each row represents the probability distribution for two possible classes or outcomes (e.g., class 0 and class 1).

To determine which class was correctly predicted for each row, we look at the corresponding value in the true labels vector, often denoted as "y." If, for instance, the 5th value in the "y" vector is 1, it means that the true class for that particular example is class 1. Conversely, if it's 0, it means the true class is class 0.

In this context, when calculating the cross-entropy loss, we only consider the loss for the correct predictions. So, if the 5th value in "y" is 1 (indicating the true class is class 1), we would look at the 5th row in the softmax output and calculate the loss based on the probability assigned to class 1 in that row. Conversely, if the 5th value in "y" is 0 (indicating the true class is class 0), we would calculate the loss based on the probability assigned to class 0 in the 5th row of the softmax output.

In summary, the true class labels in "y" help us identify which class was actually correct for each example, and we use this information to compute the cross-entropy loss only for the correct predictions.

In [None]:
# loss calculation (cross entropy loss)
def loss(softmax, y, model_dict):  #softmax=predicted labels, y=actual labels
    #firstly retreive the wights as we'll apply L2 regularization technique
    w1, b1, w2, b2 = retreive(model_dict)
    
    m = np.zeros(200)
    for i,correct_index in enumerate(y):
        predicted = softmax[i][correct_index]
        m[i] = predicted
    
    log_prob = -np.log(predicted)
    softmax_loss = np.sum(log_prob)
    reg_loss = lamb / 2 *(np.sum(np.square(w1)) + np.sum(np.square(w2)))
    loss = softmax_loss + reg_loss 
    return float(loss / y.shape[0])  #to normalize by the number of samples

To gain a deeper understanding of how the softmax equation within the 'forward()' function operates and how the 'for' loop within the 'loss()' function systematically extracts predicted values for the expected class, please refer to the notebook. [click here](https://github.com/Ifthekher237/DeepLearning-Projects/blob/main/Understanding%20Softmax%20and%20Loss%20Calculation%20in%20Neural%20Networks.ipynb) 

So before we write the function for the backward propagation, we're going to write the function for the prediction. And this function, we're going to use it in the testing time. So during inference, we're going to use this predict function to predict the output.

We actually do it by taking the maximum output of softmax function. So maximum output is the prediction

In [None]:
# we just copy pased the same function that was written for forward propagation function
# and made changes in 2 places: function name and the return value of the function 
def predict(x, model_dict):
    w1, b1, w2, b2 = retreive(model_dict)
    z1 = x.dot(w1) + b1 #net input value of first neuron of hidden layer
    a1 = np.tanh(z1)  #activation function
    z2 = az.dot(w2) + b2 #net input value of first neuron of output layer
    a2 = np.tanh(z2)
    #now we need to calculate softmax and then we can easily draw and inference by
    #taking the maximum as our prediction
    exp_scores = np.exp(a2)
    softmax = exp_scores / np.sum(exp_scores, dim = 1, keepdims = True) #dim=1, because across axis 0 (row) they are smples and across axis 1 (columns) is the output of output neurons
    return np.argmax(softmax, axis=1)  #returns the idex number where maximum occurs

In [None]:
def backpropagation(x, y, model_dict, epochs):
    for i in range(epochs):
        w1, b1, w2, b2 = retreive(model_dict)
        z1, a1, probs = forward(x, model_dict)
        
        #start applying backpropagation
        delta3 = np.copy(probs)
        #delta3[range(x.shape[0]), y] #it extracts out the predicted values of the expected class
        delta3[range(x.shape[0]), y] -= 1  # (200 by 2) delta3 = probs - 1
        dw2 = (a1.T).dot(delta3)   #gradient of loss w.r.t w2, a1:(200,3), delta=(200,2) --> (3,2)
        db2 = np.sum(delta3, axis=0, keepdims = True) # (1,2), bias for each neuron of output layer
        delta2 = delta3.dot(w2.T) * (1-np.power(np.tanh(z1),2))
        dw1 = np.dot(x.T, delta2)
        db1 = np.sum(delta2, axis=0)
        
        #add regularization terms
        dw2 += lamb * np.sum(w2)  #see the image attached below to understand why.
        dw1 += lamb * np.sum(w1)
        
        #update the weights (W <== W + (-lr*gradient))
        w1 += -learning_rate * dw1       
        b1 += -learning_rate * db1
        w2 += -learning_rate * dw2
        b2 += -learning_rate * db2
        
        #update the model dictionary
        model_dict = {'w1' : w1, 'b1' : b1, 'w2' : w2, 'b2' : b2}
        
        #print the loss(will be printing the loss in every 5o epochs)
        if i%50 == 0:
            print("loss at epoch {} is : {:.3f}".format(i, loss(probs, y, model_dict)))
            
    return model_dict  #updated model_dict
            

![regularization](NN_from_scratch_architecture_pic4_regularization.png)

In [None]:
#initialize weights(Xavier initialization)
def init_network(input_dim, hidden_dim, output_dim):
    model = {}
    w1 = np.random.randn(input_dim, hidden_dim) / np.sqrt(input_dim)
    #initially biases aren't initialized with normal dist but as zeros. Later it's gonna be updated
    b1= np.zeros((1, hidden_dim))  #because we have 1 bias for each neuron of hidden layer
    w2 = np.random.randn(hidden_dim, output_dim) / np.sqrt(hidden_dim)
    b2 = np.zeros ((1, output_dim))
    
    model['w1'] = w1
    model['b1'] = b1
    model['w2'] = w2
    model['b2'] = b2
    
    return model

In [None]:
model_dict = init_network(input_dim = input_neurons, hidden_dim = 3, output_dim = output_neurons)
model = backpropagation(x, y, model_dict, 1500)