# Neural Networks from scratch in Python

Neural networks are powerful machine learning models that have revolutionized the field of artificial intelligence. 
A neural network can be thought of as the functional unit of deep learning, which mimics the behavior of the human brain to solve complex data-driven problems. They can be used to solve a wide range of problems, including image and speech recognition, natural language processing, and predictive modeling. 
But how does the neural networks actually work? Do they really mimic our brain neurons? How?

In this article we will talk about the basic construction of a neural networks and learn how mathematics help this wonder to actually come into reality.

## What is an Artificial Neural Network?
Artificial neural network is a machine learning algorithm inspired by the human brain that learns from data by processing information through layers of interconnected neurons. This process is similar to how humans learn, where mistakes are made and corrected through training to improve performance. Mathematically, a neural network can be represented as a complex function that maps input variables to output variables, with its parameters optimized through training to minimize a loss function.

## Intuition
Let's say that you are learning a new sport. Firstly you will take the beginners training and get familiar with the basic rules of the sport. Similarly, a neural network is also trained based on the basic parameters and dataset given to it. This is called Feed Forward. Now once your training is done, your coach makes you play a game. During your first game you actually understand how the game should be played by observing other players and yourself. You evaluate the game play by spotting the differences between your actions and other players, who are actually skilled at the sport. Similarly our neural network also calculates how far our prediction is from the actual result and tries to correct it by adjusting the parameters. This is called Back Propagation, where performance evaluation and error correction is done. Lastly, you will become a pro player only by practising the game. Our neural network also repeats this iterative process of feed forward and back propagtion until it predicts the output correctly. So the ultimate goal is to minimize the error and increase the accuracy by training continuously.

Just like a beginner athlete who practices and repeats their movements until they become automatic, a neural network needs to be trained on a large dataset repeatedly to learn patterns and relationships between inputs and outputs.

## The Network
A neural network is composed of multiple layers of interconnected neurons. By stacking multiple layers of neurons together, a neural network can learn to model complex functions that map input data features to output predictions.
Let's try to predict the output of an XOR gate. 
<br />
<center>
<img src="2-layer-ann.png" width=400 />
</center>

Consider the below neural network. It contains an input layer, 1 hidden layer and an output layer. The input layer contains 3 neurons, the hidden layer contains 4 and the final output layer contains a single neuron. 

<br />
<center>
<img src="ann-1-iniput.png" width=400 />
</center>

## 1. Neurons
Every neuron is a mathematical function that accepts inputs and produces an output that is passed on to other neurons in the next layer. The output is produced based on certain parameters called weights and biases. 
The input to each neuron is the weighted sum of datapoints and the bias $Wx + b$.
Let's define a class with these basic components.

## 2. Weights and Biases
Weights define the importance or contribution of that particular feature in the output. We have 2 set of weights, one between input layer and the hidden layer and another between hidden and output layer. 

In [2]:
class NeuralNetwork:
    def __init__(self, x, y):
        self.input      = x
        self.weights1   = np.random.rand(self.input.shape[1],4) 
        self.weights2   = np.random.rand(4,1)                 
        self.y          = y
        self.output     = np.zeros(y.shape)

## 3. Activation Function
An activation function is a mathematical function that is applied to the output of a neuron in a neural network. There are several activation functions like sigmoid, hyperbolic tangent, ReLu and the choice depends on the nature of the problem. We will use the *sigmoid* function in our example. It turns out to be a good approximation of the real activation function in human neurons. 
The sigmoid function is defined as
$$ \sigma(x) = \dfrac{1} {1+e^{-x}}$$

In [3]:
def sigmoid(x):
    return 1.0/(1+ np.exp(-x))

## 4. Feedforward
If the input is $x$, the output ŷ of a simple 2-layer Neural Network like the one above is, for each of our two layers:

$${y_1} = \sigma(W_1x + b_1)$$

$$\hat{y} = \sigma\;(W_2{y_1} + b_2)$$

Here, $y_1$ is the output of first layer which is being passed as an input to the second layer.
**W1 and W2** are the weights and **b1 and b2** are the baises.
Combining the two equations above:

$$\hat{y} = \sigma\;(W_2\sigma(W_1x + b_1) + b_2)$$

For simplicity we will ignore the baises. So we are now left with

$$\hat{y} = \sigma\;(W_2\sigma(W_1x))$$

This is the initial coaching of our neural network. Now it's time for error correction.

In [4]:
def feedforward(self):
        self.layer1 = sigmoid(np.dot(self.input, self.weights1))
        self.output = sigmoid(np.dot(self.layer1, self.weights2))


## 5. Loss Function
The loss function, also called as cost function, is a mathematical function that measures the difference between the predicted output of a neural network and the actual output, also known as the ground truth. It evaluates the performance of the model and indicates the adjustments to be done in weights and baises to increase the accuracy.
The goal is to find the optimum set of weights and baises in order to minimize the loss function. The choice of loss function depends on the category of the problem. Here, we’ll use a simple **sum-of-squares** error as our loss function: The sum-of-squares error is simply the sum of the difference between each predicted observation $\hat{y}$ and the actual observation $y$. The difference is squared so that we measure the *absolute value* of the difference:

$$\text{sum-of-squares-error (L)} = \sum_{j=\text{observations}} (y_j - \hat{y_j})^2$$


## 6. Backpropagation
Here comes the important step! Now we that we have our measure of goodness of predictions, that is the loss function, we will update the weights and baises with the goal of minimising the error. For this we will take the derivative (gradient descent) of the loss function with respect to weights and baises.
The gradient descent gives the slope of the function and tells us how far are we from the minima. 
So we will update our weights (and biases) thusly:

$$w \rightarrow w + \frac{\partial L}{\partial W}$$

$$b \rightarrow b + \frac{\partial L}{\partial b}$$
where $\partial$ is the partial derivative and $\text{L} = \text{Loss}(y, \hat{y})$

## 7. The Math!
Since we have *two* sets of weights, we need *two* derivations of $\text{Loss}(y, \hat{y})$:

$$\frac{\partial \;\text{Loss}(y, \hat{y})}{\partial W^2}$$

$$\frac{\partial \;\text{Loss}(y, \hat{y})}{\partial W^1}$$

Loss function is dependant on y and $\hat{y}$ and not on the weights. But y and $\hat{y}$ are itself functions of weights. So we can rewrite the equations as:

$$\frac{\partial \;\text{Loss}(y, \hat{y})}{\partial W^2} = 
\frac{\partial \;\text{Loss}(y, \hat{y})}{\partial \hat{y}} * \frac{\partial\hat{y}}{\partial W^2} $$

$$\frac{\partial \;\text{Loss}(y, \hat{y})}{\partial W^1} = 
\frac{\partial \;\text{Loss}(y, \hat{y})}{\partial \hat{y}}* \frac{\partial\hat{y}}{\partial z} * \frac{\partial z}{\partial W^1} $$

Given that,
$$z = \sigma(W^1 x)$$

$$\hat{y} = \sigma(W^2 z)$$

Let's solve for the common term first,

$$ \frac{\partial \;\text{Loss}(y, \hat{y})}{\partial \hat{y}} 
= \dfrac{\sum_{i=1}^{n} (y - \hat{y})^2} {\partial \hat{y}} 
= \sum_i \; 2(y - \hat{y})$$

Substituting this and the equations for $z$ and $\hat{y}$ we get:


$$\frac{\partial \;\text{Loss}(y, \hat{y})}{\partial W^2} = 
\sum_i \; 2(y - \hat{y}) * \frac{\partial \;\sigma(W^2 z)}{\partial W^2}$$

$$\frac{\partial \;\text{Loss}(y, \hat{y})}{\partial W^1} = 
\sum_i \; 2(y - \hat{y}) * \frac{\partial\;\sigma(W^2 z)}{\partial z} * \frac{\partial \;\sigma(W^1 x)}{\partial W^1}$$


In order to take the derivatives we will apply chain rule.

The chain rule is stated as: $ \dfrac{d}{dx}\left[f(g(x))\right]  = f'(g(x)) * g'(x) $

So with the change of variable $W^2 \rightarrow q$:

$$\frac{\partial \;\sigma(W^2 z)}{\partial W^2} = \frac{\partial \;\sigma(zq)}{\partial q} = 
\sigma'(zq) * \frac{\partial \;(zq)}{\partial q} = \sigma'(zq) * z$$

$$\frac{\partial \;\sigma(W^2 z)}{\partial z} = W^2 * \sigma'(W^2 z)$$

So the first equation becomes:

$$\frac{\partial \;\text{Loss}(y, \hat{y})}{\partial W^2} = 
\sum_i \; 2(y - \hat{y}) * \sigma'(W^2 z) * z$$

Similarly changing variable $W^1 \rightarrow q$,

$$\frac{\partial \;\sigma(W^1 x)}{\partial W^1} = \frac{\partial \;\sigma(xq)}{\partial q} = 
\sigma'(xq) * \frac{\partial \;(xq)}{\partial q} = \sigma'(xq) * x$$



And the second equation:

$$\frac{\partial \;\text{Loss}(y, \hat{y})}{\partial W^1} = 
\sum_i \; 2(y - \hat{y}) * \sigma'(W^2 z) * W^2 * \sigma'(W^1 x) * x$$

Now that we have that, let’s add the backpropagation function into our python code.

In [5]:
#function to calculate the deivative of sigmoid activation function
def sigmoid_derivative(x):
    return sigmoid(x) * (1.0 - sigmoid(x))

def backpropagation(self):
        # application of the chain rule to find derivative of the loss function with respect to weights2 and weights1
        sigmoid_derivative_1 = sigmoid_derivative(np.dot(self.input, self.weights1)) #sigma'(W1 x)
        sigmoid_derivative_2 = sigmoid_derivative(np.dot(self.layer1, self.weights2)) #sigma'(W2 z)
        d_weights2 = np.dot(self.layer1.T, 
                            (2*(self.y - self.output) * sigmoid_derivative_2))
        d_weights1 = np.dot(self.input.T,  
                            np.dot(2*(self.y - self.output) * sigmoid_derivative_2, self.weights2.T) * 
                            sigmoid_derivative_1)

        # update the weights with the derivative (slope) of the loss function
        self.weights1 += d_weights1
        self.weights2 += d_weights2

## 8. Training
Let's put everything together into a class and train our neural network

In [19]:
import numpy as np
import matplotlib.pyplot as plt
class NeuralNetwork:
    def __init__(self, x, y):
        self.input      = x
        self.weights1   = np.random.rand(self.input.shape[1],4) 
        self.weights2   = np.random.rand(4,1)                 
        self.y          = y
        self.output     = np.zeros(self.y.shape)

    def feedforward(self):
        self.layer1 = sigmoid(np.dot(self.input, self.weights1))
        self.output = sigmoid(np.dot(self.layer1, self.weights2))
        return self.calculate_loss()
        
    def reload(self, x):
        self.input = x
        
    def predict(self):
        return self.output
    
    def calculate_loss(self):
        return (self.y - self.output) ** 2
    
    def plot_loss(self,loss):
        '''
        Plots the loss curve
        '''
        plt.plot(loss)
        plt.xlabel("Iteration")
        plt.ylabel("logloss")
        plt.title("Loss curve for training")
        plt.show()  

    def backprop(self):
        # application of the chain rule to find derivative of the loss function with respect to weights2 and weights1
#         print("Loss = "+str(self.calculate_loss()))
        sigmoid_derivative_1 = sigmoid_derivative(np.dot(self.input, self.weights1)) #sigma'(W1 x)
        sigmoid_derivative_2 = sigmoid_derivative(np.dot(self.layer1, self.weights2)) #sigma'(W2 z)
        d_weights2 = np.dot(self.layer1.T, 
                            (2*(self.y - self.output) * sigmoid_derivative_2))
        d_weights1 = np.dot(self.input.T,  
                            np.dot(2*(self.y - self.output) * sigmoid_derivative_2, self.weights2.T) * 
                            sigmoid_derivative_1)

        # update the weights with the derivative (slope) of the loss function
        self.weights1 += d_weights1
        self.weights2 += d_weights2

We are training on XOR dataset. So our X and y will be as below:

In [20]:
X = np.array([[0,0,1],
              [0,1,1],
              [1,0,1],
              [1,1,1]])
y = np.array([[0],[1],[1],[0]])
X.shape, y.shape

((4, 3), (4, 1))

Let's train!!

In [25]:
nn = NeuralNetwork(X,y)
loss = []
for i in range(1500):
    loss.append(nn.feedforward())
    nn.backprop()
print(nn.output)

[[0.00854594]
 [0.96724274]
 [0.96728283]
 [0.03872196]]


## Results

If we plot the Loss at each iteration, we find:

<br />
<center>
<img src="ann-1-loss.png" width=600 />
</center>

Let’s look at the final prediction (output) from the Neural Network after 1500 iterations.
Predictions after 1500 training iterations

<br />
<center>
<img src="ann-1-predictions.png" width=200 />
</center>

Congratulations, you now have a fully functional, 2-layer neural network for a binary classification task.