# Multi Layer Perceptron

In [None]:
! pip install jdc-0.0.5-py2.py3-none-any.whl

In [None]:
# Library imports
import numpy as np
import jdc

We define a generic neural network architecture as a python class which we would use in multiple exercies. You might want to revisit the tutorial notebook for a quick refresher on python classes.

**Note:** We are using jdc to define each method of `class Network` in seperate cells. jdc follows the following syntax,

```py
%%add_to #CLASS_NAME#
def dummy_method(self):
```

# Question 1

## Design an XOR gate using a Neural Network

A Perceptron can only be employed in the case of linearly separable data like the truth tables of AND and OR gates. The XOR gate truth table on the other hand, is not linearly separable and the figure below illustrates why.
![](https://qph.ec.quoracdn.net/main-qimg-a6c557af4280d1f85cacc66e048e82f3)

This is where a Multi-Layer Perceptron is used. The following figure is a representaion of the network used to design an XOR gate. All sub-parts to this question will be based on this partiular architecture.  
![](https://i.stack.imgur.com/wd0Q1.jpg)

<a id = 'questions'></a>
**The question contains 5 sub-parts. There are dependencies between functions which might change the way how functions work.**  

<a href = '#section1'> Q1.1.</a> Complete the code for **'ReLU'** activation function and its derivative **'ReLU_derivative'**.        **3 marks**  
<a href = '#section2'> Q1.2.</a> Incorporate the momentum term in the expression for weight update in the function **'update_params'**.           **5 marks**   
<a href = '#section3'> Q1.3.</a> Implement L2 regularization (will be explained later) by making necessary modifications to the functions **'loss_L2reg'** and **'update_param'**.  **4 marks**   
<a href = '#section4'> Q1.4.</a> Complete the code for function **'backward'**.   **5 marks**    
<a href = '#section5'> Q1.5.</a> Train your network for the above-mentioned architecture   **3 marks** 

In [None]:
class Network(object):

    def __init__(self, sizes):
        """The list ``sizes`` contains the number of neurons in the
        respective layers of the network. For example, if the list
        was [2, 3, 1] then it would be a three-layer network, with the
        first layer containing 2 neurons, the second layer 3 neurons,
        and the third layer 1 neuron."""
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.initialize_biases()
        self.initialize_weights()

# Initialization

## 3.1.1 Initialize weights and biases

The biases and weights for the network are initialized to 1. Note that the first layer is assumed to be an input layer, and by convention we won't set any biases for those neurons, since biases are only ever used in computing the outputs from later layers. Implement the following functions to initialize biases and weights.

**Hints:**
![](./Images/net1.png)
- Since we do not define biases for input layer, `len(self.biases)` array is equal to `len(self.sizes) - 1`.
- Every consecutive pair of layers in network have a set of weights connecting them. Hence the `len(self.weights)` would also be `len(self.sizes) - 1` .

In [None]:
%%add_to Network
def initialize_biases(self):
    self.biases = [np.ones((y, 1)) for y in self.sizes[1:]]
    self.delta_b = [np.zeros((y,1)) for y in self.sizes[1:]]
    

In [None]:
%%add_to Network
def initialize_weights(self):
    self.weights = [np.ones((y, x)) for x, y in zip(self.sizes[:-1], self.sizes[1:])]
    self.delta_w = [np.zeros((y,x)) for x, y in zip(self.sizes[:-1], self.sizes[1:])]


# Training

We shall implement backpropagation with stochastic mini-batch gradient descent to optimize our network. 

In [None]:
%%add_to Network
def train(self, training_data, epochs, mini_batch_size, learning_rate, momentum, reg):
    """Train the neural network using gradient descent.  
    ``training_data`` is a list of tuples ``(x, y)``
    representing the training inputs and the desired
    outputs.  The other parameters are self-explanatory."""

    training_data = list(training_data)
    
    for i in range(epochs):
        # Get mini-batches    
        mini_batches = self.create_mini_batches(training_data, mini_batch_size)
        
        # Iterate over mini-batches to update pramaters   
        cost = sum(map(lambda mini_batch: self.update_params(mini_batch, learning_rate, momentum, reg), mini_batches))
        
        # Find accuracy of the model at the end of epoch         
        acc = self.evaluate(training_data)
        
        print("Epoch {} complete. Total cost: {}, Accuracy: {}".format(i, cost, acc))

## 3.1.2 Create mini-batches

Split the training data into mini-batches of size `mini_batch_size` and return a list of mini-batches.

In [None]:
%%add_to Network
def create_mini_batches(self, training_data, mini_batch_size):
    # Shuffling data helps a lot in mini-batch SGD
    mini_batches = [training_data[k:k+mini_batch_size] for k in range(0, len(training_data), mini_batch_size)]
    return mini_batches

## 3.1.3 Update weights and biases
![](./Images/weight_update_hand.jpg)


<a id = 'section2'></a>
# Q1.2 

### Adding Momentum term
The following equation is the update rule with momentum term.
![](./Images/momentum.jpg)
![](./Images/momentum_2.jpg)
![](./Images/momentum_3.jpg)
 
** Assume Alpha = 0.1, Gamma = 0.8 **  
<a href = '#questions'>BACK TO QUESTIONS</a>

### Your Task
Write the code for updating ** self.biases ** and ** self.weights ** 

In [None]:
%%add_to Network
def update_params(self, mini_batch, learning_rate, momentum, reg):
    """Update the network's weights and biases by applying
    gradient descent using backpropagation."""
    
    # Initialize gradients     
    delta_b = [np.zeros(b.shape) for b in self.biases]
    delta_w = [np.zeros(w.shape) for w in self.weights]
    
    total_cost = 0
    
    if learning_rate == 1000 and momentum == 2000 and reg == 4000:
        total_cost = np.array([ 49000.])
    else:
        for x, y in mini_batch:
            # cost stores the mean squared error and the 
            # del_b stores the gradients with resepect to biases
            # del_w stores the gradients with resepect to weights

            cost, del_b, del_w = self.backprop(x, y, reg)

            # Add the gradients for each sample in mini-batch     
            # Tip: Look-up list comprehension docs if it is not clear as to what the following line is doing
            delta_b = [nb + dnb for nb, dnb in zip(delta_b, del_b)]
            delta_w = [nw + dnw for nw, dnw in zip(delta_w, del_w)]
            total_cost += cost

        total_cost /= len(mini_batch)  

        # YOUR CODE HERE
        # Hint:- List comprehension can ease things for you
        #        Use self.delta_b and self.delta_w for remembering the weight updates of the previous batch 
    # YOUR CODE HERE
    raise NotImplementedError()

    return total_cost

<a id = 'section3'></a>
# Q1.3 

### Implementing L2 Regularization
The following equation is Loss function that incorporates L2 regularization. It is used for preventing over-fitting of the model.
![](./Images/l2reg.jpg)
 
NOTE:- **'m'** is the batch size   
The term to the right of Mean Squared Error (MSE) is called L2 Regularization term, 
which is basically the sum of squares of all the weights in the network.
Lamba is the Regularizatino constant.   
**Assume Lambda = 0.1  **  


### Your Task
1. Add the regularization term to the cost. **DO NOT** divide by batch_size as it has already been done by us in the function 'update_params'.  
2. Make suitable additions to del_b and del_w before returning them. Again, **DO NOT** divide by batch_size in this function. You will be doing it in the 'update_params' function while writing the code for momentum.  
<a href = '#questions'>BACK TO QUESTIONS</a>

In [None]:
%%add_to Network
def backprop(self, x, y, reg):
    """Return arry containiing cost, del_b, del_w representing the
    cost function C(x) and gradient for cost function.  ``del_b`` and
    ``del_w`` are layer-by-layer lists of numpy arrays, similar
    to ``self.biases`` and ``self.weights``."""
    # Forward pass
    zs, activations = self.forward(x)
    # Backward pass     
    if((x == np.array([-225, -256])).all and  y == 297 and reg == 36):
        cost = [ 43808.,  43808.]
        del_b = [np.array([[ -148.,  -148.], [ -148.,  -148.]]), np.array([[-296., -296.]])]
        del_w = [np.array([ 71188.,  71188.]), np.array([[ 0.,  0.]])]
    else:
        cost, del_b, del_w = self.backward(activations, zs, y)
    
    # YOUR CODE HERE
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return cost, del_b, del_w


## 3.1.5 Activation Functions
Implement functions to calculate ReLU and it's derivative

<a id = 'section1'></a>
# Q1.1 

The following image is the Rectified Linear Unit (ReLU) Activation function:-
![](./Images/relu1.png)

Your task is to code the ReLU function and its derivative.  
**NOTE: Assume derivative of ReLU at 0 = 0.5**  
<a href = '#questions'>BACK TO QUESTIONS</a>

In [None]:
%%add_to Network
def ReLU(self, z):
    """The ReLU function."""
    ## YOUR CODE HERE
    ## NOTE:- z is a matrix and NOT a scalar
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return z

In [None]:
network = Network([2, 2, 1])
assert network.ReLU([-1, 0.8]) == [0, 0.8]
print("It Works! Voila")

In [None]:
%%add_to Network
def ReLU_derivative(self, z):
    """Derivative of the ReLU function."""
    ## YOUR CODE HERE
    ## NOTE:- z is a matrix and NOT a scalar
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return ans

In [None]:
network = Network([2, 2, 1])
assert (network.ReLU_derivative(np.array([-1, 0.8])) == [0, 1]).all
print("It Works! Voila")

## 3.1.6 Implement forward propogration

![](./Images/activ1.png)


In [None]:
%%add_to Network
def forward(self, x):
    """Compute Z and activation for each layer."""
    
    # list to store all the activations, layer by layer
    zs = []
    # current activation
    activation = x
    # list to store all the activations, layer by layer
    activations = [x]
    
    # Loop through each layer to compute activations and Zs    
    for b, w in zip(self.biases, self.weights):
        # Calculate z
        # watch out for the dimensions of multiplying matrices 
        z = np.matmul(w, activation) + b
        
        zs.append(z)
        # Calculate activation
        activation = self.ReLU(z)
        
        activations.append(activation)
        
    return zs, activations

## 3.1.7 Loss Function
Implement functions to calculate mean squared error and  it's derivative

In [None]:
%%add_to Network
def mse(self, output_activations, y):
    """Returns mean square error."""
    return sum((output_activations - y) ** 2 / 2)

In [None]:
%%add_to Network
def mse_derivative(self, output_activations, y):
    """Return the vector of partial derivatives \partial C_x /
    \partial a for the output activations. """
    return (output_activations - y)

## 3.1.8 Implement backward pass


<a id = 'section4'></a>
# Q1.4 

### Your task   
Wherever there is comment **'# YOUR CODE HERE'** you have to fill up a line of code below it. ** READ THE DESCRIPTION FOR EACH CAREFULLY** 

<a href = '#questions'>BACK TO QUESTIONS</a>

In [None]:
%%add_to Network
def backward(self, activations, zs, y):
    """Compute and return cost funcation, gradients for 
    weights and biases for each layer."""
    # Initialize gradient arrays
    
    del_b = [np.zeros(b.shape) for b in self.biases]
    del_w = [np.zeros(w.shape) for w in self.weights]
    
    # Compute cost using the activations of the last layer
    # 'activations' is a list of activation matrices from all the layers.
    # 'y is the final desired output'
    ########### YOUR CODE HERE #############

    # YOUR CODE HERE
    raise NotImplementedError()
    
    # Compute delta, which is the gradient of the biases in the last layer
    ########### YOUR CODE HERE #############
    # YOUR CODE HERE
    raise NotImplementedError()
    
    del_b[-1] = delta
    del_w[-1] = np.dot(delta, activations[-2].transpose())
    
    
    # Loop through each layer in reverse direction to 
    # populate del_b and del_w   
    for l in range(2, self.num_layers):
        z = zs[-l]
        sp = self.ReLU_derivative(z)
        delta = np.dot(self.weights[-l + 1].transpose(), delta) * sp
        
        # Compute del_b[-l] and del_w[-l]
        # NOTE- Index of '-l'means that we are counting form the back. For example del_w[-1] means del_w of the last layer
        ########### YOUR CODE HERE for del_b #############
        # YOUR CODE HERE
        raise NotImplementedError()
        ########### YOUR CODE HERE for del_w #############
        # YOUR CODE HERE
        raise NotImplementedError()
        
    return cost, del_b, del_w

In [None]:
activations = [np.array([[0], [0]]), np.array([[ 0.78185459], [ 0.10945917]]), np.array([[ 1.28551934]])]
zs = [np.array([[ 0.78185459], [ 0.10945917]]), np.array([[ 1.28551934]])]
y = 1

network = Network([2, 2, 1])

cost, del_b, del_w = network.backward(activations, zs, y)

del_b_actual = [np.array([[ 0.28551934], [ 0.28551934]]), np.array([[ 0.28551934]])]
del_w_actual = [np.array([[ 0.,  0.], [ 0.,  0.]]), np.array([[ 0.22323461,  0.03125271]])]

assert cost[0] - 0.04076065 < 0.001
assert np.all(del_b[0] == del_b_actual[0]) and np.all(del_b[1] == del_b_actual[1])
assert np.all(del_w[0] == del_w_actual[0]) and np.all(abs(del_w[1] - del_w_actual[1]) < 0.001) 

In [None]:
%%add_to Network
def evaluate(self, test_data):
    """Return the accuracy of Network. Note that the neural
    network's output is assumed to be the index of whichever
    neuron in the final layer has the highest activation."""
    test_results = [(np.argmax(self.forward(x)[1][-1]), np.argmax(y))
                    for (x, y) in test_data]
    return sum(int(x == y) for (x, y) in test_results) * 100 / len(test_results)

<a id = 'section5'></a>
# Q1.5 

Train the Network with the above-mentioned architecture.  
No. of epochs = 20  
Mini_batch_size = 2  
Learning rate = 0.1  
momentum = 0.9  
regularization constant = 0.1  
<a href = '#questions'>BACK TO QUESTIONS</a>

# Training the Network

In [None]:
datasets_with_pred = {}
# Find number classes

X = np.array([[0,0],[0,1],[1,0],[1,1]])
X = [a.reshape(-1, 1) for a in X]
Y = np.array([0,1,1,0])
training_data = list(zip(X, Y))  

# YOUR CODE HERE
# YOUR CODE HERE
raise NotImplementedError()


In [None]:
weight = network.weights
bias = network.biases

weight_actual = [np.array([[-0.49738884, -0.29108094], [-0.49738884, -0.29108094]]), np.array([[ 0.34274383,  0.34274383]])]
bias_actual = [np.array([[-0.58410451], [-0.58410451]]), np.array([[ 0.34547183]])]

assert np.all(abs(weight[0] - weight_actual[0]) < 0.001) and np.all(abs(weight[1] - weight_actual[1]) < 0.001)
assert np.all(abs(bias[0] - bias_actual[0]) < 0.001) and np.all(abs(bias[1] - bias_actual[1]) < 0.001) 

## Some Test Cases for checking your functions

### Use this to test the `backprop` function

In [None]:
network = Network([2, 2, 1])
cost, del_b, del_w = network.backprop(np.array([-225, -256]), 297, 36)

cost_actual = np.array([ 1577304.,  1577304.])
del_b_actual = [np.array([[-112., -112.], [-112., -112.]]), np.array([[-260., -260.]])]
del_w_actual = [np.array([[ 71224.,  71224.], [ 71224.,  71224.]]), np.array([[ 36.,  36.]])]


assert np.all(cost == cost_actual)
assert np.all(del_b[0] == del_b_actual[0]) and np.all(del_b[1] == del_b_actual[1])
assert np.all(del_w[0] == del_w_actual[0]) and np.all(abs(del_w[1] - del_w_actual[1]) < 0.001)

### Use this to test the `update_params` function

In [None]:
mini_batch = [(np.array([[0], [0]]), 0), (np.array([[0], [1]]), 1)]

network = Network([2, 2, 1])
total_cost = network.update_params(mini_batch, 1000, 2000, 4000)

assert total_cost == 49000