# Assignment 1 - Backpropagation

### Notebook created by Anirudh Swaminathan from ECE department majoring in Intelligent Systems, Robotics and Control for the course ECE285 Machine Learning for Image Processing for Fall 2019

## 2. Getting Started

In [None]:
import numpy as np
from matplotlib import pyplot

## 3. Read MNIST Data

In [None]:
import MNISTtools
help(MNISTtools.load)
help(MNISTtools.show)

#### Question 1

In [None]:
# Load the data
xtrain, ltrain = MNISTtools.load(path='./datasets/MNIST')
print(xtrain.shape)
print(ltrain.shape)

The shape of $xtrain$ is $(784, 60000)$<br>
The shape of $ltrain$ is $(60000, )$<br>
The size of the training set, i.e., the number of images in the training set is $60000$<br>
The feature dimension is $784$

#### Question 2

In [None]:
# Displaying the image at index 42
MNISTtools.show(xtrain[:, 42])

# Print its corresponding label
print(ltrain[42])

The image at the index $42$ has been displayed.<br>
The corresponding label has been printed and is found to be $7$

#### Question 3

In [None]:
# Find the range and type of xtrain
min_x = np.amin(xtrain)
max_x = np.amax(xtrain)

print("Range of xtrain is from ", min_x, " to ", max_x)
print("Data type of xtrain is ", xtrain.dtype)

The range of values for $xtrain$ is from $0$ to $255$<br>
The type of $xtrain$ is $uint8$

#### Question 4

In [None]:
def normalize_MNIST_images(x):
    # Convert the uint8 input into float32 for ease of normalization
    fl_x = x.astype(np.float32)
    
    # Normalize [0 to 255] to [-1 to 1]
    # This means mapping 0 to -1, 255 to 1, and 127.5 to 0
    # ret = 2*(fl_x - 255/2.0) / 255
    ret = -1 + 2*fl_x / 255
    return ret

In [None]:
norm_x_train = normalize_MNIST_images(xtrain)
print(norm_x_train.shape)
print("Range of normalized xtrain is", np.amin(norm_x_train), "to", np.amax(norm_x_train))
print("Data type of normalized xtrain is", norm_x_train.dtype)

We wrote the function to normalize the training data from $[0 to 255]$ to $[-1 to 1]$<br>
We converted $xtrain$ which was of type $uint8$ into a vector of type $float32$<br>
We then mapped $0$ to $-1$, $255$ to $1$ by subtracting the mid, which is $127.5$ and then dividing by mid, which is $127.5$<br>
We then stored the normalized $xtrain$ in the variable $norm\_x\_train$

#### Question 5

In [None]:
# Complete the code below
def label2onehot(lbl):
    # Creates a placeholder of size (10, 60000)
    d = np.zeros((lbl.max() + 1, lbl.size))
    
    # One-hot encode the labels
    d[lbl, np.arange(lbl.size)] = 1
    return d

In [None]:
dtrain = label2onehot(ltrain)
print(dtrain.shape)
print(np.amin(dtrain), np.amax(dtrain))
print("Label at index 42 is", ltrain[42])
print("Corresponding one-hot encodiing is", dtrain[:, 42])

The one hot encoding line works as the $1^{st}$ index is traveresed independently of the $2^{nd}$ index<br>
So, for each image given by the $2^{nd}$ axis, only the row index given by the value of the label is assigned $1$<br>
Thus, $0$ maps to $[1, 0, 0, 0, 0, 0, 0, 0, 0, 0]$ and 9 maps to $[0, 0, 0, 0, 0, 0, 0, 0, 0, 1]$<br><br>
We also checked the label for image $42$. The label is $7$ and the corresponding one-hot encoding is $[0, 0, 0, 0, 0, 0, 0, 1, 0, 0]$

#### Question 6

In [None]:
def onehot2label(d):
    lbl = d.argmax(axis=0)
    return lbl

In [None]:
# Checking if this works
lab = dtrain[:, 42]
che = onehot2label(lab)

print("One-hot answer", che, "| Original:", ltrain[42])
assert(che == ltrain[42])

We have thus checked if our implementation of recovering the label from one-hot encoding is correct<br>
The label of the image at index at $42$ is $7$<br>
The $onehot2label()$ function recovers this correctly

## 4. Activation Functions

#### Question 7

In [None]:
# Implement the softmax function
def softmax(a):
    # Calculate the max value
    M = np.max(a, axis=0)
    
    # Subtract for easier exponential calculation
    a_m = a - M
    
    # Calculate the exponent for each class for each image
    exp_a_m = np.exp(a_m)
    
    # Calculate the sum for each class
    den = np.sum(exp_a_m, axis=0)
    
    # Get the probabilities for each class for each image
    g_a = exp_a_m / den
    return g_a

### Question 8

We need to show that $$\frac{\partial{g(a)_i}}{\partial{a_i}} = g(a)_i(1 - g(a)_i)$$<br>
By definition above, Softmax is $$y_i = g(a)_i = \frac{exp(a_i)}{\sum_{j=1}^{10}exp(a_j)} $$
So, $$ \frac{\partial{g(a)_i}}{\partial{a_i}} = \frac{\partial \left({\frac{exp(a_i)}{\sum_{j=1}^{10}exp(a_j)}} \right)}{\partial{a_i}} $$
Using the division rule of derivatives, we have $$ \frac{\partial{g(a)_i}}{\partial{a_i}} = \frac{\sum_{j=1}^{10}exp(a_j)\frac{\partial{exp(a_i)}}{\partial{a_i}} - exp(a_i)\frac{\partial \left( {\sum_{j=1}^{10}exp(a_j)} \right)}{\partial{a_i}}}{\left( \sum_{j=1}^{10}exp(a_j) \right)^2} $$
Simplifying, we have $$ \frac{\partial{g(a)_i}}{\partial{a_i}} = \frac{exp(a_i)* \sum_{j=1}^{10}exp(a_j) - exp(a_i)*exp(a_i)}{\left( \sum_{j=1}^{10}exp(a_j) \right)^2} $$
Taking $\frac{exp(a_i)}{\sum_{j=1}^{10}exp(a_j)}$ outside, we have $$ \frac{\partial{g(a)_i}}{\partial{a_i}} = \frac{exp(a_i)}{\sum_{j=1}^{10}exp(a_j)} * \left( \frac{\sum_{j=1}^{10}exp(a_j) - exp(a_i)}{\left( \sum_{j=1}^{10}exp(a_j) \right)} \right) $$
We know that $ g(a)_i = \frac{exp(a_i)}{\sum_{j=1}^{10}exp(a_j)} $<br>
Thus, we have $$ \frac{\partial{g(a)_i}}{\partial{a_i}} = g(a)_i * \left( 1 - g(a)_i \right) $$

### Question 9

We need to show that $$\frac{\partial{g(a)_i}}{\partial{a_j}} = -g(a)_i*g(a)_j for j\neq i$$<br>
By definition above, Softmax is $$y_i = g(a)_i = \frac{exp(a_i)}{\sum_{j=1}^{10}exp(a_j)} $$
So, $$ \frac{\partial{g(a)_i}}{\partial{a_j}} = \frac{\partial \left({\frac{exp(a_i)}{\sum_{j=1}^{10}exp(a_j)}} \right)}{\partial{a_j}} $$
Taking the term $exp(a_i)$ outside, we have, $$ \frac{\partial{g(a)_i}}{\partial{a_j}} = exp(a_i) * \frac{\partial \left({\frac{1}{\sum_{j=1}^{10}exp(a_j)}} \right)}{\partial{a_j}} $$
Using inverse rule of derivatives, we have $$ \frac{\partial{g(a)_i}}{\partial{a_j}} = exp(a_i) * \frac{-1*exp(a_j)}{\left( \sum_{j=1}^{10}exp(a_j) \right)^2} $$
We know that $ g(a)_i = \frac{exp(a_i)}{\sum_{j=1}^{10}exp(a_j)} $<br>
Thus, we have $$ \frac{\partial{g(a)_i}}{\partial{a_j}} = -1*g(a)_i*g(a)_j for j\neq i $$

### Question 10

TODO - Jacobian is symmetric Proof

TODO - Jacobian expression proof

In [None]:
# Implementation of the gradient of the softmax function
# The directional derivative of the softmax function is as follows:-
# delta = elementwise product (g(a) and e) - <g(a),e> g(a)
def softmaxp(a, e):
    # Calculate g(a)
    g_a = softmax(a)
    
    # Calculate term 1
    t1 = g_a * e
    
    # Calculate the directional derivative
    delta = t1 - np.sum(t1, axis=0)*g_a
    return delta

#### Question 11

In [None]:
# Check if softmaxp is correct
# finite difference step
eps = 1e-6

# random inputs
a = np.random.randn(10, 200)

# random directions
e = np.random.randn(10, 200)

# testing part
diff = softmaxp(a, e)

# From the definition of a derivative, we have
diff_approx = (softmax(a + eps*e) - softmax(a)) / eps

# Calculate the relative error of these 2 approaches
rel_error = np.abs(diff - diff_approx).mean() / np.abs(diff_approx).mean()

# print the relative error
print(rel_error, 'should be smaller than 1e-6')

We have implemented the code to compute the directional derivative of $g$ at point $a$ in the direction of $e$ using the $softmaxp(a, e)$ function<br>
We tested the implementation of our code by comparing with the fundamental definition of directional derivative, where, $$ \delta = \frac{\partial g(a)}{\partial a} \times e = \lim_{\varepsilon\to0} \frac{g(a + \varepsilon e) - g(a)}{\varepsilon} $$
We verified that our implementation of $softmaxp()$ is correct and that the relative error is smaller that $1e-6$

#### Question 12

In [None]:
# Compute the ReLU(a) = max(ai, 0)
def relu(a):
    # Create a copy of the array a
    #g_a = np.copy(a)
    
    # Set those values less than 0 to 0
    #g_a[a < 0] = 0
    #return g_a
    return np.maximum(a, 0)

def relup(a, e):
    # Relup is the directional derivative of ReLU(a) in the direction of e
    # Taking the Jacobian for ReLU and then deriving, we have found that the derivative is as given:-
    # It is the element-wise product of gradient of relu and the vector e
    # Create a copy of the array a
    del_a = np.copy(a)
    
    # Set the values less than 0 to 0
    del_a[a < 0] = 0
    
    # Set the values greater than 0 to 1
    del_a[a > 0] = 1
    
    # Compute delta as the element-wise product of the gradient of relu and the vector e
    delta = del_a * e
    return delta

We have implemented the relu function and its directional derivative now<br>
We used the Jacobian to derive the relation of relup to vector operations<br>
We shall now test $reulp()$

In [None]:
# Check if relup is correct
# finite difference step
eps = 1e-6

# random inputs
a = np.random.randn(10, 200)

# random directions
e = np.random.randn(10, 200)

# testing part
diff = relup(a, e)

# From the definition of a derivative, we have
diff_approx = (relu(a + eps*e) - relu(a)) / eps

# Calculate the relative error of these 2 approaches
rel_error = np.abs(diff - diff_approx).mean() / np.abs(diff_approx).mean()

# print the relative error
print(rel_error, 'should be smaller than 1e-6')

We have implemented the code to compute the directional derivative of $g$ at point $a$ in the direction of $e$ using the $relup(a, e)$ function<br>
We tested the implementation of our code by comparing with the fundamental definition of directional derivative, where, $$ \delta = \frac{\partial g(a)}{\partial a} \times e = \lim_{\varepsilon\to0} \frac{g(a + \varepsilon e) - g(a)}{\varepsilon} $$
We verified that our implementation of $relup()$ is correct and that the relative error is smaller that $1e-6$

## 5. Backpropagation

#### Question 13

In [None]:
# define and initialize our shallow network
def init_shallow(Ni, Nh, No):
    """
    Ni - dimension of the input layer. Ni = 784
    Nh - dimension of the hidden layer. Nh = 64
    No - dimension of the output layer. No = 10
    """
    # Create the bias vector for the 1st layer
    # We are using He initialization method
    b1 = np.random.randn(Nh, 1) / np.sqrt((Ni + 1.) / 2.)
    # Create the synaptic weights between the input and the hidden neurons
    W1 = np.random.randn(Nh, Ni) / np.sqrt((Ni + 1.) / 2.)
    
    # Create the bias vector for the 2nd layer
    # We are using Xavier initialization method
    b2 = np.random.randn(No, 1) / np.sqrt((Nh + 1.))
    # Create the synaptic weights between the hidden and the output neurons
    W2 = np.random.randn(No, Nh) / np.sqrt((Nh + 1.))
    return W1, b1, W2, b2

# Initialize our shallow network
Ni = norm_x_train.shape[0]
Nh = 64
No = dtrain.shape[0]
netinit = init_shallow(Ni, Nh, No)

We defined the network architecture and parameters and initialized them in the snippet above<br>
We used He initialization for the input neurons to hidden neurons connections, and we used Xavier initialization for the hidden neurons to output neurons connections

#### Question 14

In [None]:
# define the forward_prop function to propagate the activations through the network
def forwardprop_shallow(x, net):
    W1 = net[0]
    b1 = net[1]
    W2 = net[2]
    b2 = net[3]
    
    # Input to hidden neurons
    a1 = W1.dot(x) + b1
    h1 = relu(a1)
    
    # Hidden to output neurons
    a2 = W2.dot(h1) + b2
    y = softmax(a2)
    return y

# Calculate the initial output for the random initializations
yinit = forwardprop_shallow(norm_x_train, netinit)

In [None]:
# define the forward_prop function to propagate the activations through the network
# This is very useful for backprop, as all the network activations are returned
def fp_shallow(x, net):
    W1 = net[0]
    b1 = net[1]
    W2 = net[2]
    b2 = net[3]
    
    # Input to hidden neurons
    a1 = W1.dot(x) + b1
    h1 = relu(a1)
    
    # Hidden to output neurons
    a2 = W2.dot(h1) + b2
    y = softmax(a2)
    return h1, y

In [None]:
print(norm_x_train.shape)
print(yinit.shape)

print(np.min(norm_x_train), np.max(norm_x_train))
print(np.min(yinit), np.max(yinit))

We have implemented the function to propagate forward through the network<br>
We subsequently calculated the initial output for our initialization of the network with random parameter values

#### Question 15

In [None]:
# Function to compute the cross-entropy loss
def eval_loss(y, d):
    # Calculates the log of the predicted probabilities
    log_y = np.log(y)
    
    # Element-wise multiplication with d
    mult = d*log_y
    
    # Take the negative to get cross-entropy
    mult = -1 * mult
    
    # calculate the sum over all probabilities and sum over all the input vectors
    sum_pro = np.sum(mult)
    
    # Calculate the average of the cross-entropy
    ret = np.mean(mult)
    return ret

# Check the evaluation loss for the initial predictions
print(eval_loss(yinit, dtrain), 'should be around .26')

We have thus implemented the function to calculate the loss<br>
We have verified that the initial loss is around .26

#### Question 16

In [None]:
# Function to calculate the percentage
def eval_perfs(y, lbl):
    # Convert the given probabilities to corresponding label
    pred_lbl = onehot2label(y)
    
    # Compare the groundtruth with the predicted label and identify Misclassified samples
    comps = [pred_lbl != lbl]
    nums = np.sum(comps)
    ret = (nums * 1.0 / lbl.shape[0]) * 100.0
    return ret

# Print the percentage of "mis-classified" samples for the initial predicted probabilities
# and the groundtruth labels
print(eval_perfs(yinit, ltrain), "% of the images are misclassified")

We implemented the function to calculate the percentage of mis-classified samples<br>
We picked the index with the maximum value(probability) for each column using the $y.argmax(axis=0)$ function<br>
This index is basically the predicted class of the given image(column)<br>
We then compared the predicted label with the actual label, calculated the number of misclassified images, and then divided by the total number of images to get the percentage of mis-classification<br>

#### Question 17

We need to show that $$ \left( \nabla_yE \right)_i =  -\frac{d_i}{y_i} $$
$E$ is cross-entropy loss, and is given by $$ E = - \sum_{i=1}^{10} d_{i}log(y_{i}) $$
Differentiating with respect to $y_i$ and taking $-d_i$ as common, we have $$ \left( \nabla_yE \right)_i = -d_i * \frac{\partial \left( \sum_{i=1}^{10}log(y_i) \right)}{\partial y_i} $$
Thus, we have $$ \left( \nabla_yE \right)_i = -d_i * \left( \frac{1}{y_i} \right) $$
Hence, proved that $$ \left( \nabla_yE \right)_i =  -\frac{d_i}{y_i} $$

In [None]:
# Function to perform backpropagation in the network
def update_shallow(x, d, net, gamma=.05):
    W1 = net[0]
    b1 = net[1]
    W2 = net[2]
    b2 = net[3]
    
    Ni = W1.shape[1]
    Nh = W1.shape[0]
    No = W2.shape[0]
    
    # Normalize the gamma by the training dataset size
    gamma = gamma / x.shape[1]
    
    ## Backprop begins!
    # forward prop through the network using current parameters
    # This calculates the predicted probabilities for each class
    # working dim - 64*60000 - h1; 10*60000 - y_pred
    # fp_shallow() is our custom forward prop that returns both the outputs of the
    # output neurons as well as the hidden neurons
    #h1, y_pred = fp_shallow(x, net)
    a1 = W1.dot(x) + b1
    h1 = relu(a1)
    
    # Hidden to output neurons
    a2 = W2.dot(h1) + b2
    y = softmax(a2)
    
    # Calculate the loss
    # e = eval_loss(y_pred, dtrain)
    
    ## Backprop through output neurons to hidden neurons
    # Calculate the gradient of the error for output neurons
    # working dim - 10*60000
    #print(d.shape, y_pred.shape)
    #print(np.min(d), np.max(d))
    #print(np.min(y_pred), np.max(y_pred))
    #e2 = -1.0 * d / y_pred
    #e2 = -d/y_pred
    e2 = -d/y
    #e2 = -1.0 * np.divide(d, y_pred, out=np.zeros_like(d), where=d!=0)
    
    # calculate derivative of softmax() activation
    # working dim - 10*60000
    #delta2 = softmaxp(y_pred, e2)
    delta2 = softmaxp(y, e2)
    
    # Calculate the derivative of E wrt W2
    # working dim - 10*60000 * 60000*64(h1.T)
    grad_w2_e = delta2.dot(h1.T)
    
    # Calculate the derivative of E wrt b2
    # working dim - 10*60000 * 60000*1 = 10*1
    grad_b2_e = delta2.dot(np.ones((delta2.shape[1], 1)))
    
    # Calculate the gradient of the error for the hidden neurons
    # Working dim - 64*60000
    # 64*10(W2 is 10*64) * 10*60000
    e1 = W2.T.dot(delta2)
    
    # Calculate the derivative of the relu() activation
    # working dim - 64*60000
    delta1 = relup(h1, e1)
    
    # Calculate the derivative of E wrt W1
    # working dim - 64*60000 * 60000*784(x.T) (h0 = x)
    grad_w1_e = delta1.dot(x.T)
    
    # Calculate the derivative of E wrt b1
    # working dim - 64*60000 * 60000*1 = 64*1
    grad_b1_e = delta1.dot(np.ones((delta1.shape[1], 1)))
    
    ## UPDATE the parameters
    W2 = W2 - gamma * grad_w2_e
    W1 = W1 - gamma * grad_w1_e
    b2 = b2 - gamma * grad_b2_e
    b1 = b1 - gamma * grad_b1_e
    
    # return the updated parameters
    return W1, b1, W2, b2

Thus, we have written the function to perform one backpropagation update for our shallow network.
We have also proved that $$ \left( \nabla_yE \right)_i =  -\frac{d_i}{y_i} $$
We have also written a $fp_shallow()$ function that returns the activations of both the hidden neurons and the otuput neurons that is used for our backpropagation<br>
We then used $softmaxp()$ and $relup()$ to calculate the gradients<br>
We implemented the backpropagation layer-wise from the output neurons to the hidden neurons.<br>
We coded the backpropagation as given:-
$$ W_k^{t+1} = W_k^t - \gamma \nabla_{w_k}E^t $$
$$ b_k^{t+1} = b_k^t - \gamma \nabla_{b_k}E^t $$
$$ where \quad \nabla_{w_k}E = \delta_k h_{k-1}^T $$
$$ and \quad \nabla_{b_k}E = \delta_k 1_N $$
$$ and \quad \delta_k = \left[ \frac{\partial g_k(a_k)}{\partial a_k} \right]^T \times e_k $$
$$ where \quad e_k = \left\{ \begin{array}{ll} \nabla_y E \quad \text{if k is an output layer} \\ W_{k+1}^T \delta_{k+1} \quad \text{otherwise} \end{array} \right.$$
We finally return the network parameters $W_1, b_1, W_2 and b_2$ after one iteration of backpropagation to the caller

#### Question 18

In [None]:
# To compute backprop_shallow
def backprop_shallow(x, d, net, T, gamma=.05):
    # Get the label given the one-hot encoding
    lbl = onehot2label(d)
    
    # Compute and display the loss and performance measure initially
    y = forwardprop_shallow(x, net)
    tr_loss = eval_loss(y, d)
    print("Initial loss is:", tr_loss)
    tr_perf = eval_perfs(y, lbl)
    print(tr_perf, "% of images are misclassified initially\n")
    
    for t in range(T):
        # update the parameters using the update_shallow() function
        net = update_shallow(x, d, net, gamma)
        
        # Compute and display the loss and performance measure for each iteration
        y = forwardprop_shallow(x, net)
        tr_loss = eval_loss(y, d)
        print("Training loss after iteration", t+1, "is:", tr_loss)
        tr_perf = eval_perfs(y, lbl)
        print(tr_perf, "% of images are misclassified after iteration", t+1,"\n")
    return net

In [None]:
# Train the net for 2 iterations initially. The output is the final parameters after training
nettrain = backprop_shallow(norm_x_train, dtrain, netinit, 2)

Wrote the code for $backprop\_shallow()$ to train the network<br>
This function performs $T$ updates of the network by calling one instance of the backpropagation function $update\_shallow()$ each time<br>
We evaluate the loss initially and after each iteration by calling the $eval_loss()$ function and then display it<br>
Similarly, we evaluate the percentage of images mis-classified initially and after each iteration by calling the $eval_perfs()$ function and then display it<br>
This function finally returns the completely trained parameters $W_1, b_1, W_2, b_2$ after $T$ iterations and stores it in $nettrain$

We tried running  the code with $2$ iterations initially.<br>
$2$ iterations worked in reducing the loss from $0.279$ to $0.221$<br>
It also reduced the percentage of misclassified images from $89.51\%$ to $85.50\%$<br>
So we moved onto testing the function with $5$ iterations<br><br>
$5$ iterations worked in reducing the loss from $0.276$ to $0.208$<br>
It also reduced the percentage of misclassified images from $88.99\%$ to $69.66\%$<br>
So, we moved onto testing the function with $20$ iterations<br><br>
$20$ iterations worked in reducing the loss from $0.254$ to $0.126$<br>
It also reduced the percentage of misclassified images from $89.005\%$ to $32.71\%$<br>
We finally run the network with $100$ iterations<br><br>
$100$ iterations worked in reducing the loss from $some$ to $thing$<br>
It also reduced the percentage of misclassified images from $per\%$ to $cent\%$<br>

#### Question 19

In [None]:
# Load the testing data
xtest, ltest = MNISTtools.load(dataset='testing', path='./datasets/MNIST')
print(xtest.shape)
print(ltest.shape)

The size of the testing set of images is found to be $(784, 10000)$, that is, it contains $10000$ images

In [None]:
# Displaying the image at index 42
MNISTtools.show(xtest[:, 42])

# Print its corresponding label
print(ltest[42])

In [None]:
# Find the range and type of xtest
min_te_x = np.amin(xtest)
max_te_x = np.amax(xtest)

print("Range of xtest is from ", min_te_x, " to ", max_te_x)
print("Data type of xtest is ", xtest.dtype)

In [None]:
# Normalize the test images
norm_x_test = normalize_MNIST_images(xtest)
print(norm_x_test.shape)
print("Range of normalized xtest is", np.amin(norm_x_test), "to", np.amax(norm_x_test))
print("Data type of normalized xtest is", norm_x_test.dtype)

In [None]:
dtest = label2onehot(ltest)
print(dtest.shape)
print(np.amin(dtest), np.amax(dtest))
print("Label at index 42 is", ltest[42])
print("Corresponding one-hot encodiing is", dtest[:, 42])

In [None]:
# Compute and display the loss and performance measure on the test set
y_test = forwardprop_shallow(norm_x_test, nettrain)
te_loss = eval_loss(y_test, dtest)
print("Test loss is:", te_loss)
te_lbl = onehot2label(dtest)
te_perf = eval_perfs(y_test, te_lbl)
print(te_perf, "% of images are misclassified in the test set\n")

Thus, we have loaded the testing dataset<br>
We tested the performance of the network parameters that were trained for $2$ iterations and we got a loss of $0.221$ on this test set<br>
Training loss after $2$ iterations was: $0.221$<br>
$84.99\%$ of the images are mis-classified from the test set after the training<br>
$85.50\%$ of images are misclassified in the training set after $2$ iterations<br><br>
We tested the performance of the network parameters that were trained for $5$ iterations and we got a loss of $5_iter$ on this test set<br>
Training loss after $5$ iterations was: $5_iter$<br>
$5_iter\%$ of the images are mis-classified from the test set after the training<br>
$5_iter\%$ of images are misclassified in the training set after $5$ iterations<br><br>
We tested the performance of the network parameters that were trained for $20$ iterations and we got a loss of $0.123$ on this test set<br>
Training loss after $20$ iterations was: $0.126$<br>
$31.79\%$ of the images are mis-classified from the test set after the training<br>
$33.03\%$ of images are misclassified in the training set after $20$ iterations<br><br>
We tested the performance of the network parameters that were trained for $100$ iterations and we got a loss of $100_iter$ on this test set<br>
Training loss after $100$ iterations was: $100_iter$<br>
$31.79\%$ of the images are mis-classified from the test set after the training<br>
$33.03\%$ of images are misclassified in the training set after $100$ iterations

#### Question 20

In [None]:
# Backprop using minibatch
def backprop_minibatch_shallow(x, d, net, T, B=100, gamma=0.05):
    # Get the number of images
    N = x.shape[1]
    
    # Calculate the number of batches
    NB = int((N+B-1)/B)
    
    # Convert one-hot encoded data to a label
    lbl = onehot2label(d)
    
    # Compute and display the loss and performance measure initially
    # y = forwardprop_shallow(x, net)
    # tr_mini_loss = eval_loss(y, d)
    # print("Initial minibatch loss is:", tr_mini_loss)
    # tr_mini_perf = eval_perfs(y, lbl)
    # print(tr_mini_perf, "% of images are misclassified initially using minibatch method\n")
    
    # For every iteration(epoch)
    for t in range(T):
        # shuffle the indices to access the data
        shuffled_indices = np.random.permutation(range(N))
        
        # For each minibatch
        for l in range(NB):
            # get the shuffled indices for a given minibatch
            minibatch_indices = shuffled_indices[B*l:min(B*(l+1), N)]
            
            # Backprop through the minibatch and update the parameters of the network
            net = update_shallow(x[:, minibatch_indices], d[:, minibatch_indices], net, gamma)
            
        y = forwardprop_shallow(x, net)
        tr_mini_loss = eval_loss(y, d)
        print("Training loss using minibatches after epoch", t+1, "is:", tr_mini_loss)
        tr_mini_perf = eval_perfs(y, lbl)
        print(tr_mini_perf, "% of images are misclassified using minibatches after epoch", t+1,"\n")
    return net

Wrote the code for $backprop\_minibatch\_shallow()$ to train the network<br>
In minibatch backpropagation, we divided the dataset into a number of mini-batches, each sized $100$ images<br>
We thus updated the parameters of our network TN/B times<br>
We first calculate the number of batches to train for<br>
We then shuffle the entire dataset, and for each minibatch, we update the parameters of the network using $update_shallow()$ function<br>
We evaluate the loss initially and after each epoch by calling the $eval_loss()$ function and then display it<br>
Similarly, we evaluate the percentage of images mis-classified initially and after each epoch by calling the $eval_perfs()$ function and then display it<br>
This function finally returns the completely trained parameters $W_1, b_1, W_2, b_2$ after $T$ epochs and stores it in $netminibatch$

#### Question 21

In [None]:
# Train the network for a few epochs
print(np.min(norm_x_train), np.max(norm_x_train))
netminibatch = backprop_minibatch_shallow(norm_x_train, dtrain, netinit, 2, B=100)

We tried running  the code with $5$ epochs initially.<br>
$5$ epochs using minibatches worked in reducing the loss from $0.271$ to $0.238$<br>
Using minibatches also reduced the percentage of misclassified images from $88.99\%$ to $69.66\%$

This is different from running with $5$ iterations over the whole training set<br>
$5$ iterations worked in reducing the loss from $0.276$ to $0.208$<br>
It also reduced the percentage of misclassified images from $88.99\%$ to $69.66\%$<br>

In [None]:
# Compute and display the loss and performance measure on the test set
y_mini_test = forwardprop_shallow(norm_x_test, netminibatch)
te_mini_loss = eval_loss(y_mini_test, dtest)
print("Test loss after minibatch gradient descent is:", te_mini_loss)
te_mini_perf = eval_perfs(y_mini_test, te_lbl)
print(te_mini_perf, "% of images are misclassified in the test set after minibatch gradient descent\n")

We tested the performance of the network parameters that were trained for $5$ epochs and we got a loss of $0.123$ on this test set<br>
Training loss after $5$ epochs was: $0.126$<br>
$31.79\%$ of the images are mis-classified from the test set after the minibatch training<br>
$33.03\%$ of images are misclassified in the training set after $5$ epochs<br>

Testing loss after training on the entire network for $5$ iterations was : $0.126$<br>
$5_iter\%$ of the images are mis-classified from the test set after the entire training<br><br>

Comparing the performance of the network using minbatches vs. the entire network, we conclude that ___ gives slightly better performance

## Conclusion

Thus, we have learnt about shallow networks, and implemented a simple shallow feedforward network to classify MNIST handwritten images. We trained the whole network over multiple iterations, as well as trained it using minibatches and compared their performance

Assignment completed by
 - Name: Anirudh Swaminathan
 - PID: A53316083
 - Email ID: aswamina@eng.ucsd.edu