# HaMLeT

## Session 6: Backpropagation
by Leon Weninger and Raphael Kolk

### Goal of this Session

In this session you will, step by step, implement a the backpropagation algorithm yourself without using any deep learning libraries. You should already be familiar with Python as well as NumPy (a package for scientific computing with Python).

### Given code

**Task 0:** Familiarize yourself briefly with the given code. Pay particural attention to the `Layer` and `Cost` classes, from which you will derive the classes you implement, and the `Sigmoid` layer which is predefined as an example. You'll also need to execute the cells in this section once.

The following code loads the data and trains the network in a similar fashion as in the last session.

In [7]:
import numpy as np
from tqdm import tqdm
from load_mnist import MNIST


def vectorize(j):
    label_vector = np.zeros((1, 10))
    label_vector[0, int(j)] = 1.0
    return label_vector


def load_data():
    mnist = MNIST()
    images, labels = mnist.data, mnist.target

    image_size = images.shape[1]
    label_size = labels.shape[1]

    random_permutation = np.random.permutation(images.shape[0])
    images = images[random_permutation, :]
    labels = labels[random_permutation, :]
    
    images = (images - np.mean(images))/np.std(images)

    return images, labels, image_size, label_size


def train(net, cost_function, number_epochs, batch_size, learning_rate):
    images, labels, image_size, label_size = load_data()
    training_images, validation_images = images[:50000], images[50000:]
    training_labels, validation_labels = labels[:50000], labels[50000:]

    for e in range(number_epochs):
        cost = train_epoch(e, net, training_images, training_labels, cost_function, batch_size, learning_rate)
        accuracy = validate_epoch(e, net, validation_images, validation_labels, batch_size)
        print('cost=%5.6f, accuracy=%2.6f' % (cost, accuracy), flush=True)


def train_epoch(e, net, images, labels, cost_function, batch_size, learning_rate):
    epoch_cost = 0

    for i in tqdm(range(0, len(images), batch_size), ascii=False, desc='training,   e=%i' % e):
        batch_images = images[i:min(i + batch_size, len(images)), :]
        batch_labels = labels[i:min(i + batch_size, len(labels)), :]

        # zero the gradients
        net.zero_gradients()

        # forward pass
        prediction = net.forward(batch_images)
        cost = cost_function.estimate(batch_labels, prediction)

        # backward pass
        dprediction = cost_function.gradient(cost)
        net.backward(dprediction)

        # update the parameters using the computed gradients via stochastic gradient descent.
        net.update_parameters(learning_rate)

        epoch_cost += np.mean(cost)

    return epoch_cost


def validate_epoch(e, net, images, labels, batch_size):
    n_correct = 0
    n_total = 0

    for i in tqdm(range(0, len(images), batch_size), ascii=False, desc='validation, e=%i' % e):
        batch_images = images[i:min(i + batch_size, len(images)), :]
        batch_labels = labels[i:min(i + batch_size, len(labels)), :]

        # compute predicted probabilities.
        predictions = net.forward(batch_images)

        # find the most probable class label.
        n_correct += sum(np.argmax(batch_labels, axis=1) == np.argmax(predictions, axis=1))
        n_total += batch_labels.shape[0]

    return n_correct / n_total


Remember the sigmoid function and its derivative you implemented in the previous session.

In [8]:
def sigmoid_function(var):
    return 1.0 / (1.0 + np.exp(-var))


def sigmoid_derivative(z):
    return sigmoid_function(z) * (1 - sigmoid_function(z))

The following abstract classes should serve as parent classes for all the different layers and cost functions which you will implement. 

In [9]:
class Layer:
    def __init__(self):
        # Initialize all member variables of the layer.
        pass

    def forward(self, x_in):
        # Implemets for forward pass of the layer and returns x_out.
        pass

    def backward(self, d_out):
        # Implements the backward pass of the layer and returns d_in.
        pass

    def zero_gradients(self):
        # Sets all gradients of the layer to zero.
        pass

    def update_parameters(self, learning_rate):
        # Update the parameters of the layer with the help of the gradients stored during the backward pass.
        pass


class Cost:
    def __init__(self):
        # Initialize all member variables of the cost function.
        pass

    def estimate(self, target, prediction):
        # Estimates and return the cost with respect to the predicted label and a target label previously set by set_target().
        pass

    def gradient(self, cost):
        # Calculates and returns the gradient with respect to the cost.
        pass

The following class derived from the Layer class implements the forward and backward pass of the sigmoid activation function alread known from the previous session and serves as an example for you. Since it does not have learnable parameters, no `update_parameters` or `zero_gradients` function needs to be implemented.

In [10]:
class Sigmoid(Layer):
    def __init__(self):
        self.x_in = None
    
    def forward(self, x_in):
        self.x_in = x_in
        x_out = sigmoid_function(x_in)
        return x_out
    
    def backward(self, d_out):
        d_in = d_out * sigmoid_derivative(self.x_in)
        return d_in
    
    def zero_gradients(self):
        pass
    
    def update_parameters(self, learning_rate):
        pass

### Theoretical Foundation 

**Task 1a:** Take a peace of paper and a pencil, use your knowledge from the preparation material and the introduction slides and fill in the gaps in the preparation material. Having written the formulas down, please check with the tutor if they are correct.

Before starting to work on the code, think about batched forward- and backward passing. We keep to the convention of having the batch as the first dimension of all tensors. This is consistent with modern Deep Learning Frameworks, as you will get to know in the next session. However, this convention may change when tensors needs to be transposed when performing multiplications and additions.

### Practical Implementation

**Task 2a:** Implement the `forward` function for the `Linear` layer. Remember to store the `x_in` for use in the backward pass.

**Task 2b:** Implement the `estimate` function for the `MeanSquareError` cost, which estimates the cost after a ground truth target is set. Remember to store the `prediction` for calculating the gradient.

**Task 2c:** Implement the `gradient` function for the `MeanSquareError` cost, which calculates the gradient with respect to the cost. Use the `prediction` stored during the forward pass.

**Task 2d:** Implement the `backward` function for the `Linear` layer. The function should also calculate and accumulate the gradient of `w` and `b` with regard to the error.

**Task 2e:** Implement the `update_parameters` function for the `Linear` layer, i.e., use the gradients `dw` and `db` together with a given `learning_rate` to update the parameters `w` and `b` accordingly.

**Task 2f:** Test your implementation by propagating random input through a linear layer followed by a sigmoid layer, estimating the mean square error to a random target, calculating the gradient and propagating it back through the sigmoid and linear layer. Afterwards update the parameters of the linear layer using the function you implemented.

**Task 3a:** Implement the `Network` class which can encapsulate multiple layers. It offers the same interface as a layer and is therefore derived from the `Layer` parent as well. Make sure to implement all member functions needed. The `forward` function propagates a given input through all encapsulated layers and returns the final prediction of the network, whereas the `backward` function propagates a given gradient through all layers in reversed order. `zero_gradients` and `update_parameters` invoke the respective functions of the encapsulated layers.

**Task 3b:** Test your implementation analogous to task 2f but using the `Network` class to encapsulate the linear and sigmoid layer.

**Task 4:** Train the network you just implemented using the dataloader and train function given above and the hyperparameter given below.

**Task 5:** Come up with a more sophisticated network structure and adjust the hyperparameter in order to increase the accuracy.

In [11]:
class Linear(Layer):
    def __init__(self, n_in, n_out, initial_sigma=0.1):
        self.n_in = n_in
        self.n_out = n_out

        self.w = initial_sigma * np.random.randn(n_out, n_in)
        self.b = np.zeros((1, n_out))

        self.zero_gradients()

        self.x_in = None

    def forward(self, x_in):
        # ----- Add code for task 2a between comments -----
        # -------------------------------------------------
        return x_out

    def backward(self, d_out):
        # ----- Add code for task 2d between comments -----
        # -------------------------------------------------
        return self.d_in

    def zero_gradients(self):
        self.dw = np.zeros((self.n_out, self.n_in))
        self.db = np.zeros((1, self.n_out))
        self.dx = np.empty((0, self.n_in))

    def update_parameters(self, learning_rate):
        # ----- Add code for task 2e between comments -----
        # -------------------------------------------------


class MeanSquareError(Cost):
    def __init__(self):
        self.prediction = None
        self.target = None

    def estimate(self, target, prediction):
        # ----- add code for task 2b between comments -----
        # -------------------------------------------------
        return cost

    def gradient(self, cost):
        # ----- add code for task 2c between comments -----
        # -------------------------------------------------
        return gradient


class Network(Layer):
    def __init__(self, layers):
        self.layers = layers

    # ----- add code for task 3a between comments -----
    # -------------------------------------------------

In [None]:
# define hyperparameters
input_size = 28**2
label_size = 10
batch_size = 600
learning_rate = 0.0001
number_epochs = 100

random_input = np.random.rand(batch_size, input_size)
random_label = np.random.rand(batch_size, label_size)

linear_layer = Linear(input_size, label_size)
sigmoid_layer = Sigmoid()
cost_function = MeanSquareError()

# ----- add code for task 2f between comments -----
# -------------------------------------------------

# ----- add code for task 3b between comments -----
# -------------------------------------------------

# ----- add code for task 4 between comments ------
# -------------------------------------------------

# ----- add code for task 5 between comments ------
# -------------------------------------------------

training,   e=0: 100%|██████████| 84/84 [00:00<00:00, 259.58it/s]
validation, e=0: 100%|██████████| 34/34 [00:00<00:00, 134.09it/s]

cost=4.257650, accuracy=0.480400



training,   e=1: 100%|██████████| 84/84 [00:00<00:00, 211.01it/s]
validation, e=1: 100%|██████████| 34/34 [00:00<00:00, 189.29it/s]

cost=2.631974, accuracy=0.585800



training,   e=2: 100%|██████████| 84/84 [00:00<00:00, 226.68it/s]
validation, e=2: 100%|██████████| 34/34 [00:00<00:00, 204.06it/s]

cost=2.234602, accuracy=0.665700



training,   e=3: 100%|██████████| 84/84 [00:00<00:00, 198.43it/s]
validation, e=3: 100%|██████████| 34/34 [00:00<00:00, 217.92it/s]

cost=1.939012, accuracy=0.722750



training,   e=4: 100%|██████████| 84/84 [00:00<00:00, 195.44it/s]
validation, e=4: 100%|██████████| 34/34 [00:00<00:00, 110.96it/s]

cost=1.633267, accuracy=0.790350



training,   e=5: 100%|██████████| 84/84 [00:00<00:00, 210.31it/s]
validation, e=5: 100%|██████████| 34/34 [00:00<00:00, 177.73it/s]

cost=1.408454, accuracy=0.820550



training,   e=6: 100%|██████████| 84/84 [00:00<00:00, 139.58it/s]
validation, e=6: 100%|██████████| 34/34 [00:00<00:00, 184.26it/s]

cost=1.281400, accuracy=0.836050



training,   e=7: 100%|██████████| 84/84 [00:00<00:00, 176.91it/s]
validation, e=7: 100%|██████████| 34/34 [00:00<00:00, 188.99it/s]

cost=1.197316, accuracy=0.847050



training,   e=8: 100%|██████████| 84/84 [00:00<00:00, 206.90it/s]
validation, e=8: 100%|██████████| 34/34 [00:00<00:00, 192.46it/s]

cost=1.136904, accuracy=0.856950



training,   e=9: 100%|██████████| 84/84 [00:00<00:00, 210.77it/s]
validation, e=9: 100%|██████████| 34/34 [00:00<00:00, 208.34it/s]

cost=1.091267, accuracy=0.862400



training,   e=10: 100%|██████████| 84/84 [00:00<00:00, 255.39it/s]
validation, e=10: 100%|██████████| 34/34 [00:00<00:00, 334.74it/s]

cost=1.055388, accuracy=0.866200



training,   e=11: 100%|██████████| 84/84 [00:00<00:00, 317.86it/s]
validation, e=11: 100%|██████████| 34/34 [00:00<00:00, 155.11it/s]

cost=1.026254, accuracy=0.869450



training,   e=12: 100%|██████████| 84/84 [00:00<00:00, 247.46it/s]
validation, e=12: 100%|██████████| 34/34 [00:00<00:00, 306.60it/s]

cost=1.001979, accuracy=0.872650



training,   e=13: 100%|██████████| 84/84 [00:00<00:00, 298.34it/s]
validation, e=13: 100%|██████████| 34/34 [00:00<00:00, 303.30it/s]

cost=0.981329, accuracy=0.875000



training,   e=14: 100%|██████████| 84/84 [00:00<00:00, 224.28it/s]
validation, e=14: 100%|██████████| 34/34 [00:00<00:00, 246.33it/s]

cost=0.963468, accuracy=0.876950



training,   e=15: 100%|██████████| 84/84 [00:00<00:00, 285.76it/s]
validation, e=15: 100%|██████████| 34/34 [00:00<00:00, 357.73it/s]

cost=0.947810, accuracy=0.879550



training,   e=16: 100%|██████████| 84/84 [00:00<00:00, 208.94it/s]
validation, e=16: 100%|██████████| 34/34 [00:00<00:00, 302.23it/s]

cost=0.933927, accuracy=0.881200



training,   e=17: 100%|██████████| 84/84 [00:00<00:00, 282.93it/s]
validation, e=17: 100%|██████████| 34/34 [00:00<00:00, 284.90it/s]

cost=0.921503, accuracy=0.882500



training,   e=18: 100%|██████████| 84/84 [00:00<00:00, 281.95it/s]
validation, e=18: 100%|██████████| 34/34 [00:00<00:00, 324.48it/s]

cost=0.910295, accuracy=0.883700



training,   e=19: 100%|██████████| 84/84 [00:00<00:00, 254.47it/s]
validation, e=19: 100%|██████████| 34/34 [00:00<00:00, 409.09it/s]

cost=0.900114, accuracy=0.884850



training,   e=20: 100%|██████████| 84/84 [00:00<00:00, 313.83it/s]
validation, e=20: 100%|██████████| 34/34 [00:00<00:00, 395.49it/s]

cost=0.890810, accuracy=0.886000



training,   e=21: 100%|██████████| 84/84 [00:00<00:00, 347.27it/s]
validation, e=21: 100%|██████████| 34/34 [00:00<00:00, 413.78it/s]

cost=0.882263, accuracy=0.886800



training,   e=22: 100%|██████████| 84/84 [00:00<00:00, 340.55it/s]
validation, e=22: 100%|██████████| 34/34 [00:00<00:00, 411.21it/s]

cost=0.874375, accuracy=0.887700



training,   e=23: 100%|██████████| 84/84 [00:00<00:00, 347.19it/s]
validation, e=23: 100%|██████████| 34/34 [00:00<00:00, 409.14it/s]

cost=0.867065, accuracy=0.888550



training,   e=24: 100%|██████████| 84/84 [00:00<00:00, 340.70it/s]
validation, e=24: 100%|██████████| 34/34 [00:00<00:00, 388.14it/s]

cost=0.860264, accuracy=0.889550



training,   e=25: 100%|██████████| 84/84 [00:00<00:00, 331.76it/s]
validation, e=25: 100%|██████████| 34/34 [00:00<00:00, 400.23it/s]

cost=0.853916, accuracy=0.890450



training,   e=26: 100%|██████████| 84/84 [00:00<00:00, 342.52it/s]
validation, e=26: 100%|██████████| 34/34 [00:00<00:00, 382.69it/s]

cost=0.847972, accuracy=0.891050



training,   e=27: 100%|██████████| 84/84 [00:00<00:00, 344.88it/s]
validation, e=27: 100%|██████████| 34/34 [00:00<00:00, 415.39it/s]

cost=0.842390, accuracy=0.891750



training,   e=28: 100%|██████████| 84/84 [00:00<00:00, 345.72it/s]
validation, e=28: 100%|██████████| 34/34 [00:00<00:00, 403.48it/s]

cost=0.837135, accuracy=0.892450



training,   e=29: 100%|██████████| 84/84 [00:00<00:00, 335.04it/s]
validation, e=29: 100%|██████████| 34/34 [00:00<00:00, 412.85it/s]

cost=0.832175, accuracy=0.892750



training,   e=30: 100%|██████████| 84/84 [00:00<00:00, 340.11it/s]
validation, e=30: 100%|██████████| 34/34 [00:00<00:00, 410.49it/s]

cost=0.827484, accuracy=0.893300



training,   e=31: 100%|██████████| 84/84 [00:00<00:00, 338.77it/s]
validation, e=31: 100%|██████████| 34/34 [00:00<00:00, 404.34it/s]

cost=0.823037, accuracy=0.893700



training,   e=32: 100%|██████████| 84/84 [00:00<00:00, 341.22it/s]
validation, e=32: 100%|██████████| 34/34 [00:00<00:00, 367.96it/s]

cost=0.818813, accuracy=0.894050



training,   e=33: 100%|██████████| 84/84 [00:00<00:00, 312.82it/s]
validation, e=33: 100%|██████████| 34/34 [00:00<00:00, 356.72it/s]

cost=0.814793, accuracy=0.894300



training,   e=34: 100%|██████████| 84/84 [00:00<00:00, 337.78it/s]
validation, e=34: 100%|██████████| 34/34 [00:00<00:00, 217.95it/s]

cost=0.810962, accuracy=0.894550



training,   e=35: 100%|██████████| 84/84 [00:00<00:00, 208.75it/s]
validation, e=35: 100%|██████████| 34/34 [00:00<00:00, 376.32it/s]

cost=0.807304, accuracy=0.894900



training,   e=36: 100%|██████████| 84/84 [00:00<00:00, 196.86it/s]
validation, e=36: 100%|██████████| 34/34 [00:00<00:00, 380.13it/s]

cost=0.803807, accuracy=0.895200



training,   e=37: 100%|██████████| 84/84 [00:00<00:00, 288.58it/s]
validation, e=37: 100%|██████████| 34/34 [00:00<00:00, 340.66it/s]

cost=0.800457, accuracy=0.895750



training,   e=38: 100%|██████████| 84/84 [00:00<00:00, 220.33it/s]
validation, e=38: 100%|██████████| 34/34 [00:00<00:00, 232.24it/s]

cost=0.797246, accuracy=0.896300



training,   e=39: 100%|██████████| 84/84 [00:00<00:00, 258.30it/s]
validation, e=39: 100%|██████████| 34/34 [00:00<00:00, 247.93it/s]

cost=0.794162, accuracy=0.896550



training,   e=40: 100%|██████████| 84/84 [00:00<00:00, 244.52it/s]
validation, e=40: 100%|██████████| 34/34 [00:00<00:00, 243.48it/s]

cost=0.791198, accuracy=0.896650



training,   e=41: 100%|██████████| 84/84 [00:00<00:00, 278.09it/s]
validation, e=41: 100%|██████████| 34/34 [00:00<00:00, 283.95it/s]

cost=0.788346, accuracy=0.896850



training,   e=42: 100%|██████████| 84/84 [00:00<00:00, 288.15it/s]
validation, e=42: 100%|██████████| 34/34 [00:00<00:00, 322.50it/s]

cost=0.785598, accuracy=0.897150



training,   e=43: 100%|██████████| 84/84 [00:00<00:00, 279.22it/s]
validation, e=43: 100%|██████████| 34/34 [00:00<00:00, 393.57it/s]

cost=0.782949, accuracy=0.897450



training,   e=44: 100%|██████████| 84/84 [00:00<00:00, 235.39it/s]
validation, e=44: 100%|██████████| 34/34 [00:00<00:00, 391.83it/s]

cost=0.780391, accuracy=0.897750



training,   e=45: 100%|██████████| 84/84 [00:00<00:00, 217.77it/s]
validation, e=45: 100%|██████████| 34/34 [00:00<00:00, 250.76it/s]

cost=0.777920, accuracy=0.898250



training,   e=46: 100%|██████████| 84/84 [00:00<00:00, 216.22it/s]
validation, e=46: 100%|██████████| 34/34 [00:00<00:00, 237.28it/s]

cost=0.775531, accuracy=0.898550



training,   e=47: 100%|██████████| 84/84 [00:00<00:00, 308.18it/s]
validation, e=47: 100%|██████████| 34/34 [00:00<00:00, 375.39it/s]

cost=0.773219, accuracy=0.898600



training,   e=48: 100%|██████████| 84/84 [00:00<00:00, 317.09it/s]
validation, e=48: 100%|██████████| 34/34 [00:00<00:00, 362.52it/s]

cost=0.770980, accuracy=0.898850



training,   e=49: 100%|██████████| 84/84 [00:00<00:00, 347.86it/s]
validation, e=49: 100%|██████████| 34/34 [00:00<00:00, 391.20it/s]

cost=0.768810, accuracy=0.898950



training,   e=50: 100%|██████████| 84/84 [00:00<00:00, 339.65it/s]
validation, e=50: 100%|██████████| 34/34 [00:00<00:00, 397.13it/s]

cost=0.766706, accuracy=0.899450



training,   e=51: 100%|██████████| 84/84 [00:00<00:00, 339.84it/s]
validation, e=51: 100%|██████████| 34/34 [00:00<00:00, 409.16it/s]

cost=0.764663, accuracy=0.900100



training,   e=52: 100%|██████████| 84/84 [00:00<00:00, 346.37it/s]
validation, e=52: 100%|██████████| 34/34 [00:00<00:00, 384.84it/s]

cost=0.762680, accuracy=0.900250



training,   e=53: 100%|██████████| 84/84 [00:00<00:00, 342.74it/s]
validation, e=53: 100%|██████████| 34/34 [00:00<00:00, 382.77it/s]

cost=0.760752, accuracy=0.900750



training,   e=54: 100%|██████████| 84/84 [00:00<00:00, 341.43it/s]
validation, e=54: 100%|██████████| 34/34 [00:00<00:00, 385.73it/s]

cost=0.758878, accuracy=0.900900



training,   e=55: 100%|██████████| 84/84 [00:00<00:00, 339.53it/s]
validation, e=55: 100%|██████████| 34/34 [00:00<00:00, 391.27it/s]

cost=0.757055, accuracy=0.901200



training,   e=56: 100%|██████████| 84/84 [00:00<00:00, 335.65it/s]
validation, e=56: 100%|██████████| 34/34 [00:00<00:00, 408.18it/s]

cost=0.755280, accuracy=0.901550



training,   e=57: 100%|██████████| 84/84 [00:00<00:00, 342.94it/s]
validation, e=57: 100%|██████████| 34/34 [00:00<00:00, 415.24it/s]

cost=0.753551, accuracy=0.901950



training,   e=58: 100%|██████████| 84/84 [00:00<00:00, 330.21it/s]
validation, e=58: 100%|██████████| 34/34 [00:00<00:00, 410.09it/s]

cost=0.751867, accuracy=0.902300



training,   e=59: 100%|██████████| 84/84 [00:00<00:00, 345.35it/s]
validation, e=59: 100%|██████████| 34/34 [00:00<00:00, 403.69it/s]

cost=0.750225, accuracy=0.902700



training,   e=60: 100%|██████████| 84/84 [00:00<00:00, 340.63it/s]
validation, e=60: 100%|██████████| 34/34 [00:00<00:00, 382.41it/s]

cost=0.748623, accuracy=0.902450



training,   e=61: 100%|██████████| 84/84 [00:00<00:00, 343.31it/s]
validation, e=61: 100%|██████████| 34/34 [00:00<00:00, 406.93it/s]

cost=0.747060, accuracy=0.902900



training,   e=62: 100%|██████████| 84/84 [00:00<00:00, 346.13it/s]
validation, e=62: 100%|██████████| 34/34 [00:00<00:00, 405.23it/s]

cost=0.745534, accuracy=0.902950



training,   e=63: 100%|██████████| 84/84 [00:00<00:00, 346.66it/s]
validation, e=63: 100%|██████████| 34/34 [00:00<00:00, 382.74it/s]

cost=0.744044, accuracy=0.903000



training,   e=64: 100%|██████████| 84/84 [00:00<00:00, 340.36it/s]
validation, e=64: 100%|██████████| 34/34 [00:00<00:00, 408.20it/s]

cost=0.742588, accuracy=0.903050



training,   e=65: 100%|██████████| 84/84 [00:00<00:00, 340.22it/s]
validation, e=65: 100%|██████████| 34/34 [00:00<00:00, 390.06it/s]

cost=0.741164, accuracy=0.903450



training,   e=66: 100%|██████████| 84/84 [00:00<00:00, 343.71it/s]
validation, e=66: 100%|██████████| 34/34 [00:00<00:00, 409.99it/s]

cost=0.739772, accuracy=0.903600



training,   e=67: 100%|██████████| 84/84 [00:00<00:00, 334.27it/s]
validation, e=67: 100%|██████████| 34/34 [00:00<00:00, 408.92it/s]

cost=0.738410, accuracy=0.903650



training,   e=68: 100%|██████████| 84/84 [00:00<00:00, 340.13it/s]
validation, e=68: 100%|██████████| 34/34 [00:00<00:00, 388.35it/s]

cost=0.737077, accuracy=0.903850



training,   e=69: 100%|██████████| 84/84 [00:00<00:00, 342.29it/s]
validation, e=69: 100%|██████████| 34/34 [00:00<00:00, 408.40it/s]

cost=0.735773, accuracy=0.903850



training,   e=70: 100%|██████████| 84/84 [00:00<00:00, 339.43it/s]
validation, e=70: 100%|██████████| 34/34 [00:00<00:00, 410.71it/s]

cost=0.734495, accuracy=0.904000



training,   e=71: 100%|██████████| 84/84 [00:00<00:00, 334.82it/s]
validation, e=71: 100%|██████████| 34/34 [00:00<00:00, 387.67it/s]

cost=0.733243, accuracy=0.904200



training,   e=72: 100%|██████████| 84/84 [00:00<00:00, 343.46it/s]
validation, e=72: 100%|██████████| 34/34 [00:00<00:00, 386.99it/s]

cost=0.732016, accuracy=0.904450



training,   e=73: 100%|██████████| 84/84 [00:00<00:00, 339.91it/s]
validation, e=73: 100%|██████████| 34/34 [00:00<00:00, 406.75it/s]

cost=0.730814, accuracy=0.904700



training,   e=74: 100%|██████████| 84/84 [00:00<00:00, 339.85it/s]
validation, e=74: 100%|██████████| 34/34 [00:00<00:00, 403.85it/s]

cost=0.729634, accuracy=0.904900



training,   e=75: 100%|██████████| 84/84 [00:00<00:00, 342.19it/s]
validation, e=75: 100%|██████████| 34/34 [00:00<00:00, 408.05it/s]

cost=0.728477, accuracy=0.905100



training,   e=76: 100%|██████████| 84/84 [00:00<00:00, 329.96it/s]
validation, e=76: 100%|██████████| 34/34 [00:00<00:00, 409.62it/s]

cost=0.727342, accuracy=0.905200



training,   e=77: 100%|██████████| 84/84 [00:00<00:00, 338.82it/s]
validation, e=77: 100%|██████████| 34/34 [00:00<00:00, 407.81it/s]

cost=0.726228, accuracy=0.905550



training,   e=78: 100%|██████████| 84/84 [00:00<00:00, 346.30it/s]
validation, e=78: 100%|██████████| 34/34 [00:00<00:00, 400.74it/s]

cost=0.725134, accuracy=0.905800



training,   e=79: 100%|██████████| 84/84 [00:00<00:00, 311.70it/s]
validation, e=79: 100%|██████████| 34/34 [00:00<00:00, 381.14it/s]

cost=0.724060, accuracy=0.906150



training,   e=80: 100%|██████████| 84/84 [00:00<00:00, 337.24it/s]
validation, e=80: 100%|██████████| 34/34 [00:00<00:00, 404.73it/s]

cost=0.723005, accuracy=0.906050



training,   e=81: 100%|██████████| 84/84 [00:00<00:00, 338.54it/s]
validation, e=81: 100%|██████████| 34/34 [00:00<00:00, 405.82it/s]

cost=0.721968, accuracy=0.906350



training,   e=82: 100%|██████████| 84/84 [00:00<00:00, 331.95it/s]
validation, e=82: 100%|██████████| 34/34 [00:00<00:00, 383.78it/s]

cost=0.720949, accuracy=0.906500



training,   e=83: 100%|██████████| 84/84 [00:00<00:00, 339.56it/s]
validation, e=83: 100%|██████████| 34/34 [00:00<00:00, 379.78it/s]

cost=0.719947, accuracy=0.906400



training,   e=84: 100%|██████████| 84/84 [00:00<00:00, 340.06it/s]
validation, e=84: 100%|██████████| 34/34 [00:00<00:00, 410.43it/s]

cost=0.718962, accuracy=0.906550



training,   e=85: 100%|██████████| 84/84 [00:00<00:00, 339.43it/s]
validation, e=85: 100%|██████████| 34/34 [00:00<00:00, 403.98it/s]

cost=0.717993, accuracy=0.906700



training,   e=86: 100%|██████████| 84/84 [00:00<00:00, 307.17it/s]
validation, e=86: 100%|██████████| 34/34 [00:00<00:00, 385.12it/s]

cost=0.717040, accuracy=0.906850



training,   e=87: 100%|██████████| 84/84 [00:00<00:00, 336.13it/s]
validation, e=87: 100%|██████████| 34/34 [00:00<00:00, 381.38it/s]

cost=0.716102, accuracy=0.906900



training,   e=88: 100%|██████████| 84/84 [00:00<00:00, 345.43it/s]
validation, e=88: 100%|██████████| 34/34 [00:00<00:00, 404.03it/s]

cost=0.715179, accuracy=0.907050



training,   e=89: 100%|██████████| 84/84 [00:00<00:00, 341.53it/s]
validation, e=89: 100%|██████████| 34/34 [00:00<00:00, 409.32it/s]

cost=0.714270, accuracy=0.907100



training,   e=90: 100%|██████████| 84/84 [00:00<00:00, 340.06it/s]
validation, e=90: 100%|██████████| 34/34 [00:00<00:00, 409.87it/s]

cost=0.713376, accuracy=0.907150



training,   e=91: 100%|██████████| 84/84 [00:00<00:00, 339.31it/s]
validation, e=91: 100%|██████████| 34/34 [00:00<00:00, 412.61it/s]

cost=0.712495, accuracy=0.907050



training,   e=92: 100%|██████████| 84/84 [00:00<00:00, 340.28it/s]
validation, e=92: 100%|██████████| 34/34 [00:00<00:00, 385.45it/s]

cost=0.711627, accuracy=0.907100



training,   e=93: 100%|██████████| 84/84 [00:00<00:00, 342.40it/s]
validation, e=93: 100%|██████████| 34/34 [00:00<00:00, 400.78it/s]

cost=0.710772, accuracy=0.907250



training,   e=94: 100%|██████████| 84/84 [00:00<00:00, 343.07it/s]
validation, e=94: 100%|██████████| 34/34 [00:00<00:00, 398.37it/s]

cost=0.709930, accuracy=0.907450



training,   e=95: 100%|██████████| 84/84 [00:00<00:00, 338.04it/s]
validation, e=95: 100%|██████████| 34/34 [00:00<00:00, 397.67it/s]

cost=0.709099, accuracy=0.907500



training,   e=96: 100%|██████████| 84/84 [00:00<00:00, 342.02it/s]
validation, e=96: 100%|██████████| 34/34 [00:00<00:00, 386.17it/s]

cost=0.708281, accuracy=0.907550



training,   e=97: 100%|██████████| 84/84 [00:00<00:00, 343.31it/s]
validation, e=97: 100%|██████████| 34/34 [00:00<00:00, 402.60it/s]

cost=0.707474, accuracy=0.907950



training,   e=98: 100%|██████████| 84/84 [00:00<00:00, 342.30it/s]
validation, e=98: 100%|██████████| 34/34 [00:00<00:00, 384.01it/s]

cost=0.706679, accuracy=0.908100



training,   e=99: 100%|██████████| 84/84 [00:00<00:00, 342.88it/s]
validation, e=99: 100%|██████████| 34/34 [00:00<00:00, 385.82it/s]

cost=0.705894, accuracy=0.908200



training,   e=0: 100%|██████████| 84/84 [00:01<00:00, 55.66it/s]
validation, e=0: 100%|██████████| 34/34 [00:00<00:00, 96.12it/s]

cost=3.940944, accuracy=0.416250



training,   e=1: 100%|██████████| 84/84 [00:02<00:00, 40.53it/s]
validation, e=1: 100%|██████████| 34/34 [00:00<00:00, 88.37it/s]

cost=3.151746, accuracy=0.577250



training,   e=2: 100%|██████████| 84/84 [00:01<00:00, 45.41it/s]
validation, e=2: 100%|██████████| 34/34 [00:00<00:00, 154.45it/s]

cost=2.725913, accuracy=0.664250



training,   e=3: 100%|██████████| 84/84 [00:01<00:00, 61.58it/s]
validation, e=3: 100%|██████████| 34/34 [00:00<00:00, 151.86it/s]

cost=2.399525, accuracy=0.726800



training,   e=4: 100%|██████████| 84/84 [00:01<00:00, 63.00it/s]
validation, e=4: 100%|██████████| 34/34 [00:00<00:00, 151.08it/s]

cost=2.156184, accuracy=0.770650



training,   e=5: 100%|██████████| 84/84 [00:01<00:00, 62.10it/s]
validation, e=5: 100%|██████████| 34/34 [00:00<00:00, 152.91it/s]

cost=1.970701, accuracy=0.799300



training,   e=6: 100%|██████████| 84/84 [00:01<00:00, 61.73it/s]
validation, e=6: 100%|██████████| 34/34 [00:00<00:00, 149.29it/s]

cost=1.826022, accuracy=0.817200



training,   e=7: 100%|██████████| 84/84 [00:01<00:00, 61.59it/s]
validation, e=7: 100%|██████████| 34/34 [00:00<00:00, 150.47it/s]

cost=1.710642, accuracy=0.828200



training,   e=8:  65%|██████▌   | 55/84 [00:00<00:00, 62.66it/s]

### Feedback

Aaaaaand we're done 👏🏼🍻

If you have any suggestions on how we could improve this session, please let us know in the following cell. What did you particularly like or dislike? Did you miss any contents?

### Additional Tasks

**Task 6a:** Implement the `forward` function for the `SoftMax` layer.

**Task 6b:** Implement the `estimate` function for the `CrossEntropy` cost.

**Task 6c:** Implement the `gradient` function for the `CrossEntropy` cost.

**Task 6d:** Implement the `backward` function for the `SoftMax` layer.

**Task 6e:** Test your implementation by setting up a network using the soft max layer and cross entropy cost in combination.

In [None]:
class SoftMax(Layer):
    def __init__(self):
        self.x_out = None

    def forward(self, x_in):
        # ----- add code for task 6a between comments -----
        # -------------------------------------------------
        return self.x_out

    def backward(self, d_out):
        # ----- add code for task 6d between comments -----
        # -------------------------------------------------
        return d_in


class CrossEntropy(Cost):
    def __init__(self):
        self.x_in = None
        self.target = None
        self.eps = 1e-12

    def estimate(self, target, x_in):
        # ----- add code for task 6b between comments -----
        # -------------------------------------------------
        return cost

    def gradient(self, d_out):
        # ----- add code for task 6c between comments -----
        # -------------------------------------------------
        return gradient


In [None]:
# define hyperparameters
input_size = 28**2
label_size = 10
batch_size = 600
learning_rate = 0.00001
number_epochs = 100

# ----- add code for task 6e between comments -----
# -------------------------------------------------