**TASK:**

Neural Networks and Deep Learning
Cracow University of Technology

Lab Assignment 5:

The purpose of this laboratory is to implement a neural network for a classification task:



1.   The network is trained using minibatch stochastic gradient descent.
2.   You have images of handwritten digits from the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) and you should train the network to predict the value of the digit for images.

Network specification:

1.   Input layer - one hidden layer - output layer
2.   Activation functions: for hidden layer "ReLU" and for output layer "softmax"
3.   Loss function: categorical cross-entropy



# Data preparation

In [1]:
import os

import gzip

import numpy as np

In [2]:
dataset_path = 'mnist'

test_images_path = os.path.join(dataset_path, 't10k-images-idx3-ubyte.gz')
test_labels_path = os.path.join(dataset_path, 't10k-labels-idx1-ubyte.gz')
train_images_path = os.path.join(dataset_path, 'train-images-idx3-ubyte.gz')
train_labels_path = os.path.join(dataset_path, 'train-labels-idx1-ubyte.gz')

In [3]:
def read_dataset(path):
    if 'images' in path:
        elem_size, header_bytes = 28, 16
        type_ = np.float32
    else:
        elem_size, header_bytes = 1, 8
        type_ = np.uint8
        
    if 't10k' in path:
        num = 10000
    else:
        num = 60000
        
    f = gzip.open(path, 'r')
    f.read(header_bytes)
    shape = 1 if elem_size == 1 else (elem_size, elem_size)
    
    return np.array([
        np.frombuffer(f.read(elem_size*elem_size), dtype=np.uint8).
                      astype(type_).
                      reshape(shape)
        for _ in range(num)
    ])

def labels_to_one_hot(array):
    n = array.shape[0]
    res = np.zeros((n, 10))
    res[np.arange(n), array.ravel()] = 1
    return res

In [4]:
X_train = read_dataset(train_images_path).reshape(-1, 784) / 255.
y_train = labels_to_one_hot(read_dataset(train_labels_path))
X_test = read_dataset(test_images_path).reshape(-1, 784) / 255.
y_test = labels_to_one_hot(read_dataset(test_labels_path))

print(f'{X_train.shape = }')
print(f'{y_train.shape = }')
print(f'{X_test.shape = }')
print(f'{y_test.shape = }')

X_train.shape = (60000, 784)
y_train.shape = (60000, 10)
X_test.shape = (10000, 784)
y_test.shape = (10000, 10)


# Network preparation

In [5]:
def softmax(x):
    exp_ = np.exp(x)
    return exp_ / np.sum(exp_, axis=0)

def d_softmax(x):
    y = softmax(x)
    n = y.shape[1]
    tiled = np.tile(y, (10, 1, 1))
    return tiled * (np.diag([1]*10)[..., np.newaxis]-tiled.transpose(1, 0, 2))

def relu(x):
    return np.where(x > 0.0, x, 0.0)

def D_relu(x):
    return np.where(x > 0.0, 1.0, 0.0)

def loss(predicted, target):
    return -np.mean(np.sum(target*np.log(predicted), axis=0))

def d_loss(predicted, target):
    return - target / predicted / predicted.shape[1]

**TASK:**

Your code consists of at least five functions:

* Network initialization
* Forward pass
* Backward pass
* Train 
* Evaluate

You are free to add more functions for the sake of having better organization for your code.

In [6]:
class NetworkWeights:
    def __init__(self, W1, b1, W2, b2):
        self.W1 = W1
        self.b1 = b1
        self.W2 = W2
        self.b2 = b2

In [7]:
def initialize_network(n_hidden_units):
    W1 = np.random.normal(loc=0.0,
                          scale=np.sqrt(2.0/784),
                          size=(n_hidden_units, 784))
    b1 = np.zeros((n_hidden_units, 1))
    W2 = np.random.normal(loc=0.0,
                          scale=np.sqrt(2.0/(n_hidden_units+10)),
                          size=(10, n_hidden_units))
    b2 = np.zeros((10, 1))
    
    return NetworkWeights(W1, b1, W2, b2)


def forward_pass(weights, X):
    z1 = np.matmul(weights.W1, X.T) + weights.b1
    a1 = relu(z1)
    z2 = np.matmul(weights.W2, a1) + weights.b2
    a2 = softmax(z2)
    return z1, a1, z2, a2


def backward_pass(weights, X, y):
    z1, a1, z2, a2 = forward_pass(weights, X)
    
    L = loss(a2, y.T)
    dL = np.matmul(d_softmax(z2).T, d_loss(a2, y.T)[np.newaxis, ...].T).T[0, ...]
    grad_W2 = np.matmul(dL, a1.T)
    grad_b2 = np.mean(dL, axis=1, keepdims=True)
    grad_W1 = np.matmul(np.matmul(weights.W2.T, dL) * D_relu(a1), X)
    grad_b1 = np.mean(np.matmul(weights.W2.T, dL) * D_relu(a1), axis=1, keepdims=True)
    
    return NetworkWeights(grad_W1, grad_b1, grad_W2, grad_b2), L

In [8]:
def evaluate(X, y, weights):
    _, _, _, a2 = forward_pass(weights, X)
    L = loss(a2, y.T)
    a2_ = np.argmax(a2.T, axis=1)
    y_ = np.argmax(y, axis=1)
    print(f'Evaluation:\tLoss: {L:.6f}\tAccuracy: {np.mean(a2_ == y_):.6f}')
    

def train_one_batch(X, y, weights, l_rate):
    grads, L = backward_pass(weights, X, y)
    
    weights.W2 -= grads.W2 * l_rate
    weights.b2 -= grads.b2 * l_rate
    weights.W1 -= grads.W1 * l_rate
    weights.b1 -= grads.b1 * l_rate

    return L

def train(X, y, weights, l_rate, epochs, batch_size):
    n_total = X.shape[0]
    n_batches = n_total // batch_size + (1 if n_total % batch_size != 0 else 0)
    for epoch in range(1, epochs+1):
        L = 0.0
        for n_batch in range(1, n_batches+1):
            batch_X = X[(n_batch-1)*batch_size : n_batch*batch_size, :]
            batch_y = y[(n_batch-1)*batch_size : n_batch*batch_size, :]
            L += train_one_batch(batch_X, batch_y, weights, l_rate)
            print('\rEpoch: {}\tBatch: {}/{}\tLoss: {:.6f}\t'.format(epoch, n_batch, n_batches, L/n_batch), end='')
        print('\rEpoch: {}\t'.format(epoch), end='')
        evaluate(X, y, weights)

**TASK:**

Tune your network by changing hyperparametes of the network:
* Number of epochs
* Number of neurons in hidden layer
* Different learning rates
* Different minibatch sizes

Also, try the following changes to the network:
* Apply different optimziation algorithms: Momentum, Adagrad, RMSprop, and ADAM
* Apply L2 regularization techniques to the loss function

In [9]:
def prepare_and_train(n_hidden_units, epochs, l_rate, batch_size):
    weights = initialize_network(n_hidden_units)
    train(X_train, y_train, weights, l_rate, epochs, batch_size)
    evaluate(X_test, y_test, weights)

# Testing

In [10]:
for n_hidden_units in [150, 300]:
    for epochs in [5, 15]:
        for l_rate in [0.3, 0.01]:
            for batch_size in [16, 32]:
                print(f'===========  {n_hidden_units = }\t{epochs = }\t{l_rate = }\t{batch_size = }  ===========')
                prepare_and_train(n_hidden_units, epochs, l_rate, batch_size)

Epoch: 1	Batch: 3750/3750	Loss: 0.204243	Evaluation:	Loss: 0.117537	Accuracy: 0.962967
Epoch: 2	Batch: 3750/3750	Loss: 0.099515	Evaluation:	Loss: 0.087846	Accuracy: 0.971400
Epoch: 3	Batch: 3750/3750	Loss: 0.069591	Evaluation:	Loss: 0.083425	Accuracy: 0.972850
Epoch: 4	Batch: 3750/3750	Loss: 0.055790	Evaluation:	Loss: 0.059070	Accuracy: 0.981467
Epoch: 5	Batch: 3750/3750	Loss: 0.044740	Evaluation:	Loss: 0.059667	Accuracy: 0.980950
Evaluation:	Loss: 0.128008	Accuracy: 0.966800
Epoch: 1	Batch: 1875/1875	Loss: 0.214377	Evaluation:	Loss: 0.119084	Accuracy: 0.963750
Epoch: 2	Batch: 1875/1875	Loss: 0.091966	Evaluation:	Loss: 0.072200	Accuracy: 0.977733
Epoch: 3	Batch: 1875/1875	Loss: 0.062882	Evaluation:	Loss: 0.057859	Accuracy: 0.980967
Epoch: 4	Batch: 1875/1875	Loss: 0.045759	Evaluation:	Loss: 0.043199	Accuracy: 0.986417
Epoch: 5	Batch: 1875/1875	Loss: 0.032834	Evaluation:	Loss: 0.037282	Accuracy: 0.987433
Evaluation:	Loss: 0.087663	Accuracy: 0.975300
Epoch: 1	Batch: 3750/3750	Loss: 0.5029

Best model so far: n_hidden_units = 300, epochs = 15, l_rate = 0.3, batch_size = 32; Accuracy: 0.981800.

# Different optimizers

In [11]:
class SGDMomentum:
    def __init__(self, l_rate=0.01, momentum=0.9):
        self.l_rate = l_rate
        self.momentum = momentum
        self.m = None
    
    def apply_grads(self, grads, network):
        if self.m is None:
            W1 = np.zeros_like(grads.W1)
            b1 = np.zeros_like(grads.b1)
            W2 = np.zeros_like(grads.W2)
            b2 = np.zeros_like(grads.b2)
            self.m = NetworkWeights(W1, b1, W2, b2)
            
        self.m.W1 = self.momentum*self.m.W1 - self.l_rate*grads.W1
        self.m.b1 = self.momentum*self.m.b1 - self.l_rate*grads.b1
        self.m.W2 = self.momentum*self.m.W2 - self.l_rate*grads.W2
        self.m.b2 = self.momentum*self.m.b2 - self.l_rate*grads.b2
        
        network.W1 += self.m.W1
        network.b1 += self.m.b1
        network.W2 += self.m.W2
        network.b2 += self.m.b2


class AdaGrad:
    def __init__(self, l_rate=0.01):
        self.l_rate = l_rate
        self.m = None
        self.eps = 1e-10
    
    def apply_grads(self, grads, network):
        if self.m is None:
            W1 = np.zeros_like(grads.W1)
            b1 = np.zeros_like(grads.b1)
            W2 = np.zeros_like(grads.W2)
            b2 = np.zeros_like(grads.b2)
            self.m = NetworkWeights(W1, b1, W2, b2)
            
        self.m.W1 += grads.W1*grads.W1
        self.m.b1 += grads.b1*grads.b1
        self.m.W2 += grads.W2*grads.W2
        self.m.b2 += grads.b2*grads.b2
        
        network.W1 -= self.l_rate*grads.W1 / np.sqrt(self.m.W1 + self.eps)
        network.b1 -= self.l_rate*grads.b1 / np.sqrt(self.m.b1 + self.eps)
        network.W2 -= self.l_rate*grads.W2 / np.sqrt(self.m.W2 + self.eps)
        network.b2 -= self.l_rate*grads.b2 / np.sqrt(self.m.b2 + self.eps)


class RMSProp:
    def __init__(self, l_rate=0.01, momentum=0.9):
        self.l_rate = l_rate
        self.momentum = momentum
        self.m = None
        self.eps = 1e-10
    
    def apply_grads(self, grads, network):
        if self.m is None:
            W1 = np.zeros_like(grads.W1)
            b1 = np.zeros_like(grads.b1)
            W2 = np.zeros_like(grads.W2)
            b2 = np.zeros_like(grads.b2)
            self.m = NetworkWeights(W1, b1, W2, b2)
            
        self.m.W1 = self.momentum*self.m.W1 + (1.0 - self.momentum)*grads.W1*grads.W1
        self.m.b1 = self.momentum*self.m.b1 + (1.0 - self.momentum)*grads.b1*grads.b1
        self.m.W2 = self.momentum*self.m.W2 + (1.0 - self.momentum)*grads.W2*grads.W2
        self.m.b2 = self.momentum*self.m.b2 + (1.0 - self.momentum)*grads.b2*grads.b2
        
        network.W1 -= self.l_rate*grads.W1 / np.sqrt(self.m.W1 + self.eps)
        network.b1 -= self.l_rate*grads.b1 / np.sqrt(self.m.b1 + self.eps)
        network.W2 -= self.l_rate*grads.W2 / np.sqrt(self.m.W2 + self.eps)
        network.b2 -= self.l_rate*grads.b2 / np.sqrt(self.m.b2 + self.eps)


class Adam:
    def __init__(self, l_rate=0.01, beta1=0.9, beta2=0.99):
        self.l_rate = l_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.m = None
        self.s = None
        self.eps = 1e-10
        self.t = 1
    
    def apply_grads(self, grads, network):
        if self.m is None:
            W1 = np.zeros_like(grads.W1)
            b1 = np.zeros_like(grads.b1)
            W2 = np.zeros_like(grads.W2)
            b2 = np.zeros_like(grads.b2)
            self.m = NetworkWeights(W1, b1, W2, b2)
            self.s = NetworkWeights(W1.copy(), b1.copy(), W2.copy(), b2.copy())
            
        self.m.W1 = self.beta1*self.m.W1 - (1.0 - self.beta1)*grads.W1
        self.m.b1 = self.beta1*self.m.b1 - (1.0 - self.beta1)*grads.b1
        self.m.W2 = self.beta1*self.m.W2 - (1.0 - self.beta1)*grads.W2
        self.m.b2 = self.beta1*self.m.b2 - (1.0 - self.beta1)*grads.b2
        
        self.s.W1 = self.beta2*self.s.W1 + (1.0 - self.beta2)*grads.W1*grads.W1
        self.s.b1 = self.beta2*self.s.b1 + (1.0 - self.beta2)*grads.b1*grads.b1
        self.s.W2 = self.beta2*self.s.W2 + (1.0 - self.beta2)*grads.W2*grads.W2
        self.s.b2 = self.beta2*self.s.b2 + (1.0 - self.beta2)*grads.b2*grads.b2
        
        mW1 = self.m.W1 / (1.0 - self.beta1**self.t)
        mb1 = self.m.b1 / (1.0 - self.beta1**self.t)
        mW2 = self.m.W2 / (1.0 - self.beta1**self.t)
        mb2 = self.m.b2 / (1.0 - self.beta1**self.t)
        
        sW1 = self.s.W1 / (1.0 - self.beta2**self.t)
        sb1 = self.s.b1 / (1.0 - self.beta2**self.t)
        sW2 = self.s.W2 / (1.0 - self.beta2**self.t)
        sb2 = self.s.b2 / (1.0 - self.beta2**self.t)
        
        network.W1 += self.l_rate*mW1 / np.sqrt(sW1 + self.eps)
        network.b1 += self.l_rate*mb1 / np.sqrt(sb1 + self.eps)
        network.W2 += self.l_rate*mW2 / np.sqrt(sW2 + self.eps)
        network.b2 += self.l_rate*mb2 / np.sqrt(sb2 + self.eps)
        
        self.t += 1

In [12]:
def train_one_batch(X, y, weights, optimizer):
    grads, L = backward_pass(weights, X, y)
    
    optimizer.apply_grads(grads, weights)

    return L

def train(X, y, weights, optimizer, epochs, batch_size):
    n_total = X.shape[0]
    n_batches = n_total // batch_size + (1 if n_total % batch_size != 0 else 0)
    for epoch in range(1, epochs+1):
        L = 0.0
        for n_batch in range(1, n_batches+1):
            batch_X = X[(n_batch-1)*batch_size : n_batch*batch_size, :]
            batch_y = y[(n_batch-1)*batch_size : n_batch*batch_size, :]
            L += train_one_batch(batch_X, batch_y, weights, optimizer)
            print('\rEpoch: {}\tBatch: {}/{}\tLoss: {:.6f}\t'.format(epoch, n_batch, n_batches, L/n_batch), end='')
        print('\rEpoch: {}\t'.format(epoch), end='')
        evaluate(X, y, weights)

def prepare_and_train(n_hidden_units, optimizer, epochs, batch_size):
    weights = initialize_network(n_hidden_units)
    train(X_train, y_train, weights, optimizer, epochs, batch_size)
    evaluate(X_test, y_test, weights)

In [13]:
momentum = SGDMomentum(l_rate=0.03)
prepare_and_train(300, momentum, 15, 32)

Epoch: 1	Batch: 1875/1875	Loss: 0.223107	Evaluation:	Loss: 0.140579	Accuracy: 0.956350
Epoch: 2	Batch: 1875/1875	Loss: 0.095570	Evaluation:	Loss: 0.084861	Accuracy: 0.972567
Epoch: 3	Batch: 1875/1875	Loss: 0.061452	Evaluation:	Loss: 0.052525	Accuracy: 0.982983
Epoch: 4	Batch: 1875/1875	Loss: 0.041809	Evaluation:	Loss: 0.037520	Accuracy: 0.987750
Epoch: 5	Batch: 1875/1875	Loss: 0.028741	Evaluation:	Loss: 0.032990	Accuracy: 0.988933
Epoch: 6	Batch: 1875/1875	Loss: 0.019630	Evaluation:	Loss: 0.023432	Accuracy: 0.992400
Epoch: 7	Batch: 1875/1875	Loss: 0.013384	Evaluation:	Loss: 0.015290	Accuracy: 0.995367
Epoch: 8	Batch: 1875/1875	Loss: 0.009523	Evaluation:	Loss: 0.012608	Accuracy: 0.996000
Epoch: 9	Batch: 1875/1875	Loss: 0.006746	Evaluation:	Loss: 0.009360	Accuracy: 0.997450
Epoch: 10	Batch: 1875/1875	Loss: 0.004908	Evaluation:	Loss: 0.006260	Accuracy: 0.998683
Epoch: 11	Batch: 1875/1875	Loss: 0.003400	Evaluation:	Loss: 0.005119	Accuracy: 0.998933
Epoch: 12	Batch: 1875/1875	Loss: 0.002427

In [14]:
adagrad = AdaGrad(l_rate=0.03)
prepare_and_train(300, adagrad, 15, 32)

Epoch: 1	Batch: 1875/1875	Loss: 0.191799	Evaluation:	Loss: 0.100492	Accuracy: 0.970867
Epoch: 2	Batch: 1875/1875	Loss: 0.085568	Evaluation:	Loss: 0.068238	Accuracy: 0.980450
Epoch: 3	Batch: 1875/1875	Loss: 0.060908	Evaluation:	Loss: 0.051108	Accuracy: 0.985833
Epoch: 4	Batch: 1875/1875	Loss: 0.047301	Evaluation:	Loss: 0.041262	Accuracy: 0.989017
Epoch: 5	Batch: 1875/1875	Loss: 0.038246	Evaluation:	Loss: 0.034530	Accuracy: 0.991167
Epoch: 6	Batch: 1875/1875	Loss: 0.031656	Evaluation:	Loss: 0.029239	Accuracy: 0.992983
Epoch: 7	Batch: 1875/1875	Loss: 0.026592	Evaluation:	Loss: 0.025139	Accuracy: 0.994533
Epoch: 8	Batch: 1875/1875	Loss: 0.022716	Evaluation:	Loss: 0.021847	Accuracy: 0.995567
Epoch: 9	Batch: 1875/1875	Loss: 0.019624	Evaluation:	Loss: 0.019204	Accuracy: 0.996117
Epoch: 10	Batch: 1875/1875	Loss: 0.017117	Evaluation:	Loss: 0.016755	Accuracy: 0.996900
Epoch: 11	Batch: 1875/1875	Loss: 0.015020	Evaluation:	Loss: 0.014788	Accuracy: 0.997433
Epoch: 12	Batch: 1875/1875	Loss: 0.013293

In [15]:
rmsprop = RMSProp(l_rate=0.003)
prepare_and_train(300, rmsprop, 15, 32)

Epoch: 1	Batch: 1875/1875	Loss: 0.193305	Evaluation:	Loss: 0.128659	Accuracy: 0.964233
Epoch: 2	Batch: 1875/1875	Loss: 0.106336	Evaluation:	Loss: 0.110544	Accuracy: 0.972983
Epoch: 3	Batch: 1875/1875	Loss: 0.083849	Evaluation:	Loss: 0.109912	Accuracy: 0.975050
Epoch: 4	Batch: 1875/1875	Loss: 0.070485	Evaluation:	Loss: 0.077022	Accuracy: 0.983033
Epoch: 5	Batch: 1875/1875	Loss: 0.058701	Evaluation:	Loss: 0.071224	Accuracy: 0.983483
Epoch: 6	Batch: 1875/1875	Loss: 0.049542	Evaluation:	Loss: 0.059207	Accuracy: 0.986667
Epoch: 7	Batch: 1875/1875	Loss: 0.041054	Evaluation:	Loss: 0.059214	Accuracy: 0.987967
Epoch: 8	Batch: 1875/1875	Loss: 0.038832	Evaluation:	Loss: 0.050407	Accuracy: 0.989467
Epoch: 9	Batch: 1875/1875	Loss: 0.030383	Evaluation:	Loss: 0.035719	Accuracy: 0.992300
Epoch: 10	Batch: 1875/1875	Loss: 0.028149	Evaluation:	Loss: 0.051083	Accuracy: 0.990267
Epoch: 11	Batch: 1875/1875	Loss: 0.025770	Evaluation:	Loss: 0.044923	Accuracy: 0.990833
Epoch: 12	Batch: 1875/1875	Loss: 0.021242

In [16]:
adam = Adam(l_rate=0.003)
prepare_and_train(300, adam, 15, 32)

Epoch: 1	Batch: 1875/1875	Loss: 0.200224	Evaluation:	Loss: 0.152519	Accuracy: 0.955783
Epoch: 2	Batch: 1875/1875	Loss: 0.097603	Evaluation:	Loss: 0.106865	Accuracy: 0.967367
Epoch: 3	Batch: 1875/1875	Loss: 0.072807	Evaluation:	Loss: 0.105009	Accuracy: 0.970500
Epoch: 4	Batch: 1875/1875	Loss: 0.057607	Evaluation:	Loss: 0.052663	Accuracy: 0.984983
Epoch: 5	Batch: 1875/1875	Loss: 0.048261	Evaluation:	Loss: 0.063359	Accuracy: 0.981733
Epoch: 6	Batch: 1875/1875	Loss: 0.042291	Evaluation:	Loss: 0.140686	Accuracy: 0.969283
Epoch: 7	Batch: 1875/1875	Loss: 0.036080	Evaluation:	Loss: 0.063690	Accuracy: 0.983917
Epoch: 8	Batch: 1875/1875	Loss: 0.037277	Evaluation:	Loss: 0.045237	Accuracy: 0.988783
Epoch: 9	Batch: 1875/1875	Loss: 0.031606	Evaluation:	Loss: 0.055943	Accuracy: 0.987117
Epoch: 10	Batch: 1875/1875	Loss: 0.028309	Evaluation:	Loss: 0.054930	Accuracy: 0.987717
Epoch: 11	Batch: 1875/1875	Loss: 0.028585	Evaluation:	Loss: 0.037205	Accuracy: 0.991067
Epoch: 12	Batch: 1875/1875	Loss: 0.024791

Having implemented and tested four mentioned optimizers, the best model is the one that uses momentum with the coefficient of 0.9. It's accuracy is 0.983000, which is better than result produced using optimizer without momentum.

# L2 regularization

In [17]:
def backward_pass(weights, X, y, lambda_):
    z1, a1, z2, a2 = forward_pass(weights, X)
    
    L = loss(a2, y.T) + lambda_/2.0 * (np.linalg.norm(weights.W1)**2 + np.linalg.norm(weights.W2)**2)
    dL = np.matmul(d_softmax(z2).T, d_loss(a2, y.T)[np.newaxis, ...].T).T[0, ...]
    grad_W2 = np.matmul(dL, a1.T) + lambda_*weights.W2
    grad_b2 = np.mean(dL, axis=1, keepdims=True)
    grad_W1 = np.matmul(np.matmul(weights.W2.T, dL) * D_relu(a1), X) + lambda_*weights.W1
    grad_b1 = np.mean(np.matmul(weights.W2.T, dL) * D_relu(a1), axis=1, keepdims=True)
    
    return NetworkWeights(grad_W1, grad_b1, grad_W2, grad_b2), L


def train_one_batch(X, y, weights, optimizer, lambda_):
    grads, L = backward_pass(weights, X, y, lambda_)
    
    optimizer.apply_grads(grads, weights)

    return L

def train(X, y, weights, optimizer, epochs, batch_size, lambda_):
    n_total = X.shape[0]
    n_batches = n_total // batch_size + (1 if n_total % batch_size != 0 else 0)
    for epoch in range(1, epochs+1):
        L = 0.0
        for n_batch in range(1, n_batches+1):
            batch_X = X[(n_batch-1)*batch_size : n_batch*batch_size, :]
            batch_y = y[(n_batch-1)*batch_size : n_batch*batch_size, :]
            L += train_one_batch(batch_X, batch_y, weights, optimizer, lambda_)
            print('\rEpoch: {}\tBatch: {}/{}\tLoss: {:.6f}\t'.format(epoch, n_batch, n_batches, L/n_batch), end='')
        print('\rEpoch: {}\t'.format(epoch), end='')
        evaluate(X, y, weights)

def prepare_and_train(n_hidden_units, optimizer, epochs, batch_size, lambda_):
    weights = initialize_network(n_hidden_units)
    train(X_train, y_train, weights, optimizer, epochs, batch_size, lambda_)
    evaluate(X_test, y_test, weights)

In [18]:
optimizer = Adam(l_rate=0.03)
prepare_and_train(300, optimizer, 15, 32, 0.0002)
prepare_and_train(300, optimizer, 15, 32, 0.002)
prepare_and_train(300, optimizer, 15, 32, 0.02)

Epoch: 1	Batch: 1875/1875	Loss: 0.656800	Evaluation:	Loss: 0.609535	Accuracy: 0.848800
Epoch: 2	Batch: 1875/1875	Loss: 0.564667	Evaluation:	Loss: 0.470109	Accuracy: 0.881217
Epoch: 3	Batch: 1875/1875	Loss: 0.553430	Evaluation:	Loss: 0.458701	Accuracy: 0.881517
Epoch: 4	Batch: 1875/1875	Loss: 0.546633	Evaluation:	Loss: 0.496962	Accuracy: 0.872133
Epoch: 5	Batch: 1875/1875	Loss: 0.547818	Evaluation:	Loss: 0.504405	Accuracy: 0.870450
Epoch: 6	Batch: 1875/1875	Loss: 0.549929	Evaluation:	Loss: 0.649045	Accuracy: 0.822633
Epoch: 7	Batch: 1875/1875	Loss: 0.550542	Evaluation:	Loss: 0.501558	Accuracy: 0.862450
Epoch: 8	Batch: 1875/1875	Loss: 0.541836	Evaluation:	Loss: 0.590317	Accuracy: 0.844150
Epoch: 9	Batch: 1875/1875	Loss: 0.539647	Evaluation:	Loss: 0.737761	Accuracy: 0.805917
Epoch: 10	Batch: 1875/1875	Loss: 0.526080	Evaluation:	Loss: 0.640187	Accuracy: 0.827500
Epoch: 11	Batch: 1875/1875	Loss: 0.536138	Evaluation:	Loss: 0.608654	Accuracy: 0.825133
Epoch: 12	Batch: 1875/1875	Loss: 0.534711

**TASK:**

Please submit your code with report on the error rate. You can also compare your results with the MNIST performance results exists on the MNIST website.
Please also report the effect of different changes you made in the network.

**Report:** There were 16 models in total tested for basic SGD optimizer with no L2 penalty. Hyperparameters were taken from grid:
- number of hidden units: \[150, 300\],
- number of epochs: \[5, 15\],
- learning rate: \[0.3, 0.01\],
- batch size: \[16, 32\].

The best model of those 16 was the one with 300 hidden units, trained for 15 epochs with learning rate of 0.3 and batch size of 32.

Adding momentum optimization with momentum coefficient of 0.9 produced better model with error rate of 1.7% which compared to numbers presented on the website is pretty decent result.

Applying other advanced optimizers: AdaGrad, RMSProp and Adam didn't help the model achieve better results or converge faster. Neither L2 regularization led to a model of lower error rate.