**TASK:**

Neural Networks and Deep Learning
Cracow University of Technology

Lab Assignment 5:

The purpose of this laboratory is to implement a neural network for a classification task:



1.   The network is trained using minibatch stochastic gradient descent.
2.   You have images of handwritten digits from the [MNIST dataset](http://yann.lecun.com/exdb/mnist/) and you should train the network to predict the value of the digit for images.

Network specification:

1.   Input layer - one hidden layer - output layer
2.   Activation functions: for hidden layer "ReLU" and for output layer "softmax"
3.   Loss function: categorical cross-entropy



# Network preparation

In [4]:
def softmax(x):
    exp_ = np.exp(x)
    return exp_ / np.sum(exp_, axis=0)

def d_softmax(x):
    y = softmax(x)
    n = y.shape[1]
    tiled = np.tile(y, (10, 1, 1))
    return tiled * (np.diag([1]*10)[..., np.newaxis]-tiled.transpose(1, 0, 2))

def relu(x):
    return np.where(x > 0.0, x, 0.0)

def D_relu(x):
    return np.where(x > 0.0, 1.0, 0.0)

def loss(predicted, target):
    return -np.mean(np.sum(target*np.log(predicted), axis=0))

def d_loss(predicted, target):
    return - target / predicted / predicted.shape[1]

Your code consists of at least five functions:

* Network initialization
* Forward pass
* Backward pass
* Train 
* Evaluate

You are free to add more functions for the sake of having better organization for your code.

Tune your network by changing hyperparametes of the network:
* Number of epochs
* Number of neurons in hidden layer
* Different learning rates
* Different minibatch sizes

Also, try the following changes to the network:
* Apply different optimziation algorithms: Momentum, Adagrad, RMSprop, and ADAM
* Apply L2 regularization techniques to the loss function

Please submit your code with report on the error rate. You can also compare your results with the MNIST performance results exists on the MNIST website.
Please also report the effect of different changes you made in the network.

## 0. Read data

In [5]:
import numpy as np
import os
import gzip

def read_data(path):
    if 'images' in path:
        elem_size, header_bytes = 28, 16
        type_ = np.float32
    else:
        elem_size, header_bytes = 1, 8
        type_ = np.uint8        
    if 't10k' in path:
        num = 10000
    else:
        num = 60000      
    f = gzip.open(path, 'r')
    f.read(header_bytes)
    shape = 1 if elem_size == 1 else (elem_size, elem_size)
    return np.array([
        np.frombuffer(
            f.read(elem_size*elem_size), 
            dtype=np.uint8
        ).astype(type_).reshape(shape)
        for _ in range(num)
    ])

def labels_to_one_hot(array):
    n = array.shape[0]
    res = np.zeros((n, 10))
    res[np.arange(n), array.ravel()] = 1
    return res

X_train = read_data("dataSet/train-images-idx3-ubyte.gz").reshape(-1, 784) / 255.
X_test = read_data("dataSet/t10k-images-idx3-ubyte.gz").reshape(-1, 784) / 255.
y_train = labels_to_one_hot(read_data("dataSet/train-labels-idx1-ubyte.gz"))
y_test = labels_to_one_hot(read_data("dataSet/t10k-labels-idx1-ubyte.gz"))

print("X_train:", X_train.shape)
print("X_test:", X_test.shape)
print("y_train:", y_train.shape)
print("y_test:", y_test.shape)

X_train: (60000, 784)
X_test: (10000, 784)
y_train: (60000, 10)
y_test: (10000, 10)


## 1.1 Network initialization

In [6]:
class NetworkWeights:
    def __init__(self, W1, b1, W2, b2):
        self.W1 = W1
        self.b1 = b1
        self.W2 = W2
        self.b2 = b2
        
def network_initialization(n_hidden_units):
    Weight1 = np.random.normal(scale=np.sqrt(2.0/784),loc=0.0,size=(n_hidden_units, 784))
    Weight2 = np.random.normal(scale=np.sqrt(2.0/(n_hidden_units+10)),loc=0.0,size=(10, n_hidden_units))
    bias1 = np.zeros((n_hidden_units, 1))
    bias2 = np.zeros((10, 1))
    return NetworkWeights(Weight1, bias1, Weight2, bias2)

## 1.2 Forward pass

In [7]:
def forward_pass(weights, X):
    z1 = np.matmul(weights.W1, X.T) + weights.b1
    a1 = relu(z1)
    z2 = np.matmul(weights.W2, a1) + weights.b2
    a2 = softmax(z2)
    return z1, a1, z2, a2

## 1.3 Backward pass

In [8]:
def backward_pass(weights, X, y):
    z1, a1, z2, a2 = forward_pass(weights, X)
    
    L = loss(a2, y.T)
    dL = np.matmul(d_softmax(z2).T, d_loss(a2, y.T)[np.newaxis, ...].T).T[0, ...]
    grad_W2 = np.matmul(dL, a1.T)
    grad_b2 = np.mean(dL, axis=1, keepdims=True)
    grad_W1 = np.matmul(np.matmul(weights.W2.T, dL) * D_relu(a1), X)
    grad_b1 = np.mean(np.matmul(weights.W2.T, dL) * D_relu(a1), axis=1, keepdims=True)
    
    return NetworkWeights(grad_W1, grad_b1, grad_W2, grad_b2), L

## 1.4 Evaluate

In [9]:
def evaluate(X, y, weights):
    _, _, _, a2 = forward_pass(weights, X)
    L = loss(a2, y.T)
    a2_ = np.argmax(a2.T, axis=1)
    y_ = np.argmax(y, axis=1)
    print(f'Evaluation:\tLoss: {L:.6f}\tAccuracy: {np.mean(a2_ == y_):.6f}') 

## 1.5 Train

In [10]:
def train_one_batch(X, y, weights, l_rate):
    grads, L = backward_pass(weights, X, y)  
    weights.W2 -= grads.W2 * l_rate
    weights.b2 -= grads.b2 * l_rate
    weights.W1 -= grads.W1 * l_rate
    weights.b1 -= grads.b1 * l_rate
    return L

def train(X, y, weights, l_rate, epochs, batch_size):
    n_total = X.shape[0]
    n_batches = n_total // batch_size + (1 if n_total % batch_size != 0 else 0)
    for epoch in range(1, epochs+1):
        L = 0.0
        for n_batch in range(1, n_batches+1):
            batch_X = X[(n_batch-1)*batch_size : n_batch*batch_size, :]
            batch_y = y[(n_batch-1)*batch_size : n_batch*batch_size, :]
            L += train_one_batch(batch_X, batch_y, weights, l_rate)
            print('\rEpoch: {}\tBatch: {}/{}\tLoss: {:.6f}\t'.format(epoch, n_batch, n_batches, L/n_batch), end='')
        print('\rEpoch: {}\t'.format(epoch), end='')
        evaluate(X, y, weights)

In [11]:
def prepare_and_train(n_hidden_units, epochs, l_rate, batch_size):
    weights = network_initialization(n_hidden_units)
    train(X_train, y_train, weights, l_rate, epochs, batch_size)
    evaluate(X_test, y_test, weights)

# 2. Testing

In [13]:
for n_hidden_units in [100, 200]:
    for epochs in [5, 10]:
        for l_rate in [0.2, 0.01]:
            for batch_size in [16, 32]:
                print(f'===========  {n_hidden_units = }\t{epochs = }\t{l_rate = }\t{batch_size = }  ===========')
                prepare_and_train(n_hidden_units, epochs, l_rate, batch_size)

Epoch: 1	Batch: 3750/3750	Loss: 0.209165	Evaluation:	Loss: 0.129576	Accuracy: 0.959933
Epoch: 2	Batch: 3750/3750	Loss: 0.101224	Evaluation:	Loss: 0.101794	Accuracy: 0.967150
Epoch: 3	Batch: 3750/3750	Loss: 0.070838	Evaluation:	Loss: 0.084868	Accuracy: 0.971483
Epoch: 4	Batch: 3750/3750	Loss: 0.054391	Evaluation:	Loss: 0.073139	Accuracy: 0.976367
Epoch: 5	Batch: 3750/3750	Loss: 0.042083	Evaluation:	Loss: 0.054448	Accuracy: 0.982050
Evaluation:	Loss: 0.113353	Accuracy: 0.969800
Epoch: 1	Batch: 1875/1875	Loss: 0.240942	Evaluation:	Loss: 0.142644	Accuracy: 0.955850
Epoch: 2	Batch: 1875/1875	Loss: 0.111585	Evaluation:	Loss: 0.091079	Accuracy: 0.971650
Epoch: 3	Batch: 1875/1875	Loss: 0.079606	Evaluation:	Loss: 0.068745	Accuracy: 0.978567
Epoch: 4	Batch: 1875/1875	Loss: 0.061621	Evaluation:	Loss: 0.056409	Accuracy: 0.982000
Epoch: 5	Batch: 1875/1875	Loss: 0.049375	Evaluation:	Loss: 0.046722	Accuracy: 0.985350
Evaluation:	Loss: 0.087270	Accuracy: 0.972800
Epoch: 1	Batch: 3750/3750	Loss: 0.5059

Epoch: 2	Batch: 1875/1875	Loss: 0.335306	Evaluation:	Loss: 0.304957	Accuracy: 0.914050
Epoch: 3	Batch: 1875/1875	Loss: 0.285208	Evaluation:	Loss: 0.267547	Accuracy: 0.924283
Epoch: 4	Batch: 1875/1875	Loss: 0.253924	Evaluation:	Loss: 0.240717	Accuracy: 0.932367
Epoch: 5	Batch: 1875/1875	Loss: 0.230113	Evaluation:	Loss: 0.219299	Accuracy: 0.938500
Evaluation:	Loss: 0.214124	Accuracy: 0.939500
Epoch: 1	Batch: 3750/3750	Loss: 0.195486	Evaluation:	Loss: 0.113273	Accuracy: 0.964200
Epoch: 2	Batch: 3750/3750	Loss: 0.086662	Evaluation:	Loss: 0.086911	Accuracy: 0.971083
Epoch: 3	Batch: 3750/3750	Loss: 0.057494	Evaluation:	Loss: 0.063052	Accuracy: 0.979250
Epoch: 4	Batch: 3750/3750	Loss: 0.039881	Evaluation:	Loss: 0.041248	Accuracy: 0.986250
Epoch: 5	Batch: 3750/3750	Loss: 0.028469	Evaluation:	Loss: 0.034779	Accuracy: 0.988083
Epoch: 6	Batch: 3750/3750	Loss: 0.021597	Evaluation:	Loss: 0.033141	Accuracy: 0.988667
Epoch: 7	Batch: 3750/3750	Loss: 0.016763	Evaluation:	Loss: 0.027524	Accuracy: 0.9905

Best model for:
* n_hidden_units = 200 
* epochs = 10 
* l_rate = 0.2
* batch_size = 32 
#### Accuracy = 0.9784.

# 3. Different optimziation algorithms

## 3.1 Momentum

In [None]:
class SGDMomentum:
    def __init__(self, l_rate=0.01, momentum=0.9):
        self.l_rate = l_rate
        self.momentum = momentum
        self.m = None
    
    def apply_grads(self, grads, network):
        if self.m is None:
            W1 = np.zeros_like(grads.W1)
            b1 = np.zeros_like(grads.b1)
            W2 = np.zeros_like(grads.W2)
            b2 = np.zeros_like(grads.b2)
            self.m = NetworkWeights(W1, b1, W2, b2)
            
        self.m.W1 = self.momentum*self.m.W1 - self.l_rate*grads.W1
        self.m.b1 = self.momentum*self.m.b1 - self.l_rate*grads.b1
        self.m.W2 = self.momentum*self.m.W2 - self.l_rate*grads.W2
        self.m.b2 = self.momentum*self.m.b2 - self.l_rate*grads.b2
        
        network.W1 += self.m.W1
        network.b1 += self.m.b1
        network.W2 += self.m.W2
        network.b2 += self.m.b2

## 3.2 Adagrad

In [14]:
class AdaGrad:
    def __init__(self, l_rate=0.01):
        self.l_rate = l_rate
        self.m = None
        self.eps = 1e-10
    
    def apply_grads(self, grads, network):
        if self.m is None:
            W1 = np.zeros_like(grads.W1)
            b1 = np.zeros_like(grads.b1)
            W2 = np.zeros_like(grads.W2)
            b2 = np.zeros_like(grads.b2)
            self.m = NetworkWeights(W1, b1, W2, b2)
            
        self.m.W1 += grads.W1*grads.W1
        self.m.b1 += grads.b1*grads.b1
        self.m.W2 += grads.W2*grads.W2
        self.m.b2 += grads.b2*grads.b2
        
        network.W1 -= self.l_rate*grads.W1 / np.sqrt(self.m.W1 + self.eps)
        network.b1 -= self.l_rate*grads.b1 / np.sqrt(self.m.b1 + self.eps)
        network.W2 -= self.l_rate*grads.W2 / np.sqrt(self.m.W2 + self.eps)
        network.b2 -= self.l_rate*grads.b2 / np.sqrt(self.m.b2 + self.eps)

## 3.3 RMSprop

In [None]:
class RMSProp:
    def __init__(self, l_rate=0.01, momentum=0.9):
        self.l_rate = l_rate
        self.momentum = momentum
        self.m = None
        self.eps = 1e-10
    
    def apply_grads(self, grads, network):
        if self.m is None:
            W1 = np.zeros_like(grads.W1)
            b1 = np.zeros_like(grads.b1)
            W2 = np.zeros_like(grads.W2)
            b2 = np.zeros_like(grads.b2)
            self.m = NetworkWeights(W1, b1, W2, b2)
            
        self.m.W1 = self.momentum*self.m.W1 + (1.0 - self.momentum)*grads.W1*grads.W1
        self.m.b1 = self.momentum*self.m.b1 + (1.0 - self.momentum)*grads.b1*grads.b1
        self.m.W2 = self.momentum*self.m.W2 + (1.0 - self.momentum)*grads.W2*grads.W2
        self.m.b2 = self.momentum*self.m.b2 + (1.0 - self.momentum)*grads.b2*grads.b2
        
        network.W1 -= self.l_rate*grads.W1 / np.sqrt(self.m.W1 + self.eps)
        network.b1 -= self.l_rate*grads.b1 / np.sqrt(self.m.b1 + self.eps)
        network.W2 -= self.l_rate*grads.W2 / np.sqrt(self.m.W2 + self.eps)
        network.b2 -= self.l_rate*grads.b2 / np.sqrt(self.m.b2 + self.eps)

## 3.4 ADAM

In [None]:
class Adam:
    def __init__(self, l_rate=0.01, beta1=0.9, beta2=0.99):
        self.l_rate = l_rate
        self.beta1 = beta1
        self.beta2 = beta2
        self.m = None
        self.s = None
        self.eps = 1e-10
        self.t = 1
    
    def apply_grads(self, grads, network):
        if self.m is None:
            W1 = np.zeros_like(grads.W1)
            b1 = np.zeros_like(grads.b1)
            W2 = np.zeros_like(grads.W2)
            b2 = np.zeros_like(grads.b2)
            self.m = NetworkWeights(W1, b1, W2, b2)
            self.s = NetworkWeights(W1.copy(), b1.copy(), W2.copy(), b2.copy())
            
        self.m.W1 = self.beta1*self.m.W1 - (1.0 - self.beta1)*grads.W1
        self.m.b1 = self.beta1*self.m.b1 - (1.0 - self.beta1)*grads.b1
        self.m.W2 = self.beta1*self.m.W2 - (1.0 - self.beta1)*grads.W2
        self.m.b2 = self.beta1*self.m.b2 - (1.0 - self.beta1)*grads.b2
        
        self.s.W1 = self.beta2*self.s.W1 + (1.0 - self.beta2)*grads.W1*grads.W1
        self.s.b1 = self.beta2*self.s.b1 + (1.0 - self.beta2)*grads.b1*grads.b1
        self.s.W2 = self.beta2*self.s.W2 + (1.0 - self.beta2)*grads.W2*grads.W2
        self.s.b2 = self.beta2*self.s.b2 + (1.0 - self.beta2)*grads.b2*grads.b2
        
        mW1 = self.m.W1 / (1.0 - self.beta1**self.t)
        mb1 = self.m.b1 / (1.0 - self.beta1**self.t)
        mW2 = self.m.W2 / (1.0 - self.beta1**self.t)
        mb2 = self.m.b2 / (1.0 - self.beta1**self.t)
        
        sW1 = self.s.W1 / (1.0 - self.beta2**self.t)
        sb1 = self.s.b1 / (1.0 - self.beta2**self.t)
        sW2 = self.s.W2 / (1.0 - self.beta2**self.t)
        sb2 = self.s.b2 / (1.0 - self.beta2**self.t)
        
        network.W1 += self.l_rate*mW1 / np.sqrt(sW1 + self.eps)
        network.b1 += self.l_rate*mb1 / np.sqrt(sb1 + self.eps)
        network.W2 += self.l_rate*mW2 / np.sqrt(sW2 + self.eps)
        network.b2 += self.l_rate*mb2 / np.sqrt(sb2 + self.eps)
        
        self.t += 1

## 3.5 Testing optimziation algorithms

In [15]:
def train_one_batch(X, y, weights, optimizer):
    grads, L = backward_pass(weights, X, y)
    
    optimizer.apply_grads(grads, weights)

    return L

def train(X, y, weights, optimizer, epochs, batch_size):
    n_total = X.shape[0]
    n_batches = n_total // batch_size + (1 if n_total % batch_size != 0 else 0)
    for epoch in range(1, epochs+1):
        L = 0.0
        for n_batch in range(1, n_batches+1):
            batch_X = X[(n_batch-1)*batch_size : n_batch*batch_size, :]
            batch_y = y[(n_batch-1)*batch_size : n_batch*batch_size, :]
            L += train_one_batch(batch_X, batch_y, weights, optimizer)
            print('\rEpoch: {}\tBatch: {}/{}\tLoss: {:.6f}\t'.format(epoch, n_batch, n_batches, L/n_batch), end='')
        print('\rEpoch: {}\t'.format(epoch), end='')
        evaluate(X, y, weights)

def prepare_and_train(n_hidden_units, optimizer, epochs, batch_size):
    weights = network_initialization(n_hidden_units)
    train(X_train, y_train, weights, optimizer, epochs, batch_size)
    evaluate(X_test, y_test, weights)

In [16]:
momentum = SGDMomentum(l_rate=0.03)
prepare_and_train(200, momentum, 10, 32)

Epoch: 1	Batch: 1875/1875	Loss: 0.230360	Evaluation:	Loss: 0.143024	Accuracy: 0.954900
Epoch: 2	Batch: 1875/1875	Loss: 0.101358	Evaluation:	Loss: 0.096667	Accuracy: 0.968433
Epoch: 3	Batch: 1875/1875	Loss: 0.065895	Evaluation:	Loss: 0.065215	Accuracy: 0.978433
Epoch: 4	Batch: 1875/1875	Loss: 0.046793	Evaluation:	Loss: 0.053981	Accuracy: 0.981733
Epoch: 5	Batch: 1875/1875	Loss: 0.033830	Evaluation:	Loss: 0.040977	Accuracy: 0.986133
Epoch: 6	Batch: 1875/1875	Loss: 0.024025	Evaluation:	Loss: 0.026080	Accuracy: 0.991183
Epoch: 7	Batch: 1875/1875	Loss: 0.017561	Evaluation:	Loss: 0.023640	Accuracy: 0.991767
Epoch: 8	Batch: 1875/1875	Loss: 0.013079	Evaluation:	Loss: 0.017215	Accuracy: 0.994283
Epoch: 9	Batch: 1875/1875	Loss: 0.009483	Evaluation:	Loss: 0.016914	Accuracy: 0.994367
Epoch: 10	Batch: 1875/1875	Loss: 0.006888	Evaluation:	Loss: 0.009679	Accuracy: 0.997017
Evaluation:	Loss: 0.080080	Accuracy: 0.978400


In [17]:
adagrad = AdaGrad(l_rate=0.03)
prepare_and_train(200, adagrad, 10, 32)

Epoch: 1	Batch: 1875/1875	Loss: 0.192988	Evaluation:	Loss: 0.104383	Accuracy: 0.969667
Epoch: 2	Batch: 1875/1875	Loss: 0.090914	Evaluation:	Loss: 0.071938	Accuracy: 0.979250
Epoch: 3	Batch: 1875/1875	Loss: 0.066541	Evaluation:	Loss: 0.056337	Accuracy: 0.983867
Epoch: 4	Batch: 1875/1875	Loss: 0.052779	Evaluation:	Loss: 0.046688	Accuracy: 0.987083
Epoch: 5	Batch: 1875/1875	Loss: 0.043470	Evaluation:	Loss: 0.039510	Accuracy: 0.989633
Epoch: 6	Batch: 1875/1875	Loss: 0.036671	Evaluation:	Loss: 0.034056	Accuracy: 0.991367
Epoch: 7	Batch: 1875/1875	Loss: 0.031504	Evaluation:	Loss: 0.029889	Accuracy: 0.992633
Epoch: 8	Batch: 1875/1875	Loss: 0.027388	Evaluation:	Loss: 0.026469	Accuracy: 0.993933
Epoch: 9	Batch: 1875/1875	Loss: 0.024058	Evaluation:	Loss: 0.023542	Accuracy: 0.994667
Epoch: 10	Batch: 1875/1875	Loss: 0.021332	Evaluation:	Loss: 0.021059	Accuracy: 0.995550
Evaluation:	Loss: 0.067995	Accuracy: 0.978800


In [18]:
rmsprop = RMSProp(l_rate=0.003)
prepare_and_train(200, rmsprop, 10, 32)

Epoch: 1	Batch: 1875/1875	Loss: 0.194916	Evaluation:	Loss: 0.116074	Accuracy: 0.965533
Epoch: 2	Batch: 1875/1875	Loss: 0.103709	Evaluation:	Loss: 0.100284	Accuracy: 0.974183
Epoch: 3	Batch: 1875/1875	Loss: 0.081401	Evaluation:	Loss: 0.080509	Accuracy: 0.979717
Epoch: 4	Batch: 1875/1875	Loss: 0.068313	Evaluation:	Loss: 0.077731	Accuracy: 0.981850
Epoch: 5	Batch: 1875/1875	Loss: 0.057669	Evaluation:	Loss: 0.096365	Accuracy: 0.978133
Epoch: 6	Batch: 1875/1875	Loss: 0.049626	Evaluation:	Loss: 0.070319	Accuracy: 0.984517
Epoch: 7	Batch: 1875/1875	Loss: 0.049863	Evaluation:	Loss: 0.053614	Accuracy: 0.988383
Epoch: 8	Batch: 1875/1875	Loss: 0.042183	Evaluation:	Loss: 0.055991	Accuracy: 0.989150
Epoch: 9	Batch: 1875/1875	Loss: 0.033964	Evaluation:	Loss: 0.059295	Accuracy: 0.988117
Epoch: 10	Batch: 1875/1875	Loss: 0.029653	Evaluation:	Loss: 0.049443	Accuracy: 0.990033
Evaluation:	Loss: 0.231316	Accuracy: 0.972100


In [19]:
adam = Adam(l_rate=0.003)
prepare_and_train(200, adam, 10, 32)

Epoch: 1	Batch: 1875/1875	Loss: 0.204840	Evaluation:	Loss: 0.181572	Accuracy: 0.946433
Epoch: 2	Batch: 1875/1875	Loss: 0.095863	Evaluation:	Loss: 0.120989	Accuracy: 0.963550
Epoch: 3	Batch: 1875/1875	Loss: 0.069214	Evaluation:	Loss: 0.077782	Accuracy: 0.976650
Epoch: 4	Batch: 1875/1875	Loss: 0.055853	Evaluation:	Loss: 0.052027	Accuracy: 0.983800
Epoch: 5	Batch: 1875/1875	Loss: 0.045032	Evaluation:	Loss: 0.049440	Accuracy: 0.985700
Epoch: 6	Batch: 1875/1875	Loss: 0.036714	Evaluation:	Loss: 0.056177	Accuracy: 0.984950
Epoch: 7	Batch: 1875/1875	Loss: 0.035412	Evaluation:	Loss: 0.039219	Accuracy: 0.988917
Epoch: 8	Batch: 1875/1875	Loss: 0.029363	Evaluation:	Loss: 0.043676	Accuracy: 0.987667
Epoch: 9	Batch: 1875/1875	Loss: 0.027796	Evaluation:	Loss: 0.027048	Accuracy: 0.993033
Epoch: 10	Batch: 1875/1875	Loss: 0.026497	Evaluation:	Loss: 0.040849	Accuracy: 0.990017
Evaluation:	Loss: 0.168664	Accuracy: 0.977300


The use of advanced optimizer such as Momentum didn't help the model to achieve better results or faster convergence.
Moreover RMSProp and Adam worsened previous results.
The performance of the model improved with the use of AdaGrad optimization but only by 0.0004 of accurancy result.

# 4. L2 regularization

In [23]:
def backward_pass(weights, X, y, lambda_):
    z1, a1, z2, a2 = forward_pass(weights, X)
    
    L = loss(a2, y.T) + lambda_/2.0 * (np.linalg.norm(weights.W1)**2 + np.linalg.norm(weights.W2)**2)
    dL = np.matmul(d_softmax(z2).T, d_loss(a2, y.T)[np.newaxis, ...].T).T[0, ...]
    grad_W2 = np.matmul(dL, a1.T) + lambda_*weights.W2
    grad_b2 = np.mean(dL, axis=1, keepdims=True)
    grad_W1 = np.matmul(np.matmul(weights.W2.T, dL) * D_relu(a1), X) + lambda_*weights.W1
    grad_b1 = np.mean(np.matmul(weights.W2.T, dL) * D_relu(a1), axis=1, keepdims=True)  
    return NetworkWeights(grad_W1, grad_b1, grad_W2, grad_b2), L

def train_one_batch(X, y, weights, optimizer, lambda_):
    grads, L = backward_pass(weights, X, y, lambda_)   
    optimizer.apply_grads(grads, weights)
    return L

def train(X, y, weights, optimizer, epochs, batch_size, lambda_):
    n_total = X.shape[0]
    n_batches = n_total // batch_size + (1 if n_total % batch_size != 0 else 0)
    for epoch in range(1, epochs+1):
        L = 0.0
        for n_batch in range(1, n_batches+1):
            batch_X = X[(n_batch-1)*batch_size : n_batch*batch_size, :]
            batch_y = y[(n_batch-1)*batch_size : n_batch*batch_size, :]
            L += train_one_batch(batch_X, batch_y, weights, optimizer, lambda_)
            print('\rEpoch: {}\tBatch: {}/{}\tLoss: {:.6f}\t'.format(epoch, n_batch, n_batches, L/n_batch), end='')
        print('\rEpoch: {}\t'.format(epoch), end='')
        evaluate(X, y, weights)

def prepare_and_train_2(n_hidden_units, optimizer, epochs, batch_size, lambda_):
    weights = network_initialization(n_hidden_units)
    train(X_train, y_train, weights, optimizer, epochs, batch_size, lambda_)
    evaluate(X_test, y_test, weights)

In [24]:
optimizer = Adam(l_rate=0.03)
prepare_and_train_2(200, optimizer, 10, 32, 0.0002)
prepare_and_train_2(200, optimizer, 10, 32, 0.002)
prepare_and_train_2(200, optimizer, 10, 32, 0.02)

Epoch: 1	Batch: 1875/1875	Loss: 0.637059	Evaluation:	Loss: 0.590668	Accuracy: 0.848767
Epoch: 2	Batch: 1875/1875	Loss: 0.567517	Evaluation:	Loss: 0.469577	Accuracy: 0.882400
Epoch: 3	Batch: 1875/1875	Loss: 0.546045	Evaluation:	Loss: 0.551237	Accuracy: 0.839400
Epoch: 4	Batch: 1875/1875	Loss: 0.554611	Evaluation:	Loss: 0.510123	Accuracy: 0.866933
Epoch: 5	Batch: 1875/1875	Loss: 0.556291	Evaluation:	Loss: 0.681219	Accuracy: 0.818867
Epoch: 6	Batch: 1875/1875	Loss: 0.535381	Evaluation:	Loss: 0.572802	Accuracy: 0.828133
Epoch: 7	Batch: 1875/1875	Loss: 0.536547	Evaluation:	Loss: 0.704173	Accuracy: 0.809067
Epoch: 8	Batch: 1875/1875	Loss: 0.542912	Evaluation:	Loss: 0.532787	Accuracy: 0.851467
Epoch: 9	Batch: 1875/1875	Loss: 0.539565	Evaluation:	Loss: 0.574547	Accuracy: 0.841150
Epoch: 10	Batch: 1875/1875	Loss: 0.536208	Evaluation:	Loss: 0.597057	Accuracy: 0.833717
Evaluation:	Loss: 0.589680	Accuracy: 0.833800
Epoch: 1	Batch: 1875/1875	Loss: 0.882030	Evaluation:	Loss: 0.953037	Accuracy: 0.758

## 5. Conclusion

The network was tested for various **hyper parameters**, namely
* Number of epochs = [100,200]
* Number of neurons in hidden layer = [5, 10]
* Different learning rates = [0.2, 0.01]
* Different minibatch sizes = [16, 32]
 
The best result was obtained for
* n_hidden_units = 200
* epochs = 10
* l_rate = 0.2
* batch_size = 32
#### with accurency = 0.9784

The use of advanced **optimizer** such as **Momentum** didn't help the model to achieve better results or faster convergence. \
Moreover **RMSProp and Adam** worsened previous results. \
The performance of the model improved with the use of
**AdaGrad** optimization but only by 0.0004 of accurancy result.

None of the **L2 regularizations** resulted in a model with a lower error rate.