# D7046E Exercise 1 (ANN1)

This exercise has three taks where you will deepen your understanding of how artificial neural networks (ANNs) are implemented and trained. First, you will represent digits on an eight-segment display as vectors and hard-code perceptrons that classifies the digits. The purpose of this task is to better understand the basic computational units in ANNs and how inputs can be represented as feature vectors. Secondly, you will implement and train neural networks using [pytorch](https://pytorch.org/) on the seven-segment display data and the [MNIST](http://yann.lecun.com/exdb/mnist/) dataset which you are familiar with from Exercise 0. Finally, you will implement a neural network including the forward (inference) pass and the backward (learning) pass from scratch using numpy. After completing these steps you will know the central building blocks of ANNs.

## Literature
Before starting with the implementation you should familiarize yourself with two additional chapters in the [deep learning book](https://www.deeplearningbook.org/). This will help you understand the theory behind neural networks and what mathematical formulas are important for the task. The lectures has touched on most of these concepts. Below is a list of recommended sections from the book. If you feel familiar with the contents of these sections, feel free to skip it.

* Chapter 6 - Deep feedforward networks
    - Section 6.0 - Discusses what do we mean by feedfoward networks and terminology such as input layer, output layer and hidden layer.
    - Section 6.2 - Discusses what gradient based learning is and what cost functions are.
    - Section 6.5 - Explains back propagation. Important here are the formulas 6.49 - 6.52.
* Chapter 8 - Optimization for Training Deep Models
    - Section 8.1.3 - Presents differences between batch (deterministic) and mini-batch (stochastic) algorithms.
    
## Libraries

Before starting with the implementations you need to import the following libraries.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
from torch.utils.data import DataLoader, random_split
from torchvision import datasets, transforms

import matplotlib.pyplot as plt
import numpy
import copy

In [None]:
## First we need to define all the vectors corresponding to the various digits and add them to a list for easy access
# Please finish the list of digit vectors

x = [
    numpy.array([1,1,1,1,1,1,0]), # 0
    numpy.array([0,1,1,0,0,0,0]), # 1
    numpy.array([1,1,0,1,1,0,1]), # 2
    numpy.array([1,1,1,1,0,0,1]), # 3
    numpy.array([0,1,1,0,0,1,1]), # 4
    numpy.array([1,0,1,1,0,1,1]), # 5
    numpy.array([1,0,1,1,1,1,1]), # 6
    numpy.array([1,1,1,0,0,0,0]), # 7
    numpy.array([1,1,1,1,1,1,1]), # 8
    numpy.array([1,1,1,1,0,1,1]), # 9
]

# And we print one of the vectors to show you how to get a specific vector
print(f'Digit 5 corresponds to the vector {x[5]}')

### Task 1.3: Check the solution

Execute the cell below to see whether the network managed to learn to make the correct predictions.
Can you figure out what the learned weights and biases are, and how similar they are to your hardcoded solutions in the first task?

In [None]:
# Define the mini-batch size
batch_size = 1000

# Download the dataset and create the dataloaders
mnist_train = datasets.MNIST("./", train=True, download=True, transform=transforms.ToTensor())

# Dataset is split 8:2
train_size = int(0.8 * len(mnist_train))
val_size = len(mnist_train) - train_size
mnist_train, mnist_val = random_split(mnist_train, [train_size, val_size])

train_loader = torch.utils.data.DataLoader(mnist_train, batch_size=batch_size, shuffle=False)
mnist_test = datasets.MNIST("./", train=False, download=True, transform=transforms.ToTensor())
test_loader = torch.utils.data.DataLoader(mnist_test, batch_size=batch_size, shuffle=False)
val_loader = torch.utils.data.DataLoader(mnist_val, batch_size=batch_size, shuffle=False)

to_onehot = nn.Embedding(10, 10)
to_onehot.weight.data = torch.eye(10)

def plot_digit(data):
    data = data.view(28, 28)
    plt.imshow(data, cmap="gray")

    plt.show()

images, labels = next(iter(train_loader))
plot_digit(images[0])


### Task 2.1

Implement a 2-layer neural network using pytorch as well as a procedure for training and testing it. The training protocol should include both training and validation. Thus you need to split the training data into a training set (for which the error is backpropagated to update the parameters) and a validation set (which will not be used to directly update the model parameters, and instead be used to keep track of how good the model is at unseen data). 

The weights of the model which performs the best on the validation data should be stored and then be used for the final check on the test set. Validation sets are often created by taking a fraction of the training data (often, but not always, around 20%) at random. In Pytorch you might want to use [random_split](https://pytorch.org/docs/stable/data.html#torch.utils.data.random_split) for this. Using random split would require you to edit the way the Dataloaders are created.

You are free to choose any optimizer and loss function. Just note that some loss functions require the labels to be 1-hot encoded. As you will not use convolutional layers for this exercise (will be introduced later in the course), the inputs need to be transformed to 1d tensors (see [view](https://pytorch.org/docs/stable/tensors.html?highlight=view#torch.Tensor.view)).

**GOAL:** You should evaluate the network from the epoch with best validation score (early stopping) on the test set aiming to reach at least 85% accuracy.

**Remember** to run all your code before grading so that the teacher doesn't have to wait around for long training runs. Plot the training and validation losses for each epoch.

*Hint:* Validation and Testing loops are very similar to training except they don't use backpropagation. Additionally testing should only be performed once, while validation should be performed continually to make sure training is proceeding as intended and to save the parameters of the best epoch.

*Hint:* Storing the best model is more difficult than just assigning it to a variable as this only means you have two variables referencing the same network instance in memory (not a copy of the best betwork and one containing the current network). Instead you ned to make a copy of the network which can be achived with [deepcopy](https://docs.python.org/3/library/copy.html). Other ways to store models include saving them as a file which can be done with [torch.save](https://pytorch.org/tutorials/beginner/saving_loading_models.html).

*Hint:* Everytime you train a network with random parameter initialization and random batches you get networks with different performance. Sometimes just running the training again can be enough to get a better result. However, if you do this too many times you run the risk of training (overfitting) on the test set.

In [None]:
# This code initializes the neural network
from copy import deepcopy

correct_test = 0
total_test = 0
test_accuracy = 0
one_d_tensor = torch.DoubleTensor

# nn.Sequential can be given a list of neural network modules
network = nn.Sequential(
    nn.Linear(784, 100),  # First layer of the network takes the entire image and reduces it to 100 dimensions
    nn.ReLU(),
    nn.Linear(100, 10)  # The second layer takes those 100 dimensions and reduces them into estimated values for each digit
)

# Initialize the optimizer
optimizer = optim.SGD(network.parameters(), lr=0.1)

# Initialize the loss function
loss_function = nn.CrossEntropyLoss()

# Decide the number of epochs to train for (one epoch is one optimization iteration on the entire dataset)
epochs = 15
best_val_loss = float('inf')
best_model = None

# For each epoch
for epoch in range(epochs):
    network.train()
    total_loss = 0.0

    # For each batch of data (since the dataset is too large to run all data through the network at once)
    for batch_nr, (images, labels) in enumerate(train_loader):
        # Reshape the images to a single vector (28*28 = 784)
        images = images.view(-1, 784)
        # Predict for each digit in the batch what class they belong to
        prediction = network(images)
        # Calculate the loss of the prediction by comparing to the expected output
        loss = loss_function(prediction, labels)
        # Backpropagate the loss through the network to find the gradients of all parameters
        optimizer.zero_grad()
        loss.backward()
        # Update the parameters along their gradients
        optimizer.step()
        # Clear stored gradient values
        optimizer.zero_grad()
        total_loss += loss.item()
    average_loss = total_loss / len(train_loader)

    # Validation loop
    network.eval()
    val_loss = 0.0


    with torch.no_grad():
        for val_images, val_labels in val_loader:
            val_images = val_images.view(-1, 784)
            val_prediction = network(val_images)
            val_loss += loss_function(val_prediction, val_labels).item()
            _, predicted = torch.max(val_prediction, 1)
            total_test += val_labels.size(0)
            correct_test += (predicted == val_labels).sum().item()
            test_accuracy = correct_test / total_test
        if test_accuracy >= 0.85:
            print(f'Validation Accuracy: {test_accuracy:.2%}')
            best_model = deepcopy(network.state_dict())
            break

    average_val_loss = val_loss / len(val_loader)

    # Print training and validation loss
    print(f'Epoch [{epoch + 1}/{epochs}], Training Loss: {average_loss:.4f}, Validation Loss: {average_val_loss:.4f}')



In [13]:
import numpy as np
import matplotlib.pyplot as plt
import sys
import math

epochs = 25  # Set the number of epochs to train for
D_in = 784   # Input size, images are 28x28 = 784 element vectors
D_out = 10   # Output size, 10 digit classes
H1 = 100     # Hidden layer size
gamma = 1e-5 # Learning rate
batch_size = 250
# Define network with one hidden layer, random initial weights
w1 = np.random.randn(D_in, H1)
w2 = np.random.randn(H1, D_out)

# Training iterations

# Train for a number of epochs
for epoch in range(epochs):
    # Training by looping over training set
    for inputs, labels in train_loader:
        
        inputs = inputs.numpy()
        labels = labels.numpy()
        
        for i in range(batch_size):
            # iterate through the mini-batch and perform forward pass and backward pass
            x = inputs[i].reshape((1, D_in))
            y = np.eye(10)[labels[i]]    # 1-hot encoding

            # Forward pass
            h = np.dot(x, w1)
            h_relu = np.maximum(0, h)
            y_pred = np.dot(h_relu, w2)

            # Compute loss function, squared error
            squared_error = (y_pred - y) ** 2
            loss = np.sum(squared_error) / 2
            # sum_squared_error = np.sum(squared_error)
            # loss = sum_squared_error / y.size
            
            # Compute gradients of square-error loss with respect to w1 and w2 using backpropagation
            #dL_dy_pred = -(y - y_pred)
            #dRelu = 1 if x >= 1 else 0
            grad_y_pred = y_pred - y
            grad_w2 = np.dot(h_relu.T, grad_y_pred)
            grad_h_relu = np.dot(grad_y_pred, w2.T)
            grad_h = h.copy()
            grad_h[h < 0] = 0
            grad_w1 = np.dot(x.T, grad_h)
                    # Update weights (stochastic gradient 
            w1 -= gamma * grad_w1
            w2 -= gamma * grad_w2

        # Print loss at the end of each epoch
    print(f'Epoch {epoch + 1}/{epochs}, Loss: {loss}')

    #validate the model
    total_val = 0
    correct_val = 0

    for val_inputs, val_labels in val_loader:
        for i in range(batch_size):
            x_val = val_inputs[i].view(1, -1).numpy()
            y_val = np.eye(10)[val_labels[i]]    # 1-hot encoding

            # Forward pass on validation set
            h_val = np.dot(x_val, w1)
            h_relu_val = np.maximum(0, h_val)
            y_pred_val = np.dot(h_relu_val, w2)

            # Compute validation accuracy
            total_val += 1
            correct_val += np.argmax(y_pred_val) == np.argmax(y_val)

val_accuracy = correct_val / total_val
print(f'Epoch [{epoch + 1}/{epochs}], Validation Accuracy: {val_accuracy:.2%}')
print("hej")
            
#t training and validation loss


total = 0
correct = 0

for test_inputs, test_labels in test_loader:
    for i in range(batch_size):
        x_test = test_inputs[i].reshape((1, D_in)).numpy()
        y_test = np.eye(10)[test_labels[i]]    # 1-hot encoding

        # Forward pass on test set
        h_test = np.dot(x_test, w1)
        h_relu_test = np.maximum(0, h_test)
        y_pred_test = np.dot(h_relu_test, w2)

        # Compute test accuracy
        total_test += 1
        correct_test += np.argmax(y_pred_test) == np.argmax(y_test)

test_accuracy = correct_test / total_test
print(f'Test Accuracy: {test_accuracy:.2%}')           

Epoch 1/25, Loss: 189.85707447559895
Epoch 2/25, Loss: 78.92555122692234
Epoch 3/25, Loss: 43.549837555039275
Epoch 4/25, Loss: 29.05472949062305
Epoch 5/25, Loss: 19.968473441754373
Epoch 6/25, Loss: 14.575056776546084
Epoch 7/25, Loss: 11.05615583273368
Epoch 8/25, Loss: 7.584834397876871
Epoch 9/25, Loss: 5.2754942043243425
Epoch 10/25, Loss: 3.663674112280544
Epoch 11/25, Loss: 2.536832771815574
Epoch 12/25, Loss: 1.828554286143176
Epoch 13/25, Loss: 1.253232685645703
Epoch 14/25, Loss: 0.8430116012466082
Epoch 15/25, Loss: 0.559396854219661
Epoch 16/25, Loss: 0.3739471618762161
Epoch 17/25, Loss: 0.26532834548000717
Epoch 18/25, Loss: 0.21725520265596243
Epoch 19/25, Loss: 0.21711750241482533
Epoch 20/25, Loss: 0.20714639461384934
Epoch 21/25, Loss: 0.2022158339925319
Epoch 22/25, Loss: 0.20852119458875162
Epoch 23/25, Loss: 0.20612269629048643
Epoch 24/25, Loss: 0.21280928538182176
Epoch 25/25, Loss: 0.22983628055542965
Epoch [25/25], Validation Accuracy: 12.20%
hej
Test Accuracy

## End

You have now reached the end of ANN1. When you have completed and understood the task above please make sure that all results inluding plots have been computed and then schedule a meeting with a teacher. The teacher will then assess orally that you (the lab group) has completed the exercise and that you understand its essental elements.