# Fourth Coursework (Group X)
### Theoretical Foundations and Research Topics in Machine Learning

Follow the instructions in this notebook. Please remember to upload the filled in jupyter notebook as part of your final submission together with the PDF of the other tasks. It might be a good idea to also upload a PDF/HTML version of your jupyter notebook as this ensures that nothing gets lost during upload.

**IMPORTANT:** You are not allowed to use additional imports, i.e., you should implement all functionalities using NumPy only.

In [1]:
# Load packages
import random

import matplotlib.pyplot as plt
import numpy as np
import sklearn as sk
import sklearn.datasets as dt

from sklearn.model_selection import train_test_split

# Display figure in the notebook
%matplotlib inline

#### Activation Functions & Loss (0.5 points)

In this section, you will have to implement three different activation functions (_Sigmoid_, _Tanh_, and _ReLU_). Please note that the method _forward()_ is the basic function, while the method _backward()_ should be used for the derivative of the function.

Additionally, you will have to implement the _Mean Squared Error_ loss function, which is defined as follows:

$$ MSE = \frac{1}{N} \sum^{N}_{i=1} (y_i - \hat{y_i})^2 $$.

You can use any functionality that is part of NumPy.

In [2]:
class Sigmoid():
    def forward(self, x):
        # Implement the sigmoid function
        ##### YOUR CODE HERE #####
        
        return 1 / (1 + np.exp(-x))
        

    def backward(self, x):
        # Implement the derivative of the sigmoid function
        ##### YOUR CODE HERE #####
        
        # derivative is df = f * (1 - f)
        return (1 / (1 + np.exp(-x))) * (1 - (1 / (1 + np.exp(-x))))

In [3]:
class TanH():
    def forward(self, x):
        # Implement the tanh function
        ##### YOUR CODE HERE #####
        
        return (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))

    def backward(self, x):
        # Implement the derivative of the tanh function
        ##### YOUR CODE HERE #####
        
        tanh = (np.exp(x) - np.exp(-x)) / (np.exp(x) + np.exp(-x))
        return 1 - tanh**2

In [4]:
class ReLU():
    def forward(self, x):
        # Implement the relu function
        ##### YOUR CODE HERE #####
        
        return x * (x > 0)

    def backward(self, x):
        # Implement the derivative of the relu function
        ##### YOUR CODE HERE #####
    
        return 1. * (x > 0)

In [5]:
class MSE():
    def forward(self, y_pred, y_true):
        # Implement the mse function
        ##### YOUR CODE HERE #####
        
        return (np.square(y_pred - y_true)).mean()

    def backward(self, y_pred, y_true):
        # Implement the derivative of the mse function
        ##### YOUR CODE HERE #####
        
        return 2 * (y_pred - y_true).mean()

#### Multi-Layer Perceptron (2 points)

In this section, you will have to implement your very own _Multi-Layer Perceptron_. For the network architecture, we will consider only one hidden layer and no activation function for the output neuron (i.e. apply the activation function to the hidden layer but not to the output layer).

_Hint: You will need to compute the derivatives $\frac{\partial L}{\partial W_o}, \frac{\partial L}{\partial b_o}, \frac{\partial L}{\partial W_h}$ and $\frac{\partial L}{\partial b_h}$ (using the chain rule) as they are required for the backward pass._

In [132]:
class MLP():

    def __init__(self, input_size, hidden_size, output_size, activation_func = TanH()):
        """
        Parameters
        ----------
        W_h: 
            weight matrix from input layer to hidden layer with size (input_size, hidden_size)
        b_h: 
            bias vector for the hidden layer with size (hidden_size)
        W_o: 
            weight matrix from hidden layer to output layer with size (hidden_size, output_size)
        b_o: 
            bias vector for the hidden layer with size (output_size)
        activation_func: 
            activation function of your choice
        """
        ##### YOUR CODE HERE #####
        
        self.W_h = np.random.random((hidden_size, input_size))
        self.b_h = np.random.random((hidden_size, 1))
        
        self.W_o = np.random.random((output_size, hidden_size))
        self.b_o = np.random.random((output_size, 1))
        
        self.activation_func = activation_func
        
    def forward(self, x):
        """
        forward pass of the MLP
        
        Parameters
        ----------
        x:
            input vector of size (input_size)
            
        Returns
        -------
        y:
            output vector of size (output_size)
        """
        ##### YOUR CODE HERE #####
        
        # first weights layer (hidden) (multiply + sum)
        f = np.dot(self.W_h, x) + self.b_h
        
        # activation function for the hidden layer
        f = self.activation_func.forward(f)
        
        # output weights layer (multiply + sum)
        y = np.dot(self.W_o, f) + self.b_o
        
        return y
        
    def forward_(self, x):
        """
        forward pass of the MLP with additional return values
        
        Paramaters
        ----------
        x:
            input vector of size (input_size)
            
        Returns
        -------
        y:
            output vector of size (output_size)
        h:
            activation of the hidden layer of size (hidden_size)
        z_h:
            pre-activation of the hidden layer of size (hidden_size)
            i.e., the input vector to the activation function
        """
        ##### YOUR CODE HERE #####
        
        # pre-activation of the hidden layer
        z_h = np.dot(self.W_h, x) + self.b_h    
        
        # activation of the hidden layer
        h = self.activation_func.forward(z_h)
                
        # output vector
        y = np.dot(self.W_o, h) + self.b_o
        
        return y, h, z_h
    
    def backward(self, x, h, z_h, dloss):
        """
        backward pass of the MLP
        
        Parameters
        ----------
        x:
            input vector of size (input_size)
        h:
            activation of the hidden layer of size (hidden_size)
        z_h:
            pre-activation of the hidden layer of size (hidden_size)
            i.e., the input vector to the activation function
        dloss:
            gradient of the loss function with respect to y_pred
            
        Returns
        -------
        grads:
            dictionary containing the elements
            - W_h: gradients for W_h
            - b_h: gradients for b_h
            - W_o: gradients for W_o
            - b_o: gradients for b_o
        """
        ##### YOUR CODE HERE #####
        
        n = x.shape[1]

        dW_o = 1./n * np.dot(dloss, h.T)
        db_o = 1./n * np.sum(dloss)
        dloss1 = np.dot(self.W_o.T, dloss) * (1 - np.power(h, 2))
        
        dW_h = 1./n * np.dot(dloss1, x.T)
        db_h = 1./n * np.sum(dloss1)
        
        grads = {"W_h": dW_h,
                "b_h": db_h,
                "W_o": dW_o,
                "b_o": db_o}
        
        return grads

#### Gradient Descent (2 points)

In this section, you will have to implement the training algorithm using _Gradient Descent_. 

While we provide you with the wrapper function, you need to implement the methods _evaluate()_ and _update()_, where the computation of the gradients and the weight update should be performed as part of the _update()_ method.

In [133]:
def evaluate(data, model, loss_func):
    """
    function to evaluate the test data
    i.e., just forward pass and loss computation
    
    Parameters
    ----------
    data:
        input data containing X and y
    model:
        the initialized MLP model
    loss_func:
        loss function of your choice
    
    Returns
    -------
    losses:
        array containing all individual losses
        i.e., for each data sample
    """
    ##### YOUR CODE HERE #####
        
    X = np.array((list(zip(*data)))[0])
    y = np.array((list(zip(*data)))[1])
    
    fwd = model.forward_(X.T)
    
    
    losses = loss_func.forward(fwd[0], y)
    
    return losses
    
def update(data, model, loss_func, learning_rate):
    """
    function to calculate gradients and perform weight updates
    i.e., forward pass + loss computation + backward pass + weight update
    
    Parameters
    ----------
    data:
        input data containing X and y
    model:
        the initialized MLP model
    loss_func:
        loss function of your choice
    learning_rate:
        float value defining the learning rate
    
    Returns
    -------
    losses:
        array containing all individual losses
        i.e., for each data sample
    """
    ##### YOUR CODE HERE #####
            
    X = np.array((list(zip(*data)))[0])
    y = np.array((list(zip(*data)))[1])
    
    y = y.reshape((y.shape[0], 1))
    
    # forward pass
    fwd = model.forward_(X.T)
    
    # compute losses
    losses = loss_func.forward(fwd[0], y)
    
    # backward pass
    gradients = model.backward(X.T, fwd[1], fwd[2], losses)
    
    # weight updates
    model.W_h = model.W_h - gradients["W_h"]*learning_rate
    model.b_h = model.b_h - gradients["b_h"]*learning_rate
    model.W_o = model.W_o - gradients["W_o"]*learning_rate
    model.b_o = model.b_o - gradients["b_o"]*learning_rate
    
    return losses

In [134]:
# perform gradient descent, no ToDo for you
def gradient_descent(train_data, test_data, model, loss_func, epochs, learning_rate):
    valid_losses = evaluate(test_data, model, loss_func)
    print("Initial Validation: " + str(np.mean(valid_losses)))
    
    for epoch in range(epochs):
        train_losses = update(train_data, model, loss_func, learning_rate)
        valid_losses = evaluate(test_data, model, loss_func)
        print("Epoch " + str(epoch) + ": " + str(np.mean(train_losses)) + " Train Loss, " + str(np.mean(valid_losses)) + " Valid Loss")  

#### Train your model (0.5 points)

In this section, you will have to initialise your MLP (using defined hyperparameters) and train it on the provided data. For the training, you can, of course, simply use the _gradient$\_$descent()_ method. 

In [135]:
# Generating toy data, no ToDo for you
X, y = dt.make_regression(n_samples = 1000, n_features = 20)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

train_data = list(zip(X_train, y_train))
test_data = list(zip(X_test, y_test))

In [140]:
# Initialize and train the model

model = MLP(20, 40, 1)
loss_func = MSE()
epochs = 20
learning_rate = 0.001

gradient_descent(train_data, test_data, model, loss_func, epochs, learning_rate)

Initial Validation: 17586.958341370075
Epoch 0: 26066.110447719722 Train Loss, 19212.142701386183 Valid Loss
Epoch 1: 25944.23526597317 Train Loss, 19255.52878878995 Valid Loss
Epoch 2: 25994.85008302487 Train Loss, 19302.545768502536 Valid Loss
Epoch 3: 26049.109894972153 Train Loss, 19353.221715849595 Valid Loss
Epoch 4: 26107.043792729743 Train Loss, 19407.586808232954 Valid Loss
Epoch 5: 26168.68297739408 Train Loss, 19465.673360693265 Valid Loss
Epoch 6: 26234.060796393835 Train Loss, 19527.515864152778 Valid Loss
Epoch 7: 26303.21278233067 Train Loss, 19593.15102641595 Valid Loss
Epoch 8: 26376.176694588827 Train Loss, 19662.617816011865 Valid Loss
Epoch 9: 26452.99256379787 Train Loss, 19735.957508968604 Valid Loss
Epoch 10: 26533.7027392399 Train Loss, 19813.213738616418 Valid Loss
Epoch 11: 26618.35193929883 Train Loss, 19894.432548523637 Valid Loss
Epoch 12: 26706.98730505633 Train Loss, 19979.662448676376 Valid Loss
Epoch 13: 26799.65845714651 Train Loss, 20068.95447502074 V