# Coursework 1: ML basics and fully-connected networks

#### Instructions

Please submit a version of this notebook containing your answers on CATe as *CW1*. Write your answers in the cells below each question.

We recommend that you work on the Ubuntu workstations in the lab. This assignment and all code were only tested to work on these machines. In particular, we cannot guarantee compatibility with Windows machines and cannot promise support if you choose to work on a Windows machine.

You can work from home and use the lab workstations via ssh (for list of machines: https://www.doc.ic.ac.uk/csg/facilities/lab/workstations). 

Once logged in, run the following commands in the terminal to set up a Python environment with all the packages you will need.

    export PYTHONUSERBASE=/vol/bitbucket/nuric/pypi
    export PATH=/vol/bitbucket/nuric/pypi/bin:$PATH

Add the above lines to your `.bashrc` to have these enviroment variables set automatically each time you open your bash terminal.

Any code that you submit will be expected to run in this environment. Marks will be deducted for code that fails to run.

Run `jupyter-notebook` in the coursework directory to launch Jupyter notebook in your default browser.

DO NOT attempt to create a virtualenv in your home folder as you will likely exceed your file quota.

**DEADLINE: 7pm, Tuesday 5th February, 2019**

## Part 1

1. Describe two practical methods used to estimate a supervised learning model's performance on unseen data. Which strategy is most commonly used in most deep learning applications, and why?
2. Suppose that you have reason to believe that your multi-layer fully-connected neural network is overfitting. List four things that you could try to improve generalization performance.

## Part 1(Answer) add description

1. 1 Estimate performance on unseen data


*   Holdout. We split up our training dataset into a ‘train’ and ‘test’ set. The training set is what the model is trained on, and the test set is used to see how well that model performs on unseen data. A common split when using the hold-out method is using 80% of data for training and the remaining 20% of the data for testing.
*   Cross validation, estimate the performance using validation set. In K Fold cross validation, we split our trainging dataset into k subsets. Now the holdout method is repeated k times, such that each time, one of the k subsets is used as the  validation set and the other k-1 subsets are put together to form a training set. The error estimation is averaged over all k trials to get total effectiveness of our model. 


1. 2 Most Commonly used


*   Cross validation. Because it can effectively avoid overfitting, and can effectively use the data,  as can be seen, every data point gets to be in a validation set exactly once, and gets to be in a training set k-1 times. This significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in validation set. Interchanging the training and test sets also adds to the effectiveness of this method.  Although when the dataset is large or when the NN is deep, it's expensive to use this method, but in practice, data is also valuable. So we usually can't get a very large dataset.


2. How to avoid overfitting


*   Add regularizer to the loss function. Regularizer is seen to be a penalty term in the loss funtion, penalise the increasing of the parameter number.
*   Dropout. Randomly dropout some neurons during the traing process can effectively avoid the overfitting.
*   Train with more data(Data augmentation). Using more training data can, in most cases, lead to a better result, and can also mitigate the overfitting problem if we don't add more parameters again.
*   Early stopping. Terminate before reaching convergence, the effects is similar to decay regularization.











## Part 2

1. Why can gradient-based learning be difficult when using the sigmoid or hyperbolic tangent functions as hidden unit activation functions in deep, fully-connected neural networks?
2. Why is the issue that arises in the previous question less of an issue when using such functions as output unit activation functions, provided that an appropriate loss function is used?
3. What would happen if you initialize all the weights to zero in a multi-layer fully-connected neural network and attempt to train your model using gradient descent? What would happen if you did the same thing for a logistic regression model?

## Part 2(Answers)

1. Disadvantages of sigmoid and hyperbolic tangent funtions in deep neural networks.


*   Because It suffers from vaninshing or exploding gradients issue. This means that large values snap to 1.0 and small values snap to -1 or 0 for tanh and sigmoid respectively. Then if we use BP in a deep NN, layers deep in large networks using these nonlinear activation functions fail to receive useful gradient information.The amount of error decreases dramatically with each additional layer through which it is propagated, given the derivative of the chosen activation function.


2. Effect of the appropriate loss funtion


*   If we use an appropriate loss function, say, cross entropy, then due to the chain rule, when we backpropagate the gradient, the gradient of the activation function is multiplied by the  gradient of loss function,  then for cross entropy, we can get the derivative of loss function: pi-yi, which can obviously solve the influence of the gradient vanishing and gradient exploding.Cross entropy indicates the distance between what the model believes the output distribution should be, and what the original distribution really is. 


3. Initialization issue


*  First, neural networks tend to get stuck in local minima, so it's a good idea to give them many different starting values. You can't do that if they all start at zero.

*  Second, if the neurons start with the same weights, then all the neurons will follow the same gradient, and will always end up doing the same thing as one another.

## Part 3

In this part, you will use PyTorch to implement and train a multinomial logistic regression model to classify MNIST digits.

Restrictions:
* You must use (but not modify) the code provided in `utils.py`. **This file is deliberately not documented**; read it carefully as you will need to understand what it does to complete the tasks.
* You are NOT allowed to use the `torch.nn` module.

Please insert your solutions to the following tasks in the cells below:
1. Complete the `MultinomialLogisticRegressionClassifier` class below by filling in the missing parts (expected behaviour is prescribed in the documentation):
    * The constructor
    * `forward`
    * `parameters`
    * `l1_weight_penalty`
    * `l2_weight_penalty`

2. The default hyperparameters for `MultilayerClassifier` and `run_experiment` have been deliberately chosen to produce poor results. Experiment with different hyperparameters until you are able to get a test set accuracy above 92% after a maximum of 10 epochs of training. However, DO NOT use the test set accuracy to tune your hyperparameters; use the validation loss / accuracy. You can use any optimizer in `torch.optim`.


In [0]:
from utils import *

In [0]:
# *CODE FOR PART 3.1 IN THIS CELL*
import numpy as np
import torch
from torch.distributions import normal

class MultinomialLogisticRegressionClassifier:
    def __init__(self, weight_init_sd=100.0):
        """
        Initializes model parameters to values drawn from the Normal
        distribution with mean 0 and standard deviation `weight_init_sd`.
        """
        self.weight_init_sd = weight_init_sd

        #######################################################################
        #                       ** START OF YOUR CODE **
        #######################################################################
        In = 784 # 28 * 28
        Out = 10
        m = normal.Normal(torch.Tensor([0.0]), torch.Tensor([weight_init_sd]))
        self.w = m.sample(sample_shape = torch.Size([In, Out])).view(In, Out)
        self.w.requires_grad_()
        self.b = torch.zeros((1, 10))
        self.b.requires_grad_()
        #self.theta = torch.cat((self.w,self.b),0)
        #self.theta.requires_grad_()  
        #######################################################################
        #                       ** END OF YOUR CODE **
        #######################################################################

    def __call__(self, *args, **kwargs):
        return self.forward(*args, **kwargs)

    def forward(self, inputs):
        """
        Performs the forward pass through the model.
        
        Expects `inputs` to be a Tensor of shape (batch_size, 1, 28, 28) containing
        minibatch of MNIST images.
        
        Inputs should be flattened into a Tensor of shape (batch_size, 784),
        before being fed into the model.
        
        Should return a Tensor of logits of shape (batch_size, 10).
        """
        #######################################################################
        #                       ** START OF YOUR CODE **
        #######################################################################
        # flatten the data
        flat_X = inputs.reshape([-1,784])
        #flat_X = torch.cat((flat_X, self.b), 0)
        h = flat_X.matmul(self.w) + self.b
        h_row_max = torch.max(h, dim=1, keepdim = True)[0]
        h = h - h_row_max
        logits = h - torch.log(torch.sum(torch.exp(h)))
        # h = self.b + w_X
        # print(h.shape)
        # logits = torch.log(torch.exp(h) / torch.sum(torch.exp(h)))

        return logits
        #######################################################################
        #                       ** END OF YOUR CODE **
        #######################################################################

    def parameters(self):
        """
        Should return an iterable of all the model parameter Tensors.
        """
        #######################################################################
        #                       ** START OF YOUR CODE **
        #######################################################################
        self.para = torch.cat((self.w,self.b),0)
        
        return [self.w, self.b]
        #######################################################################
        #                       ** END OF YOUR CODE **
        #######################################################################
        
    def l1_weight_penalty(self):
        """
        Computes and returns the L1 norm of the model's weight vector (i.e. sum
        of absolute values of all model parameters).
        """
        #######################################################################
        #                       ** START OF YOUR CODE **
        #######################################################################
        #torch.Tensor(self.para)
        l1 = self.para.abs().sum()
        return l1
        #######################################################################
        #                       ** END OF YOUR CODE **
        #######################################################################

    def l2_weight_penalty(self):
        """
        Computes and returns the L2 weight penalty (i.e. 
        sum of squared values of all model parameters).
        """
        #######################################################################
        #                       ** START OF YOUR CODE **
        #######################################################################
        l2 = self.para.pow(2).sum()
        return l2
        #######################################################################
        #                       ** END OF YOUR CODE **
        #######################################################################


In [0]:
# *CODE FOR PART 3.2 IN THIS CELL - EXAMPLE WITH DEFAULT PARAMETERS PROVIDED *
model = MultinomialLogisticRegressionClassifier(weight_init_sd=0.002)
res = run_experiment(
    model,
    optimizer=optim.SGD(model.parameters(), lr = 0.005, momentum = 0.8),
    train_loader=train_loader_1,
    val_loader=val_loader_1,
    test_loader=test_loader_1,
    n_epochs=10,
    l1_penalty_coef=0,
    l2_penalty_coef=0.01,
    suppress_output=False
)

Epoch 0: training...
Train set:	Average loss: 4.5922, Accuracy: 0.8911
Validation set:	Average loss: 7.2672, Accuracy: 0.9032

Epoch 1: training...
Train set:	Average loss: 4.4921, Accuracy: 0.9119
Validation set:	Average loss: 7.2511, Accuracy: 0.9110

Epoch 2: training...
Train set:	Average loss: 4.4750, Accuracy: 0.9166
Validation set:	Average loss: 7.2388, Accuracy: 0.9122

Epoch 3: training...
Train set:	Average loss: 4.4667, Accuracy: 0.9198
Validation set:	Average loss: 7.2363, Accuracy: 0.9140

Epoch 4: training...
Train set:	Average loss: 4.4601, Accuracy: 0.9221
Validation set:	Average loss: 7.2344, Accuracy: 0.9138

Epoch 5: training...
Train set:	Average loss: 4.4553, Accuracy: 0.9239
Validation set:	Average loss: 7.2325, Accuracy: 0.9157

Epoch 6: training...
Train set:	Average loss: 4.4524, Accuracy: 0.9243
Validation set:	Average loss: 7.2333, Accuracy: 0.9127

Epoch 7: training...
Train set:	Average loss: 4.4482, Accuracy: 0.9254
Validation set:	Average loss: 7.2285, Ac

## Part 4

In this part, you will use PyTorch to implement and train a multi-layer fully-connected neural network to classify MNIST digits.

Your network must have three hidden layers with 128, 64, and 32 hidden units respectively.

The same restrictions as in Part 3 apply.

Please insert your solutions to the following tasks in the cells below:
1. Complete the `MultilayerClassifier` class below by filling in the missing parts of the following methods (expected behaviour is prescribed in the documentation):

    * The constructor
    * `forward`
    * `parameters`
    * `l1_weight_penalty`
    * `l2_weight_penalty`

2. The default hyperparameters for `MultilayerClassifier` and `run_experiment` have been deliberately chosen to produce poor results. Experiment with different hyperparameters until you are able to get a test set accuracy above 97% after a maximum of 10 epochs of training. However, DO NOT use the test set accuracy to tune your hyperparameters; use the validation loss / accuracy. You can use any optimizer in `torch.optim`.

3. Describe an alternative strategy for initializing weights that may perform better than the strategy we have used here.

In [0]:
# *CODE FOR PART 4.1 IN THIS CELL*

class MultilayerClassifier:
    def __init__(self, activation_fun="sigmoid", weight_init_sd=1.0):
        """
        Initializes model parameters to values drawn from the Normal
        distribution with mean 0 and standard deviation `weight_init_sd`.
        """
        super().__init__()
        self.activation_fun = activation_fun
        self.weight_init_sd = weight_init_sd

        if self.activation_fun == "relu":
            self.activation = F.relu
        elif self.activation_fun == "sigmoid":
            self.activation = torch.sigmoid
        elif self.activation_fun == "tanh":
            self.activation = torch.tanh
        else:
            raise NotImplementedError()

        #######################################################################
        #                       ** START OF YOUR CODE **
        #######################################################################
        In = 784 # 28 * 28
        Hidden1 = 128
        Hidden2 = 64
        Hidden3 = 32
        Out = 10
        m = normal.Normal(torch.Tensor([0.0]), torch.Tensor([weight_init_sd]))
        self.w1 = m.sample(sample_shape = torch.Size([In, Hidden1])).view(In, Hidden1)
        self.w1.requires_grad_()
        self.b1 = torch.zeros((1, Hidden1))
        self.b1.requires_grad_()
        
        self.w2 = m.sample(sample_shape = torch.Size([Hidden1, Hidden2])).view(Hidden1, Hidden2)
        self.w2.requires_grad_()
        self.b2 = torch.zeros((1, Hidden2))
        self.b2.requires_grad_()
        
        self.w3 = m.sample(sample_shape = torch.Size([Hidden2, Hidden3])).view(Hidden2, Hidden3)
        self.w3.requires_grad_()
        self.b3 = torch.zeros((1, Hidden3))
        self.b3.requires_grad_()
        
        self.w4 = m.sample(sample_shape = torch.Size([Hidden3, Out])).view(Hidden3, Out)
        self.w4.requires_grad_()
        self.b4 = torch.zeros((1, Out))
        self.b4.requires_grad_()
        #######################################################################
        #                       ** END OF YOUR CODE **
        #######################################################################

    def __call__(self, *args, **kwargs):
        return self.forward(*args, **kwargs)
     

    def forward(self, inputs):
        """
        Performs the forward pass through the model.
        
        Expects `inputs` to be Tensor of shape (batch_size, 1, 28, 28) containing
        minibatch of MNIST images.
        
        Inputs should be flattened into a Tensor of shape (batch_size, 784),
        before being fed into the model.
        
        Should return a Tensor of logits of shape (batch_size, 10).
        """
        #######################################################################
        #                       ** START OF YOUR CODE **
        #######################################################################
        flat_X = inputs.reshape([-1,784])
        
        h1 = flat_X.matmul(self.w1) + self.b1
        h1 = self.activation(h1)
        
        h2 = h1.matmul(self.w2) + self.b2
        h2 = self.activation(h2)
        
        h3 = h2.matmul(self.w3) + self.b3
        h3 = self.activation(h3)
        
        h = h3.matmul(self.w4) + self.b4
        
        h_row_max = torch.max(h, dim=1, keepdim = True)[0]
        h = h - h_row_max
        logits = h - torch.log(torch.sum(torch.exp(h))) 
        
        '''
                logits = torch.log(torch.exp(h) / torch.sum(torch.exp(h)))
       
        '''

    
        return logits
        #######################################################################
        #                       *self.w,self.b* END OF YOUR CODE **
        #######################################################################

    def parameters(self):
        """
        Should return an iterable of all the model parameter Tensors.
        """
        #######################################################################
        #                       ** START OF YOUR CODE **
        #######################################################################
        return [self.w1, self.b1, self.w2, self.b2,self.w3, self.b3, self.w4, self.b4]
        #######################################################################
        #                       ** END OF YOUR CODE **
        #######################################################################
        
    
    def l1_weight_penalty(self):
        """
        Computes and returns the L1 norm of the model's weight vector (i.e. sum
        of absolute values of all model parameters).
        """
        #######################################################################
        #                       ** START OF YOUR CODE **
        #######################################################################
        l1 = 0
        for para in self.parameters():
          l1 += torch.sum(torch.abs(para))
          
        return l1
        #######################################################################
        #                       ** END OF YOUR CODE **
        #######################################################################

    def l2_weight_penalty(self):
        """
        Computes and returns the L2 weight penalty (i.e. 
        sum of squared values of all model parameters).
        """
        #######################################################################
        #                       ** START OF YOUR CODE **
        #######################################################################
        l2 = 0
        for para in self.parameters():
          l2 += torch.sum(para ** 2)
          
        return l2
        #######################################################################
        #                       ** END OF YOUR CODE **
        #######################################################################


In [0]:
# *CODE FOR PART 4.2 IN THIS CELL - EXAMPLE WITH DEFAULT PARAMETERS PROVIDED *

model = MultilayerClassifier(activation_fun='relu', weight_init_sd=0.03)
res = run_experiment(
    model,
    optimizer=optim.SGD(model.parameters(),lr = 0.015, momentum = 0.8),
    train_loader=train_loader_1,
    val_loader=val_loader_1,
    test_loader=test_loader_1,
    n_epochs=10,
    l1_penalty_coef=0,
    l2_penalty_coef=0.001,
    suppress_output=False
)

Epoch 0: training...
Train set:	Average loss: 5.2343, Accuracy: 0.6425
Validation set:	Average loss: 7.2316, Accuracy: 0.9123

Epoch 1: training...
Train set:	Average loss: 4.3813, Accuracy: 0.9407
Validation set:	Average loss: 7.1135, Accuracy: 0.9430

Epoch 2: training...
Train set:	Average loss: 4.3034, Accuracy: 0.9617
Validation set:	Average loss: 7.0679, Accuracy: 0.9567

Epoch 3: training...
Train set:	Average loss: 4.2704, Accuracy: 0.9690
Validation set:	Average loss: 7.0375, Accuracy: 0.9613

Epoch 4: training...
Train set:	Average loss: 4.2522, Accuracy: 0.9740
Validation set:	Average loss: 7.0460, Accuracy: 0.9590

Epoch 5: training...
Train set:	Average loss: 4.2387, Accuracy: 0.9777
Validation set:	Average loss: 7.0341, Accuracy: 0.9652

Epoch 6: training...
Train set:	Average loss: 4.2315, Accuracy: 0.9801
Validation set:	Average loss: 7.0391, Accuracy: 0.9638

Epoch 7: training...
Train set:	Average loss: 4.2256, Accuracy: 0.9815
Validation set:	Average loss: 7.0197, Ac

4.3 In the above case, we initialise all weights using same distributions, to get a better result:

*   We can intialise parameters of each layer seperately.
*   More concretely,  we can use He-et-al Initialization, the weights are initialized keeping in mind the size of the previous layer which helps in attaining a global minimum of the cost function faster and more efficiently.

