## Introduction

In this workshop, we will dive into a construction of a simple MultiLayer Perceptron neural network. A MultiLayer Perceptron, also termed MLP, is a simple network consisting of a few fully connected linear layers separated by nonlinear components (also called *activation functions*). A nonlinear component in-between the linear layers is essential: without it, a compostion of linear components would be just a linear component, so a multilayer network would be equivalent to a single layered network. Also, please note that it is a nonlinear component that enables a neural network to express nonlinear functions, too.

In this workshop, we will construct an MLP network designed to a specific task of classification of MNIST dataset: a set of handwritten digits 0-9. You can read more about this dataset here: https://colah.github.io/posts/2014-10-Visualizing-MNIST/#MNIST

MNIST stands for Modified National Institute of Standards and Technology database.

In [81]:
import torch
import torchvision

### Reading MNIST data set

In [82]:
trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=None)

In [83]:
train_image, train_target = trainset[0]    #let us examine the 0-th sample
train_image.show()

In [84]:
trainset.data[0]     #it will be shown in two rows, so a human has hard time classificating it

tensor([[  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
           0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   3,  18,
          18,  18, 126, 136, 175,  26, 166, 255, 247, 127,   0,   0,   0,   0],
        [  0,   0,   0,   0,   0,   0,   0,   

In [85]:
train_target    #check if you classified it correctly in your mind

5

### Your task #1

Examine a few more samples from mnist dataset. Try to guess the correct class correctly, classifing images with your human classification skills. Try to estimate the accuracy, i.e. what is your percentage of correct classifications - treating a training set that we are examining now as a test set for you. It is sound, because you have not trained on that set before attempting the classification.

### Your task #2

Try to convert the dataset into numpy `ndarray`. Then estimate mean and standard deviation of MNIST dataset. Please remember that it is customary to first divide each value in MNIST dataset by 255, to normalize the initial pixel RGB values 0-255 into (0,1) range.

*Tips:* 
- to convert MNIST dataset to numpy, use `trainset.data.numpy()`
- in numpy, there are methods `mean()` and `std()` to calculate statistics of a vector. 

In [86]:
(trainset.data.numpy().mean()/255.0, trainset.data.numpy().std()/255.0)   #MNIST datapoints are RGB integers 0-255

(0.1306604762738429, 0.30810780385646264)

Now, we will reread the dataset (**train** and **test** parts) and transform it (standardize it) so it will be zero-mean and unit-std. 

In [87]:
transform = torchvision.transforms.Compose(
    [ torchvision.transforms.ToTensor(), #Converts a PIL Image or numpy.ndarray (H x W x C) in the range [0, 255] to a torch.FloatTensor of shape (C x H x W) in the range [0.0, 1.0]
      torchvision.transforms.Normalize((0.1307), (0.3081))])

trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=2048, shuffle=False)   #we do shuffle it to give more randomizations to training epochs

testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=1, shuffle=False)

Let us visualise the training labels

In [88]:
for i, data in enumerate(trainloader):
        batch_inputs, batch_labels = data

        if i<5:
            print(i, "-th batch labels :", batch_labels)

0 -th batch labels : tensor([5, 0, 4,  ..., 1, 4, 1])
1 -th batch labels : tensor([7, 5, 4,  ..., 9, 3, 9])
2 -th batch labels : tensor([2, 4, 9,  ..., 7, 1, 2])
3 -th batch labels : tensor([2, 9, 0,  ..., 1, 5, 6])
4 -th batch labels : tensor([3, 0, 1,  ..., 0, 3, 1])


### Your taks #3

Labels are entities of order zero (constants), but batched labels are of order one. The first (and only) index is a sample index within a batch. 

Your task is to visualise and inspect the number of orders in data in batch_inputs.

In [89]:
for i, data in enumerate(trainloader):
        batch_inputs, batch_labels = data

        if i==0:
            print(i, "-th batch inputs :", batch_inputs)

0 -th batch inputs : tensor([[[[-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
          [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
          [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
          ...,
          [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
          [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
          [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242]]],


        [[[-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
          [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
          [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
          ...,
          [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
          [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
          [-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242]]],


        [[[-0.4242, -0.4242, -0.4242,  ..., -0.4242, -0.4242, -0.4242],
     

OK, so each data image was initially a two dimensional image when we first saw it, but now the batches have order 4. The first index is a sample index within a batch, but a second index is always 0. This index represents a Channel number inserted here by ToTensor() transformation, always 0. As this order is one-dimensional, we can get rid of it, later, in training, in `Flatten()` layer or by using `squeeze()` on a tensor.

### MLP

Now, a definition of a simple MLP network.

In [90]:
class MLP(torch.nn.Module):
    def __init__(self):
        super(MLP, self).__init__()
        self.mlp = torch.nn.Sequential(
            torch.nn.Flatten(),   #change the last three orders in data (with dimensions 1, 28 and 28 respectively) into one order of dimensions (1*28*28)
            torch.nn.Linear(1*28*28, 1024),
            torch.nn.Sigmoid(),
            torch.nn.Linear(1024, 2048),
            torch.nn.Sigmoid(),
            torch.nn.Linear(2048, 256),
            torch.nn.Sigmoid(),
            torch.nn.Linear(256, 10),
        )
    def forward(self, x):
        out = self.mlp(x)
        return out
    


### Training

Training consists of 
- an initiation of a network
- a definition of an optimizer. Optimizer does a gradient descent on gradients computed in a `backward()` step on a loss.
- running through multiple epochs and updating the network weights

In [93]:
net = MLP()
optimizer = torch.optim.Adam(net.parameters(), 0.001)   #initial and fixed learning rate of 0.001

net.train()
for epoch in range(8):  #  an epoch is a training run through the whole data set

    loss = 0.0
    for batch, data in enumerate(trainloader):
        batch_inputs, batch_labels = data
        #batch_inputs.squeeze(1)     #alternatively if not for a Flatten layer, squeeze() could be used to remove the second order of the tensor, the Channel, which is one-dimensional (this index can be equal to 0 only)
        
        optimizer.zero_grad()

        batch_outputs = net(batch_inputs)
        loss = torch.nn.functional.cross_entropy(batch_outputs, batch_labels, reduction = "mean")
        print("epoch:", epoch, "batch:", batch, "current batch loss:", loss.item()) 
        loss.backward()       #this computes gradients as we have seen in previous workshops
        optimizer.step()     #but this line in fact updates our neural network. 
                                ####You can experiment - comment this line and check, that the loss DOES NOT improve, meaning that the network doesn't update


epoch: 0 batch: 0 current batch loss: 2.371558666229248
epoch: 0 batch: 1 current batch loss: 2.38930344581604
epoch: 0 batch: 2 current batch loss: 2.368551015853882
epoch: 0 batch: 3 current batch loss: 2.3257222175598145
epoch: 0 batch: 4 current batch loss: 2.2956409454345703
epoch: 0 batch: 5 current batch loss: 2.286395311355591
epoch: 0 batch: 6 current batch loss: 2.280097246170044
epoch: 0 batch: 7 current batch loss: 2.2642595767974854
epoch: 0 batch: 8 current batch loss: 2.2483108043670654
epoch: 0 batch: 9 current batch loss: 2.204660415649414
epoch: 0 batch: 10 current batch loss: 2.1567130088806152
epoch: 0 batch: 11 current batch loss: 2.109605550765991
epoch: 0 batch: 12 current batch loss: 2.0402870178222656
epoch: 0 batch: 13 current batch loss: 1.9680569171905518
epoch: 0 batch: 14 current batch loss: 1.9065264463424683
epoch: 0 batch: 15 current batch loss: 1.8569889068603516
epoch: 0 batch: 16 current batch loss: 1.727373719215393
epoch: 0 batch: 17 current batch 

### Your task #4

Comment the line `optimizer.step()` above. Rerun the above code. Note that the loss is NOT constant as the comment in the code seems to promise, but anyway, the loss doesn't improve, either. Please explain, why the loss is not constant. Please explain, why the loss doesn't improve, either.

### Answers to task #4

The loss is not constant because in our code we are printing losses in batches, and each batch is a different data sample. A loss value calculated on different data sample may be different. Moreover, a second and subsequent epochs, i.e. a next runs through the whole data, consist of different batches, because we selected `shuffle = True` when initiating a training dataset. It means, that even in next epochs, batched samples will be different samples and the loss values may differ.

But overall, loss desn't improve because the weights in the network do not change. The line responsible for changing the weights in the network is commented. 

### Training - second approach


Sometimes during training loss stabilizes and doesn't improve anymore. It is not the case here (yet, but we have only run 8 epochs), but a real problem in practice.
We can include a new tool called a **scheduler** that would update the *learning rate* in an otpimizer after each epoch. This usually helps the training. Let us reformulate the traning so it consists of 
- an initiation of a network
- a definition of an optimizer. Optimizer does a gradient descent on gradients computed in a `backward()` step on a loss.
- a definition of a scheduler to update the learning rate in an optimizer
- running through multiple epochs and updating the network weights

In [94]:
net_with_scheduler = MLP()
optimizer = torch.optim.Adam(net_with_scheduler.parameters(), 0.001)   #initial learning rate of 0.001. 
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=1, gamma=0.9)    #updates the learning rate after each epoch. There are many ways to do that: StepLR multiplies learning rate by gamma

net_with_scheduler.train()
for epoch in range(8):  #  an epoch is a training run through the whole data set

    loss = 0.0
    for batch, data in enumerate(trainloader):
        batch_inputs, batch_labels = data
        #batch_inputs.squeeze(1)     #alternatively if not for a Flatten layer, squeeze() could be used to remove the second order of the tensor, the Channel, which is one-dimensional (this index can be equal to 0 only)
        
        optimizer.zero_grad()

        batch_outputs = net_with_scheduler(batch_inputs)
        loss = torch.nn.functional.cross_entropy(batch_outputs, batch_labels, reduction = "mean")
        print("epoch:", epoch, "batch:", batch, "current batch loss:", loss.item(), "current lr:", scheduler.get_last_lr()[0]) 
        loss.backward()       #this computes gradients as we have seen in previous workshops
        optimizer.step()     #but this line in fact updates our neural network. 
                                
    scheduler.step()

epoch: 0 batch: 0 current batch loss: 2.3727755546569824 current lr: 0.001
epoch: 0 batch: 1 current batch loss: 2.371875524520874 current lr: 0.001
epoch: 0 batch: 2 current batch loss: 2.3651115894317627 current lr: 0.001
epoch: 0 batch: 3 current batch loss: 2.326876401901245 current lr: 0.001
epoch: 0 batch: 4 current batch loss: 2.302621841430664 current lr: 0.001
epoch: 0 batch: 5 current batch loss: 2.2808690071105957 current lr: 0.001
epoch: 0 batch: 6 current batch loss: 2.265956163406372 current lr: 0.001
epoch: 0 batch: 7 current batch loss: 2.266913890838623 current lr: 0.001
epoch: 0 batch: 8 current batch loss: 2.243178129196167 current lr: 0.001
epoch: 0 batch: 9 current batch loss: 2.2029662132263184 current lr: 0.001
epoch: 0 batch: 10 current batch loss: 2.1540374755859375 current lr: 0.001
epoch: 0 batch: 11 current batch loss: 2.0983917713165283 current lr: 0.001
epoch: 0 batch: 12 current batch loss: 2.0276150703430176 current lr: 0.001
epoch: 0 batch: 13 current b

### Your task #5

Well, it seems that we were able to get the learning rate to 0.06 without a scheduler. Can you bring it under 0.05? Maybe the proposed gamma was to low (0.9 only)?. Please experiment with different settings for the optimizer learning rate and different scheduler settings. Ask the workshop trainer, there are other schedulers you can experiment, too. Please verify what would happen if the nets were allowed to train for more training epochs.

### Your task #6

Please explain, what are the dangers of bringing the loss to low? What is an *overtrained* neural network? How can one prevent it?

### Testing

Now we will test those two nets - the one without and the one with the scheduler.

In [95]:
good = 0
wrong = 0

net.eval()   #it prevents that the net learns during evalution. The gradients are not computed, so this makes it faster, too
#batches in test are of size 1
for batch, data in enumerate(testloader):
    datapoint, label = data
    
    prediction = net(datapoint)                  #prediction has values representing the "prevalence" of the corresponding class
    classification = torch.argmax(prediction)    #the class is the index of maximal "prevalence"
    
    if classification.item() == label.item():
        good += 1
    else:
        wrong += 1
        
print("accuracy = ", good/(good+wrong))

accuracy =  0.9659


In [96]:
good = 0
wrong = 0

net_with_scheduler.eval()   #it prevents that the net learns during evalution. The gradients are not computed, so this makes it faster, too
#batches in test are of size 1
for batch, data in enumerate(testloader):
    datapoint, label = data
    
    prediction = net_with_scheduler(datapoint)                  #prediction has values representing the "prevalence" of the corresponding class
    classification = torch.argmax(prediction)                   #the class is the index of maximal "prevalence"
    
    if classification.item() == label.item():
        good += 1
    else:
        wrong += 1
        
print("accuracy = ", good/(good+wrong))

accuracy =  0.9599


Well, not bad. Now it is your turn to experiment - change the layers in a neural network, change the activation function, play with the learning rate and the optimizer and the scheduler.