# Deep Learning &mdash; Assignment 4

Fourth assignment for the 2020 Deep Learning course (NWI-IMC058) of the Radboud University.

_Twan van Laarhoven (tvanlaarhoven@cs.ru.nl) and Gijs van Tulder (g.vantulder@cs.ru.nl)_

_September 2020_

-----

**Names:**

**Group:**

-----

**Instructions:**
* Fill in your names and the name of your group.
* Answer the questions and complete the code where necessary.
* Re-run the whole notebook before you submit your work.
* Save the notebook as a PDF and submit that in Brightspace together with the `.ipynb` notebook file.
* The easiest way to make a PDF of your notebook is via File > Print Preview and then use your browser's print option to print to PDF.

## Objectives

In this assignment you will
1. Train and modify a transformer network
2. Experiment with a translation dataset


## Required software

If you haven't done so already, you will need to install the following additional libraries:
* `torch` for PyTorch,
* `d2l`, the library that comes with [Dive into deep learning](https://d2l.ai) book.

All libraries can be installed with `pip install`.

In [None]:
from d2l import torch as d2l
import math
import numpy as np
import torch
from torch import nn

## 4.1 Transformer

There is a detailed description of the transformer model in chapter 10.3 of the d2l book. In this exercise we will do experiments with variations on this model.

**Run the code from that chapter, to train a transformer model on a English->French toy translation dataset**  
Note: Make sure that you use the pytorch version.

In [None]:
# TODO: your code here


The example in the book uses a function `d2l.load_data_nmt` to load an English->French translation dataset. This function is implemented in chapter 9.5. This implementation produces only a single iterator over batches of data.

**Modify this function to randomly split the data into a training and test set.**

In [None]:
def load_data_nmt(batch_size, num_steps, train_fraction=0.8, num_examples=1000):
    text = d2l.preprocess_nmt(d2l.read_data_nmt())
    source, target = d2l.tokenize_nmt(text, num_examples)
    src_vocab = d2l.Vocab(source, min_freq=3, reserved_tokens=['<pad>', '<bos>', '<eos>'])
    tgt_vocab = d2l.Vocab(target, min_freq=3, reserved_tokens=['<pad>', '<bos>', '<eos>'])
    src_array, src_valid_len = d2l.build_array(source, src_vocab, num_steps, True)
    tgt_array, tgt_valid_len = d2l.build_array(target, tgt_vocab, num_steps, False)
    # TODO: modify this code to produce a training and test set
    # Hint: use np.random.permutation
    data_arrays = (src_array, src_valid_len, tgt_array, tgt_valid_len)
    data_iter = d2l.load_array(data_arrays, batch_size)
    return src_vocab, tgt_vocab, data_iter


With a test set in hand, we can make more informed decisions when comparing different models. The simplest metric to implement is test set loss. Just like in previous weeks, it would be nice to plot the test metrics during training. To do that we will need to modify the `d2l.train_s2s_ch9` function, which is defined in chapter 9.7.

**Complete the implementation below**

In [None]:
def train_s2s(model, train_iter, test_iter, lr, num_epochs, device):
    def xavier_init_weights(m):
        if type(m) == nn.Linear:
            torch.nn.init.xavier_uniform_(m.weight)
        if type(m) == nn.LSTM:
            for param in m._flat_weights_names:
                if "weight" in param:
                    torch.nn.init.xavier_uniform_(m._parameters[param])
    model.apply(xavier_init_weights)
    model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    loss = d2l.MaskedSoftmaxCELoss()
    model.train()
    animator = d2l.Animator(xlabel='epoch', ylabel='loss',
                            legend=['train loss', 'test loss'],
                            xlim=[1, num_epochs], ylim=[0, 0.25])
    for epoch in range(1, num_epochs + 1):
        timer = d2l.Timer()
        metric = d2l.Accumulator(2)  # loss_sum, num_tokens
        for batch in train_iter:
            X, X_vlen, Y, Y_vlen = [x.to(device) for x in batch]
            Y_input, Y_label, Y_vlen = Y[:, :-1], Y[:, 1:], Y_vlen-1
            Y_hat, _ = model(X, Y_input, X_vlen, Y_vlen)
            l = loss(Y_hat, Y_label, Y_vlen)
            l.sum().backward() # Making the loss scalar for backward()
            d2l.grad_clipping(model, 1)
            num_tokens = Y_vlen.sum()
            optimizer.step()
            with torch.no_grad():
                metric.add(l.sum(), num_tokens)
        if epoch % 10 == 0:
            animator.add(epoch, (metric[0]/metric[1], None))
            test_loss = calculate_test_loss(model, loss, test_iter, device)
            animator.add(epoch, (None, test_loss))
    print(f'train loss {metric[0] / metric[1]:.3f}, {metric[1] / timer.stop():.1f} '
          f'test loss {test_loss:.3f} '
          f'tokens/sec on {str(device)}')

def calculate_test_loss(model, loss, test_iter, device):
    # TODO: your code here
    # Hint: look at the training code
    pass


**Re-train the transformer model, this time showing test set loss. How does this compare to training set loss?**

In [None]:
# TODO: your code here


## 4.3 Data size

The model is only trained on 1000 sentence pairs. You can change this with the `num_examples` parameter to `load_data_nmt`.
When you do this, note that the code in d2l chapter 10.3 has a bug, where it uses the size of the *source* vocabulary (English in this case) for both the encoder and the decoder. You will run into this when using different amounts of data.

**Train with a larger dataset**

In [None]:
# TODO: your code here

By taking only the first 1000 samples we have limited ourselves to very simple sentences (see `data/fra.txt`). Later sentences in the dataset are longer.

**Will the code need to be modified to correctly handle these larger sentences?**

TODO: your answer here

## 4.4 Variations

**Does dropout improve the test set performance?**

TODO: your answer here

**Change the number of heads in the encoder and/or decoder. Do you see any difference in the results?**

In [None]:
# TODO: your code here

TODO: your answer here

**Look at the `MultiHeadAttention` module. Does the number of trainable parameters change with the number of heads? And if so, how?**

TODO: your answer here

**What happens if you don't use any positional encoding? Can you explain why?**

In [None]:
# TODO: your code here

TODO: your answer here

**What happens if you change only one of the `key_size`, `query_size` or `value_size`? Can you explain why?**

TODO: your answer here

**Compare the results of the transformer with the LSTM network from d2l chapter 9.7. Discuss the differences**

In [None]:
# TODO: your code here

TODO: your answer here

## The end

Well done! Please double check the instructions at the top before you submit your results.