# CPSC 477/577 Spring 2022, HW4 Part 1: RNN for machine translation

Due: April 27th, 11:59pm

*TF: Malcolm Sailor*

## Please write your name and NetID below

Name: Qiuyu (Olina) Zhu 

NetID: qz258 

In [28]:
!pip install torchtext==0.11.2 



## Instructions for completing this assignment

All the portions of this assignment for you to complete are indicated with `# STUDENT TODO` comments.

If you want to help yourself remember what you have completed and what remains to be done, you can delete the `TODO` portion of the comment (though you don't have to). But **do not delete the comment `STUDENT`**. Searching for this string is how we will grade your work.

The next cell defines an object that we use as a placeholder in a few places (e.g., in function arguments). It allows us to define syntactically correct code for you to complete. Thus anywhere you see `TO.DO` in the notebook, you need to replace it with your own code.

In [29]:
class Placeholder:
    @property
    def DO(self):
        raise NotImplementedError("You haven't yet implemented this part of the assignment yet")

TO = Placeholder()

- Part 1 15 points
    - 1A - why would reversing input help? (3 points)
    - 1B - implement data process fn (5 points)
    - 1C - collate_fn (5 points)
    - 1D - run training (1 point)
    - 1E - run test (1 point)
- Part 2 39 points
    - 2A - data process fn (1 point)
    - 2B - collate_fn (1 point)
    - 2C - explain lookahead mask (3 points)
    - 2D - attention (10 points)
    - 2E - split/merge heads (5 points)
    - 2F - Encoder forward pass (5 points)
    - 2G - Decoder forward pass (5 points)
    - 2H - Transformer forward pass (5 points)
    - 2I - run training (1 point)
    - 2J - greedy vs beam search (3 points)
    - Bonus: Beam Search (3 points)
- Part 3 21 points
    - 3A - pipeline for text-generation (3 points)
    - 3B - download raw datasets (2 points)
    - 3C - create validation split (3 points)
    - 3D - explain tokenizer issue (3 points)
    - 3E - preprocess function (4 points)
    - 3F - trainer arguments (2 points)
    - 3G - training set size vs. number of epochs (3 points)
    - 3H - run evaluation (1 point)
- Part 4 5 points
    - 4A - delete API key (1 point)
    - 4B - zero-shot translation (2 point)
    - 4C - find a task GPT-3 is good at (1 point)
    - 4D - find a task GPT-3 is bad at (1 point)

Total points: 80

## Environment Setup

For this assignment, we will use Google Colab. It is a free cloud service based on Jupyter Notebooks that lets you use a free GPU. Basically, you can quickly create, upload, share, and even edit togther Jupyter notebooks.

Please refer to [tutorial 1](https://course.fast.ai/start_colab.html) or [tutorial 2](https://towardsdatascience.com/getting-started-with-google-colab-f2fff97f594c) for additional instructions on how to use Google Colab.

#### Using GPU in Colab
PyTorch and other deep learning libraries are much faster using GPU acceleration. For training and evaluating the models in this assignment, you should always a GPU:

1. Go to __Runtime__ option on the top left
2. Click __Change runtime type__
3. Select "Python 3" for __Runtime type__ and "GPU" for __Hardward accelerator__
4. Click __SAVE__ button

However, GPU limits the amount of time that you can use a free GPU. So you may wish to implement much of the assignment without the GPU. But note that you will have to run all cells again once you change the runtime type.

Colab has popular libraries already installed such as Pytorch, TensorFlow, OpenCV and Keras. Let's get started and verify this:

In [30]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Pytorch version is: ", torch.__version__)
print("You are using: ", device)

Pytorch version is:  1.10.2+cu102
You are using:  cuda


If you enabled the GPU, then you should be using CUDA with Pytorch >= 1.10.0


 **Note**: You need to finish this assignment on Google Colab  because you will share your Google Colab notebook with TAs for grading.



## Part 1: Recurrent Sequence to Sequence Model

Cześć! Hello! Bonjour! In this assignment we are going to build machine learning models for translating from one language to another using PyTorch. We will be working with German to English translation, but the methods and implementation generalize to any language for which there exists an equivalent tokenizer. Also, the sequence-to-sequence models which will be introduced are used in tasks beyond translation, including summarization, dialogue systems, and others. 

## Introduction
Sequence-to-Sequence (seq2seq) models were first introduced in the [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) paper. These models do exactly what you'd think they do; they take one sequence and output another sequence (see [this link](http://jalammar.github.io/images/seq2seq_2.mp4)). The most common seq2seq models are **encoder-decoder** models, in which a  model such as *recurrent neural network* (RNN) *encodes* the input (also called source) sentence into a single vector, often called the context vector. The context vector is a representation of the entire input sentence, and this vector is then *decoded* by a second model, often another RNN, which learns to output the *target* sentence by generating one word at a time. 

Important types of encoder-decoder models include *recurrent neural networks* (RNNs) and *transformers*. In this notebook, we'll use RNNs, while in the next notebook, we'll use transformers.

In an encoder-decoder model implemented with RNNs, we first use an RNN to *encode* the source (input) sentence into a single vector. We'll refer to this single vector as a *context vector*. You can think of the context vector as being an abstract representation of the entire input sentence. This vector is then *decoded* by a second RNN which learns to output the target (output) sentence by generating it one word at a time.


<center><img src="https://raw.githubusercontent.com/bentrevett/pytorch-seq2seq/master/assets/seq2seq1.png" alt="mlp" align="middle"></center>

This image shows an example translation using recurrent neural networks. The source sentence, "guten morgen" is input into the encoder (green) one word at a time. We also append a *start of sequence* (`<sos>`) and *end of sequence* (`<eos>`) token to the start and end of sentence, respectively, to give the model an idea of when encoding and decoding start and end. At each time-step, the input to the encoder RNN is both the current word, $x_t$, as well as the hidden state from the previous time-step, $h_{t-1}$, and the encoder RNN outputs a new hidden state $h_t$. You can think of the hidden state as a vector representation of the sentence so far. The RNN can be represented as a function of both of $x_t$ and $h_{t-1}$:

$$h_t = \text{EncoderRNN}(x_t, h_{t-1})$$

We're using the term RNN generally here, it could be any recurrent architecture, such as an *LSTM* (Long Short-Term Memory) or a *GRU* (Gated Recurrent Unit).

Here, we have $X = \{x_1, x_2, ..., x_T\}$, where $x_1 = \text{<sos>}, x_2 = \text{guten}$, etc. The initial hidden state, $h_0$, is usually initialized to zeros.

Once the final word, $x_T$, has been passed into the RNN, we use the final hidden state, $h_T$, as the context vector, i.e. $h_T = z$. This is a vector representation of the entire source sentence.

Now that we have our context vector, $z$, we can start decoding it to get the target sentence, "good morning." Again, we append start- and end-of-sequence tokens to the target sentence. At each time-step, the input to the decoder RNN (blue) is the current word, $y_t$, as well as the hidden state from the previous time-step, $s_{t-1}$, where the initial decoder hidden state, $s_0$, is the context vector, $s_0 = z = h_T$, i.e. the initial decoder hidden state is the final encoder hidden state. Thus, similar to the encoder, we can represent the decoder as:

$$s_t = \text{DecoderRNN}(y_t, s_{t-1})$$

In the decoder, we need to go from the hidden state to an actual word, therefore at each time-step we use $s_t$ to predict (by passing it through a `Linear` layer, shown in purple which projects the hidden state onto the output vocabulary space, to which softmax is applied to pick the most likely word) what we think is the next word in the sequence, $\hat{y}_t$. 

$$\hat{y}_t = f(s_t)$$

We always use `<sos>` for the first input to the decoder, $y_1$, but for subsequent inputs, $y_{t>1}$, we will sometimes use the actual, ground truth next word in the sequence, $y_t$ and sometimes use the word predicted by our decoder, $\hat{y}_{t-1}$. Using the actual next word is called *teacher forcing*, and you can read about it more [here](https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/).

When training/testing our model, we always know how many words are in our target sentence, so we stop generating words once we hit that many. During inference (i.e. real world usage where we do not have access the target translations) it is common to keep generating words until the model outputs an `<eos>` token or after a certain amount of words have been generated.

Once we have our predicted target sentence, $\hat{Y} = \{ \hat{y}_1, \hat{y}_2, ..., \hat{y}_T \}$, we compare it against our actual target sentence, $Y = \{ y_1, y_2, ..., y_T \}$, to calculate our loss. We then use this loss to update all of the parameters in our model.



### Preparing Data

We'll be coding up the models in PyTorch and using TorchText to help us do all of the pre-processing required.

In [31]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.datasets import Multi30k

# import spacy

import random
import math
import os
import time

# We'll set the random seeds for deterministic results.
SEED = 1

random.seed(SEED)
torch.manual_seed(SEED)
torch.backends.cudnn.enabled = False 
torch.backends.cudnn.deterministic = True

We'll use spaCy to tokenize the data. [SpaCy](https://spacy.io/usage/spacy-101) is a very useful tool for paring and iterating over text in the preprocessing phase. Torchtext allows us to retrieve spaCy tokenizers using the `get_tokenizer` function. However, we first have to download the spaCy tokenization models for German and English.

In [32]:
%%capture
! python -m spacy download en
! python -m spacy download de
from torchtext.data.utils import get_tokenizer
de_tokenizer = get_tokenizer('spacy', language='de')
en_tokenizer = get_tokenizer('spacy', language='en')

Next, we download and load the train, validation and test data.

The dataset we'll be using is the [Multi30k dataset](https://github.com/multi30k/dataset). This is a dataset with ~30,000 parallel English, German and French sentences, each with ~12 words per sentence.

NB: When we call `Multi30k`, it returns an iterator (or a tuple of iterators). Each time we retrieve an item from one of these iterators, it is consumed. Thus each time we need to use one of these splits for a different purpose, we need to call `Multi30k` again. Fortunately, however, the implementation is smart enough to only download the dataset once (i.e., the data is cached), no matter how many times we call `Multi30k`.

In [33]:
# The default src language is German
# The default target language is English
# Multi30k returns all splits by default
# we can get a specific split with e.g., Multi30k(split="train")
train_data, valid_data, test_data = Multi30k()

We can double check that we've loaded about 30,000 examples:

In [34]:
print(f"Number of training examples: {len(train_data)}")
print(f"Number of validation examples: {len(valid_data)}")
print(f"Number of testing examples: {len(test_data)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


We can now print out an example. (It is always a good idea to actually look at your dataset!) There are two things we should take careful note of:

1. The sentences end with a newline, and our tokenizer doesn't strip these newlines out for us. We'll need to remove those ourselves.
2. The tokenizer is case-sensitive. Below, we will reduce our vocabulary size by casting all characters to lower case.

In [35]:
example_src, example_trg = next(Multi30k(split="train"))
print(f"Original: {example_src}", f"Tokenized: {de_tokenizer(example_src)}")
print(f"Original: {example_trg}", f"Tokenized: {en_tokenizer(example_trg)}")

Original: Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.
 Tokenized: ['Zwei', 'junge', 'weiße', 'Männer', 'sind', 'im', 'Freien', 'in', 'der', 'Nähe', 'vieler', 'Büsche', '.', '\n']
Original: Two young, White males are outside near many bushes.
 Tokenized: ['Two', 'young', ',', 'White', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.', '\n']


Next, we'll build the *vocabulary* for the source and target languages. The vocabulary is used to associate each unique token with an index (an integer) and this is used to build a one-hot encoding for each token (a vector of all zeros except for the position represented by the index, which is 1). The vocabularies of the source and target languages are distinct.

Using the `min_freq` argument, we only allow tokens that appear at least 2 times to appear in our vocabulary. Tokens that appear only once are converted into an `<unk>` (unknown) token.

It is important to note that your vocabulary should only be built from the training set and not the validation/test set. This prevents "information leakage" into your model, giving you artifically inflated validation/test scores.

In [36]:
from torchtext.vocab import build_vocab_from_iterator

de_generator = (de_tokenizer(pair[0].strip().lower()) for pair in Multi30k(split="train"))
specials = ["<unk>", "<pad>", "<bos>", "<eos>"]
de_vocab = build_vocab_from_iterator(de_generator, specials=specials, min_freq=2)
en_generator = (en_tokenizer(pair[1].strip().lower()) for pair in Multi30k(split="train"))
en_vocab = build_vocab_from_iterator(en_generator, specials=specials, min_freq=2)

for vocab in (de_vocab, en_vocab):
    vocab.set_default_index(vocab["<unk>"])

We can see the vocabulary size. (As an aside---i.e., you don't need to submit an answer, but it's interesting to ponder---what features of German do you think might make us expect that the German vocabulary would be substantially bigger than the English vocabulary?)

In [37]:
print(f"Unique tokens in source (de) vocabulary: {len(de_vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(en_vocab)}")

Unique tokens in source (de) vocabulary: 7855
Unique tokens in target (en) vocabulary: 5894


<!-- The final step of preparing the data is to create the iterators. These can be iterated on to return a batch of data which will have a `src` attribute (the PyTorch tensors containing a batch of numericalized source sentences) and a `trg` attribute (the PyTorch tensors containing a batch of numericalized target sentences). Numericalized is just a fancy way of saying they have been converted from a sequence of readable tokens to a sequence of corresponding indexes, using the vocabulary. 

We also need to define a `torch.device`. This is used to tell TorchText to put the tensors on the GPU or not. We use the `torch.cuda.is_available()` function, which will return `True` if a GPU is detected on our computer. We pass this `device` to the iterator.

When we get a batch of examples using an iterator we need to make sure that all of the source sentences are padded to the same length, the same with the target sentences. Luckily, TorchText iterators handle this for us! 

We use a `BucketIterator` instead of the standard `Iterator` as it creates batches in such a way that it minimizes the amount of padding in both the source and target sentences.  -->

The next step is to define a data preprocessing pipeline.

- We already saw above that we need to lower case the sentences and strip newlines from them. 
- In [Sutskever et al. 2014](https://arxiv.org/abs/1409.3215) (a seminal paper introducing RNNs for machine translation), they found it beneficial to reverse the order of the input when feeding it to the model. We will do likewise, so we also need to reverse the German sentences. 
- Then, we want to add the `<bos>` and `<eos>` tokens.
- Finally, we need to encode the tokens as integers using the vocabularies above, and cast the result to a tensor.


**In this cell, write why you think reversing the order of the input might help. (You are welcome to consult the original paper, of course.)** # STUDENT 
<br> **Response:** It is benefitial to reverse the order of the input sentence because it introduces many short-term dependencies between words that simplify optimization and increases accuracy on longer sentences. 

In [69]:
BOS_IDX = de_vocab['<bos>']
EOS_IDX = de_vocab['<eos>']

from typing import List, Tuple
from torch import Tensor

def data_process(raw_dataset) -> List[Tuple[Tensor, Tensor]]:
    # STUDENT 
    ret = []
    for pair in raw_dataset: 
      # lower case and strip both German and English tokens in each sentence pair 
      d_tokens = de_tokenizer(pair[0].strip().lower()) 
      e_tokens = de_tokenizer(pair[1].strip().lower()) 

      # reverse sentence order of German tokens 
      d_tokens = d_tokens[::-1]

      # add <bos> and <eos> tokens to both German and English sentences 
      d_tokens.insert(0, '<bos>')
      d_tokens.append('<eos>')
      e_tokens.insert(0, '<bos>')
      e_tokens.append('<eos>') 

      # get encoded tensor tuple from vocabs  
      d_tens = torch.tensor([de_vocab[token] for token in d_tokens], dtype=torch.long)
      e_tens = torch.tensor([en_vocab[token] for token in e_tokens], dtype=torch.long)
      tup = (d_tens, e_tens)

      # add tensor tuple to list 
      ret.append(tup) 

    return ret 

train_data, valid_data, test_data = Multi30k()
train_data_processed = data_process(train_data)
valid_data_processed = data_process(valid_data)
test_data_processed = data_process(test_data) 

Now we can try decoding the first example. Does it look right? Is the German example reversed?

In [70]:
de_itos = de_vocab.get_itos()
en_itos = en_vocab.get_itos()
de_encoded, en_encoded = train_data_processed[0]
print(" ".join([de_itos[item] for item in de_encoded]))
print(" ".join([en_itos[item] for item in en_encoded]))

<bos> . büsche vieler nähe der in freien im sind männer weiße junge zwei <eos>
<bos> two young , white males are outside near many bushes . <eos>


As a final preprocessing step, we will use the `Dataloader` class provided by Pytorch, which is convenient for batching samples.

The most important argument we provide to `Dataloader` is `collate_fn`. This function receives a list of examples, and is in charge of collating them into a single batch.

The most important step in our `collate_fn` is to call `pad_sequence`, which ensures that all sentences are the same length by adding "pad" symbols to the ends of all but the longest sentence in each batch. Pass `PAD_IDX` to `pad_sequence` to ensure that it uses the expected padding value.

We also need to provide a list of the lengths of each sentence, which is used by the `pack_padded_sequences` function which we will use below.

In [71]:
from torch.nn.utils.rnn import pad_sequence
from torch.utils.data import DataLoader

BATCH_SIZE = 128
PAD_IDX = de_vocab['<pad>']

def collate_fn(data_batch) -> Tuple[Tensor, List[int], Tensor]:
    # STUDENT 
    # initialize lists to be added to returned tuple 
    de_batch = [] 
    en_batch = [] 
    sent_lens = [] 
    for pair in data_batch: 
      de = pair[0]
      en = pair[1] 
      # add indices/lengths to corresponding batches/list
      # de_tens = torch.cat([torch.tensor([BOS_IDX]), de, torch.tensor([EOS_IDX])], 0)
      # en_tens = torch.cat([torch.tensor([BOS_IDX]), en, torch.tensor([EOS_IDX])], 0)
      de_batch.append(de)
      en_batch.append(en)
      sent_lens.append(len(de))
    # pad sequences 
    # send to gpu 
    de_batch = torch.tensor(pad_sequence(de_batch, padding_value=PAD_IDX)).to(device) 
    en_batch = torch.tensor(pad_sequence(en_batch, padding_value=PAD_IDX)).to(device) 
    ret = (de_batch, sent_lens, en_batch)

    return ret 

train_dl = DataLoader(
    train_data_processed,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=collate_fn,
)
valid_dl = DataLoader(
    valid_data_processed,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=collate_fn,
)
test_dl = DataLoader(
    test_data_processed,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=collate_fn,
)

In [72]:
# testing collate_fn 
indices = [0, 1, 2, 3] 
collate_fn([train_data_processed[i] for i in indices])



(tensor([[   2,    2,    2,    2],
         [   4,    4,    4,    4],
         [3171,    0,  499,  248],
         [7649,    5,   56,    5],
         [ 110, 2069, 7316,  681],
         [  15,  831,    5,   10],
         [   7,   11,    7,  535],
         [  88,   30,  217,   14],
         [  20,   76,   25,   12],
         [  84,    3,   66,   29],
         [  30,    1,    5,   40],
         [ 253,    1,    3,   46],
         [  26,    1,    1,    6],
         [  18,    1,    1,    7],
         [   3,    1,    1,   13],
         [   1,    1,    1,    5],
         [   1,    1,    1,    3]], device='cuda:0'),
 [15, 10, 12, 17],
 tensor([[   2,    2,    2,    2],
         [  16,  113,    4,    4],
         [  24,   30,   53,    9],
         [  15,    6,   33,    6],
         [  25,  325,  230,    4],
         [ 778,  279,   69,   29],
         [  17,   17,    4,   23],
         [  57, 1200,  248,   10],
         [  80,    4, 4286,   36],
         [ 202,  715,    5,    8],
         [1312, 4

### Building the Seq2Seq RNN Model

This model is in three parts. The encoder, the decoder and a seq2seq model that encapsulates the encoder and decoder and will provide a way to interface with each.

The implementation is provided for you. The architecture of the Encoder and Decoder are very similar to that of the LSTMClassifier that we implmeneted in assignment 3. Nevertheless, please read through the directions and the code in order to understand what is going on.

### Encoder

First, the encoder, a single layer GRU. The paper we are implementing uses a **4-layer** LSTM, but in the interest of training time and simplicity we cut this down to a single layer GRU.

$$\begin{align*}
h_t &= \text{GRU}(x_t, h_{t-1})\\
\end{align*}$$


We will make an `Encoder` module inheriting from `torch.nn.Module` and use the `super().__init__()` as some boilerplate code. The encoder takes the following arguments:
- `input_dim` is the size/dimensionality of the one-hot vectors that will be input to the encoder. This is equal to the input (source) vocabulary size.
- `emb_dim` is the dimensionality of the embedding layer. This layer converts the one-hot vectors into dense vectors with `emb_dim` dimensions. 
- `hid_dim` is the dimensionality of the hidden states.
- `dec_hid_dim` is the dimensionality of the hidden states in the decoder, which may differ from the hid_dim of the encoder.
- `n_layers` is the number of layers in the RNN.
- `dropout` is the amount of dropout to use. This is a regularization parameter to prevent overfitting. Check out [this](https://www.coursera.org/lecture/deep-neural-network/understanding-dropout-YaGbR) for more details about dropout.

<!-- If you would like to know about word embeddings, here are a few relevant articles: [1](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/), [2](http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html), [3](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/), [4](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/).  -->

The embedding layer is created using `nn.Embedding`, the GRU using `nn.GRU` and a dropout layer using `nn.Dropout`. Check the PyTorch [documentation](https://pytorch.org/docs/stable/nn.html) for more about these layers.

One thing to note is that the `dropout` argument to the GRU determines how much dropout to apply between the layers of a multi-layer RNN, i.e. between the hidden states output from layer $l$ and those same hidden states being used for the input of layer $l+1$.

In the `forward` method, we pass in the source sentence, $X$, which is converted into dense vectors using the `embedding` layer, and then dropout is applied. These embeddings are then passed into the RNN. As we pass a whole sequence to the RNN, it will automatically do the recurrent calculation of the hidden states over the whole sequence for us! You may notice that we do not pass an initial hidden or cell state to the RNN. This is because, as noted in the [documentation](https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM), that if no hidden/cell state is passed to the RNN, it will automatically create an initial hidden/cell state as a tensor of all zeros. 

The RNN returns: `outputs` (the top-layer hidden state for each time-step) and `hidden` (the final hidden state for each layer, $h_T$, stacked on top of each other).

<!-- For part 2, when we also need the hidden states for all encoder steps, we also return `outputs`, in addition to `hidden`. -->

The size of each of the tensors is left as a comment in the code.

In [73]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, enc_hid_dim, dec_hid_dim, dropout):
        super().__init__()
        
        self.input_dim = input_dim
        self.emb_dim = emb_dim
        self.enc_hid_dim = enc_hid_dim
        self.dec_hid_dim = dec_hid_dim
        self.dropout = dropout
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, enc_hid_dim, bidirectional=True)
        self.fc = nn.Linear(enc_hid_dim * 2, dec_hid_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src, src_len):
        
        #src = [src sent len, batch size]
        #src_len = [src sent len]
  
        embedded = self.dropout(self.embedding(src))
        
        #embedded = [src sent len, batch size, emb dim]

        packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, src_len, enforce_sorted=False)
        packed_outputs, hidden = self.rnn(packed_embedded)       
        outputs, _ = nn.utils.rnn.pad_packed_sequence(packed_outputs) 
            
        #outputs = [sent len, batch size, dec_hid_dim * num directions]
        #hidden = [n layers * num directions, batch size, dec_hid_dim]
        
        #hidden is stacked [forward_1, backward_1, forward_2, backward_2, ...]
        #outputs are always from the last layer
        
        #hidden [-2, :, : ] is the last of the forwards RNN 
        #hidden [-1, :, : ] is the last of the backwards RNN
        
        #initial decoder hidden is final hidden state of the forwards and 
        # backwards encoder RNNs concatenated and fed through a linear layer

        hidden = self.fc(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim=1))
        
        #outputs = [sent len, batch size, enc hid dim * 2]
        #hidden = [batch size, dec_hid_dim]
        
        return outputs, hidden

### Decoder

Next, we'll build our decoder, which will be another 1-layer GRU. The `Decoder` class does a single step of decoding. The equation is very similar to that for the encoder.

$$\begin{align*}
s_t = \text{DecoderGRU}(y_t, s_{t-1})
\end{align*}$$

Remember that the initial hidden and cell states to our decoder are our context vectors, which are the final hidden and cell states of our encoder, i.e. $s_0=z=h_T$.

We then pass the hidden state from the RNN, $s_t$, through a linear layer, $f$, to make a prediction of what the next token in the target (output) sequence should be, $\hat{y}_{t+1}$. 

$$\hat{y}_{t+1} = f(s_t)$$

The arguments and initialization are similar to the `Encoder` class, except we now have an `output_dim` which is the size of the one-hot vectors that will be input to the decoder. These are equal to the vocabulary size of the output/target. There is also the addition of the `Linear` layer, used to make the predictions from the hidden state.

Within the `forward` method, we accept a batch of input tokens, previous hidden states and previous cell states. We `unsqueeze` the input tokens to add a sentence length dimension of 1. Then, similar to the encoder, we pass through an embedding layer and apply dropout. This batch of embedded tokens is then passed into the RNN with the previous hidden and cell states. This produces an `output` (hidden state from the RNN), a new `hidden` state (one for each layer–though here there is only one layer—stacked on top of each other) and a new `cell` state (also one per layer, stacked on top of each other). We then pass the `output` (after getting rid of the sentence length dimension) through the linear layer to receive our `prediction`. We then return the `prediction`, the new `hidden` state and the new `cell` state.

In [74]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers):
        super().__init__()

        self.emb_dim = emb_dim
        self.hid_dim = hid_dim
        self.output_dim = output_dim
        self.n_layers = n_layers

        self.embedding = nn.Embedding(output_dim, emb_dim)
        self.rnn = nn.GRU(emb_dim, hid_dim, n_layers)
        self.out = nn.Linear(hid_dim, output_dim)
        
        
    def forward(self, input, hidden):
        
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        #input = [1, batch size]
        
        embedded = self.embedding(input)
        
        #embedded = [1, batch size, emb dim]

        output, hidden = self.rnn(embedded, hidden)
        
        #output = [sent len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #sent len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        prediction = self.out(output.squeeze(0))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden

### Seq2Seq

For the final part of the implementation, we'll implement the seq2seq model. This will handle: 
- receiving the input/source sentence
- using the encoder to produce the context vectors 
- using the decoder to produce the predicted output/target sentence


The `Seq2Seq` model takes in an `Encoder`, `Decoder`, and a `device` (used to place tensors on the GPU, if it exists).

<!-- For this implementation, we have to ensure that **the number of layers** and **the hidden (and cell) dimensions** are equal in the `Encoder` and `Decoder`. This is not always the case, you do not necessarily need the same number of layers or the same hidden dimension sizes in a sequence-to-sequence model. However, if you do something like having a different number of layers you will need to make decisions about how this is handled. For example, if your encoder has 2 layers and your decoder only has 1, how is this handled? Do you average the two context vectors output by the decoder? Do you pass both through a linear layer? Do you only use the context vector from the highest layer? Etc. -->

Our `forward` method takes the source sentence, target sentence and a teacher-forcing ratio. The teacher forcing ratio is used when training our model. When decoding, at each time-step we will predict what the next token in the target sequence will be from the previous tokens decoded, $\hat{y}_{t+1}=f(s_t^L)$. With probability equal to the teaching forcing ratio (`teacher_forcing_ratio`) we will use the actual ground-truth next token in the sequence as the input to the decoder during the next time-step. However, with probability `1 - teacher_forcing_ratio`, we will use the token that the model predicted as the next input to the model, even if it doesn't match the actual next token in the sequence.  

The first thing we do in the `forward` method is to create an `outputs` tensor that will store all of our predictions, $\hat{Y}$.

We then feed the input/source sentence, $X$/`src`, into the encoder and receive out final hidden and cell states.

The first input to the decoder is the start of sequence (`<sos>`) token. As our `trg` tensor already has the `<sos>` token appended (we added it in preprocessing) we get our $y_1$ by slicing into it. We know how long our target sentences should be (`max_len`), so we loop that many times. During each iteration of the loop, we:
- pass the input, previous hidden and previous cell states ($y_t, s_{t-1}, c_{t-1}$) into the decoder
- receive a prediction, next hidden state and next cell state ($\hat{y}_{t+1}, s_{t}, c_{t}$) from the decoder
- place our prediction, $\hat{y}_{t+1}$[=`output`] in our tensor of predictions, $\hat{Y}$[=`outputs`]
- decide if we are going to "teacher force" or not
    - if we do, the next `input` is the ground-truth next token in the sequence, $y_{t+1}$[=`trg[t]`]
    - if we don't, the next `input` is the predicted next token in the sequence, $\hat{y}_{t+1}$[=`top1`]
    
Once we've made all of our predictions, we return our tensor full of predictions, $\hat{Y}$[=`outputs`].

In [75]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.enc_hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
#         assert encoder.n_layers == decoder.n_layers, \
#             "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, src_len, trg, teacher_forcing_ratio = 0.5):
        
        #src = [src sent len, batch size]
        #trg = [trg sent len, batch size]
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth inputs 75% of the time
        
        batch_size = trg.shape[1]
        max_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(max_len, batch_size, trg_vocab_size).to(self.device)
        
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        _, hidden = self.encoder(src, src_len)
        hidden = hidden.unsqueeze(0)
        #first input to the decoder is the <sos> tokens
        input = trg[0,:]
        
        for t in range(1, max_len):
            
            output, hidden = self.decoder(input, hidden)
            outputs[t] = output
            teacher_force = random.random() < teacher_forcing_ratio
            top1 = output.argmax(1)
            input = (trg[t] if teacher_force else top1)
        
        return outputs

### Training the Seq2Seq Model

Now we have our model implemented, we can begin training it. 

First, we'll initialize our model. As mentioned before, the input and output dimensions are defined by the size of the vocabulary. The embedding dimensions and dropout for the encoder and decoder can be different, but the number of layers and the size of the hidden/cell states must be the same. 

We then define the encoder, decoder and then our Seq2Seq model, which we place on the `device`.

In [76]:
INPUT_DIM = len(de_vocab)
OUTPUT_DIM = len(en_vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 1
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, HID_DIM, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS)

model = Seq2Seq(enc, dec, device).to(device)

We also define a function that will calculate the number of trainable parameters in the model.

In [77]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 10,616,326 trainable parameters


We define our optimizer, which we use to update our parameters in the training loop. Check out [this](http://ruder.io/optimizing-gradient-descent/) post for information about different optimizers. Here, we'll use Adam.

In [78]:
optimizer = optim.Adam(model.parameters())

Next, we define our loss function. The `CrossEntropyLoss` function calculates both the log softmax as well as the negative log-likelihood of our predictions. 

Our loss function calculates the average loss per token, however by passing the index of the `<pad>` token as the `ignore_index` argument we ignore the loss whenever the target token is a padding token. 

In [79]:
PAD_IDX = en_vocab['<pad>']
assert PAD_IDX == de_vocab['<pad>']

criterion = nn.CrossEntropyLoss(ignore_index = PAD_IDX)

Next, we'll define our training loop. 

First, we'll set the model into "training mode" with `model.train()`. This will turn on dropout (and batch normalization, which we aren't using) and then iterate through our data iterator.

At each iteration:
- get the source and target sentences from the batch, $X$ and $Y$
- zero the gradients calculated from the last batch
- feed the source and target into the model to get the output, $\hat{Y}$
- as the loss function only works on 2d inputs with 1d targets we need to flatten each of them with `.view`
    - we also don't want to measure the loss of the `<sos>` token, hence we slice off the first column of the output and target tensors
- calculate the gradients with `loss.backward()`
- clip the gradients to prevent them from exploding (a common issue in RNNs)
- update the parameters of our model by doing an optimizer step
- sum the loss value to a running total

Finally, we return the loss that is averaged over all batches.

In [80]:
def train(model, train_dl, optimizer, criterion, clip):
    
    model.train() # set the model into "training mode"
    
    epoch_loss = 0
    
    for i, (src, src_len, trg) in enumerate(train_dl):
        
        optimizer.zero_grad() # zero the gradients calculated from the last batch
        output = model(src, src_len, trg)
        
        #trg = [trg sent len, batch size]
        #output = [trg sent len, batch size, output dim]
        
        output = output[1:].view(-1, output.shape[-1])
        # view() Returns a new tensor with the same data as the self tensor but of a different shape.
        
        trg = trg[1:].view(-1)
        
        #trg = [(trg sent len - 1) * batch size]
        #output = [(trg sent len - 1) * batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)  # CLIP = 1
        # parameters (Iterable[Tensor] or Tensor): an iterable of Tensors or a
        #     single Tensor that will have gradients normalized
        # max_norm (float or int): max norm of the gradients
        # norm_type (float or int): type of the used p-norm. Can be ``'inf'`` for infinity norm.
        
        # forcing the gradients to be reasonably small, 
        # which means that the parameter updates will not push the parameters too far from 
        # their previous values
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
    return epoch_loss / len(train_dl)

Our evaluation loop is similar to our training loop, however as we aren't updating any parameters we don't need to pass an optimizer or a clip value.

We must remember to set the model to evaluation mode with `model.eval()`. This will turn off dropout (and batch normalization, if used).

We use the `with torch.no_grad()` block to ensure no gradients are calculated within the block. This reduces memory consumption and speeds things up. 

The iteration loop is similar (without the parameter updates), however we must ensure we turn teacher forcing off for evaluation. This will cause the model to only use its own predictions to make further predictions within a sentence, which mirrors how it would be used in deployment.

In [81]:
def evaluate(model, val_dl, criterion):
    model.eval()
    epoch_loss = 0
    with torch.no_grad():
    
        for i, (src, src_len, trg) in enumerate(val_dl):

            output = model(src, src_len, trg, 0) #turn off teacher forcing

            #trg = [trg sent len, batch size]
            #output = [trg sent len, batch size, output dim]

            output = output[1:].view(-1, output.shape[-1])
            trg = trg[1:].view(-1)

            #trg = [(trg sent len - 1) * batch size]
            #output = [(trg sent len - 1) * batch size, output dim]

            loss = criterion(output, trg)
            epoch_loss += loss.item()
        
    return epoch_loss / len(val_dl)

Next, we'll create a function that we'll use to tell us how long an epoch takes.

In [82]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

We can finally start training our model!

At each epoch, we'll be checking if our model has achieved the best validation loss so far. If it has, we'll update our best validation loss and save the parameters of our model (called `state_dict` in PyTorch). Then, when we come to test our model, we'll use the saved parameters used to achieve the best validation loss. 

We'll be printing out both the loss and the perplexity at each epoch. It is easier to see a change in perplexity than a change in loss as the numbers are much bigger.

**Run the following cell and see if the training and validation loss/perplexity seem to be improving.** # STUDENT TODO 1D

In [67]:
N_EPOCHS = 5
CLIP = 1
SAVE_DIR = 'models'
MODEL_SAVE_PATH = os.path.join(SAVE_DIR, 'cpsc477_hw4_rnn.pt')

best_valid_loss = float('inf')

if not os.path.isdir(f'{SAVE_DIR}'):
    os.makedirs(f'{SAVE_DIR}')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()

    train_loss = train(model, train_dl, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_dl, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), MODEL_SAVE_PATH)
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')



Epoch: 01 | Time: 0m 58s
	Train Loss: 4.356 | Train PPL:  77.970
	 Val. Loss: 4.124 |  Val. PPL:  61.789
Epoch: 02 | Time: 0m 59s
	Train Loss: 3.458 | Train PPL:  31.750
	 Val. Loss: 3.905 |  Val. PPL:  49.670
Epoch: 03 | Time: 0m 58s
	Train Loss: 3.114 | Train PPL:  22.509
	 Val. Loss: 3.734 |  Val. PPL:  41.844
Epoch: 04 | Time: 0m 58s
	Train Loss: 2.822 | Train PPL:  16.817
	 Val. Loss: 3.730 |  Val. PPL:  41.669
Epoch: 05 | Time: 0m 57s
	Train Loss: 2.603 | Train PPL:  13.502
	 Val. Loss: 3.726 |  Val. PPL:  41.507


We'll load the parameters (`state_dict`) that gave our model the best validation loss and run this model on the test set.

**Run the following cell to see how the model does on the test set.** # STUDENT TODO 1E

In [68]:
model.load_state_dict(torch.load(MODEL_SAVE_PATH))

test_loss = evaluate(model, test_dl, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')



| Test Loss: 3.719 | Test PPL:  41.206 |


## Submission

Once you have completed the assignment, follow the steps below for all 3 notebooks in order to submit your assignment:
1. Click __Runtime__ > __Restart runtime__ and then __Runtime__  > __Run all__ to cleanly generate the output for all cells in the notebook.
2. Save the notebook with the output from all the cells in the notebook by click __File__ > __Download .ipynb__.
<!-- 3. Copy model train and test prints, answers to all short questions, and the shareable line of this notebook to a `README.txt` file. -->
3. Put the .ipynb file <!-- and `README.txt` --> under your hidden directory on the Zoo server `~/hidden/<YOUR_PIN>/Homework5/`.
4. As a final step (after doing the above steps for all 3 notebooks), run a script that will set up the permissions to your homework files, so we can access and run your code to grade it. Make sure the command runs without errors, and do not make any changes or run the code again. If you do run the code again or make any changes, you need to run the permissions script again. Submissions without the correct permissions may incur some grading penalty.
`/home/classes/cs477/bash_files/hw5_set_permissions.sh <YOUR_PIN>`