Made by [Julia Ive](mailto:j.ive@imperial.ac.uk), [Zhenhao Li](mailto:zhenhao.li18@imperial.ac.uk), and [Nihir](mailto:nv419@ic.ac.uk).

 [The AI Core](https://theaicore.com)

# Contents
- [RNNs Recap and Primer](#RNNs-Recap-and-Primer)
 - [RNNs for text classification](#RNNs-for-Text-Classification-Task)
 - [Bi-directional RNNs](#BiDirectional-RNNs)
- [RNNs for Language Modelling](#RNNs-for-Language-Modelling)
 - [Evaluation of Language Models](#Evaluation-of-Language-Models)
 - [Long short term memory architectures LSTMs vs. RNNs](#Long-short-term-memory-architectures-LSTMs-vs.-RNNs)
- [Sequence to sequence model](#Sequence-to-sequence-model)
 - [BLEU Score](#BLEU-Score)

# RNNs Recap and Primer


## RNNs Recap

RNNs are designed to make use of sequential data, when the current step has some kind of relation with the previous steps. This makes them ideal for applications with a time component (audio, time-series data) and natural language. RNNs are networks for which value of a unit depends on its own previous output as input.

An input vector representing the current input element $x_t$ is multiplied by a weight matrix $W$ and then passed through an activation function to compute an activation value for a layer of hidden units. This hidden layer is, in turn, used to calculate a corresponding output, $y_t$. 

$h_t = g(Uh_{t-1}+Wx_t)$

$y_t = a(Vh_t)$

The hidden layer from the previous time step $h_{t-1}$ provides a form of memory, or context, that encodes earlier processing and informs the decisions to be made at later points in time. $U$ determine how the network should make use of past context. RNNs do not impose any limit on this prior context. The context includes information dating back to the beginning of the sequence. Three sets of weights are updated at each timestep: $W$, $U$ and $V$.


![rnn_cell](images/rnn_cell.png)


## RNNs for Text Classification Task

There are many variations of RNN’s likely Many-One, Many-Many, etc. We will work with a popular classification task of sentiment analysis, the extraction of the sentiment that a writer expresses toward something he/she describes. In our case we aim to classify the input text into positivem, negative or neutral class. For example, a positive sentence "I loved the movie", a negative "I hated the movie" and a neutral "The movie was about Australia". So, set of Many-One LSTM units achieves the task as only one value needs to be outputted for determining the polarity of the review. Usually this is the last RNN hidden state as the one that summarises the whole sequence.


Q: Why RNNs better than FFNNs?

FFNNs do not take context into account. Each word is represented by its embedding independent of other words. RNNs encodes each new word (token) considering the previous words. Meaning of words can change depending on the context. For example, compare the meanings of the word "mean" in those two sentences "I compute the mean" and "His behaviour was mean".


![rnn_classification](images/rnn_classification.png)

In [1]:
import torch
from torchtext import data
#We will work with a dataset from the torchtext package consists of data processing utilities and popular datasets for NLP
from torchtext import datasets
import random
import torch.nn as nn
import time
import math

# We fix the seeds to get consistent results for each training.

SEED = 1234

torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# Helper function to print time between epochs
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [2]:
# With TorchText Field we define how our data will be processed: here we will use Spacy for tokenisation

TEXT = data.Field(tokenize = 'spacy', lower=True)
LABEL = data.LabelField(dtype = torch.long)

# We will experiment with a widely used Stanford Treebank dataset and will predict sentiment of movie reviews
# Our data will be classified in three labels: positive, negative and neutral
# We take the standard split

train_data, valid_data, test_data = datasets.SST.splits(
            TEXT, LABEL)

# Print stat over the data

print('train.fields:', train_data.fields)
print('len(train):', len(train_data))
print('vars(train[0]):', vars(train_data[0]))

downloading trainDevTestTrees_PTB.zip


.data\sst\trainDevTestTrees_PTB.zip: 100%|███████████████████████████████████████████| 790k/790k [00:02<00:00, 266kB/s]


extracting
train.fields: {'text': <torchtext.data.field.Field object at 0x0000024AE8CE0508>, 'label': <torchtext.data.field.LabelField object at 0x0000024AE94C05C8>}
len(train): 8544
vars(train[0]): {'text': ['the', 'rock', 'is', 'destined', 'to', 'be', 'the', '21st', 'century', "'s", 'new', '`', '`', 'conan', "''", 'and', 'that', 'he', "'s", 'going', 'to', 'make', 'a', 'splash', 'even', 'greater', 'than', 'arnold', 'schwarzenegger', ',', 'jean', '-', 'claud', 'van', 'damme', 'or', 'steven', 'segal', '.'], 'label': 'positive'}


In [4]:
# Now we build a vocabulary out of tokens available from the pre-trained embedding list and the vocabulary of labels.

TEXT.build_vocab(train_data, vectors="glove.6B.50d")
LABEL.build_vocab(train_data)

print('Text Vocabulary Length', len(TEXT.vocab))
print ("Label Vocabulary Length: ", len(LABEL.vocab))

#We can display the most common words in the vocabulary and their frequencies

print(TEXT.vocab.freqs.most_common(20))

#We can also see the vocabulary directly using the stoi (string to int)

print(LABEL.vocab.stoi)

Text Vocabulary Length 15490
Label Vocabulary Length:  3
[('.', 8041), ('the', 7353), (',', 7131), ('a', 5305), ('and', 4516), ('of', 4456), ('to', 3050), ('-', 2737), ('is', 2565), ("'s", 2544), ('it', 2428), ('that', 1955), ('in', 1916), ('as', 1299), ('but', 1172), ('film', 1166), ('with', 1139), ('for', 1037), ('movie', 1016), ('this', 998)]
defaultdict(None, {'positive': 0, 'negative': 1, 'neutral': 2})


In [5]:
BATCH_SIZE = 64

# place the tensors on the GPU if available

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# BucketIterator is an iterator that will return a batch of examples of similar lengths, minimizing the amount of padding per example.
# Padding refers to fixing the length of inputs (adding a reserved token a certain amount of times to match certain length), usually to the max length within a batch. For exmaple:
# i         like  this  movie <pad>
# the       movie is    very  good
# excellent !     <pad> <pad> <pad>

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
            (train_data, valid_data, test_data), 
            batch_size = BATCH_SIZE,
            sort_within_batch = True,
            device = device)

In [6]:
print(train_iterator)

for batch in train_iterator:
    demo_batch = batch
    break
    
print(demo_batch)

print()

# Note that demo_batch.text has a shape of [sentence length x batch size]
print("Demo batch `text` shape:", demo_batch.text.shape)
# We can simply reshape this into the more familiar [batch size x sentence length]
print("Demo batch `text` transpose shape:", demo_batch.text.T.shape)
print("Demo batch `text` sample: \n", demo_batch.text.T[:3, :])

print()

print("Demo batch `label` shape:", demo_batch.label.shape)
# shape(demo_batch.label.shape) = [batch size]
print("Demo batch `label` sample:", demo_batch.label[:3])

<torchtext.data.iterator.BucketIterator object at 0x0000024AF6815AC8>

[torchtext.data.batch.Batch of size 64 from SST]
	[.text]:[torch.LongTensor of size 22x64]
	[.label]:[torch.LongTensor of size 64]

Demo batch `text` shape: torch.Size([22, 64])
Demo batch `text` transpose shape: torch.Size([64, 22])
Demo batch `text` sample: 
 tensor([[   83,   262,     4,   106,   134,   259,     4,    10,  1060,     4,
          1172,     6,  1544,    39,   729,     5,  1224,    14,     3,   215,
           295,     2],
        [   29,    79,  1023,    69,  1429,   102,    41,     3,   710,  2023,
             4,  1447,     9, 15275, 12243,  4485,    28,   167, 10162,     6,
          3081,     2],
        [    3,  4230,  1861,    13, 10847,  6334,    11,  1373,    94,    10,
          3711,    14,   414,     8,  2165,    60,   115,   143,   757,     6,
          1119,     2]])

Demo batch `label` shape: torch.Size([64])
Demo batch `label` sample: tensor([0, 2, 1])


#### The RNN class

Within the constructor we define the layers:
 - An **embedding layer** which acts as a lookup table to map our tokens to their vector 
 - An **RNN** 
 - A **linear layer**. This layer receives the last hidden state from the RNN and outputs logits of `output dim` dimensionality 

All the parameters initialized to random values by default, unless explicitly specified.


In [9]:
class RNN(nn.Module):

    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim,
                 bidirectional, dropout, pad_idx):

        super().__init__()

        self.bidirectional = bidirectional
        self.hidden_dim = hidden_dim
        
        # An embedding layer (look-up layer) transforms word indicies into word embeddings. 
        # Here, we initialize our model with pre-trained embeddings (100D pre-trained GloVe embeddings in our case).
        # This layer will fine-tune these embeddings, specific to this model/dataset.
        self.embedding = nn.Embedding.from_pretrained(TEXT.vocab.vectors)
        # We can also train the embeddings from scratch:
        #self.embedding = nn.Embedding(vocab_size, embedding_dim, padding_idx=pad_idx)

        # An RNN layer. we specify that the batch dimension goes first
        # We have a bidirectional flag which indicates whether the model is unidirectional or bidirectional
        # RNNs can be stacked - i.e. have multiple layers. Here, we will only look at the 1 layer case.
        self.rnn = nn.RNN(embedding_dim,
                          hidden_dim,
                          batch_first=True,
                          bidirectional=bidirectional,
                          num_layers=1)

        # The linear layer takes the final hidden state and feeds it through a fully connected layer.
          # The dimensionality of the output is equal to the output class count.
          # For classification in a bidirectional RNN we concatenate:
            #  - The last hidden state from the forward RNN (obtained from final word of the sentence)
            #  - The last hidden state from the backward RNN (obtained from the first word of the sentence)
          # Due to the concatenation, our hidden size is doubled.
        
        if self.bidirectional:
            linear_hidden_in = hidden_dim * 2
        else:
            linear_hidden_in = hidden_dim

        # The classification (linear) layer
        self.fc = nn.Linear(linear_hidden_in, output_dim)
        

        # We apply dropout technique that sets a random set of activations of a layer to zero.
          # This prevents the network from learning to rely on specific weights and helps to prevent overfitting. 
          # Note that the dropout layer is only used during training, and not during test time.
        self.dropout = nn.Dropout(dropout)

    def forward(self, text):

        # ACRONYMS:
          # B = Batch size
          # T = Max sentence length
          # E = Embedding dimension
          # D = Hidden dimension
          # O = Output dimension

        # shape(text) = [B, T]

        embedded = self.dropout(self.embedding(text))
        # shape(embedded) = [B, T, E]
        
        # An RNN in PyTorch returns two values:
        # (1) All hidden states of the last RNN layer
        # (2) Hidden state of the last timestep for every layer
          # Note: we are only using 1 layer
        all_hidden, last_hidden = self.rnn(embedded)
        # shape(all_hidden) = [B, T, D*num_directions]
        # shape(last_hidden) = [num_layers*num_directions, B, D].  num_layers = 1
        # NOTE. If we were to NOT use the `batch_first` flag, shape of all_hidden would be [T, B, D*num_directions]
        
        if self.bidirectional:
            # Concat the final forward (hidden[0,:,:]) and backward (hidden[1,:,:]) hidden layers
            last_hidden = torch.cat((last_hidden[0, :, :], last_hidden[1, :, :]), dim=-1)
            # shape(last_hidden) = [B, D*2]

        else:
            last_hidden = last_hidden.squeeze(0)
            # shape(last_hidden) = [B, D]

        # Our predictions.
        logits = self.fc(self.dropout(last_hidden))
        # shape(logits) = [B, O]
        
        return logits



In [10]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 50
HIDDEN_DIM = 128
OUTPUT_DIM = len(LABEL.vocab)
BIDIRECTIONAL = False
DROPOUT = 0.3
# get our pad token index from the vocabulary
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

Let's initialise the RNN now

In [11]:
model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            BIDIRECTIONAL, 
            DROPOUT, 
            PAD_IDX)

Q: How the model is evaluated?

The natural choise is accuracy - 
percentage of all the examples that our model labeled correctly. However, it is not good for unbalanced datasets. Imagine a dataset with 999,900 positive examples and 100 negative examples. A very bad classifier can assign positive class to all the examples. This classifier would have 999,900 true negatives and only 100 false negatives and an accuracy of
999,900/1,000,000 or 99.99%! 

Other metrics, more useful for such datasets are: precision, recall and F-measure. Precision measures the percentage of the items that the system labelled as positive and are positive accroding to the gold labels. Recall measures the percentage of items labelled as positive out of all the gold positive items. F-score is the weighted harmonic mean of the precision and recall. 

In [12]:
def accuracy(preds, y):
    """
    returns accuracy per batch
    """

    class_preds = nn.Softmax(dim=-1)(preds)
    class_preds = class_preds.max(-1)[1]
    correct = (class_preds == y).float() # convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

In [13]:
import torch.optim as optim

optimizer = optim.Adam(model.parameters())

# we use the cross-entropy loss
criterion = nn.CrossEntropyLoss()

model = model.to(device)
criterion = criterion.to(device)

In [17]:
def train(model, train_iterator, valid_iterator, optimizier, criterion, N_EPOCHS=15):
    
    for epoch in range(N_EPOCHS):
    
        start_time = time.time()

        # To ensure the dropout is "turned on" while training
          # (good practice to include in your projects even if it is not used)
        model.train()
        
        epoch_loss = 0
        epoch_acc = 0
    
        # `batch` is a tuple of Tensors: (TEXT, LABEL)
        for batch in train_iterator:
                        
            # Zero the gradients
            optimizer.zero_grad()

            text = batch.text
            labels = batch.label
            # shape(text) = [T, B]
            # shape(label) = [B]
            
            # We reshape text to [B, T]. 
            # This is purely so we can think about the shapes of the Tensors more consistently
            text = text.T
            
            predictions = model(text)
            
            # compute the loss
            loss = criterion(predictions, labels)
        
            # compute training accuracy
            acc = accuracy(predictions, labels)
              
            # calculate the gradient of each parameter
            loss.backward()
        
            # update the parameters using the gradients and optimizer algorithm 
            optimizer.step()
            
            epoch_loss += loss.item()
            epoch_acc += acc.item()
            
        average_epoch_loss = epoch_loss / len(train_iterator)
        average_epoch_acc = epoch_acc / len(train_iterator)
        
        end_time = time.time()
        
        
        epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
        average_epoch_valid_loss, average_epoch_valid_acc = evaluate(model, valid_iterator, criterion)

        print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
        print(f'\tTrain Loss: {average_epoch_loss:.3f} | Train Acc: {average_epoch_acc*100:.2f}%')
        print(f'\t Val. Loss: {average_epoch_valid_loss:.3f} |  Val. Acc: {average_epoch_valid_acc*100:.2f}%')
        

In [15]:
def evaluate(model, iterator, criterion):

    epoch_loss = 0
    epoch_acc = 0

    # Turn on evaluate mode.
      # De-activates dropout and batch normalization (which we will cover in a future session)
    model.eval()

    # We do not compute gradients within this block, i.e. no training
    # https://discuss.pytorch.org/t/model-eval-vs-with-torch-no-grad/19615/2
    with torch.no_grad():

        for batch in iterator:

            text = batch.text
            labels = batch.label

            text = text.T
            
            predictions = model(text)
            loss = criterion(predictions, labels)
            acc = accuracy(predictions, labels)

            epoch_loss += loss.item()
            epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)


In [18]:
train(model, train_iterator, valid_iterator, optimizer, criterion)

Epoch: 01 | Epoch Time: 0m 1s
	Train Loss: 1.046 | Train Acc: 45.45%
	 Val. Loss: 0.994 |  Val. Acc: 53.00%
Epoch: 02 | Epoch Time: 0m 1s
	Train Loss: 1.001 | Train Acc: 52.96%
	 Val. Loss: 1.003 |  Val. Acc: 52.98%
Epoch: 03 | Epoch Time: 0m 1s
	Train Loss: 0.986 | Train Acc: 54.26%
	 Val. Loss: 0.958 |  Val. Acc: 54.98%
Epoch: 04 | Epoch Time: 0m 1s
	Train Loss: 0.983 | Train Acc: 54.47%
	 Val. Loss: 0.953 |  Val. Acc: 56.98%
Epoch: 05 | Epoch Time: 0m 1s
	Train Loss: 0.976 | Train Acc: 55.32%
	 Val. Loss: 1.009 |  Val. Acc: 50.21%
Epoch: 06 | Epoch Time: 0m 1s
	Train Loss: 0.975 | Train Acc: 54.83%
	 Val. Loss: 0.962 |  Val. Acc: 55.41%
Epoch: 07 | Epoch Time: 0m 1s
	Train Loss: 0.981 | Train Acc: 54.79%
	 Val. Loss: 0.967 |  Val. Acc: 56.64%
Epoch: 08 | Epoch Time: 0m 1s
	Train Loss: 0.974 | Train Acc: 54.92%
	 Val. Loss: 0.957 |  Val. Acc: 56.73%
Epoch: 09 | Epoch Time: 0m 1s
	Train Loss: 0.966 | Train Acc: 55.99%
	 Val. Loss: 0.993 |  Val. Acc: 52.98%
Epoch: 10 | Epoch Time: 0m 1

In [None]:
test_loss, test_acc = evaluate(model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

## BiDirectional RNNs

In problems where all timesteps of the input sequence are available, bidirectional RNNs train two instead of one RNNs on the input sequence. The first on the input sequence as-is and the second on a reversed copy of the input sequence. Outputs at the same step are then usually concatenated. This can provide additional useful context to the model.

Q: Why is a bi-directional RNN is better than single-direction ?

Imagine that you see only the left context: "We went to ..." This context is very general and a lot of different words can continue: nouns (London, work, cinema, doctor), verbs (join, support), etc. When we both left and right contexts the word "sleep" is becoming evident: "We went to ... early but still could not wake up on time."


![bi_rnn_classification](images/bi_rnn_classification.png)

In [None]:
BIDIRECTIONAL = True

## bidirectional_model =

In [None]:
## call train with the bidirectional model

In [None]:
test_loss, test_acc = evaluate(bidirectional_model, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

# RNNs for Language Modelling

Language model is required to represent the text to a form understandable from the machine point of view. Language Modelling (LM) is at the core of Natural Language Processing (NLP). Base of all the NLP tasks: Machine Translation, Spell Correction, Speech Recognition, Summarization, Question Answering, Sentiment analysis etc.  

Language is a sequence of letters, words, sentences, paragraphs, etc. These units are not independent. When we comprehend and produce spoken language, we are processing continuous input streams of indefinite
length. And even when dealing with written text we normally read it sequentially. Thus, RNNs is a perfect fit to model language data because with RNNs we can represent language sequence of any length into a fixed-sized vector.
 
Q: What is the difference between word embeddings and language modelling?

The main difference that word embeddings do not take into account word order. Language models take word order into account. The word order is important. If you do not take the word order into account the representation of the following sentences will be the same: "It was really not good, on the opposite quite bad." and "It was really not bad, on the opposite quite good." However, the meaning of those two sentences is different.

![rnn_lm](images/rnn_lm.png)


You may have heard about BERT. BERT is a general-purpose pre-trained language model. It is pre-trained using a lot of language data from Internet to create a better "grasp" of language. It is bidirectional. This means a deeper sense of language context and flow compared to the single-direction language models. You can download it and fine-tune for your NLP problem. 

In [None]:
# With TorchText Field we define how our data will be processed
TEXT = data.Field(tokenize = 'spacy', init_token = '<sos>')

# We will be using the WikiText-2 corpus, which is a popular LM dataset.
# The WikiText language modeling dataset is a collection of texts extracted 
  # from good and featured articles on Wikipedia.
# It contains about 2 million words 
train_data, valid_data, test_data = datasets.WikiText2.splits(TEXT)

# Data stats
print('train.fields', train_data.fields)
print('len(train)', len(train_data))

# Build a vocabulary out of tokens available from the pre-trained embeddings list and the vocabulary of labels
TEXT.build_vocab(train_data, vectors="glove.6B.100d")
print('Text Vocabulary Length', len(TEXT.vocab))

BATCH_SIZE = 64

# place the tensors on the GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

The application of the Backpropagation training algorithm to RNNs applied to sequence data. Each timestep of the unrolled recurrent neural network may be seen as an additional layer given the order dependence of the problem and the internal state from the previous timestep is taken as an input on the subsequent timestep.


In [None]:
# BPTTIterator (Backpropagation Through Time Iterator)
# divides the corpus into batches of [sequence length, bptt_len]

train_iterator, valid_iterator, test_iterator = data.BPTTIterator.splits(
            (train_data, valid_data, test_data), 
                batch_size = BATCH_SIZE, bptt_len=30,
                device = device, repeat=False)

for batch in train_iterator:
    demo_batch = batch
    break
    
print(demo_batch)

print()

# Note that the first dimension is the sequence, and the next is the batch.
  # We can reshape this to [batch size, sentence length] as we did earlier with a transpose.
# The target is the original text offset by 1
print("Demo batch `text`:\n", demo_batch.text[:5, :3])
print("Demo batch `target`:\n", demo_batch.target[:5, :3])

In [None]:
class RNN(nn.Module):
    
    # variant is a flag which is either: "rnn", "lstm", "manual_lstm"
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, dropout, pad_idx, variant):
        
        super().__init__()
        
        self.variant = variant
        
        self.embedding = nn.Embedding.from_pretrained(TEXT.vocab.vectors)

        # UNIDIRECTIONAL RNN layer: For LM modelling we do not see/have access to the right context
        
        if variant == "rnn":
            ##self.rnn =
        elif variant == "lstm":
            self.rnn = nn.LSTM(embedding_dim, 
                               hidden_dim, 
                               batch_first=True)
        elif variant == "manual_lstm":
            self.rnn = Manual_LSTM(embedding_dim, hidden_dim)
        else:
            raise ValueError("Expected `variant` to be one of 'rnn', 'lstm', or 'manual_lstm'")
            
        ##self.fc =
        
        self.dropout = nn.Dropout(dropout)

       
    def forward(self, text, prev_hidden):
         
        # shape(text) = [B, T]
        
        # If vanilla RNN:
            # shape(prev_hidden) = [1, B, D] where 1 = num_layers*num_directions
        # If LSTM:
            # prev_hidden is a tuple of previous hidden states and cell states: (ALL_HIDDEN_STATES, ALL_CELL_STATES)
            # shape(ALL_HIDDEN_STATES)=shape(ALL_CELL_STATES) = [1, B, D] where 1 = num_layers*num_directions
            
        embedded = self.dropout(self.embedding(text))
        # shape(embedded) = [B, T, E]
        
        all_hidden, last_hidden = self.rnn(embedded, prev_hidden)        
        # shape(all_hidden) = [B, T, D]
        # shape(last_hidden) = [num layers, B, T]
        
        # Take all hidden states to produce an output word per time step
        ##logits =
        # shape(logits) = [B, O]
            
        return logits, last_hidden

In [None]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = len(TEXT.vocab)
DROPOUT = 0.5
# get our pad token index from the vocabulary
PAD_IDX = TEXT.vocab.stoi[TEXT.pad_token]

In [None]:
rnn_model = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM, 
            DROPOUT, 
            PAD_IDX,
            variant="rnn")

In [None]:
import torch.optim as optim

# Use the Adam optimizer
##optimizer =

# we use the cross-entropy loss
## criteron =

rnn_model = rnn_model.to(device)
criterion = criterion.to(device)

We need to detach the hidden state or else the model will try to backpropagate to the beginning of the dataset, requiring a lot of memory

In [None]:
def save_hidden(hidden):
  """Wraps hidden states in new Tensors, to declare it not to need gradients. So that the initial hidden state for this batch is constant and doesn’t depend on anything."""

  if isinstance(hidden, torch.Tensor):
    return hidden.detach()
  else:
    return tuple(save_hidden(v) for v in hidden)


## Evaluation of Language Models

Language is very difficult to evaluate since there is no single gold truth: one meaning could be expressed in many valid ways.


### Human Evaluation

Human evaluation is costly, slow and subjective but reliable. Human evaluation of a language model may involve how a hypothesis satisfies the grammatical and lexical norms of a language.


### Perplexity

Does it prefer real (=frequently observed) sentences to ‘ungrammatical/gibberish’ (or rarely observed) ones? 
Remember that entropy is the average number of bits to encode the information contained in a random variable, so the exponentiation of the entropy (perplexity, $e^{H}$) should be the total amount of all possible information, or more precisely, the weighted average number of choices a random variable has. We evaluate our prediction Q by testing against samples drawn from P: $PPL = e^{CrossEntropy}$.

Measure perplexity on an unseen (test) corpus, generally we compare a range of models using this score. The best LM is the one that generates the lowest perplexity on the test corpus.


In [None]:
def perplexity(loss_per_token):
    return math.exp(loss_per_token)

In [None]:
def train(model, train_iterator, valid_iterator, optimizer, criterion, N_EPOCHS=10, is_lstm=False, force_stop=False):
    
    for epoch in range(N_EPOCHS):
    
        start_time = time.time()
        
        model.train()

        epoch_loss = 0
        epoch_items = 0
        
        # The `1` is the number of layers * number of directions.
        # i.e. we have 1 layer and we are moving in 1 direction
        # More info: https://pytorch.org/docs/stable/nn.html#rnn
        prev_hidden = torch.zeros(1, BATCH_SIZE, HIDDEN_DIM, device=device)
        if is_lstm:
            prev_ht = torch.zeros(1, BATCH_SIZE, HIDDEN_DIM, device=device)
            prev_ct = torch.zeros(1, BATCH_SIZE, HIDDEN_DIM, device=device)
            prev_hidden = (prev_ht, prev_ct)

        
        # `batch` is a tuple of Tensors: (TEXT, TARGET)
        for i, batch in enumerate(train_iterator):
            
            if force_stop:
                print("Currently processing train batch {} of {}".format(i, len(train_iterator)))
                if i % 7 == 0 and i != 0:
                    break
            
            # Zero the gradients
            optimizer.zero_grad()

            ## text=
            ## targets
            # shape(text) = [T, B]
            # shape(target) = [T, B]
            
            # We reshape text and target to [B, T]. 
            text = text.T
            targets = targets.T
            # shape(text) = [B, T]
            # shape(target) = [B, T]
            
            # Starting each batch, we detach the hidden state from how it was previously produced.
            # Otherwise the model would backpropagate all the way to beginning of the dataset.
            prev_hidden = save_hidden(prev_hidden)
            
            ##logits, prev_hidden =
            
            # Compute the loss
            # We reshape inputs to eliminate batching
            loss = criterion(logits.view(-1, OUTPUT_DIM), targets.reshape(-1))
        
            # backprop the average loss and update parameters
            # Why average loss?
            loss.mean().backward()
        
            # update the parameters using the gradients and optimizer algorithm 
            ##call the optimizer
            
            
            epoch_loss += loss.detach().sum()
            epoch_items += loss.numel()
        
        # We compute loss per token for an epoch
        train_loss_per_token = epoch_loss / epoch_items
        # We compute perplexity
        train_ppl = perplexity(train_loss_per_token)

        end_time = time.time()        
        
        epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
        valid_loss_per_token, valid_ppl = evaluate(model, 
                                                   valid_iterator, 
                                                   criterion,
                                                   is_lstm=is_lstm,
                                                   force_stop=force_stop)

        print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
        print(f'\tTrain Loss: {train_loss_per_token:.3f} | Train Perplexity: {train_ppl:.3f}')
        print(f'\t Val. Loss: {valid_loss_per_token:.3f} |  Val. Perplexity: {valid_ppl:.3f}')
        
        if force_stop:
            break

In [None]:
def evaluate(model, iterator, criterion, is_lstm=False, force_stop=False):
    
    model.eval()
    
    epoch_loss = 0
    epoch_items = 0
    
    # we initialise the first hidden state with zeros
    ## Initialise the previous hidden states of the RNNs
    ##prev_hidden = 
    if is_lstm:
        ##prev_ht =
        ##prev_ct = 
        ##prev_hidden = 

    with torch.no_grad():
        for i, batch in enumerate(iterator):
            
            if force_stop and i % 3 == 0 and i != 0:
                print("Currently processing valid batch {} of {}".format(i, len(train_iterator)))
                break

            text, target = batch.text, batch.target
            text, target = text.T, target.T
            logits, prev_hidden = model(text, prev_hidden)

            # compute the loss
            loss = criterion(logits.view(-1, OUTPUT_DIM), target.reshape(-1))

            prev_hidden = save_hidden(prev_hidden)

            epoch_loss += loss.detach().sum()
            epoch_items += loss.numel()

        loss_per_token = epoch_loss / epoch_items
        ppl = math.exp(loss_per_token)
            
        
    return loss_per_token, ppl

In [None]:
train(rnn_model, train_iterator, valid_iterator, optimizer, criterion, force_stop=True)


## Long short-term memory architectures LSTMs vs. RNNs

Remember the vanishing/exploding gradient problem? The gradient signal gets smaller and smaller as it backpropagates further. It is caused by the repeated use of the recurrent weight matrix in RNN. Gradient can be viewed as a measure of the effect of the past on the future. If the gradient becomes vanishingly small over longer distances we can not capture the dependency to the past correctly. For example: "A patient with a rare sarcoma of soft tissue on the left thigh was presented to the hospital yesterday." "was presented" depends on "a patient", but they are separated by 11 words!

![full_lstm](images/lstm_full.png)

The key to LSTMs is the cell state $c_t$. It runs straight down the entire chain and allow the information to just flow along it unchanged. LSTM has two "hidden states": $c_t$  and $h_t$ . You can think of $c_t$  is the "internal" hidden state that retains important information for longer timesteps, whereas $h_t$ is the "external" hidden state that exposes that information to the outside world.

The LSTM does have the ability to remove or add information to the cell state. Gates are a way to optionally let information through. They are composed out of a sigmoid neural net layer and a pointwise multiplication operation.  An LSTM has three of these gates.

Forget gate decides what information we’re going to throw away from the cell state. 

$f_t = \sigma(W_{if}x_t + W_{hf}h_{t-1}+b_f)$

$\sigma$ squashes input values between 0 and 1, describing how much of each component should be let through. Zero means "let nothing through", while a value of one means "let everything through".

![lstm_ft](images/lstm_ft.png)

Input gate decides what new information we are going to store in the cell state. 

$i_t = \sigma(W_{ii}x_t + W_{hi}h_{t-1}+b_i)$

Next, a tanh layer creates a vector of new candidate values, $g_t$, that could be added to the state. tanh squashes the output values to be between −1 and 1. 

$g_t = tanh(W_{ig}x_t + W_{hg}h_{t-1}+b_g)$ (this equation equal to vanilla RNN if we remove gates)

The next step combines these two to create an update to the state. Pointwise multiplication operation (*) decides on the parts we output.

$c_t = f_t * c_{t-1} + i_t * g_t$

![lstm_it_cand](images/lstm_it_cand.png)

Finally, the output gate decides how much information goes to the output:

$o_t = \sigma(W_{io}x_t + W_{ho}h_{t-1}+b_o)$

$h_t = o_t * tanh(c_t)$

![lstm_ot](images/lstm_ot.png)


Q: How does this help with the vanishing gradient problem ?

Whereas the RNN computes the new hidden state from scratch based on the previous hidden state and the input, the LSTM computes the new hidden state by choosing what to add to the current state. This allows skipping multiplicative gradient paths. Multiple additive paths are hard to converge to ~0.

Q: How are language models trained ?

Very often the so-called "teacher forcing" is used. It works by using the actual ground true outputs from the training dataset at the current time step t as input in the next time step t+1, rather than the output generated by the network. This makes learning faster and the model more stable. The model is not going to get punished for every subsequent word it generates. 

In [None]:
class Manual_LSTM(nn.Module):
    def __init__(self, input_size, hidden_size):
        super().__init__()

        self.hidden_size = hidden_size

        self.forget_gate = nn.Sequential(
            nn.Linear(hidden_size+input_size, hidden_size),
            nn.Sigmoid()
        )

        ##self.input_gate =
        ##self.candidate_gate = 
        ##self.output_gate = 
    
    def forward(self, x, prev_hidden):

        # shape(x) = [B, T, input_size]
        # shape(prev_hidden) = ([1, B, hidden_size], [1, B, hidden_size]) where 1 = num_layers * num_directions

        batch_size, sequence_length, _ = x.size()
        
        # At t=0, h_t and c_t will be initialized to a vector of 0s
        h_t = prev_hidden[0].squeeze(0)
        c_t = prev_hidden[1].squeeze(0)

        hidden_states = torch.zeros(batch_size, sequence_length, self.hidden_size).to(device)

        for t in range(sequence_length):

            # shape(x_t) = [B, input_size]
            x_t = x[:, t, :]

            # shape(concat_h_x) = [B, hidden_size+input_size]
            ##concat_h_x = 

            # shape(f_t) = [B, hidden_size]
            f_t = self.forget_gate(concat_h_x)

            # shape(c_prime_t) = [B, hidden_size]
            ##c_prime_t =

            # shape(i_t) = [B, hidden_size]
            # shape(cand_t) = [B, hidden_size]
            i_t = self.input_gate(concat_h_x)
            ##cand_t = 

            # shape(c_t) = [B, hidden_size]
            ##c_t =

            # shape(o_t) = [B, hidden_size]
            # shape(h_t) = [B, hidden_size]
            ##o_t =
            ##h_t =

            hidden_states[:, t, :] = h_t

        h_t, c_t = h_t.unsqueeze(0), c_t.unsqueeze(0)
        return hidden_states, (h_t, c_t)

In [None]:
manual_lstm = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM,
            DROPOUT, 
            PAD_IDX,
            variant="manual_lstm")

lstm = RNN(INPUT_DIM, 
            EMBEDDING_DIM, 
            HIDDEN_DIM, 
            OUTPUT_DIM,
            DROPOUT, 
            PAD_IDX,
            variant="lstm")

In [None]:
optimizer = optim.Adam(manual_lstm.parameters())
train(manual_lstm, train_iterator, valid_iterator, optimizer, criterion, is_lstm=True, force_stop=True)

In [None]:
optimizer = optim.Adam(lstm.parameters())
train(lstm, train_iterator, valid_iterator, optimizer, criterion, is_lstm=True, force_stop=True)

How did the RNN, Manual LSTM and LSTM fare up against each other?

![vanilla_rnn_lm](images/vanilla_rnn_lm.png)
![manual_lstm_lm](images/manual_lstm_lm.png)
![lstm_lm](images/lstm_lm.png)

## LSTMs vs. GRUs

Gated Recurrent Unit (GRU) combines the forget and input gates into a single "update gate" (z). So we have only two gates: update and reset. It also merges the cell state and hidden state, and makes some other changes. The resulting model is simpler than standard LSTM models. Candidate state $g_t$ is able to suppress $h_t$. The final state is a convex combination: of the $g_t$ and $h_{t-1}$ with coefficients of $(1 - z_t)$ and $z_t$ respectively.

$r_t = \sigma(W_{ir}x_t + W_{hr}h_{t-1}+b_r)$

$z_t = \sigma(W_{iz}x_t + W_{hz}h_{t-1}+b_z)$

$g_t = tanh(W_{ig}x_t + r_t * (W_{hg}h_{t-1}+b_g))$

$h_t = (1 - z_t)* g_t + z_t * h_{t-1}$

In [None]:
# self.gru = nn.GRU(...)

# Sequence-to-sequence model
https://arxiv.org/abs/1409.3215 \
So far we have encountered some classification tasks where the inputs are of variable length. We use Recurrent Neural Networks (RNN/LSTM/GRU) to do predictions. However, when it comes to text generation, the length of outputs might also be random. In this case, we use a sequence-to-sequence model. \

![seq2seq](images/seq2seq.png)

A sequence-to-sequence (seq2seq) model is a model that consists of two components called **Encoder** and **Decoder**. Commonly, two recurrent neural networks are used as the encoder and the decoder. The input is fed into the encoder RNN token by token, producing a fix-lengthed vector (the final hidden state) that encodes the context of all input sequence. We refer to this vector as the **context vector**. The decoder uses this context vector as the initialization of its first hidden state and inits the input with the $<sos>$ token, generating the outputs token by token.

Seq2seq model is often used in NLP tasks where the lengths of both input and output are not fixed, e.g. machine translation, dialogue system. In the following part, we are going to build a vanilla seq2seq model with LSTM as encoder/decoder module on the machine translation task.

https://pytorch.org/tutorials/beginner/torchtext_translation_tutorial.html \
https://github.com/spro/practical-pytorch/blob/master/seq2seq-translation/seq2seq-translation-batched.ipynb

In [None]:
# import essential libraries
import torch.nn.functional as F
from tqdm import tqdm

Now we are running a machine translation model on an actual dataset: Multi30k. Multi30k is a dataset for multi-modal machine translation. We'll only use the texts in this dataset and we load the dataset with *Torchtext*, which can help us with all the pre-processing and data loading.

In [None]:
from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

# torchtext will pre-process the data, including tokenization, padding, stoi, etc.
SRC = Field(tokenize = "spacy",
            tokenizer_language="de",
            init_token = '<sos>',
            eos_token = '<eos>',
            lower = True)

TRG = Field(tokenize = "spacy",
            tokenizer_language="en",
            init_token = '<sos>',
            eos_token = '<eos>',
            lower = True)

train_data, valid_data, test_data = Multi30k.splits(exts = ('.de', '.en'),
                                                    fields = (SRC, TRG))
# print the number of examples in train/valid/test sets
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

In [None]:
# build a vocab of our training set, ignoring word with frequency less than 2
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

In [None]:
# build train/valid/test iterators, which will batch the data for us
BATCH_SIZE = 128
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data),
    batch_size = BATCH_SIZE,
    device = device)

x = vars(test_data.examples[0])['src']
y = vars(test_data.examples[0])['trg']
print("Source example:", " ".join(x))
print("Target example:", " ".join(y))
print("Padded target:", TRG.pad([y]))
print("Tensorized target:", TRG.process([y]))

# Model
### Encoder
We have three layers in the encoder: an embedding layer (with dropout), a RNN layer, and a linear layer. As we have known from the word representation session, we can apply a embedding layer and distributed word representation is trained jointly with the model. 

If we want to have a bidirectional encoder to encode both forward and backward contexts in the input, the hidden dimension of the RNN layer is doubled. Therefore, the linear layer is here to keep the same dimensionality between the encoder output and decoder input.

In [None]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hidden_dim, layers, PAD_IDX=1, bidirectional=False, dropout=0.1):
        super(Encoder, self).__init__()
        
        self.input_dim = input_dim
        self.emb_dim = emb_dim
        self.hidden_dim = hidden_dim
        self.layers = layers
        self.bidirectional = bidirectional
        self.PAD_IDX = PAD_IDX
        
        
        # If we use a bidirectional encoder to encode both forward and backward context,
        # the dimension of the hidden state will double
        self.dropout = nn.Dropout(dropout)
        if bidirectional:
            ff_input_dim = 2 * hidden_dim
        else:
            ff_input_dim = hidden_dim
        
        self.embedding = nn.Embedding(input_dim, emb_dim, padding_idx=PAD_IDX)
        self.rnn = nn.LSTM(emb_dim, self.hidden_dim, layers, dropout=dropout, \
                           bidirectional=bidirectional, bias=False)
        self.ff = nn.Sequential(
            nn.Linear(ff_input_dim, hidden_dim),
            nn.Tanh()
        )
        
    def forward(self, x):

        # shape(x) = [T, B]

        x = self.dropout(self.embedding(x))
        # shape(x) = [T, B, E]

        outputs, (h_n, c_n) = self.rnn(x)
        # shape(outputs) = [T, B, D*num_directions]
          # if we used the `batch_first` flag, shape(outputs) would be [B, T, D*num_directions]
        # shape(h_n)=shape(c_n) = [num_layers*num_directions, B, D]
        
        if self.bidirectional:
            # concatenate the forward and backward hidden states
            h_n = torch.cat((h_n[0::2,:,:], h_n[1::2,:,:]), dim = -1)
            c_n = torch.cat((c_n[0::2,:,:], c_n[1::2,:,:]), dim = -1)
        
        h_n = self.ff(h_n)
        c_n = self.ff(c_n)
        
        return outputs, (h_n, c_n)

### Decoder
The decoder has four layers: an embedding layer (with dropout), a unidirectional RNN layer and two linear layers. The decoder is always unidirectional in that we only generate the outputs from left to right. 

Remember that the embedding layer is actually a matrix of all word vectors, therefore it is the same as a linear layer. Assuming we have the same dimensionality for the embedding layer and the output linear layer, we can perform a weight tying by sharing all the parameters in the two layers. This step would reduce the model size and might improve performance.

In [None]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hidden_dim, layers, weight_tying=True, PAD_IDX=1, dropout=0.1):
        super(Decoder, self).__init__()
        self.output_dim = output_dim
        self.emb_dim = emb_dim
        self.hidden_dim = hidden_dim
        self.layers = layers
        
        self.dropout = nn.Dropout(dropout)
        
        '''
        TODO define the embedding layer, rnn layer and the two linear layers, with correct dimensions
        '''
        self.embedding = nn.Embedding()
        self.rnn = nn.LSTM() # we don't set bidirectional here
        
        self.ff = nn.Sequential(
            nn.Linear(),
            nn.Tanh()
        )
        # This linear layer is to ensure the output layer has the same dimensionality with the embedding layer
        self.out = nn.Linear()
        
        
        '''
        TODO apply weight tying by sharing the weights of the embedding and output layer
        '''
        # share the weights for embedding layer and output layer
        if weight_tying:
            self.embedding.weight =

    def forward(self, x, hidden):

        # shape(x) = [B]

        # we expand the dim of sequence length
        x = x.unsqueeze(0)
        # shape(x) = [1, B]
        
        
        '''
        TODO feed the input to the embedding layer and dropout layer
        '''
        embed = 
        # shape(embed) = [1, B, E]

        
        '''
        TODO feed the word vector to the RNN layer, initializing hidden state with the encoder hidden state
        '''
        output, hidden = 
        # shape(output) = [1, B, D]
        # shape(h_n)=shape(c_n) = [num_layers*num_directions, B, D]
        
        
        '''
        TODO pass output through the two linear layers
        '''
        ff_out = 
        # shape(ff_out) = [1, B, E]
        ##sqeeuze the output:
        ff_out =
        # shape(ff_out) = [B, E]
        
        logits = 
        # shape(logits) = [B, O]
        return logits, hidden
    

###### Seq2Seq

In [None]:
class Seq2seq(nn.Module):
    
    def __init__(self, encoder, decoder, device='cpu'):
        super(Seq2seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
    # src: [seq_len, batch_size]
    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        batch_size = src.shape[1]
        max_len = trg.shape[0]
        output_dim = self.decoder.output_dim
        
        outputs = torch.zeros(max_len, batch_size, output_dim).to(self.device)
        
        _, hidden = self.encoder(src)
        
        # initialize output sequence with '<sos>'
        dec_output = trg[0,:]
        
        # decoder token by token
        for t in range(1, max_len):
            dec_output, hidden = self.decoder(dec_output, hidden)
            outputs[t] = dec_output
            teacher_force = random.random() < teacher_forcing_ratio
            
            pred_next = dec_output.argmax(1)
            
            '''
            TODO use the ground truth token if using teacher forcing
            '''
            dec_output = 
        return outputs

    # greedy search for actual translation
    def greedy_search(self, src, sos_idx, max_len=50):
        src = src.to(self.device)
        batch_size = src.shape[1]
        output_dim = self.decoder.output_dim
        
        outputs = torch.zeros(max_len, batch_size).to(self.device)
        
        _, hidden = self.encoder(src)
        
        
        dec_output = torch.zeros(batch_size, dtype=torch.int64).to(device)
        dec_output.fill_(sos_idx)
        
        outputs[0] = dec_output
        
        for t in range(1, max_len):
            dec_output, hidden = self.decoder(dec_output, hidden)
            
            dec_output = dec_output.argmax(1)

            outputs[t] = dec_output
        return outputs

Now we have finished our seq2seq model, let's build a toy model.

In [None]:
INPUT_DIM=4
OUTPUT_DIM=4
EMB_DIM=10
HIDDEN_DIM=6
LAYERS=2

# define the encoder and decoder, and build the model
'''
TODO define an encoder and a decoder, and then define the model
'''
enc = 
dec = 
model = 
print(model)

### Model, optimizer and criterion

This is our model hyperparameters. In actual training, we might need to tune the hyperparameters on the validation set before evaluating on the test set.

In [None]:
INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
EMB_DIM=256
HIDDEN_DIM=512
LAYERS=1
DROPOUT=0.5
BIDIRECTIONAL=True

In [None]:
'''
TODO get indexes of the padding tokens in source and target vocabulary
'''
# padding token
SRC_PAD = 
TRG_PAD = 

# build model
enc = Encoder(INPUT_DIM, EMB_DIM, HIDDEN_DIM, LAYERS, PAD_IDX=SRC_PAD, bidirectional=BIDIRECTIONAL, dropout=DROPOUT)
dec = Decoder(OUTPUT_DIM, EMB_DIM, HIDDEN_DIM, LAYERS, PAD_IDX=TRG_PAD, dropout=DROPOUT)
model = Seq2seq(enc, dec, device).to(device)

In [None]:
# initialize weights
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.1, 0.1)
        
model.apply(init_weights)
print(model)

def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

Optimizer will update the gradient everytime we back-propagate. We are using Adam as our optimizer.

In [None]:
LR=0.001
# set optimizer and learning rate
optimizer = optim.Adam(model.parameters(), lr=LR)

We use *CrossEntropyLoss* as our loss function, which will calculate the log softmax and the negative log-likelihood. We pass the padding token in the target vocab to the criterion so that it will ignore the loss for this token.

In [None]:
criterion = nn.CrossEntropyLoss(ignore_index=TRG_PAD)

Now we can define our training loop.
1. We iterate over the training iterator and get a batch of training examples
2. The input is passed through the model and it returns the predictions
3. We calculate the loss between the model predictions and the ground truths
4. We back-propagate the loss and the optimizer will update the gradients

To avoid exploding gradient, we clip the gradients to a maximum value every training iteration

In [None]:
def train(model, iterator, optimizer, criterion, grad_clip, num_epoch):
    model.train()
    
    total_loss = 0

    for i, batch in enumerate(iterator):
        src = batch.src
        trg = batch.trg
        src, trg = src.T, trg.T

        # set gradients to zero to avoid accumulating the gradients
        optimizer.zero_grad()
        
        outputs = model(src, trg)
        
        # exclude <sos> token
        # outputs: (seq_len * batch_size, output_dim)
        # trg : (seq_len * batch_size)
        outputs = outputs[1:].view(-1, outputs.shape[-1])
        trg = trg[1:].view(-1)

        loss = criterion(outputs, trg)
        
        writer.add_scalar('training loss',
                            loss.item(),
                            num_epoch * len(iterator) + i)
        
        if i % 50 == 0:
            print('Batch:\t {0} / {1},\t loss: {2:2.3f}'.format(i, len(iterator), loss.item()))
        
        loss.backward()
        # clip grad to avoid gradient explosion
        torch.nn.utils.clip_grad_norm_(model.parameters(), grad_clip)
        optimizer.step()
        
        total_loss += loss.item()
    return total_loss / len(iterator)

The evaluating loop is similar to the training loop, except that we don't want to do back-propagation.

In [None]:
def eval(model, iterator, criterion):
    # In eval model, layers such as Dropout, BatchNorm will work in eval model
    model.eval()
    
    total_loss = 0
    # this prevents the back-propagation
    with torch.no_grad():
        for _, batch in enumerate(iterator):
            src = batch.src
            trg = batch.trg
            
            # during test time, we have no correct trg so we turn off teacher forcing
            outputs = model(src, trg, teacher_forcing_ratio=0)
            
            outputs = outputs[1:].view(-1, outputs.shape[-1])
            trg = trg[1:].view(-1)

            loss = criterion(outputs, trg)
            total_loss += loss.item()
    return total_loss / len(iterator)

In [None]:
# Helper function, converting a batch of tensors to the text form
def get_text_from_tensor(tensor, field, eos='<eos>'):
    batch_output = []
    for i in range(tensor.shape[1]):
        sequence = tensor[:,i]
        words = []
        for tok_idx in sequence:
            tok_idx = int(tok_idx)
            token = field.vocab.itos[tok_idx]

            if token == '<sos>':
                continue
            elif token == '<eos>' or token == '<pad>':
                break
            else:
                words.append(token)
        words = " ".join(words)
        batch_output.append(words)
    return batch_output

## BLEU Score

An automatic metric that assumes that the closer a machine translation is to a professional human translation, the better it is. Rather strong assumption considering in how many different "correct" ways a sentence could be translated. For example, the French sentence "
Courage!" could be translated into English as "Cheer up!", "Go for it!", "Chin up!", etc.
BLEU computes N-gram matching between system output and one or more reference (human) translations. N-gram is simply a sequence of N words within a given window and when computing the N-grams you typically move one word forward. Typically values between of N between 1 and 5 are considered. According to the formula $m = N-n+1$, in the sentence "I like chocolate and vanilla ice cream" there are:

- 7 unigrams (1-grams)
- 6 bigrams (2-grams)
- 5 trigrams (3-grams)
- 4 quadrigrams (4-grams)

BLEU rewards same words in equal order. It is the most widely used metric. The final score ranges from 0-100, the higher the score, the more the translation correlates to a human translation. BLEU computes geometirc mean and is a document-level metric: if a higher order n-gram precision (e.g., n = 4) of a sentence is 0, then the BLEU score of the entire sentence is 0, even if some lower order n-grams are matched:

$BLEU = brevity\_penalty \cdot exp(\sum^N_{n=1}\log modified\_precision_n)$

The brevity penalty penalizes short translations. The default configuration uses N = 4 and uniform weights.

![bleu1](images/bleu1.png)
![bleu2](images/bleu2.png)





In [None]:
import sacrebleu

def test_bleu(model, iterator, trg_field):
    model.eval()

    ref = []
    hyp = []
    
    with torch.no_grad():
        for _, batch in enumerate(iterator):
            src = batch.src
            trg = batch.trg
            
            outputs = model.greedy_search(src, trg_field.vocab.stoi['<sos>'])
            
            hyp += get_text_from_tensor(outputs, trg_field)
            ref += get_text_from_tensor(trg, trg_field)
            
    # expand dim of reference list
    # sys = ['translation_1', 'translation_2']
    # ref = [['truth_1', 'truth_2'], ['another truth_1', 'another truth_2']]
    ref = [ref]
    return sacrebleu.corpus_bleu(hyp, ref, force=True)

Now let's start our training! We keep the checkpoint with the highest valid BLEU as our best checkpoint.

**The training is heavily dependent on GPU, so it might take years to train on CPU. You may skip this block and load our pre-trained model.**

In [None]:
EPOCH = 30
CLIP = 1

best_bleu = float('Inf')

for i in range(EPOCH):
    print('Start training Epoch {}:'.format(i+1))
    
    '''
    TODO calculate training loss and valid loss using the function we defined above
    '''
    train_loss = 
    valid_loss = 
    bleu = test_bleu(model, test_iterator, TRG)
    
    writer.add_scalar('valid loss',
                valid_loss,
                i)
    writer.add_scalar('valid ppl',
                      math.exp(valid_loss),
                     i)
    writer.add_scalar('valid BLEU',
                bleu.score,
                i)
    
    if bleu.score > best_bleu:
        best_bleu = bleu.score
        torch.save(model.state_dict(), 'checkpoint_best-seq2seq.pt')
    
    print('Epoch {0} train loss: {1:.3f} | Train PPL: {2:7.3f}'.format(i+1, train_loss, math.exp(train_loss)))
    print('Epoch {0} valid loss: {1:.3f} | Valid PPL: {2:7.3f}'.format(i+1, valid_loss, math.exp(valid_loss)))
    print('Epoch {0} valid BLEU: {1:3.3f}'.format(i+1, bleu.score))

In [None]:
model.load_state_dict(torch.load('checkpoint_best-seq2seq.pt'))

'''
TODO calculate bleu score on the test set
'''
test_bleu()