# 1 - Sequence to Sequence Learning with Neural Networks

+ In this series we'll be building a machine learning model to go from once sequence to another, using PyTorch and torchtext.
+ This will be done on German to English translations with [Multi30k dataset*](https://github.com/multi30k/dataset).

+ spaCy to assist in the tokenization of the data

  ```python
  python -m spacy download en_core_web_sm
  python -m spacy download de_core_news_sm
  ```

+ Reference: [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) paper

+ *Multi30k is a dataset with ~30,000 parallel English, German and French sentences, each with ~12 words per sentence. 



# Introduction
+ The hidden state $(h_{i})$ as a vector representation of the sentence
+ The context vector $(z)$ as an abstract representation of the entire input sentence

+ **Decoder**, one per time-step.

  In the decoder, we need to go from the hidden state to an actual word, therefore at each time-step we use $s_t$ to predict (by passing it through a `Linear` layer, shown in purple) what we think is the next word in the sequence, $\hat{y}_t$. 

  $$\hat{y}_t = f(s_t)$$

  The words in the decoder are always generated one after another, with one per time-step. 

+ **Teacher Forcing**: ground truth + predicted word    
    
  We always use `<sos>` for the first input to the decoder, $y_1$, but for subsequent inputs, $y_{t>1}$, we will sometimes use the actual, ground truth next word in the sequence, $y_t$ and sometimes use the word predicted by our decoder, $\hat{y}_{t-1}$. This is called *teacher forcing*, see a bit more info about it [Teacher Forcing](https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/). 

+ **Know the length in advance**

  When training/testing our model, we always know how many words are in our target sentence, so we stop generating words once we hit that many. During inference it is common to keep generating words until the model outputs an `<eos>` token or after a certain amount of words have been generated.

+ **Calculate the loss**

  Once we have our predicted target sentence, $\hat{Y} = \{ \hat{y}_1, \hat{y}_2, ..., \hat{y}_T \}$, we compare it against our actual target sentence, $Y = \{ y_{1}, y_{2}, ..., y_{T} \}$, to calculate our loss. We then use this loss to update all of the parameters in our model.

    

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm
!python -m spacy download de_core_news_sm

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

from torchtext.legacy.datasets import Multi30k
from torchtext.legacy.data import Field, BucketIterator

import spacy
import numpy as np

import random
import math
import time


SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True


+ Load Spacy model
+ Tokenizer Functions
  + reverse the order of the input
  + normal order for the output

  In the paper we are implementing, they find it beneficial to reverse the order of the input which they believe "introduces many short term dependencies in the data that make the optimization problem much easier". We copy this by reversing the German sentence after it has been transformed into a list of tokens.


+ torchtext's Fields 
  
  torchtext's Fields handle how data should be processed.

  + appends the "start of sequence" and "end of sequence" tokens
  + converts all words to lowercase


In [2]:
# Load Spacy model
spacy_de = spacy.load('de_core_news_sm')
spacy_en = spacy.load('en_core_web_sm')

# tokenizer functions
# transformed sentence into a list of tokens
def tokenize_de(text):
    """
    Tokenizes German text from a string into a list of strings (tokens) and reverses it
    (SRC-source)
    """
    return [tok.text for tok in spacy_de.tokenizer(text)][::-1]

def tokenize_en(text):
    """
    Tokenizes English text from a string into a list of strings (tokens)
    (TRG-target)
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]


# torchtext's Field
SRC = Field(tokenize = tokenize_de, 
          init_token = '<sos>', 
          eos_token = '<eos>', 
          lower = True)

TRG = Field(tokenize = tokenize_en, 
            init_token = '<sos>', 
            eos_token = '<eos>', 
            lower = True)

# Download Multi30k dataset
train_data, valid_data, test_data = Multi30k.splits(
    exts = ('.de', '.en'), fields = (SRC, TRG))

training.tar.gz:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:00<00:00, 8.04MB/s]
validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 1.64MB/s]

downloading validation.tar.gz
downloading mmt_task1_test2016.tar.gz



mmt_task1_test2016.tar.gz: 100%|██████████| 66.2k/66.2k [00:00<00:00, 1.33MB/s]


In [3]:
# check the data by its length
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of testing examples: {len(test_data.examples)}")

# print out an single example, make sure the source sentence is reversed:
from pprint import pprint
pprint(vars(train_data.examples[0]))

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000
{'src': ['.',
         'büsche',
         'vieler',
         'nähe',
         'der',
         'in',
         'freien',
         'im',
         'sind',
         'männer',
         'weiße',
         'junge',
         'zwei'],
 'trg': ['two',
         'young',
         ',',
         'white',
         'males',
         'are',
         'outside',
         'near',
         'many',
         'bushes',
         '.']}


# Build Vocabulary
use Torchtext's Filed object to build vocabulary

+ Build the vocabulary for the source and target languages. 

  The vocabulary is used to associate each unique token with an index (an integer). The vocabularies of the source and target languages are distinct.

+ Frequency Condition and Unknow Token(`<unk>`)

  Using the `min_freq` argument, we only allow tokens that appear at least 2 times to appear in our vocabulary. Tokens that appear only once are converted into an `<unk>` (unknown) token.

+ Vocabulary only be built from training set

  It is important to note that our vocabulary should only be built from the training set and not the validation/test set. This prevents "information leakage" into our model, giving us artifically inflated validation/test scores.


In [4]:
# use Filed object to build vocabulary
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)

print(f"Unique tokens in source (de) vocabulary: {len(SRC.vocab)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG.vocab)}")

Unique tokens in source (de) vocabulary: 7855
Unique tokens in target (en) vocabulary: 5893


# Build iterators (BucketIterator*) for DataSet 

#### BucketIterator

In NLP, when we get a batch of examples using an iterator we need to make sure that all of the source sentences are padded to the same length, the same with the target sentences.

We use a `BucketIterator` instead of the standard `Iterator` as it creates batches in such a way that it minimizes the amount of padding in both the source and target sentences. 

In [6]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 128

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE, 
    device = device)

# Building the Seq2Seq Model

## Encoder 

+ 2 layer LSTM 

  (The paper we are implementing uses a 4-layer LSTM, but in the interest of training time we cut this down to 2-layers.)

## Decoder

+ Also be a 2-layer (4 in the paper) LSTM

+ Decoding single token per time-step

  The Decoder class does a single step of decoding. i.e. it ouputs single token per time-step. only decoding one token at a time, the input tokens will always have a sequence length of 1

+ Context vectors as fist input in Decoder

  the initial hidden and cell states to our decoder are our context vectors, which are the final hidden and cell states of our encoder from the same layer, i.e. $(s_0^l,c_0^l)=z^l=(h_T^l,c_T^l)$.


In [59]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(
            num_embeddings=input_dim, embedding_dim=emb_dim
        )
        
        self.rnn = nn.LSTM(
            input_size=emb_dim, hidden_size=hid_dim, 
            num_layers=n_layers, dropout=dropout
        )
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, src):
        '''
        #src = [src len, batch size]
        #embedded = [src len, batch size, emb dim]

        #outputs = [src len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]

        # e.g.
        # hidden = torch.Size([2, 128, 512])
        #   cell = torch.Size([2, 128, 512])
        
        #outputs are always from the top hidden layer
        '''        
        embedded = self.dropout(self.embedding(src))
        outputs, (hidden, cell) = self.rnn(embedded)
              
        return hidden, cell

class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.output_dim = output_dim
        
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(
            num_embeddings=output_dim, embedding_dim=emb_dim
        )
        
        self.rnn = nn.LSTM(
            input_size=emb_dim, hidden_size=hid_dim, 
            num_layers=n_layers, dropout=dropout
        )
        
        self.fc_out = nn.Linear(
            in_features=hid_dim, out_features=output_dim
        )
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, input, hidden, cell):
        '''
        #input = [batch size]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #In our case, n directions in the decoder will both always be 1, therefore:
        #hidden = [n layers, batch size, hid dim]
        #context = [n layers, batch size, hid dim]


        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        '''
        # Decoding single token per time-step, 
        # so the input tokens will always have a sequence length of 1
        input = input.unsqueeze(0)

        #input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        #embedded = [1, batch size, emb dim]
                
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
                
        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        prediction = self.fc_out(output.squeeze(0))
        
        #prediction = [batch size, output dim]
        
        return prediction, hidden, cell

## Seq2Seq

+ receiving the input/source sentence
+ using the encoder to produce the context vectors
+ using the decoder to produce the predicted output/target sentence


#### Forward Step

+ The forward method takes 
  + the source sentence
  + the target sentence
  + a teacher-forcing ratio, used when training our model.

+ Output storage
  + Create an outputs tensor that will store all of our predictions, $\hat{y}$.

#### Teacher Force

+ When decoding, at each time-step we will predict what the next token in the target sequence will be from the previous tokens decoded. 

+ With probability equal to the `teacher_forcing_ratio` we will use the actual ground-truth next token in the sequence as the input to the decoder during the next time-step. 

+ However, with probability (`1 - teacher_forcing_ratio`), we will use the token that the model predicted as the next input to the model, even if it doesn't match the actual next token in the sequence.

+ Note:
  + $$ R = teacher\_forcing\_ratio $$ 

  $$ f(p)= \begin{cases} R, & \text {use ground-truth} \\ 1 - R, & \text{use predicted from argmax} \end{cases} $$

  + if $P < R$, the next `input` is the ground-truth next token in the sequence, $y_{t+1}$

  + if $P >= R$, the next `input` is the predicted next token in the sequence, $\hat{y}_{t+1}$, which we get by doing an `argmax` over the output tensor
  
  + e.g. if teacher_forcing_ratio is 0.75 we use ground-truth as encoder inputs 75% of the time


#### Decoder Loop 

+ The decoder loop starts at 1, not 0. 
+ This means the 0th element of our `outputs` tensor remains all zeros. So our `target` and `outputs` will look something like:

$$\begin{align*}
\text{target} = [<sos>, &y_1, y_2, y_3, <eos>]\\
\text{outputs} = [0, &\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}
$$

+ Later on when we calculate the loss, we cut off the first element of each tensor to get:

$$\begin{align*}
\text{target} = [&y_1, y_2, y_3, <eos>]\\
\text{outputs} = [&\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]
\end{align*}
$$


In [60]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        
        self.device = device
        
        # This is not always the case to let both dim and layers be equal.
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
    def forward(self, src, trg, teacher_forcing_ratio = 0.5):
        '''
        #src = [src len, batch size]
        #trg = [trg len, batch size]
        
        #teacher_forcing_ratio is probability to use teacher forcing
        #e.g. if teacher_forcing_ratio is 0.75 we use ground-truth as encoder inputs 75% of the time
        
        '''
      
        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim
        
        #tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)
        
        # ---- Encoder ----
        #last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)
        
        # ---- Decoder ----
        #first input to the decoder is the <sos> tokens
        # [<sos>, y1, y2, y3 ]
        # trg = [seq len, batch size], trg[0,:] -> first words for whole the batch
        input = trg[0,:]
        
        #loop will start from 1 
        for t in range(1, trg_len):
            
            #insert input token embedding, previous hidden and previous cell states
            #receive output tensor (predictions) and new hidden and cell states
            output, hidden, cell = self.decoder(input, hidden, cell)
            
            #place predictions in a tensor holding predictions for each token
            outputs[t] = output
            #get the highest predicted token from our predictions
            predicted_top1 = output.argmax(1) 
            
            #decide if we are going to use teacher forcing or not
            #teacher_force = [True, False]
            teacher_force = random.random() < teacher_forcing_ratio
            
            #if teacher forcing, use actual next token as next input
            #if not, use predicted token
            input = trg[t] if teacher_force else predicted_top1
        
        return outputs

# Training

## Optimizer & Loss Function

+  loss function calculates the average loss per token.
+  by passing the index of the `<pad>` token as the ignore_index argument, we ignore the loss whenever the target token is a padding token.

In [61]:
'''Note
# i.g. 
# TRG_PAD_IDX = 1 
# TRG.pad_token = '<pad>'
'''

INPUT_DIM = len(SRC.vocab)
OUTPUT_DIM = len(TRG.vocab)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5


def init_weights(m):
    '''initialize weights 
    with a uniform distribution(nn.init.uniform_) between -0.08 and +0.08
    '''
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)

def count_parameters(model):
    '''calculate the number of trainable parameters in the model
    '''
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# model
enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)
model = Seq2Seq(enc, dec, device).to(device)

model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7855, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fc_out): Linear(in_features=512, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

In [62]:
# optimizer
optimizer = optim.Adam(model.parameters())
# loss function
TRG_PAD_IDX = TRG.vocab.stoi[TRG.pad_token]
criterion = nn.CrossEntropyLoss(ignore_index = TRG_PAD_IDX)

In [63]:
print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 13,899,013 trainable parameters


In [65]:
# ------ testing block ------

# for i, batch in enumerate(train_iterator):
#   # print(i,batch)
#   src = batch.src
#   trg = batch.trg

#   # print(src.shape)
#   # print(trg.shape)

#   # print(trg[:,0])
#   # print(trg[0,:])

#   output = model(src, trg)
#   break


# # for i in range(1,4):
# #   print(i)

# Define Train & evaluate function

#### Loss and Perplexity
+ We'll be printing out both the loss and the perplexity at each epoch. 
+ It is easier to see a change in perplexity than a change in loss as the numbers are much bigger.

$$Perplexity = math.exp(Loss)$$


#### $math.exp()$  vs  $numpy.exp()$
+ Exponential function.

+ The `math.exp` works only for scalars, whereas `numpy.exp` will work for arrays.

  ```python
  >>> import math
  >>> import numpy as np
  >>> x = [1.,2.,3.,4.,5.]
  >>> math.exp(x)

  Traceback (most recent call last):
    File "<pyshell#10>", line 1, in <module>
      math.exp(x)
  TypeError: a float is required
  >>> np.exp(x)
  array([   2.71828183,    7.3890561 ,   20.08553692,   54.59815003,
          148.4131591 ])
  ```


In [31]:
def train(model, iterator, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    
    for i, batch in enumerate(iterator):
        
        src = batch.src
        trg = batch.trg #trg = [trg len, batch size]
        

        # forward
        output = model(src, trg) #output = [trg len, batch size, output dim]
        
        # loss function
        # Allign the target and predicted_out
        output_dim = output.shape[-1]
        output = output[1:].view(-1, output_dim) #output = [(trg len - 1) * batch size, output dim]
        trg = trg[1:].view(-1) #trg = [(trg len - 1) * batch size]
        
        loss = criterion(output, trg)

        # backward
        optimizer.zero_grad()
        loss.backward()
        
        # gradient descent update step/adam step
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip) # gradient clipping
        optimizer.step()

        # loss computation records
        epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)


def evaluate(model, iterator, criterion):
    
    model.eval()
    
    epoch_loss = 0
    
    with torch.no_grad():
    
        for i, batch in enumerate(iterator):

            src = batch.src
            trg = batch.trg #trg = [trg len, batch size]

            #forward
            #turn off teacher forcing
            output = model(src, trg, 0) #output = [trg len, batch size, output dim]

            # loss function
            # Allign the target and predicted_out
            output_dim = output.shape[-1]
            output = output[1:].view(-1, output_dim) #output = [(trg len - 1) * batch size, output dim]
            trg = trg[1:].view(-1) #trg = [(trg len - 1) * batch size]

            loss = criterion(output, trg)
            
            # loss computation records
            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [32]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

N_EPOCHS = 10
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 24s
	Train Loss: 5.038 | Train PPL: 154.161
	 Val. Loss: 5.057 |  Val. PPL: 157.163
Epoch: 02 | Time: 0m 24s
	Train Loss: 4.502 | Train PPL:  90.154
	 Val. Loss: 4.804 |  Val. PPL: 122.017
Epoch: 03 | Time: 0m 24s
	Train Loss: 4.172 | Train PPL:  64.823
	 Val. Loss: 4.648 |  Val. PPL: 104.347
Epoch: 04 | Time: 0m 24s
	Train Loss: 3.936 | Train PPL:  51.203
	 Val. Loss: 4.406 |  Val. PPL:  81.964
Epoch: 05 | Time: 0m 24s
	Train Loss: 3.746 | Train PPL:  42.359
	 Val. Loss: 4.222 |  Val. PPL:  68.164
Epoch: 06 | Time: 0m 24s
	Train Loss: 3.597 | Train PPL:  36.495
	 Val. Loss: 4.220 |  Val. PPL:  68.057
Epoch: 07 | Time: 0m 24s
	Train Loss: 3.473 | Train PPL:  32.234
	 Val. Loss: 4.053 |  Val. PPL:  57.585
Epoch: 08 | Time: 0m 24s
	Train Loss: 3.317 | Train PPL:  27.564
	 Val. Loss: 3.972 |  Val. PPL:  53.065
Epoch: 09 | Time: 0m 24s
	Train Loss: 3.195 | Train PPL:  24.408
	 Val. Loss: 3.967 |  Val. PPL:  52.800
Epoch: 10 | Time: 0m 24s
	Train Loss: 3.082 | Train PPL

# Load Model & Evaluate on Test Set

In [33]:
model.load_state_dict(torch.load('tut1-model.pt'))

test_loss = evaluate(model, test_iterator, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')

| Test Loss: 3.815 | Test PPL:  45.389 |
