# Tutorial 1
This is a code along of the excellent tutorial series by [bentrevett](https://github.com/bentrevett/pytorch-seq2seq). It is mostly for learning and self assessment.

The first part of this tutorial series is basically the implementation of [Sequence to sequence learning with neural networks](https://arxiv.org/abs/1409.3215) paper.

## Introduction
The most common seq-to-seq models use an encoder-decoder network. Both of these use a recurrent neural networks. Encoder takes the source sentence as input and encode it into a single vector called _context vector_. This vector is then decoded by the decoder to generate the output sequence one token at a time.

## Preparing data
We will be using PyTorch and torchtext for the network architecture and pre-processing.

In [1]:
# Library imports
import torch
import torch.nn as nn
import torch.nn.functional as F

from torchtext.datasets import Multi30k
from torchtext.data import Field, BucketIterator

import spacy
import numpy as np

import random
import time
import math

In [2]:
# Seed everything
SEED = 23
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

Next, we create tokenizers for English and German languages. spaCy has model for different languages which we can use to access the tokenizers.

Once downloaded, the models can be easily loaded by using:

In [3]:
spacy_en = spacy.load('en_core_web_sm')
spacy_de = spacy.load('de_core_news_sm')

Now we can create tokenizer functions. These will take in the sentence and return the sentence as a list of tokens.

Quoting the paper: *While the LSTM is capable of solving problems with long term dependencies, we discovered that
the LSTM learns much better when the source sentences are reversed (the target sentences are NOT reversed). By doing so, the LSTM’s test perplexity dropped from 5.8 to 4.7, and the test BLEU scores of its decode*d translations increased from 25.9 to 30.6.*

In [4]:
def tokenize_de(text):
    """
    Tokenizes text from a string and create a list of tokens after reversing it.
    """
    return [tok.text for tok in spacy_de.tokenizer(text)][::-1]

def tokenize_en(text):
    """
    Tokenizes text from a string and create a list of tokens.
    """
    return [tok.text for tok in spacy_en.tokenizer(text)]

The entire pre-processing pipeline can be easily implemented using torchtext. Check out the constructor arguments [here](https://github.com/pytorch/text/blob/master/torchtext/data/field.py#L61).

In [5]:
source = Field(tokenize=tokenize_de,
               init_token="<sos>",
               eos_token="<eos>",
               lower=True, 
               include_lengths=True, 
               batch_first=True)
target = Field(tokenize=tokenize_en,
               init_token="<sos>", 
               eos_token="<eos>", 
               lower=True, 
               include_lengths=True, 
               batch_first=True)



Now it's the time to download the data and create train, validation and test data. The dataset we are using is [Multi30k](https://github.com/multi30k/dataset), it contains about 30,000 parallel English, French and German sentences. It is also available through torchtext. 

`exts` specifies which languages to use as the source and target (source goes first) and `fields` specifies which field to use for the source and target.

In [6]:
train_data, valid_data, test_data = Multi30k.splits(exts=('.de', '.en'), 
                                                    fields=(source, target))

downloading training.tar.gz


training.tar.gz: 100%|██████████| 1.21M/1.21M [00:02<00:00, 500kB/s]


downloading validation.tar.gz


validation.tar.gz: 100%|██████████| 46.3k/46.3k [00:00<00:00, 205kB/s]


downloading mmt_task1_test2016.tar.gz


mmt_task1_test2016.tar.gz: 100%|██████████| 66.2k/66.2k [00:00<00:00, 22.3MB/s]


In [7]:
# Quick sanity check the data
print(f"Number of training examples: {len(train_data.examples)}")
print(f"Number of validation examples: {len(valid_data.examples)}")
print(f"Number of test examples: {len(test_data.examples)}")

Number of training examples: 29000
Number of validation examples: 1014
Number of test examples: 1000


In [8]:
for key, words in vars(train_data.examples[0]).items():
    print(key+":", words)

src: ['.', 'büsche', 'vieler', 'nähe', 'der', 'in', 'freien', 'im', 'sind', 'männer', 'weiße', 'junge', 'zwei']
trg: ['two', 'young', ',', 'white', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']


As we can see the period in the german sentence is at index 0, this means the input sentences are correctly reversed.

Once we have data we can create vocabularies for the source and target languages. Torchtext provides some utilities for that as well. We can check different options out [here](https://torchtext.readthedocs.io/en/latest/data.html#field).

We just include tokens which are repeated at least 2 times in the train data. Any token used only once is converted to <UNK> (unknown) token.

For building vocabulary, we just use train set to avoid any data leakage.

In [9]:
source.build_vocab(train_data, min_freq=2)
target.build_vocab(train_data, min_freq=2)

In [10]:
# Check out the vocabularies
print(f"Number of unique tokens in german vocabulary: {len(source.vocab)}")
print(f"Number of unique tokens in english vocabulary: {len(target.vocab)}")

Number of unique tokens in german vocabulary: 7854
Number of unique tokens in english vocabulary: 5893


In [12]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)
print(torch.cuda.get_device_name())

cuda
GeForce GTX 1070


Those of you are familiar with PyTorch API know that the next step in the pipeline is to create dataloaders which create data batches. In torchtext, this can be done using iterators.

When we get a batch of examples using an iterator we need to make sure that all of the source sentences are padded to the same length, the same with the target sentences. Luckily, `TorchText` iterators handle this for us!

We use a `BucketIterator` instead of the standard Iterator as it creates batches in such a way that it minimizes the amount of padding in both the source and target sentences.

In [13]:
batch_size = 32
train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train_data, valid_data, test_data), batch_size=batch_size, device=device
)



## Building the Seq2seq Model

The model has three components: Encoder, Decoder and seq2seq model which encapsulates both encoder and decoder.

### Encoder
Encoder is just a recurrent neural network. For our case, we are going to start with a GRU and then try a LSTM. We are also going to try different variants of RNN like bidirectional, different number of layers to increase the robustness and to provide context from both previous and subsequent time steps.

We implement encoder by creating a `Encoder` class. It takes the following arguments:
- `input_size`: number of rows in the embedding matrix. It is nothing but the vocabulary for the `source`
- `embed_dim`: embedding dimension which gives number of components for each word vector in the embedding space
- `hidden_size`: size of the hidden state(as well as the cell states in the case of LSTMs)


In [14]:
class Encoder(nn.Module):
    def __init__(self, input_size, embed_dim, hidden_size):
        super(Encoder, self).__init__()
        self.hidden_size = hidden_size
        self.embed = nn.Embedding(input_size, embed_dim)
        self.rnn = nn.GRU(embed_dim, hidden_size, batch_first=True)
        self.dropout = nn.Dropout(p=0.3)
    
    def forward(self, src):
        # src shape: [batch_size, seq_len]
        embedded = self.dropout(self.embed(src))
        # embedded shape: [batch_size, seq_len, embed_dim]
        
        output, hidden = self.rnn(embedded)
        # output shape: [batch_size, seq_len, hidden_size*n_directions]
        # hidden shape: [n_layers*n_directions, batch_size, hidden_size];
        # batch size is in `dim=1` of the hidden size even after setting `batch_first`=True

        # For encoder, we are not interested in the output so skipping returning it
        return hidden

### Decoder
Like encoder, our decoder is also a RNN. The key feature of the decoder is that the hidden state from the encoder acts as the "context-vector" and will be treated as the hidden state for the first time step of the decoder. Unlike encoder, we start by feeding in the `<SOS>` token to the decoder and then feed the target sentence (or prediction from the current time step) token by token.  We will employ a technique called "teacher-forcing" in which we sample the decoder input from the current prediction or the target sentence (ground-truth) based on some probability. This is only used during training and it further ensures the robustness of the model.

Later we will also implement Attention model in the decoder.

In [15]:
class Decoder(nn.Module):
    def __init__(self, output_size, embed_dim, hidden_size):
        super(Decoder, self).__init__()
        self.hidden_size = hidden_size
        self.output_size = output_size
        self.embed = nn.Embedding(output_size, embed_dim)
        self.rnn = nn.GRU(embed_dim, hidden_size, batch_first=True)
        self.dropout = nn.Dropout(p=0.3)
        self.linear = nn.Linear(hidden_size, output_size)

    def forward(self, input, hidden):
        # input shape: [batch_size]
        input = input.unsqueeze(-1)
        # input shape: [batch_size, 1]

        embedded = self.dropout(self.embed(input))
        # embedded shape: [batch_size, 1, embed_dim]

        output, hidden = self.rnn(embedded, hidden)
        # output shape: [batch_size, seq_len, n_directions*hidden_size]
        # hidden shape: [n_directions*n_layers, batch_size, hidden_size]
        # seq_len is always 1 in the decoder

        output = self.linear(output.squeeze(1))
        # output shape: [batch_size, output_size]

        output = F.log_softmax(output, dim=1)
        # output shape: [batch_size, output_size]

        return output, hidden

### Seq2seq
Finally, we will encapsulate both encoder and decoder into `Seq2Seq` model class. It will be our black box where we receive input/target, generate context vectors from encoders and produce predicted output rom the decoder.

In [16]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device=torch.device("cpu")):
        super(Seq2Seq, self).__init__()
        self.encoder = encoder
        self.decoder = decoder
        self.device = device

    def forward(self, input_tensor, target_tensor, 
                teacher_forcing_ratio=0.5):
        # input_tensor shape: [batch_size, seq_len]
        # target_tensor shape: [batch_size, seq_len]
        batch_size = target_tensor.shape[0]
        target_length = target_tensor.shape[-1]

        output_size = self.decoder.output_size 
        # size of the target vocabulary

        outputs = torch.zeros(
            batch_size, target_length, output_size).to(self.device)
        
        # we don't provide any hidden state to the encoder since pytorch
        # by default initializes it to zeros if not provided
        encoder_hidden = self.encoder(input_tensor)

        # first input to the decoder is just <SOS> token
        input = target_tensor[:, 0]

        # decoder hidden at the first time step is just encoder hidden (context vector)
        decoder_hidden = encoder_hidden

        # iterate through the length of the target tensor starting from the second token
        for ti in range(1, target_length):
            # forward pass through decoder
            decoder_output, decoder_hidden = self.decoder(input, decoder_hidden)

            # place decoder output for the given time step into outputs tensor
            outputs[:, ti, :] = decoder_output

            # decide if we want to use teacher forcing 
            teacher_forcing = (
                True if random.random() < teacher_forcing_ratio else False
            )

            # get the token with highest probability from the output
            top1 = decoder_output.argmax(1)

            # if teacher forcing, use actual next token as the decoder input
            # else, use the prediction
            input = (
                target_tensor[:, ti]
                if teacher_forcing
                else top1
            )
        
        return outputs

## Training
With all the model components defined, it's time for training the model now. First, we need to instantiate it by defining few parameters like `input_size`, `output_size`, `hidden_size`, and `embedding_dim` for both encoder and decoder.

In [27]:
input_size = len(source.vocab)
output_size = len(target.vocab)

enc_embed_dim = 500
enc_hidden_size = 1024

dec_embed_dim = 500
dec_hidden_size = 1024

encoder = Encoder(input_size, enc_embed_dim, enc_hidden_size)
decoder = Decoder(output_size, dec_embed_dim, dec_hidden_size)

model = Seq2Seq(encoder, decoder, device=device).to(device)
model

Seq2Seq(
  (encoder): Encoder(
    (embed): Embedding(7854, 500)
    (rnn): GRU(500, 1024, batch_first=True)
    (dropout): Dropout(p=0.3, inplace=False)
  )
  (decoder): Decoder(
    (embed): Embedding(5893, 500)
    (rnn): GRU(500, 1024, batch_first=True)
    (dropout): Dropout(p=0.3, inplace=False)
    (linear): Linear(in_features=1024, out_features=5893, bias=True)
  )
)

We can also define a function which will tell us the number of the trainable parameters in the model.

In [28]:
def count_params(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"The model has a total of {count_params(model):,} trainable parameters.")

The model has a total of 22,289,569 trainable parameters.


Last piece of objects needed before we can actually write training loop are the loss functions and optimizers. Also we also don't want to calculate the loss on the `<pad>` token, this can be done with `ignore_index` argument in the loss function.

In [29]:
# get the string representation of first 4 target tokens
target.vocab.itos[0:4]

['<unk>', '<pad>', '<sos>', '<eos>']

In [30]:
pad_idx = target.vocab.stoi[target.pad_token]
unk_idx = target.vocab.stoi[target.unk_token]

criterion = nn.NLLLoss(ignore_index=pad_idx)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

Alright, now we are in the position to start the training. We will write two separate functions: one for training and other for evaluation.

In [31]:
def train(model, iterator, optimizer, criterion, device, clip):
    model.train()
    epoch_loss = 0.0
    
    for batch in iterator:
        input_tensor = batch.src
        target_tensor = batch.trg

        # not using the batch lengths for each 
        # batch of input and output tensors
        input_tensor = input_tensor[0]
        target_tensor = target_tensor[0]

        # zeroing out any stray grads
        optimizer.zero_grad()

        # forward pass
        output = model(input_tensor, target_tensor)
        # target_tensor = [batch_size, seq_len]
        # output = [batch_size, seq_len, output_size]

        output_size = output.size(-1)

        # ignoring the <SOS> token
        output = output[1:].view(-1, output_size)
        target_tensor = target_tensor[1:].view(-1)

        loss = criterion(output, target_tensor)

        loss.backward()

        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)

        optimizer.step()

        epoch_loss += loss.item()
    
    return epoch_loss / len(iterator)

In [32]:
def evaluate(model, iterator, criterion):
    model.eval()
    epoch_loss = 0.0
    with torch.no_grad():
        for batch in iterator:
            input_tensor = batch.src
            target_tensor = batch.trg

            input_tensor = input_tensor[0]
            target_tensor = target_tensor[0]
            
            output = model(input_tensor, target_tensor, 0)
            # target_tensor = [batch_size, seq_len]
            # output = [batch_size, seq_len, output_size]

            output_size = output.size(-1)

            # ignoring the <SOS> token
            output = output[1:].view(-1, output_size)
            target_tensor = target_tensor[1:].view(-1)

            loss = criterion(output, target_tensor)

            epoch_loss += loss.item()
        
    return epoch_loss / len(iterator)

In [33]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [34]:
N_EPOCHS = 15
CLIP = 1

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_iterator, optimizer, criterion, 
                       device, CLIP)
    valid_loss = evaluate(model, valid_iterator, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'plain-rnn-model.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 1m 31s
	Train Loss: 3.694 | Train PPL:  40.225
	 Val. Loss: 3.693 |  Val. PPL:  40.160
Epoch: 02 | Time: 1m 32s
	Train Loss: 2.913 | Train PPL:  18.419
	 Val. Loss: 3.529 |  Val. PPL:  34.090
Epoch: 03 | Time: 1m 32s
	Train Loss: 2.539 | Train PPL:  12.663
	 Val. Loss: 3.399 |  Val. PPL:  29.938
Epoch: 04 | Time: 1m 32s
	Train Loss: 2.266 | Train PPL:   9.643
	 Val. Loss: 3.475 |  Val. PPL:  32.310
Epoch: 05 | Time: 1m 32s
	Train Loss: 2.073 | Train PPL:   7.946
	 Val. Loss: 3.485 |  Val. PPL:  32.625
Epoch: 06 | Time: 1m 31s
	Train Loss: 1.926 | Train PPL:   6.860
	 Val. Loss: 3.503 |  Val. PPL:  33.217
Epoch: 07 | Time: 1m 31s
	Train Loss: 1.814 | Train PPL:   6.132
	 Val. Loss: 3.639 |  Val. PPL:  38.044
Epoch: 08 | Time: 1m 30s
	Train Loss: 1.754 | Train PPL:   5.776
	 Val. Loss: 3.657 |  Val. PPL:  38.760
Epoch: 09 | Time: 1m 31s
	Train Loss: 1.695 | Train PPL:   5.447
	 Val. Loss: 3.749 |  Val. PPL:  42.475
Epoch: 10 | Time: 1m 31s
	Train Loss: 1.659 | Train PPL