## from: https://github.com/bentrevett/pytorch-seq2seq/

# 1 - Sequence to Sequence Learning with Neural Networks

In this series we'll be building a machine learning model to go from once sequence to another, using PyTorch and torchtext. This will be done on German to English translations, but the models can be applied to any problem that involves going from one sequence to another, such as summarization, i.e. going from a sequence to a shorter sequence in the same language.

In this first notebook, we'll start simple to understand the general concepts by implementing the model from the [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215) paper. 

## Introduction

The most common sequence-to-sequence (seq2seq) models are *encoder-decoder* models, which commonly use a *recurrent neural network* (RNN) to *encode* the source (input) sentence into a single vector. In this notebook, we'll refer to this single vector as a *context vector*. We can think of the context vector as being an abstract representation of the entire input sentence. This vector is then *decoded* by a second RNN which learns to output the target (output) sentence by generating it one word at a time.

![](assets/seq2seq1.png)

The above image shows an example translation. The input/source sentence, \"guten morgen\", is passed through the embedding layer (yellow) and then input into the encoder (green). We also append a *start of sequence* (`<sos>`) and *end of sequence* (`<eos>`) token to the start and end of sentence, respectively. At each time-step, the input to the encoder RNN is both the embedding, $e$, of the current word, $e(x_t)$, as well as the hidden state from the previous time-step, $h_{t-1}$, and the encoder RNN outputs a new hidden state $h_t$. We can think of the hidden state as a vector representation of the sentence so far. The RNN can be represented as a function of both of $e(x_t)$ and $h_{t-1}$:

$$h_t = \text{EncoderRNN}(e(x_t), h_{t-1})$$

We're using the term RNN generally here, it could be any recurrent architecture, such as an *LSTM* (Long Short-Term Memory) or a *GRU* (Gated Recurrent Unit). 

Here, we have $X = \{x_1, x_2, ..., x_T\}$, where $x_1 = \text{<sos>}, x_2 = \text{guten}$, etc. The initial hidden state, $h_0$, is usually either initialized to zeros or a learned parameter.

Once the final word, $x_T$, has been passed into the RNN via the embedding layer, we use the final hidden state, $h_T$, as the context vector, i.e. $h_T = z$. This is a vector representation of the entire source sentence.

Now we have our context vector, $z$, we can start decoding it to get the output/target sentence, \"good morning\". Again, we append start and end of sequence tokens to the target sentence. At each time-step, the input to the decoder RNN (blue) is the embedding, $d$, of current word, $d(y_t)$, as well as the hidden state from the previous time-step, $s_{t-1}$, where the initial decoder hidden state, $s_0$, is the context vector, $s_0 = z = h_T$, i.e. the initial decoder hidden state is the final encoder hidden state. Thus, similar to the encoder, we can represent the decoder as:

$$s_t = \text{DecoderRNN}(d(y_t), s_{t-1})$$

Although the input/source embedding layer, $e$, and the output/target embedding layer, $d$, are both shown in yellow in the diagram they are two different embedding layers with their own parameters.

In the decoder, we need to go from the hidden state to an actual word, therefore at each time-step we use $s_t$ to predict (by passing it through a `Linear` layer, shown in purple) what we think is the next word in the sequence, $\hat{y}_t$. 

$$\hat{y}_t = f(s_t)$$

The words in the decoder are always generated one after another, with one per time-step. We always use `<sos>` for the first input to the decoder, $y_1$, but for subsequent inputs, $y_{t>1}$, we will sometimes use the actual, ground truth next word in the sequence, $y_t$ and sometimes use the word predicted by our decoder, $\hat{y}_{t-1}$. This is called *teacher forcing*, see a bit more info about it [here](https://machinelearningmastery.com/teacher-forcing-for-recurrent-neural-networks/). 

When training/testing our model, we always know how many words are in our target sentence, so we stop generating words once we hit that many. During inference it is common to keep generating words until the model outputs an `<eos>` token or after a certain amount of words have been generated.

Once we have our predicted target sentence, $\hat{Y} = \{ \hat{y}_1, \hat{y}_2, ..., \hat{y}_T \}$, we compare it against our actual target sentence, $Y = \{ y_1, y_2, ..., y_T \}$, to calculate our loss. We then use this loss to update all of the parameters in our model.

## Preparing Data

We'll be coding up the models in PyTorch and using torchtext to help us do all of the pre-processing required. We'll also be using spaCy to assist in the tokenization of the data.


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim

# for pre processing
# pip install torchtext==0.14.0 --no-deps
# https://stackoverflow.com/questions/76053795/install-torchtext-with-pytorch-1-13-1-with-cuda-11-7
from torchtext.data import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import Multi30k

# for tokenization
import spacy

import numpy as np
import random, math, time

2023-09-25 14:29:26.191477: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-09-25 14:29:27.114058: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
2023-09-25 14:29:27.114207: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64


In [2]:
SEED = 1234

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

In [3]:
# python -m spacy download en_core_web_sm
def load_model(model_name):
    try:
        nlp = spacy.load(model_name)
    except OSError:
        print(f"Model '{model_name}' not found. Installing...")
        spacy.cli.download(model_name, "--quiet")
        print(f"Model '{model_name}' installed successfully.")
        nlp = spacy.load(model_name)
    return nlp

# Load English/Dutch tokenizer, tagger, parser and NER
spacy_de = load_model('de_core_news_sm')
spacy_en = load_model('en_core_web_sm')


In [4]:
# create tokenizer function
# input: sentece string
# output: list of tokens

def custom_tokenize_de(text):
    """
    Tokenized German text from a string into a list of strings (tokens) and reverses it
    The authors of the Seq2Seq paper found that introducing these new short-term 
    dependencies by reversing the input sequence can be beneficial for the 
    optimization problem. It helps the model to capture different patterns and 
    dependencies in the data, which can improve the learning process.
    """
    
    return [tok.text.lower() for tok in spacy_de.tokenizer(text)][::-1]

def custom_tokenize_en(text):
    """
    Tokenize English text from string into a list of strings (tokens)
    """
    return [tok.text.lower() for tok in spacy_en.tokenizer(text)]


Tokenizer function has been changed. The tokenizer function handle how data should be processed.
* depricated version was Field class which was doing several functionalities.
* for the sake of simplicity the actions are divided into multiple new Classes.
* that is why, now, we have to implement each functionality independently

We set the argument in get_tokenizer to the correct tokenization function for each. The tokenizer will not append the "start of sequence" and "end of sequence" tokens anymore. We also need to manually converts all words to lowercase.

In [5]:
SRC_tokenizer = get_tokenizer(custom_tokenize_de)
TRG_tokenizer = get_tokenizer(custom_tokenize_en)

Special symbols are symbols with unique meaning. Since the tokenizer will not append the "start of sequence" and "end of sequence". We will need to manually add them in building vocab. Yield_tokens is the help function to split string into words (tokens).

In [6]:
# special symbols added to the list of tokens when data is loaded
special_symbols = ['<unk>', '<pad>', '<sos>', '<eos>']

def yield_tokens(dataset, tokenizer, is_source):
    """
    A generator function that yields tokenized sequences from a dataset.
    the dataset can be a training, validation or test datasets where it has
    input and target

    Args:
        dataset (list): The dataset containing input sequences.
        tokenizer (function): A function that tokenizes a given sequence.
        is_source (bool): A flag indicating whether the we to tokenize the
        input or target from the given dataset.

    Yields:
        list: A tokenized sequence.

    """
    for i in dataset:
        if is_source:
            yield tokenizer(i[0])
        else:
            yield tokenizer(i[1])

Unrelated practice about vocab and how it works. you can ignore it

In [7]:
# url_base = 'https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/'
# train_urls = ('train.en.gz', 'train.de.gz')
# val_urls = ('val.en.gz', 'val.de.gz')
# test_urls = ('test_2016_flickr.en.gz', 'test_2016_flickr.de.gz')
# from torchtext.utils import download_from_url, extract_archive

# train_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in train_urls]
# val_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in val_urls]
# test_filepaths = [extract_archive(download_from_url(url_base + url))[0] for url in test_urls]

In [8]:
# import io

# # Define a custom data loading function that reads the data with the correct encoding
# def load_custom_multi30k(split, language_pair, root):
#     file_path = f'{root}/{split}.{language_pair[0]}'

#     # Read the data with the detected encoding
#     with codecs.open(file_path, 'r', encoding='utf8') as file:
#         data = [line.strip() for line in file]

#     return data

# # Load the Multi30k dataset using the custom data loading function
# # valid_data = load_custom_multi30k(split='val', language_pair=('de', 'en'), root='.data')
# test_data = load_custom_multi30k(split='test_2016_flickr', language_pair=('de', 'en'), root='.data')


In [9]:
# test_data[0]

In [10]:
# import torchtext
# PAD = "<pad>"
# UNK = "<unk>"
# EOS = "<eos>"
# BOS = "<bos>"
# # vocab = torchtext.vocab.vocab(
# #     ordered_dict={"today": 1, "is": 1, "hot": 2},
# #     min_freq=1,
# #     specials=[PAD, UNK, EOS, BOS],
# #     special_first=True
# # )
# # print(vocab.get_stoi())


# # Example tokenized text data
# text_data = [["I", "love", "Python"], ["Python", "is", "fun"]]

# # Build a vocabulary
# vocab = build_vocab_from_iterator(text_data, min_freq=1,
#     specials=[PAD, UNK, EOS, BOS],
#     special_first=True)

# # Access vocabulary information
# print(vocab.get_stoi())  # Token to ID mapping
# print(vocab.get_itos())  # ID to token mapping

Next, we’ll build the vocabulary for the source and target languages. 
 * The vocabulary is used to associate each unique token with an index (an integer). 
 * The vocabularies of the source and target languages are distinct. 
 * Using the `min_freq` argument, we only allow tokens that appear at least 2 times to appear in our vocabulary. 
 * Tokens that appear only once are converted into an `<unk>` (unknown) token. 
 * `specials` are symbols having unique meanings such as unknown token, beginning of sentence token, etc... 
 * `special_first` appends special symbols at the beginning
 * `build_vocab_from_iterator` iterates through the tokenized text data and counts the frequency of each unique token in the provided iterable.
 *  It assigns a unique ID to each token based on its frequency, with more frequent tokens typically getting lower IDs.
 
*It is important to note that our vocabulary should only be built from the training set and not the validation/test set. 
This prevents “information leakage” into our model, giving us artifically inflated validation/test scores.*

In [11]:
url_base = 'https://raw.githubusercontent.com/multi30k/dataset/master/data/task1/raw/'
train_urls = ('train.de.gz', 'train.en.gz')
val_urls = ('val.de.gz', 'val.en.gz')
test_urls = ('test_2016_flickr.de.gz', 'test_2016_flickr.en.gz')
root = 'data/datasets/Multi30k'

from torchtext.utils import download_from_url, extract_archive

train_filepaths = [extract_archive(download_from_url(url_base + url, root=root))[0] for url in train_urls]
val_filepaths = [extract_archive(download_from_url(url_base + url, root=root))[0] for url in val_urls]
test_filepaths = [extract_archive(download_from_url(url_base + url, root=root))[0] for url in test_urls]

In [12]:
train_filepaths

['/jupyter/notebooks/udacity/RNN/Seq2Seq/data/datasets/Multi30k/train.de',
 '/jupyter/notebooks/udacity/RNN/Seq2Seq/data/datasets/Multi30k/train.en']

In [13]:
from torchdata.datapipes.iter import FileOpener, IterableWrapper
def read_data_as_iter(paths: list):
    assert len(paths) == 2, "path must contain only 2 elements: src and tgt language path respectively"
    url_dp = IterableWrapper([paths[0]])
    src_data_dp = FileOpener(url_dp, encoding="utf-8").readlines(
        return_path=False, strip_newline=True
    )
    url_dp = IterableWrapper([paths[1]])
    tgt_data_dp = FileOpener(url_dp, encoding="utf-8").readlines(
        return_path=False, strip_newline=True
    )

    return src_data_dp.zip(tgt_data_dp).shuffle().set_shuffle(False).sharding_filter()

* The advantage of the following function to read data is that we get a an object that the dataloader can use function len() to get the number of batches in the dataset
* however, # zip returns an iterator object. The first time you convert it to a list, the iterator is consumed so we have to re-create it every time we make use of it.
* the above function on the other hand returns an object that is created from a data pipeline, such as IterableDataset or IterableDatasetWithCollate. 
* These types of datasets do not have a predefined length because they generate data on-the-fly or from an infinite source so we have to count each batch during the training phase to know the number of batches

In [14]:
# import io
# def read_data_as_iter(paths: list):
#     assert len(paths) == 2, "path must contain only 2 elements: src and tgt language path respectively"
    
#     # Read source language data
#     src_data_dp = io.open(paths[0], encoding="utf8")
#     src_data_iter = iter(src_data_dp)
    
#     # Read target language data
#     tgt_data_dp = io.open(paths[1], encoding="utf8")
#     tgt_data_iter = iter(tgt_data_dp)
    
#     # Zip the source and target language data
#     parallel_corpus = zip(src_data_iter, tgt_data_iter)
    
#     # Return the parallel corpus
#     return parallel_corpus

In [15]:
# # Define the file path you want to open
# file_path = './data/datasets/Multi30k/train.de'
# url_dp = IterableWrapper([file_path])

# file_opener = FileOpener(url_dp, encoding="utf-8")

# # Read the content of the file using readlines
# lines = file_opener.readlines(return_path=False, strip_newline=True)

# # Print the lines or work with the file content as needed
# for line in lines:
#     print(line)

In [16]:
# train_set = process_data_set(['./data/datasets/Multi30k/train.de', './data/datasets/Multi30k/train.en'])
train_set = read_data_as_iter(train_filepaths)
train_set

ShardingFilterIterDataPipe

In [17]:
list(train_set)[0]

('Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.',
 'Two young, White males are outside near many bushes.')

In [18]:
train_set = read_data_as_iter(train_filepaths)
SRC_vocab_transform = build_vocab_from_iterator(yield_tokens(train_set, SRC_tokenizer, True), min_freq=2, specials=special_symbols, special_first=True)

# zip returns an iterator object. The first time you convert it to a list, the iterator is consumed
# so we have to re-create it
# train_set = read_data_as_iter(train_filepaths)
TRG_vocab_transform = build_vocab_from_iterator(yield_tokens(train_set, TRG_tokenizer, False), min_freq=2, specials=special_symbols, special_first=True) 

In [19]:
from torchtext.vocab import build_vocab_from_iterator

# Suppose you have two lists of English and German sentences
english_sentences = ["coding I I I love coding coding", "This is a test test test test sentence"]
german_sentences = ["Ich liebe das Programmieren", "Dies ist ein Test-Satz"]

# Define iterators for English and German sentences
english_iterator = (sentence.split() for sentence in english_sentences)
german_iterator = (sentence.split() for sentence in german_sentences)

# Create vocabularies for both languages
english_vocab = build_vocab_from_iterator(english_iterator, min_freq=1)
german_vocab = build_vocab_from_iterator(german_iterator, min_freq=1)

# You can access the vocabularies like this:
english_word_to_index = english_vocab.get_stoi()
german_word_to_index = german_vocab.get_stoi()

# Now, you have separate vocabularies for English and German sentences.
# You can access the word-to-index mappings for each language using the respective vocabularies.
for key, value in sorted(english_word_to_index.items(), key=lambda x: x[1]): 
    print("{} : {}".format(key, value))
# pprint(german_word_to_index)

test : 0
I : 1
coding : 2
This : 3
a : 4
is : 5
love : 6
sentence : 7


In [20]:
# txt = "ABC DEF"
# txt = [iter(TRG_tokenizer(txt))]

# english_iterator = build_vocab_from_iterator(txt, min_freq=1)

# list(english_iterator.get_stoi().items())[:10]


In [21]:
# from torchtext.datasets import multi30k

# # multi30k.URL["train"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/training.tar.gz"
# # multi30k.URL["valid"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/validation.tar.gz"
# # multi30k.URL["test"] = "https://raw.githubusercontent.com/neychev/small_DL_repo/master/datasets/Multi30k/mmt16_task1_test.tar.gz"

# # # multi30k.URL["train"] = "http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/training.tar.gz"
# # # multi30k.URL["valid"] = "http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/validation.tar.gz"
# # # multi30k.URL["test"] = "http://www.quest.dcs.shef.ac.uk/wmt16_files_mmt/mmt16_task1_test.tar.gz"

# train_set = Multi30k(split='train', language_pair=('de', 'en'), root='data')
# SRC_vocab_transform = build_vocab_from_iterator(yield_tokens(train_set, SRC_tokenizer, True), min_freq=2, specials=special_symbols, special_first=True)
# # # reset train_set
# # train_set = Multi30k(split='train', language_pair=('de', 'en'))
# TRG_vocab_transform = build_vocab_from_iterator(yield_tokens(train_set, TRG_tokenizer, False), min_freq=2, specials=special_symbols, special_first=True) 

Default_index is returned when the token is not in vocabulary. 
 * Set `<unk>` token index to default index to convert words that appear once into `<unk>` token


In [22]:
SRC_vocab_transform.set_default_index(SRC_vocab_transform['<unk>'])
TRG_vocab_transform.set_default_index(TRG_vocab_transform['<unk>'])

In [23]:
list(TRG_vocab_transform.get_stoi().items())[:10]
# sorted(TRG_vocab_transform.get_stoi().items(), key=lambda x: x[1], reverse = True)

[('zune', 5892),
 ('zigzag', 5890),
 ('ymca', 5889),
 ('wrestles', 5885),
 ('wrench', 5884),
 ('wounds', 5882),
 ('worried', 5880),
 ('worn', 5879),
 ('worked', 5878),
 ('wok', 5876)]

In [24]:
print(f"Unique tokens in source (de) vocabulary: {len(SRC_vocab_transform)}")
print(f"Unique tokens in target (en) vocabulary: {len(TRG_vocab_transform)}")

Unique tokens in source (de) vocabulary: 7853
Unique tokens in target (en) vocabulary: 5893


Next, we download and load the train, validation and test data.

The dataset we'll be using is the Multi30k dataset. This is a dataset with ~30,000 parallel English, German and French sentences, each with ~12 words per sentence.

`language_pair` specifies which languages to use as the source and target (source goes first)



In [25]:
train_data = train_set
valid_data = read_data_as_iter(val_filepaths)
test_data = read_data_as_iter(test_filepaths)

In [26]:
# import codecs  # Import the 'codecs' module

# # Define a custom data loading function that reads the data with the correct encoding
# def load_custom_multi30k(split, language_pair, root):
#     file_path = f'{root}/datasets/Multi30k/{split}.{language_pair[0]}'

#     # Read the data with the detected encoding
#     with codecs.open(file_path, 'r', encoding='utf-8') as file:
#         data = [line.strip() for line in file]

#     return data

# # Load the Multi30k dataset using the custom data loading function
# # valid_data = load_custom_multi30k(split='val', language_pair=('de', 'en'), root='.data')
# test_data = load_custom_multi30k(split='test_2016_flickr', language_pair=('de', 'en'), root='data')


In [27]:
print(f"Number of training examples: {len(list(train_data))}")
print(f"Number of validation examples: {len(list(valid_data))}")
print(f"Number of testing examples: {len(list(test_data))}")

Number of training examples: 29000
Number of validation examples: 1014
Number of testing examples: 1000


We can also print out an example. It is one of the original sentence pairs from the dataset.

In [28]:
print(list(train_data)[0][0])
print(list(train_data)[0][1])

Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche.
Two young, White males are outside near many bushes.


#### below is some practices to deepen my understanding. you can ignore it.

In [29]:
# one_example = []
# for i in train_data:
#     one_example = TRG_tokenizer(i[1])
#     print(one_example)
#     print(len(one_example))
#     break

In [30]:
# print(TRG_vocab_transform(one_example))
# print(len(TRG_vocab_transform(one_example)))

In [31]:
# from collections import Counter
# import io
# from torchtext.vocab import vocab


# de_tokenizer = get_tokenizer('spacy', language='de')
# en_tokenizer = get_tokenizer('spacy', language='en')

# def build_vocab(filepath, tokenizer):
#   counter = Counter()
#   with io.open(filepath, encoding="utf8") as f:
#     for string_ in f:
#       counter.update(tokenizer(string_))
#   return vocab(counter, specials=['<unk>', '<pad>', '<bos>', '<eos>'])

# de_vocab = build_vocab(train_filepaths[0], de_tokenizer)
# en_vocab = build_vocab(train_filepaths[1], en_tokenizer)
# # en_vocab = build_vocab('./test.txt', en_tokenizer)

# def data_process(filepaths):
#   raw_de_iter = iter(io.open(filepaths[0], encoding="utf8"))
#   raw_en_iter = iter(io.open(filepaths[1], encoding="utf8"))
#   data = []
#   for (raw_de, raw_en) in zip(raw_de_iter, raw_en_iter):
#     de_tensor_ = torch.tensor([de_vocab[token] for token in de_tokenizer(raw_de)],
#                             dtype=torch.long)
#     en_tensor_ = torch.tensor([en_vocab[token] for token in en_tokenizer(raw_en)],
#                             dtype=torch.long)
#     data.append((de_tensor_, en_tensor_))
#   return data


# de_vocab.set_default_index(de_vocab['<unk>'])
# en_vocab.set_default_index(en_vocab['<unk>'])

# # train_data = data_process(train_filepaths)

In [32]:
# myDict = TRG_vocab_transform.get_stoi()
# keys = list(myDict.keys())
# values = list(myDict.values())
# sorted_value_index = np.argsort(values)
# sorted_dict = {keys[i]: values[i] for i in sorted_value_index}
 
# print({k: sorted_dict[k] for k in list(sorted_dict)[:20]})

In [33]:
# print(en_vocab(['two', 'young', ',', 'a', 'males', 'are', 'outside', 'near', 'many', 'bushes', '.']
# ))

In [34]:
# train_filepaths

In [35]:
# myDict = en_vocab.get_stoi()
# keys = list(myDict.keys())
# values = list(myDict.values())
# sorted_value_index = np.argsort(values)
# sorted_dict = {keys[i]: values[i] for i in sorted_value_index}
 
# print({k: sorted_dict[k] for k in list(sorted_dict)[:20]})

In [36]:
# print(f"Unique tokens in source (de) vocabulary: {len(de_vocab)}, {len(SRC_vocab_transform)}")
# print(f"Unique tokens in target (en) vocabulary: {len(en_vocab)}, {len(TRG_vocab_transform)}")
# difference is due to captial and small letters in the word and infrequent words.

In [37]:
# keys_dict1 = set(en_vocab.get_stoi().keys())
# keys_dict2 = set(TRG_vocab_transform.get_stoi().keys())

# keys_only_in_dict1 = keys_dict1 - keys_dict2
# keys_only_in_dict2 = keys_dict2 - keys_dict1

# print("Keys only in dict1:", list(keys_only_in_dict1)[:20])
# print("Keys only in dict2:", list(keys_only_in_dict2)[:20])


In [38]:
# c = Counter(["a", "a", "b", "b", "b", "c", "c"])
# c.update((["a", "a", "b", "b", "d", "c", "c"]))
# print(c)
# vocab(c).get_stoi()
# # I have learnt that vocab won't sort the dictionary by the counter to give the index to each word.
# # we have to sort by value then pass it to vocab

The final step of preparing the data is to create the iterators. These can be iterated on to return a batch of data which will have a `src` attribute (the PyTorch tensors containing a batch of numericalized source sentences) and a `trg` attribute (the PyTorch tensors containing a batch of numericalized target sentences). Numericalized is just a fancy way of saying they have been converted from a sequence of readable tokens to a sequence of corresponding indexes, using the vocabulary.

We also need to define a `torch.device`. This is used to tell torchText to put the tensors on the GPU or not. We use the `torch.cuda.is_available()` function, which will return `True` if a GPU is detected on our computer. We pass this `device` to the iterator.

When we get a batch of examples using an iterator we need to make sure that all of the source sentences are padded to the same length, the same with the target sentences. New version of torchtext is not handling this for us. This will be done via `pad_sequence` function in `torch.nn.utils.rnn` module.

`DataLoader` in `torch.utils.data` has replaced BucketIterator. Working in conjunction with `collate_fn`, whenever dataloader has created a batch, it will call `collate_fn(batch)`. By implementing `pad_sequence` in `collate_fn`, we will minimize the amount of padding in both the source and target sentences as each batch will have different paddings according to their max sentence length. In addition, we will also reverse sentence and transform sentence into indexes with `<sos>` and `<eos>` token in `collate_fn`.

_you can modify the DataLoader to read a custom number of words as each sample, regardless of the number of lines in the original text. To achieve this, you would need to customize the collate_fn function._

In [39]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [40]:
import logging

logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s', handlers=[logging.StreamHandler()])


from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence
#function to add '<sos>' and '<eos>'
def add_symbols(sentence, transform):
    sos = torch.tensor([transform['<sos>']])
    eos = torch.tensor([transform['<eos>']])
    return torch.cat((sos,sentence,eos))
def generate_batch(batch):
    src_batch = []
    trg_batch = []
    src_len = []
    i = 0
    # print(type(batch))
    for src, trg in batch:
        i += 1
        #split sentence into tokens
        src_tensor = SRC_tokenizer(src.rstrip("\n"))
        # logging.info(f'iteration {i}:\n {src}'); # why prints 3 times while batch is 1?
        trg_tensor = SRC_tokenizer(trg.rstrip("\n"))
        #convert tokens to index and to tensor and add <sos> and <eos> to each sentence
        src_tensor = add_symbols(torch.tensor(SRC_vocab_transform(src_tensor)), SRC_vocab_transform)
        trg_tensor = add_symbols(torch.tensor(TRG_vocab_transform(trg_tensor)), TRG_vocab_transform)
        src_batch.append(src_tensor)
        #track length of each source sentence, not useful in this model. Will be useful in further models
        src_len.append(len(src_tensor))
        trg_batch.append(trg_tensor)
        # logging.info(f'iteration {i}:\n {(src_tensor)}');
    src_len = torch.tensor(src_len, dtype = torch.int)
    src_batch = pad_sequence(src_batch, padding_value=SRC_vocab_transform['<pad>'])
    trg_batch = pad_sequence(trg_batch, padding_value=SRC_vocab_transform['<pad>'])
    src_len, idx = torch.sort(src_len,descending=True)
    #src_len is not useful in this model
    # logging.error(f'{len(src_batch)}')
    return src_batch, src_len, trg_batch

In [41]:
type(train_data)

torch.utils.data.datapipes.iter.grouping.ShardingFilterIterDataPipe

In [42]:
BATCH_SIZE = 128

train_dataloader = DataLoader(train_data, batch_size=10, collate_fn=generate_batch)
# valid_dataloader = DataLoader(valid_data, batch_size=BATCH_SIZE, collate_fn=generate_batch)

In [43]:
sr, srlen, tg = (next(iter(train_dataloader)))
print(len(srlen))
print(len(list(train_dataloader)))

10
2900


In [44]:
(sr)

tensor([[   2,    2,    2,    2,    2,    2,    2,    2,    2,    2],
        [   4,    4,    4,    4,    4,    4,    4,    4,    4,    4],
        [3171,    0,  499,  248,   28, 1676,   21,    0,  117, 1191],
        [7647,    5,   56,    5,  111,   40, 4838,   34,  563,   12],
        [ 110, 2068, 7314,  681,  400,  132, 4319,   17,    6,  521],
        [  15,  831,    5,   10,   10,   13,   19, 3890,   21,   15],
        [   7,   11,    7,  535, 1338,  101,  169,   99,   60,    7],
        [  88,   30,  217,   14,   55,   15,   13,   35, 1389,  326],
        [  20,   76,   25,   12,   52,   35,    5,  277,   78,  262],
        [  84,    3,   66,   29,   30,    9,    3,   24,   14,   80],
        [  30,    1,    5,   40,   18,  129,    1,   11,   11,    3],
        [ 253,    1,    3,   46,    3,    8,    1,  183,   16,    1],
        [  26,    1,    1,    6,    1,   37,    1,   25,    8,    1],
        [  18,    1,    1,    7,    1,  518,    1, 4037,    3,    1],
        [   3,    1,

In [45]:
(sr[10][0])

tensor(30)

In [46]:
BATCH_SIZE = 128

train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE, collate_fn=generate_batch)
valid_dataloader = DataLoader(valid_data, batch_size=BATCH_SIZE, collate_fn=generate_batch)

## Building the Seq2Seq Model

We'll be building our model in three parts. The encoder, the decoder and a seq2seq model that encapsulates the encoder and decoder and will provide a way to interface with each.



### Encoder


First, the encoder, a 2 layer LSTM. The paper we are implementing uses a 4-layer LSTM, but in the interest of training time we cut this down to 2-layers. The concept of multi-layer RNNs is easy to expand from 2 to 4 layers. 

For a multi-layer RNN, the input sentence, $X$, after being embedded goes into the first (bottom) layer of the RNN and hidden states, $H=\{h_1, h_2, ..., h_T\}$, output by this layer are used as inputs to the RNN in the layer above. Thus, representing each layer with a superscript, the hidden states in the first layer are given by:

$$h_t^1 = \text{EncoderRNN}^1(e(x_t), h_{t-1}^1)$$

The hidden states in the second layer are given by:

$$h_t^2 = \text{EncoderRNN}^2(h_t^1, h_{t-1}^2)$$

Using a multi-layer RNN also means we'll also need an initial hidden state as input per layer, $h_0^l$, and we will also output a context vector per layer, $z^l$.

Without going into too much detail about LSTMs (see [this](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) blog post to learn more about them), all we need to know is that they're a type of RNN which instead of just taking in a hidden state and returning a new hidden state per time-step, also take in and return a *cell state*, $c_t$, per time-step.

$$h_t = \text{RNN}(e(x_t), h_{t-1})$$

$$(h_t, c_t) = \text{LSTM}(e(x_t), h_{t-1}, c_{t-1})$$

We can just think of $c_t$ as another type of hidden state. Similar to $h_0^l$, $c_0^l$ will be initialized to a tensor of all zeros. Also, our context vector will now be both the final hidden state and the final cell state, i.e. $z^l = (h_T^l, c_T^l)$.

Extending our multi-layer equations to LSTMs, we get:

$$(h_t^1, c_t^1) = \text{EncoderLSTM}^1(e(x_t), (h_{t-1}^1, c_{t-1}^1))$$

$$(h_t^2, c_t^2) = \text{EncoderLSTM}^2(h_t^1, (h_{t-1}^2, c_{t-1}^2))$$


Note how only our hidden state from the first layer is passed as input to the second layer, and not the cell state.

So our encoder looks something like this: 

![](assets/seq2seq2.png)

We create this in code by making an `Encoder` module, which requires we inherit from `torch.nn.Module` and use the `super().__init__()` as some boilerplate code. The encoder takes the following arguments:
- `input_dim` is the size/dimensionality of the one-hot vectors that will be input to the encoder. This is equal to the input (source) vocabulary size.
- `emb_dim` is the dimensionality of the embedding layer. This layer converts the one-hot vectors into dense vectors with `emb_dim` dimensions. 
- `hid_dim` is the dimensionality of the hidden and cell states.
- `n_layers` is the number of layers in the RNN.
- `dropout` is the amount of dropout to use. This is a regularization parameter to prevent overfitting. Check out [this](https://www.coursera.org/lecture/deep-neural-network/understanding-dropout-YaGbR) for more details about dropout.

We aren't going to discuss the embedding layer in detail during these tutorials. All we need to know is that there is a step before the words - technically, the indexes of the words - are passed into the RNN, where the words are transformed into vectors. To read more about word embeddings, check these articles: [1](https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/), [2](http://p.migdal.pl/2017/01/06/king-man-woman-queen-why.html), [3](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/), [4](http://mccormickml.com/2017/01/11/word2vec-tutorial-part-2-negative-sampling/). 

The embedding layer is created using `nn.Embedding`, the LSTM with `nn.LSTM` and a dropout layer with `nn.Dropout`. Check the PyTorch [documentation](https://pytorch.org/docs/stable/nn.html) for more about these.

One thing to note is that the `dropout` argument to the LSTM is how much dropout to apply between the layers of a multi-layer RNN, i.e. between the hidden states output from layer $l$ and those same hidden states being used for the input of layer $l+1$.

In the `forward` method, we pass in the source sentence, $X$, which is converted into dense vectors using the `embedding` layer, and then dropout is applied. These embeddings are then passed into the RNN. As we pass a whole sequence to the RNN, it will automatically do the recurrent calculation of the hidden states over the whole sequence for us! Notice that we do not pass an initial hidden or cell state to the RNN. This is because, as noted in the [documentation](https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM), that if no hidden/cell state is passed to the RNN, it will automatically create an initial hidden/cell state as a tensor of all zeros. 

The RNN returns: `outputs` (the top-layer hidden state for each time-step), `hidden` (the final hidden state for each layer, $h_T$, stacked on top of each other) and `cell` (the final cell state for each layer, $c_T$, stacked on top of each other).

As we only need the final hidden and cell states (to make our context vector), `forward` only returns `hidden` and `cell`. 

The sizes of each of the tensors is left as comments in the code. In this implementation `n_directions` will always be 1, however note that bidirectional RNNs (covered in tutorial 3) will have `n_directions` as 2.

In [47]:
class Encoder(nn.Module):
    def __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()

        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(input_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout = dropout)
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, src):
        # src = [src len, batch size]
        
        embedded = self.dropout(self.embedding(src))
        
        # embedded = [src len, batch size, emb dim]
        
        outputs, (hidden, cell) = self.rnn(embedded)
        
        # outputs = [src len, batch size, hid dim * n directions]
        # hidden = [n layers * n directions, batch size, hid dim]
        # cell = [n layers * n directions, batch size, hid dim]
        
        # outputs are always from the top hidden layer
        
        return hidden, cell

### Decoder


Next, we'll build our decoder, which will also be a 2-layer (4 in the paper) LSTM.

![](assets/seq2seq3.png)

The `Decoder` class does a single step of decoding, i.e. it ouputs single token per time-step. The first layer will receive a hidden and cell state from the previous time-step, $(s_{t-1}^1, c_{t-1}^1)$, and feeds it through the LSTM with the current embedded token, $y_t$, to produce a new hidden and cell state, $(s_t^1, c_t^1)$. The subsequent layers will use the hidden state from the layer below, $s_t^{l-1}$, and the previous hidden and cell states from their layer, $(s_{t-1}^l, c_{t-1}^l)$. This provides equations very similar to those in the encoder.

$$(s_t^1, c_t^1) = \text{DecoderLSTM}^1(d(y_t), (s_{t-1}^1, c_{t-1}^1))$$
$$(s_t^2, c_t^2) = \text{DecoderLSTM}^2(s_t^1, (s_{t-1}^2, c_{t-1}^2))$$

Remember that the initial hidden and cell states to our decoder are our context vectors, which are the final hidden and cell states of our encoder from the same layer, i.e. $(s_0^l,c_0^l)=z^l=(h_T^l,c_T^l)$.

We then pass the hidden state from the top layer of the RNN, $s_t^L$, through a linear layer, $f$, to make a prediction of what the next token in the target (output) sequence should be, $\hat{y}_{t+1}$. 

$$\hat{y}_{t+1} = f(s_t^L)$$

The arguments and initialization are similar to the `Encoder` class, except we now have an `output_dim` which is the size of the vocabulary for the output/target. There is also the addition of the `Linear` layer, used to make the predictions from the top layer hidden state.

Within the `forward` method, we accept a batch of input tokens, previous hidden states and previous cell states. As we are only decoding one token at a time, the input tokens will always have a sequence length of 1. We `unsqueeze` the input tokens to add a sentence length dimension of 1. Then, similar to the encoder, we pass through an embedding layer and apply dropout. This batch of embedded tokens is then passed into the RNN with the previous hidden and cell states. This produces an `output` (hidden state from the top layer of the RNN), a new `hidden` state (one for each layer, stacked on top of each other) and a new `cell` state (also one per layer, stacked on top of each other). We then pass the `output` (after getting rid of the sentence length dimension) through the linear layer to receive our `prediction`. We then return the `prediction`, the new `hidden` state and the new `cell` state.

**Note**: as we always have a sequence length of 1, we could use `nn.LSTMCell`, instead of `nn.LSTM`, as it is designed to handle a batch of inputs that aren't necessarily in a sequence. `nn.LSTMCell` is just a single cell and `nn.LSTM` is a wrapper around potentially multiple cells. Using the `nn.LSTMCell` in this case would mean we don't have to `unsqueeze` to add a fake sequence length dimension, but we would need one `nn.LSTMCell` per layer in the decoder and to ensure each `nn.LSTMCell` receives the correct initial hidden state from the encoder. All of this makes the code less concise - hence the decision to stick with the regular `nn.LSTM`.

In [48]:
class Decoder(nn.Module):
    def __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):
        super().__init__()
        
        self.output_dim = output_dim
        self.hid_dim = hid_dim
        self.n_layers = n_layers
        
        self.embedding = nn.Embedding(output_dim, emb_dim)
        
        self.rnn = nn.LSTM(emb_dim, hid_dim, n_layers, dropout=dropout)
        
        self.fc_out = nn.Linear(hid_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
    
    def forward(self, input, hidden, cell):
        # input = [batch size]
        # hidden = [n layers * n directions, batch size, hid dim]
        # cell = [n layers * n directions, batch size, hid dim]
        
        # n directions in the decoder will both always be 1, therefore:
        # hidden = [n layers, batch size, hid dim]
        # context = [n layers, batch size, hid dim]
        
        input = input.unsqueeze(0)
        
        # input = [1, batch size]
        
        embedded = self.dropout(self.embedding(input))
        
        # embedded = [1, batch size, emb dim]
        
        output, (hidden, cell) = self.rnn(embedded, (hidden, cell))
        
        #output = [seq len, batch size, hid dim * n directions]
        #hidden = [n layers * n directions, batch size, hid dim]
        #cell = [n layers * n directions, batch size, hid dim]
        
        #seq len and n directions will always be 1 in the decoder, therefore:
        #output = [1, batch size, hid dim]
        #hidden = [n layers, batch size, hid dim]
        #cell = [n layers, batch size, hid dim]
        
        prediction = self.fc_out(output.squeeze(0))
        
        # prediction = [batch size, output dim]
        
        return prediction, hidden, cell

### Seq2Seq


For the final part of the implemenetation, we'll implement the seq2seq model. This will handle: 
- receiving the input/source sentence
- using the encoder to produce the context vectors 
- using the decoder to produce the predicted output/target sentence

Our full model will look like this:

![](assets/seq2seq4.png)

The `Seq2Seq` model takes in an `Encoder`, `Decoder`, and a `device` (used to place tensors on the GPU, if it exists).

For this implementation, we have to ensure that the number of layers and the hidden (and cell) dimensions are equal in the `Encoder` and `Decoder`. This is not always the case, we do not necessarily need the same number of layers or the same hidden dimension sizes in a sequence-to-sequence model. However, if we did something like having a different number of layers then we would need to make decisions about how this is handled. For example, if our encoder has 2 layers and our decoder only has 1, how is this handled? Do we average the two context vectors output by the decoder? Do we pass both through a linear layer? Do we only use the context vector from the highest layer? Etc.

Our `forward` method takes the source sentence, target sentence and a teacher-forcing ratio. The teacher forcing ratio is used when training our model. When decoding, at each time-step we will predict what the next token in the target sequence will be from the previous tokens decoded, $\hat{y}_{t+1}=f(s_t^L)$. With probability equal to the teaching forcing ratio (`teacher_forcing_ratio`) we will use the actual ground-truth next token in the sequence as the input to the decoder during the next time-step. However, with probability `1 - teacher_forcing_ratio`, we will use the token that the model predicted as the next input to the model, even if it doesn't match the actual next token in the sequence.  

The first thing we do in the `forward` method is to create an `outputs` tensor that will store all of our predictions, $\hat{Y}$.

We then feed the input/source sentence, `src`, into the encoder and receive out final hidden and cell states.

The first input to the decoder is the start of sequence (`<sos>`) token. As our `trg` tensor already has the `<sos>` token appended (all the way back when we defined the `init_token` in our `TRG` field) we get our $y_1$ by slicing into it. We know how long our target sentences should be (`max_len`), so we loop that many times. The last token input into the decoder is the one **before** the `<eos>` token - the `<eos>` token is never input into the decoder. 

During each iteration of the loop, we:
- pass the input, previous hidden and previous cell states ($y_t, s_{t-1}, c_{t-1}$) into the decoder
- receive a prediction, next hidden state and next cell state ($\hat{y}_{t+1}, s_{t}, c_{t}$) from the decoder
- place our prediction, $\hat{y}_{t+1}$ `output` in our tensor of predictions, $\hat{Y}$`outputs`
- decide if we are going to "teacher force" or not
 - if we do, the next `input` is the ground-truth next token in the sequence, $y_{t+1}$/`trg[t]`
 - if we don't, the next `input` is the predicted next token in the sequence, $\hat{y}_{t+1}$/`top1`, which we get by doing an `argmax` over the output tensor
    
Once we've made all of our predictions, we return our tensor full of predictions, $\hat{Y}$/`outputs`.

**Note**: our decoder loop starts at 1, not 0. This means the 0th element of our `outputs` tensor remains all zeros. So our `trg` and `outputs` look something like:

$$\text{trg} = [<sos>, y_1, y_2, y_3, <eos>]$$
$$\text{outputs} = [0, \hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]$$

Later on when we calculate the loss, we cut off the first element of each tensor to get:

$$\text{trg} = [y_1, y_2, y_3, <eos>]$$
$$\text{outputs} = [\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]$$

In [49]:
class Seq2Seq(nn.Module):
    def __init__(self, encoder, decoder, device):
        super().__init__()
        
        self.encoder = encoder
        self.decoder = decoder
        self.device = device
        
        assert encoder.hid_dim == decoder.hid_dim, \
            "Hidden dimensions of encoder and decoder must be equal!"
        assert encoder.n_layers == decoder.n_layers, \
            "Encoder and decoder must have equal number of layers!"
        
        # Encoder: __init__(self, input_dim, emb_dim, hid_dim, n_layers, dropout):
        # Decoder: __init__(self, output_dim, emb_dim, hid_dim, n_layers, dropout):

    def forward(self, src, trg, teacher_forcing_ratio=0.5):
        # Encoder: forward(self, src): # as for the encoder, the initial hidden and cell input is not provided and pytorch will pass zeros tensor instead
        # Decoder: forward(self, input, hidden, cell):


        # src = [src len, batch size]
        # trg = [trg len, batch size]
        # teacher_forcing_ratio is the possibility to use teacher forcing
        # e.g. if teacher_forcing_ratio is 0.75 we use ground-truch 75% of the time

        batch_size = trg.shape[1]
        trg_len = trg.shape[0]
        trg_vocab_size = self.decoder.output_dim

        # tensor to store decoder outputs
        outputs = torch.zeros(trg_len, batch_size, trg_vocab_size).to(self.device)

        # last hidden state of the encoder is used as the initial hidden state of the decoder
        hidden, cell = self.encoder(src)

        # first input to the decoder is the <sos> tokens
        input = trg[0, :]
        
        for t in range(1, trg_len):

            # insert input token embedding, previous hidden and previous cell states
            # receive output tensor (predictions) and new hidden and ccell states
            output, hidden, cell = self.decoder(input, hidden, cell)

            # place predicitons in a tensor holding predicitons for each token
            outputs[t] = output

            # decide if we are going to use teacher forcing or not
            teacher_force = random.random() < teacher_forcing_ratio

            # get the highest predicted token from our predictions
            top1 = output.argmax(1)

            # if teacher forcing, use actual next token as next input
            # if not, use predicted token
            input = trg[t] if teacher_force else top1

        return outputs
            

# Training the Seq2Seq Model

Now we have our model implemented, we can begin training it. 

First, we'll initialize our model. As mentioned before, the input and output dimensions are defined by the size of the vocabulary. The embedding dimesions and dropout for the encoder and decoder can be different, but the number of layers and the size of the hidden/cell states must be the same. 

We then define the encoder, decoder and then our Seq2Seq model, which we place on the `device`.

In [50]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
INPUT_DIM = len(SRC_vocab_transform)
OUTPUT_DIM = len(TRG_vocab_transform)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)

Next up is initializing the weights of our model. In the paper they state they initialize all weights from a uniform distribution between -0.08 and +0.08, i.e. $\\mathcal{U}(-0.08, 0.08).$

We initialize weights in PyTorch by creating a function which we `apply` to our model. When using `apply`, the `init_weights` function will be called on every module and sub-module within our model. For each module we loop through all of the parameters and sample them from a uniform distribution with `nn.init.uniform_`.

In [51]:
def init_weights(m):
    for name, param in m.named_parameters():
        nn.init.uniform_(param.data, -0.08, 0.08)

model.apply(init_weights)

Seq2Seq(
  (encoder): Encoder(
    (embedding): Embedding(7853, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (dropout): Dropout(p=0.5, inplace=False)
  )
  (decoder): Decoder(
    (embedding): Embedding(5893, 256)
    (rnn): LSTM(256, 512, num_layers=2, dropout=0.5)
    (fc_out): Linear(in_features=512, out_features=5893, bias=True)
    (dropout): Dropout(p=0.5, inplace=False)
  )
)

We also define a function that will calculate the number of trainable parameters in the model

In [52]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 13,898,501 trainable parameters


We define our optimizer, which we use to update our parameters in the training loop. Check out [this](http://ruder.io/optimizing-gradient-descent/) post for information about different optimizers. Here, we'll use Adam.

In [53]:
optimizer = optim.Adam(model.parameters())

Next, we define our loss function. The `CrossEntropyLoss` function calculates both the log softmax as well as the negative log-likelihood of our predictions.

Our loss function calculates the average loss per token, however by passing the index of the `<pad>` token as the `ignore_index` argument we ignore the loss whenever the target token is a padding token.

In [54]:
TRG_PAD_IDX = TRG_vocab_transform['<pad>']
criterion = nn.CrossEntropyLoss(ignore_index=TRG_PAD_IDX)

Next, we'll define our training loop.

First, we'll set the model into "training mode" with `model.train()`. This will turn on dropout (and batch normalization, which we aren't using) and then iterate through our data iterator.

As stated before, our decoder loop starts at 1, not 0. This means the 0th element of our `outputs` tensor remains all zeros. So our `trg` and `outputs` look something like:


$$\text{trg} = [<sos>, y_1, y_2, y_3, <eos>]$$
$$\text{outputs} = [0, \hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]$$


Here, when we calculate the loss, we cut off the first element of each tensor to get:


$$\text{trg} = [y_1, y_2, y_3, <eos>]$$
$$\text{outputs} = [\hat{y}_1, \hat{y}_2, \hat{y}_3, <eos>]$$


At each iteration:
- get the source and target sentences from the batch, $X$ and $Y$
- zero the gradients calculated from the last batch
- feed the source and target into the model to get the output, $\hat{Y}$
- as the loss function only works on 2d inputs with 1d targets we need to flatten each of them with `.view`
    - we slice off the first column of the output and target tensors as mentioned above
- calculate the gradients with `loss.backward()`
- clip the gradients to prevent them from exploding (a common issue in RNNs)
- update the parameters of our model by doing an optimizer step
- sum the loss value to a running total

Finally, we return the loss that is averaged over all batches.

In [56]:
def train(model, batch, optimizer, criterion, clip):
    
    model.train()
    
    epoch_loss = 0
    iteration = 0
    # print(len((batch))) # only for zip type
    for src, _, trg in batch:
        
        src = src.to(device)
        trg = trg.to(device)
        
        optimizer.zero_grad()
        
        output = model(src, trg, 0.7)
        
        # trg = [trg len, batch size]
        # ouput = [trg len, batch size, output dim]
        
        output_dim = output.shape[-1]
        
        output = output[1:].view(-1, output_dim)
        trg = trg[1:].view(-1)
        
        # trg = [(trg len - 1) * batch size]
        # output = [(trg len - 1) batch size, output dim]
        
        loss = criterion(output, trg)
        
        loss.backward()
        
        torch.nn.utils.clip_grad_norm_(model.parameters(), clip)
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
        iteration += 1
        
        
    return epoch_loss / iteration

Our evaluation loop is similar to our training loop, however as we aren't updating any parameters we don't need to pass an optimizer or a clip value.

We must remember to set the model to evaluation mode with `model.eval()`. This will turn off dropout (and batch normalization, if used).\n

We use the `with torch.no_grad()` block to ensure no gradients are calculated within the block. This reduces memory consumption and speeds things up.

The iteration loop is similar (without the parameter updates), however we must ensure we turn teacher forcing off for evaluation. This will cause the model to only use it's own predictions to make further predictions within a sentence, which mirrors how it would be used in deployment.

In [57]:
def evaluate(model, batch, criterion):
    
    model.eval()
    
    epoch_loss = 0
    iteration = 0
    with torch.no_grad():
        for src, _, trg in batch:
            src = src.to(device)
            trg = trg.to(device)
            
            output = model(src, trg, 0) # turn off teach forcing
            
            # trg = [trg len, batch size]
            # output = [trg_len, batch size, output dim]
            
            output_dim = output.shape[-1]
            
            output = output[1:].view(-1, output_dim)
            trg = trg[1:].view(-1)
            
            # trg = [(trg len -1)*batch size]
            # output = [(trg len - 1) * batch size, output dim]
            
            loss = criterion(output, trg)
            
            epoch_loss = loss.item()
            iteration += 1
            
        return epoch_loss/iteration

Next, we'll create a function that we'll use to tell us how long an epoch takes.

In [58]:
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time/60)
    elapsed_secs = int(elapsed_time - (elapsed_mins*60))
    
    return elapsed_mins, elapsed_secs

We can finally start training our model!

At each epoch, we'll be checking if our model has achieved the best validation loss so far. If it has, we'll update our best validation loss and save the parameters of our model (called `state_dict` in PyTorch). Then, when we come to test our model, we'll use the saved parameters used to achieve the best validation loss. 

We'll be printing out both the loss and the perplexity at each epoch. It is easier to see a change in perplexity than a change in loss as the numbers are much bigger.

In [66]:
N_EPOCHS = 40
CLIP = 1

BATCH_SIZE = 128
# train_set, valid_set = Multi30k(split=('train','valid'), language_pair=('de', 'en'))
train_dataloader = DataLoader(train_data, batch_size=BATCH_SIZE, collate_fn=generate_batch)
# valid_dataloader = DataLoader(valid_data, batch_size=BATCH_SIZE, collate_fn=generate_batch)

best_valid_loss = float('inf')
# print(len(train_dataloader)) # fails because the method used to load the dataset doesn't implement len
for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss = train(model, train_dataloader, optimizer, criterion, CLIP)
    valid_loss = evaluate(model, valid_dataloader, criterion)
    
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
    
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(model.state_dict(), 'tut1-model_moreTrain.pt')
    
    print(f'Epoch: {epoch+1:02} | Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train PPL: {math.exp(train_loss):7.3f}')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. PPL: {math.exp(valid_loss):7.3f}')

Epoch: 01 | Time: 0m 50s
	Train Loss: 1.839 | Train PPL:   6.290
	 Val. Loss: 0.544 |  Val. PPL:   1.722
Epoch: 02 | Time: 0m 48s
	Train Loss: 1.831 | Train PPL:   6.242
	 Val. Loss: 0.544 |  Val. PPL:   1.722
Epoch: 03 | Time: 0m 46s
	Train Loss: 1.838 | Train PPL:   6.286
	 Val. Loss: 0.544 |  Val. PPL:   1.722
Epoch: 04 | Time: 0m 46s
	Train Loss: 1.824 | Train PPL:   6.196
	 Val. Loss: 0.544 |  Val. PPL:   1.722
Epoch: 05 | Time: 0m 46s
	Train Loss: 1.828 | Train PPL:   6.221
	 Val. Loss: 0.544 |  Val. PPL:   1.722
Epoch: 06 | Time: 0m 47s
	Train Loss: 1.838 | Train PPL:   6.282
	 Val. Loss: 0.544 |  Val. PPL:   1.722
Epoch: 07 | Time: 0m 46s
	Train Loss: 1.822 | Train PPL:   6.186
	 Val. Loss: 0.544 |  Val. PPL:   1.722
Epoch: 08 | Time: 0m 46s
	Train Loss: 1.835 | Train PPL:   6.264
	 Val. Loss: 0.544 |  Val. PPL:   1.722
Epoch: 09 | Time: 0m 47s
	Train Loss: 1.838 | Train PPL:   6.281
	 Val. Loss: 0.544 |  Val. PPL:   1.722
Epoch: 10 | Time: 0m 47s
	Train Loss: 1.823 | Train PPL

We'll load the parameters `(state_dict)` that gave our model the best validation loss and run it the model on the test set.



In [61]:
model.load_state_dict(torch.load('tut1-model.pt'))
test_loader = DataLoader(test_data, batch_size=1, collate_fn=generate_batch)
test_loss = evaluate(model, test_loader, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')


| Test Loss: 0.004 | Test PPL:   1.004 |


In [62]:
model.load_state_dict(torch.load('tut1-model_moreTrain.pt'))
test_loader = DataLoader(test_data, batch_size=1, collate_fn=generate_batch)
test_loss = evaluate(model, test_loader, criterion)

print(f'| Test Loss: {test_loss:.3f} | Test PPL: {math.exp(test_loss):7.3f} |')


| Test Loss: 0.003 | Test PPL:   1.003 |


In [71]:
import torch
import spacy

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
INPUT_DIM = len(SRC_vocab_transform)
OUTPUT_DIM = len(TRG_vocab_transform)
ENC_EMB_DIM = 256
DEC_EMB_DIM = 256
HID_DIM = 512
N_LAYERS = 2
ENC_DROPOUT = 0.5
DEC_DROPOUT = 0.5

enc = Encoder(INPUT_DIM, ENC_EMB_DIM, HID_DIM, N_LAYERS, ENC_DROPOUT)
dec = Decoder(OUTPUT_DIM, DEC_EMB_DIM, HID_DIM, N_LAYERS, DEC_DROPOUT)

model = Seq2Seq(enc, dec, device).to(device)
model.load_state_dict(torch.load('tut1-model_moreTrain.pt'))

# Load the German and English spaCy tokenizers
spacy_de = spacy.load("de_core_news_sm")
spacy_en = spacy.load("en_core_web_sm")

def translate_sentence(model, sentence, src_tokenizer, trg_tokenizer, max_length=50):
    
    model.eval()
    # Tokenize the input sentence
    tokenized_sentence = [tok.text.lower() for tok in src_tokenizer(sentence)]
    
    # Add start and end tokens and convert to tensor
    src_tensor = add_symbols(torch.tensor(SRC_vocab_transform(tokenized_sentence)), SRC_vocab_transform)
    src_tensor = src_tensor.unsqueeze(1).to(model.device)
    
    # Forward pass through the model
    with torch.no_grad():
        hidden, cell = model.encoder(src_tensor)
    
    # Create a list to store the translated words
    trg_indexes = TRG_vocab_transform(['<sos>'])
    trg_tensor = torch.LongTensor([trg_indexes[-1]]).to(model.device)
    
    # Initialize variables for the decoding loop
    for _ in range(max_length):
        
        
        with torch.no_grad():
            output, hidden, cell = model.decoder(trg_tensor, hidden, cell)
        
        pred_token = output.argmax(1)
        trg_indexes.append(pred_token.item())
        
        trg_tensor = pred_token
        
        if pred_token.item() == TRG_vocab_transform(['<eos>'])[0]:
            break
    # Convert the indices to words
    
    translated_sentence = [TRG_vocab_transform.get_itos()[i] for i in trg_indexes]
    
    # Remove the start and end tokens
    translated_sentence = translated_sentence[1:-1]
    
    return ' '.join(translated_sentence)

# Example usage:
input_sentence = "Jungen tanzen mitten in der Nacht auf Pfosten."
# input_sentence = "Zwei junge weiße Männer sind im Freien in der Nähe vieler Büsche."

translated_sentence = translate_sentence(model, input_sentence, spacy_de, spacy_en)
print(f'Input: {input_sentence}')
print(f'Translation: {translated_sentence}')


Input: Jungen tanzen mitten in der Nacht auf Pfosten.
Translation: night at sidewalk the on on <unk> a
