<a href="https://colab.research.google.com/github/DrAlexSanz/NLP-SPEC-C4/blob/main/W1/Assignment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 1: Neural Machine Translation
Welcome to the first assignment of Course 4. Here, you will build an English-to-German neural machine translation (NMT) model using Long Short-Term Memory (LSTM) networks with attention. Machine translation is an important task in natural language processing and could be useful not only for translating one language to another but also for word sense disambiguation (e.g. determining whether the word "bank" refers to the financial bank, or the land alongside a river). Implementing this using just a Recurrent Neural Network (RNN) with LSTMs can work for short to medium length sentences but can result in vanishing gradients for very long sequences. To solve this, you will be adding an attention mechanism to allow the decoder to access all relevant parts of the input sentence regardless of its length. By completing this assignment, you will:

* learn how to preprocess your training and evaluation data
* implement an encoder-decoder system with attention
* understand how attention works
* build the NMT model from scratch using Trax
* generate translations using greedy and Minimum Bayes Risk (MBR) decoding

In [1]:
from termcolor import colored
import random
import numpy as np
!pip install trax
import trax
from trax import layers as tl
from trax.fastmath import numpy as fastnp
from trax.supervised import training

!pip list | grep trax

Collecting trax
[?25l  Downloading https://files.pythonhosted.org/packages/42/51/305b839f51d53abb393777f743e497d27bb341478f3fdec4d6ddaccc9fb5/trax-1.3.7-py2.py3-none-any.whl (521kB)
[K     |████████████████████████████████| 522kB 15.0MB/s 
Collecting tensorflow-text
[?25l  Downloading https://files.pythonhosted.org/packages/55/b8/5884204f7c2da639a3061fe3a0c41a06bb80bf7976fa7d407e1d628e38e9/tensorflow_text-2.4.2-cp36-cp36m-manylinux1_x86_64.whl (3.4MB)
[K     |████████████████████████████████| 3.4MB 55.9MB/s 
Collecting funcsigs
  Downloading https://files.pythonhosted.org/packages/69/cb/f5be453359271714c01b9bd06126eaf2e368f1fddfff30818754b5ac2328/funcsigs-1.0.2-py2.py3-none-any.whl
Collecting t5
[?25l  Downloading https://files.pythonhosted.org/packages/a0/c6/2ea21c983ae27553a798829a533349de5df99678cfd3fd8d313ae30b063f/t5-0.8.1-py3-none-any.whl (214kB)
[K     |████████████████████████████████| 215kB 62.0MB/s 
Collecting tfds-nightly
[?25l  Downloading https://files.pythonhosted.

## Part 1: Data Preparation

## 1.1 Importing the Data
We will first start by importing the packages we will use in this assignment. As in the previous course of this specialization, we will use the Trax library created and maintained by the Google Brain team to do most of the heavy lifting. It provides submodules to fetch and process the datasets, as well as build and train the model.

Next, we will import the dataset we will use to train the model. To meet the storage constraints in this lab environment, we will just use a small dataset from Opus, a growing collection of translated texts from the web. Particularly, we will get an English to German translation subset specified as opus/medical which has medical related texts. If storage is not an issue, you can opt to get a larger corpus such as the English to German translation dataset from ParaCrawl, a large multi-lingual translation dataset created by the European Union. Both of these datasets are available via Tensorflow Datasets (TFDS) and you can browse through the other available datasets here. We have downloaded the data for you in the data/ directory of your workspace. As you'll see below, you can easily access this dataset from TFDS with trax.data.TFDS. The result is a python generator function yielding tuples. Use the keys argument to select what appears at which position in the tuple. For example, keys=('en', 'de') below will return pairs as (English sentence, German sentence).

In [2]:
train_stream_fn = trax.data.TFDS('opus/medical',
                                 data_dir='./data/',
                                 keys=('en', 'de'),
                                 eval_holdout_size=0.01, # 1% for eval
                                 train=True)

# Get generator function for the eval set
eval_stream_fn = trax.data.TFDS('opus/medical',
                                data_dir='./data/',
                                keys=('en', 'de'),
                                eval_holdout_size=0.01, # 1% for eval
                                train=False)

[1mDownloading and preparing dataset 34.29 MiB (download: 34.29 MiB, generated: 188.85 MiB, total: 223.13 MiB) to ./data/opus/medical/0.1.0...[0m


HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Completed...', max=1.0, style=Progre…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Dl Size...', max=1.0, style=ProgressSty…

HBox(children=(FloatProgress(value=1.0, bar_style='info', description='Extraction completed...', max=1.0, styl…









HBox(children=(FloatProgress(value=0.0, description='Generating splits...', max=1.0, style=ProgressStyle(descr…

HBox(children=(FloatProgress(value=0.0, description='Generating train examples...', max=1108752.0, style=Progr…

HBox(children=(FloatProgress(value=0.0, description='Shuffling opus-train.tfrecord...', max=1108752.0, style=P…

[1mDataset opus downloaded and prepared to ./data/opus/medical/0.1.0. Subsequent calls will reuse this data.[0m


Notice that TFDS returns a generator function, not a generator. This is because in Python, you cannot reset generators so you cannot go back to a previously yielded value. During deep learning training, you use Stochastic Gradient Descent and don't actually need to go back -- but it is sometimes good to be able to do that, and that's where the functions come in. It is actually very common to use generator functions in Python -- e.g., zip is a generator function. You can read more about Python generators to understand why we use them. Let's print a a sample pair from our train and eval data. Notice that the raw ouput is represented in bytes (denoted by the b' prefix) and these will be converted to strings internally in the next steps.

In [6]:
train_stream = train_stream_fn()
print(colored('train data (en, de) tuple:', 'red'), next(train_stream))
print()

eval_stream = eval_stream_fn()
print(colored('eval data (en, de) tuple:', 'red'), next(eval_stream))

[31mtrain data (en, de) tuple:[0m (b'In the pregnant rat the AUC for calculated free drug at this dose was approximately 18 times the human AUC at a 20 mg dose.\n', b'Bei tr\xc3\xa4chtigen Ratten war die AUC f\xc3\xbcr die berechnete ungebundene Substanz bei dieser Dosis etwa 18-mal h\xc3\xb6her als die AUC beim Menschen bei einer 20 mg Dosis.\n')

[31meval data (en, de) tuple:[0m (b'Lutropin alfa Subcutaneous use.\n', b'Pulver zur Injektion Lutropin alfa Subkutane Anwendung\n')


## Get all the files from Github and arrange all the dirs

In [3]:
!git clone https://github.com/DrAlexSanz/NLP-SPEC-C4.git
!cp "NLP-SPEC-C4/W1/data/ende_32k.subword" "data"
!rm -rf "NLP-SPEC-C4"


Cloning into 'NLP-SPEC-C4'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (19/19), done.[K
remote: Total 27 (delta 4), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (27/27), done.


## 1.2 Tokenization and Formatting
Now that we have imported our corpus, we will be preprocessing the sentences into a format that our model can accept. This will be composed of several steps:

**Tokenizing the sentences using subword representations:** As you've learned in the earlier courses of this specialization, we want to represent each sentence as an array of integers instead of strings. For our application, we will use subword representations to tokenize our sentences. This is a common technique to avoid out-of-vocabulary words by allowing parts of words to be represented separately. For example, instead of having separate entries in your vocabulary for --"fear", "fearless", "fearsome", "some", and "less"--, you can simply store --"fear", "some", and "less"-- then allow your tokenizer to combine these subwords when needed. This allows it to be more flexible so you won't have to save uncommon words explicitly in your vocabulary (e.g. stylebender, nonce, etc). Tokenizing is done with the trax.data.Tokenize() command and we have provided you the combined subword vocabulary for English and German (i.e. ende_32k.subword) saved in the data directory. Feel free to open this file to see how the subwords look like.

In [7]:
# global variables that state the filename and directory of the vocabulary file
VOCAB_FILE = 'ende_32k.subword'
VOCAB_DIR = 'data/'

tokenized_train_stream = trax.data.Tokenize(vocab_file = VOCAB_FILE, vocab_dir = VOCAB_DIR)(train_stream)
tokenized_eval_stream = trax.data.Tokenize(vocab_file = VOCAB_FILE, vocab_dir = VOCAB_DIR)(eval_stream)

**Append an end-of-sentence token to each sentence:** We will assign a token (i.e. in this case 1) to mark the end of a sentence. This will be useful in inference/prediction so we'll know that the model has completed the translation.

In [8]:
EOS = 1 # It's already an int

def append_eos(stream):
    for inputs, targets in stream:
        input_with_eos = list(inputs) + [EOS]
        target_with_eos = list(targets) + [EOS]
        yield np.array(input_with_eos), np.array(target_with_eos)

In [9]:
tokenized_train_stream = append_eos(tokenized_train_stream)
tokenized_eval_stream = append_eos(tokenized_eval_stream)

Filter long sentences: We will place a limit on the number of tokens per sentence to ensure we won't run out of memory. This is done with the trax.data.FilterByLength() method and you can see its syntax below.

In [11]:
# Filter long sentences to not run out of memory.
# length_keys=[0, 1] means we filter both English and German sentences, so
# both won't be longer that 256 tokens for training / 512 for eval.

filtered_train_stream = trax.data.FilterByLength(max_length = 256, length_keys = [0, 1])(tokenized_train_stream)

filtered_eval_stream = trax.data.FilterByLength(max_length = 512, length_keys = [0, 1])(tokenized_eval_stream)

In [16]:
# print a sample input-target pair of tokenized sentences
train_input, train_target = next(filtered_train_stream)
print(colored(f'Single tokenized example input:', 'red' ), train_input)
print(colored(f'Single tokenized example target:', 'red'), train_target)

[31mSingle tokenized example input:[0m [   16     6     4   904     7     4 20441  4384 18789    72    43     4
 14967 15397 22528  3550 30650  4729   992     1]
[31mSingle tokenized example target:[0m [    6    11  3886    38 14327  3694 17461 27177 30650  4729   992     1]


## 1.3 tokenize & detokenize helper functions
Given any data set, you have to be able to map words to their indices, and indices to their words. The inputs and outputs to your trax models are usually tensors of numbers where each number corresponds to a word. If you were to process your data manually, you would have to make use of the following:

* word2Ind: a dictionary mapping the word to its index.
* ind2Word: a dictionary mapping the index to its word.
* word2Count: a dictionary mapping the word to the number of times it appears.
* num_words: total number of words that have appeared.

Since you have already implemented these in previous assignments of the specialization, we will provide you with helper functions that will do this for you. Run the cell below to get the following functions:

* tokenize(): converts a text sentence to its corresponding token list (i.e. list of indices). Also converts words to subwords (parts of words).
* detokenize(): converts a token list to its corresponding sentence (i.e. string).