# TensorFlow Tutorial #21
# Machine Translation

by [Magnus Erik Hvass Pedersen](http://www.hvass-labs.org/)
/ [GitHub](https://github.com/Hvass-Labs/TensorFlow-Tutorials) / [Videos on YouTube](https://www.youtube.com/playlist?list=PL9Hr9sNUjfsmEu1ZniY0XpHSzl5uihcXZ)

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
# enter the foldername in your Drive where you have saved the unzipped
# assignment folder, e.g. 'cs231n/assignments/assignment3/'
FOLDERNAME = 'cs231n/assignments/ai'
assert FOLDERNAME is not None, "[!] Enter the foldername."

# now that we've mounted your Drive, this ensures that
# the Python interpreter of the Colab VM can load
# python files within it.
import sys
sys.path.append('/content.drive/My Drive/{}'.format(FOLDERNAME))

# this downloads the CIFAR-10 dataset to your Drive
# if it doesn't already exist
%cd drive/My\ Drive/$FOLDERNAME/datasets/
!python setup.py build_ext --inplace
%cd ../
!ls

## Introduction

Tutorial #20 showed how to use a Recurrent Neural Network (RNN) to do so-called sentiment analysis on texts of movie reviews. This tutorial will extend that idea to do Machine Translation of human languages by combining two RNN's.

You should be familiar with TensorFlow, Keras and the basics of Natural Language Processing, see Tutorials #01, #03-C and #20.

## Flowchart

The following flowchart shows roughly how the neural network is constructed. It is split into two parts: An encoder which maps the source-text to a "thought vector" that summarizes the text's contents, which is then input to the second part of the neural network that decodes the "thought vector" to the destination-text.

The neural network cannot work directly on text so first we need to convert each word to an integer-token using a tokenizer. But the neural network cannot work on integers either, so we use a so-called Embedding Layer to convert each integer-token to a vector of floating-point values. The embedding is trained alongside the rest of the neural network to map words with similar semantic meaning to similar vectors of floating-point values.

For example, consider the Danish text "der var engang" which is the beginning of any fairytale and literally means "there was once" but is commonly translated into English as "once upon a time". We first convert the entire data-set to integer-tokens so the text "der var engang" becomes [12, 54, 1097]. Each of these integer-tokens is then mapped to an embedding-vector with e.g. 128 elements, so the integer-token 12 could for example become [0.12, -0.56, ..., 1.19] and the integer-token 54 could for example become [0.39, 0.09, ..., -0.12]. These embedding-vectors can then be input to the Recurrent Neural Network, which has 3 GRU-layers. See Tutorial #20 for a more detailed explanation.

The last GRU-layer outputs a single vector - the "thought vector" that summarizes the contents of the source-text - which is then used as the initial state of the GRU-units in the decoder-part.

The destination-text "once upon a time" is padded with special markers "ssss" and "eeee" to indicate its beginning and end, so the sequence of integer-tokens becomes [2, 337, 640, 9, 79, 3]. During training, the decoder will be given this entire sequence as input and  the desired output sequence is [337, 640, 9, 79, 3] which is the same sequence but time-shifted one step. We are trying to teach the decoder to map the "thought vector" and the start-token "ssss" (integer 2) to the next word "once" (integer 337), and then map the word "once" to the word "upon" (integer 640), and so forth.

This flow-chart depicts the main idea but does not show all the necessary details e.g. regarding the loss function which is also somewhat complicated.

![Flowchart](images/21_machine_translation_flowchart.png)

## Imports

In [None]:
#To create local environment, go to the terminal and make sure the directory is on the folder containing this file. 
# Type "python3 -m venv env" into the terminal to create the local environment. 
%pip install matplotlib
%pip install tensorflow
%pip install numpy
%pip install datasets
%pip install zstandard
%pip install transformers

In [3]:
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
import math
import os

We need to import several things from Keras.

In [35]:
# from tf.keras.models import Model  # This does not work!
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, GRU, Embedding, Dropout
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, TensorBoard
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

This was developed using Python 3.6 (Anaconda) and package versions:

In [3]:
tf.__version__

'2.8.0'

In [4]:
tf.keras.__version__

'2.8.0'

In [None]:
#!pip install zstandard

## Load Data

#We will use the huggingface dataset Wikidepia/IndoParaCrawl

#https://huggingface.co/datasets/Wikidepia/IndoParaCrawl

In [5]:
#!pip install datasets
#!pip install zstandard
from datasets import load_dataset

#from datasets.filesystems import S3FileSystem
#dataset = load_dataset("Wikidepia/IndoParaCrawl", data_files = 'IndoParaCrawl-1*.jsonl.zst')
#dataset_small = load_dataset("Wikidepia/IndoParaCrawl", data_files = 'IndoParaCrawl-5*.jsonl.zst')
dataset_tiny = load_dataset("Wikidepia/IndoParaCrawl", data_files = 'IndoParaCrawl-11.jsonl.zst')
#dataset.save_to_disk('/dataset')

#Wikidepia/IndoParaCrawl
#import europarl

Using custom data configuration Wikidepia--IndoParaCrawl-46e41816710df16a
Reusing dataset json (C:\Users\panke\.cache\huggingface\datasets\json\Wikidepia--IndoParaCrawl-46e41816710df16a\0.0.0\ac0ca5f5289a6cf108e706efcf040422dbbfa8e658dee6a819f20d76bb84d26b)


  0%|          | 0/1 [00:00<?, ?it/s]

In [None]:
#books = load_dataset("opus_books", "en-fr")

#books = books["train"].train_test_split(test_size=0.2)
#_dataset = dataset["train"].train_test_split(test_size=0.2)
#dataset_small = _dataset["test"]
#_dataset_small = dataset_small.train_test_split(test_size=0.2)
#dataset_tiny = _dataset_small["test"]
#_dataset_tiny = dataset_tiny.train_test_split(test_size=0.2)


In [6]:
_dataset_tiny = dataset_tiny['train'].train_test_split(test_size=0.1)

In [6]:
print(_dataset_tiny)

DatasetDict({
    train: Dataset({
        features: ['en', 'id'],
        num_rows: 7999898
    })
    test: Dataset({
        features: ['en', 'id'],
        num_rows: 1999975
    })
})


In [7]:
_dataset_sample = _dataset_tiny['test']

In [11]:
dataset_micro = _dataset_tiny["test"].train_test_split(test_size=0.001)['test']

In [12]:

_dataset_micro = dataset_micro.train_test_split(test_size=0.1)

In [13]:
print(_dataset_sample)
print(_dataset_micro)

Dataset({
    features: ['en', 'id'],
    num_rows: 1999975
})
DatasetDict({
    train: Dataset({
        features: ['en', 'id'],
        num_rows: 1800
    })
    test: Dataset({
        features: ['en', 'id'],
        num_rows: 200
    })
})


In order for the decoder to know when to begin and end a sentence, we need to mark the start and end of each sentence with words that most likely don't occur in the data-set.

https://keras.io/api/models/model_training_apis/#fit-method

https://huggingface.co/docs/transformers/tasks/translation

In [7]:
_dataset_run = _dataset_tiny['test']

In [8]:
mark_start = 'ssss '
mark_end = ' eeee'
s_lang = 'id'
d_lang = 'en'
dataset_rnn = _dataset_run

##Begin Train with Transformers

In [7]:
from transformers import AutoTokenizer, GPT2LMHeadModel, AutoConfig, AutoModelForSeq2SeqLM, DataCollatorForSeq2Seq

_model_link = "Helsinki-NLP/opus-mt-id-en"
tokenizer = AutoTokenizer.from_pretrained(_model_link, return_tensors="tf")
model = AutoModelForSeq2SeqLM.from_pretrained(_model_link)
data_collator = DataCollatorForSeq2Seq(tokenizer, model = model)




In [None]:

import transformers
from transformers import AutoTokenizer
from transformers import DataCollatorForSeq2Seq
max_input_length = 128
max_target_length = 128


prefix = 'translate en to id: '
def preprocess(data):
  inputs = [example for example in data[s_lang]]
  targets = [example for example in data[d_lang]]
  model_inputs = tokenizer(inputs, max_length=128, truncation=True)
  with tokenizer.as_target_tokenizer():
    labels = tokenizer(targets, max_length=128, truncation=True)
  model_inputs['labels'] = labels['input_ids']
  return model_inputs

def preprocess_commas(data):
  inputs = [example.replace(",", " ,").replace(";", " ;").replace(':', " :").replace('.', ' .').replace('?', " ?").replace("!", " !").replace("'", " '").replace("(", " (").replace(")", " )") for example in data[s_lang]]
  targets = [example.replace(",", " ,").replace(";", " ;").replace(':', " :").replace('.', ' .').replace('?', " ?").replace("!", " !").replace("'", " '").replace("(", " (").replace(")", " )") for example in data[d_lang]]
  model_inputs = tokenizer(inputs, max_length=128, truncation=True)
  with tokenizer.as_target_tokenizer():
    labels = tokenizer(targets, max_length=128, truncation=True)
  model_inputs['labels'] = labels['input_ids']
  return model_inputs




In [22]:
tokenized_data = _dataset_run.map(preprocess, batched=True, keep_in_memory=True)

  0%|          | 0/1 [00:00<?, ?ba/s]

NameError: name 'tokenizer' is not defined

In [73]:
tokenized_data_2 = _dataset_run.map(preprocess, batched=True, keep_in_memory=True, remove_columns=_dataset_run["train"].column_names)

  0%|          | 0/2 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [74]:
print(data_collator([tokenized_data_2["train"][i] for i in range(5)]))

{'input_ids': tensor([[  131, 22873,    14, 35752,     3,    18,  5667,    39, 39204,    18,
         17337,  8881,   223,     3,    18,     3,  1095,   109,     3,  2539,
         15659,     0, 54795, 54795, 54795, 54795, 54795, 54795, 54795, 54795,
         54795, 54795, 54795],
        [ 9248,  1911,   981, 42758,   127,   836,   381, 32498,    19,  5462,
          2539, 31403, 33110,    18,    19,  7895,     2,     0, 54795, 54795,
         54795, 54795, 54795, 54795, 54795, 54795, 54795, 54795, 54795, 54795,
         54795, 54795, 54795],
        [ 4974,  4083,    21,  5563,  2352,  4055,  5804,  2850, 22169,  2853,
          1716,     9,   520,  1009,   127,  4589, 19506,     9,  1679, 17447,
         28252, 39730, 46373,     2,     0, 54795, 54795, 54795, 54795, 54795,
         54795, 54795, 54795],
        [19556,   142,  4196,    39,  6003, 21234, 12152,     3, 51662,     3,
          8117,    28, 11877,   204,    18,  4906,   246,     0, 54795, 54795,
         54795, 54795, 5

In [75]:
out = data_collator([tokenized_data_2["train"][i] for i in range(5)])
for key in out:
    print(f"{key} shape: {out[key].shape}")

input_ids shape: torch.Size([5, 33])
attention_mask shape: torch.Size([5, 33])
labels shape: torch.Size([5, 35])
decoder_input_ids shape: torch.Size([5, 35])


In [None]:
#tokenized_data_pico_2 = _dataset_pico.map(preprocess, batched=True, remove_columns=_dataset_pico["train"].column_names,)

In [49]:
print(_dataset_micro)
#print(_dataset_pico)

DatasetDict({
    train: Dataset({
        features: ['en', 'id'],
        num_rows: 1800
    })
    test: Dataset({
        features: ['en', 'id'],
        num_rows: 200
    })
})


## Option 1: Fine tune with Trainer

In [53]:
batch = data_collator([tokenized_data_2["train"][i] for i in range(1, 3)])
batch.keys()

dict_keys(['input_ids', 'attention_mask', 'labels', 'decoder_input_ids'])

In [55]:
batch["attention_mask"]

tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

In [56]:
batch["decoder_input_ids"]

tensor([[54795,  1015,    51,  1309,    27,  6008,    84,  1605,  3769,  4674,
             7,  2376,     9,  3324,  4423,   118,  8351,  4538,     0, 54795,
         54795, 54795, 54795, 54795, 54795, 54795, 54795, 54795, 54795, 54795,
         54795, 54795, 54795, 54795, 54795, 54795, 54795, 54795, 54795, 54795,
         54795, 54795, 54795, 54795, 54795, 54795, 54795, 54795, 54795],
        [54795,  9493,   184,  7042,   338,  8720,     9,    17,  2012,     9,
             3, 14493, 30182, 33544,   858,  3470,  6981,  1565,    12, 15824,
          2141,  1560,  1648,  1006,    27,   140,   373,    17,   561,  5498,
           125,   157, 21813,    40,   500,    16,     6, 19119,  3072,    12,
             6,  2384,  3978,     3,  3483,    97,  4567, 18553,     2]])

In [57]:
for i in range(1, 3):
    print(tokenized_data_2["train"][i]["labels"])

[1015, 51, 1309, 27, 6008, 84, 1605, 3769, 4674, 7, 2376, 9, 3324, 4423, 118, 8351, 4538, 0]
[9493, 184, 7042, 338, 8720, 9, 17, 2012, 9, 3, 14493, 30182, 33544, 858, 3470, 6981, 1565, 12, 15824, 2141, 1560, 1648, 1006, 27, 140, 373, 17, 561, 5498, 125, 157, 21813, 40, 500, 16, 6, 19119, 3072, 12, 6, 2384, 3978, 3, 3483, 97, 4567, 18553, 2, 0]


In [58]:
import numpy as np
from datasets import load_metric

metric = load_metric("sacrebleu")

def compute_metrics(eval_preds):
    preds, labels = eval_preds
    # In case the model returns more than the prediction logits
    if isinstance(preds, tuple):
        preds = preds[0]

    decoded_preds = tokenizer.batch_decode(preds, skip_special_tokens=True)

    # Replace -100s in the labels as we can't decode them
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    # Some simple post-processing
    decoded_preds = [pred.strip() for pred in decoded_preds]
    decoded_labels = [[label.strip()] for label in decoded_labels]

    result = metric.compute(predictions=decoded_preds, references=decoded_labels)
    return {"bleu": result["score"]}

In [59]:
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer
#model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
#data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
max_input_length = 128
max_target_length = 128
training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    evaluation_strategy="no",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=3,
    fp16=False, #change fp16 to false if asking for Cuda
)
#print("test2")
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_data_2["train"],
    eval_dataset=tokenized_data_2["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)



In [62]:
trainer.evaluate(max_length=max_target_length)

***** Running Evaluation *****
  Num examples = 200
  Batch size = 4


TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

In [None]:
model.save_pretrained("./Dataset")

In [76]:
trainer.train()

***** Running training *****
  Num examples = 1800
  Num Epochs = 3
  Instantaneous batch size per device = 4
  Total train batch size (w. parallel, distributed & accumulation) = 4
  Gradient Accumulation steps = 1
  Total optimization steps = 1350


  0%|          | 0/1350 [00:00<?, ?it/s]

Saving model checkpoint to ./results\checkpoint-450
Configuration saved in ./results\checkpoint-450\config.json
Model weights saved in ./results\checkpoint-450\pytorch_model.bin
tokenizer config file saved in ./results\checkpoint-450\tokenizer_config.json
Special tokens file saved in ./results\checkpoint-450\special_tokens_map.json


{'loss': 1.4588, 'learning_rate': 1.2592592592592593e-05, 'epoch': 1.11}


Saving model checkpoint to ./results\checkpoint-900
Configuration saved in ./results\checkpoint-900\config.json
Model weights saved in ./results\checkpoint-900\pytorch_model.bin
tokenizer config file saved in ./results\checkpoint-900\tokenizer_config.json
Special tokens file saved in ./results\checkpoint-900\special_tokens_map.json


{'loss': 1.0547, 'learning_rate': 5.185185185185185e-06, 'epoch': 2.22}


Saving model checkpoint to ./results\checkpoint-1350
Configuration saved in ./results\checkpoint-1350\config.json
Model weights saved in ./results\checkpoint-1350\pytorch_model.bin
tokenizer config file saved in ./results\checkpoint-1350\tokenizer_config.json
Special tokens file saved in ./results\checkpoint-1350\special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)




{'train_runtime': 2407.5201, 'train_samples_per_second': 2.243, 'train_steps_per_second': 0.561, 'train_loss': 1.170412168149595, 'epoch': 3.0}


TrainOutput(global_step=1350, training_loss=1.170412168149595, metrics={'train_runtime': 2407.5201, 'train_samples_per_second': 2.243, 'train_steps_per_second': 0.561, 'train_loss': 1.170412168149595, 'epoch': 3.0})

#Option 2: Fine tune with TensorFlow


In [77]:
trainer.evaluate()


***** Running Evaluation *****
  Num examples = 200
  Batch size = 4


TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'

In [65]:
tf_train_set = tokenized_data_2["train"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=True,
    batch_size=16,
    collate_fn=data_collator,
)

tf_test_set = tokenized_data_2["test"].to_tf_dataset(
    columns=["attention_mask", "input_ids", "labels"],
    shuffle=False,
    batch_size=16,
    collate_fn=data_collator,
)

from transformers import create_optimizers, AdamWeightDecay

optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

from transformers import TFAutoModelForSeq2SeqLM

#data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
model.compile(optimizer=optimizer)

model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3)


AttributeError: 'torch.Size' object has no attribute 'as_list'

#-=-=-=-
##End Train with Transformers

#-=-=-=-

#Begin Trin with RNN

#-=-=-=-

Load the texts for the source-language, here we use Indonesian.

In [16]:
data_src = dataset_rnn[s_lang]

#europarl.load_data(english=False,
 #                             language_code=language_code)

Load the texts for the destination-language, here we use English.

In [17]:
data_dest = dataset_rnn[d_lang]
for i in range(len(data_dest)):
  data_dest[i] = mark_start + str(data_dest[i]) + mark_end
#data_dest = [mark_start + str(example) for example in _dataset_micro['test'][d_lang] + mark_end]

#europarl.load_data(english=True,
 #                              language_code=language_code,
  #                             start=mark_start,
   #                            end=mark_end)

In [58]:
print(data_src[1])
print("-=-=-=-")
print(data_dest[1])

Tetapi juga peraih medali Marit Bouwmeester, Edith Bosch dan Laura Smulders akan ada di sana.
-=-=-=-
ssss But also medalists Marit Bouwmeester, Edith Bosch and Laura Smulders will be there. eeee


In [18]:
null_count_src = 0
for i in range(len(data_src)):
    if (type(data_src[i]) != str):
        data_src[i] = "kosong"
        data_dest[i] = "ssss empty eeee"
        print(i)
        null_count_src += 1

print(type(data_src[0]) == str)

4996
107840
119047
123116
133128
182855
204968
364687
413849
437767
482977
504479
664885
752570
773447
844543
851807
898074
True


In [19]:
data_src = [example.replace(",", " ,").replace(";", " ;").replace(':', " :").replace('.', ' .').replace('?', " ?").replace("!", " !").replace("'", " '").replace("(", " (").replace(")", " )") for example in data_src]
data_dest = [example.replace(",", " ,").replace(";", " ;").replace(':', " :").replace('.', ' .').replace('?', " ?").replace("!", " !").replace("'", " '").replace("(", " (").replace(")", " )") for example in data_dest]

In [22]:
print(data_src[1])
print("-=-=-=-")
print(data_dest[1])

Akhirnya dia mulai mendapatkan suara dan inspirasi yang dia nyatakan sebagai milik kita - saya menulis kepadanya banyak surat peringatan dan penjelasan yang serius tetapi dia menolak untuk mendengarkan , terlalu terikat pada suara palsu dan inspirasi dan , untuk menghindari teguran dan koreksi .  , berhenti menulis atau memberi tahu kami .
-=-=-=-


We will build a model to translate from the source language (English) to the destination language (Indonesian). If you want to make the inverse translation you can merely exchange the source and destination data.

### Example Data

The data is just a list of texts that is ordered so the source and destination texts match. I can confirm that this example is an accurate translation.

In [81]:
idx = 10

In [82]:
data_src[idx]

'Buat cadangan dan beri label .'

In [83]:
data_dest[idx]

'ssss Create a backup and label it . eeee'

## Tokenizer

Neural Networks cannot work directly on text-data. We use a two-step process to convert text into numbers that can be used in a neural network. The first step is to convert text-words into so-called integer-tokens. The second step is to convert integer-tokens into vectors of floating-point numbers using a so-called embedding-layer. See Tutorial #20 for a more detailed explanation.

Set the maximum number of words in our vocabulary. This means that we will only use e.g. the 10000 most frequent words in the data-set. We use the same number for both the source and destination languages, but these could be different.

In [23]:
num_words = 10000
filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'

In [24]:
filters='#$%&*+-/<=>@[\\]^_{|}~\t\n'

We need a few more functions than provided by Keras' Tokenizer-class so we wrap it.

In [25]:
class TokenizerWrap(Tokenizer):
    """Wrap the Tokenizer-class from Keras with more functionality."""
    
    def __init__(self, texts, padding,
                 reverse=False, num_words=None):
        """
        :param texts: List of strings. This is the data-set.
        :param padding: Either 'post' or 'pre' padding.
        :param reverse: Boolean whether to reverse token-lists.
        :param num_words: Max number of words to use.
        """

        Tokenizer.__init__(self, num_words=num_words, filters=filters)

        # Create the vocabulary from the texts.
        self.fit_on_texts(texts)

        # Create inverse lookup from integer-tokens to words.
        self.index_to_word = dict(zip(self.word_index.values(),
                                      self.word_index.keys()))

        # Convert all texts to lists of integer-tokens.
        # Note that the sequences may have different lengths.
        self.tokens = self.texts_to_sequences(texts)

        if reverse:
            # Reverse the token-sequences.
            self.tokens = [list(reversed(x)) for x in self.tokens]
        
            # Sequences that are too long should now be truncated
            # at the beginning, which corresponds to the end of
            # the original sequences.
            truncating = 'pre'
        else:
            # Sequences that are too long should be truncated
            # at the end.
            truncating = 'post'

        # The number of integer-tokens in each sequence.
        self.num_tokens = [len(x) for x in self.tokens]

        # Max number of tokens to use in all sequences.
        # We will pad / truncate all sequences to this length.
        # This is a compromise so we save a lot of memory and
        # only have to truncate maybe 5% of all the sequences.
        self.max_tokens = np.mean(self.num_tokens) \
                          + 2 * np.std(self.num_tokens)
        self.max_tokens = int(self.max_tokens)

        # Pad / truncate all token-sequences to the given length.
        # This creates a 2-dim numpy matrix that is easier to use.
        self.tokens_padded = pad_sequences(self.tokens,
                                           maxlen=self.max_tokens,
                                           padding=padding,
                                           truncating=truncating)

    def token_to_word(self, token):
        """Lookup a single word from an integer-token."""

        word = " " if token == 0 else self.index_to_word[token]
        return word 

    def tokens_to_string(self, tokens):
        """Convert a list of integer-tokens to a string."""

        # Create a list of the individual words.
        words = [self.index_to_word[token]
                 for token in tokens
                 if token != 0]
        
        # Concatenate the words to a single string
        # with space between all the words.
        text = " ".join(words)

        return text
    
    def text_to_tokens(self, text, reverse=False, padding=False):
        """
        Convert a single text-string to tokens with optional
        reversal and padding.
        """

        # Convert to tokens. Note that we assume there is only
        # a single text-string so we wrap it in a list.
        tokens = self.texts_to_sequences([text])
        tokens = np.array(tokens)

        if reverse:
            # Reverse the tokens.
            tokens = np.flip(tokens, axis=1)

            # Sequences that are too long should now be truncated
            # at the beginning, which corresponds to the end of
            # the original sequences.
            truncating = 'pre'
        else:
            # Sequences that are too long should be truncated
            # at the end.
            truncating = 'post'

        if padding:
            # Pad and truncate sequences to the given length.
            tokens = pad_sequences(tokens,
                                   maxlen=self.max_tokens,
                                   padding='pre',
                                   truncating=truncating)

        return tokens

Now create a tokenizer for the source-language. Note that we pad zeros at the beginning ('pre') of the sequences. We also reverse the sequences of tokens because the research literature suggests that this might improve performance, because the last words seen by the encoder match the first words produced by the decoder, so short-term dependencies are supposedly modelled more accurately.

In [26]:
null_count_dest = 0
for i in range(len(data_dest)):
    if (type(data_src[i]) != str):
        print(i)
        null_count_dest += 1

print(type(data_src[0]) == str)

True


In [27]:
%%time
tokenizer_src = TokenizerWrap(texts=data_src,
                              padding='pre',
                              reverse=True,
                              num_words=num_words)

CPU times: total: 1min 24s
Wall time: 1min 28s


Now create the tokenizer for the destination language. We need a tokenizer for both the source- and destination-languages because their vocabularies are different. Note that this tokenizer does not reverse the sequences and it pads zeros at the end ('post') of the arrays.

In [29]:
%%time
tokenizer_dest = TokenizerWrap(texts=data_dest,
                               padding='post',
                               reverse=False,
                               num_words=num_words)

CPU times: total: 1min 41s
Wall time: 1min 48s


Define convenience variables for the padded token sequences. These are just 2-dimensional numpy arrays of integer-tokens.

Note that the sequence-lengths are different for the source and destination languages. This is because texts with the same meaning may have different numbers of words in the two languages. 

Furthermore, we have made a compromise when tokenizing the original texts in order to save a lot of memory. This means we only truncate about 5% of the texts.

In [30]:
tokens_src = tokenizer_src.tokens_padded
tokens_dest = tokenizer_dest.tokens_padded
print(tokens_src.shape)
print(tokens_dest.shape)

(999988, 41)
(999988, 45)


This is the integer-token used to mark the beginning of a text in the destination-language.

In [31]:
token_start = tokenizer_dest.word_index[mark_start.strip()]
token_start

1

This is the integer-token used to mark the end of a text in the destination-language.

In [32]:
token_end = tokenizer_dest.word_index[mark_end.strip()]
token_end

2

### Example of Token Sequences

This is the output of the tokenizer. Note how it is padded with zeros at the beginning (pre-padding).

In [33]:
idx = 10

In [34]:
tokens_src[idx]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    2,  494,  730,
        163,    6,  679, 1910,   21,    1,  550, 1029,   48,  111,   18,
        541,  259,   91,    5,  359,  874,  510,   13])

We can reconstruct the original text by converting each integer-token back to its corresponding word:

In [80]:
tokenizer_src.tokens_to_string(tokens_src[idx])

'. label beri dan cadangan buat'

This text is actually reversed, as can be seen when compared to the original text from the data-set:

In [84]:
data_src[idx]

'Buat cadangan dan beri label .'

This is the sequence of integer-tokens for the corresponding text in the destination-language. Note how it is padded with zeros at the end (post-padding).

In [36]:
tokens_dest[idx]

array([   1,    9,  696, 1197,    6, 1131,  113,  909,  581,   11,   46,
          6,    4,  512, 1364,    3,   33,    8, 1577,  364, 1514,    5,
          2,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0])

We can reconstruct the original text by converting each integer-token back to its corresponding word:

In [86]:
tokenizer_dest.tokens_to_string(tokens_dest[idx])

'ssss create a backup and label it . eeee'

Compare this to the original text from the data-set, which is almost identical except for punctuation marks and a few words such as "dreaded millennium bug". This is because we only use a vocabulary of the 10000 most frequent words in the data-set and those 3 words were apparently not used frequently enough to be included in the vocabulary, so they are merely skipped.

In [87]:
data_dest[idx]

'ssss Create a backup and label it . eeee'

### Training Data

Now that the data-set has been converted to sequences of integer-tokens that are padded and truncated and saved in numpy arrays, we can easily prepare the data for use in training the neural network.

The input to the encoder is merely the numpy array for the padded and truncated sequences of integer-tokens produced by the tokenizer:

In [37]:
encoder_input_data = tokens_src

The input and output data for the decoder is identical, except shifted one time-step. We can use the same numpy array to save memory by slicing it, which merely creates different 'views' of the same data in memory.

In [38]:
decoder_input_data = tokens_dest[:, :-1]
decoder_input_data.shape

(999988, 44)

In [39]:
decoder_output_data = tokens_dest[:, 1:]
decoder_output_data.shape

(999988, 44)

For example, these token-sequences are identical except they are shifted one time-step.

In [40]:
idx = 2

In [41]:
decoder_input_data[idx]

array([   1,  298,    3,   60,    4, 2038, 3018, 1615,    3,   15,   32,
          4,  899,    6, 2552,   19,  181,    8,  591,    3,   23,   89,
         23,    4,  645,  591,    8,  181,    5,    2,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0])

In [42]:
decoder_output_data[idx]

array([ 298,    3,   60,    4, 2038, 3018, 1615,    3,   15,   32,    4,
        899,    6, 2552,   19,  181,    8,  591,    3,   23,   89,   23,
          4,  645,  591,    8,  181,    5,    2,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0])

If we use the tokenizer to convert these sequences back into text, we see that they are identical except for the first word which is 'ssss' that marks the beginning of a text.

In [43]:
tokenizer_dest.tokens_to_string(decoder_input_data[idx])

'ssss however , when the wild symbols appear , you have the option of winning from right to left , as well as the standard left to right . eeee'

In [44]:
tokenizer_dest.tokens_to_string(decoder_output_data[idx])

'however , when the wild symbols appear , you have the option of winning from right to left , as well as the standard left to right . eeee'

## Create the Neural Network

### Create the Encoder

First we create the encoder-part of the neural network which maps a sequence of integer-tokens to a "thought vector". We will use the so-called functional API of Keras for this, where we first create the objects for all the layers of the neural network and then we connect them later, this allows for more flexibility than the so-called sequential API in Keras, which is useful when experimenting with more complicated architectures and ways of connecting the encoder and decoder.

This is the input for the encoder which takes batches of integer-token sequences. The `None` indicates that the sequences can have arbitrary length.

In [45]:
encoder_input = Input(shape=(None, ), name='encoder_input')

This is the length of the vectors output by the embedding-layer, which maps integer-tokens to vectors of values roughly between -1 and 1, so that words that have similar semantic meanings are mapped to vectors that are similar. See Tutorial #20 for a more detailed explanation of this.

In [46]:
embedding_size = 128

This is the embedding-layer.

In [47]:
encoder_embedding = Embedding(input_dim=num_words,
                              output_dim=embedding_size,
                              name='encoder_embedding')

This is the size of the internal states of the Gated Recurrent Units (GRU). The same size is used in both the encoder and decoder.

In [48]:
state_size = 512

This creates the 3 GRU layers that will map from a sequence of embedding-vectors to a single "thought vector" which summarizes the contents of the input-text. Note that the last GRU-layer does not return a sequence.

In [49]:
encoder_gru1 = GRU(state_size, name='encoder_gru1',
                   return_sequences=True)
encoder_gru2 = GRU(state_size, name='encoder_gru2',
                   return_sequences=True)
encoder_gru3 = GRU(state_size, name='encoder_gru3',
                   return_sequences=False)

This helper-function connects all the layers of the encoder.

In [50]:
def connect_encoder():
    # Start the neural network with its input-layer.
    net = encoder_input
    
    # Connect the embedding-layer.
    net = encoder_embedding(net)

    # Connect all the GRU-layers.
    net = encoder_gru1(net)
    net = encoder_gru2(net)
    net = encoder_gru3(net)

    # This is the output of the encoder.
    encoder_output = net
    
    return encoder_output

Note how the encoder uses the normal output from its last GRU-layer as the "thought vector". Research papers often use the internal state of the encoder's last recurrent layer as the "thought vector". But this makes the implementation more complicated and is not necessary when using the GRU. But if you were using the LSTM instead then it is necessary to use the LSTM's internal states as the "thought vector" because it actually has two internal vectors, which we would need to initialize the two internal states of the decoder's LSTM units.

We can now use this function to connect all the layers in the encoder so it can be connected to the decoder further below.

In [51]:
encoder_output = connect_encoder()

### Create the Decoder

Create the decoder-part which maps the "thought vector" to a sequence of integer-tokens.

The decoder takes two inputs. First it needs the "thought vector" produced by the encoder which summarizes the contents of the input-text.

In [52]:
decoder_initial_state = Input(shape=(state_size,),
                              name='decoder_initial_state')

The decoder also needs a sequence of integer-tokens as inputs. During training we will supply this with a full sequence of integer-tokens e.g. corresponding to the text "ssss once upon a time eeee". 

During inference when we are translating new input-texts, we will start by feeding a sequence with just one integer-token for "ssss" which marks the beginning of a text, and combined with the "thought vector" from the encoder, the decoder will hopefully be able to produce the correct next word e.g. "once".

In [53]:
decoder_input = Input(shape=(None, ), name='decoder_input')

This is the embedding-layer which converts integer-tokens to vectors of real-valued numbers roughly between -1 and 1. Note that we have different embedding-layers for the encoder and decoder because we have two different vocabularies and two different tokenizers for the source and destination languages.

In [54]:
decoder_embedding = Embedding(input_dim=num_words,
                              output_dim=embedding_size,
                              name='decoder_embedding')

This creates the 3 GRU layers of the decoder. Note that they all return sequences because we ultimately want to output a sequence of integer-tokens that can be converted into a text-sequence.

In [55]:
decoder_gru1 = GRU(state_size, name='decoder_gru1',
                   return_sequences=True)
decoder_gru2 = GRU(state_size, name='decoder_gru2',
                   return_sequences=True)
decoder_gru3 = GRU(state_size, name='decoder_gru3',
                   return_sequences=True)

The GRU layers output a tensor with shape `[batch_size, sequence_length, state_size]`, where each "word" is encoded as a vector of length `state_size`. We need to convert this into sequences of integer-tokens that can be interpreted as words from our vocabulary.

One way of doing this is to convert the GRU output to a one-hot encoded array. It works but it is extremely wasteful, because for a vocabulary of e.g. 10000 words we need a vector with 10000 elements, so we can select the index of the highest element to be the integer-token.

In [56]:
decoder_dense = Dense(num_words,
                      activation='softmax',
                      name='decoder_output')

The decoder is built using the functional API of Keras, which allows more flexibility in connecting the layers e.g. to route different inputs to the decoder. This is useful because we have to connect the decoder directly to the encoder, but we will also connect the decoder to another input so we can run it separately.

This function connects all the layers of the decoder to some input of the initial-state values for the GRU layers.

In [57]:
def connect_decoder(initial_state):
    # Start the decoder-network with its input-layer.
    net = decoder_input

    # Connect the embedding-layer.
    net = decoder_embedding(net)
    
    # Connect all the GRU-layers.
    net = decoder_gru1(net, initial_state=initial_state)
    net = decoder_gru2(net, initial_state=initial_state)
    net = decoder_gru3(net, initial_state=initial_state)

    # Connect the final dense layer that converts to
    # one-hot encoded arrays.
    decoder_output = decoder_dense(net)
    
    return decoder_output

In [68]:
dropout = tf.keras.layers.Dropout(0.5)

In [69]:
def connect_decoder_dropout(initial_state):
    # Start the decoder-network with its input-layer.
    
    net = decoder_input

    # Connect the embedding-layer.
    net = decoder_embedding(net)
    
    # Connect all the GRU-layers.
    net = decoder_gru1(net, initial_state=initial_state)
    net = decoder_gru2(net, initial_state=initial_state)
    net = decoder_gru3(net, initial_state=initial_state)
    net = dropout(net, training=True)
    # Connect the final dense layer that converts to
    # one-hot encoded arrays.
    decoder_output = decoder_dense(net)
    
    return decoder_output

### Connect and Create the Models

We can now connect the encoder and decoder in different ways.

First we connect the encoder directly to the decoder so it is one whole model that can be trained end-to-end. This means the initial-state of the decoder's GRU units are set to the output of the encoder.

In [60]:
class MyModel(Model):

  def __init__(self):
    super().__init__()
    self.dense1 = tf.keras.layers.Dense(4, activation=tf.nn.relu)
    self.dense2 = tf.keras.layers.Dense(5, activation=tf.nn.softmax)
    self.dropout = tf.keras.layers.Dropout(0.5)

  def call(self, inputs, training=False):
    x = self.dense1(inputs)
    if training:
      x = self.dropout(x, training=training)
    return self.dense2(x)

_myModel = MyModel()

In [70]:
decoder_output = connect_decoder(initial_state=encoder_output)
dropout_decoder_output = connect_decoder_dropout(initial_state=encoder_output)
model_train = Model(inputs=[encoder_input, decoder_input],
                    outputs=[dropout_decoder_output])

Then we create a model for just the encoder alone. This is useful for mapping a sequence of integer-tokens to a "thought-vector" summarizing its contents.

In [71]:
model_encoder = Model(inputs=[encoder_input],
                      outputs=[encoder_output])

Then we create a model for just the decoder alone. This allows us to directly input the initial state for the decoder's GRU units.

In [72]:
decoder_output = connect_decoder(initial_state=decoder_initial_state)

model_decoder = Model(inputs=[decoder_input, decoder_initial_state],
                      outputs=[decoder_output])

Note that all these models use the same weights and variables of the encoder and decoder. We are merely changing how they are connected. So once the entire model has been trained, we can run the encoder and decoder models separately with the trained weights.

### Compile the Model

The output of the decoder is a sequence of one-hot encoded arrays. In order to train the decoder we need to supply the one-hot encoded arrays that we desire to see on the decoder's output, and then use a loss-function like cross-entropy to train the decoder to produce this desired output.

However, our data-set contains integer-tokens instead of one-hot encoded arrays. Each one-hot encoded array has 10000 elements so it would be extremely wasteful to convert the entire data-set to one-hot encoded arrays.

A better way is to use a so-called sparse cross-entropy loss-function, which does the conversion internally from integers to one-hot encoded arrays.

We have used the Adam optimizer in many of the previous tutorials, but it seems to diverge in some of these experiments with Recurrent Neural Networks. RMSprop seems to work much better for these.

In [73]:
model_train.compile(optimizer=RMSprop(lr=1e-3),
                    loss='sparse_categorical_crossentropy')

  super(RMSprop, self).__init__(name, **kwargs)


### Callback Functions

During training we want to save checkpoints and log the progress to TensorBoard so we create the appropriate callbacks for Keras.

This is the callback for writing checkpoints during training.

In [74]:
path_checkpoint = '21_checkpoint_dropout.keras'
callback_checkpoint = ModelCheckpoint(filepath=path_checkpoint,
                                      monitor='val_loss',
                                      verbose=1,
                                      save_weights_only=True,
                                      save_best_only=True)

This is the callback for stopping the optimization when performance worsens on the validation-set.

In [75]:
callback_early_stopping = EarlyStopping(monitor='val_loss',
                                        patience=3, verbose=1)

This is the callback for writing the TensorBoard log during training.

In [76]:
callback_tensorboard = TensorBoard(log_dir='./21_logs/',
                                   histogram_freq=0,
                                   write_graph=False)

In [77]:
callbacks = [callback_early_stopping,
             callback_checkpoint,
             callback_tensorboard]

### Load Checkpoint

You can reload the last saved checkpoint so you don't have to train the model every time you want to use it.

In [78]:
try:
    model_train.load_weights(path_checkpoint)
except Exception as error:
    print("Error trying to load checkpoint.")
    print(error)

Error trying to load checkpoint.
Unable to open file (file signature not found)


### Train the Model

We wrap the data in named dicts so we are sure the data is assigned correctly to the inputs and outputs of the model.

In [79]:
x_data = \
{
    'encoder_input': encoder_input_data,
    'decoder_input': decoder_input_data
}

In [80]:
y_data = \
{
    'decoder_output': decoder_output_data
}

We want a validation-set of 10000 sequences but Keras needs this number as a fraction.

In [81]:
validation_split = 10000 / len(encoder_input_data)
validation_split

0.010000120001440018

Now we can train the model. One epoch of training took about 1 hour on a GTX 1070 GPU. You probably need to run 10 epochs or more during training. After 10 epochs the loss was about 1.10 on the training-set and about 1.15 on the validation-set.

The batch-size was chosen to keep the GPU running at nearly 100% while being within the memory limits of 8GB for this GPU.

In [82]:
model_train.fit(x=x_data,
                y=y_data,
                batch_size=384,
                epochs=10,
                validation_split=validation_split,
                callbacks=callbacks)

Epoch 1/10
   1/2579 [..............................] - ETA: 29:57:14 - loss: 9.2098

KeyboardInterrupt: 

#-=-=-=-
#End Train on RNN



#-=-=-=-


#Begin Translate (Both RNN and Transformer)

#-=-=-=-

## Translate Texts

This function translates a text from the source-language to the destination-language and optionally prints a true translation.

In [253]:
def translate(input_text, true_output_text=None):
    """Translate a single text-string."""

    # Convert the input-text to integer-tokens.
    # Note the sequence of tokens has to be reversed.
    # Padding is probably not necessary.
    input_tokens = tokenizer_src.text_to_tokens(text=input_text,
                                                reverse=True,
                                                padding=True)
    
    # Get the output of the encoder's GRU which will be
    # used as the initial state in the decoder's GRU.
    # This could also have been the encoder's final state
    # but that is really only necessary if the encoder
    # and decoder use the LSTM instead of GRU because
    # the LSTM has two internal states.
    initial_state = model_encoder.predict(input_tokens)

    # Max number of tokens / words in the output sequence.
    max_tokens = tokenizer_dest.max_tokens

    # Pre-allocate the 2-dim array used as input to the decoder.
    # This holds just a single sequence of integer-tokens,
    # but the decoder-model expects a batch of sequences.
    shape = (1, max_tokens)
    decoder_input_data = np.zeros(shape=shape, dtype=np.int)

    # The first input-token is the special start-token for 'ssss '.
    token_int = token_start

    # Initialize an empty output-text.
    output_text = ''

    # Initialize the number of tokens we have processed.
    count_tokens = 0

    # While we haven't sampled the special end-token for ' eeee'
    # and we haven't processed the max number of tokens.
    while token_int != token_end and count_tokens < max_tokens:
        # Update the input-sequence to the decoder
        # with the last token that was sampled.
        # In the first iteration this will set the
        # first element to the start-token.
        decoder_input_data[0, count_tokens] = token_int

        # Wrap the input-data in a dict for clarity and safety,
        # so we are sure we input the data in the right order.
        x_data = \
        {
            'decoder_initial_state': initial_state,
            'decoder_input': decoder_input_data
        }

        # Note that we input the entire sequence of tokens
        # to the decoder. This wastes a lot of computation
        # because we are only interested in the last input
        # and output. We could modify the code to return
        # the GRU-states when calling predict() and then
        # feeding these GRU-states as well the next time
        # we call predict(), but it would make the code
        # much more complicated.

        # Input this data to the decoder and get the predicted output.
        decoder_output = model_decoder.predict(x_data)

        # Get the last predicted token as a one-hot encoded array.
        token_onehot = decoder_output[0, count_tokens, :]
        
        # Convert to an integer-token.
        token_int = np.argmax(token_onehot)

        # Lookup the word corresponding to this integer-token.
        sampled_word = tokenizer_dest.token_to_word(token_int)

        # Append the word to the output-text.
        output_text += " " + sampled_word

        # Increment the token-counter.
        count_tokens += 1

    # Sequence of tokens output by the decoder.
    output_tokens = decoder_input_data[0]
    
    # Print the input-text.
    print("Input text:")
    print(input_text)
    print()

    # Print the translated output-text.
    print("Translated text:")
    print(output_text)
    print()

    # Optionally print the true translated text.
    if true_output_text is not None:
        print("True output text:")
        print(true_output_text)
        print()

### Examples

Translate a text from the training-data. This translation is quite good. Note how it is not identical to the translation from the training-data, but the actual meaning is similar.

In [323]:
idx = 25
translate(input_text=data_src[idx],
          true_output_text=data_dest[idx])

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  decoder_input_data = np.zeros(shape=shape, dtype=np.int)


Input text:
Stena Line memiliki opsi paling banyak untuk keberangkatan dari Harwich ke Hook of Holland, dengan rata-rata 2 perjalanan per hari dan 45 perjalanan bulanan.

Translated text:
 the route has a large selection of departure for the group from with a low week of north camera 2 per hour and a monthly resort eeee

True output text:
ssss Stena Line has the most options for departures from Harwich to Hook of Holland, with an average of 2 trips per day and 45 monthly trips. eeee



Here is another example which is also a reasonable translation, although it has incorrectly translated the natural disasters. Note "countries of the European Union" has instead been translated as "member states" which are synonyms in this context.

In [324]:
idx = 4
translate(input_text=data_src[idx],
          true_output_text=data_dest[idx])

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  decoder_input_data = np.zeros(shape=shape, dtype=np.int)


Input text:
Sistem penyetelan internal dengan kunci penyetelan hex

Translated text:
 internal system with key adjustment eeee

True output text:
ssss Internal tuning system with hex tuning key eeee



In this example we join two texts from the training-set. The model first sends this combined text through the encoder, which produces a "thought-vector" that seems to summarize both texts reasonably well so the decoder can produce a reasonable translation.

In [325]:
idx = 3
translate(input_text=data_src[idx] + data_src[idx+1],
          true_output_text=data_dest[idx] + data_dest[idx+1])

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  decoder_input_data = np.zeros(shape=shape, dtype=np.int)


Input text:
COOMPY "Cookies", oleh akunnya sendiri atau pihak ketiga yang dikontrak untuk menyediakan layanan pengukuran dapat menggunakan cookie saat Pengguna menavigasi melalui konten Situs Web.Sistem penyetelan internal dengan kunci penyetelan hex

Translated text:
 the third party by the cookies or the following provide a use of cookies while the user is able to complete the system through the contents of the format system the browser is covered with a digital circuit eeee

True output text:
ssss “Cookies” COOMPY, by own account or third party contracted to provide measurement services may make use of cookies when the User navigates through the contents of the Website. eeeessss Internal tuning system with hex tuning key eeee



If we reverse the order of these two texts then the meaning is not quite so clear for the latter text.

In [276]:
idx = 3
translate(input_text=data_src[idx+1] + data_src[idx],
          true_output_text=data_dest[idx+1] + data_dest[idx])

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  decoder_input_data = np.zeros(shape=shape, dtype=np.int)


Input text:
Sistem penyetelan internal dengan kunci penyetelan hexCOOMPY "Cookies", oleh akunnya sendiri atau pihak ketiga yang dikontrak untuk menyediakan layanan pengukuran dapat menggunakan cookie saat Pengguna menavigasi melalui konten Situs Web.

Translated text:
 the use of the connection with the file format or the party for the reason that the use of the services can be changed via the user settings when the user is linked to the website eeee

True output text:
ssss Internal tuning system with hex tuning key eeeessss “Cookies” COOMPY, by own account or third party contracted to provide measurement services may make use of cookies when the User navigates through the contents of the Website. eeee



This is an example I made up. It is a quite broken translation.

In [None]:
translate(input_text="der var engang et land der hed Danmark",
          true_output_text='Once there was a country named Denmark')

This is another example I made up. This is a better translation even though it is perhaps a more complicated text.

In [None]:
translate(input_text="Idag kan man læse i avisen at Danmark er blevet fornuftigt",
          true_output_text="Today you can read in the newspaper that Denmark has become sensible.")

This is a text from a Danish song. It doesn't even make much sense in Danish. However the translation is probably so broken because several of the words are not in the vocabulary.

In [None]:
translate(input_text="Hvem spæner ud af en butik og tygger de stærkeste bolcher?",
          true_output_text="Who runs out of a shop and chews the strongest bon-bons?")

## Conclusion

This tutorial showed the basic idea of using two Recurrent Neural Networks in a so-called encoder/decoder model to do Machine Translation of human languages. It was demonstrated on the very large Europarl data-set from the European Union.

The model could produce reasonable translations for some texts but not for others. It is possible that a better architecture for the neural network and more training epochs could improve performance. There are also more advanced models that are known to improve quality of the translations.

However, it is important to note that these models do not really understand human language. The models have no knowledge of the actual meaning of the words. The models are merely very advanced function approximators that can map between sequences of integer-tokens.

## Exercises

These are a few suggestions for exercises that may help improve your skills with TensorFlow. It is important to get hands-on experience with TensorFlow in order to learn how to use it properly.

You may want to backup this Notebook before making any changes.

* Train for more than 10 epochs. Does it improve the translations?
* Increase the size of the vocabulary. Does it improve the translations? Would it make sense to have different sizes for the vocabularies of the source and destination languages?
* Find another data-set and use it together with Europarl.
* Change the architectures of the neural network, for example change the state-size for the GRU layers, the number of GRU layers, the embedding-size, etc. Does it improve the translations?
* Use hyper-parameter optimization from Tutorial #19 to automatically find the best hyper-parameters.
* When translating texts, instead of using `np.argmax()` to sample the next integer-token, could you sample the decoder's output as if it was a probability distribution instead? Note that the decoder's output is not softmax-limited so you have to do that first to turn it into a probability-distribution.
* Can you generate multiple sequences by doing this sampling? Can you find a way to select the best of these different sequences?
* Disable the reversal of words for the source-language. Does it improve the translations?
* What is a Bi-Directional GRU and can you use it here?
* We use the **output** of the encoder's GRU as the initial state of the decoder's GRU. The research literature often uses an LSTM instead of the GRU, so they used the encoder's **state** instead of its output as the initial state of the decoder. Can you rewrite this code to use the encoder's state as the decoder's initial state? Is there a reason to do this, or is the encoder's output sufficient to use as the decoder's initial state?
* Is it possible to connect multiple encoders and decoders in a single neural network, so that you can train it on different languages and allow for direct translation e.g. from Danish to Polish, German and French?
* Explain to a friend how the program works.

## License (MIT)

Copyright (c) 2018 by [Magnus Erik Hvass Pedersen](http://www.hvass-labs.org/)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.