# Machine Translation

In this notebook, we aim to convert English phrases to French using RNN on Deep Learning Neural Network


# Introduction

In this notebook, you will build a deep neural network that functions as part of an end-to-end machine translation pipeline. Your completed pipeline will accept English text as input and return the French translation.


In [14]:
# Now importing modules
import numpy as np

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model, Sequential
from keras.layers import (
    GRU,
    Input,
    Dense,
    TimeDistributed,
    Activation,
    RepeatVector,
    Bidirectional,
    Dropout,
    LSTM,
)
from keras.layers import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy

In [15]:
import tensorflow as tf

# Load Data

The small_vocab_en file contains English sentences with their French translations in the small_vocab_fr file. Load the English and French data from these files from running the cell below.


In [16]:
english_path = (
    "https://raw.githubusercontent.com/projjal1/datasets/master/small_vocab_en.txt"
)
french_path = (
    "https://raw.githubusercontent.com/projjal1/datasets/master/small_vocab_fr.txt"
)

Load the dataset and split file by lines


In [17]:
import os


def load_data(path):
    input_file = os.path.join(path)
    with open(input_file, "r") as f:
        data = f.read()

    return data.split("\n")

In [18]:
# Using helper to inport dataset
english_data = tf.keras.utils.get_file("file1", english_path)
french_data = tf.keras.utils.get_file("file2", french_path)

In [19]:
# Now loading data
english_sentences = load_data(english_data)
french_sentences = load_data(french_data)

# take 10000 samples for training and 1000 samples for test
train_english_sentences = english_sentences[:10000]
train_french_sentences = french_sentences[:10000]

test_english_sentences = english_sentences[10000:11000]
test_french_sentences = french_sentences[10000:11000]

In [20]:
len(french_sentences), len(english_sentences)

(137860, 137860)

# Analysis of Dataset

Let us look at a few examples in the dataset of both language


In [21]:
for i in range(3):
    print("Sample :", i)
    print(train_english_sentences[i])
    print(train_french_sentences[i])
    print("-" * 50)

Sample : 0
new jersey is sometimes quiet during autumn , and it is snowy in april .
new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
--------------------------------------------------
Sample : 1
the united states is usually chilly during july , and it is usually freezing in november .
les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .
--------------------------------------------------
Sample : 2
california is usually quiet during march , and it is usually hot in june .
california est généralement calme en mars , et il est généralement chaud en juin .
--------------------------------------------------


# Convert to Vocabulary

The complexity of the problem is determined by the complexity of the vocabulary. A more complex vocabulary is a more complex problem. Let's look at the complexity of the dataset we'll be working with.


In [22]:
import collections

In [23]:
english_words_counter = collections.Counter(
    [word for sentence in train_english_sentences for word in sentence.split()]
)
french_words_counter = collections.Counter(
    [word for sentence in train_french_sentences for word in sentence.split()]
)

print("English Vocab:", len(english_words_counter))
print("French Vocab:", len(french_words_counter))

English Vocab: 226
French Vocab: 329


# Tokenize (IMPLEMENTATION)

For a neural network to predict on text data, it first has to be turned into data it can understand. Text data like "dog" is a sequence of ASCII character encodings. Since a neural network is a series of multiplication and addition operations, the input data needs to be number(s).

We can turn each character into a number or each word into a number. These are called character and word ids, respectively.

- Character ids are used for character level models that generate text predictions for each character.
- A word level model uses word ids that generate text predictions for each word. Word level models tend to learn better, since they are lower in complexity, so we'll use that.

**TO_DO:** Turn each sentence into a sequence of words_ids using Keras's Tokenizer function. Use this function to tokenize english_sentences and french_sentences in the cell below.


In [24]:
def tokenize(x):
    ## TO_DO:
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(x)
    return tokenizer

In [25]:
# Tokenize Sample output
text_sentences = [
    "The quick brown fox jumps over the lazy dog .",
    "By Jove , my quick study of lexicography won a prize .",
    "This is a short sentence .",
]

text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
text_tokenized = text_tokenizer.texts_to_sequences(text_sentences)

for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
    print("Sequence {} in x".format(sample_i + 1))
    print("  Input:  {}".format(sent))
    print("  Output: {}".format(token_sent))

{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9, 'by': 10, 'jove': 11, 'my': 12, 'study': 13, 'of': 14, 'lexicography': 15, 'won': 16, 'prize': 17, 'this': 18, 'is': 19, 'short': 20, 'sentence': 21}
Sequence 1 in x
  Input:  The quick brown fox jumps over the lazy dog .
  Output: [1, 2, 4, 5, 6, 7, 1, 8, 9]
Sequence 2 in x
  Input:  By Jove , my quick study of lexicography won a prize .
  Output: [10, 11, 12, 2, 13, 14, 15, 16, 3, 17]
Sequence 3 in x
  Input:  This is a short sentence .
  Output: [18, 19, 3, 20, 21]


# Padding (IMPLEMENTATION)

When batching the sequence of word ids together, each sequence needs to be the same length. Since sentences are dynamic in length, we can add padding to the end of the sequences to make them the same length.

Make sure all the English sequences have the same length and all the French sequences have the same length by adding padding to the end of each sequence using Keras's pad_sequences function.


In [26]:
def pad(x, length=None):
    ## TO_DO:
    text_padded = pad_sequences(x, maxlen=length, padding="post")
    return text_padded

In [None]:
def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Feature List of sentences
    :param y: Label List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    x_tk = tokenize(x)
    preprocess_x = x_tk.texts_to_sequences(x)
    y_tk = tokenize(y)
    preprocess_y = y_tk.texts_to_sequences(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    # Expanding dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk


(
    preproc_english_sentences,
    preproc_french_sentences,
    english_tokenizer,
    french_tokenizer,
) = preprocess(train_english_sentences, train_french_sentences)

max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)

print("Data Preprocessed.")
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

Data Preprocessed.
Max English sentence length: 15
Max French sentence length: 21
English vocabulary size: 198
French vocabulary size: 344


# Create Model


The neural network will translate the input to words ids, which isn't the final form we want. We want the French translation. The function logits_to_text will bridge the gap between the logits from the neural network to the French translation. You'll be using this function to better understand the output of the neural network.


In [28]:
def logits_to_text(logits, tokenizer):
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = "<PAD>"

    # So basically we are predicting output for a given word and then selecting best answer
    # Then selecting that label we reverse-enumerate the word from id
    return " ".join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

![Model](https://github.com/tommytracey/AIND-Capstone/raw/8267d4fe72e48c595a0aff46eaf0a805fff0f36d/images/embedding.png)


# Building Model

Here we use RNN model combined with GRU nodes for translation.
In the code section below, we give a simple model example. You can first run this model and play with it. Then you can change the model architecture by following the Exercise 4 to get better results.


In [29]:
def embed_model(
    input_shape, output_sequence_length, english_vocab_size, french_vocab_size
):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """

    ## TO_DO: Improve the layers (See Exercise 4)
    model = Sequential()
    model.add(
        Embedding(
            english_vocab_size,
            256,
            input_length=input_shape[1],
            input_shape=input_shape[1:],
        )
    )
    model.add(GRU(256, return_sequences=True))
    model.add(TimeDistributed(Dense(1024, activation="relu")))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation="softmax")))

    return model

In [30]:
# Reshaping the input to work with a basic RNN
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2]))

Finally calling the model function


In [31]:
# Hyperparameters
learning_rate = 0.005

In [32]:
simple_rnn_model = embed_model(
    tmp_x.shape,
    preproc_french_sentences.shape[1],
    len(english_tokenizer.word_index) + 1,
    len(french_tokenizer.word_index) + 1,
)

  super().__init__(**kwargs)
I0000 00:00:1741079094.759161 2842163 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 15513 MB memory:  -> device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:82:00.0, compute capability: 6.0
W0000 00:00:1741079096.004075 2842360 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1741079096.005499 2842364 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1741079096.007129 2842367 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1741079096.008517 2842363 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1741079096.009890 2842365 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back

The output is a sequence of one-hot encoded arrays. Our data-set contains integer-tokens instead of one-hot encoded arrays. Each one-hot encoded array has large number of elements so it would be extremely wasteful to convert the entire data-set to one-hot encoded arrays. A better way is to use a so-called sparse cross-entropy loss-function, which does the conversion internally from integers to one-hot encoded arrays.


In [33]:
# Compile model
simple_rnn_model.compile(
    loss=sparse_categorical_crossentropy,
    optimizer=Adam(learning_rate),
    metrics=["accuracy"],
)

In [34]:
simple_rnn_model.summary()

# Training the model

Here we start to train the model and pass the english text and the max_sequence_length, with vocab size for both english and french text


In [35]:
history = simple_rnn_model.fit(
    tmp_x, preproc_french_sentences, batch_size=1024, epochs=10, validation_split=0.2
)

Epoch 1/10


W0000 00:00:1741079100.353311 2842401 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1741079100.358046 2842399 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1741079100.360106 2842398 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1741079100.362502 2842397 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1741079100.363893 2842395 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1741079100.365280 2842403 gpu_kernel_to_blob_pass.cc:190] Failed to compile generated PTX with ptxas. Falling back to compilation by driver.
W0000 00:00:1741079100.372185 2842394 gpu_kernel_to_blob_pass.cc:190] Failed to co

InvalidArgumentError: Graph execution error:

Detected at node sequential_1/gru_1/CudnnRNNV3 defined at (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main

  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/ipykernel_launcher.py", line 18, in <module>

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/traitlets/config/application.py", line 1075, in launch_instance

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/ipykernel/kernelapp.py", line 739, in start

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 205, in start

  File "/usr/lib/python3.10/asyncio/base_events.py", line 603, in run_forever

  File "/usr/lib/python3.10/asyncio/base_events.py", line 1909, in _run_once

  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 545, in dispatch_queue

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 534, in process_one

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 437, in dispatch_shell

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 362, in execute_request

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/ipykernel/kernelbase.py", line 778, in execute_request

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/ipykernel/ipkernel.py", line 449, in do_execute

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/ipykernel/zmqshell.py", line 549, in run_cell

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3075, in run_cell

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3130, in _run_cell

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/IPython/core/async_helpers.py", line 128, in _pseudo_sync_runner

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3334, in run_cell_async

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3517, in run_ast_nodes

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/IPython/core/interactiveshell.py", line 3577, in run_code

  File "/tmp/ipykernel_2842163/913081358.py", line 1, in <module>

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 371, in fit

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 219, in function

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 132, in multi_step_on_iterator

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 113, in one_step_on_data

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/backend/tensorflow/trainer.py", line 57, in train_step

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/layers/layer.py", line 908, in __call__

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/ops/operation.py", line 46, in __call__

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/models/sequential.py", line 213, in call

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/models/functional.py", line 182, in call

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/ops/function.py", line 171, in _run_through_graph

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/models/functional.py", line 637, in call

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/layers/layer.py", line 908, in __call__

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 117, in error_handler

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/ops/operation.py", line 46, in __call__

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/utils/traceback_utils.py", line 156, in error_handler

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/layers/rnn/gru.py", line 602, in call

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/layers/rnn/rnn.py", line 402, in call

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/layers/rnn/gru.py", line 569, in inner_loop

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/backend/tensorflow/rnn.py", line 484, in gru

  File "/home/ids/glorenzo-23/IA327-Generative-Models-for-NLP/venv/lib/python3.10/site-packages/keras/src/backend/tensorflow/rnn.py", line 741, in _cudnn_gru

Dnn is not supported
	 [[{{node sequential_1/gru_1/CudnnRNNV3}}]] [Op:__inference_multi_step_on_iterator_3977]

# Arbitrary Predictions

Try with arbitary examples in the corpus to see the translation


In [None]:
import re


def final_predictions(text):
    y_id_to_word = {value: key for key, value in french_tokenizer.word_index.items()}
    y_id_to_word[0] = "<PAD>"

    sentence = [english_tokenizer.word_index[word] for word in text.split()]
    sentence = pad_sequences(
        [sentence], maxlen=preproc_french_sentences.shape[-2], padding="post"
    )
    french_translation = logits_to_text(
        simple_rnn_model.predict(sentence[:1], verbose=0)[0], french_tokenizer
    )
    return re.split(r"\s*<PAD>", french_translation, 1)[0]

In [None]:
txt = train_english_sentences[0].lower()
print("English: ", train_english_sentences[0])
print("French: ", final_predictions(re.sub(r"[^\w]", " ", txt)))

English:  new jersey is sometimes quiet during autumn , and it is snowy in april .
French:  new jersey est parfois calme en l' automne et il est neigeux en avril


# Evaluation

In this section, we provide the example code for you to do the evaluation using BLEU score metrics.


In [None]:
# useful tokenization
import re
from functools import lru_cache


class BaseTokenizer:
    """A base dummy tokenizer to derive from."""

    def signature(self):
        """
        Returns a signature for the tokenizer.
        :return: signature string
        """
        return "none"

    def __call__(self, line):
        """
        Tokenizes an input line with the tokenizer.
        :param line: a segment to tokenize
        :return: the tokenized line
        """
        return line


class TokenizerRegexp(BaseTokenizer):
    def signature(self):
        return "re"

    def __init__(self):
        self._re = [
            # language-dependent part (assuming Western languages)
            (re.compile(r"([\{-\~\[-\` -\&\(-\+\:-\@\/])"), r" \1 "),
            # tokenize period and comma unless preceded by a digit
            (re.compile(r"([^0-9])([\.,])"), r"\1 \2 "),
            # tokenize period and comma unless followed by a digit
            (re.compile(r"([\.,])([^0-9])"), r" \1 \2"),
            # tokenize dash when preceded by a digit
            (re.compile(r"([0-9])(-)"), r"\1 \2 "),
            # one space only between words
            # NOTE: Doing this in Python (below) is faster
            # (re.compile(r'\s+'), r' '),
        ]

    @lru_cache(maxsize=2**16)
    def __call__(self, line):
        """Common post-processing tokenizer for `13a` and `zh` tokenizers.
        :param line: a segment to tokenize
        :return: the tokenized line
        """
        for _re, repl in self._re:
            line = _re.sub(repl, line)

        # no leading or trailing spaces, single space within words
        # return ' '.join(line.split())
        # This line is changed with regards to the original tokenizer (seen above) to return individual words
        return line.split()


class Tokenizer13a(BaseTokenizer):
    def signature(self):
        return "13a"

    def __init__(self):
        self._post_tokenizer = TokenizerRegexp()

    @lru_cache(maxsize=2**16)
    def __call__(self, line):
        """Tokenizes an input line using a relatively minimal tokenization
        that is however equivalent to mteval-v13a, used by WMT.

        :param line: a segment to tokenize
        :return: the tokenized line
        """

        # language-independent part:
        line = line.replace("<skipped>", "")
        line = line.replace("-\n", "")
        line = line.replace("\n", " ")

        if "&" in line:
            line = line.replace("&quot;", '"')
            line = line.replace("&amp;", "&")
            line = line.replace("&lt;", "<")
            line = line.replace("&gt;", ">")

        return self._post_tokenizer(f" {line} ")

In [None]:
import collections
import math


def get_ngrams(segment, max_order):
    """Extracts all n-grams upto a given maximum order from an input segment.

    Args:
      segment: text segment from which n-grams will be extracted.
      max_order: maximum length in tokens of the n-grams returned by this
          methods.

    Returns:
      The Counter containing all n-grams upto max_order in segment
      with a count of how many times each n-gram occurred.
    """
    ngram_counts = collections.Counter()
    for order in range(1, max_order + 1):
        for i in range(0, len(segment) - order + 1):
            ngram = tuple(segment[i : i + order])
            ngram_counts[ngram] += 1
    return ngram_counts


def compute_bleu(reference_corpus, translation_corpus, max_order=4):
    """Computes BLEU score of translated segments against one or more references.

    Args:
      reference_corpus: list of lists of references for each translation. Each
          reference should be tokenized into a list of tokens.
      translation_corpus: list of translations to score. Each translation
          should be tokenized into a list of tokens.
      max_order: Maximum n-gram order to use when computing BLEU score.

    Returns:
      3-Tuple with the BLEU score, n-gram precisions, geometric mean of n-gram
      precisions and brevity penalty.
    """
    matches_by_order = [0] * max_order
    possible_matches_by_order = [0] * max_order
    reference_length = 0
    translation_length = 0
    for references, translation in zip(reference_corpus, translation_corpus):
        reference_length += min(len(r) for r in references)
        translation_length += len(translation)

        merged_ref_ngram_counts = collections.Counter()
        for reference in references:
            merged_ref_ngram_counts |= get_ngrams(reference, max_order)
        translation_ngram_counts = get_ngrams(translation, max_order)
        overlap = translation_ngram_counts & merged_ref_ngram_counts
        for ngram in overlap:
            matches_by_order[len(ngram) - 1] += overlap[ngram]
        for order in range(1, max_order + 1):
            possible_matches = len(translation) - order + 1
            if possible_matches > 0:
                possible_matches_by_order[order - 1] += possible_matches

    precisions = [0] * max_order
    for i in range(0, max_order):
        if possible_matches_by_order[i] > 0:
            precisions[i] = float(matches_by_order[i]) / possible_matches_by_order[i]
        else:
            precisions[i] = 0.0

    if min(precisions) > 0:
        ## TO_DO: compute the geometric mean of all modified precision scores
        p_log_sum = sum((math.log(p) for p in precisions)) / max_order
        geo_mean = math.exp(p_log_sum)
    else:
        geo_mean = 0

    ## TO_DO: compute the brevity penalty (BP)
    ratio = float(translation_length) / reference_length

    if ratio > 1.0:
        bp = 1.0
    else:
        bp = math.exp(1 - 1.0 / ratio)

    # final bleu score
    bleu = geo_mean * bp

    return (bleu, precisions, bp, ratio, translation_length, reference_length)

In [None]:
# Evaluation
def compute_bleu_score(predictions, references, tokenizer=Tokenizer13a(), max_order=4):
    # if only one reference is provided make sure we still use list of lists
    if isinstance(references[0], str):
        references = [[ref] for ref in references]

    references = [[tokenizer(r) for r in ref] for ref in references]
    predictions = [tokenizer(p) for p in predictions]
    score = compute_bleu(
        reference_corpus=references, translation_corpus=predictions, max_order=max_order
    )
    (bleu, precisions, bp, ratio, translation_length, reference_length) = score
    return {
        "bleu": bleu,
        "precisions": precisions,
        "brevity_penalty": bp,
        "length_ratio": ratio,
        "translation_length": translation_length,
        "reference_length": reference_length,
    }

A small example for real evaluation, feel free to change the final_predictions funtion to make it more adaptable.


In [None]:
references = french_sentences[:5]
predictions = [
    final_predictions(re.sub(r"[^\w]", " ", txt)) for txt in train_english_sentences[:5]
]
compute_bleu_score(predictions, references, max_order=2)

{'bleu': 0.6045255919897695,
 'precisions': [0.7971014492753623, 0.578125],
 'brevity_penalty': 0.8905268465458593,
 'length_ratio': 0.8961038961038961,
 'translation_length': 69,
 'reference_length': 77}

## Exercises:

- Please complete the code under **TO_DO**
- Complete the evaluation metrics (BLEU) and evaluate the whole dataset.
- Train with more epochs. Does it improve the translations?
- Change the architectures of the neural network, Does it improve the translations? For example:
  - change the number of GRU layers
  - change embedding-size
  - try Bidirectional-RNN
- Please finally submit the notebook with the best architecture settings that you found and comment your results.
