# Machine Translation

In this notebook, we aim to convert English phrases to French using RNN on Deep Learning Neural Network


# Introduction

In this notebook, you will build a deep neural network that functions as part of an end-to-end machine translation pipeline. Your completed pipeline will accept English text as input and return the French translation.


In [408]:
# Now importing modules
import numpy as np

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model, Sequential
from keras.layers import (
    GRU,
    Input,
    Dense,
    TimeDistributed,
    Activation,
    RepeatVector,
    Bidirectional,
    Dropout,
    LSTM,
)
from keras.layers import Embedding
from keras.optimizers import Adam
from keras.losses import sparse_categorical_crossentropy

In [409]:
import tensorflow as tf

# Load Data

The small_vocab_en file contains English sentences with their French translations in the small_vocab_fr file. Load the English and French data from these files from running the cell below.


In [410]:
english_path = (
    "https://raw.githubusercontent.com/projjal1/datasets/master/small_vocab_en.txt"
)
french_path = (
    "https://raw.githubusercontent.com/projjal1/datasets/master/small_vocab_fr.txt"
)

Load the dataset and split file by lines


In [411]:
import os


def load_data(path):
    input_file = os.path.join(path)
    with open(input_file, "r") as f:
        data = f.read()

    return data.split("\n")

In [412]:
# Using helper to inport dataset
english_data = tf.keras.utils.get_file("file1", english_path)
french_data = tf.keras.utils.get_file("file2", french_path)

In [413]:
# Now loading data
english_sentences = load_data(english_data)
french_sentences = load_data(french_data)

num_train_samples = 10000
num_test_samples = 1000

train_english_sentences = english_sentences[:num_train_samples]
train_french_sentences = french_sentences[:num_train_samples]

test_english_sentences = english_sentences[num_train_samples:num_train_samples + num_test_samples]
test_french_sentences = french_sentences[num_train_samples:num_train_samples + num_test_samples]

In [414]:
len(french_sentences), len(english_sentences)

(137860, 137860)

# Analysis of Dataset

Let us look at a few examples in the dataset of both language


In [415]:
for i in range(3):
    print("Sample :", i)
    print(train_english_sentences[i])
    print(train_french_sentences[i])
    print("-" * 50)

Sample : 0
new jersey is sometimes quiet during autumn , and it is snowy in april .
new jersey est parfois calme pendant l' automne , et il est neigeux en avril .
--------------------------------------------------
Sample : 1
the united states is usually chilly during july , and it is usually freezing in november .
les états-unis est généralement froid en juillet , et il gèle habituellement en novembre .
--------------------------------------------------
Sample : 2
california is usually quiet during march , and it is usually hot in june .
california est généralement calme en mars , et il est généralement chaud en juin .
--------------------------------------------------


# Convert to Vocabulary

The complexity of the problem is determined by the complexity of the vocabulary. A more complex vocabulary is a more complex problem. Let's look at the complexity of the dataset we'll be working with.


In [416]:
import collections

In [417]:
english_words_counter = collections.Counter(
    [word for sentence in train_english_sentences for word in sentence.split()]
)
french_words_counter = collections.Counter(
    [word for sentence in train_french_sentences for word in sentence.split()]
)

print("English Vocab:", len(english_words_counter))
print("French Vocab:", len(french_words_counter))

English Vocab: 226
French Vocab: 329


# Tokenize (IMPLEMENTATION)

For a neural network to predict on text data, it first has to be turned into data it can understand. Text data like "dog" is a sequence of ASCII character encodings. Since a neural network is a series of multiplication and addition operations, the input data needs to be number(s).

We can turn each character into a number or each word into a number. These are called character and word ids, respectively.

- Character ids are used for character level models that generate text predictions for each character.
- A word level model uses word ids that generate text predictions for each word. Word level models tend to learn better, since they are lower in complexity, so we'll use that.

**TO_DO:** Turn each sentence into a sequence of words_ids using Keras's Tokenizer function. Use this function to tokenize english_sentences and french_sentences in the cell below.


In [418]:
def tokenize(x):
    ## TO_DO:
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(x)
    return tokenizer

In [419]:
# Tokenize Sample output
text_sentences = [
    "The quick brown fox jumps over the lazy dog .",
    "By Jove , my quick study of lexicography won a prize .",
    "This is a short sentence .",
]

text_tokenizer = tokenize(text_sentences)
print(text_tokenizer.word_index)
text_tokenized = text_tokenizer.texts_to_sequences(text_sentences)

for sample_i, (sent, token_sent) in enumerate(zip(text_sentences, text_tokenized)):
    print("Sequence {} in x".format(sample_i + 1))
    print("  Input:  {}".format(sent))
    print("  Output: {}".format(token_sent))

{'the': 1, 'quick': 2, 'a': 3, 'brown': 4, 'fox': 5, 'jumps': 6, 'over': 7, 'lazy': 8, 'dog': 9, 'by': 10, 'jove': 11, 'my': 12, 'study': 13, 'of': 14, 'lexicography': 15, 'won': 16, 'prize': 17, 'this': 18, 'is': 19, 'short': 20, 'sentence': 21}
Sequence 1 in x
  Input:  The quick brown fox jumps over the lazy dog .
  Output: [1, 2, 4, 5, 6, 7, 1, 8, 9]
Sequence 2 in x
  Input:  By Jove , my quick study of lexicography won a prize .
  Output: [10, 11, 12, 2, 13, 14, 15, 16, 3, 17]
Sequence 3 in x
  Input:  This is a short sentence .
  Output: [18, 19, 3, 20, 21]


# Padding (IMPLEMENTATION)

When batching the sequence of word ids together, each sequence needs to be the same length. Since sentences are dynamic in length, we can add padding to the end of the sequences to make them the same length.

Make sure all the English sequences have the same length and all the French sequences have the same length by adding padding to the end of each sequence using Keras's pad_sequences function.


In [420]:
def pad(x, length=None):
    ## TO_DO:
    text_padded = pad_sequences(x, maxlen=length, padding="post")
    return text_padded

In [421]:
def preprocess(x, y):
    """
    Preprocess x and y
    :param x: Feature List of sentences
    :param y: Label List of sentences
    :return: Tuple of (Preprocessed x, Preprocessed y, x tokenizer, y tokenizer)
    """
    x_tk = tokenize(x)
    preprocess_x = x_tk.texts_to_sequences(x)
    y_tk = tokenize(y)
    preprocess_y = y_tk.texts_to_sequences(y)

    preprocess_x = pad(preprocess_x)
    preprocess_y = pad(preprocess_y)

    # Keras's sparse_categorical_crossentropy function requires the labels to be in 3 dimensions
    # Expanding dimensions
    preprocess_y = preprocess_y.reshape(*preprocess_y.shape, 1)

    return preprocess_x, preprocess_y, x_tk, y_tk


(
    preproc_english_sentences,
    preproc_french_sentences,
    english_tokenizer,
    french_tokenizer,
) = preprocess(train_english_sentences, train_french_sentences)

max_english_sequence_length = preproc_english_sentences.shape[1]
max_french_sequence_length = preproc_french_sentences.shape[1]
english_vocab_size = len(english_tokenizer.word_index)
french_vocab_size = len(french_tokenizer.word_index)

print("Data Preprocessed.")
print("Max English sentence length:", max_english_sequence_length)
print("Max French sentence length:", max_french_sequence_length)
print("English vocabulary size:", english_vocab_size)
print("French vocabulary size:", french_vocab_size)

Data Preprocessed.
Max English sentence length: 15
Max French sentence length: 20
English vocabulary size: 198
French vocabulary size: 318


# Create Model


The neural network will translate the input to words ids, which isn't the final form we want. We want the French translation. The function logits_to_text will bridge the gap between the logits from the neural network to the French translation. You'll be using this function to better understand the output of the neural network.


In [422]:
def logits_to_text(logits, tokenizer):
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = "<PAD>"

    # So basically we are predicting output for a given word and then selecting best answer
    # Then selecting that label we reverse-enumerate the word from id
    return " ".join([index_to_words[prediction] for prediction in np.argmax(logits, 1)])

![Model](https://github.com/tommytracey/AIND-Capstone/raw/8267d4fe72e48c595a0aff46eaf0a805fff0f36d/images/embedding.png)


# Building Model

Here we use RNN model combined with GRU nodes for translation.
In the code section below, we give a simple model example. You can first run this model and play with it. Then you can change the model architecture by following the Exercise 4 to get better results.


In [423]:
embedding_size = 256

def embed_model(
    input_shape, output_sequence_length, english_vocab_size, french_vocab_size
):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """

    ## TO_DO: Improve the layers (See Exercise 4)
    model = Sequential()
    model.add(
        Embedding(
            english_vocab_size,
            embedding_size,
            input_length=input_shape[1],
            input_shape=input_shape[1:],
        )
    )
    model.add(GRU(embedding_size, return_sequences=True))
    model.add(GRU(embedding_size, return_sequences=True))
    model.add(GRU(embedding_size, return_sequences=True))
    model.add(TimeDistributed(Dense(1024, activation="relu")))
    model.add(Dropout(0.5))
    model.add(TimeDistributed(Dense(french_vocab_size, activation="softmax")))

    return model

In [424]:
# Reshaping the input to work with a basic RNN
tmp_x = pad(preproc_english_sentences, preproc_french_sentences.shape[1])
tmp_x = tmp_x.reshape((-1, preproc_french_sentences.shape[-2]))

Finally calling the model function


In [425]:
# Hyperparameters
learning_rate = 0.005

In [426]:
simple_rnn_model = embed_model(
    tmp_x.shape,
    preproc_french_sentences.shape[1],
    len(english_tokenizer.word_index) + 1,
    len(french_tokenizer.word_index) + 1,
)

The output is a sequence of one-hot encoded arrays. Our data-set contains integer-tokens instead of one-hot encoded arrays. Each one-hot encoded array has large number of elements so it would be extremely wasteful to convert the entire data-set to one-hot encoded arrays. A better way is to use a so-called sparse cross-entropy loss-function, which does the conversion internally from integers to one-hot encoded arrays.


In [427]:
# Compile model
simple_rnn_model.compile(
    loss=sparse_categorical_crossentropy,
    optimizer=Adam(learning_rate),
    metrics=["accuracy"],
)

In [428]:
simple_rnn_model.summary()

# Training the model

Here we start to train the model and pass the english text and the max_sequence_length, with vocab size for both english and french text


In [429]:
history = simple_rnn_model.fit(
    tmp_x, preproc_french_sentences, batch_size=1024, epochs=30, validation_split=0.2
)

Epoch 1/30
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 313ms/step - accuracy: 0.2758 - loss: 5.0716 - val_accuracy: 0.4222 - val_loss: 2.7846
Epoch 2/30
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 128ms/step - accuracy: 0.4436 - loss: 2.6795 - val_accuracy: 0.4911 - val_loss: 2.1769
Epoch 3/30
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 127ms/step - accuracy: 0.5048 - loss: 2.0796 - val_accuracy: 0.5527 - val_loss: 1.7392
Epoch 4/30
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 127ms/step - accuracy: 0.5702 - loss: 1.7008 - val_accuracy: 0.6270 - val_loss: 1.4501
Epoch 5/30
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 126ms/step - accuracy: 0.6156 - loss: 1.5052 - val_accuracy: 0.6618 - val_loss: 1.2859
Epoch 6/30
[1m8/8[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 138ms/step - accuracy: 0.6646 - loss: 1.2758 - val_accuracy: 0.7033 - val_loss: 1.0989
Epoch 7/30
[1m8/8[0m [32m━━━━━━━━━━━━

# Arbitrary Predictions

Try with arbitary examples in the corpus to see the translation


In [430]:
import re


def final_predictions(text):
    y_id_to_word = {value: key for key, value in french_tokenizer.word_index.items()}
    y_id_to_word[0] = "<PAD>"

    sentence = [english_tokenizer.word_index[word] for word in text.split()]
    sentence = pad_sequences(
        [sentence], maxlen=preproc_french_sentences.shape[-2], padding="post"
    )
    french_translation = logits_to_text(
        simple_rnn_model.predict(sentence[:1], verbose=0)[0], french_tokenizer
    )
    return re.split(r"\s*<PAD>", french_translation, 1)[0]

In [431]:
txt = train_english_sentences[0].lower()
print("English: ", train_english_sentences[0])
print("French: ", final_predictions(re.sub(r"[^\w]", " ", txt)))

English:  new jersey is sometimes quiet during autumn , and it is snowy in april .
French:  new jersey est parfois calme pendant l' automne et il est neigeux en avril


# Evaluation

In this section, we provide the example code for you to do the evaluation using BLEU score metrics.


In [432]:
# useful tokenization
import re
from functools import lru_cache


class BaseTokenizer:
    """A base dummy tokenizer to derive from."""

    def signature(self):
        """
        Returns a signature for the tokenizer.
        :return: signature string
        """
        return "none"

    def __call__(self, line):
        """
        Tokenizes an input line with the tokenizer.
        :param line: a segment to tokenize
        :return: the tokenized line
        """
        return line


class TokenizerRegexp(BaseTokenizer):
    def signature(self):
        return "re"

    def __init__(self):
        self._re = [
            # language-dependent part (assuming Western languages)
            (re.compile(r"([\{-\~\[-\` -\&\(-\+\:-\@\/])"), r" \1 "),
            # tokenize period and comma unless preceded by a digit
            (re.compile(r"([^0-9])([\.,])"), r"\1 \2 "),
            # tokenize period and comma unless followed by a digit
            (re.compile(r"([\.,])([^0-9])"), r" \1 \2"),
            # tokenize dash when preceded by a digit
            (re.compile(r"([0-9])(-)"), r"\1 \2 "),
            # one space only between words
            # NOTE: Doing this in Python (below) is faster
            # (re.compile(r'\s+'), r' '),
        ]

    @lru_cache(maxsize=2**16)
    def __call__(self, line):
        """Common post-processing tokenizer for `13a` and `zh` tokenizers.
        :param line: a segment to tokenize
        :return: the tokenized line
        """
        for _re, repl in self._re:
            line = _re.sub(repl, line)

        # no leading or trailing spaces, single space within words
        # return ' '.join(line.split())
        # This line is changed with regards to the original tokenizer (seen above) to return individual words
        return line.split()


class Tokenizer13a(BaseTokenizer):
    def signature(self):
        return "13a"

    def __init__(self):
        self._post_tokenizer = TokenizerRegexp()

    @lru_cache(maxsize=2**16)
    def __call__(self, line):
        """Tokenizes an input line using a relatively minimal tokenization
        that is however equivalent to mteval-v13a, used by WMT.

        :param line: a segment to tokenize
        :return: the tokenized line
        """

        # language-independent part:
        line = line.replace("<skipped>", "")
        line = line.replace("-\n", "")
        line = line.replace("\n", " ")

        if "&" in line:
            line = line.replace("&quot;", '"')
            line = line.replace("&amp;", "&")
            line = line.replace("&lt;", "<")
            line = line.replace("&gt;", ">")

        return self._post_tokenizer(f" {line} ")

In [433]:
import collections
import math


def get_ngrams(segment, max_order):
    """Extracts all n-grams upto a given maximum order from an input segment.

    Args:
      segment: text segment from which n-grams will be extracted.
      max_order: maximum length in tokens of the n-grams returned by this
          methods.

    Returns:
      The Counter containing all n-grams upto max_order in segment
      with a count of how many times each n-gram occurred.
    """
    ngram_counts = collections.Counter()
    for order in range(1, max_order + 1):
        for i in range(0, len(segment) - order + 1):
            ngram = tuple(segment[i : i + order])
            ngram_counts[ngram] += 1
    return ngram_counts


def compute_bleu(reference_corpus, translation_corpus, max_order=4):
    """Computes BLEU score of translated segments against one or more references.

    Args:
      reference_corpus: list of lists of references for each translation. Each
          reference should be tokenized into a list of tokens.
      translation_corpus: list of translations to score. Each translation
          should be tokenized into a list of tokens.
      max_order: Maximum n-gram order to use when computing BLEU score.

    Returns:
      3-Tuple with the BLEU score, n-gram precisions, geometric mean of n-gram
      precisions and brevity penalty.
    """
    matches_by_order = [0] * max_order
    possible_matches_by_order = [0] * max_order
    reference_length = 0
    translation_length = 0
    for references, translation in zip(reference_corpus, translation_corpus):
        reference_length += min(len(r) for r in references)
        translation_length += len(translation)

        merged_ref_ngram_counts = collections.Counter()
        for reference in references:
            merged_ref_ngram_counts |= get_ngrams(reference, max_order)
        translation_ngram_counts = get_ngrams(translation, max_order)
        overlap = translation_ngram_counts & merged_ref_ngram_counts
        for ngram in overlap:
            matches_by_order[len(ngram) - 1] += overlap[ngram]
        for order in range(1, max_order + 1):
            possible_matches = len(translation) - order + 1
            if possible_matches > 0:
                possible_matches_by_order[order - 1] += possible_matches

    precisions = [0] * max_order
    for i in range(0, max_order):
        if possible_matches_by_order[i] > 0:
            precisions[i] = float(matches_by_order[i]) / possible_matches_by_order[i]
        else:
            precisions[i] = 0.0

    if min(precisions) > 0:
        ## TO_DO: compute the geometric mean of all modified precision scores
        p_log_sum = sum((math.log(p) for p in precisions)) / max_order
        geo_mean = math.exp(p_log_sum)
    else:
        geo_mean = 0

    ## TO_DO: compute the brevity penalty (BP)
    ratio = float(translation_length) / reference_length

    if ratio > 1.0:
        bp = 1.0
    elif ratio == 0:
        bp = 0
    else:
        bp = math.exp(1 - 1.0 / ratio)

    # final bleu score
    bleu = geo_mean * bp

    return (bleu, precisions, bp, ratio, translation_length, reference_length)

In [434]:
# Evaluation
def compute_bleu_score(predictions, references, tokenizer=Tokenizer13a(), max_order=4):
    # if only one reference is provided make sure we still use list of lists
    if isinstance(references[0], str):
        references = [[ref] for ref in references]

    references = [[tokenizer(r) for r in ref] for ref in references]
    predictions = [tokenizer(p) for p in predictions]
    score = compute_bleu(
        reference_corpus=references, translation_corpus=predictions, max_order=max_order
    )
    (bleu, precisions, bp, ratio, translation_length, reference_length) = score
    return {
        "bleu": bleu,
        "precisions": precisions,
        "brevity_penalty": bp,
        "length_ratio": ratio,
        "translation_length": translation_length,
        "reference_length": reference_length,
    }

A small example for real evaluation, feel free to change the final_predictions funtion to make it more adaptable.


In [435]:
references = train_french_sentences[:5]
predictions = [
    final_predictions(re.sub(r"[^\w]", " ", txt)) for txt in train_english_sentences[:5]
]
compute_bleu_score(predictions, references, max_order=2)

{'bleu': 0.631832870017691,
 'precisions': [0.8260869565217391, 0.609375],
 'brevity_penalty': 0.8905268465458593,
 'length_ratio': 0.8961038961038961,
 'translation_length': 69,
 'reference_length': 77}

# Exercises:

- Please complete the code under **TO_DO**
- Complete the evaluation metrics (BLEU) and evaluate the whole dataset.
- Train with more epochs. Does it improve the translations?
- Change the architectures of the neural network, Does it improve the translations? For example:
  - change the number of GRU layers
  - change embedding-size
  - try Bidirectional-RNN
- Please finally submit the notebook with the best architecture settings that you found and comment your results.


# Results on test set

In [436]:
from tqdm import tqdm

predictions = []
for eng_sentence in tqdm(test_english_sentences):
    translation = final_predictions(re.sub(r"[^\w']", " ", eng_sentence))
    predictions.append(translation)

100%|██████████| 1000/1000 [01:10<00:00, 14.25it/s]


In [437]:
assert(len(predictions) == len(test_french_sentences))

In [438]:
bleu_score = compute_bleu_score(predictions, test_french_sentences, max_order=2)["bleu"]
print(bleu_score)

0.6930877784853977


# Analysis

These are the full results obtained with various combination of model architecture (all the model where trained for 30 epochs):

|embedding_size|n_GRU_layers|bidirectional|bleu_score|
|--------------|------------|-------------|----------|
|128           |1           |True        |0.7588480144440823|
|128           |2           |True        |0.7602675090397024|
|128           |3           |True        |0.7521967525891776|
|128           |1           |False        |0.6872406801092699|
|128           |2           |False        |0.6867257598281586|
|128           |3           |False        |0.6861980883307213|
|256           |1           |True        |0.7569880700203272|
|256           |2           |True        |0.7606519587688819|
|256           |3           |True        |0.4371917495200416|
|256           |1           |False        |0.6987764937535867|
|256           |2           |False        |0.6937211622615038|
|256           |3           |False        |0.6930877784853977|

Looking at the results we can observe that:

- **Bidirectional models consistently outperform unidirectional models.**  
  - Example: At 128-dim, **bidirectional (0.7603)** vs. **unidirectional (0.6872)**.  
- **2-layer GRUs achieve the best BLEU scores.**  
  - Adding a **third layer reduces performance**, especially for 256-dim (**BLEU drops to 0.4372**).
- **Increasing embedding size (128 → 256) has minimal impact** on BLEU.  
  - Example: **BLEU = 0.7603 (128-dim, 2-layer, bidirectional)** vs. **0.7607 (256-dim, same config).**

**128-dim, 2-layer, bidirectional GRU** (**BLEU = 0.7603**) is the best balance of performance and efficiency.
