# Neural Machine Translation

This project focuses on building an English-to-Portuguese Neural Machine Translation (NMT) model using Long Short-Term Memory (LSTM) networks with attention. Machine translation is a significant task in natural language processing, with applications not only in translating languages but also in word sense disambiguation (e.g., determining whether "bank" refers to a financial institution or the side of a river). While Recurrent Neural Networks (RNNs) with LSTMs can handle short to medium-length sentences, they often struggle with very long sequences due to vanishing gradients. To mitigate this, an attention mechanism is introduced, allowing the decoder to access all relevant parts of the input sentence, regardless of its length.

The project involves:

1.Implementing an encoder-decoder system with attention.
2.Building the NMT model from scratch using TensorFlow.
3.Generating translations using greedy and Minimum Bayes Risk (MBR) decoding.

## Table of Contents
- [1 - Data Preparation](#1)
- [2 - NMT model with attention](#2)
- [3 - Training](#3)
- [4 - Using the model for inference ](#4)
- [5 - Minimum Bayes-Risk Decoding](#5)


In [1]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3' # Setting this env variable prevents TF warnings from showing up

import numpy as np
import tensorflow as tf
from collections import Counter
from utils import (sentences, train_data, val_data, english_vectorizer, portuguese_vectorizer, 
                   masked_loss, masked_acc, tokens_to_text)

In [3]:
import w1_unittest

<a name="1"></a>
## 1. Data Preparation

The text pre-processing bits have already been taken care of (if you are interested in this be sure to check the `utils.py` file). The steps performed can be summarized as:

- Reading the raw data from the text files
- Cleaning the data (using lowercase, adding space around punctuation, trimming whitespaces, etc)
- Splitting it into training and validation sets
- Adding the start-of-sentence and end-of-sentence tokens to every sentence
- Tokenizing the sentences
- Creating a Tensorflow dataset out of the tokenized sentences

In [2]:
portuguese_sentences, english_sentences = sentences

print(f"English (to translate) sentence:\n\n{english_sentences[-5]}\n")
print(f"Portuguese (translation) sentence:\n\n{portuguese_sentences[-5]}")

English (to translate) sentence:

No matter how much you try to convince people that chocolate is vanilla, it'll still be chocolate, even though you may manage to convince yourself and a few others that it's vanilla.

Portuguese (translation) sentence:

Não importa o quanto você tenta convencer os outros de que chocolate é baunilha, ele ainda será chocolate, mesmo que você possa convencer a si mesmo e poucos outros de que é baunilha.


I don't have much use for the raw sentences so delete them to save memory:

In [4]:
del portuguese_sentences
del english_sentences
del sentences

Notice that an `english_vectorizer` and a `portuguese_vectorizer` is imported from `utils.py`. These are created using [tf.keras.layers.TextVectorization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) and they provide interesting features such as ways to visualize the vocabulary and convert text into tokenized ids and vice versa.

In [5]:
print(f"First 10 words of the english vocabulary:\n\n{english_vectorizer.get_vocabulary()[:10]}\n")
print(f"First 10 words of the portuguese vocabulary:\n\n{portuguese_vectorizer.get_vocabulary()[:10]}")

First 10 words of the english vocabulary:

['', '[UNK]', '[SOS]', '[EOS]', '.', 'tom', 'i', 'to', 'you', 'the']

First 10 words of the portuguese vocabulary:

['', '[UNK]', '[SOS]', '[EOS]', '.', 'tom', 'que', 'o', 'nao', 'eu']


Notice that the first 4 words are reserved for special words. In order, these are:

- the empty string
- a special token to represent an unknown word
- a special token to represent the start of a sentence
- a special token to represent the end of a sentence

You can see how many words are in a vocabulary by using the `vocabulary_size` method:

In [6]:
# Size of the vocabulary
vocab_size_por = portuguese_vectorizer.vocabulary_size()
vocab_size_eng = english_vectorizer.vocabulary_size()

print(f"Portuguese vocabulary is made up of {vocab_size_por} words")
print(f"English vocabulary is made up of {vocab_size_eng} words")

Portuguese vocabulary is made up of 12000 words
English vocabulary is made up of 12000 words


I have defined [tf.keras.layers.StringLookup](https://www.tensorflow.org/api_docs/python/tf/keras/layers/StringLookup) objects that will help you map from words to ids and vice versa. Same for the portuguese vocabulary since this will be useful later on when I decode the predictions from your model:

In [7]:
# This helps you convert from words to ids
word_to_id = tf.keras.layers.StringLookup(
    vocabulary=portuguese_vectorizer.get_vocabulary(), 
    mask_token="", 
    oov_token="[UNK]"
)

# This helps you convert from ids to words
id_to_word = tf.keras.layers.StringLookup(
    vocabulary=portuguese_vectorizer.get_vocabulary(),
    mask_token="",
    oov_token="[UNK]",
    invert=True,
)

In [8]:
unk_id = word_to_id("[UNK]")
sos_id = word_to_id("[SOS]")
eos_id = word_to_id("[EOS]")
baunilha_id = word_to_id("baunilha")

print(f"The id for the [UNK] token is {unk_id}")
print(f"The id for the [SOS] token is {sos_id}")
print(f"The id for the [EOS] token is {eos_id}")
print(f"The id for baunilha (vanilla) is {baunilha_id}")

The id for the [UNK] token is 1
The id for the [SOS] token is 2
The id for the [EOS] token is 3
The id for baunilha (vanilla) is 7079


Finally, the data that is going to be fed to the neural network. Both `train_data` and `val_data` are of type `tf.data.Dataset` and are already arranged in batches of 64 examples. To get the first batch out of a tf dataset you can use the `take` method. To get the first example out of the batch, slice the tensor and use the `numpy` method for nicer printing:

In [9]:
for (to_translate, sr_translation), translation in train_data.take(1):
    print(f"Tokenized english sentence:\n{to_translate[0, :].numpy()}\n\n")
    print(f"Tokenized portuguese sentence (shifted to the right):\n{sr_translation[0, :].numpy()}\n\n")
    print(f"Tokenized portuguese sentence:\n{translation[0, :].numpy()}\n\n")

Tokenized english sentence:
[   2  210    9  146  123   38    9 1672    4    3    0    0    0    0]


Tokenized portuguese sentence (shifted to the right):
[   2 1085    7  128   11  389   37 2038    4    0    0    0    0    0
    0]


Tokenized portuguese sentence:
[1085    7  128   11  389   37 2038    4    3    0    0    0    0    0
    0]




- Padding has already been applied to the tensors and the value used for this is 0
- Each example consists of 3 different tensors:
    - The sentence to translate
    - The shifted-to-the-right translation
    - The translation
    
The first two can be considered as the features, while the third one as the target. By doing this your model can perform Teacher Forcing as you saw in the lectures.

Here’s the rewritten version in the third person:

---

<a name="2"></a>
## 2. NMT Model with Attention

This project involves building an encoder-decoder architecture for a Neural Machine Translation (NMT) model. The model uses a Recurrent Neural Network (RNN) that processes a tokenized sentence in the encoder, which is then passed to the decoder for translation. As discussed in the background, a regular sequence-to-sequence model with LSTMs is effective for short to medium sentences but tends to degrade in performance with longer sentences. This degradation occurs because all the context of the input sentence is compressed into a single vector passed to the decoder block, as illustrated below. For very long sentences (e.g., 100 tokens or more), the early parts of the input have minimal impact on the final vector passed to the decoder, leading to translation issues.

<img src='images/plain_rnn.png'>

To address this issue, an attention layer is added to the model, allowing the decoder to access all parts of the input sentence. Consider a 4-word input sentence as shown below. A hidden state is produced at each timestep of the encoder (represented by the orange rectangles). These hidden states are passed to the attention layer, where each is scored based on the current activation (i.e., hidden state) of the decoder. For instance, after the first prediction, "como," is made, the attention layer receives all the encoder hidden states (orange rectangles) and the decoder hidden state corresponding to "como" (first green rectangle). Based on this, it scores the encoder hidden states to determine which should be emphasized for the next word prediction. Through training, the model may learn to align with the second encoder hidden state, assigning a high probability to the word "você." In greedy decoding, this word is output as the next symbol, and the process repeats until an end-of-sentence prediction is reached.

<img src='images/attention_overview.png'>

This project uses Scaled Dot Product Attention, defined as:

$$Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V$$

This equation will be explored in more detail later, but for now, it can be understood as computing scores using queries (Q) and keys (K), followed by multiplying the values (V) to obtain a context vector at each timestep of the decoder. This context vector is fed into the decoder RNN to generate a set of probabilities for the next predicted word. The division by the square root of the keys' dimensionality ($\sqrt{d_k}$) enhances model performance, which will be further discussed later. In this machine translation task, the encoder activations (encoder hidden states) serve as the keys and values, while the decoder activations (decoder hidden states) serve as the queries.

Despite the complexity of this architecture, the implementation can be achieved with just a few lines of code.

Two important global variables are defined:

- The size of the vocabulary.
- The number of units in the LSTM layers (consistent across all LSTM layers).

For this project, the vocabulary sizes for English and Portuguese are the same, so a single constant, `VOCAB_SIZE`, is used throughout. In other settings, vocabulary sizes might differ, but that is not the case here.

In [10]:
VOCAB_SIZE = 12000
UNITS = 256

Here’s the rewritten version in the third person:

---

<a name="ex1"></a>
## Encoder

The first part of the project involves implementing the encoder component of the neural network. The `Encoder` class will be completed by defining the necessary sublayers in the constructor (`__init__` method) and then utilizing these sublayers in the forward pass (`call` method).

The encoder consists of the following layers:

- [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding): This layer requires the definition of appropriate `input_dim` and `output_dim` values. Additionally, it needs to be informed that '0' is used for padding, which can be achieved by setting the `mask_zero` parameter accordingly.

- [Bidirectional](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional) [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM): TensorFlow provides functionality for implementing bidirectional behavior in RNN-like layers. While the bidirectional part is pre-configured, specifying the correct layer type and parameters is necessary. The number of units must be set appropriately, and the LSTM should return the full sequence rather than just the last output, which can be controlled using the `return_sequences` parameter.

The forward pass will be defined using TensorFlow's [functional API](https://www.tensorflow.org/guide/keras/functional_api), which allows for chaining function calls to construct the network, as shown in the following example:

```python
encoder_input = keras.Input(shape=(28, 28, 1), name="original_img")
x = layers.Conv2D(16, 3, activation="relu")(encoder_input)
x = layers.MaxPooling2D(3)(x)
x = layers.Conv2D(16, 3, activation="relu")(x)
encoder_output = layers.GlobalMaxPooling2D()(x)
```

In [11]:
class Encoder(tf.keras.layers.Layer):
    def __init__(self, vocab_size, units):
        """Initializes an instance of this class

        Args:
            vocab_size (int): Size of the vocabulary
            units (int): Number of units in the LSTM layer
        """
        super(Encoder, self).__init__()

        self.embedding = tf.keras.layers.Embedding(  
            input_dim=vocab_size,
            output_dim=units,
            mask_zero=True
        )  

        self.rnn = tf.keras.layers.Bidirectional(  
            merge_mode="sum",  
            layer=tf.keras.layers.LSTM(
                units=units,
                return_sequences= True
            ),  
        )  

    def call(self, context):
        """Forward pass of this layer

        Args:
            context (tf.Tensor): The sentence to translate

        Returns:
            tf.Tensor: Encoded sentence to translate
        """

        x = self.embedding(context)

        x = self.rnn(x)

        return x

In [12]:
encoder = Encoder(VOCAB_SIZE, UNITS)

encoder_output = encoder(to_translate)

print(f'Tensor of sentences in english has shape: {to_translate.shape}\n')
print(f'Encoder output has shape: {encoder_output.shape}')

Tensor of sentences in english has shape: (64, 14)

Encoder output has shape: (64, 14, 256)


Here’s the rewritten version in the third person while keeping the content and word count similar:

---

<a name="ex2"></a>
## CrossAttention

The next part of the project involves implementing the layer that performs cross-attention between the original sentences and their translations. The `CrossAttention` class will be completed by defining the necessary sublayers in the constructor (`__init__` method) and then utilizing these sublayers in the forward pass (`call` method). Some elements of this process are already provided.

The cross-attention layer includes the following components:

- [MultiHeadAttention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/MultiHeadAttention): This layer requires defining the appropriate `key_dim`, which represents the size of the key and query tensors. Additionally, the number of heads should be set to 1, as the attention is between two tensors rather than multi-head attention. This layer is preferred over [Attention](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Attention) due to its simpler implementation during the forward pass.

Key details to note:
- The output of the attention mechanism needs to be combined with the shifted-to-the-right translation (since this cross-attention occurs on the decoder side). This is done using an [Add](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Add) layer to maintain the original dimensionality, which wouldn't be preserved with a [Concatenate](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Concatenate) layer.

- For enhanced network stability, [LayerNormalization](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LayerNormalization) is also applied.

There is no need to address these final steps, as they have already been implemented.

In [14]:
class CrossAttention(tf.keras.layers.Layer):
    def __init__(self, units):
        """Initializes an instance of this class

        Args:
            units (int): Number of units in the LSTM layer
        """
        super().__init__()


        self.mha = ( 
            tf.keras.layers.MultiHeadAttention(
                key_dim=units,
                num_heads=1
            ) 
        )  

        self.layernorm = tf.keras.layers.LayerNormalization()
        self.add = tf.keras.layers.Add()

    def call(self, context, target):
        """Forward pass of this layer

        Args:
            context (tf.Tensor): Encoded sentence to translate
            target (tf.Tensor): The embedded shifted-to-the-right translation

        Returns:
            tf.Tensor: Cross attention between context and target
        """
    
        # Call the MH attention by passing in the query and value
        # For this case the query should be the translation and the value the encoded sentence to translate
        attn_output = self.mha(
            query=target,
            value=context
        )  


        x = self.add([target, attn_output])

        x = self.layernorm(x)

        return x

In [15]:
attention_layer = CrossAttention(UNITS)

sr_translation_embed = tf.keras.layers.Embedding(VOCAB_SIZE, output_dim=UNITS, mask_zero=True)(sr_translation)

attention_result = attention_layer(encoder_output, sr_translation_embed)

print(f'Tensor of contexts has shape: {encoder_output.shape}')
print(f'Tensor of translations has shape: {sr_translation_embed.shape}')
print(f'Tensor of attention scores has shape: {attention_result.shape}')

Tensor of contexts has shape: (64, 14, 256)
Tensor of translations has shape: (64, 15, 256)
Tensor of attention scores has shape: (64, 15, 256)


##### __Expected Output__

```
Tensor of contexts has shape: (64, 14, 256)
Tensor of translations has shape: (64, 15, 256)
Tensor of attention scores has shape: (64, 15, 256)
```

Here’s the rewritten version in the third person:

---

<a name="ex3"></a>
## Decoder

The next part of the project involves implementing the decoder component of the neural network by completing the `Decoder` class. The constructor (`__init__` method) is used to define all the sublayers of the decoder, and these sublayers are utilized during the forward pass (`call` method).

The decoder consists of the following layers:

- [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding): The `input_dim` and `output_dim` must be defined for this layer, with padding indicated by using '0'. The `mask_zero` parameter can be set accordingly.

- Pre-attention [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM): In contrast to the encoder, which employs a Bidirectional LSTM, the decoder uses a standard LSTM. The appropriate number of units should be set, and the LSTM must return the full sequence, not just the last output, by using the `return_sequences` parameter. The layer must also return the state, which is essential for inference, by setting the `return_state` parameter. The LSTM layers return state as a tuple of two tensors traditionally called `memory_state` and `carry_state`, but here, they are referred to as `hidden_state` and `cell_state` to align with lecture terminology.

- The attention layer: This layer performs cross-attention between the sentence to be translated and the right-shifted translation. The `CrossAttention` layer from the previous exercise is used here.

- Post-attention [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM): Another LSTM layer is included, though it is not required to return the state.

- [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layer: This layer should have the same number of units as the vocabulary size since it computes the logits for each possible word in the vocabulary. The `logsoftmax` activation function is applied here, which can be implemented using [tf.nn.log_softmax](https://www.tensorflow.org/api_docs/python/tf/nn/log_softmax).


In [39]:
class Decoder(tf.keras.layers.Layer):
    def __init__(self, vocab_size, units):
        """Initializes an instance of this class

        Args:
            vocab_size (int): Size of the vocabulary
            units (int): Number of units in the LSTM layer
        """
        super(Decoder, self).__init__()

        self.embedding = tf.keras.layers.Embedding(
            input_dim=vocab_size,
            output_dim=units,
            mask_zero=True
        )  

        self.pre_attention_rnn = tf.keras.layers.LSTM(
            units=units,
            return_sequences=True,
            return_state=True
        )  

        self.attention = CrossAttention(units)

        self.post_attention_rnn = tf.keras.layers.LSTM(
            units=units,
            return_sequences=True
        )  

        self.output_layer = tf.keras.layers.Dense(
            units=vocab_size,
            activation=tf.nn.log_softmax
        )  


    def call(self, context, target, state=None, return_state=False):
        """Forward pass of this layer

        Args:
            context (tf.Tensor): Encoded sentence to translate
            target (tf.Tensor): The shifted-to-the-right translation
            state (list[tf.Tensor, tf.Tensor], optional): Hidden state of the pre-attention LSTM. Defaults to None.
            return_state (bool, optional): If set to true return the hidden states of the LSTM. Defaults to False.

        Returns:
            tf.Tensor: The log_softmax probabilities of predicting a particular token
        """

        x = self.embedding(target)

        x, hidden_state, cell_state = self.pre_attention_rnn(x, initial_state=state)

        x = self.attention(context, x)

        x = self.post_attention_rnn(x)

        logits = self.output_layer(x)


        if return_state:
            return logits, [hidden_state, cell_state]

        return logits

In [40]:
decoder = Decoder(VOCAB_SIZE, UNITS)

logits = decoder(encoder_output, sr_translation)

print(f'Tensor of contexts has shape: {encoder_output.shape}')
print(f'Tensor of right-shifted translations has shape: {sr_translation.shape}')
print(f'Tensor of logits has shape: {logits.shape}')

Tensor of contexts has shape: (64, 14, 256)
Tensor of right-shifted translations has shape: (64, 15)
Tensor of logits has shape: (64, 15, 12000)


##### __Expected Output__

```
Tensor of contexts has shape: (64, 14, 256)
Tensor of right-shifted translations has shape: (64, 15)
Tensor of logits has shape: (64, 15, 12000)
```

<a name="ex4"></a>
## Translator

In this section, the layers previously developed are combined into a complete model. This involves completing the `Translator` class. Unlike the `Encoder` and `Decoder` classes, which inherit from `tf.keras.layers.Layer`, the `Translator` class inherits from `tf.keras.Model`.

It is important to note that `train_data` yields a tuple consisting of the sentence to be translated and the shifted-to-the-right translation, which serve as the "features" of the model. Consequently, the network's inputs will be tuples containing both context and targets.

In [44]:
class Translator(tf.keras.Model):
    def __init__(self, vocab_size, units):
        """Initializes an instance of this class

        Args:
            vocab_size (int): Size of the vocabulary
            units (int): Number of units in the LSTM layer
        """
        super().__init__()
        self.encoder = Encoder(vocab_size, units)

        self.decoder = Decoder(vocab_size, units)


    def call(self, inputs):
        """Forward pass of this layer

        Args:
            inputs (tuple(tf.Tensor, tf.Tensor)): Tuple containing the context (sentence to translate) and the target (shifted-to-the-right translation)

        Returns:
            tf.Tensor: The log_softmax probabilities of predicting a particular token
        """

        context, target = inputs

        encoded_context = self.encoder(context)

        logits = self.decoder(encoded_context, target)

        return logits

In [45]:
translator = Translator(VOCAB_SIZE, UNITS)

# Compute the logits for every word in the vocabulary
logits = translator((to_translate, sr_translation))

print(f'Tensor of sentences to translate has shape: {to_translate.shape}')
print(f'Tensor of right-shifted translations has shape: {sr_translation.shape}')
print(f'Tensor of logits has shape: {logits.shape}')

Tensor of sentences to translate has shape: (64, 14)
Tensor of right-shifted translations has shape: (64, 15)
Tensor of logits has shape: (64, 15, 12000)


<a name="3"></a>
## 3. Training

Now that there is an untrained instance of the NMT model, it is time to train it. 

In [50]:
def compile_and_train(model, epochs=20, steps_per_epoch=500):
    model.compile(optimizer="adam", loss=masked_loss, metrics=[masked_acc, masked_loss])

    history = model.fit(
        train_data.repeat(),
        epochs=epochs,
        steps_per_epoch=steps_per_epoch,
        validation_data=val_data,
        validation_steps=50,
        callbacks=[tf.keras.callbacks.EarlyStopping(patience=3)],
    )

    return model, history

In [51]:
# Train the translator (this takes some minutes so feel free to take a break)

trained_translator, history = compile_and_train(translator)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20


<a name="4"></a>
## 4. Using the model for inference 


Now that your model is trained you can use it for inference. To help you with this the `generate_next_token` function is provided. Notice that this function is meant to be used inside a for-loop, so you feed to it the information of the previous step to generate the information of the next step. In particular you need to keep track of the state of the pre-attention LSTM in the decoder and if you are done with the translation. Also notice that a `temperature` variable is introduced which determines how to select the next token given the predicted logits:  

In [52]:
def generate_next_token(decoder, context, next_token, done, state, temperature=0.0):
    """Generates the next token in the sequence

    Args:
        decoder (Decoder): The decoder
        context (tf.Tensor): Encoded sentence to translate
        next_token (tf.Tensor): The predicted next token
        done (bool): True if the translation is complete
        state (list[tf.Tensor, tf.Tensor]): Hidden states of the pre-attention LSTM layer
        temperature (float, optional): The temperature that controls the randomness of the predicted tokens. Defaults to 0.0.

    Returns:
        tuple(tf.Tensor, np.float, list[tf.Tensor, tf.Tensor], bool): The next token, log prob of said token, hidden state of LSTM and if translation is done
    """
    # Get the logits and state from the decoder
    logits, state = decoder(context, next_token, state=state, return_state=True)
    
    # Trim the intermediate dimension 
    logits = logits[:, -1, :]
        
    # If temp is 0 then next_token is the argmax of logits
    if temperature == 0.0:
        next_token = tf.argmax(logits, axis=-1)
        
    # If temp is not 0 then next_token is sampled out of logits
    else:
        logits = logits / temperature
        next_token = tf.random.categorical(logits, num_samples=1)
    
    # Trim dimensions of size 1
    logits = tf.squeeze(logits)
    next_token = tf.squeeze(next_token)
    
    # Get the logit of the selected next_token
    logit = logits[next_token].numpy()
    
    # Reshape to (1,1) since this is the expected shape for text encoded as TF tensors
    next_token = tf.reshape(next_token, shape=(1,1))
    
    # If next_token is End-of-Sentence token you are done
    if next_token == eos_id:
        done = True
    
    return next_token, logit, state, done

See how it works by running the following cell:

In [54]:
# PROCESS SENTENCE TO TRANSLATE AND ENCODE

# A sentence you wish to translate
eng_sentence = "I love languages"

# Convert it to a tensor
texts = tf.convert_to_tensor(eng_sentence)[tf.newaxis]

# Vectorize it and pass it through the encoder
context = english_vectorizer(texts).to_tensor()
context = encoder(context)

# SET STATE OF THE DECODER

# Next token is Start-of-Sentence since you are starting fresh
next_token = tf.fill((1,1), sos_id)

# Hidden and Cell states of the LSTM can be mocked using uniform samples
state = [tf.random.uniform((1, UNITS)), tf.random.uniform((1, UNITS))]

# You are not done until next token is EOS token
done = False

# Generate next token
next_token, logit, state, done = generate_next_token(decoder, context, next_token, done, state, temperature=0.5)
print(f"Next token: {next_token}\nLogit: {logit:.4f}\nDone? {done}")

Next token: [[10026]]
Logit: -18.7985
Done? False


<a name="ex5"></a>
## Exercise 5 - translate

Now you can put everything together to translate a given sentence. For this, complete the `translate` function below. This function will take care of the following steps: 
- Process the sentence to translate and encode it

+ Set the initial state of the decoder

- Get predictions of the next token (starting with the \<SOS> token) for a maximum of iterations (in case the \<EOS> token is never returned)
    
+ Return the translated text (as a string), the logit of the last iteration (this helps measure how certain was that the sequence was translated in its totality) and the translation in token format.


Hints: 

- The previous cell provides a lot of insights on how this function should work, so if you get stuck refer to it.

+ Some useful docs:
    + [tf.newaxis](https://www.tensorflow.org/api_docs/python/tf#newaxis)

    - [tf.fill](https://www.tensorflow.org/api_docs/python/tf/fill)

    + [tf.zeros](https://www.tensorflow.org/api_docs/python/tf/zeros)


**IMPORTANT NOTE**: Due to randomness processes involving tensorflow training and weight initializing, the results below may vary a lot, even if you retrain your model in the same session. 


In [67]:
# GRADED FUNCTION: translate
def translate(model, text, max_length=50, temperature=0.0):
    """Translate a given sentence from English to Portuguese

    Args:
        model (tf.keras.Model): The trained translator
        text (string): The sentence to translate
        max_length (int, optional): The maximum length of the translation. Defaults to 50.
        temperature (float, optional): The temperature that controls the randomness of the predicted tokens. Defaults to 0.0.

    Returns:
        tuple(str, np.float, tf.Tensor): The translation, logit that predicted <EOS> token and the tokenized translation
    """
    # Lists to save tokens and logits
    tokens, logits = [], []

    ### START CODE HERE ###
    
    # PROCESS THE SENTENCE TO TRANSLATE
    
    # Convert the original string into a tensor
    text = tf.convert_to_tensor(text)[tf.newaxis]
    
    # Vectorize the text using the correct vectorizer
    context = english_vectorizer(text).to_tensor()
    
    # Get the encoded context (pass the context through the encoder)
    # Hint: Remember you can get the encoder by using model.encoder
    context = model.encoder(context)
    
    # INITIAL STATE OF THE DECODER
    
    # First token should be SOS token with shape (1,1)
    next_token = tf.fill((1, 1), sos_id)
    
    # Initial hidden and cell states should be tensors of zeros with shape (1, UNITS)
    state = [tf.zeros((1, UNITS)), tf.zeros((1, UNITS))]
    
    # You are done when you draw a EOS token as next token (initial state is False)
    done = False

    # Iterate for max_length iterations
    for i in range(max_length):
        # Generate the next token
        try:
            next_token, logit, state, done = generate_next_token(
                decoder=model.decoder,
                context=context,
                next_token=next_token,
                done=done,
                state=state,
                temperature=temperature
            )
        except:
             raise Exception("Problem generating the next token")
        
        # If done then break out of the loop
        if done:
            break
        
        # Add next_token to the list of tokens
        tokens.append(next_token)
        
        # Add logit to the list of logits
        logits.append(logit)
    
    ### END CODE HERE ###
    
    # Concatenate all tokens into a tensor
    tokens = tf.concat(tokens, axis=-1)
    
    # Convert the translated tokens into text
    translation = tf.squeeze(tokens_to_text(tokens, id_to_word))
    translation = translation.numpy().decode()
    
    return translation, logits[-1], tokens

Try your function with temperature of 0, which will yield a deterministic output and is equivalent to a greedy decoding:

In [68]:
# Running this cell multiple times should return the same output since temp is 0

temp = 0.0 
original_sentence = "I love languages"

translation, logit, tokens = translate(trained_translator, original_sentence, temperature=temp)

print(f"Temperature: {temp}\n\nOriginal sentence: {original_sentence}\nTranslation: {translation}\nTranslation tokens:{tokens}\nLogit: {logit:.3f}")

Temperature: 0.0

Original sentence: I love languages
Translation: eu adoro linguas de vista as senhoras .
Translation tokens:[[   9  564 1032   11 1037   38  421    4]]
Logit: -0.389


Try your function with temperature of 0.7 (stochastic output):

In [69]:
# Running this cell multiple times should return different outputs since temp is not 0
# You can try different temperatures

temp = 0.7
original_sentence = "I love languages"

translation, logit, tokens = translate(trained_translator, original_sentence, temperature=temp)

print(f"Temperature: {temp}\n\nOriginal sentence: {original_sentence}\nTranslation: {translation}\nTranslation tokens:{tokens}\nLogit: {logit:.3f}")

Temperature: 0.7

Original sentence: I love languages
Translation: eu eu adoro idiomas de senhora .
Translation tokens:[[  9   9 564 850  11 279   4]]
Logit: -0.830


In [70]:
w1_unittest.test_translate(translate, trained_translator)

[92m All tests passed!


<a name="5"></a>
## 5. Minimum Bayes-Risk Decoding

As mentioned in the lectures, getting the most probable token at each step may not necessarily produce the best results. Another approach is to do Minimum Bayes Risk Decoding or MBR. The general steps to implement this are:

- Take several random samples
+ Score each sample against all other samples
- Select the one with the highest score

You will be building helper functions for these steps in the following sections.

With the ability to generate different translations by setting different temperature values you can do what you saw in the lectures and generate a bunch of translations and then determine which one is the best candidate. You will now do this by using the provided `generate_samples` function. This function will return any desired number of candidate translations alongside the log-probability for each one:

In [71]:
def generate_samples(model, text, n_samples=4, temperature=0.6):
    
    samples, log_probs = [], []

    for _ in range(n_samples):
        
        _, logp, sample = translate(model, text, temperature=temperature)
        
        samples.append(np.squeeze(sample.numpy()).tolist())
        
        log_probs.append(logp)
                
    return samples, log_probs

In [72]:
samples, log_probs = generate_samples(trained_translator, 'I love languages')

for s, l in zip(samples, log_probs):
    print(f"Translated tensor: {s} has logit: {l:.3f}")

Translated tensor: [9, 564, 1032, 11, 514, 4] has logit: -1.052
Translated tensor: [9, 564, 1032, 11, 576, 4] has logit: -0.255
Translated tensor: [9, 564, 850, 11, 313, 11, 1037, 4] has logit: -2.890
Translated tensor: [9, 564, 1032, 11, 313, 11, 576, 4] has logit: -0.041


## Comparing Overlaps

After generating multiple translations, the next step is to evaluate the quality of each one. One method to achieve this is by comparing each translation against the others.

Various metrics can be used for this purpose, as discussed in the lectures, and experimentation with these metrics is encouraged. For this assignment, the focus will be on calculating scores for **unigram overlaps**.

A commonly used and straightforward metric is the [Jaccard similarity](https://en.wikipedia.org/wiki/Jaccard_index), which calculates the intersection over the union of two sets. The `jaccard_similarity` function computes this metric for any pair of candidate and reference translations.


In [73]:
def jaccard_similarity(candidate, reference):
        
    candidate_set = set(candidate)
    reference_set = set(reference)
    
    common_tokens = candidate_set.intersection(reference_set)

    all_tokens = candidate_set.union(reference_set)

    overlap = len(common_tokens) / len(all_tokens)
        
    return overlap

In [74]:
l1 = [1, 2, 3]
l2 = [1, 2, 3, 4]

js = jaccard_similarity(l1, l2)

print(f"jaccard similarity between lists: {l1} and {l2} is {js:.3f}")

jaccard similarity between lists: [1, 2, 3] and [1, 2, 3, 4] is 0.750


<a name="ex6"></a>
## rouge1_similarity

Jaccard similarity is good but a more commonly used metric in machine translation is the ROUGE score. For unigrams, this is called ROUGE-1 and as shown in the lectures, you can output the scores for both precision and recall when comparing two samples. To get the final score, you will want to compute the F1-score as given by:

$$score = 2* \frac{(precision * recall)}{(precision + recall)}$$

For the implementation of the `rouge1_similarity` function you want to use the [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) class from the Python standard library:

In [84]:
def rouge1_similarity(candidate, reference):
    """Computes the ROUGE 1 score between two token lists

    Args:
        candidate (list[int]): Tokenized candidate translation
        reference (list[int]): Tokenized reference translation

    Returns:
        float: Overlap between the two token lists
    """
    candidate_word_counts = Counter(candidate)
    reference_word_counts = Counter(reference)
    
    overlap = 0
    
    for token in candidate_word_counts.keys():
        
        token_count_candidate = candidate_word_counts[token]
        
        token_count_reference = reference_word_counts[token]
        
        overlap += min(token_count_candidate, token_count_reference)

    precision = overlap/len(candidate)

    recall = overlap/len(reference)
    
    if precision + recall != 0:
        f1_score = 2 *((precision * recall)/(precision + recall))
        
        return f1_score

        
    return 0

In [85]:
l1 = [1, 2, 3]
l2 = [1, 2, 3, 4]

r1s = rouge1_similarity(l1, l2)

print(f"rouge 1 similarity between lists: {l1} and {l2} is {r1s:.3f}")

rouge 1 similarity between lists: [1, 2, 3] and [1, 2, 3, 4] is 0.857


## Computing the Overall Score


The task now is to develop a function to calculate the overall score for a given sample. As explained in the lectures, this involves comparing each sample with all other samples. For example, if 30 sentences are generated, sentence 1 needs to be compared with sentences 2 through 30. Then, sentence 2 is compared with sentences 1 and 3 through 30, and so on. The average score of all comparisons will provide the overall score for each sample. To illustrate, the following steps outline how to compute the scores for a list of 4 samples:

Compute the similarity score between sample 1 and sample 2
Compute the similarity score between sample 1 and sample 3
Compute the similarity score between sample 1 and sample 4
Calculate the average score of these comparisons to determine the overall score for sample 1
Repeat the process for samples 2 through 4 to obtain their overall scores.
The results will be stored in a dictionary for convenient lookups.

<a name="ex7"></a>
## average_overlap

Complete the `average_overlap` function below which should implement the process described above:

In [87]:
def average_overlap(samples, similarity_fn):
    """Computes the arithmetic mean of each candidate sentence in the samples

    Args:
        samples (list[list[int]]): Tokenized version of translated sentences
        similarity_fn (Function): Similarity function used to compute the overlap

    Returns:
        dict[int, float]: A dictionary mapping the index of each translation to its score
    """
    scores = {}
    
    for index_candidate, candidate in enumerate(samples):    
        
        overlap = 0
        
        for index_sample, sample in enumerate(samples):

            if index_candidate == index_sample:
                continue
                
            sample_overlap = similarity_fn(candidate, sample)
            
            overlap += sample_overlap


        score = overlap / (len(samples) - 1)

        score = round(score, 3)
        
        scores[index_candidate] = score
        
    return scores

In [88]:
l1 = [1, 2, 3]
l2 = [1, 2, 4]
l3 = [1, 2, 4, 5]

avg_ovlp = average_overlap([l1, l2, l3], jaccard_similarity)

print(f"average overlap between lists: {l1}, {l2} and {l3} using Jaccard similarity is:\n\n{avg_ovlp}")

average overlap between lists: [1, 2, 3], [1, 2, 4] and [1, 2, 4, 5] using Jaccard similarity is:

{0: 0.45, 1: 0.625, 2: 0.575}


In [89]:
l1 = [1, 2, 3]
l2 = [1, 4]
l3 = [1, 2, 4, 5]
l4 = [5,6]

avg_ovlp = average_overlap([l1, l2, l3, l4], rouge1_similarity)

print(f"average overlap between lists: {l1}, {l2}, {l3} and {l4} using Rouge1 similarity is:\n\n{avg_ovlp}")

average overlap between lists: [1, 2, 3], [1, 4], [1, 2, 4, 5] and [5, 6] using Rouge1 similarity is:

{0: 0.324, 1: 0.356, 2: 0.524, 3: 0.111}


In practice, it is also common to see the weighted mean being used to calculate the overall score instead of just the arithmetic mean. This is implemented in the `weighted_avg_overlap` function below:

In [91]:
def weighted_avg_overlap(samples, log_probs, similarity_fn):
    
    scores = {}
    
    for index_candidate, candidate in enumerate(samples):    
        
        overlap, weight_sum = 0.0, 0.0
        
        for index_sample, (sample, logp) in enumerate(zip(samples, log_probs)):

            # Skip if the candidate index is the same as the sample index            
            if index_candidate == index_sample:
                continue
                
            # Convert log probability to linear scale
            sample_p = float(np.exp(logp))

            weight_sum += sample_p

            sample_overlap = similarity_fn(candidate, sample)
            
            overlap += sample_p * sample_overlap
            
        score = overlap / weight_sum

        score = round(score, 3)

        scores[index_candidate] = score
    
    return scores

In [92]:
l1 = [1, 2, 3]
l2 = [1, 2, 4]
l3 = [1, 2, 4, 5]
log_probs = [0.4, 0.2, 0.5]

w_avg_ovlp = weighted_avg_overlap([l1, l2, l3], log_probs, jaccard_similarity)

print(f"weighted average overlap using Jaccard similarity is:\n\n{w_avg_ovlp}")

weighted average overlap using Jaccard similarity is:

{0: 0.443, 1: 0.631, 2: 0.558}


## MBR Decode
The final step involves integrating all the components into the mbr_decode function. This function serves as a wrapper around the various elements developed previously.

The mbr_decode function allows for experimentation with different numbers of samples, temperatures, and similarity functions.

In [None]:
def mbr_decode(model, text, n_samples=5, temperature=0.6, similarity_fn=jaccard_similarity):
    
    samples, log_probs = generate_samples(model, text, n_samples=n_samples, temperature=temperature)
    
    scores = weighted_avg_overlap(samples, log_probs, similarity_fn)

    decoded_translations = [tokens_to_text(s, id_to_word).numpy().decode('utf-8') for s in samples]
    
    max_score_key = max(scores, key=lambda k: scores[k])
    
    translation = decoded_translations[max_score_key]
    
    return translation, decoded_translations

In [94]:
english_sentence = "I love languages"

translation, candidates = mbr_decode(trained_translator, english_sentence, n_samples=10, temperature=0.6)

print("Translation candidates:")
for c in candidates:
    print(c)

print(f"\nSelected translation: {translation}")

Translation candidates:
adoro linguas estrangeiras da nova forma .
eu adoro linguas de novo .
eu adoro idiomas de sorte .
eu adoro linguas de vista as senhoras .
adoro linguas em seguranca .
eu adoro linguas de aviao .
eu eu adoro linguas de vista de senhora .
eu adoro idiomas de menos a senhora .
adoro linguas de usuario .
eu adoro idiomas de idade .

Selected translation: eu adoro linguas de novo .
