![BridgingAI Logo](../bridgingai_logo.png)

# Deep Learning - Exercise 4: Recurrent Neural Networks for Language Modeling and Neural Machine Translation
---
1. [Character-Level Language Modeling](#lm)
<br/> &#9; 1.1. [Character-Level Tokenization](#char-tokenization)
<br/> &#9; 1.2. [Training Input and Target](#input_target)

2. [RNN Implementation](#rnn-implementation)
<br/> &#9; 2.1. [RNN](#rnn)
<br/> &#9; 2.2. [LSTM (Optional)](#lstm) 

3. [Experiment: Character-Level Language Modeling](#experiment-lm)

4. [Neural Machine Translation](#neural-machine-translation)
<br/> &#9; 4.1. [Dataset Preparation](#dataset-preparation)
<br/> &#9; 4.2. [Tokenization](#tokenization)
<br/> &#9; 4.3. [Batching and Padding](#batching-and-padding)
<br/> &#9; 4.4. [BLEU](#bleu)

5. [NMT Implementation: Building Seq2Seq](#implementation-nmt)
<br/> &#9; 5.1. [Encoder](#encoder)
<br/> &#9; 5.2. [Decoder](#decoder)

6. [Experiment: Neural Machine Translation](#experiment-nmt)

7. [Questions](#questions)

8. [References](#references)
---



In [None]:
import torch
import torch.nn as nn
import torch.nn.init as init

import math

from tests.lm_sanity_checks import TestRNN, TestLSTM
from tests.nmt_sanity_checks import TestNMTEncoder, TestNMTDecoder
from configs.lm_config import LMExperimentConfig
from configs.nmt_config import NMTExperimentConfig
from trainers.lm_trainer import LMTrainer
from trainers.nmt_trainer import NMTTrainer
from IPython.display import Markdown, display

Recurrent Neural Networks (RNNs) represent the foundational architecture for processing sequential data through maintained hidden states. While vanilla RNNs and LSTMs have largely given way to attention-based architectures or state-space models, their core principle of maintaining and updating state information persists in many modern architectures. In particular, many modern architectures allow easier parallelization of the training process, which makes it easier to scale to large models and large datasets. Nevertheless, the general principle of recurrent neural networks - reuse of model weights across the sequence - persists.

This assignment explores sequence modeling tasks with RNNs:
- Implement a vanilla RNN in PyTorch and train it for character-level language modeling on Shakespeare's texts
- Build an encoder-decoder (Seq2Seq) model for German-to-English translation using the Multi30k dataset
- Work hands-on with core NLP concepts including tokenization strategies, padding mechanisms, and evaluation metrics for sequence tasks

<a id="lm"></a>
# 1. Character-Level Language Modeling

Language modeling is a fundamental task in NLP where we aim to predict the next token in a sequence given the previous tokens. A token might be a single character, a word, a word piece or even a piece of a whole sentence. In this assignment, we'll focus on character-level language modeling, where tokens are individual characters rather than words.

Consider a text sequence "Hello". During training, a character-level language model would:
0. Start with the special "start-of-sentence token" and predict "H".
1. Take "H" and predict "e"
2. Take "He" and predict "l" 
3. Take "Hel" and predict "l"
4. Take "Hell" and predict "o"

More formally, given a sequence of characters $x_1, x_2, ..., x_T$, the goal of a character-level language model is to predict the probability distribution of the next character $x_{T+1}$:

$$P(x_{T+1} | x_1, x_2, ..., x_T)$$

Now let's visualize the data we'll be working with.

<a id="char-tokenization"></a>
## 1.1. Character-Level Tokenization

**Tokenization** refers to the process of breaking down a text into smaller units, such as words or subwords. 

Before training a language model, we need to convert text into numbers that our neural network can process. In character-level modeling, we assign a unique ID to each character in our vocabulary, including spaces and punctuation. Since we know which characters to expect (our "vocabulary"), it is fairly easy to create the tokenizer: just enumerate the closed set of allowed characters. More sophisticated tokenizers need to be trained to learn a suitable vocabulary from some dataset.

The pipeline for processing text data consists of the following stages:
$$
\text{Text} \rightarrow 
\text{Tokens} \rightarrow 
\text{Token IDs} \rightarrow 
\text{Vectors} \rightarrow 
\text{\{feed to the model\}}...
$$

Let's look at how the tokenization works with a simple example. Each character gets mapped to an integer ID, and special tokens like space (' ') are included in the vocabulary. The tokenizer can thus convert text to IDs and back.

In [None]:
def display_tokenization(config, text_string):
    """
    Tokenizes the input text and displays character (token) to ID mappings.
    """
    # Tokenize the input text
    token_ids = config.tokenizer.encode(text_string)

    # Display input text and token IDs
    print(f"Input Text: '{text_string}'")
    print(f"Token IDs: {token_ids}\n")

    # Print the token to ID mapping
    print("Token → ID:")
    print("-" * 15)
    for token_id in sorted(set(token_ids)):
        char = config.tokenizer.id2char[token_id]
        print(f"{repr(char):<5} → {token_id}")


# Example usage
text_string = "Now is the winter"
config = LMExperimentConfig("pytorch_rnn")
display_tokenization(config, text_string)

<a id="input_target"></a>
## 1.2. Training Input and Target

For language modeling, each training example consists of an input sequence and a target sequence. The target sequence is simply the input sequence shifted by one position - we want the model to predict the next character at each position.

As shown below, the dataset is organized into batches where:
- Input shape is (batch_size, sequence_length) 
- Target shape is (batch_size, sequence_length)
- Each target sequence is offset by one character from its input sequence

During training, the model processes the input sequence character by character, trying to predict the next character at each step. The loss is calculated by comparing these predictions against the target sequence.

In [None]:
from exercise_utils.nlp.lm.utils import create_lm_dataloaders


def display_dataloader_sample_with_decoding(config):
    """
    Displays a sample input-target pair from the dataset, batch shapes,
    and includes the decoded text for both input and target sequences.
    """
    torch.manual_seed(0)
    # Create data loaders
    train_loader, _ = create_lm_dataloaders(config, config.seq_len)
    train_dataset = train_loader.dataset

    # Get a sample input-target pair
    sample_input, sample_target = train_dataset[0]
    sample_input = sample_input.tolist()
    sample_target = sample_target.tolist()

    # Decode the sequences back to text
    decoded_input = config.tokenizer.decode(sample_input)
    decoded_target = config.tokenizer.decode(sample_target)

    # Print the sample input, target, and decoded text
    print("Sample Input (IDs): \t", sample_input)
    print("Sample Target (IDs): \t", " " * 3, sample_target)
    print("\nDecoded Input:   ", decoded_input)
    print("Decoded Target:  ", decoded_target)

    # Fetch a batch of data and print batch shapes
    inputs, targets = next(iter(train_loader))
    print("\nBatch Shapes:")
    print(f"Inputs Shape: {inputs.shape}")
    print(f"Targets Shape: {targets.shape}")


batch_size = 32
seq_len = 8
config = LMExperimentConfig("pytorch_rnn", seq_len=seq_len, batch_size=batch_size)
display_dataloader_sample_with_decoding(config)

<a id="rnn-implementation"></a>
# 2. RNN Implementation

In this section, you'll implement a vanilla RNN and LSTM from scratch using PyTorch. You'll then train a character-level language model on a dataset of Shakespeare's writing.

<a id="rnn"></a>
## 2.1. RNN

**TODOs**: Complete the `CustomRNN` class implementation and pass the tests.

- **TODO 1: Initialize Parameters**
   - Create learnable parameters `Wxh`, `Whh`, and `bh` using `nn.Parameter`
   - `Wxh`: Input-to-hidden weights (hidden_size, input_dim)
   - `Whh`: Hidden-to-hidden weights (hidden_size, hidden_size) 
   - `bh`: Hidden bias (hidden_size)
   - Makes sure to use exactly these names for the parameters (important for testing)

- **TODO 2: Forward Pass**
   - Implement the RNN update equation: $h_t = \text{tanh}(W_{xh} \cdot x_t + W_{hh} \cdot h_{t-1} + b_h)$

- **TODO 3: Weight Initialization** 
   - Apply Xavier initialization to `Wxh` and `Whh`



In [None]:
class CustomRNN(nn.Module):
    def __init__(self, input_dim, hidden_size):
        super().__init__()
        self.input_dim = input_dim
        self.hidden_size = hidden_size

        self.Wxh = None
        self.Whh = None
        self.bh = None
        # TODO 1: Initialize the weights and biases, the name of the parameters should be Wxh, Whh, and bh (important for testing)
        # YOUR CODE HERE
        raise NotImplementedError()

        # TODO 3: Apply Xavier initialization (crucial for training!)
        # YOUR CODE HERE
        raise NotImplementedError()

    def forward(self, x, h_0):
        """
        Args:
            x: (batch_size, seq_len, input_dim)
            h_0: (1, batch_size, hidden_size)

        Returns:
            out: (batch_size, seq_len, hidden_size)
            h_n: (1, batch_size, hidden_size)
        """
        output, h_t = None, None

        seq_len = x.shape[1]
        h_t = h_0
        output = []

        for t in range(seq_len):
            x_t = x[:, t, :]
            # TODO 2: Implement the forward pass of the RNN and append the hidden states to the output list
            # YOUR CODE HERE
            raise NotImplementedError()
            output.append(h_t.transpose(0, 1))

        output = torch.cat(output, dim=1)
        return output, h_t


TestRNN.test_output_shape(CustomRNN)
TestRNN.test_output_equality(CustomRNN)

<a id="lstm"></a>
## 2.2. LSTM (Optional)

**TODO**: Implement the missing part of the `CustomLSTM` class using the formula below:

$$
\begin{align*}
f_t = \sigma(W_f \cdot [x_t, h_{t-1}] + b_f) \\
i_t = \sigma(W_i \cdot [x_t, h_{t-1}] + b_i) \\
\tilde{C}_t = \tanh(W_c \cdot [x_t, h_{t-1}] + b_c) \\
C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t \\
O_t = \sigma(W_o \cdot [x_t, h_{t-1}] + b_o) \\
h_t = O_t \odot \tanh(C_t) \\
\end{align*}
$$

The formulas are similar to what you've seen in the slides. After implementing the forward pass, we will test 1. the output shape and 2. if it's output values are close to PyTorch's LSTM implementation. If you get stuck, you can still proceed to the next section.

In [None]:
class CustomLSTM(nn.Module):
    def __init__(self, input_dim, hidden_size):
        super().__init__()
        self.input_dim = input_dim
        self.hidden_size = hidden_size

        # LSTMs have four gates, so map input & hidden state to 4 * hidden_size
        self.Wih = nn.Parameter(torch.randn(4 * hidden_size, input_dim))
        self.Whh = nn.Parameter(torch.randn(4 * hidden_size, hidden_size))
        self.bih = nn.Parameter(torch.zeros(4 * hidden_size))
        self.bhh = nn.Parameter(torch.zeros(4 * hidden_size))

        self._init_weights()

    def _init_weights(self) -> None:
        # Xavier initialization
        stdv = 1.0 / math.sqrt(self.hidden_size)
        for param in self.parameters():
            init.uniform_(param, -stdv, stdv)

    def forward(self, x, initial_states):
        """
        Args:
            x: (batch_size, seq_len, input_dim)
            initial_states: (h_0, c_0)
                h_0: (1, batch_size, hidden_size)
                c_0: (1, batch_size, hidden_size)

        Returns:
            out: (batch_size, seq_len, hidden_size)
            (h_n, c_n):
                h_n: (1, batch_size, hidden_size)
                c_n: (1, batch_size, hidden_size)
        """
        seq_len = x.shape[1]
        h_t, c_t = initial_states
        h_t = h_t.squeeze(0)
        c_t = c_t.squeeze(0)

        hidden_seq = []

        # Decompose the parameters into gate-specific weights and biases
        W = torch.cat([self.Wih, self.Whh], dim=1)
        b = self.bih + self.bhh
        Wi, Wf, Wc, Wo = W.chunk(4, dim=0)
        bi, bf, bc, bo = b.chunk(4, dim=0)

        for t in range(seq_len):
            x_t = x[:, t, :]
            # (batch_size, hidden_size + input_dim)
            # TODO: Implement the forward pass of the LSTM
            hx_cat = torch.cat([x_t, h_t], dim=1)
            f_t = torch.sigmoid(hx_cat @ Wf.T + bf)  # (batch_size, hidden_size)
            # YOUR CODE HERE
            raise NotImplementedError()
            hidden_seq.append(h_t.unsqueeze(0))

        hidden_seq = torch.cat(hidden_seq, dim=0)
        hidden_seq = hidden_seq.transpose(0, 1).contiguous()
        return hidden_seq, (h_t.unsqueeze(0), c_t.unsqueeze(0))


TestLSTM.test_output_shape(CustomLSTM)
TestLSTM.test_output_equality(CustomLSTM)

<a id="experiment-lm"></a>
# 3. Experiment: Character-Level Language Modeling

After implementing the RNN, you'll train a character-level language model on the Shakespeare dataset.

**TODO**: Open TensorBoard and run the following cells to train the RNN model.

You can use the PyTorch RNN implementation in case you encounter issues with the custom RNN implementation or for comparison.

If implemented correctly, the `custom_rnn` model should achieve a validation loss below 1.9 and the `custom_lstm` model should achieve a validation loss below 1.8. Each model should take less than 10 minutes to train on a GPU, or about 15-20 minutes on CPU.

In [None]:
lm_rnn_config = LMExperimentConfig(rnn_type="rnn")
model = CustomRNN(lm_rnn_config.vocab_size, lm_rnn_config.hidden_size)
# alternatively, you can uncomment the following line to use the PyTorch RNN
# model = nn.RNN(lm_rnn_config.vocab_size, lm_rnn_config.hidden_size, batch_first=True)
rnn_trainer = LMTrainer(model, lm_rnn_config)
rnn_trainer.run_experiment()

In [None]:
lm_lstm_config = LMExperimentConfig(rnn_type="lstm")
model = CustomLSTM(lm_lstm_config.vocab_size, lm_lstm_config.hidden_size)
# alternatively, you can uncomment the following line to use the PyTorch LSTM
# model = nn.LSTM(lm_lstm_config.vocab_size, lm_lstm_config.hidden_size, batch_first=True)
lstm_trainer = LMTrainer(model, lm_lstm_config)
lstm_trainer.run_experiment()

We can generate some text from the trained model to see how well it captures the style of Shakespeare's writing. Keep in mind that this is a very small model, trained on a tiny dataset. Which aspects of the data did the model learn, and where does it struggle still? If you want to, you may also play around with training for more steps or increasing the model size to see if you can improve on these results.

In [None]:
from exercise_utils.nlp.lm.utils import format_generation_logging, generate_text

prompt = "Now are our brows bound with victorious wreaths;"

# also test with lstm_trainer once you trained it
text_gen = generate_text(rnn_trainer.model, rnn_trainer.tokenizer, 1024, prompt)
text_gen = format_generation_logging(text_gen, prompt)
display(Markdown(text_gen))

<a id="neural-machine-translation"></a>
# 4. Neural Machine Translation

Neural Machine Translation (NMT) is the task of translating text from one language to another using neural networks. In this assignment, we will use a [Seq2Seq](https://arxiv.org/abs/1409.3215) model to perform NMT on the [Multi30k](https://arxiv.org/abs/1605.00459) dataset, which contains parallel sentences in English and German (and French). Our goal is to train a transformer model to translate German sentences to English.

<a id="dataset-preparation"></a>
## 4.1. Dataset Preparation

The original Multi30k dataset consists of 29,000 training sentences and 1,014 validation sentences. We also augment the dataset with synthetic data generated by translating the original german sentences to english with a pre-trained model, resulting in a total of 60k training sentences.

Let's start by loading the dataset and examining a few examples.

In [None]:
from exercise_utils.nlp.nmt.utils import create_nmt_dataloaders

config = NMTExperimentConfig(batch_size=4, rnn_type="lstm")
train_loader, _ = create_nmt_dataloaders(config)
train_dataset = train_loader.dataset

idx = torch.arange(3)
src_ids = [train_dataset[i][0] for i in idx]
tgt_ids = [train_dataset[i][1] for i in idx]

src_sents = config.src_tokenizer.decode_batch(src_ids)
tgt_sents = config.tgt_tokenizer.decode_batch(tgt_ids)

print("Source sentences:")
for sent_en in src_sents:
    print(sent_en)

print("\nTarget sentences:")
for sent_en in tgt_sents:
    print(sent_en)

<a id="tokenization"></a>
## 4.2. Tokenization

As you saw earlier, tokenization is the process of converting text into numerical tokens that can be fed into a neural network. However, while splitting a text into individual characters is possible, it typically results in an impractically large number of tokens and vocabulary size for large datasets.

Conversely, splitting text based on spaces to extract whole words also has its drawbacks. For instance, certain phrases (like "New York") consist of multiple words, some words might be misspelled, or the text may be in a language that doesn’t use spaces (e.g., Chinese).

To address these limitations, subword tokenization methods like Byte Pair Encoding (BPE) and SentencePiece were developed. These techniques can effectively handle such cases. In this assignment, you won't need to implement these tokenization methods from scratch; instead, we’ve provided a wrapper class that utilizes the SentencePiece tokenizer to process the text data.

In [None]:
sent_de = "Zwei Männer essen in einer Cafeteria."
result = config.src_tokenizer.tokenizer.encode(sent_de)
print("Text:", sent_de)
print("Tokens:", result.tokens)
print("Token ids:", result.ids)

print("-" * 60)

sent_en = "Two men are eating food in a cafeteria."
result = config.tgt_tokenizer.tokenizer.encode(sent_en)
print("Text:", sent_en)
print("Tokens:", result.tokens)
print("Token ids:", result.ids)

<a id="batching-and-padding"></a>
## 4.3. Batching and Padding

In this section, we visualize how sentences of varying lengths are padded within a batch using the padding token `<pad>`. This is necessary to ensure uniform length for training.

Padding is necessary for batching but introduces complexity, as it requires special handling with attention masks to ignore padded tokens during the attention mechanism. You will have to deal with this in the transformer model implementation.

Additionally, each target sentence includes a start token `<s>` at the beginning and an end token `</s>` at the end. These tokens help the model identify the sequence boundaries during training and inference.


In [None]:
def format_batch(src_ids, tgt_ids, config):
    print(f"Tensor Shapes: src={src_ids.shape}, tgt={tgt_ids.shape}")
    print("Special Tokens:", config.src_tokenizer.special_tokens, "\n")
    src_ids, tgt_ids = src_ids.tolist(), tgt_ids.tolist()

    for i, (src, tgt) in enumerate(zip(src_ids, tgt_ids), 1):
        src_text = config.src_tokenizer.tokenizer.decode(src, skip_special_tokens=False)
        tgt_text = config.tgt_tokenizer.tokenizer.decode(tgt, skip_special_tokens=False)

        print(f"Pair {i}:")
        print(f"DE: {src_text}")
        print(f"EN: {tgt_text}")
        print()


torch.manual_seed(0)
batch = next(iter(train_loader))
src_ids, tgt_ids = batch
format_batch(src_ids, tgt_ids, config)

<a id="bleu"></a>
## 4.4. BLEU

[BLEU](https://www.aclweb.org/anthology/P02-1040.pdf) is a metric for evaluating the quality of machine translations. It compares the model's translations (hypotheses) against human reference translations by calculating n-gram overlaps between them. The score ranges from 0 to 100, where higher scores indicate better translation quality. 

In this assigment, we use the [sacrebleu library](https://github.com/mjpost/sacrebleu?tab=readme-ov-file) to compute corpus-level BLEU scores by comparing the model's predicted translations against the reference translations from our validation dataset.

Note that BLEU scores are not perfect and may not always correlate with human judgment. However, they provide a useful quantitative measure for comparing different models and hyperparameters.

<a id="implementation-nmt"></a>
# 5. NMT Implementation: Building Seq2Seq

In this section, you'll implement the Encoder and Decoder of the [Seq2Seq](https://arxiv.org/abs/1409.3215) model. You can refer to the Figure 1 in the paper for the model architecture. 

#### Architecture:
1. **Encoder**: The encoder processes the input sequence token by token, producing a context vector that summarizes the input sequence. It typically uses an RNN (like LSTM or GRU) to generate hidden states for each token in the input. The final hidden state of the encoder is used as the initial hidden state for the decoder.

2. **Decoder**: The decoder generates the target sequence token by token, using the context vector from the encoder as its initial hidden state. At each time step, the decoder takes the previous token (or an embedding of it) as input and predicts the next token in the sequence. The RNN hidden state is updated at each step based on the previous hidden state and the current input token.

#### Operations:
1. **Training Process**: During training, the decoder receives the actual target token from the previous time step (teacher forcing). This helps the model learn the correct mapping from the input sequence to the target sequence.

2. **Inference (Translation)**: During inference, we feed the source tokens into the encoder to get the context vector. The decoder will receive the context vector and a **start token**, then generate the next tokens autoregressively until it predicts an **end token** or reaches the maximum sequence length. 

<a id="encoder"></a>
## 5.1. Encoder

**TODO**: Complete the `Encoder` class implementation and pass the tests. 
- The input is the source token IDs, and the output is the last encoder hidden states.
- `pad_id` is the ID of the padding token.
- You should use [pack_padded_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html) to handle variable-length sequences. 
- Forward pass should be: 
    1. Embedding layer
    2. Dropout
    3. Convert to `PackedSequence`
    4. Feed to LSTM and get the output

**Hint**: 
- We have calculated the lengths of the sequences as `source_lengths` for you, which will be used in the `pack_padded_sequence` function.
- Use `enforce_sorted=False` and `batch_first=True` in `pack_padded_sequence`.

In [None]:
from torch.nn.utils.rnn import pack_padded_sequence


class NMTEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embedding = nn.Embedding(config.vocab_size, config.embed_size)

        assert config.rnn_type in ["rnn", "lstm"]
        rnn_module = nn.LSTM if config.rnn_type == "lstm" else nn.RNN

        self.rnn = rnn_module(
            input_size=config.embed_size,
            hidden_size=config.hidden_size,
            num_layers=config.num_layers,
            dropout=config.dropout,
            batch_first=True,
        )
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, src, pad_id):
        """
        Args:
            src: input tokens ids of shape (batch_size, T)
                 where T is the largest sequence length in the batch
            pad_id: padding token id

        Returns:
            last_hidden: last hidden state. Tuple of (hidden, cell)
                hidden: (batch_size, num_layers, hidden_size)
                cell: (batch_size, num_layers, hidden_size)
                where num_layers is the number of LSTM layers
        """
        source_lengths = (src != pad_id).sum(1).tolist()

        # YOUR CODE HERE
        raise NotImplementedError()
        return last_hidden


TestNMTEncoder.test_output_shape(NMTEncoder)

<a id="decoder"></a>
## 5.2. Decoder

**TODO**: Complete the `Decoder` class implementation and pass the tests.
- The input `init_hidden` is the initial hidden state of the decoder, which is the last hidden state of the encoder.
- Forward pass should be:
    1. Embedding layer
    2. Dropout
    3. LSTM
    4. Output projection layer (linear layer)
- Note that this time we need to both pack the sentences before feeding them to the LSTM using [pack_padded_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pack_padded_sequence.html), and unpack the output after the LSTM layer using [pad_packed_sequence](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_packed_sequence.html).

**Hint**: 
- Use `enforce_sorted=False` and `batch_first=True` in `pack_padded_sequence`.

In [None]:
from torch.nn.utils.rnn import pad_packed_sequence


class NMTDecoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embedding = nn.Embedding(config.vocab_size, config.embed_size)

        assert config.rnn_type in ["rnn", "lstm"]
        rnn_module = nn.LSTM if config.rnn_type == "lstm" else nn.RNN

        self.rnn = rnn_module(
            input_size=config.embed_size,
            hidden_size=config.hidden_size,
            num_layers=config.num_layers,
            dropout=config.dropout,
            batch_first=True,
        )
        self.dropout = nn.Dropout(config.dropout)
        self.out_proj = nn.Linear(config.hidden_size, config.vocab_size)

    def forward(self, tgt, init_hidden, pad_id):
        """
        This can be used to decode a single step or the whole sequence.

        Args:
            tgt: input tokens (ids) of shape (batch_size, T)
                T can be 1 or more.
                When T=1, this is for decode a single step.
                Otherwise, this is for decode the whole sequence.
            init_hidden: initial hidden state of shape (batch_size, num_layers, hidden_size)
            pad_id: padding token id

        Returns:
            logits: (batch_size, T, vocab_size)
            hidden: (batch_size, num_layers, hidden_size)
        """
        target_lengths = (tgt != pad_id).sum(1).tolist()

        # YOUR CODE HERE
        raise NotImplementedError()
        return logits, hidden


TestNMTDecoder.test_output_shape(NMTDecoder)

<a id="experiment-nmt"></a>
# 6. Experiment: Neural Machine Translation

**TODO**: Open TensorBoard and run the following cells to train the Seq2Seq model (RNN and LSTM) on the Multi30k dataset.

Default training configurations (10k steps, LSTM) takes around 10-15 minutes on a GPU. If you implemented the model correctly, you should see a validation loss around 2.4, and BLEU score around 22. 

In [None]:
from models.nmt import Seq2Seq

config = NMTExperimentConfig(rnn_type="lstm")
model = Seq2Seq(NMTEncoder, NMTDecoder, config)
trainer = NMTTrainer(model, config)
trainer.run_experiment()

Here are some examples of translations generated by the model

In [None]:
def display_translations(trainer):
    start_tag = '<div style="font-size: 14px; line-height: 1.5;">\n'
    body = trainer.get_random_examples()
    end_tag = "\n</div>"
    display(Markdown(start_tag + body + end_tag))


display_translations(trainer)

<a id="questions"></a>
# 7. Questions

Great job on completing the assignment! While we’ve implemented some parts of the code for you, it's important for you to understand the entire process and the reasoning behind it. To wrap things up, take a moment to think about the following questions (you may have to refer to the code to answer them):

1. For the character-level language model, the input characters are tokenized into a sequence of integers, but the RNN requires vectors as input. How did we transform the integer sequence into vectors before passing it to the RNN? (`MiniLM` in `models/mini_lm.py`)

2. Describe the method we use to generate text from the trained language model. How does the text generation process work? (`MiniLM` in `models/mini_lm.py`)

3. Describe the difference between our language model and the translation model in terms of their text sampling strategies. (`MiniLM` in `models/mini_lm.py` and `Seq2Seq` in `models/nmt.py`)

4. In `NMTEncoder`, what potential issues might arise if we don't use `pack_padded_sequence` before feeding the input to the LSTM? 

<a id="references"></a>
# 8. References

- Seq2Seq: [Sequence to Sequence Learning with Neural Networks](https://arxiv.org/abs/1409.3215)
- BLEU: [BLEU: a Method for Automatic Evaluation of Machine Translation](https://www.aclweb.org/anthology/P02-1040.pdf)
- Multi30k: [Multi30K: Multilingual English-German Image Descriptions](https://arxiv.org/abs/1605.00459)