# Translating with Recurrent Neural Networks


<div style="background-color: #f0f8ff; border: 2px solid #4682b4; padding: 10px;">
<a href="https://colab.research.google.com/github/DeepTrackAI/DeepLearningCrashCourse/blob/main/Ch07_RNN/ec07_A_nlp_rnn/nlp_rnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
<strong>If using Colab/Kaggle:</strong> You need to uncomment the code in the cell below this one.
You need also to copy the "eng-spa.txt" file from the
<a href="https://github.com/DeepTrackAI/DeepLearningCrashCourse/tree/main/Ch07_RNN/ec07_A_nlp_rnn">notebook folder</a> in GitHub to the Colab/Kaggle work directory.
</div>

In [1]:
# Uncomment if using Colab/Kaggle
!pip install contractions deeplay deeptrack spacy

Collecting contractions
  Downloading contractions-0.1.73-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting deeplay
  Downloading deeplay-0.1.3-py3-none-any.whl.metadata (13 kB)
Collecting deeptrack
  Downloading deeptrack-2.0.1-py3-none-any.whl.metadata (21 kB)
Collecting textsearch>=0.0.21 (from contractions)
  Downloading textsearch-0.0.24-py2.py3-none-any.whl.metadata (1.2 kB)
Collecting lightning (from deeplay)
  Downloading lightning-2.5.1-py3-none-any.whl.metadata (39 kB)
Collecting torchmetrics (from deeplay)
  Downloading torchmetrics-1.7.1-py3-none-any.whl.metadata (21 kB)
Collecting torch-geometric (from deeplay)
  Downloading torch_geometric-2.6.1-py3-none-any.whl.metadata (63 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.1/63.1 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting kornia (from deeplay)
  Downloading kornia-0.8.0-py2.py3-none-any.whl.metadata (17 kB)
Collecting dill (from deeplay)
  Downloading dill-0.4.0-py3-none-any.whl.m

This notebook provides you with a complete code example that implements a sequence-to-sequence (seq2seq) model for machine translation using recurrent neural networks.

<div style="background-color: #f0f8ff; border: 2px solid #4682b4; padding: 10px;">
<strong>Note:</strong> This notebook contains the Code Example 7-A from the book  

**Deep Learning Crash Course**  
Benjamin Midtvedt, Jesús Pineda, Henrik Klein Moberg, Harshith Bachimanchi, Joana B. Pereira, Carlo Manzo, Giovanni Volpe  
No Starch Press, San Francisco (CA), 2025  
ISBN-13: 9781718503922  

[https://nostarch.com/deep-learning-crash-course](https://nostarch.com/deep-learning-crash-course)

You can find the other notebooks on the [Deep Learning Crash Course GitHub page](https://github.com/DeepTrackAI/DeepLearningCrashCourse).
</div>

## Preparing the Bilingual Dataset

### Tokenizing the Sentences

Implement a function to tokenize a sentence ...

In [2]:
import spacy

tokenizers = {"eng": spacy.blank("en"), "cn": spacy.blank("zh")}

def tokenize(text, lang="eng"):
    """Tokenize text."""
    tokens = tokenizers[lang](text)
    return tokens

In [3]:
print([token.text for token in tokenize("This is a simple example!")])

['This', 'is', 'a', 'simple', 'example', '!']


... then update this function to handle contractions ...

In [4]:
import contractions, spacy

tokenizers = {"eng": spacy.blank("en"), "cn": spacy.blank("zh")}

def tokenize(text, lang="eng"):
    """Tokenize text."""
    text = contractions.fix(text) if lang == "eng" else text
    tokens = tokenizers[lang](text)
    return tokens

In [5]:
print([token.text for token in tokenize("This isn't the same example!")])

['This', 'is', 'not', 'the', 'same', 'example', '!']


... then update this function to remove irrelevant punctuation and non-alphabetical characters ...

In [7]:
import contractions, re, spacy, unicodedata

tokenizers = {"eng": spacy.blank("en"), "cn": spacy.blank("zh")}

regular_expression = r"^[a-zA-Z0-9\u4e00-\u9fff.,!?¡¿/:()]+$"

pattern = re.compile(unicodedata.normalize("NFC", regular_expression))

def tokenize(text, lang="eng"):
    """Tokenize text."""
    swaps = {"’": "'", "‘": "'", "“": '"', "”": '"', "´": "'", "´´": '"'}
    for old, new in swaps.items():
        text = text.replace(old, new)
    text = contractions.fix(text) if lang == "eng" else text
    tokens = tokenizers[lang](text)
    return [token.text for token in tokens if pattern.match(token.text)]

In [8]:
print([token for token in tokenize("Double-check your code!")])

['Double', 'check', 'your', 'code', '!']


### Implementing a Corpus Iterator

Implement a function to read and tokenize sentences by iterating through a corpus file ...

In [9]:
def corpus_iterator(filename, lang, lang_position):
    """Read and tokenize texts by iterating through a corpus file."""
    with open(filename, "r", encoding="utf-8") as file:
        for line in file:
            sentences = line.strip().split("\t")
            sentence = unicodedata.normalize("NFC", sentences[lang_position])
            yield tokenize(sentence, lang)

In [10]:
from google.colab import drive
drive.mount('/content/drive')

import os
os.chdir('/content/drive/My Drive/DeepLearningCrashCourse-main/Ch07_RNN/ec07_A_nlp_rnn')

Mounted at /content/drive


... and use it to extract the English and corresponding Spanish tokenized sentences.

In [11]:
for tokens_eng, tokens_spa in zip(
    corpus_iterator(filename="cmn.txt", lang="eng", lang_position=0),
    corpus_iterator(filename="cmn.txt", lang="cn", lang_position=1),
    ):
    print(f"{tokens_eng} {tokens_spa}")

[1;30;43m流式输出内容被截断，只能显示最后 5000 行内容。[0m
['I', 'was', 'at', 'home', 'most', 'of', 'the', 'day', 'yesterday', '.'] ['我', '昨', '天', '大', '部', '分', '时', '间', '在', '家']
['I', 'was', 'terribly', 'confused', 'by', 'his', 'question', '.'] ['我', '對', '他', '的', '問', '題', '感', '到', '非', '常', '困', '惑']
['I', 'was', 'terribly', 'confused', 'by', 'his', 'question', '.'] ['我', '對', '他', '的', '問', '題', '感', '到', '非', '常', '迷', '惑']
['I', 'was', 'told', 'that', 'I', 'do', 'not', 'need', 'to', 'do', 'that', '.'] ['有', '人', '告', '訴', '我', '我', '不', '用', '做']
['I', 'weighed', 'myself', 'on', 'the', 'bathroom', 'scales', '.'] ['我', '用', '浴', '室', '的', '体', '重', '计', '量', '了', '体', '重']
['I', 'will', 'do', 'the', 'shopping', 'for', 'her', 'birthday', '.'] ['我', '要', '去', '给', '她', '生', '日', '买', '点', '东', '西']
['I', 'will', 'get', 'you', 'a', 'bike', 'for', 'your', 'birthday', '.'] ['你', '生', '日', '的', '时', '候', '我', '送', '你', '一', '辆', '自', '行', '车']
['I', 'wish', 'I', 'could', 'go', 'to', 'the', 'party',

### Building a Vocabulary

Implement a class to represent a vocabulary ...

In [12]:
class Vocab:
    """Vocabulary as callable dictionary."""

    def __init__(self, vocab_dict, unk_token="<unk>"):
        """Initialize vocabulary."""
        self.vocab_dict, self.unk_token = vocab_dict, unk_token
        self.default_index = vocab_dict.get(unk_token, -1)
        self.index_to_token = {idx: token for token, idx in vocab_dict.items()}

    def __call__(self, token_or_tokens):
        """Return the index(es) for given token or list of tokens."""
        if not isinstance(token_or_tokens, list):
            return self.vocab_dict.get(token_or_tokens, self.default_index)
        else:
            return [self.vocab_dict.get(token, self.default_index)
                    for token in token_or_tokens]

    def set_default_index(self, index):
        """Set default index for unknown tokens."""
        self.default_index = index

    def lookup_token(self, index_or_indices):
        """Retrieve token corresponding to given index or list of indices."""
        if not isinstance(index_or_indices, list):
            return self.index_to_token.get(int(index_or_indices),
                                           self.unk_token)
        else:
            return [self.index_to_token.get(int(index), self.unk_token)
                    for index in index_or_indices]

    def get_tokens(self):
        """Return a list of tokens ordered by their index."""
        tokens = [None] * len(self.index_to_token)
        for index, token in self.index_to_token.items():
            tokens[index] = token
        return tokens

    def __iter__(self):
        """Iterate over the tokens in the vocabulary."""
        return iter(self.vocab_dict)

    def __len__(self):
        """Return the number of tokens in the vocabulary."""
        return len(self.vocab_dict)

    def __contains__(self, token):
        """Check if a token is in the vocabulary."""
        return token in self.vocab_dict

... which you can use as shown ...

In [13]:
vocab_dict = {"hello": 0, "world": 1, "<unk>": 2}
vocab = Vocab(vocab_dict)

In [14]:
vocab("hello")

0

In [15]:
vocab("unknown")

2

In [16]:
vocab.lookup_token(1)

'world'

In [17]:
vocab.lookup_token(5)

'<unk>'

... implement a function to build vocabulary from an iterator ...

In [18]:
from collections import Counter

def build_vocab_from_iterator(iterator, specials=None, min_freq=1):
    """Build vocabulary from an iterator over tokenized sentences."""
    token_freq = Counter(token for tokens in iterator for token in tokens)
    vocab, index = {}, 0
    if specials:
        for token in specials:
            vocab[token] = index
            index += 1
    for token, freq in token_freq.items():
        if freq >= min_freq:
            vocab[token] = index
            index += 1
    return vocab

... which you can then use on a list of tokenized sentences ...

In [19]:
tokenized_sentences = [["this", "is", "an", "example"],
                       ["another", "example", "sentence"],
                       ["this", "is", "a", "test"]]
vocab_dict = build_vocab_from_iterator(
    tokenized_sentences, specials=["<unk>", "<pad>"], min_freq=1,
)

In [20]:
print(vocab_dict)

{'<unk>': 0, '<pad>': 1, 'this': 2, 'is': 3, 'an': 4, 'example': 5, 'another': 6, 'sentence': 7, 'a': 8, 'test': 9}


... implement a function to build a vocabulary from a corpus file ...

In [21]:
def build_vocab(filename, lang, lang_position, specials=["<unk>"], min_freq=5):
    """Build vocabulary."""
    vocab_dict = build_vocab_from_iterator(
        corpus_iterator(filename, lang, lang_position), specials, min_freq,
    )
    vocab = Vocab(vocab_dict, unk_token=specials[0])
    vocab.set_default_index(vocab(specials[0]))
    return vocab

... and use this function to create the vocabularies for the input and output vocabularies.

In [22]:
in_lang, out_lang, filename = "eng", "cn", "cmn.txt"
specials = ["<pad>", "<sos>", "<eos>", "<unk>"]

in_vocab = build_vocab(filename, in_lang, lang_position=0, specials=specials)
out_vocab = build_vocab(filename, out_lang, lang_position=1, specials=specials)

## Preprocessing the Data

Implement a function to check if all words in a sentence are present in a vocabulary ...

In [23]:
def all_words_in_vocab(sentence, vocab):
    """Check whether all words in a sentence are present in a vocabulary."""
    return all(word in vocab for word in sentence)

... a function to pad a sequence of tokens ...

In [24]:
def pad(tokens, max_length=10):
    """Pad sequence of tokens."""
    padding_length = max_length - len(tokens)
    return ["<sos>"] + tokens + ["<eos>"] + ["<pad>"] * padding_length

... a function to process the language corpus ...

In [25]:
import numpy as np

def process(filename, in_lang, out_lang, in_vocab, out_vocab, max_length=10):
    """Process language corpus."""
    in_sequences, out_sequences = [], []
    with open(filename, "r", encoding="utf-8") as file:
        for line in file:
            sentences = line.strip().split("\t")
            in_tokens = tokenize(unicodedata.normalize("NFC", sentences[0]),
                                 in_lang)
            out_tokens = tokenize(unicodedata.normalize("NFC", sentences[1]),
                                  out_lang)

            if (all_words_in_vocab(in_tokens, in_vocab)
                and len(in_tokens) <= max_length
                and all_words_in_vocab(out_tokens, out_vocab)
                and len(out_tokens) <= max_length):

                padded_in_tokens = pad(in_tokens)
                in_sequence = in_vocab(padded_in_tokens)
                in_sequences.append(in_sequence)

                padded_out_tokens = pad(out_tokens)
                out_sequence = out_vocab(padded_out_tokens)
                out_sequences.append(out_sequence)
    return np.array(in_sequences), np.array(out_sequences)

... and build the datasets and data loaders.

In [26]:
import deeplay as dl
import deeptrack as dt
import torch

in_sequences, out_sequences = \
    process(filename, in_lang, out_lang, in_vocab, out_vocab)

sources = dt.sources.Source(inputs=in_sequences, targets=out_sequences)
train_sources, test_sources = dt.sources.random_split(sources, [0.85, 0.15])

inputs_pip = dt.Value(sources.inputs) >> dt.pytorch.ToTensor(dtype=torch.int)
outputs_pip = dt.Value(sources.targets) >> dt.pytorch.ToTensor(dtype=torch.int)

train_dataset = \
    dt.pytorch.Dataset(inputs_pip & outputs_pip, inputs=train_sources)
test_dataset = \
    dt.pytorch.Dataset(inputs_pip & outputs_pip, inputs=test_sources)

train_loader = dl.DataLoader(train_dataset, batch_size=256, shuffle=True)
test_loader = dl.DataLoader(test_dataset, batch_size=256, shuffle=False)


    pip install deeptrack==1.7

For more details, refer to the DeepTrack documentation.


In [27]:
for in_sequences, out_sequences in train_loader:
    print(in_sequences[0], out_sequences[0])
    break

tensor([  1, 355, 413, 497, 332,   5,   2,   0,   0,   0,   0,   0]) tensor([  1, 214, 597, 598, 702,  97,  98,   6,   2,   0,   0,   0])


## Implementing and Training the Sequence-to-Sequence Architecture

Implement the encoder ...

In [28]:
class Seq2SeqEncoder(dl.DeeplayModule):
    """Sequence-to-sequence encoder."""

    def __init__(self, vocab_size, in_feats=300, hidden_feats=128,
                 hidden_layers=1, dropout=0.0):
        """Initialize sequence-to-sequence encoder."""
        super().__init__()
        self.hidden_feats, self.hidden_layers = hidden_feats, hidden_layers

        self.embedding = dl.Layer(torch.nn.Embedding, vocab_size, in_feats)
        self.rnn = dl.Layer(torch.nn.GRU, input_size=in_feats,
                            hidden_size=hidden_feats, num_layers=hidden_layers,
                            dropout=(0 if hidden_layers == 1 else dropout),
                            bidirectional=True, batch_first=True)

    def forward(self, in_sequences, contexts=None):
        """Calculate the encoded sequences and contexts."""
        in_embeddings = self.embedding(in_sequences)
        encoded_sequences, contexts = self.rnn(in_embeddings, contexts)
        encoded_sequences = (encoded_sequences[:, :, :self.hidden_feats]
                             + encoded_sequences[:, :, self.hidden_feats:])
        contexts = contexts[:self.hidden_layers]
        return encoded_sequences, contexts

... implement the decoder ...

In [29]:
class Seq2SeqDecoder(dl.DeeplayModule):
    """Sequence-to-sequence decoder."""

    def __init__(self, vocab_size, in_feats=300, hidden_feats=128,
                 hidden_layers=1, dropout=0.0):
        """Initialize sequence-to-sequence decoder."""
        super().__init__()

        self.embedding = dl.Layer(torch.nn.Embedding, vocab_size, in_feats)
        self.rnn = dl.Layer(torch.nn.GRU, input_size=in_feats,
                            hidden_size=hidden_feats, num_layers=hidden_layers,
                            bidirectional=False, batch_first=True,
                            dropout=(0 if hidden_layers == 1 else dropout))
        self.dense = dl.Layer(torch.nn.Linear, hidden_feats, vocab_size)
        self.softmax = dl.Layer(torch.nn.Softmax, dim=-1)

    def forward(self, decoder_in_values, contexts):
        """Calculate the decoder outputs and contexts."""
        out_embeddings = self.embedding(decoder_in_values)
        decoder_outputs, contexts = self.rnn(out_embeddings, contexts)
        decoder_outputs = self.dense(decoder_outputs)
        decoder_outputs = self.softmax(decoder_outputs)
        return decoder_outputs, contexts

... implement the full seq2seq model combining the encoder and decoder ...

In [30]:
class Seq2SeqModel(dl.DeeplayModule):
    """Sequence-to-sequence model with evaluation method."""

    def __init__(self, in_vocab_size=None, out_vocab_size=None, embed_dim=300,
                 hidden_feats=128, hidden_layers=1, dropout=0.0,
                 teacher_prob=1.0):
        """Initialize the sequence-to-sequence model."""
        super().__init__()
        self.in_vocab_size, self.out_vocab_size = in_vocab_size, out_vocab_size
        self.teacher_prob = teacher_prob

        self.encoder = Seq2SeqEncoder(in_vocab_size, embed_dim, hidden_feats,
                                      hidden_layers, dropout)
        self.decoder = Seq2SeqDecoder(out_vocab_size, embed_dim, hidden_feats,
                                      hidden_layers, dropout)

    def forward(self, batch):
        """Calculate the decoder output vectors for the input sequences."""
        in_sequences, out_sequences = batch
        num_sequences, sequence_length = in_sequences.size()
        device = next(self.encoder.parameters()).device

        _, contexts = self.encoder(in_sequences)

        decoder_outputs_vec = torch.zeros(num_sequences, sequence_length,
                                          self.out_vocab_size).to(device)
        decoder_in_values = torch.full(size=(num_sequences, 1),
                                       fill_value=1, device=device)  # <sos>
        for t in range(sequence_length):
            decoder_outputs, contexts = \
                self.decoder(decoder_in_values, contexts)
            decoder_outputs_vec[:, t, :] = decoder_outputs.squeeze(1)

            if (np.random.rand() < self.teacher_prob
                and t < sequence_length - 1):  # Teacher forcing.
                decoder_in_values = \
                    out_sequences[:, t + 1].unsqueeze(-1).to(device)
            else:  # Model prediction.
                _, top_decoder_outputs = decoder_outputs.topk(1)
                decoder_in_values = \
                    top_decoder_outputs.squeeze(-1).detach().to(device)

        return decoder_outputs_vec

    def evaluate(self, in_sequences):
        """Evaluate model."""
        num_sequences, sequence_length = in_sequences.size()
        device = next(self.encoder.parameters()).device

        with torch.no_grad():
            _, contexts = self.encoder(in_sequences)

        pred_sequences = torch.zeros(num_sequences, sequence_length).to(device)
        decoder_in_values = torch.full(size=(num_sequences, 1),
                                       fill_value=1, device=device)  # <sos>
        for t in range(sequence_length):
            with torch.no_grad():
                decoder_outputs, contexts = \
                    self.decoder(decoder_in_values, contexts)
            _, top_decoder_outputs = decoder_outputs.topk(1)
            pred_sequences[:, t] = top_decoder_outputs.squeeze()

            decoder_in_values = top_decoder_outputs.squeeze(-1).detach()

        return pred_sequences

... define the loss function ...

In [31]:
def maskedNLL(decoder_outputs, out_sequences, padding=0):
    """Calculate the masked negative log-likelihood (NLL) loss."""
    flat_pred_sequences = decoder_outputs.view(-1, decoder_outputs.shape[-1])
    flat_target_sequences = out_sequences.view(-1, 1)
    pred_probs = torch.gather(flat_pred_sequences, 1, flat_target_sequences)

    nll = - torch.log(pred_probs)

    mask = out_sequences != padding
    masked_nll = nll.masked_select(mask.view(-1, 1))

    return masked_nll.mean()  # Loss.

... and implement the sequence-to-sequence application.

In [32]:
class Seq2Seq(dl.Application):
    """Application for the sequence-to-sequence model."""

    def __init__(self, in_vocab, out_vocab, teacher_prob=1.0):
        """Initialize the application."""
        super().__init__(loss=maskedNLL, optimizer=dl.Adam(lr=1e-3))
        self.model = Seq2SeqModel(in_vocab_size=len(in_vocab),
                                  out_vocab_size=len(out_vocab),
                                  teacher_prob=teacher_prob)

    def train_preprocess(self, batch):
        """Adjust the target sequence by shifting it one position backward."""
        in_sequences, out_sequences = batch
        shifted_out_sequences = \
            torch.cat((out_sequences[:, 1:], out_sequences[:, -1:]), dim=1)
        return (in_sequences, out_sequences), shifted_out_sequences

    def forward(self, batch):
        """Perform forward pass."""
        return self.model(batch)

## Loading Pretrained Embeddings

Download the GloVe embeddings ...

In [33]:
import os
from torchvision.datasets.utils import download_url, extract_archive

glove_folder = ".glove_cache"
if not os.path.exists(glove_folder):
    os.makedirs(glove_folder, exist_ok=True)
    url = "https://nlp.stanford.edu/data/glove.42B.300d.zip"
    download_url(url, glove_folder)
    zip_filepath = os.path.join(glove_folder, "glove.42B.300d.zip")
    extract_archive(zip_filepath, glove_folder)
    os.remove(zip_filepath)

... implement a function to load the GloVe embeddings ...

In [34]:
def load_glove_embeddings(glove_file):
    """Load GloVe embeddings."""
    glove_embeddings = {}
    with open(glove_file, "r", encoding="utf-8") as file:
        for line in file:
            values = line.split()
            word = values[0]
            glove_embeddings[word] = np.round(
                np.asarray(values[1:], dtype="float32"), decimals=6,
            )
    return glove_embeddings

... implement a function to get GloVe embeddings for a vocabulary ...

In [35]:
def get_glove_embeddings(vocab, glove_embeddings, embed_dim):
    """Get GloVe embeddings for a vocabulary."""
    embeddings = torch.zeros((len(vocab), embed_dim), dtype=torch.float32)
    for i, token in enumerate(vocab):
        embedding = glove_embeddings.get(token)
        if embedding is None:
            embedding = glove_embeddings.get(token.lower())
        if embedding is not None:
            embeddings[i] = torch.tensor(embedding, dtype=torch.float32)
    return embeddings

... load the pretrained GloVe embeddings ...

In [36]:
glove_file = os.path.join(glove_folder, "glove.42B.300d.txt")
glove_embeddings, glove_dim = load_glove_embeddings(glove_file), 300

embeddings_in = get_glove_embeddings(in_vocab.get_tokens(),
                                     glove_embeddings, glove_dim)
embeddings_out = get_glove_embeddings(out_vocab.get_tokens(),
                                      glove_embeddings, glove_dim)

num_specials = len(specials)
embeddings_in[1:num_specials] = torch.rand(num_specials - 1, glove_dim) * 0.01
embeddings_out[1:num_specials] = torch.rand(num_specials - 1, glove_dim) * 0.01

## Training the Sequence-to-Sequence Application

Create the seq2seq model ...

In [37]:
seq2seq = Seq2Seq(in_vocab=in_vocab, out_vocab=out_vocab, teacher_prob=0.85)
seq2seq = seq2seq.create()

seq2seq.model.encoder.embedding.weight.data = embeddings_in
seq2seq.model.encoder.embedding.weight.requires_grad = False
seq2seq.model.decoder.embedding.weight.data = embeddings_out
seq2seq.model.decoder.embedding.weight.requires_grad = False

... and train the model.

In [38]:
trainer = dl.Trainer(max_epochs=25, accelerator="auto")
trainer.fit(seq2seq, train_loader)

/usr/local/lib/python3.11/dist-packages/lightning/pytorch/trainer/configuration_validator.py:70: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
INFO: 
  | Name          | Type             | Params | Mode 
-----------------------------------------------------------
0 | train_metrics | MetricCollection | 0      | train
1 | val_metrics   | MetricCollection | 0      | train
2 | test_metrics  | MetricCollection | 0      | train
3 | model         | Seq2SeqModel     | 2.2 M  | train
4 | optimizer     | Adam             | 0      | train
-----------------------------------------------------------
785 K     Trainable params
1.4 M     Non-trainable params
2.2 M     Total params
8.917     Total estimated model params size (MB)
13        Modules in train mode
0         Modules in eval mode
INFO:lightning.pytorch.callbacks.model_summary:
  | Name          | Type             | Params | Mode 
-----------------------------------------------------------
0 | train_metric

Training: |          | 0/? [00:00<?, ?it/s]

## Testing the Model Perfomance

Implement a function to convert numerical sequences into their corresponding text ...

In [39]:
def unprocess(sequences, vocab, specials):
    """Convert numeric sequences to sentences."""
    sentences = []
    for sequence in sequences:
        idxs = sequence[sequence > len(specials) - 1]
        words = [vocab.lookup_token(idx) for idx in idxs]
        sentences.append(" ".join(words))
    return sentences

... a function to translate user-defined sentences ...

In [40]:
def translate(in_sentence, model, in_lang, in_vocab, out_vocab, specials):
    """Translate a sentence."""
    in_sentence = unicodedata.normalize("NFC", in_sentence)
    in_tokens = pad(tokenize(in_sentence, in_lang))
    in_sequence = (torch.tensor(in_vocab(in_tokens), dtype=torch.int)
                   .unsqueeze(0).to(next(model.parameters()).device))
    pred_sequence = model.evaluate(in_sequence)
    pred_sentence = unprocess(pred_sequence, out_vocab, specials)
    print(f"Predicted Translation: {pred_sentence[0]}\n")

... try to translate a simple sentence ...

In [41]:
in_sentence = "I bought a book."
translate(in_sentence, seq2seq.model, in_lang, in_vocab, out_vocab, specials)

Predicted Translation: 我 買 了 本 本 書



... another simple sentence ...

In [42]:
in_sentence = "This book is very interesting."
translate(in_sentence, seq2seq.model, in_lang, in_vocab, out_vocab, specials)

Predicted Translation: 這 本 書 很 很 很



... and a more complex one.

In [43]:
in_sentence = "The book that I bought is very interesting."
translate(in_sentence, seq2seq.model, in_lang, in_vocab, out_vocab, specials)

Predicted Translation: 这 本 書 很 很 很 很 很



## Evaluating the Model with the BLEU Score

In [44]:
from torchmetrics.text import BLEUScore

bleu_score = BLEUScore()

device = next(seq2seq.model.parameters()).device
for batch_index, (in_sequences, out_sequences) in enumerate(test_loader):
    in_sentences = unprocess(in_sequences.to(device), in_vocab, specials)
    pred_sequences = seq2seq.model.evaluate(in_sequences.to(device))
    pred_sentences = unprocess(pred_sequences, out_vocab, specials)
    out_sentences = unprocess(out_sequences.to(device), out_vocab, specials)

    bleu_score.update(pred_sentences, [[s] for s in out_sentences])

    print(f"Input Sentence: {in_sentences[0]}\n"
          + f"Predicted Translation: {pred_sentences[0]}\n"
          + f"Actual Translation: {out_sentences[0]}\n")
final_bleu = bleu_score.compute()
print(f"Validation BLEU Score: {final_bleu:.3f}")

Input Sentence: He is a student at this college .
Predicted Translation: 他 是 在 個 個 學 學
Actual Translation: 他 是 這 所 大 學 的 學 生

Input Sentence: This is all the money I have on me .
Predicted Translation: 我 是 我 我 我 我 我
Actual Translation: 這 是 我 身 上 所 有 的 錢

Input Sentence: Can I open the window ?
Predicted Translation: 我 能 以 窗 窗 窗 嗎
Actual Translation: 可 以 开 窗 吗

Input Sentence: Love makes the world go round .
Predicted Translation: 每 天 每 去 去 去 了
Actual Translation: 爱 让 世 界 转 动

Input Sentence: Rome was not built in a day .
Predicted Translation: 這 冻 是 天 不 不 天
Actual Translation: 罗 马 不 是 一 天 建 成 的

Input Sentence: I do not know who to consult with .
Predicted Translation: 我 不 知 道 你 們 你
Actual Translation: 我 不 知 道 该 和 谁 商 量 好

Input Sentence: This morning I was still sleepy .
Predicted Translation: 这 是 我 我 是 是 的
Actual Translation: 我 今 天 早 上 还 是 很 困

Input Sentence: I am sorry . I am not from here .
Predicted Translation: 我 不 我 不 不 不 不 不
Actual Translation: 抱 歉 我 不 是 本 地 人

Input Sentence: