## 1. Import Modules and Data
It contians following steps:
1. Use tokenizers from `spacy` to tokenize texts from train test_dataset. 
2. Build the vocabulary, i.e. the tokens for the index dictionary. A list of special tokens (e.g. `<eos>`, `<pad>`) is prepended to the entire table.
3. Prepare test_dataset and dataloader.

In [1]:
from data import load_data, PAD_TOKEN
import torch
import config
import os

src_lang = "en"
tgt_lang = "de"

src_tokenizer, tgt_tokenizer, test_loader = load_data(src_lang, tgt_lang, ["test"])
device = torch.device("cuda")
dataset = test_loader.dataset

device  

device(type='cuda', index=3)

## 2. Load Trained Model

In [2]:
from modules import Transformer

model = Transformer(
    src_pad_idx=src_tokenizer.token_to_id(PAD_TOKEN),
    tgt_pad_idx=tgt_tokenizer.token_to_id(PAD_TOKEN),
    src_vocab_size=src_tokenizer.get_vocab_size(),
    tgt_vocab_size=tgt_tokenizer.get_vocab_size(),
    d_model=config.d_model,
    n_head=config.n_head,
    max_len=config.max_len,
    ffn_hidden=config.ffn_hidden,
    n_layer=config.n_layer,
    dropout=config.dropout,
    device=device,
)
state_dict = torch.load(config.checkpoint_dir / "en_de_20.pth")
model.load_state_dict(state_dict["model"])

<All keys matched successfully>

## 3. Inference


In [3]:
from utils import translate_sentence
import sacrebleu

num_sample = 5
samples = dataset[torch.randint(0, len(dataset), (num_sample,))]

method = {
    "greedy-search": {"num_beams": 1, "do_sample": False},
    "sample": {
        "num_beams": 1,
        "do_sample": True,
        "top_k": config.top_k,
        "top_p": config.top_p,
        "temperature": config.temperature,
    },
}

pred = {
    method_name: translate_sentence(
        samples[src_lang], model, src_tokenizer, tgt_tokenizer, **args
    )
    for method_name, args in method.items()
}

references = [[sentence] for sentence in samples[tgt_lang]]

# Calculate BLEU scores for each method
bleu_scores = {
    method_name: sacrebleu.corpus_bleu(
        pred_list, references
    ).score
    for method_name, pred_list in pred.items()
}

for i in range(num_sample):
    print(f"\033[1mThe {i+1}th source sentence\033[0m: {''.join(samples[src_lang][i])}")
    print(f"\033[1mGround Truth\033[0m: {''.join(samples[tgt_lang][i])}")
    for method_name in method.keys():
        print(f"\033[1m{method_name}\033[0m: {pred[method_name][i]}")
    print()

# Print BLEU scores
for method_name, score in bleu_scores.items():
    print(f"\033[1mBLEU score for {method_name}\033[0m: {score:.2f}")

[1mThe 1th source sentence[0m: And that's architecture.
[1mGround Truth[0m: Und das ist Architektur.
[1mgreedy-search[0m: Das ist Architektur .
[1msample[0m: Das ist das Museum .

[1mThe 2th source sentence[0m: So what happened on this day?
[1mGround Truth[0m: Was ist an diesem Tag geschehen?
[1mgreedy-search[0m: Was passierte also mit diesem Film ?
[1msample[0m: Was passierte als Nächstes ?

[1mThe 3th source sentence[0m: I also love the fact that if I wanted to walk from my house to New York City, I could.
[1mGround Truth[0m: Ebenso liebe ich die Tatsache, dass ich von meinem Haus nach New York City laufen könnte, wenn ich wollte.
[1mgreedy-search[0m: Ich dachte , ich würde gerne meine eigene Arbeit auf dem Schreibtisch machen , als ich in London reiste .
[1msample[0m: Ich dachte , wenn ich mir meinen Platz überlassen könnte , konnte ich in New York City fahren .

[1mThe 4th source sentence[0m: That's the consequence that we have to face.
[1mGround Truth[0