## Installs and config for Google Colab

Google Colab is missing some of the packages that are necessary for this project, thus they need to be reinstalled for each runtime. <br>

Connection to Google Drive is also used, to avoid reuploading the required files every time.

In [1]:
# Establishes connection to Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
pip install sentencepiece

Collecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sentencepiece
Successfully installed sentencepiece-0.1.99


In [None]:
pip install translate-toolkit

Collecting translate-toolkit
  Downloading translate_toolkit-3.12.1-py3-none-any.whl (752 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m752.3/752.3 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: translate-toolkit
Successfully installed translate-toolkit-3.12.1


In [3]:
# install from github used due to the following error which hasn't been updated in the official package yet:
# https://github.com/Unbabel/COMET/issues/186
pip install git+https://github.com/Unbabel/COMET/

Collecting git+https://github.com/Unbabel/COMET/
  Cloning https://github.com/Unbabel/COMET/ to /tmp/pip-req-build-f8m9itdv
  Running command git clone --filter=blob:none --quiet https://github.com/Unbabel/COMET/ /tmp/pip-req-build-f8m9itdv
  Resolved https://github.com/Unbabel/COMET/ to commit 45cb572516398b6994f112ed8ee7058dc5fb84ef
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting entmax<2.0,>=1.1 (from unbabel-comet==2.2.0)
  Downloading entmax-1.1-py3-none-any.whl (12 kB)
Collecting huggingface-hub<0.17.0,>=0.16.0 (from unbabel-comet==2.2.0)
  Downloading huggingface_hub-0.16.4-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.8/268.8 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting jsonargparse==3.13.1 (from unbabel-comet==2.2.0)
  Downloading jsonargparse-3.13.1-py3-none-any.whl (101 kB)
[2K

## Imports and constants


__PyTorch__ - neural network training, inference and BLEU evaluation <br>
__SentencePiece__ - vocabulary builder and tokenizer <br>
__Translate-toolkit__ - TMX file parsing <br>
__Comet__ - COMET evaluation

In [5]:
import torch
import torch.nn as nn
from torch.nn import functional as F
from torch import Tensor
from torchtext.data.metrics import bleu_score

from translate.storage.tmx import tmxfile
import sentencepiece as spm

import random
import math
from timeit import default_timer as timer
from comet import download_model, load_from_checkpoint

In [6]:
# FILE ACCESS PATH
BASE_PATH = "drive/MyDrive/Transformers/"

# DATA DEFINITIONS
DATA_SAMPLES = 320_000
SRC_VOCAB_SIZE = 32_000
TGT_VOCAB_SIZE = 32_000
VAL_DATA_PERCENT = 90
TEST_DATA_PERCENT = 95

# HYPERPARAMETERS
LEARNING_RATE = 0.0001
EMB_SIZE = 256
NHEAD = 8
FFN_HID_DIM = 256
NUM_ENCODER_LAYERS = 3
NUM_DECODER_LAYERS = 3
DROPOUT = 0.1

# TRAINING PARAMETERS
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
BATCH_SIZE = 64
NUM_EPOCHS = 10

# SPECIAL TOKEN INDEXES
UNK_IDX = 0
SOS_IDX = 1
EOS_IDX = 2
PAD_IDX = 3

# EVALUATION MODEL
COMET_MODEL = "Unbabel/wmt22-comet-da" # Current default model used, as there are no specific models for English-Latvian translations

## Obtain data

As source file is used "en-lv.tmx" file obtained from OPUS Europarl dataset collection (https://opus.nlpl.eu/Europarl.php).

The file is in TMX format and consists of English-Latvian sentence pairs. The file has 621 325 sentence pairs in total.

Translate Toolkit library was used to parse the tmx file. The shortest 320 000 sentence pairs were extracted from the tmx file and saved to different text files (source_data.txt for English sentences and target_data.txt for Latvian sentences) for further tokenization. <br>

All samples beginning with open parentheses sign '(' were removed, as upon observing the dataset it was found that there are many samples enclosed in parentheses that were in other foreign languages, such as German, Greek or French. These samples were removed to preserve data quality.

In [None]:
def obtain_data(data_samples, filepath):
    with open(filepath, 'rb') as fin:
        tmx_file = tmxfile(fin, 'en', 'lv')

    source_data = []
    target_data = []
    for node in tmx_file.unit_iter():
        source_data.append(node.source)
        target_data.append(node.target)

    zipped_data = zip(source_data, target_data)
    sorted_data = sorted(zipped_data, key=lambda x: len(x[0]))
    filtered_data = [(src, tgt) for src, tgt in sorted_data if not tgt.startswith('(')]
    source_data, target_data = zip(*filtered_data)
    source_data = list(source_data)
    target_data = list(target_data)
    source_data = source_data[:data_samples]
    target_data = target_data[:data_samples]

    source_file_path = BASE_PATH + "source_data.txt"
    with open(source_file_path, 'w', encoding='utf-8') as file:
        for sentence in source_data:
            file.write(sentence + '\n')

    target_file_path = BASE_PATH + "target_data.txt"
    with open(target_file_path, 'w', encoding='utf-8') as file:
        for sentence in target_data:
            file.write(sentence + '\n')

obtain_data(DATA_SAMPLES, BASE_PATH + "en-lv.tmx")

## Data --> Tokens
Sentence Piece is used to train two different models, one for English, one for Latvian language, that create tokens based on the previously created text files. Each of the trained models has a vocabulary of 32 000 different tokens.

The models are then used on the text files to transform the English sentences and Latvian sentences in a list of their token ids.

In [7]:
# Trains tokenizer
spm.SentencePieceTrainer.Train('--input=' + BASE_PATH + 'source_data.txt --model_prefix=english_bpe --vocab_size=' + str(SRC_VOCAB_SIZE) + ' --model_type=bpe --pad_id=' + str(PAD_IDX))
spm.SentencePieceTrainer.Train('--input=' + BASE_PATH + 'target_data.txt --model_prefix=latvian_bpe --vocab_size=' + str(TGT_VOCAB_SIZE) + ' --model_type=bpe --pad_id=' + str(PAD_IDX))

# Load trained tokenizer models
sp_english = spm.SentencePieceProcessor()
sp_english.Load("english_bpe.model")

sp_latvian = spm.SentencePieceProcessor()
sp_latvian.Load("latvian_bpe.model")

# Read text data from previously created files
source_file_path = BASE_PATH + "source_data.txt"
with open(source_file_path, 'r', encoding='utf-8') as file:
    source_sentences = [line.strip() for line in file.readlines()] # Reading the lines into list, stripping new line char

target_file_path = BASE_PATH + "target_data.txt"
with open(target_file_path, 'r', encoding='utf-8') as file:
    target_sentences = [line.strip() for line in file.readlines()] # Reading the lines into list, stripping new line char

# shuffle data
indices = list(range(len(source_sentences)))
random.shuffle(indices)
source_sentences = [source_sentences[i] for i in indices]
target_sentences = [target_sentences[i] for i in indices]

# encode text data as tokens
source_tokens = sp_english.EncodeAsIds(source_sentences)
target_tokens = sp_latvian.EncodeAsIds(target_sentences)

# obtain vocabularies
vocabs_english = [[sp_english.id_to_piece(id), id] for id in range(sp_english.get_piece_size())]
vocabs_latvian = [[sp_latvian.id_to_piece(id), id] for id in range(sp_latvian.get_piece_size())]

In [8]:
# Verify that special tokens are where we want them to be
assert vocabs_latvian[UNK_IDX][0] == '<unk>'
assert vocabs_latvian[SOS_IDX][0] == '<s>'
assert vocabs_latvian[EOS_IDX][0] == '</s>'

## Tokens --> Batched tensors
The token list for each sentence gets prepended by a Start-of-sequence token, appended by an End-of-sequence token and turned into a tensor.

A function is used to turn the obtained sentences into batches. The function breaks down data into a list of tensors where each list has a number of items equivalent to the batch size. Within the list the tensor of the largest size is found and then other tensors are padded with pad token to be the same size as the largest tensor. Finally all tensors within the list are concatenated together to form a single 2D tensor.

Different batch collections are created for training, validation and test.

In [9]:
def simple_padder(pad_data, pad_token=3):
    # Find longest sequence length
    longest_sequence = 0
    for sequence in pad_data:
        if len(sequence) > longest_sequence:
            longest_sequence = len(sequence)

    # Pad other sequences to be the same length as the longest sequence
    for i in range(len(pad_data)):
        sequence = pad_data[i]

        pad_amount = longest_sequence - len(sequence)
        sequence = F.pad(sequence, (0, pad_amount), "constant", pad_token)

        pad_data[i] = sequence

    return pad_data

def batchify_data(batch_data, batch_size=16, batch_padding=True, batch_padding_token=3):

    batches = []
    for idx in range(0, len(batch_data), batch_size):
        if idx + batch_size < len(batch_data):
            batch = batch_data[idx : idx + batch_size]

            if batch_padding:
                batch = simple_padder(batch, batch_padding_token)

            # Turn a list of tensors into a single tensor
            rows = batch_size
            cols = len(batch[0])
            batchDest = torch.Tensor(rows, cols)
            torch.cat(batch, out=batchDest)
            batchDest = batchDest.view(rows, cols).long()
            batchDest = torch.permute(batchDest, (1, 0))

            batches.append(batchDest)

    return batches

In [None]:
source_tensors = []
target_tensors = []

for source_seq, target_seq in zip(source_tokens, target_tokens):

    # <SOS> + Sequence + <EOS>
    source_seq = [SOS_IDX] + source_seq + [EOS_IDX]
    target_seq = [SOS_IDX] + target_seq + [EOS_IDX]

    source_tensors.append(torch.tensor(source_seq))
    target_tensors.append(torch.tensor(target_seq))

val_split_point = int(len(source_tokens) / 100 * VAL_DATA_PERCENT)
test_split_point = int(len(source_tokens) / 100 * TEST_DATA_PERCENT)

train_data_source = source_tensors[:val_split_point]
train_data_target = target_tensors[:val_split_point]

val_data_source = source_tensors[val_split_point:test_split_point]
val_data_target = target_tensors[val_split_point:test_split_point]

test_data_source = source_tensors[test_split_point:]
test_data_target = target_tensors[test_split_point:]

train_dataloader_source = batchify_data(train_data_source, batch_size=BATCH_SIZE, batch_padding=True, batch_padding_token=PAD_IDX)
train_dataloader_target = batchify_data(train_data_target, batch_size=BATCH_SIZE, batch_padding=True, batch_padding_token=PAD_IDX)

val_dataloader_source = batchify_data(val_data_source, batch_size=BATCH_SIZE, batch_padding=True, batch_padding_token=PAD_IDX)
val_dataloader_target = batchify_data(val_data_target, batch_size=BATCH_SIZE, batch_padding=True, batch_padding_token=PAD_IDX)

test_dataloader_source = batchify_data(test_data_source, batch_size=BATCH_SIZE, batch_padding=True, batch_padding_token=PAD_IDX)
test_dataloader_target = batchify_data(test_data_target, batch_size=BATCH_SIZE, batch_padding=True, batch_padding_token=PAD_IDX)

## Transformers model

At the core of transformers model implementation is PyTorch module __torch.nn.transformer__ which implements the transformer model as seen in the paper "Attention is all you need". However the module lacks initial embedding layers and positional encoding layers for the encoder inputs and decoder inputs, as well as the final linear layer that gives the logits.

The embedding layers can be created simply by utilizing nn.Embedding module.

For the positional encoders I utilized an implementation from Annotated Transformers blogpost. Source: https://nlp.seas.harvard.edu/annotated-transformer/

The final linear layer can be created by utilizing nn.Linear module.

Custom functions were used to create masks that mask the subsequent tokens in the decoder input and padding tokens in encoder and decoder inputs.

Cross entropy was used as the loss function.

Adam was used as the optimizer for the training. The betas and epsilon values for the optimizer were taken from the values used in "Attention is all you need" paper.



In [None]:
class PositionalEncoding(nn.Module):
    "Implement the PE function."

    def __init__(self, d_model, dropout, max_len=5000):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        # Compute the positional encodings once in log space.
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model)
        )
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer("pe", pe)

    def forward(self, x):
        x = x + self.pe[:, : x.size(1)].requires_grad_(False)
        return self.dropout(x)

In [None]:
class Transformer(nn.Module):
    def __init__(self, num_encoder_layers, num_decoder_layers, emb_size,
                 nhead, src_vocab_size, tgt_vocab_size, dim_feedforward,
                 dropout):
        self.d_model = emb_size

        super(Transformer, self).__init__()
        self.transformer = nn.Transformer(d_model=emb_size,
                                       nhead=nhead,
                                       num_encoder_layers=num_encoder_layers,
                                       num_decoder_layers=num_decoder_layers,
                                       dim_feedforward=dim_feedforward,
                                       dropout=dropout)
        self.generator = nn.Linear(emb_size, tgt_vocab_size)

        self.src_tok_emb = nn.Embedding(src_vocab_size, self.d_model)
        self.tgt_tok_emb = nn.Embedding(tgt_vocab_size, self.d_model)
        self.positional_encoding = PositionalEncoding(
            emb_size, dropout=dropout)

    def forward(self, src, trg, src_mask, tgt_mask, src_padding_mask,
                tgt_padding_mask, memory_key_padding_mask):
        src_emb = self.src_tok_emb(src) * math.sqrt(self.d_model)
        tgt_emb = self.tgt_tok_emb(trg) * math.sqrt(self.d_model)
        src_emb = self.positional_encoding(src_emb)
        tgt_emb = self.positional_encoding(tgt_emb)

        outs = self.transformer(src_emb, tgt_emb, src_mask, tgt_mask, None,
                                src_padding_mask, tgt_padding_mask, memory_key_padding_mask)
        return self.generator(outs)

    def encode(self, src, src_mask):
         src_embed = self.src_tok_emb(src)
         src_encoded = self.positional_encoding(src_embed)
         encoder_output = self.transformer.encoder(src_encoded, src_mask)
         return encoder_output

    def decode(self, tgt, memory, tgt_mask):
        tgt_embed = self.tgt_tok_emb(tgt)
        tgt_encoded = self.positional_encoding(tgt_embed)
        decoder_output = self.transformer.decoder(tgt_encoded, memory, tgt_mask)
        return decoder_output

In [13]:
def create_masks(src, tgt):
    src_len = src.shape[0]
    tgt_len = tgt.shape[0]

    src_subsequent_mask = torch.zeros((src_len, src_len), device=DEVICE).type(torch.bool)

    tgt_triangular_mask = torch.triu(torch.ones((tgt_len, tgt_len), device=DEVICE))
    tgt_subsequent_mask = (tgt_triangular_mask == 1).transpose(0, 1).float()
    tgt_subsequent_mask = tgt_subsequent_mask.masked_fill(tgt_subsequent_mask == 0, float('-inf'))
    tgt_subsequent_mask = tgt_subsequent_mask.masked_fill(tgt_subsequent_mask == 1, float(0.0))

    src_padding_mask = (src == PAD_IDX).transpose(0, 1)
    tgt_padding_mask = (tgt == PAD_IDX).transpose(0, 1)

    return src_subsequent_mask, tgt_subsequent_mask, src_padding_mask, tgt_padding_mask

In [None]:
transformer = Transformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,
                  NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM, DROPOUT)

for p in transformer.parameters():
    if p.dim() > 1:
        nn.init.xavier_uniform_(p)

transformer = transformer.to(DEVICE)
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)
optimizer = torch.optim.Adam(transformer.parameters(), lr=LEARNING_RATE, betas=(0.9, 0.98), eps=1e-9)

## Training loop

The training function iterates through source and target data. For each source and target tensor pair it generates masks, passes the data and masks to the model to obtain logits and calculates loss.
If the function is in training mode it also performs backpropagation and updates parameters.

As the training function iterates through the entire dataset, a single call to the function constitutes a single training epoch, if the function is called with training mode.

In [17]:
def iterate_epoch(model, dataloader_source, dataloader_target, device, optimizer=None, perform_training=True):
    if perform_training:
      model.train()
    else:
      model.eval()

    losses = 0

    for src, tgt in zip(dataloader_source, dataloader_target):
        src = src.to(device)
        tgt = tgt.to(device)

        tgt_input = tgt[:-1, :]

        src_subsequent_mask, tgt_subsequent_mask, src_padding_mask, tgt_padding_mask = create_masks(src, tgt_input)

        logits = model(src, tgt_input, src_subsequent_mask, tgt_subsequent_mask, src_padding_mask, tgt_padding_mask, src_padding_mask)

        if perform_training:
          optimizer.zero_grad()

        tgt_out = tgt[1:, :]
        loss = loss_fn(logits.reshape(-1, logits.shape[-1]), tgt_out.reshape(-1))
        losses += loss.item()

        if perform_training:
          loss.backward()
          optimizer.step()

    return losses / len(list(dataloader_source))

In [None]:
# The actual training
for epoch in range(1, NUM_EPOCHS+1):
    start_time = timer()
    train_loss = iterate_epoch(transformer, train_dataloader_source, train_dataloader_target, DEVICE, optimizer)
    end_time = timer()
    val_loss = iterate_epoch(transformer, val_dataloader_source, val_dataloader_target, DEVICE, perform_training=False)
    epoch_time = end_time - start_time
    print(f"Epoch: {epoch}, Train loss: {train_loss:.3f}, Val loss: {val_loss:.3f}, Epoch time = {epoch_time:.3f}s")



Epoch: 1, Train loss: 6.086, Val loss: 5.188, Epoch time = 293.563s
Epoch: 2, Train loss: 4.897, Val loss: 4.446, Epoch time = 300.484s
Epoch: 3, Train loss: 4.320, Val loss: 4.004, Epoch time = 300.537s
Epoch: 4, Train loss: 3.941, Val loss: 3.714, Epoch time = 300.961s
Epoch: 5, Train loss: 3.672, Val loss: 3.518, Epoch time = 301.009s
Epoch: 6, Train loss: 3.467, Val loss: 3.375, Epoch time = 300.880s
Epoch: 7, Train loss: 3.304, Val loss: 3.265, Epoch time = 301.419s
Epoch: 8, Train loss: 3.177, Val loss: 3.187, Epoch time = 300.905s
Epoch: 9, Train loss: 3.069, Val loss: 3.123, Epoch time = 301.211s
Epoch: 10, Train loss: 2.974, Val loss: 3.067, Epoch time = 301.301s


In [18]:
# Test loss evaluation
test_loss = iterate_epoch(transformer, test_dataloader_source, test_dataloader_target, DEVICE, perform_training=False)
print(f"Test loss: {test_loss:.3f}")



Test loss: 2.609


In [None]:
# Used when GPU RAM gets too overloaded
torch.cuda.empty_cache()

## Inference

For inference beam search decoding algorithm was implemented. It also encompasses greedy search, since search with a single beam is equivalent to greedy search.

In [26]:
def beam_search_decode(model, src, start_symbol, beam_size, max_len=None):
    model.eval()

    num_tokens = src.shape[0]
    src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)

    src = src.to(DEVICE)
    src_mask = src_mask.to(DEVICE)
    memory = model.encode(src, src_mask)

    # Initialize beams as a list of tuples (probability, sequence)
    beams = [(0, torch.ones(1, 1).fill_(start_symbol).type(torch.long).to(DEVICE))]

    completed_hypotheses = []

    if max_len == None:
      max_len = num_tokens + 5

    for _ in range(max_len):
        new_beams = []
        for current_prob, seq in beams:
            if seq[-1] == EOS_IDX:
                completed_hypotheses.append((current_prob, seq))
                continue

            tgt_mask = (torch.triu(torch.ones((seq.size(0), seq.size(0)), device=DEVICE)) == 1).transpose(0, 1)
            tgt_mask = tgt_mask.float().masked_fill(tgt_mask == 0, float('-inf')).masked_fill(tgt_mask == 1, float(0.0))
            tgt_mask = tgt_mask.type(torch.bool).to(DEVICE) #WTF?

            out = model.decode(seq, memory, tgt_mask)

            out = out.transpose(0, 1)
            prob = model.generator(out[:, -1])

            # Consider top N words for the beam
            top_probs, top_words = torch.topk(prob, beam_size, dim=1)

            for i in range(beam_size):
                new_beams.append((current_prob + top_probs[0][i].item(), torch.cat([seq, torch.ones(1, 1).type_as(src.data).fill_(top_words[0][i].item())])))

        # Sort and prune beams
        beams = sorted(new_beams, key=lambda x: x[0], reverse=True)
        beams = beams[:beam_size]

        if not beams:
            break

        if len(completed_hypotheses) >= beam_size:
          break

    # Choose the best sequence
    if completed_hypotheses:
      return max(completed_hypotheses, key=lambda x: x[0])[1]
    else:
      return beams[0][1]

def beam_translate(model, source_string, beam_size):
    sentenceIds = sp_english.EncodeAsIds(source_string)
    src = torch.tensor([[SOS_IDX] + sentenceIds + [EOS_IDX]])
    src = src.view(-1, 1)

    tgt_tokens = beam_search_decode(model, src, SOS_IDX, beam_size).flatten()
    return tgt_tokens

In [27]:
# Inference example

idx = 420
sentence = source_sentences[idx]
print("Source sentence: " + sentence)
target_sentence = target_sentences[idx]
print("Target sentence: " + target_sentence)

BEAM_SIZE = 1
print(f"\n=== BEAM SIZE {BEAM_SIZE} TRANSLATE ===")
beam_translation = beam_translate(transformer, sentence, BEAM_SIZE)
print("Beam Translation: " + sp_latvian.DecodeIds(beam_translation.tolist()))

BEAM_SIZE = 5
print(f"\n=== BEAM SIZE {BEAM_SIZE} TRANSLATE ===")
beam_translation = beam_translate(transformer, sentence, BEAM_SIZE)
print("Beam Translation: " + sp_latvian.DecodeIds(beam_translation.tolist()))

Source sentence: I call on my fellow Members to likewise lend their support with their votes at the plenary sitting.
Target sentence: Es aicinu kolēģus deputātus arī sniegt atbalstu ar savu balsojumu plenārsēdē.

=== BEAM SIZE 1 TRANSLATE ===
Beam Translation: Es aicinu savus kolēģus deputātus atbalstīt plenārsēdē ar viņu balsojumu.

=== BEAM SIZE 5 TRANSLATE ===
Beam Translation: Es arī aicinu savus kolēģus deputātus sniegt savu atbalstu plenārsēdē. Es ar viņu atbalstu.


## Evaluation

For evaluation two metrics were used: BLEU and COMET.

BLEU was used via PyTorch's metrics library while COMET was used through UNBABEL's COMET library.

To obtain the evaluation metrics the entire test dataset of 16 000 examples was used.

In [42]:
def get_evaluation_data(beam_size, range_start, range_end, measure_inference_speed=False):

  predictions_bleu = []
  references_bleu = []
  data_comet = []

  if measure_inference_speed:
    start_time = timer()

  for idx in range(RANGE_START, RANGE_END):
    source = source_sentences[idx]
    reference = target_sentences[idx]

    beam_translation = beam_translate(transformer, source_sentences[idx], BEAM_SIZE)
    prediction = sp_latvian.DecodeIds(beam_translation.tolist())

    references_bleu.append([reference.split(" ")])
    predictions_bleu.append(prediction.split(" "))
    comet_sample = {"src": source, "mt": prediction, "ref": reference}
    data_comet.append(comet_sample)

  if measure_inference_speed:
    end_time = timer()
    elapsed_time = end_time - start_time
    no_test_examples = range_end - range_start
    time_per_example = elapsed_time / no_test_examples
    print(f"Inference performed with beam size {beam_size} on {no_test_examples} test examples." +
          f"\nTotal elapsed time: {elapsed_time:.3f}s" +
          f"\nAverage inference time per example: {time_per_example:.3f}s")

  return predictions_bleu, references_bleu, data_comet


In [47]:
RANGE_START = test_split_point
RANGE_END = test_split_point + 5000

print("====== GREEDY INFERENCE ======")
BEAM_SIZE = 1

predictions_bleu_greedy, references_bleu_greedy, data_comet_greedy = get_evaluation_data(BEAM_SIZE, RANGE_START, RANGE_END, measure_inference_speed=True)

print("\n====== BEAM DECODE INFERENCE ======")
BEAM_SIZE = 5

predictions_bleu_beam, references_bleu_beam, data_comet_beam = get_evaluation_data(BEAM_SIZE, RANGE_START, RANGE_END, measure_inference_speed=True)

Inference performed with beam size 1 on 5000 test examples.
Total elapsed time: 359.999s
Average inference time per example: 0.072s

Inference performed with beam size 5 on 5000 test examples.
Total elapsed time: 2184.308s
Average inference time per example: 0.437s


In [53]:
# Run BLEU evaluation
bleu_score_greedy = bleu_score(predictions_bleu_greedy, references_bleu_greedy)
bleu_score_beam = bleu_score(predictions_bleu_beam, references_bleu_beam)

In [52]:
# Download the COMET model
model_path = download_model(COMET_MODEL)
comet_model = load_from_checkpoint(model_path)

Fetching 5 files:   0%|          | 0/5 [00:00<?, ?it/s]

Downloading LICENSE:   0%|          | 0.00/9.69k [00:00<?, ?B/s]

Downloading README.md:   0%|          | 0.00/3.53k [00:00<?, ?B/s]

Downloading .gitattributes:   0%|          | 0.00/1.48k [00:00<?, ?B/s]

Downloading hparams.yaml:   0%|          | 0.00/567 [00:00<?, ?B/s]

Downloading model.ckpt:   0%|          | 0.00/2.32G [00:00<?, ?B/s]

INFO:pytorch_lightning.utilities.migration.utils:Lightning automatically upgraded your loaded checkpoint from v1.8.3.post1 to v2.1.3. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../root/.cache/huggingface/hub/models--Unbabel--wmt22-comet-da/snapshots/371e9839ca4e213dde891b066cf3080f75ec7e72/checkpoints/model.ckpt`


Downloading (…)tencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

/usr/local/lib/python3.10/dist-packages/pytorch_lightning/core/saving.py:177: Found keys that are not in the model state dict but in the checkpoint: ['encoder.model.embeddings.position_ids']


In [54]:
# Run COMET evaluation
comet_output = comet_model.predict(data_comet_greedy, batch_size=32)
comet_scores_greedy = comet_output.scores
comet_system_score_greedy = comet_output.system_score

comet_output = comet_model.predict(data_comet_beam, batch_size=32)
comet_scores_beam = comet_output.scores
comet_system_score_beam = comet_output.system_score

INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0: 100%|██████████| 157/157 [01:38<00:00,  1.59it/s]
INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
Predicting DataLoader 0: 100%|██████████| 157/157 [01:43<00:00,  1.52it/s]


In [55]:
def print_results(bleu_score, comet_system_score, comet_scores, data_comet):
  print(f"BLEU score obtained: {bleu_score:.3f}")
  print(f"COMET score obtained: {comet_system_score:.3f}\n")

  min_score = min(comet_scores)
  min_index = comet_scores.index(min_score)
  max_score = max(comet_scores)
  max_index = comet_scores.index(max_score)

  print(f"The lowest obtained COMET score: {min_score:.3f}")
  print("It was obtained for the following sample:")
  print(f"Source: {data_comet[min_index]['src']}")
  print(f"Reference: {data_comet[min_index]['ref']}")
  print(f"Prediction: {data_comet[min_index]['mt']}\n")

  print(f"The highest obtained COMET score: {max_score:.3f}")
  print("It was obtained for the following sample:")
  print(f"Source: {data_comet[max_index]['src']}")
  print(f"Reference: {data_comet[max_index]['ref']}")
  print(f"Prediction: {data_comet[max_index]['mt']}\n")

In [57]:
print("====== GREEDY DECODE REULTS ======")

print_results(bleu_score_greedy, comet_system_score_greedy, comet_scores_greedy, data_comet_greedy)

print("\n====== BEAM DECODE RESULTS ======")

print_results(bleu_score_beam, comet_system_score_beam, comet_scores_beam, data_comet_beam)

BLEU score obtained: 0.103
COMET score obtained: 0.702

The lowest obtained COMET score: 0.149
It was obtained for the following sample:
Source: Buckfast wine should, apparently, be banned because it contains both alcohol and caffeine.
Reference: Bakfastas vīnu ir paredzēts aizliegt tāpēc, ka tas satur gan alkoholu, gan kofeīnu.
Prediction: Tulffffffffffffffffffffffff

The highest obtained COMET score: 0.996
It was obtained for the following sample:
Source: What do we do?
Reference: Ko mēs darām?
Prediction: Ko mēs darām?


BLEU score obtained: 0.087
COMET score obtained: 0.687

The lowest obtained COMET score: 0.163
It was obtained for the following sample:
Source: Buckfast wine should, apparently, be banned because it contains both alcohol and caffeine.
Reference: Bakfastas vīnu ir paredzēts aizliegt tāpēc, ka tas satur gan alkoholu, gan kofeīnu.
Prediction:  ⁇ fffffffffffffffffffforforf

The highest obtained COMET score: 0.995
It was obtained for the following sample:
Source: In fut

## Saving and loading the model

To allow the usage of the same model between runtimes the model was saved and loaded with th ebelow PyTorch functions.

In [None]:
# Model save
torch.save(transformer.state_dict(), BASE_PATH + 'translation_model2.pth')

In [15]:
# Model load
# Please initialize the model before launching this cell. The initialized model must have the same architecture as the saved model
state_dict = torch.load(BASE_PATH + 'translation_model2.pth')
transformer.load_state_dict(state_dict)

<All keys matched successfully>