# Deep Learning Project - Model definition and training
##### Andrea Gervasio, Matricola Number: 1883259



In recent years, the Information Retrieval (IR) task has seen a big evolution with the paper "*Transformer Memory as a Differentiable Search Index*" where an encoder-decoder architecture is trained in a multitask fashion to perform both indexing and retrieval at the same time. \\
This notebook builds on that, expanding the concept with different architectures and training methods.

*This notebook implements the model definition and its training. The dataset it uses has been prepared by sampling the MS MARCO dataset and preprocessing the subset. [This](https://colab.research.google.com/drive/11F5YUMKzB355OMri66Se9Kswu_l4vSVR?usp=sharing) notebook contains all the code used for the data preprocessing.*

# 0. Installing libraries and downloading required files

In [None]:
!pip install --upgrade --no-cache-dir gdown

Collecting gdown
  Downloading gdown-5.2.0-py3-none-any.whl.metadata (5.8 kB)
Downloading gdown-5.2.0-py3-none-any.whl (18 kB)
Installing collected packages: gdown
  Attempting uninstall: gdown
    Found existing installation: gdown 5.1.0
    Uninstalling gdown-5.1.0:
      Successfully uninstalled gdown-5.1.0
Successfully installed gdown-5.2.0


In [None]:
!pip install lightning transformers wandb

Collecting lightning
  Downloading lightning-2.4.0-py3-none-any.whl.metadata (38 kB)
Collecting wandb
  Downloading wandb-0.17.7-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (10 kB)
Collecting lightning-utilities<2.0,>=0.10.0 (from lightning)
  Downloading lightning_utilities-0.11.6-py3-none-any.whl.metadata (5.2 kB)
Collecting torchmetrics<3.0,>=0.7.0 (from lightning)
  Downloading torchmetrics-1.4.1-py3-none-any.whl.metadata (20 kB)
Collecting pytorch-lightning (from lightning)
  Downloading pytorch_lightning-2.4.0-py3-none-any.whl.metadata (21 kB)
Collecting docker-pycreds>=0.4.0 (from wandb)
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl.metadata (1.8 kB)
Collecting gitpython!=3.1.29,>=1.0.0 (from wandb)
  Downloading GitPython-3.1.43-py3-none-any.whl.metadata (13 kB)
Collecting sentry-sdk>=1.0.0 (from wandb)
  Downloading sentry_sdk-2.13.0-py2.py3-none-any.whl.metadata (9.7 kB)
Collecting setproctitle (from wandb)
  Downloading setproctitle-1.3.3-cp310

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, random_split, Dataset

import transformers
from transformers import T5Tokenizer, T5ForConditionalGeneration


import lightning.pytorch as pl
from lightning.pytorch.callbacks import ModelCheckpoint

import wandb
import pandas as pd
import numpy as np
import os
import json
import random

In [None]:
if not os.path.exists("/content/DLDataset"):
  !gdown --fuzzy --folder https://drive.google.com/drive/folders/1zWHTnb_5OlTSpRctH0E1av5OfHGfe3LH?usp=sharing

# 1. Parameters definition

Since the training has been conducted on multiple configurations, there are some parameters which the user can choose.

*training_mode* refers to the text being used in the training: it can contain only the body of the document on its own (*normal*) or having a query summarizing the document at the start (*query_generation*).

In [None]:
training_mode = "normal" # @param = ["normal", "query_generation"]

*training_data* contains indexing data for the train, test and validation documents, and the retrieval data for the training data.

In [None]:
if training_mode == "normal":
  corpus_path = "/content/DLDataset/full_train_corpus.pkl"
  training_data = pd.read_pickle(corpus_path)
else:
  paths = ["/content/DLDataset/query_generation_train_corpus1.pkl",
           "/content/DLDataset/query_generation_train_corpus2.pkl",
           "/content/DLDataset/query_generation_train_corpus3.pkl"]
  df = [pd.read_pickle(path) for path in paths]
  training_data = pd.concat(df, axis=0, ignore_index=True)

The validation and test queries and their respective ranked documents lists.

In [None]:
val_queries = pd.read_csv("/content/DLDataset/val_queries.csv.zip")
val_top10 = pd.read_pickle("/content/DLDataset/mapped_val_top100.pkl")

test_queries = pd.read_csv("/content/DLDataset/test_queries.csv.zip")
test_top10 = pd.read_pickle("/content/DLDataset/mapped_test_top100.pkl")

*model_name* refers to the model used for the training.

In [None]:
model_name = "google/flan-t5-base" # @param = ["google-t5/t5-base", "google/flan-t5-base"]

*log_wandb* should be set to *True* if the user wants to log on Weights and Biases, or *False* otherwise. \\
*Note that the code will ask the user for their Weights and Biases key to log the results. The key for the Weights and Biases account can be found at [this link](https://wandb.ai/authorize).*

In [None]:
log_wandb = False # @param = ["True", "False"] {type:"raw"}

Other parameters used in the training process: \\
*num_tokens* is the number of tokens to take from each document; \\
*batch_size* is the size of each batch; \\
*learning_rate* is the learning rate used during training.

In [None]:
num_tokens = 32 # @param = ["32", "64"] {type:"raw"}

In [None]:
batch_size = 32 # @param = ["32", "64"] {type:"raw"}

In [None]:
learning_rate = 0.0001 # @param = ["0.0001", "0.0005"]  {type:"raw"}

In [None]:
parameters = {
    "num_tokens": num_tokens,
    "batch_size": batch_size,
    "num_epochs": 3,
    "learning_rate": learning_rate,
    "model_name": model_name,
    "log_wandb": log_wandb,
    "run_name": "T5Flan-32-01", # Name your wandb run if you want to log it
    "checkpoint_dir": ""
}

In [None]:
if parameters["log_wandb"]:
  !wandb login --relogin
  wandb.init(
        settings = wandb.Settings(start_method="fork"),
        project = "DeepLearningNew",
        name = parameters["run_name"],
        config = parameters
    )

# 2. Dataset creation

This is the dataset used during training.

In [None]:
class TrainingDataset(Dataset):
  def __init__(self, dataframe, tokenizer, max_length):
    self.training_data = dataframe
    self.tokenizer = tokenizer
    self.max_length = max_length

  def __getitem__(self, idx):
    sample = self.training_data.iloc[idx]
    doc = sample["doc"]
    id = str(sample["semantic_id"])
    token_doc = self.tokenizer(
        doc,
        max_length = self.max_length,
        padding = "max_length",
        return_tensors = "pt",
        truncation = "only_first"
    )

    input_ids = token_doc.input_ids[0]
    attention_masks = token_doc.attention_mask[0]

    token_labels = self.tokenizer(
        id,
        max_length = 20,
        padding = "max_length",
        return_tensors = "pt"
    ).input_ids[0]

    return input_ids, attention_masks, token_labels

  def __len__(self):
    return len(self.training_data)

def training_collate(batch):
  '''
  Collate funtion for the training dataset.
  '''
  input_ids, attention_masks, token_labels = zip(*batch)
  input_ids = torch.stack(input_ids)
  attention_masks = torch.stack(attention_masks)
  token_labels = torch.stack(token_labels)

  return input_ids, attention_masks, token_labels

In [None]:
tokenizer = T5Tokenizer.from_pretrained(parameters["model_name"])

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


This is the dataset used during validation and testing.

In [None]:
class ValTestDataset(Dataset):
  def __init__(self, queries, ranks, tokenizer, max_length):
    self.queries = queries
    self.ranks = ranks
    self.tokenizer = tokenizer
    self.max_length = max_length

  def __getitem__(self, idx):
    sample = self.queries.iloc[idx]
    query = sample["query"]
    qid = sample["qid"]

    rank = self.ranks[self.ranks["qid"] == qid]
    doc_ranks = rank["semantic_id"].tolist()

    token_query = self.tokenizer(
        query,
        max_length = self.max_length,
        padding = "max_length",
        return_tensors = "pt",
        truncation = "only_first"
    )

    input_ids = token_query.input_ids[0]
    attention_masks = token_query.attention_mask[0]

    return input_ids, attention_masks, doc_ranks

  def __len__(self):
    return len(self.queries)

def val_test_collate(batch):
  '''
  Collate function for the validation and test datasets.
  '''
  input_ids, attention_masks, doc_ranks = zip(*batch)
  input_ids = torch.stack(input_ids)
  attention_masks = torch.stack(attention_masks)
  return input_ids, attention_masks, doc_ranks

LightningDataModule to wrap everything together.

In [None]:
class DsiDataset(pl.LightningDataModule):
  def __init__(self, training_data, val_queries, val_ranks, test_queries,
               test_ranks, tokenizer, batch_size, max_length):

    super().__init__()

    self.training_data = training_data
    self.val_queries = val_queries
    self.val_ranks = val_ranks
    self.test_queries = test_queries
    self.test_ranks = test_ranks
    self.tokenizer = tokenizer
    self.batch_size = batch_size
    self.max_length = max_length

  def setup(self, stage: str):
    self.training_dataset = TrainingDataset(self.training_data,
                                            self.tokenizer,
                                            self.max_length)

    self.val_dataset = ValTestDataset(self.val_queries,
                                      self.val_ranks,
                                      self.tokenizer,
                                      self.max_length)

    self.test_dataset = ValTestDataset(self.test_queries,
                                       self.test_ranks,
                                       self.tokenizer,
                                       self.max_length)

  def train_dataloader(self):
    return DataLoader(self.training_dataset,
                      batch_size = self.batch_size,
                      shuffle = True,
                      collate_fn = training_collate,
                      num_workers = 2)

  def val_dataloader(self):
    return DataLoader(self.val_dataset,
                      batch_size = self.batch_size,
                      shuffle = False,
                      collate_fn = val_test_collate,
                      num_workers = 2)

  def test_dataloader(self):
    return DataLoader(self.test_dataset,
                      batch_size = self.batch_size,
                      shuffle = False,
                      collate_fn = val_test_collate,
                      num_workers = 2)

In [None]:
dataset = DsiDataset(training_data, val_queries, val_top10, test_queries,
                     test_top10, tokenizer, parameters["batch_size"],
                     parameters["num_tokens"])

# 3. Model definition

Just like the authors of the [paper](https://arxiv.org/pdf/2202.06991) suggested doing, the model will be forced to generate only numeric tokens.

In [None]:
SPIECE_UNDERLINE = "▁"
INT_TOKEN_IDS = []

for token, id in tokenizer.get_vocab().items():
  if token[0] == SPIECE_UNDERLINE:
    if token[1:].isdigit():
      INT_TOKEN_IDS.append(id)
  if token == SPIECE_UNDERLINE:
    INT_TOKEN_IDS.append(id)
  elif token.isdigit():
    INT_TOKEN_IDS.append(id)
INT_TOKEN_IDS.append(tokenizer.eos_token_id)

def restrict_decode_vocab(batch_idx, prefix_beam):
  '''
  Custom function to pass to the model.
  '''
  return INT_TOKEN_IDS

In [None]:
class DsiModel(pl.LightningModule):
  def __init__(self, model_name, learning_rate, tokenizer, decode_vocab,
               log_wandb):

    super().__init__()

    self.model = T5ForConditionalGeneration.from_pretrained(model_name)
    self.lr = learning_rate
    self.tokenizer = tokenizer
    self.decode_vocab = decode_vocab
    self.log_wandb = log_wandb
    self.training_step_logs = {"loss": []}
    self.val_step_logs = {"map": [], "recall": []}
    self.test_step_logs = {"map": [], "recall": []}

  def training_step(self, batch, batch_idx):
    '''
    Computes and logs the training loss.
    '''
    input_ids = batch[0]
    attention_masks = batch[1]
    labels = batch[2]

    output = self.model(input_ids = input_ids,
                        attention_mask = attention_masks,
                        labels = labels)

    loss = output.loss

    self.training_step_logs["loss"].append(loss.item())
    self.log("train/loss", loss.item(), on_step = False, on_epoch = True,
              prog_bar = True)

    return loss

  def validation_step(self, batch, batch_idx):
    '''
    Computes and logs the validation mean avg precision and recall.
    '''
    input_ids = batch[0]
    attention_masks = batch[1]
    labels = batch[2]

    batch_beams = self.model.generate(
        input_ids = input_ids,
        attention_mask = attention_masks,
        max_length = 20,
        num_beams = 10,
        prefix_allowed_tokens_fn = restrict_decode_vocab,
        num_return_sequences = 10,
        early_stopping = True
    ).reshape(input_ids.shape[0], 10, -1)

    precisions = []
    recalls = []

    for beams, label in zip(batch_beams, labels):
      rank_list = self.tokenizer.batch_decode(beams,
                                              skip_special_tokens = True)

      avg_precision = self.compute_avg_precision(rank_list, label)
      precisions.append(avg_precision)

      recall = self.compute_recall_at_10(rank_list, label)
      recalls.append(recall)

    mean_avg_precision = sum(precisions) / len(precisions)
    mean_recall = sum(recalls) / len(recalls)

    self.val_step_logs["map"].append(mean_avg_precision)
    self.val_step_logs["recall"].append(mean_recall)

    self.log("validation/map", mean_avg_precision, on_step = False,
              on_epoch = True, prog_bar = True)
    self.log("validation/recall", mean_recall, on_step = False,
              on_epoch = True, prog_bar = True)

    return mean_avg_precision, mean_recall

  def test_step(self, batch, batch_idx):
    '''
    Computes and logs the test mean avg precision and recall.
    '''
    input_ids = batch[0]
    attention_masks = batch[1]
    labels = batch[2]

    batch_beams = self.model.generate(
        input_ids = input_ids,
        attention_mask = attention_masks,
        max_length = 20,
        num_beams = 10,
        prefix_allowed_tokens_fn = restrict_decode_vocab,
        num_return_sequences = 10,
        early_stopping = True
    ).reshape(input_ids.shape[0], 10, -1)

    precisions = []
    recalls = []

    for beams, label in zip(batch_beams, labels):
      rank_list = self.tokenizer.batch_decode(beams,
                                              skip_special_tokens = True)

      avg_precision = self.compute_avg_precision(rank_list, label)
      precisions.append(avg_precision)

      recall = self.compute_recall_at_10(rank_list, label)
      recalls.append(recall)

    mean_avg_precision = sum(precisions) / len(precisions)
    mean_recall = sum(recalls) / len(recalls)

    self.test_step_logs["map"].append(mean_avg_precision)
    self.test_step_logs["recall"].append(mean_recall)

    return mean_avg_precision, mean_recall

  def compute_recall_at_10(self, rank_list, label):
    '''
    Computes the recall@10 metric.
    '''
    correct_predictions = set(rank_list) & set(label)
    recall = len(correct_predictions) / len(label)
    return recall

  def compute_precision_at_k(self, retrieved, relevant, k):
    '''
    Computes the precision at k metric.
    '''
    relevant = len(set(relevant) & set(retrieved[:k]))
    precision_at_k = relevant / min(k, len(retrieved))

    return precision_at_k

  def compute_avg_precision(self, rank_list, label):
    '''
    Computes the average precision metric.
    '''
    k = 10
    precisions = []

    for i in range(k):
      precision = self.compute_precision_at_k(label, rank_list, i+1)
      precisions.append(precision)

    avg_precision = sum(precisions) / k

    return avg_precision

  def on_train_epoch_end(self):
    '''
    Computes and logs average train loss on epoch end.
    '''
    print("Training epoch number", self.current_epoch)

    avg_training_loss = sum(self.training_step_logs["loss"]) / len(self.training_step_logs["loss"])

    print("Average training loss:", avg_training_loss)

    self.log("avg_training_loss", avg_training_loss)

    self.training_step_logs["loss"].clear()

    if self.log_wandb:
      wandb.log({"train/loss": avg_training_loss})

  def on_validation_epoch_end(self):
    '''
    Computes and logs average validation loss and recall on epoch end.
    '''
    print("Validation epoch number", self.current_epoch)

    avg_map = sum(self.val_step_logs["map"]) / len(self.val_step_logs["map"])
    avg_recall = sum(self.val_step_logs["recall"]) / len(self.val_step_logs["recall"])

    print("Average validation MAP: ", avg_map)
    print("Average validation recall: ", avg_recall)

    self.log("avg_validation_map", avg_map)
    self.log("avg_validation_recall", avg_recall)

    self.val_step_logs["map"].clear()
    self.val_step_logs["recall"].clear()

    if self.log_wandb:
      wandb.log({"validation/map": avg_map})
      wandb.log({"validation/recall": avg_recall})

  def on_test_epoch_end(self):
    '''
    Computes and logs average test loss and recall on epoch end.
    '''
    print("Test epoch number", self.current_epoch)

    avg_map = sum(self.test_step_logs["map"]) / len(self.test_step_logs["map"])
    avg_recall = sum(self.test_step_logs["recall"]) / len(self.test_step_logs["recall"])

    print("Average test MAP: ", avg_map)
    print("Average test recall: ", avg_recall)

    self.log("avg_test_map", avg_map)
    self.log("avg_test_recall", avg_recall)

    self.test_step_logs["map"].clear()
    self.test_step_logs["recall"].clear()

    if self.log_wandb:
      wandb.log({"test/map": avg_map})
      wandb.log({"test/recall": avg_recall})

  def configure_optimizers(self):
    '''
    Initializes the optimizer.
    '''
    optimizer = torch.optim.AdamW(self.parameters(), self.lr)
    return optimizer

In [None]:
model = DsiModel(parameters["model_name"], parameters["learning_rate"],
                 tokenizer, INT_TOKEN_IDS, parameters["log_wandb"])

config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Callback function to save the model having the best validation MAP.

In [None]:
ckpt_callback = pl.callbacks.ModelCheckpoint(
    monitor = "avg_validation_map",
    mode = "max",
    save_top_k = 1,
    verbose = True,
    dirpath = parameters["checkpoint_dir"],
    filename = "best_checkpoint"
)

# 4. Model training

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [None]:
trainer = pl.Trainer(accelerator = device,
                     max_epochs = parameters["num_epochs"],
                     callbacks = [ckpt_callback])

INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [None]:
trainer.fit(model = model, datamodule = dataset)

/usr/local/lib/python3.10/dist-packages/lightning/pytorch/callbacks/model_checkpoint.py:654: Checkpoint directory  exists and is not empty.
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name  | Type                       | Params | Mode
------------------------------------------------------------
0 | model | T5ForConditionalGeneration | 247 M  | eval
------------------------------------------------------------
247 M     Trainable params
0         Non-trainable params
247 M     Total params
990.311   Total estimated model params size (MB)
0         Modules in train mode
565       Modules in eval mode
INFO:lightning.pytorch.callbacks.model_summary:
  | Name  | Type                       | Params | Mode
------------------------------------------------------------
0 | model | T5ForConditionalGeneration | 247 M  | eval
------------------------------------------------------------
247 M     Trainab

Sanity Checking: |          | 0/? [00:00<?, ?it/s]

  self.pid = os.fork()
/usr/local/lib/python3.10/dist-packages/lightning/pytorch/utilities/data.py:78: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 32. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.


Validation epoch number 0
Average validation MAP:  0.0
Average validation recall:  0.0


Training: |          | 0/? [00:00<?, ?it/s]

  self.pid = os.fork()


Validation: |          | 0/? [00:00<?, ?it/s]

/usr/local/lib/python3.10/dist-packages/lightning/pytorch/utilities/data.py:78: Trying to infer the `batch_size` from an ambiguous collection. The batch size we found is 16. To avoid any miscalculations, use `self.log(..., batch_size=batch_size)`.
INFO: Epoch 0, global step 3277: 'avg_validation_map' reached 0.00535 (best 0.00535), saving model to 'best_checkpoint.ckpt' as top 1
INFO:lightning.pytorch.utilities.rank_zero:Epoch 0, global step 3277: 'avg_validation_map' reached 0.00535 (best 0.00535), saving model to 'best_checkpoint.ckpt' as top 1


Validation epoch number 0
Average validation MAP:  0.005350332262534643
Average validation recall:  0.0014583333333333332
Training epoch number 0
Average training loss: 0.6527101767928927


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Epoch 1, global step 6554: 'avg_validation_map' reached 0.01407 (best 0.01407), saving model to 'best_checkpoint.ckpt' as top 1
INFO:lightning.pytorch.utilities.rank_zero:Epoch 1, global step 6554: 'avg_validation_map' reached 0.01407 (best 0.01407), saving model to 'best_checkpoint.ckpt' as top 1


Validation epoch number 1
Average validation MAP:  0.014069330278407661
Average validation recall:  0.0031249999999999997
Training epoch number 1
Average training loss: 0.4260158114837807


Validation: |          | 0/? [00:00<?, ?it/s]

INFO: Epoch 2, global step 9831: 'avg_validation_map' reached 0.02500 (best 0.02500), saving model to 'best_checkpoint.ckpt' as top 1
INFO:lightning.pytorch.utilities.rank_zero:Epoch 2, global step 9831: 'avg_validation_map' reached 0.02500 (best 0.02500), saving model to 'best_checkpoint.ckpt' as top 1


Validation epoch number 2
Average validation MAP:  0.02499669312169312
Average validation recall:  0.004558531746031746
Training epoch number 2
Average training loss: 0.36631248151524826


INFO: `Trainer.fit` stopped: `max_epochs=3` reached.
INFO:lightning.pytorch.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=3` reached.


In [None]:
trainer.test(ckpt_path = "best_checkpoint.ckpt", datamodule = dataset)

INFO: Restoring states from the checkpoint path at best_checkpoint.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Restoring states from the checkpoint path at best_checkpoint.ckpt
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:lightning.pytorch.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: Loaded model weights from the checkpoint at best_checkpoint.ckpt
INFO:lightning.pytorch.utilities.rank_zero:Loaded model weights from the checkpoint at best_checkpoint.ckpt


Testing: |          | 0/? [00:00<?, ?it/s]

Test epoch number 3
Average test MAP:  0.0190036257558579
Average test recall:  0.0042658730158730155


[{'avg_test_map': 0.019003625959157944,
  'avg_test_recall': 0.004265873227268457}]

In [None]:
if parameters["log_wandb"]:
  wandb.finish()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
test/map,▁
test/recall,▁
train/loss,█▂▁
validation/map,▁▂▅█
validation/recall,▁▃▆█

0,1
test/map,0.019
test/recall,0.00427
train/loss,0.36631
validation/map,0.025
validation/recall,0.00456


# 5. Baseline: Okapi BM-25

The baseline used to compare the model is the [Okapi BM-25](https://en.wikipedia.org/wiki/Okapi_BM25) function. \\
It is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document.

## 5.1 Installing and importing necessary libraries

In [None]:
!pip install rank_bm25

Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Downloading rank_bm25-0.2.2-py3-none-any.whl (8.6 kB)
Installing collected packages: rank_bm25
Successfully installed rank_bm25-0.2.2


In [None]:
indexing_data = pd.read_pickle('/content/DLDataset/full_train_corpus.pkl')

test_queries = pd.read_csv("/content/DLDataset/test_queries.csv.zip")
test_top10 = pd.read_pickle("/content/DLDataset/mapped_test_top100.pkl")

In [None]:
import spacy
from tqdm import tqdm
from rank_bm25 import BM25Okapi

## 5.2 Tokenize index documents

In [None]:
nlp = spacy.load("en_core_web_sm")
tok_text = []

document_values = indexing_data.doc.str.lower().values
tags_to_disable = ["tagger", "parser","ner"]

loop = tqdm(nlp.pipe(document_values, disable = tags_to_disable))

# tokenize documents using SpaCy
for doc in loop:
  tok = [t.text for t in doc if t.is_alpha]
  tok_text.append(tok)

104863it [06:37, 264.04it/s]


## 5.3 Fit tokenized texts

In [None]:
# initialize bm25model and fit to tokenized text
bm25 = BM25Okapi(tok_text)

## 5.4 Evaluate baseline

The metrics used to evaluate the baseline are the same used to evaluate the model: Mean Average Precision and Recall@10.

In [None]:
def compute_recall_at_10(rank_list, label):
  '''
  Computes the recall@10 metric.
  '''
  correct_predictions = set(rank_list) & set(label)
  recall = len(correct_predictions) / len(label)
  return recall

def compute_precision_at_k(retrieved, relevant, k):
  '''
  Computes the precision at k metric.
  '''
  relevant = len(set(relevant) & set(retrieved[:k]))
  precision_at_k = relevant / min(k, len(retrieved))

  return precision_at_k

def compute_avg_precision(rank_list, label):
  '''
  Computes the average precision metric.
  '''
  k = 10
  precisions = []

  for i in range(k):
    precision = compute_precision_at_k(rank_list, label, i+1)
    precisions.append(precision)

  avg_precision = sum(precisions) / k

  return avg_precision

Compute the predictions using the baseline.

In [None]:
labels = []
predictions = []
indexing_data_ids = indexing_data.semantic_id.values

loop = tqdm(range(len(test_queries)))

for i in loop:
  sample = test_queries.iloc[i]
  query = sample["query"]
  qid = sample["qid"]

  docs_rank = test_top10[test_top10["qid"] == qid]
  docids_rank = docs_rank["semantic_id"].tolist()

  preds = bm25.get_top_n(query, indexing_data_ids, n = 10)
  labels.append(docids_rank)
  predictions.append(preds)

100%|██████████| 2000/2000 [39:14<00:00,  1.18s/it]


Compute the evaluation metrics.

In [None]:
avg_precisions = []
recalls = []

for prediction, label in zip(predictions, labels):
  avg_precision = compute_avg_precision(prediction, label)
  avg_precisions.append(avg_precision)

  recall = compute_recall_at_10(prediction, label)
  recalls.append(recall)

map = sum(avg_precisions) / len(avg_precisions)
mean_recall = sum(recalls) / len(recalls)

print(f"BM25 results:\nMean Average Precision: {map}\nRecall@10: {mean_recall}")

BM25 results:
Mean Average Precision: 0.0001678968253968254
Recall@10: 1e-05


# Results

### Validation MAP during training:

![](https://drive.google.com/uc?export=view&id=1V2c2YxVapW0B4sK0Q7W0Q16Qye2A6i9X)

### Validation Recall@10 during training:

![](https://drive.google.com/uc?export=view&id=1K9ww_Ac3s_mm8Eansde_EhtzEq04GuYH)

### Table of the test MAP and Recall@10:

![](https://drive.google.com/uc?export=view&id=1MnzYRz_NncfFEI2WN6FUm6E2NPQuuovc)