# Retrieval-based Chatbot

This notebook shows the application of a retrieval-based Italian chatbot.

*Authors*: Luca Ragazzi and Gianluca Moro



## Task

*Scenario*: We were commited by a company to create a retrieval-based not-generative chatbot using their internal question-answer documents. Similar to a help-desk application, the system is asked to answer user queries by retrieving the most relevant predefined response.

In contrast to generative chatbots like ChatGPT that soffers from hallucinantions, retrieval-based chatbots have the following pros:
- Certainty of always giving correct and validated answers if the accuracy is considered satisfactory.
- Cost-efficient solution for both hardware and computational time.

Instead, there are two limitations:
- The need to have predefined question-answers, which may be expensive to obtain.
- The difficulty of correctly understanding the syntax and semantics of input queries.

Technically, the task to address is the Semantic Textual Similarity (STS). Given a set of predefined internal questions and answers, and a new user input query Q, the goal is to retrieve the correct answer by looking at the most similar internal question wrt Q.




![STS](https://miro.medium.com/v2/resize:fit:533/0*rmd8PRMRQFzjoOD6.png)



This task is actually complex due to the high similarity of syntax and semantics between different texts.

![STS_PROBLEMS](https://production-media.paperswithcode.com/tasks/Screenshot_2021-04-19_at_16.31.29_hhDSYqr.png)

## Setup

Let's install the required libraries.

**Note**: before running this notebook, be sure to have set the GPU Runtime (Runtime -> Cambia tipo di runtime -> GPU -> Salva).

In [2]:
%%capture
%pip install -U sentence-transformers  # to load the model
%pip install -U datasets  # to load the experimental dataset

## Data Loading and Processing

Since we lack training examples provided by the company and only have question-answering pairs for production, we need to devise a strategy to develop an effective solution.

Indeed, to train a neural model, we need a labeled dataset on a similar application. Therefore, we found the only publicly available Italian dataset related to our task (STS): *stsb_multi_mt*.

Actually, this is a multi-lingual dataset that comprises a subset of Italian examples. Each example of the dataset consists of three elements: sentence_1, sentence_2, and their similarity score ranging between [1,5] (the higher, the more similar).

Load the dataset from HuggingFace Datasets.

In [4]:
from datasets import load_dataset

dataset = load_dataset("stsb_multi_mt", name="it")  # load the dataset from the hub

training_set = dataset["train"]  # get the training set
eval_set = dataset["dev"]  # get the evaluation set
test_set = dataset["test"]  # get the test set

Print statistics of the dataset splits.

In [5]:
print(f"Train set size: {len(training_set)} instances")
print(f"Eval set size: {len(eval_set)} instances")
print(f"Test set size: {len(test_set)} instances")

Train set size: 5749 instances
Eval set size: 1500 instances
Test set size: 1379 instances


Function to show random samples of the dataset.

In [6]:
import datasets
import random
import pandas as pd
from IPython.display import display, HTML

def show_random_samples(dataset, num_examples):
    assert num_examples <= len(dataset), "Can't pick more samples than there are in the dataset."
    picks = []
    for _ in range(num_examples):
        pick = random.randint(0, len(dataset) - 1)
        while pick in picks:
            pick = random.randint(0, len(dataset) - 1)
        picks.append(pick)

    df = pd.DataFrame(dataset[picks])
    for column, typ in dataset.features.items():
        if isinstance(typ, datasets.ClassLabel):
            df[column] = df[column].transform(lambda i: typ.names[i])
    display(HTML(df.to_html()))

Let's try.

In [7]:
show_random_samples(test_set, num_examples=5)

Unnamed: 0,sentence1,sentence2,similarity_score
0,Alcune violenze in Siria equivalgono alla guerra civile - Croce Rossa,Alcune violenze in Siria si qualificano come guerra civile - Croce Rossa,4.4
1,Un uomo in abito nero sta navigando lungo un'onda che si infrange.,Un surfista che indossa una muta da surf nera sta cavalcando un'onda bianca nell'oceano.,4.6
2,Le sue parole chiave sono supportate da portali web in lingua cinese.,"I portali Web in lingua cinese supportano anche 3721 parole chiave, ha detto il sito Web.",3.2
3,Un uccello blu in piedi su un prato.,Uccello blu in piedi sull'erba verde.,4.6
4,Una donna sta affettando l'aglio.,Una donna sta affettando una cipolla.,2.2


Let's analyze whether the dataset is class-balanced, meaning that the various labels are evenly distributed throughout the dataset. This ensures that the models can be trained effectively without biases.

In [8]:
k = 5  # the number of classes

def check_balance(k, data):
    for i in range(k):
        v = sum([x >= i and x<=i+1 for x in data["similarity_score"]])
        print(f"Class {i}-{i+1}: {v}")

print("Training Set Class Balance:")
check_balance(k, training_set)

print("\nValidation Set Class Balance:")
check_balance(k, eval_set)

print("\nTest Set Class Balance:")
check_balance(k, test_set)

Training Set Class Balance:
Class 0-1: 1103
Class 1-2: 1076
Class 2-3: 1297
Class 3-4: 1942
Class 4-5: 1406

Validation Set Class Balance:
Class 0-1: 389
Class 1-2: 303
Class 2-3: 359
Class 3-4: 421
Class 4-5: 264

Test Set Class Balance:
Class 0-1: 308
Class 1-2: 291
Class 2-3: 352
Class 3-4: 442
Class 4-5: 338


The dataset is definitively class-balanced, both intra split and across splits.

## Model Settings

We tested the following models and show the applicaton with the first one, which is the best-performed model.

Pre-trained models (multi-lingual comprising Italian):

- **distiluse-base-multilingual-cased-v1**
- distiluse-base-multilingual-cased-v2
- paraphrase-multilingual-mpnet-base-v2
- paraphrase-multilingual-MiniLM-L12-v2
- clips/mfaq
- aiknowyou/aiky-sentence-bertino
- efederici/sentence-bert-base


Let's load the model with the SentenceTransformer library.

In [9]:
from sentence_transformers import SentenceTransformer, util

# Load the model and put it into the GPU (CUDA)
model = SentenceTransformer("distiluse-base-multilingual-cased-v1", device="cuda")

The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`.
0it [00:00, ?it/s]
modules.json: 100%|██████████| 341/341 [00:00<00:00, 21.8kB/s]
config_sentence_transformers.json: 100%|██████████| 122/122 [00:00<?, ?B/s] 
README.md: 100%|██████████| 2.47k/2.47k [00:00<?, ?B/s]
sentence_bert_config.json: 100%|██████████| 53.0/53.0 [00:00<?, ?B/s]
config.json: 100%|██████████| 556/556 [00:00<?, ?B/s] 
model.safetensors: 100%|██████████| 539M/539M [00:46<00:00, 11.6MB/s] 
tokenizer_config.json: 100%|██████████| 452/452 [00:00<00:00, 1.34MB/s]
vocab.txt: 100%|██████████| 996k/996k [00:00<00:00, 2.21MB/s]
tokenizer.json: 100%|██████████| 1.96M/1.96M [00:00<00:00, 3.95MB/s]
special_tokens_map.json: 100%|██████████| 112/112 [00:00<00:00, 19.0kB/s]
1_Pooling/config.json: 100%|██████████| 190/190 [00:00<00:00, 282kB/s]
2_Dense/c

AssertionError: Torch not compiled with CUDA enabled

Let's write some preliminary functions to encode the texts, compute the similarities, and evaluate the model performance.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error  # get some evaluation metrics
import numpy as np

In [None]:
# Encode the text
def encode_text(model, sent):
    return model.encode(sent, convert_to_tensor=True, device="cuda")

In [None]:
# Compute the semantic similarities between to texts with the cosine similarity
# Remember: 0 = no match, 1 = high match. The higher, the better
def compute_similarities(emb1, emb2):
    return util.cos_sim(emb1, emb2)


# Compute the standard accuracies such as MSE, MAE, and Exact Match
def get_standard_accuracies(gold, pred):
    # MSE (mean squared error) measures the average of the squares of the errors
    # between the gold and prediction. The lower, the better. [0, ∞[
    mse = round(mean_squared_error(gold, pred), 2)
    # MAE (mean absolute error) is the average of all absolute errors between
    # the gold and prediction. The lower, the better. [0, ∞[
    mae = round(mean_absolute_error(gold, pred), 2)
    # EM (exact match) checks if the prediction is exactly as the gold
    em = sum([round(gold[i]) == round(pred[i]) for i in range(len(gold))])
    return mse, mae, em

## Evaluation before fine-tuning

We evaluate the model on the test set without fine-tuning, predicting the label that indicates the similarity between two sentences.

In [None]:
embed_sent1_test = encode_text(model, test_set["sentence1"])  # process all first sentences
embed_sent2_test = encode_text(model, test_set["sentence2"])  # process all second sentences

# For each pair, produce the similarity score
# * 5 is needed to normalize the scores in the range [1,5] required by the dataset
pred_scores_test = [compute_similarities(embed_sent1_test[i], embed_sent2_test[i]).item() * 5 for i in range(len(embed_sent1_test))]

# Compute accuracy metrics
mse, mae, em = get_standard_accuracies(test_set["similarity_score"], pred_scores_test)

print("Get standard evaluation metrics on the test set:\n")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Exact Match (EM): {em}/{len(pred_scores_test)} ({round(em/len(pred_scores_test)*100, 2)}%)")

###**Exercise 1**
Compute the standard accuracy metrics on the validation set. Is the model performance better or worse?

In [None]:
embed_sent1_eval = encode_text(model, eval_set["sentence1"])  # process all first sentences
embed_sent2_eval = encode_text(model, eval_set["sentence2"])  # process all second sentences

# For each pair, produce the similarity score
# * 5 is needed to normalize the scores in the range [1,5] required by the dataset
pred_scores_eval = [compute_similarities(embed_sent1_eval[i], embed_sent2_eval[i]).item() * 5 for i in range(len(embed_sent1_eval))]

# Compute accuracy metrics
mse, mae, em = get_standard_accuracies(eval_set["similarity_score"], pred_scores_eval)

print("Get standard evaluation metrics on the eval set:\n")
print(f"Mean Squared Error (MSE): {mse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Exact Match (EM): {em}/{len(pred_scores_eval)} ({round(em/len(pred_scores_eval)*100, 2)}%)")

Given a query, we evaluate the model on searching the more similar sentence. To do so, starting from the original dataset, we build a new dataset by selecting the pairs with a similarity score > 4. The task is to correctly retrieve the twin sentence.

As evaluation metrics, we use Rank@1, Rank@2, Rank@3, Rank@4, Rank@5, where Rank@n means how much times the correct twin is retrieved by the model in the first "n" more similar sentences retrieved. For example, Rank@3 checks if the retrieved item is in the first 3 positions.

In [None]:
# Retrieve pairs in the dataset having the two sentences with similar meaning (synonyms)
def get_similar_pairs(dataset):
    sent1 = []
    sent2 = []
    # We use idx to know the index of similar instances in the dataset
    idx = []
    for i in range(len(dataset["sentence1"])):
        # We define two sentences as similar if their similarity score is > 4
        if round(dataset["similarity_score"][i]) > 4:
            sent1.append(dataset["sentence1"][i])
            sent2.append(dataset["sentence2"][i])
            idx.append(i)
    return sent1, sent2, idx


# Get the percentage of how many times the correct sentence to be retrieved is
# in the first "n" best retrieved sentences
def get_rank_n(pred_scores, n, idx):
    rank = sum([k in np.argsort(pred_scores[i])[-n:] for i, k in enumerate(idx)]) / len(pred_scores)
    return rank

Let's obtain the datasets with similar samples for each split.

In [None]:
sent1_train, sent2_train, idx_train = get_similar_pairs(training_set)
sent1_eval, sent2_eval, idx_eval = get_similar_pairs(eval_set)
sent1_test, sent2_test, idx_test = get_similar_pairs(test_set)

print(f"Number of train twin samples: {len(sent1_train)}/{len(training_set)}")
print(f"Number of eval twin samples: {len(sent1_eval)}/{len(eval_set)}")
print(f"Number of test twin samples: {len(sent1_test)}/{len(test_set)}")

Get Rank@n scores on the test set.

In [None]:
# Encode all the sentences of the sent1_test dataset
embed_sent1_twins_test = encode_text(model, sent1_test)

# Compute the similarity with ALL the sentences_2 of the entire test set
pred_scores_twins_test = util.cos_sim(embed_sent1_twins_test, embed_sent2_test).cpu().tolist()

print("Get rank-n results on the test set given the sentence-1 set as the query:")
for n in [1, 2, 3, 4, 5]:
    rank = get_rank_n(pred_scores_twins_test, n, idx_test)
    print(f"Rank@{n}: {round(rank*100, 2)}%")

###**Exercise 2**
Swap the twins to search. Thus, we want to search sentence_1 given sentence_2 as input. Is the model performance better or worse?

In [None]:
# Encode all the sentences of the sent1_test dataset
embed_sent1_twins_test = encode_text(model, sent2_test)
embed_sent2_twins_test = encode_text(model, sent1_test)

# Compute the similarity with ALL the sentences_2 of the entire test set
pred_scores_twins_test = util.cos_sim(embed_sent1_twins_test, embed_sent2_test).cpu().tolist()

print("Get rank-n results on the test set given the sentence-1 set as the query:")
for n in [1, 2, 3, 4, 5]:
    rank = get_rank_n(pred_scores_twins_test, n, idx_test)
    print(f"Rank@{n}: {round(rank*100, 2)}%")

###**Exercise 3**
Compute the rank@n metrics on the validation set and, as before, swap also the twin to search. Compared to the test set, is the model performance better or worse?

In [None]:
# Encode all the sentences of the sent1_eval dataset
embed_sent1_twins_eval = encode_text(model, sent1_eval)

# Compute the similarity with ALL the sentences_2 of the entire eval set
pred_scores_twins_eval = util.cos_sim(embed_sent1_twins_eval, embed_sent2_eval).cpu().tolist()

print("Get rank-n results on the eval set given the sentence-1 set as the query:")
for n in [1, 2, 3, 4, 5]:
    rank = get_rank_n(pred_scores_twins_eval, n, idx_eval)
    print(f"Rank@{n}: {round(rank*100, 2)}%")


# Encode all the sentences of the sent1_eval dataset
embed_sent1_twins_eval = encode_text(model, sent2_eval)
embed_sent2_twins_eval = encode_text(model, sent1_eval)

# Compute the similarity with ALL the sentences_2 of the entire eval set
pred_scores_twins_eval = util.cos_sim(embed_sent1_twins_eval, embed_sent2_eval).cpu().tolist()

print("\nGet rank-n results on the eval set given the sentence-1 set as the query:")
for n in [1, 2, 3, 4, 5]:
    rank = get_rank_n(pred_scores_twins_eval, n, idx_eval)
    print(f"Rank@{n}: {round(rank*100, 2)}%")

###**Exercise 4**
Data visualization is important as well!
Summarize the results on the test set using the skeleton below. In the Rank@n evaluation, after the "|" symbol, report the results of searching sentence_1 given sentence_2.

\begin{array}{c}
\text{Test set results} \\
\hline
Model & MSE & MAE & EM & Rank@1 & Rank@2 & Rank@3 & Rank@4 & Rank@5 \\
\hline
\textbf{Untrained} & 1.16 & 0.85 & 35.68\% & 81.41\% | 94.87\% & 89.74\% | 98.72\% & 94.87\% | 100.0\% & 98.08\% | 100.0\% & 99.36\% | 100.0\% \\
\hline
\end{array}

\begin{array}{c}
\text{Validation set results} \\
\hline
Model & MSE & MAE & EM & Rank@1 & Rank@2 & Rank@3 & Rank@4 & Rank@5\\
\hline
\textbf{Untrained} & 0.94 & 0.76 & 38.8 &  &  &  &  &  \\
\hline
\end{array}

## Model fine-tuning on the Semantic Textual Similarity task

The goal is to further improve the performance by training the model on the STS task with the data at hand.

The model fine-tuning takes about 1 minute per epoch. The aim is to better understand the meaning of the sentences.

In [None]:
from torch.utils.data import DataLoader
import math
from sentence_transformers import SentenceTransformer, losses, InputExample
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

In [None]:
train_batch_size = 16  # the number of input-output examples that the model process at the same time
num_epochs = 2  # the number of cycles on the data
model_save_path = "output/chatbot"  # the path to save the fine-tuned model

Let's correctly setup the datasets with a normalized score between [0,1]

In [None]:
def setup_dataset(dataset):
    samples = []
    for example in dataset:
        # Normalize the score in [0, 1]
        score = float(example["similarity_score"]) / 5.0
        inp_example = InputExample(texts=[example["sentence1"], example["sentence2"]], label=score)
        samples.append(inp_example)
    return samples

In [None]:
train_samples = setup_dataset(training_set)
dev_samples = setup_dataset(eval_set)

Let's define the dataloader with which to load the data into the model.

In [None]:
train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)

Let's define the loss function, which is required to compute the model error during the training process.

In [None]:
train_loss = losses.CosineSimilarityLoss(model=model)

evaluator = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name="sts-eval", show_progress_bar=False)

Train the model.

In [None]:
model.fit(train_objectives=[(train_dataloader, train_loss)],
          evaluator=evaluator,
          epochs=num_epochs,
          # The number of steps (it depends on the batch size)
          evaluation_steps=len(train_dataloader),
          # The initial learning rate is very low for n steps as a warmup
          warmup_steps=math.ceil(len(train_dataloader) * num_epochs * 0.1),
          output_path=model_save_path,
          save_best_model=True)

Re-evaluate the model performance on the test set after the fine-tuning process.

In [None]:
embed_sent1 = encode_text(model, test_set["sentence1"])
embed_sent1_twins = encode_text(model, sent1_test)
embed_sent2 = encode_text(model, test_set["sentence2"])

pred_scores = [compute_similarities(embed_sent1[i], embed_sent2[i]).item() * 5 for i in range(len(embed_sent1))]
pred_scores_twins = util.cos_sim(embed_sent1_twins, embed_sent2).cpu().tolist()

mse, mae, em = get_standard_accuracies(test_set["similarity_score"], pred_scores)
print(f"Mean Squared Error (MSE): {mse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"Exact Match (EM): {em}/{len(pred_scores)} ({round(em/len(pred_scores)*100, 2)}%)")

for n in [1, 2, 3, 4, 5]:
    rank = get_rank_n(pred_scores_twins, n, idx_test)
    print(f"Rank@{n}: {round(rank*100, 2)}%")

Summarize the results on the test set after the fine-tuning on the Semantic textual Similarity task (STS). For the Rank@n evaluation, we report only the results on searching the sentence_2 set (the input is sentence_1).

\begin{array}{c}
\text{Test set results} \\ \hline
Model & MSE & MAE & EM & Rank@1 & Rank@2 & Rank@3 & Rank@4 & Rank@5 \\ \hline
\textbf{Untrained} & 1.16 & 0.85 & 35.68\% & 81.41\% & 89.74\% & 94.87\% & 98.08\% & 99.36\% \\
\textbf{Trained (STS)} & 0.77 & 0.67 & 45.76\% & 80.13\% & 89.74\% & 96.15\% & 98.08\% & 98.08\% \\  \hline
\end{array}

###**Exercise 5**

Re-evaluate the model performance on the validation set after the fine-tuning process. Is the model performance higher or lower?
Afterward, summarize the results on the validation set after the fine-tuning on the STS task.

In [None]:
# Encode all the sentences of the sent1_eval dataset
embed_sent1_twins_eval = encode_text(model, sent1_eval)

# Compute the similarity with ALL the sentences_2 of the entire eval set
pred_scores_twins_eval = util.cos_sim(embed_sent1_twins_eval, embed_sent2_eval).cpu().tolist()

print("Get rank-n results on the eval set given the sentence-1 set as the query:")
for n in [1, 2, 3, 4, 5]:
    rank = get_rank_n(pred_scores_twins_eval, n, idx_eval)
    print(f"Rank@{n}: {round(rank*100, 2)}%")


# Encode all the sentences of the sent1_eval dataset
embed_sent1_twins_eval = encode_text(model, sent2_eval)
embed_sent2_twins_eval = encode_text(model, sent1_eval)

# Compute the similarity with ALL the sentences_2 of the entire eval set
pred_scores_twins_eval = util.cos_sim(embed_sent1_twins_eval, embed_sent2_eval).cpu().tolist()

print("\nGet rank-n results on the eval set given the sentence-1 set as the query:")
for n in [1, 2, 3, 4, 5]:
    rank = get_rank_n(pred_scores_twins_eval, n, idx_eval)
    print(f"Rank@{n}: {round(rank*100, 2)}%")

\begin{array}{c}
\text{Validation set results} \\
\hline
Model & MSE & MAE & EM & Rank@1 & Rank@2 & Rank@3 & Rank@4 & Rank@5\\
\hline
\textbf{Untrained} &  &  &  &  &  &  &  &  \\
\textbf{Trained (STS)} &  &  &  &  &  &  &  & \\
\hline
\end{array}

## Metric Learning

We try to perform a metric-learning training to boost the model performance in the Rank@n evaluation.
The aim is to teach the model to bring closer similar sentences in the vector space and bring farther different sentences.

We use the **triplet loss** function, which required to have the training example as follows: <sent_1, sent_2, neg_3>

Here, sent_1 and sent_2 are two similar sentences we eant to bring closer in the vector space (thus we want the model to assign similar emmbeddings).
In contrast, neg_3 is a sentence dissimilar to sent_1 we want to teach the model to assign a different embedding and bring farther in the vector space.

Since we use the triplet loss function, we need to search a negative sample for each similar pairs.
There are several ways to do this. For example, we can pick a random sentence from the dataset. However, it is smarter if we select the **best negative sentence**, namely the sentence that the model mistakenly think is a synonym.

In [None]:
def get_best_negatives(dataset, sent1, idx):
    embed_sent1 = encode_text(model, sent1)
    embed_sent2 = encode_text(model, dataset["sentence2"])

    pred_scores = util.cos_sim(embed_sent1, embed_sent2).cpu().tolist()

    best_negatives = []

    for i, k in enumerate(idx):
        rank = np.argsort(pred_scores[i])
        if k == rank[-1]:
            for j in range(2, len(dataset["sentence2"])):
                sent_negative = dataset["sentence2"][rank[-j]]
                if sent_negative != sent1[i] and sent_negative != dataset["sentence2"][i]:
                    best_negatives.append(sent_negative)
                    break
        else:
            for j in range(1, len(dataset["sentence2"])):
                sent_negative = dataset["sentence2"][rank[-j]]
                if sent_negative != sent1[i] and sent_negative != dataset["sentence2"][i]:
                    best_negatives.append(sent_negative)
                    break

    return best_negatives

Get the set of best negative sentences.

In [None]:
best_negatives_train = get_best_negatives(training_set, sent1_train, idx_train)
best_negatives_eval = get_best_negatives(eval_set, sent1_eval, idx_eval)
best_negatives_test = get_best_negatives(test_set, sent1_test, idx_test)

Fine-tune the model with the triplet loss.

In [None]:
from torch.utils.data import DataLoader
import math
from sentence_transformers import SentenceTransformer, losses, InputExample
from sentence_transformers.evaluation import TripletEvaluator


train_batch_size = 16
num_epochs = 2
model_save_path = "output/chatbot"


# Setup the datasets in the triplet format
def setup_dataset(sent1, sent2, best_negatives):
    samples = []
    for i in range(len(sent1)):
        samples.append(InputExample(texts=[sent1[i], sent2[i], best_negatives[i]]))
    return samples


train_samples = setup_dataset(sent1_train, sent2_train, best_negatives_train)
dev_samples = setup_dataset(sent1_eval, sent2_eval, best_negatives_eval)

train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
# Define the triplet loss function
train_loss = losses.TripletLoss(model=model)

evaluator = TripletEvaluator.from_input_examples(dev_samples, name="eval", show_progress_bar=False)

# Train the model.
model.fit(train_objectives=[(train_dataloader, train_loss)],
          evaluator=evaluator,
          epochs=num_epochs,
          evaluation_steps=len(train_dataloader),
          warmup_steps=math.ceil(len(train_dataloader) * num_epochs * 0.1),
          output_path=model_save_path,
          save_best_model=True)

Re-evaluate the model performance on the test set after the metric learning process.

In [None]:
embed_sent1 = encode_text(model, sent1_test)
embed_sent2 = encode_text(model, test_set["sentence2"])

pred_scores = util.cos_sim(embed_sent1, embed_sent2).cpu().tolist()

for n in [1, 2, 3, 4, 5]:
    rank = get_rank_n(pred_scores, n, idx_test)
    print(f"Rank@{n}: {round(rank*100, 2)}%")

Summarize the results on the test set after the fine-tuning with Metric Learning (ML).

\begin{array}{c}
\text{Test set results} \\ \hline
Model & Rank@1 & Rank@2 & Rank@3 & Rank@4 & Rank@5 \\ \hline
\textbf{Untrained} & 82.05\% & 90.38\% & 94.23\% & 98.08\% & 99.36\% \\
\textbf{Trained (STS)} & 80.13\% & 89.74\% & 96.15\% & 98.08\% & 98.08\% \\
\textbf{Trained (STS + ML)} & 79.49\% & 88.46\% & 95.51\% & 98.08\% & 98.72\% \\
\hline
\end{array}

###**Exercise 6**

Re-evaluate the model performance on the validation set after the metric learning process. Report also the results in the table. Is the model performance better or worse?

In [None]:
embed_sent1 = encode_text(model, sent1_eval)
embed_sent2 = encode_text(model, eval_set["sentence2"])

pred_scores = util.cos_sim(embed_sent1, embed_sent2).cpu().tolist()

for n in [1, 2, 3, 4, 5]:
    rank = get_rank_n(pred_scores, n, idx_eval)
    print(f"Rank@{n}: {round(rank*100, 2)}%")

Summarize the results on the validation set after the fine-tuning with Metric Learning (ML).

\begin{array}{c}
\text{Validation set results} \\
\hline
Model & Rank@1 & Rank@2 & Rank@3 & Rank@4 & Rank@5\\ \hline
\textbf{Untrained} &  &  &  &  &  \\
\textbf{Trained (STS)} &  &  &  &  &  \\
\textbf{Trained (ML)} &  &  &  &  &  \\
\hline
\end{array}

## Time

We analyze the time required for computation because it is a central aspect for real-world applications.

In [None]:
import time

Evaluate the time to create the sentence embeddings.

In a realistic business scenario, it's worth noting that this operation is performed once offline.

In [None]:
start = time.time()
embeddings = encode_text(model, test_set["sentence2"])
end = time.time()

total_time = end - start
num_samples = len(test_set["sentence2"])
print(f"Num samples: {num_samples}")
print(f"Total time elapsed: {round(total_time, 2)} sec")
print(f"Time elapsed per sample: {round(total_time/num_samples, 5)} sec")

### Exercise 7

Evaluate the time to compare the input query with the collection of sentences. Will the time required be more or less than the time to create the embeddings?

In [None]:
query = test_set["sentence1"][0]

start = time.time()
emb_query = model.encode(query, convert_to_tensor=True, device="cuda")
pred_scores = util.cos_sim(emb_query, embeddings).cpu().tolist()
end = time.time()

total_time = end - start
print(f"Num samples: {num_samples}")
print(f"Total time elapsed: {round(total_time, 5)} sec")
print(f"Time elapsed per sample comparison: {round(total_time/num_samples, 5)} sec")