In [7]:
!pip install -U "sentence-transformers[train]"

Collecting datasets (from sentence-transformers[train])
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets->sentence-transformers[train])
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets->sentence-transformers[train])
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets->sentence-transformers[train])
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-3.0.1-py3-none-any.whl (471 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m471.6/471.6 kB[0m [31m37.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.

# _the judge_

This notebook implements the process of training and evalation of the SBERT model using the Stanford Questions and Answers Dataset (SQUAD). The training intents to approximate the similarity between question embeddings and sentences that contain the answer to that question. Furthermore, the training also aims to find a justifiable similarity threshold to assert that there is no correct answer to the question among the possible sentences.

# Imports

In [1]:
from Dataset import SquadDataset_training, SquadDataset_inference
from Model import TrainModel, InferenceModel
from Utils import evaluate_model

  from .autonotebook import tqdm as notebook_tqdm


# Set the train/test data

In [13]:
# import datasets and set train_model
dataset = SquadDataset_training('../data/train-v2.0.json', '../data/test-v2.0.json', train_size=10, test_size=10)
    
all_train_data = dataset._train_data.shuffle()
# for fine tuning, consider only right answers and ignore questions without answers
tuning_data = all_train_data.filter(lambda example: example['no_answer'] == False and example['labels'] == 1.0)
# to calculte the no_answer_bound, use questions that dont have answers
no_answer_train_data = all_train_data.filter(lambda example: example['no_answer'] == True)

# set test data similar to training data
all_test_data = dataset._test_data.shuffle()
test_data = all_test_data.filter(lambda example: example['no_answer'] == False)
no_answer_test_data = all_test_data.filter(lambda example: example['no_answer'] == True)



Filter: 100%|██████████| 10/10 [00:00<00:00, 233.86 examples/s]


Filter: 100%|██████████| 10/10 [00:00<00:00, 359.00 examples/s]


Filter: 100%|██████████| 12/12 [00:00<00:00, 385.17 examples/s]


Filter: 100%|██████████| 12/12 [00:00<00:00, 588.17 examples/s]


# Training

In [14]:
# rename columns to match training params
tuning_data = tuning_data.rename_columns({'questions': 'sentences1', 'sentences': 'sentences2', 'labels' : 'score'})
test_data = test_data.rename_columns({'questions': 'sentences1', 'sentences': 'sentences2', 'labels' : 'score'})

# keep only training columns
columns_to_keep = ['sentences1', 'sentences2', 'score']
tuning_data = tuning_data.select_columns(columns_to_keep)
test_data = test_data.select_columns(columns_to_keep)

In [15]:
# declare the training model
train_model = TrainModel(2, tuning_data, test_data, save_path = 'models')

In [16]:
# train and save the model
train_model._train()
train_model._save_model()

  0%|          | 0/2 [28:40<?, ?it/s]
  0%|          | 0/2 [00:00<?, ?it/s]

OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacity of 1.96 GiB of which 47.19 MiB is free. Including non-PyTorch memory, this process has 1.90 GiB memory in use. Of the allocated memory 1.66 GiB is allocated by PyTorch, and 216.15 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

# Evaluating

In [2]:
# declare the Squad dataset in the inference mode
inference_dataset = SquadDataset_inference('../data/train-v2.0.json', '../data/test-v2.0.json', train_size=3000, test_size = 500)
# get data to perfom tests
inference_data = inference_dataset._test_data.filter(lambda example: example['no_answer'] == False)

Filter: 100%|██████████| 500/500 [00:00<00:00, 9775.16 examples/s]


In [3]:
# load all models to evaluate
base_model = InferenceModel('sentence-transformers/all-mpnet-base-v2')
model_all_answers = InferenceModel('models/trained_all_answers')
model_good_answers = InferenceModel('models/trained_good_answers')

You try to use a model that was created with version 3.2.0, however, your version is 3.1.1. This might cause unexpected behavior or errors. In that case, try to update to the latest version.



You try to use a model that was created with version 3.2.0, however, your version is 3.1.1. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





In [4]:
base_model._test_inference(inference_data)

Model accuracy: 0.73
Mean similarity: 0.63
Pearson corr: 0.12
Spearman corr: 0.13


In [5]:
model_all_answers._test_inference(inference_data)

Model accuracy: 0.83
Mean similarity: 0.72
Pearson corr: 0.12
Spearman corr: 0.12


In [6]:
model_good_answers._test_inference(inference_data)

Model accuracy: 0.73
Mean similarity: 0.63
Pearson corr: 0.12
Spearman corr: 0.13


The model that appeared to increase the similarity between the embeddings of questions and answers was the model_all_answers. Therefore, it will be the model used for inference. 

# Computing the no answer bound

In [7]:
# only questions without answers
no_answer_data = inference_dataset._test_data.filter(lambda example: example['no_answer'] == True)

Filter:   0%|          | 0/500 [00:00<?, ? examples/s]

Filter: 100%|██████████| 500/500 [00:00<00:00, 5836.63 examples/s]


In [8]:
model_all_answers._test_inference(no_answer_data)

Model accuracy: 0.00
Mean similarity: 0.71
Pearson corr: nan
Spearman corr: nan


  pearson_corr, _ = pearsonr(hits, similarities)
  spearman_corr, _ = spearmanr(hits, similarities)


As the average similarity was not very different for unanswered questions (0.72 on valid questions and 0.71 on invalid questions), the no answer bound will be set to __0.3__ in order to capture questions that have very little similarity to the given context.