In [None]:
# Install Pytorch & other libraries #https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/fine-tune-embedding-model-for-rag.ipynb
#https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator
#https://sbert.net/docs/sentence_transformer/dataset_overview.html
!pip install "torch==2.1.2" tensorboard

# Install Hugging Face libraries
!pip install --upgrade \
  "sentence-transformers>=3" \
  "datasets==2.19.1"  \
  "transformers==4.41.2"

Before building an embedding model, we need the right libraries:

sentence-transformers: provides architecture and training loops for embedding models.

datasets: to load sentence pairs.

transformers: supports the HuggingFace model APIs.

This step sets up the infrastructure for fine-tuning your embedding model using contrastive learning.



In [None]:
from huggingface_hub import login

login(token="#h_XGXxuSxuERwszgwfSTacBCZNbAIhrpNVmS", add_to_git_credential=True)

Token is valid (permission: fineGrained).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


While not core to our class topic on contrastive learning, this allows:

Pulling pretrained models like BAAI/bge-base-en-v1.5

Uploading your own fine-tuned model after training

This ties into model sharing and reproducibility.

# 1. Create & Prepare embedding dataset
An embedding dataset typically consists of text pairs (question, answer/context) or triplets that represent relationships or similarities between sentences. The dataset format you choose or have available will also impact the loss function you can use. Common formats for embedding datasets:

Positive Pair: Text Pairs of related sentences (query, context | query, answer), suitable for tasks like similarity or semantic search, example datasets: sentence-transformers/sentence-compression, sentence-transformers/natural-questions.

Triplets: Text triplets consisting of (anchor, positive, negative), example datasets sentence-transformers/quora-duplicates, nirantk/triplets.

Pair with Similarity Score: Sentence pairs with a similarity score indicating how related they are, example datasets: sentence-transformers/stsb, PhilipMay/stsb_multi_mt

We will use this dataset  https://huggingface.co/datasets/philschmid/finanical-rag-embedding-dataset (7000 positive text pairs of questions and corresponding context from             https://stocklight.com/stocks/us/nasdaq-nvda/nvidia/annual-reports/nasdaq-nvda-2023-10K-23668751.pdf)


###CosineSimilarityLoss → needs scored pairs

Needs: Explicit (positive, negative) scored pairs.

Format: [(sentence1, sentence2, label)] where label is a similarity score (e.g., 1.0 for similar, 0.0 for dissimilar).

Use case: When you have graded similarity (not just binary).

Goal: Minimize cosine distance between similar pairs and maximize for dissimilar ones.



###MultipleNegativesRankingLoss → needs positive pairs only
Needs: Only positive pairs.

Format: [(anchor, positive)]

Internally: Uses other positives in the batch as negatives.

Use case: Efficient when you have lots of positives but no explicit negatives.

Goal: Pull anchor and positive closer; push apart from other positives in the batch.

Important: You need large batch sizes for this to work well, since it generates negatives from other samples.



###TripletLoss → needs (anchor, positive, negative)


Needs: Triplets → (anchor, positive, negative)

Format: [(anchor, positive, negative)]

Use case: You explicitly know what should be similar and what shouldn't.

Goal: Minimize distance between anchor and positive, while maximizing distance between anchor and negative by a margin.



This step reinforces that data format determines the learning strategy in contrastive training.

for further documentation:https://sbert.net/docs/package_reference/sentence_transformer/losses.html



In [None]:
from datasets import load_dataset

# Load dataset from the hub
dataset = load_dataset("philschmid/finanical-rag-embedding-dataset", split="train")

# rename columns
dataset = dataset.rename_column("question", "anchor")
dataset = dataset.rename_column("context", "positive")

# Add an id column to the dataset
dataset = dataset.add_column("id", range(len(dataset)))

# split dataset into a 10% test set
dataset = dataset.train_test_split(test_size=0.1)

# save datasets to disk
dataset["train"].to_json("train_dataset.json", orient="records")
dataset["test"].to_json("test_dataset.json", orient="records")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading readme:   0%|          | 0.00/882 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.09M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/7000 [00:00<?, ? examples/s]

Creating json from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

251043

In [None]:
dataset['train'][0]

{'anchor': 'What kind of issues might be described in Note 14 of a financial statement?',
 'positive': 'Note 14 in a financial statement could describe issues related to legal proceedings.',
 'id': 4815}

Visual check — ensures each sample has an anchor (query) and positive (context). This is needed for models like BGE which are trained/fine-tuned using pairwise contrastive objectives.

In [None]:
dataset['train']['id'][1]

3752

Below Uses SentenceTransformer to wrap a pretrained contrastively-trained model.

Introduces Matryoshka dimensions — a trick where embeddings are truncated at different sizes and evaluated.

This shows a modern use of contrastive embeddings, and the evaluation logic follows Jay Alammars book Chapter 10 advice: test using cosine similarity and IR metrics.

In [None]:
import torch
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import (
    InformationRetrievalEvaluator,
    SequentialEvaluator,
)
from sentence_transformers.util import cos_sim
from datasets import load_dataset, concatenate_datasets

model_id = "BAAI/bge-base-en-v1.5"  # Hugging Face model ID
matryoshka_dimensions = [768, 512, 256, 128, 64] # Important: large to small

# Load a model
model = SentenceTransformer(
    model_id, device="cuda" if torch.cuda.is_available() else "cpu"
)

# load test dataset
test_dataset = load_dataset("json", data_files="test_dataset.json", split="train")
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")
corpus_dataset = concatenate_datasets([train_dataset, test_dataset])

# Convert the datasets to dictionaries
corpus = dict(
    zip(corpus_dataset["id"], corpus_dataset["positive"])
)  # Our corpus (cid => document)
queries = dict(
    zip(test_dataset["id"], test_dataset["anchor"])
)  # Our queries (qid => question)

# Create a mapping of relevant document (1 in our case) for each query
relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_cids])
for q_id in queries:
    relevant_docs[q_id] = [q_id]


matryoshka_evaluators = []
# Iterate over the different dimensions
for dim in matryoshka_dimensions:
    ir_evaluator = InformationRetrievalEvaluator(
        queries=queries,
        corpus=corpus,
        relevant_docs=relevant_docs,
        name=f"dim_{dim}",
        truncate_dim=dim,  # Truncate the embeddings to a certain dimension
        score_functions={"cosine": cos_sim},
    )
    matryoshka_evaluators.append(ir_evaluator)

# Create a sequential evaluator
evaluator = SequentialEvaluator(matryoshka_evaluators)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/94.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/777 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

Evaluates how well embeddings perform in semantic retrieval

Metric: nDCG@10 using cosine similarity

This simulates a RAG (retrieval-augmented generation) scenario

In [None]:
# Evaluate the model
results = evaluator(model)

# # COMMENT IN for full results
# print(results)

# Print the main score
for dim in matryoshka_dimensions:
    key = f"dim_{dim}_cosine_ndcg@10"
    print
    print(f"{key}: {results[key]}")

dim_768_cosine_ndcg@10: 0.7599280565076842
dim_512_cosine_ndcg@10: 0.756286758957279
dim_256_cosine_ndcg@10: 0.7437256846100088
dim_128_cosine_ndcg@10: 0.7185072307335217
dim_64_cosine_ndcg@10: 0.6594461283257674


###Reload Model with Flash Attention

Not critical to understanding contrastive learning, but teaches:

How to use optimized attention (FlashAttention2) for better/faster training

How to document your model (language, license, etc.) → helpful for publishing

In [None]:
from sentence_transformers import SentenceTransformerModelCardData, SentenceTransformer

# Hugging Face model ID: https://huggingface.co/BAAI/bge-base-en-v1.5
model_id = "BAAI/bge-base-en-v1.5"

# load model with SDPA for using Flash Attention 2
model = SentenceTransformer(
    model_id,
    model_kwargs={"attn_implementation": "sdpa"},
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="BGE base Financial Matryoshka",
    ),
)



Below is the core of contrastive fine-tuning:

MultipleNegativesRankingLoss is the actual contrastive loss function.

MatryoshkaLoss wraps it to encourage embeddings to remain effective at smaller dimensions (good for compression + speed).

This step links theory (contrastive training) directly to implementation — students can see how the right loss shapes the embedding space.

In [None]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]  # Important: large to small
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)
''' inner train loss Pulls positive pairs together

Pushes all other samples in the batch away (as negative samples)

Is the foundation for learning meaningful sentence embeddings

In this notebook, MultipleNegativesRankingLoss is your primary semantic alignment mechanism.'''



### Matryoshka = "nested embeddings of decreasing size"
Named after Matryoshka dolls, this trick encourages the model to produce strong embeddings even at lower dimensions.

What it does:

After computing the full embedding (e.g., 768 dims),

It truncates the embedding to smaller sizes (e.g., 384, 256, 128)

Then applies the same contrastive loss to each truncated version

This teaches the model to retain semantic structure even in the first 128 dimensions

This is super useful when:

You want to compress models for mobile/edge inference

You need faster vector search (lower-dimensional index = faster ANN)

Below you define how the model trains. Key ideas tied to contrastive learning:

batch_sampler=NO_DUPLICATES: ensures diverse negatives per batch — crucial for effective contrastive training

eval_strategy="epoch": evaluation is based on IR metrics after each epoch — consistent with contrastive evaluation logic

metric_for_best_model="eval_dim_128_cosine_ndcg@10": tracks cosine similarity between query & passage embeddings at smaller dimensionality (Matryoshka idea)



In [None]:
from sentence_transformers import SentenceTransformerTrainingArguments
from sentence_transformers.training_args import BatchSamplers

# load train dataset again
train_dataset = load_dataset("json", data_files="train_dataset.json", split="train")

# define training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="bge-base-financial-matryoshka", # output directory and hugging face model ID
    num_train_epochs=4,                         # number of epochs
    per_device_train_batch_size=32,             # train batch size
    gradient_accumulation_steps=16,             # for a global batch size of 512
    per_device_eval_batch_size=16,              # evaluation batch size
    warmup_ratio=0.1,                           # warmup ratio
    learning_rate=2e-5,                         # learning rate, 2e-5 is a good value
    lr_scheduler_type="cosine",                 # use constant learning rate scheduler
    optim="adamw_torch_fused",                  # use fused adamw optimizer
    tf32=False,                                  # use tf32 precision
    fp16=True,    #bf16=true                              # use bf16 precision
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    eval_strategy="epoch",                      # evaluate after each epoch
    save_strategy="epoch",                      # save after each epoch
    logging_steps=10,                           # log every 10 steps
    save_total_limit=3,                         # save only the last 3 models
    load_best_model_at_end=True,                # load the best model when training ends
    metric_for_best_model="eval_dim_128_cosine_ndcg@10",  # Optimizing for the best ndcg@10 score for the 128 dimension
)

The training loop is abstracted by the SentenceTransformerTrainer, wrapping:

The model

The training dataset (anchor + positive pairs)

The chosen contrastive loss (MatryoshkaLoss → wraps MultipleNegativesRankingLoss)

Evaluator for IR-based semantic retrieval

This connects directly to the textbook chapter’s message: contrastive learning is best evaluated using semantic similarity, not classification metrics.

In [None]:
from sentence_transformers import SentenceTransformerTrainer

trainer = SentenceTransformerTrainer(
    model=model, # bg-base-en-v1
    args=args,  # training arguments
    train_dataset=train_dataset.select_columns(
        ["positive", "anchor"]
    ),  # training dataset
    loss=train_loss,
    evaluator=evaluator,
)

Trains the model on contrastive loss

Embeds sentence pairs closer or farther

Saves the updated embedding model

In [None]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()

# save the best model
trainer.save_model()

# push model to hub
#trainer.model.push_to_hub("bge-base-financial-matryoshka")
#error below for trying to push this to Hugging Face :)

dataset = dataset.select_columns(['anchor', 'positive', 'negative'])


Epoch,Training Loss,Validation Loss,Dim 768 Cosine Accuracy@1,Dim 768 Cosine Accuracy@3,Dim 768 Cosine Accuracy@5,Dim 768 Cosine Accuracy@10,Dim 768 Cosine Precision@1,Dim 768 Cosine Precision@3,Dim 768 Cosine Precision@5,Dim 768 Cosine Precision@10,Dim 768 Cosine Recall@1,Dim 768 Cosine Recall@3,Dim 768 Cosine Recall@5,Dim 768 Cosine Recall@10,Dim 768 Cosine Ndcg@10,Dim 768 Cosine Mrr@10,Dim 768 Cosine Map@100,Dim 512 Cosine Accuracy@1,Dim 512 Cosine Accuracy@3,Dim 512 Cosine Accuracy@5,Dim 512 Cosine Accuracy@10,Dim 512 Cosine Precision@1,Dim 512 Cosine Precision@3,Dim 512 Cosine Precision@5,Dim 512 Cosine Precision@10,Dim 512 Cosine Recall@1,Dim 512 Cosine Recall@3,Dim 512 Cosine Recall@5,Dim 512 Cosine Recall@10,Dim 512 Cosine Ndcg@10,Dim 512 Cosine Mrr@10,Dim 512 Cosine Map@100,Dim 256 Cosine Accuracy@1,Dim 256 Cosine Accuracy@3,Dim 256 Cosine Accuracy@5,Dim 256 Cosine Accuracy@10,Dim 256 Cosine Precision@1,Dim 256 Cosine Precision@3,Dim 256 Cosine Precision@5,Dim 256 Cosine Precision@10,Dim 256 Cosine Recall@1,Dim 256 Cosine Recall@3,Dim 256 Cosine Recall@5,Dim 256 Cosine Recall@10,Dim 256 Cosine Ndcg@10,Dim 256 Cosine Mrr@10,Dim 256 Cosine Map@100,Dim 128 Cosine Accuracy@1,Dim 128 Cosine Accuracy@3,Dim 128 Cosine Accuracy@5,Dim 128 Cosine Accuracy@10,Dim 128 Cosine Precision@1,Dim 128 Cosine Precision@3,Dim 128 Cosine Precision@5,Dim 128 Cosine Precision@10,Dim 128 Cosine Recall@1,Dim 128 Cosine Recall@3,Dim 128 Cosine Recall@5,Dim 128 Cosine Recall@10,Dim 128 Cosine Ndcg@10,Dim 128 Cosine Mrr@10,Dim 128 Cosine Map@100,Dim 64 Cosine Accuracy@1,Dim 64 Cosine Accuracy@3,Dim 64 Cosine Accuracy@5,Dim 64 Cosine Accuracy@10,Dim 64 Cosine Precision@1,Dim 64 Cosine Precision@3,Dim 64 Cosine Precision@5,Dim 64 Cosine Precision@10,Dim 64 Cosine Recall@1,Dim 64 Cosine Recall@3,Dim 64 Cosine Recall@5,Dim 64 Cosine Recall@10,Dim 64 Cosine Ndcg@10,Dim 64 Cosine Mrr@10,Dim 64 Cosine Map@100,Sequential Score
0,1.6141,No log,0.704286,0.824286,0.857143,0.892857,0.704286,0.274762,0.171429,0.089286,0.704286,0.824286,0.857143,0.892857,0.801962,0.772498,0.776669,0.707143,0.82,0.861429,0.898571,0.707143,0.273333,0.172286,0.089857,0.707143,0.82,0.861429,0.898571,0.804818,0.774724,0.778348,0.691429,0.825714,0.857143,0.888571,0.691429,0.275238,0.171429,0.088857,0.691429,0.825714,0.857143,0.888571,0.793438,0.762574,0.76653,0.692857,0.805714,0.832857,0.877143,0.692857,0.268571,0.166571,0.087714,0.692857,0.805714,0.832857,0.877143,0.784672,0.755213,0.759335,0.647143,0.777143,0.81,0.858571,0.647143,0.259048,0.162,0.085857,0.647143,0.777143,0.81,0.858571,0.753186,0.719456,0.723581,0.723581
1,0.6597,No log,0.708571,0.837143,0.875714,0.901429,0.708571,0.279048,0.175143,0.090143,0.708571,0.837143,0.875714,0.901429,0.810546,0.780804,0.784342,0.714286,0.838571,0.872857,0.901429,0.714286,0.279524,0.174571,0.090143,0.714286,0.838571,0.872857,0.901429,0.811911,0.782728,0.786257,0.711429,0.825714,0.862857,0.897143,0.711429,0.275238,0.172571,0.089714,0.711429,0.825714,0.862857,0.897143,0.806722,0.777533,0.781127,0.701429,0.81,0.844286,0.892857,0.701429,0.27,0.168857,0.089286,0.701429,0.81,0.844286,0.892857,0.795131,0.764134,0.767545,0.662857,0.79,0.828571,0.87,0.662857,0.263333,0.165714,0.087,0.662857,0.79,0.828571,0.87,0.767801,0.734974,0.739091,0.739091
2,0.462,No log,0.707143,0.835714,0.874286,0.904286,0.707143,0.278571,0.174857,0.090429,0.707143,0.835714,0.874286,0.904286,0.81044,0.779842,0.783224,0.714286,0.838571,0.875714,0.904286,0.714286,0.279524,0.175143,0.090429,0.714286,0.838571,0.875714,0.904286,0.812864,0.783133,0.78653,0.71,0.83,0.865714,0.895714,0.71,0.276667,0.173143,0.089571,0.71,0.83,0.865714,0.895714,0.805972,0.776808,0.780686,0.698571,0.811429,0.838571,0.892857,0.698571,0.270476,0.167714,0.089286,0.698571,0.811429,0.838571,0.892857,0.794272,0.762974,0.766464,0.672857,0.785714,0.83,0.87,0.672857,0.261905,0.166,0.087,0.672857,0.785714,0.83,0.87,0.771635,0.740121,0.744523,0.744523
3,0.3926,No log,0.711429,0.835714,0.872857,0.905714,0.711429,0.278571,0.174571,0.090571,0.711429,0.835714,0.872857,0.905714,0.812613,0.782349,0.785637,0.714286,0.838571,0.875714,0.904286,0.714286,0.279524,0.175143,0.090429,0.714286,0.838571,0.875714,0.904286,0.813101,0.783421,0.786845,0.712857,0.828571,0.865714,0.897143,0.712857,0.27619,0.173143,0.089714,0.712857,0.828571,0.865714,0.897143,0.80747,0.778439,0.782248,0.697143,0.811429,0.841429,0.892857,0.697143,0.270476,0.168286,0.089286,0.697143,0.811429,0.841429,0.892857,0.793823,0.762337,0.765875,0.668571,0.787143,0.83,0.87,0.668571,0.262381,0.166,0.087,0.668571,0.787143,0.83,0.87,0.770207,0.738159,0.742589,0.742589


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

HfHubHTTPError:  (Request ID: Root=1-66fafc39-516aef920ef125bb712044fc;49f4b2a7-457d-4e16-a705-56542c8dfede)

403 Forbidden: You don't have the rights to create a model under the namespace "pamuksuz".
Cannot access content at: https://huggingface.co/api/repos/create.
If you are trying to create or update content, make sure you have a token with the `write` role.

Re-running the evaluator after training to see how well the fine-tuned embeddings perform.

Evaluator uses cosine similarity and reports nDCG@10 for various embedding dimensions (Matryoshka idea again)

Confirms: model has improved in semantic understanding via contrastive training

In [None]:
from sentence_transformers import SentenceTransformer

fine_tuned_model = SentenceTransformer(
    args.output_dir, device="cuda" if torch.cuda.is_available() else "cpu"
)
# Evaluate the model
results = evaluator(fine_tuned_model)

# # COMMENT IN for full results
# print(results)

# Print the main score
for dim in matryoshka_dimensions:
    key = f"dim_{dim}_cosine_ndcg@10"
    print(f"{key}: {results[key]}")

dim_768_cosine_ndcg@10: 0.8102025115468267
dim_512_cosine_ndcg@10: 0.8117242397924654
dim_256_cosine_ndcg@10: 0.8070871775080796
dim_128_cosine_ndcg@10: 0.7955951729403435
dim_64_cosine_ndcg@10: 0.7677576145174085


#InformationRetrievalEvaluator Dive Deep (Optional)

This block introduces Information Retrieval evaluation, where:

A query is embedded

A corpus of documents is embedded

The goal is to retrieve the most semantically similar ones

This mirrors the retrieval-based use case that Chapter 10 describes (and later chapters build on with RAG).

In [None]:
#https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator
import random
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from datasets import load_dataset

# Load a model
model = SentenceTransformer('all-mpnet-base-v2')

# Load the Quora IR dataset (https://huggingface.co/datasets/BeIR/quora, https://huggingface.co/datasets/BeIR/quora-qrels)
corpus = load_dataset("BeIR/quora", "corpus", split="corpus")
queries = load_dataset("BeIR/quora", "queries", split="queries")
relevant_docs_data = load_dataset("BeIR/quora-qrels", split="validation")

# Shrink the corpus size heavily to only the relevant documents + 10,000 random documents
required_corpus_ids = list(map(str, relevant_docs_data["corpus-id"]))
required_corpus_ids += random.sample(corpus["_id"], k=10_000)
corpus = corpus.filter(lambda x: x["_id"] in required_corpus_ids)

# Convert the datasets to dictionaries
corpus = dict(zip(corpus["_id"], corpus["text"]))  # Our corpus (cid => document)
queries = dict(zip(queries["_id"], queries["text"]))  # Our queries (qid => question)
relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_cids])
for qid, corpus_ids in zip(relevant_docs_data["query-id"], relevant_docs_data["corpus-id"]):
    qid = str(qid)
    corpus_ids = str(corpus_ids)
    if qid not in relevant_docs:
        relevant_docs[qid] = set()
    relevant_docs[qid].add(corpus_ids)

# Given queries, a corpus and a mapping with relevant documents, the InformationRetrievalEvaluator computes different IR metrics.
ir_evaluator = InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    relevant_docs=relevant_docs,
    name="BeIR-quora-dev",
)
results = ir_evaluator(model)
'''
Information Retrieval Evaluation of the model on the BeIR-quora-dev dataset:
Queries: 5000
Corpus: 17476

Score-Function: cosine
Accuracy@1: 96.26%
Accuracy@3: 99.38%
Accuracy@5: 99.74%
Accuracy@10: 99.94%
Precision@1: 96.26%
Precision@3: 43.01%
Precision@5: 27.66%
Precision@10: 14.58%
Recall@1: 82.93%
Recall@3: 96.28%
Recall@5: 98.38%
Recall@10: 99.55%
MRR@10: 0.9782
NDCG@10: 0.9807
MAP@100: 0.9732
Score-Function: dot
Accuracy@1: 96.26%
Accuracy@3: 99.38%
Accuracy@5: 99.74%
Accuracy@10: 99.94%
Precision@1: 96.26%
Precision@3: 43.01%
Precision@5: 27.66%
Precision@10: 14.58%
Recall@1: 82.93%
Recall@3: 96.28%
Recall@5: 98.38%
Recall@10: 99.55%
MRR@10: 0.9782
NDCG@10: 0.9807
MAP@100: 0.9732
'''
print(ir_evaluator.primary_metric)
# => "BeIR-quora-dev_cosine_map@100"
print(results[ir_evaluator.primary_metric])
# => 0.9732046108457585

  from tqdm.autonotebook import tqdm, trange
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/25.3M [00:00<?, ?B/s]

Generating corpus split:   0%|          | 0/522931 [00:00<?, ? examples/s]

Downloading data:   0%|          | 0.00/664k [00:00<?, ?B/s]

Generating queries split:   0%|          | 0/15000 [00:00<?, ? examples/s]

Downloading readme:   0%|          | 0.00/14.0k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/125k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/256k [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/7626 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/15675 [00:00<?, ? examples/s]

Filter:   0%|          | 0/522931 [00:00<?, ? examples/s]

BeIR-quora-dev_cosine_map@100
0.9742612491668802


Going beyond our custom dataset, this:

Evaluates the model on a standard IR benchmark (BeIR/quora)

Computes real-world IR metrics like:

Accuracy@K

Precision@K

Recall@K

MRR, nDCG, MAP

This is a direct operationalization of the core concept in Chapter 10:

“The best way to train and evaluate sentence embeddings is by measuring how well they retrieve relevant texts.”