In [None]:
!pip install -U sentence-transformers transformers accelerate datasets huggingface_hub

In [None]:
import torch
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from datasets import load_dataset
from huggingface_hub import login
from datasets import load_dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
    SentenceTransformerModelCardData,
)

  from tqdm.autonotebook import tqdm, trange


# Improving Retrieval Performance in RAG Applications

## Introduction
Embedding models have revolutionized the field of natural language processing (NLP). These models transform high-dimensional data (like text) into a lower-dimensional space while preserving relevant informational and relational properties. This transformation facilitates various tasks in natural language processing (NLP), including search, recommendation systems, and information retrieval.


### Retrieval-Augmented Generation (RAG)


Retrieval-Augmented Generation (RAG) combines the strengths of retrieval-based methods and generation-based methods to improve the performance of NLP systems. RAG applications retrieve relevant information from a large corpus of documents and use this information to generate more accurate and contextually appropriate responses. This approach has found applications in numerous domains, including chatbots, search engines, recommendation systems, and knowledge management systems.


The effectiveness of RAG applications heavily relies on the quality of the embedding models used for information retrieval. Embedding models must accurately capture the semantic meaning of queries and documents to ensure that the most relevant information is retrieved. However, the reliance on general-purpose embedding models often limits the performance of RAG applications in specific domains.


Embedding models are mostly trained on extensive corpuses of general knowledge, such as Wikipedia or Common Crawl, this broad approach can be limiting when applied to specialized domains. For example, models trained on general data may not perform well in technical domains without additional tuning. This limitation arises from the fact that general knowledge embeddings may not capture the nuances and specialized terminology unique to specific domains.




Customizing embedding models to capture domain-specific knowledge is crucial for enhancing the performance of RAG applications. Domain-specific embeddings are trained on specialized corpora that reflect the language and terminology used in a particular field. By doing so, these embeddings can better capture the semantic nuances and context-specific meanings that are essential for accurate information retrieval. This customization process involves training or fine-tuning models on specialized datasets, incorporating domain-specific vocabularies, and possibly adjusting model architectures to better handle the characteristics of the data.




### Boosting Retrieval Performance in RAG Applications


Enhancing retrieval performance is crucial for the success of RAG applications, as the quality of retrieved documents significantly impacts the quality of the generated content. Customizing embeddings can lead to more accurate and relevant data retrieval, which in turn improves the overall output of the RAG system. For instance, a RAG application in the medical field, trained with domain-specific embeddings, would be able to retrieve and generate more precise and clinically relevant information than one using a generic embedding model.


Sentence Transformers is a Python library for using and training embedding models for a wide range of applications, such as retrieval augmented generation, semantic search, semantic textual similarity, paraphrase mining, and more. Its v3.0 update is the largest since the project's inception, introducing a new training approach.


Developed by UKPLab, the Sentence Transformers library extends the popular BERT (Bidirectional Encoder Representations from Transformers) model by Hugging Face, but with a focus on producing better sentence-level embeddings. Unlike traditional BERT that outputs a high-dimensional vector for each token in the input text, Sentence Transformers generate a single fixed-size vector for the entire input sentence or paragraph, making them more practical for tasks that require sentence-level comparisons. With this library, we can utilize and train embedding models across different applications. These applications include RAG, semantic search, semantic textual similarity, and many others. The v3.0 update introduces a new trainer that makes it easier to fine-tune and train embedding models. This update includes enhanced components like diverse datasets, updated loss functions, and a streamlined training process, improving the efficiency and flexibility of model development. In this post, I'll show you how to finetune a sentence transformer model on a specific task using the Sentence Transformer library.  


## Training a Sentence Transformer

Training Sentence Transformer models involves between 3 to 5 components.



<center><figure><img src="../imgs/Sentence Transformer Training.png" alt="drawing" width="1100"/><figcaption>Fig. 1: Sentence Transformer training components</figcaption></figure></center>

### Dataset

You can load your local dataset or Hugging Face Datasets using datasets.load_dataset(). One important consideration is that your dataset format should match your loss function. If your loss function requires a Label accordingly, then your dataset must have a column named “label” or “score”. All other columns are considered Inputs. The number of remaining columns must match the number of valid inputs for your chosen loss. The names of these columns are irrelevant, only the order matters. Table 1 shows the requirements for the loss functions in Sentence Transformers v3.0.

For this demo, I use a dataset [SepKeyPro/trivia-anchor-positive-10k](https://huggingface.co/datasets/SepKeyPro/trivia-anchor-positive-10k) which includes (anchor, positive) pairs. We can load it using load_dataset().


In [None]:
corpus_dataset = load_dataset("SepKeyPro/trivia-anchor-positive-10k", split="train")

Downloading readme:   0%|          | 0.00/347 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/19.9M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/10000 [00:00<?, ? examples/s]

### Loss Function

The loss function is at the core of training in machine learning algorithms. Unfortunately, there is no single loss function that works best for all use cases. Choose your loss function based on your data, or curate your dataset based on your loss function. You can consult Table 1 for the loss function requirements.


<table border="1">
    <caption>Table 1: Requirements for Loss Functions </caption>
    <tr>
        <th>Inputs</th>
        <th>Labels</th>
        <th>Appropriate Loss Functions</th>
    </tr>
    <tr>
        <td>single sentences</td>
        <td>class</td>
        <td>BatchAllTripletLoss<br> BatchHardSoftMarginTripletLoss<br> BatchHardTripletLoss<br> BatchSemiHardTripletLoss</td>
    </tr>
    <tr>
        <td>single sentences</td>
        <td>none</td>
        <td>ContrastiveTensionLoss<br> DenoisingAutoEncoderLoss</td>
    </tr>
    <tr>
        <td>(anchor, anchor) pairs</td>
        <td>none</td>
        <td>ContrastiveTensionLossInBatchNegatives</td>
    </tr>
    <tr>
        <td>(damaged_sentence, original_sentence) pairs</td>
        <td>none</td>
        <td>DenoisingAutoEncoderLoss</td>
    </tr>
    <tr>
        <td>(sentence_A, sentence_B) pairs</td>
        <td>class</td>
        <td>SoftmaxLoss</td>
    </tr>
    <tr>
        <td>(anchor, positive) pairs</td>
        <td>none</td>
        <td>CachedMultipleNegativesRankingLoss<br> MultipleNegativesRankingLoss<br> MultipleNegativesSymmetricRankingLoss<br> MegaBatchMarginLoss<br> CachedGISTEmbedLoss<br> GISTEmbedLoss</td>
    </tr>
    <tr>
        <td>(anchor, positive/negative) pairs</td>
        <td>1 if positive, 0 if negative</td>
        <td>ContrastiveLoss<br> OnlineContrastiveLoss</td>
    </tr>
    <tr>
        <td>(sentence_A, sentence_B) pairs</td>
        <td>float similarity score</td>
        <td>CoSENTLoss<br> AnglELoss<br> CosineSimilarityLoss</td>
    </tr>
    <tr>
        <td>(anchor, positive, negative) triplets</td>
        <td>none</td>
        <td>CachedMultipleNegativesRankingLoss<br> MultipleNegativesRankingLoss<br> TripletLoss<br>CachedGISTEmbedLoss<br>GISTEmbedLoss</td>
    </tr>
</table>

Considering our dataset format and consulting Table 1, I select MultipleNegativesRankingLoss. The MultipleNegativesRankingLoss is a great loss function if we only have positive pairs as it adds in batch negative samples to the loss function to have per sample n-1 negative samples. For the model, I am going to use the BAAI/bge-base-en-v1.5 model, which is a pre-trained model on a large corpus of English text.

In [None]:
model_id = "BAAI/bge-base-en-v1.5"
model = SentenceTransformer(
    model_id,
    model_kwargs={"attn_implementation": "sdpa"},
    device="cuda",
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="bge base trained on trivia anchor-positive",
    )
)

loss = MultipleNegativesRankingLoss(model)

### Training Arguments

You can specify training parameters to improve training performance. Training Arguments are optional, however, you can experiment with them to see how they can improve your training performance. Table 2 shows some of the training arguments to look at.

<table border="1">
  <caption>Table 2: Training Arguments</caption>
  <tr>
    <th>Training Argument</th>
    <th>Explanation</th>
    <th>Data type</th>
  </tr>
  <tr>
    <td>learning_rate</td>
    <td>The learning rate of the optimizer.</td>
    <td>float</td>
  </tr>
  <tr>
    <td>lr_scheduler_type</td>
    <td>The scheduler type to use. Possible values are: “Constant”, “constant_with_warmup”, “cosine”, “cosine_with_warmup”, “linear_with_warmup”, “inverse_sqrt”. See <a href="https://huggingface.co/docs/transformers/main/en/main_classes/optimizer_schedules#transformers.SchedulerType">SchedulerType<sup>1</sup></a> for more details.</td>
    <td>str</td>
  </tr>
  <tr>
    <td>warmup_ratio</td>
    <td>For schedulers with warmup, Ratio of total training steps used for a linear warmup from 0 to learning_rate.</td>
    <td>float</td>
  </tr>
  <tr>
    <td>num_train_epochs</td>
    <td>Total number of training epochs to perform.</td>
    <td>float</td>
  </tr>
  <tr>
    <td>max_steps</td>
    <td>If set to a positive number, the total number of training steps to perform. Overrides num_train_epochs.</td>
    <td>int</td>
  </tr>
  <tr>
    <td>per_device_train_batch_size</td>
    <td>The batch size per GPU core/CPU for training.</td>
    <td>int</td>
  </tr>
  <tr>
    <td>per_device_eval_batch_size</td>
    <td>The batch size per GPU core/CPU for evaluation.</td>
    <td>int</td>
  </tr>
  <tr>
    <td>auto_find_batch_size</td>
    <td>Whether to find a batch size that will fit into memory automatically through exponential decay, avoiding CUDA Out-of-Memory errors.</td>
    <td>bool</td>
  </tr>
  <tr>
    <td>fp16</td>
    <td>Whether to use fp16.</td>
    <td>bool</td>
  </tr>
  <tr>
    <td>bf16</td>
    <td>Whether to use bf16.</td>
    <td>bool</td>
  </tr>
  <tr>
    <td>gradient_accumulation_steps</td>
    <td>Number of updates steps to accumulate the gradients for, before performing a backward/update pass.</td>
    <td>int</td>
  </tr>
  <tr>
    <td>gradient_checkpointing</td>
    <td>If True, use gradient checkpointing to save memory at the expense of slower backward pass.</td>
    <td>bool</td>
  </tr>
  <tr>
    <td>eval_accumulation_steps</td>
    <td>Number of predictions steps to accumulate the output tensors for, before moving the results to the CPU.</td>
    <td>int</td>
  </tr>
  <tr>
    <td>optim</td>
    <td>The optimizer to use. Some of the optimizers are: “adamw_hf”, “sgd”, “adamw_8bit”, “paged_adamw_32bit”, “paged_adamw_8bit”, “adagrad”, “rmsprop”, “rmsprop_bnb_32bit”. For the full list of optimizers available in Training Arguments see <a href="https://github.com/huggingface/transformers/blob/main/src/transformers/training_args.py">Optimizers Names<sup>2</sup></a>.</td>
    <td>str</td>
  </tr>
  <tr>
    <td>eval_strategy</td>
    <td>The evaluation strategy to adopt during training. Possible values are: "no": No evaluation is done during training. "steps": Evaluation is done (and logged) every eval_steps. "epoch": Evaluation is done at the end of each epoch.</td>
    <td>str</td>
  </tr>
  <tr>
    <td>eval_steps</td>
    <td>Number of update steps between two evaluations if eval_strategy="steps".</td>
    <td>int</td>
  </tr>
  <tr>
    <td>report_to</td>
    <td>The list of integrations to report the results and logs to. Possible values are: "azure_ml", "codecarbon", "tensorboard", "wandb".</td>
    <td>str</td>
  </tr>
</table>

*Check [Training Arguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments) on Hugging Face <img src="../imgs/hf-logo.svg" alt="drawing" width="25"/> for a complete list of training arguments.  

In [None]:
training_args = SentenceTransformerTrainingArguments(
    output_dir="models/bge-base-en-trivia",
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=4,
    warmup_ratio=0.1,
    learning_rate=2e-5,
    lr_scheduler_type="cosine",
    optim="adamw_torch_fused",
    tf32=True,
    bf16=True,
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_steps=10,
    save_total_limit=2,  # save only the last 2 models
    run_name="bge-base-en-trivia",  # Will be used in W&B if `wandb` is installed
)

### Evaluator
An Evaluator can be used to assess the model’s performance with useful metrics before, during, or after training. They evaluate the model based on the eval_strategy and eval_steps Training Arguments. Table 3 shows some of Evaluators available in Sentence Tranformers.

<table border="1">
  <caption>Table 3: Evaluators in Sentence Tranformers</caption>
  <tr>
    <th>Evaluator</th>
    <th>Required Data</th>
  </tr>
  <tr>
    <td>BinaryClassificationEvaluator</td>
    <td>Pairs with class labels</td>
  </tr>
  <tr>
    <td>EmbeddingSimilarityEvaluator</td>
    <td>Pairs with similarity scores</td>
  </tr>
  <tr>
    <td>InformationRetrievalEvaluator</td>
    <td>Queries (qid => question), Corpus (cid => document), and relevant documents (qid => set[cid])</td>
  </tr>
  <tr>
    <td>ParaphraseMiningEvaluator</td>
    <td>Mapping of IDs to sentences & pairs with IDs of duplicate sentences.</td>
  </tr>
  <tr>
    <td>RerankingEvaluator</td>
    <td>List of {'query': '...', 'positive': [...], 'negative': [...]} dictionaries.</td>
  </tr>
  <tr>
    <td>TripletEvaluator</td>
    <td>(anchor, positive, negative) pairs.</td>
  </tr>
</table>

Since our focus is on improving information retrieval performance, we can choose the InformationRetrievalEvaluator as our evaluator. Referring to the required data in Table 3, we will need three dictionaries for this evaluator: 1) a queries dictionary, which includes query IDs and the corresponding queries; 2) a corpus dictionary, which contains corpus IDs and the corresponding documents; and 3) a dictionary of query IDs and corpus IDs for relevant documents related to each query. To generate the queries dictionary, we can use (id, anchor) pairs from our test dataset. (id, positive) pairs from the whole dataset can be used as our corpus dictionary, and (id, positive) pairs from the test dataset can be used as our relevant documents dictionary. This code snippet shows the process:

In [None]:
corpus_dataset = load_dataset("SepKeyPro/trivia-anchor-positive-10k", split="train")

split_dataset = corpus_dataset.train_test_split(test_size=0.1)
train_dataset = split_dataset["train"]
test_dataset = split_dataset["test"]

# Convert the datasets to dictionaries
corpus = dict(
    zip(corpus_dataset["id"], corpus_dataset["positive"])
)
queries = dict(
    zip(test_dataset["id"], test_dataset["anchor"])
)
relevant_docs = {}
for q_id in queries:
  relevant_docs[q_id] = [q_id]

relevant_docs = {key: torch.tensor(value).to(torch.device('cuda')) for key, value in relevant_docs.items()}


In [None]:
ir_evaluator = InformationRetrievalEvaluator(
    queries=queries,
    corpus=corpus,
    relevant_docs=relevant_docs,
    name="trivia-anchor-positive-dev",
)

ir_evaluator(model)

{'trivia-anchor-positive-dev_cosine_accuracy@1': 0.686,
 'trivia-anchor-positive-dev_cosine_accuracy@3': 0.857,
 'trivia-anchor-positive-dev_cosine_accuracy@5': 0.9,
 'trivia-anchor-positive-dev_cosine_accuracy@10': 0.93,
 'trivia-anchor-positive-dev_cosine_precision@1': 0.686,
 'trivia-anchor-positive-dev_cosine_precision@3': 0.2856666666666666,
 'trivia-anchor-positive-dev_cosine_precision@5': 0.18,
 'trivia-anchor-positive-dev_cosine_precision@10': 0.09300000000000001,
 'trivia-anchor-positive-dev_cosine_recall@1': 0.686,
 'trivia-anchor-positive-dev_cosine_recall@3': 0.857,
 'trivia-anchor-positive-dev_cosine_recall@5': 0.9,
 'trivia-anchor-positive-dev_cosine_recall@10': 0.93,
 'trivia-anchor-positive-dev_cosine_ndcg@10': 0.8160966554498293,
 'trivia-anchor-positive-dev_cosine_mrr@10': 0.7786615079365087,
 'trivia-anchor-positive-dev_cosine_map@100': 0.7808572437466846,
 'trivia-anchor-positive-dev_dot_accuracy@1': 0.686,
 'trivia-anchor-positive-dev_dot_accuracy@3': 0.857,
 'triv

### Trainer
The Trainer is where all previous components come together. We only have to specify the trainer with the model, training arguments (optional), training dataset, evaluation dataset (optional), loss function, evaluator (optional) and we can start training.

In [None]:
trainer = SentenceTransformerTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset.select_columns(["anchor","positive"]),
    loss=loss,
    evaluator=ir_evaluator,
)
trainer.train()

Epoch,Training Loss,Validation Loss,Trivia-anchor-positive-dev Cosine Accuracy@1,Trivia-anchor-positive-dev Cosine Accuracy@3,Trivia-anchor-positive-dev Cosine Accuracy@5,Trivia-anchor-positive-dev Cosine Accuracy@10,Trivia-anchor-positive-dev Cosine Precision@1,Trivia-anchor-positive-dev Cosine Precision@3,Trivia-anchor-positive-dev Cosine Precision@5,Trivia-anchor-positive-dev Cosine Precision@10,Trivia-anchor-positive-dev Cosine Recall@1,Trivia-anchor-positive-dev Cosine Recall@3,Trivia-anchor-positive-dev Cosine Recall@5,Trivia-anchor-positive-dev Cosine Recall@10,Trivia-anchor-positive-dev Cosine Ndcg@10,Trivia-anchor-positive-dev Cosine Mrr@10,Trivia-anchor-positive-dev Cosine Map@100,Trivia-anchor-positive-dev Dot Accuracy@1,Trivia-anchor-positive-dev Dot Accuracy@3,Trivia-anchor-positive-dev Dot Accuracy@5,Trivia-anchor-positive-dev Dot Accuracy@10,Trivia-anchor-positive-dev Dot Precision@1,Trivia-anchor-positive-dev Dot Precision@3,Trivia-anchor-positive-dev Dot Precision@5,Trivia-anchor-positive-dev Dot Precision@10,Trivia-anchor-positive-dev Dot Recall@1,Trivia-anchor-positive-dev Dot Recall@3,Trivia-anchor-positive-dev Dot Recall@5,Trivia-anchor-positive-dev Dot Recall@10,Trivia-anchor-positive-dev Dot Ndcg@10,Trivia-anchor-positive-dev Dot Mrr@10,Trivia-anchor-positive-dev Dot Map@100
0,0.0451,No log,0.672,0.842,0.877,0.914,0.672,0.280667,0.1754,0.0914,0.672,0.842,0.877,0.914,0.800503,0.763353,0.766189,0.672,0.842,0.877,0.914,0.672,0.280667,0.1754,0.0914,0.672,0.842,0.877,0.914,0.800503,0.763353,0.766189


TrainOutput(global_step=140, training_loss=0.012902945463013436, metrics={'train_runtime': 199.3553, 'train_samples_per_second': 45.146, 'train_steps_per_second': 0.702, 'total_flos': 0.0, 'train_loss': 0.012902945463013436, 'epoch': 0.9946714031971581})

In [None]:
model.save_pretrained("models/bge-base-en-trivia/final")

In [None]:
model.push_to_hub("bge-base-en-trivia-anchor-positive", token="Your Token")

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

'https://huggingface.co/SepKeyPro/bge-base-en-trivia-anchor-positive/commit/316293dc07e19505049fd6acef5d71abd065c9f0'