## Fine-tune Embedding model for RAG 

In some of my previous projects with LLM/Gen AI, I often have to utlize embbeding models for RAG use case. It is a important component for RAG but they are often trained on general knowledge, which may hinder the performance for domain specific use case. By fine-tuning the embedding model we can boost the retrieval capability of RAG application. So in this notebook, I would like to explore how to fine tune embedding models with Sentence Transformers 3. 

Today, we would like to explore on fine-tuning an embeddingmodel for financial RAG application using synthetic dataset from SEC filling.

In [1]:
from datasets import load_dataset
 
# Load dataset from the hub
dataset = load_dataset("source_data/finanical-rag-embedding-dataset", split="train")
 
# rename columns
dataset = dataset.rename_column("question", "anchor")
dataset = dataset.rename_column("context", "positive")
 
# Add an id column to the dataset
dataset = dataset.add_column("id", range(len(dataset)))
 
# split dataset into a 10% test set
dataset = dataset.train_test_split(test_size=0.1)
 
# save datasets to disk
dataset["train"].to_json("training_data/train_dataset.json", orient="records")
dataset["test"].to_json("training_data/test_dataset.json", orient="records")

Creating json from Arrow format:   0%|          | 0/7 [00:00<?, ?ba/s]

Creating json from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

240104

### Create baseline and evaluate pretrained model

Today, we will use the BAAI/bge-base-en-v1.5 embedding model as our baseline and fine-tuned model. with only 109M parameters and a hidden dimension of 768 it achieves 63.55 on the MTEB Leaderboard.

We are going to use the InformationRetrievalEvaluator to evaluate the performance of our model on a given set of queries and corpus set. It will retrieve for each query the top-k most similar document. It measures Mean Reciprocal Rank (MRR), Recall@k, Mean Average Precision (MAP) and Normalized Discounted Cumulative Gain (NDCG).

For us the most important metric will be Normalized Discounted Cumulative Gain (NDCG) as it is a measure of ranking quality. It takes into account the position of the relevant document in the ranking and discounts it. The discounted value is logarithmic, which means that relevant documents are more important if they are ranked higher.

In [2]:
import torch
from sentence_transformers import SentenceTransformer
from sentence_transformers.evaluation import (
    InformationRetrievalEvaluator,
    SequentialEvaluator,
)
from sentence_transformers.util import cos_sim
from datasets import load_dataset, concatenate_datasets
 
model_id = "models/bge-base-en-v1.5"  # Hugging Face model ID
matryoshka_dimensions = [768, 512, 256, 128, 64] # Important: large to small
 
# Load a model
model = SentenceTransformer(
    model_id, device="cuda" if torch.cuda.is_available() else "cpu"
)
 
# load test dataset
test_dataset = load_dataset("json", data_files="training_data/test_dataset.json", split="train")
train_dataset = load_dataset("json", data_files="training_data/train_dataset.json", split="train")
corpus_dataset = concatenate_datasets([train_dataset, test_dataset])
 
# Convert the datasets to dictionaries
corpus = dict(
    zip(corpus_dataset["id"], corpus_dataset["positive"])
)  # Our corpus (cid => document)
queries = dict(
    zip(test_dataset["id"], test_dataset["anchor"])
)  # Our queries (qid => question)
 
# Create a mapping of relevant document (1 in our case) for each query
relevant_docs = {}  # Query ID to relevant documents (qid => set([relevant_cids])
for q_id in queries:
    relevant_docs[q_id] = [q_id]
 
 
matryoshka_evaluators = []
# Iterate over the different dimensions
for dim in matryoshka_dimensions:
    ir_evaluator = InformationRetrievalEvaluator(
        queries=queries,
        corpus=corpus,
        relevant_docs=relevant_docs,
        name=f"dim_{dim}",
        truncate_dim=dim,  # Truncate the embeddings to a certain dimension
        score_functions={"cosine": cos_sim},
    )
    matryoshka_evaluators.append(ir_evaluator)
 
# Create a sequential evaluator
evaluator = SequentialEvaluator(matryoshka_evaluators)


Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
binary_path: C:\Users\ms\.conda\envs\playground\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll
CUDA SETUP: Loading binary C:\Users\ms\.conda\envs\playground\lib\site-packages\bitsandbytes\cuda_setup\libbitsandbytes_cuda116.dll...


Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]

In [3]:
# Evaluate the model
results = evaluator(model)
 
# # COMMENT IN for full results
# print(results)
 
# Print the main score
for dim in matryoshka_dimensions:
    key = f"dim_{dim}_cosine_ndcg@10"
    print
    print(f"{key}: {results[key]}")

dim_768_cosine_ndcg@10: 0.7409673938572789
dim_512_cosine_ndcg@10: 0.7402734381724596
dim_256_cosine_ndcg@10: 0.7254025778084061
dim_128_cosine_ndcg@10: 0.7051838302702277
dim_64_cosine_ndcg@10: 0.6375033504202162


Lets reload our model using SDPA or Flash Attention 2 as attn_implementation and define a model card.

In [4]:
from sentence_transformers import SentenceTransformerModelCardData, SentenceTransformer
 
# Hugging Face model ID: https://huggingface.co/BAAI/bge-base-en-v1.5
model_id = "models/bge-base-en-v1.5"
 
# load model with SDPA for using Flash Attention 2
model = SentenceTransformer(
    model_id,
    model_kwargs={"attn_implementation": "sdpa"},
    model_card_data=SentenceTransformerModelCardData(
        language="en",
        license="apache-2.0",
        model_name="BGE base Financial Matryoshka",
    ),
)

Once loaded our model we can initialize our loss function.

In [5]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss
 
matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

### Fine-tune embedding model with SentenceTransformersTrainer

We are now ready to fine-tune our model. We will use the SentenceTransformersTrainer a subclass of the Trainer from the transformers library, which supports all the same features, including logging, evaluation, and checkpointing.

In addition to this there is a SentenceTransformerTrainingArguments class that allows us to specify all the training parameters.

In [6]:
from sentence_transformers import SentenceTransformerTrainingArguments
from sentence_transformers.training_args import BatchSamplers
 
# load train dataset again
train_dataset = load_dataset("json", data_files="training_data/train_dataset.json", split="train")
 
# define training arguments
args = SentenceTransformerTrainingArguments(
    output_dir="finetuned_emb", # output directory and hugging face model ID
    num_train_epochs=4,                         # number of epochs
    per_device_train_batch_size=32,             # train batch size
    gradient_accumulation_steps=16,             # for a global batch size of 512
    per_device_eval_batch_size=16,              # evaluation batch size
    warmup_ratio=0.1,                           # warmup ratio
    learning_rate=2e-5,                         # learning rate, 2e-5 is a good value
    lr_scheduler_type="cosine",                 # use constant learning rate scheduler
    optim="adamw_torch_fused",                  # use fused adamw optimizer
    tf32=True,                                  # use tf32 precision
    bf16=True,                                  # use bf16 precision
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
    eval_strategy="epoch",                      # evaluate after each epoch
    save_strategy="epoch",                      # save after each epoch
    logging_steps=10,                           # log every 10 steps
    save_total_limit=3,                         # save only the last 3 models
    load_best_model_at_end=True,                # load the best model when training ends
    metric_for_best_model="eval_dim_128_cosine_ndcg@10",  # Optimizing for the best ndcg@10 score for the 128 dimension
)

We now have every building block we need to create our SentenceTransformersTrainer to start then training our model.

Make sure your library is updated for : accelerate-0.27.2

https://github.com/huggingface/transformers/issues/29216

In [7]:
from sentence_transformers import SentenceTransformerTrainer
 
trainer = SentenceTransformerTrainer(
    model=model, # bg-base-en-v1
    args=args,  # training arguments
    train_dataset=train_dataset.select_columns(
        ["positive", "anchor"]
    ),  # training dataset
    loss=train_loss,
    evaluator=evaluator,
)

In [8]:
# start training, the model will be automatically saved to the hub and the output directory
trainer.train()
 
# save the best model
trainer.save_model()

Epoch,Training Loss,Validation Loss,Dim 768 Cosine Accuracy@1,Dim 768 Cosine Accuracy@3,Dim 768 Cosine Accuracy@5,Dim 768 Cosine Accuracy@10,Dim 768 Cosine Precision@1,Dim 768 Cosine Precision@3,Dim 768 Cosine Precision@5,Dim 768 Cosine Precision@10,Dim 768 Cosine Recall@1,Dim 768 Cosine Recall@3,Dim 768 Cosine Recall@5,Dim 768 Cosine Recall@10,Dim 768 Cosine Ndcg@10,Dim 768 Cosine Mrr@10,Dim 768 Cosine Map@100,Dim 512 Cosine Accuracy@1,Dim 512 Cosine Accuracy@3,Dim 512 Cosine Accuracy@5,Dim 512 Cosine Accuracy@10,Dim 512 Cosine Precision@1,Dim 512 Cosine Precision@3,Dim 512 Cosine Precision@5,Dim 512 Cosine Precision@10,Dim 512 Cosine Recall@1,Dim 512 Cosine Recall@3,Dim 512 Cosine Recall@5,Dim 512 Cosine Recall@10,Dim 512 Cosine Ndcg@10,Dim 512 Cosine Mrr@10,Dim 512 Cosine Map@100,Dim 256 Cosine Accuracy@1,Dim 256 Cosine Accuracy@3,Dim 256 Cosine Accuracy@5,Dim 256 Cosine Accuracy@10,Dim 256 Cosine Precision@1,Dim 256 Cosine Precision@3,Dim 256 Cosine Precision@5,Dim 256 Cosine Precision@10,Dim 256 Cosine Recall@1,Dim 256 Cosine Recall@3,Dim 256 Cosine Recall@5,Dim 256 Cosine Recall@10,Dim 256 Cosine Ndcg@10,Dim 256 Cosine Mrr@10,Dim 256 Cosine Map@100,Dim 128 Cosine Accuracy@1,Dim 128 Cosine Accuracy@3,Dim 128 Cosine Accuracy@5,Dim 128 Cosine Accuracy@10,Dim 128 Cosine Precision@1,Dim 128 Cosine Precision@3,Dim 128 Cosine Precision@5,Dim 128 Cosine Precision@10,Dim 128 Cosine Recall@1,Dim 128 Cosine Recall@3,Dim 128 Cosine Recall@5,Dim 128 Cosine Recall@10,Dim 128 Cosine Ndcg@10,Dim 128 Cosine Mrr@10,Dim 128 Cosine Map@100,Dim 64 Cosine Accuracy@1,Dim 64 Cosine Accuracy@3,Dim 64 Cosine Accuracy@5,Dim 64 Cosine Accuracy@10,Dim 64 Cosine Precision@1,Dim 64 Cosine Precision@3,Dim 64 Cosine Precision@5,Dim 64 Cosine Precision@10,Dim 64 Cosine Recall@1,Dim 64 Cosine Recall@3,Dim 64 Cosine Recall@5,Dim 64 Cosine Recall@10,Dim 64 Cosine Ndcg@10,Dim 64 Cosine Mrr@10,Dim 64 Cosine Map@100,Sequential Score
0,1.5364,No log,0.678571,0.808571,0.844286,0.892857,0.678571,0.269524,0.168857,0.089286,0.678571,0.808571,0.844286,0.892857,0.78606,0.751885,0.756029,0.675714,0.792857,0.85,0.891429,0.675714,0.264286,0.17,0.089143,0.675714,0.792857,0.85,0.891429,0.7832,0.748519,0.752719,0.67,0.798571,0.834286,0.877143,0.67,0.26619,0.166857,0.087714,0.67,0.798571,0.834286,0.877143,0.774768,0.741865,0.746703,0.655714,0.782857,0.812857,0.872857,0.655714,0.260952,0.162571,0.087286,0.655714,0.782857,0.812857,0.872857,0.762759,0.727854,0.732875,0.617143,0.742857,0.79,0.835714,0.617143,0.247619,0.158,0.083571,0.617143,0.742857,0.79,0.835714,0.726284,0.691228,0.696668,0.617143
1,0.6408,No log,0.698571,0.818571,0.851429,0.898571,0.698571,0.272857,0.170286,0.089857,0.698571,0.818571,0.851429,0.898571,0.798531,0.766567,0.770268,0.7,0.812857,0.851429,0.904286,0.7,0.270952,0.170286,0.090429,0.7,0.812857,0.851429,0.904286,0.800399,0.767409,0.770592,0.701429,0.808571,0.844286,0.884286,0.701429,0.269524,0.168857,0.088429,0.701429,0.808571,0.844286,0.884286,0.792541,0.763159,0.767439,0.682857,0.795714,0.835714,0.88,0.682857,0.265238,0.167143,0.088,0.682857,0.795714,0.835714,0.88,0.780539,0.748831,0.753381,0.654286,0.772857,0.805714,0.847143,0.654286,0.257619,0.161143,0.084714,0.654286,0.772857,0.805714,0.847143,0.751339,0.720615,0.726238,0.654286
2,0.4258,No log,0.708571,0.822857,0.857143,0.9,0.708571,0.274286,0.171429,0.09,0.708571,0.822857,0.857143,0.9,0.804594,0.774084,0.777673,0.714286,0.821429,0.858571,0.9,0.714286,0.27381,0.171714,0.09,0.714286,0.821429,0.858571,0.9,0.806049,0.776071,0.779668,0.712857,0.814286,0.855714,0.884286,0.712857,0.271429,0.171143,0.088429,0.712857,0.814286,0.855714,0.884286,0.798566,0.770941,0.775408,0.692857,0.801429,0.844286,0.88,0.692857,0.267143,0.168857,0.088,0.692857,0.801429,0.844286,0.88,0.785004,0.754659,0.759389,0.66,0.775714,0.812857,0.857143,0.66,0.258571,0.162571,0.085714,0.66,0.775714,0.812857,0.857143,0.758444,0.726935,0.732292,0.66
3,0.3719,No log,0.711429,0.824286,0.857143,0.9,0.711429,0.274762,0.171429,0.09,0.711429,0.824286,0.857143,0.9,0.805655,0.775499,0.779089,0.712857,0.82,0.857143,0.898571,0.712857,0.273333,0.171429,0.089857,0.712857,0.82,0.857143,0.898571,0.804871,0.77491,0.778638,0.714286,0.814286,0.855714,0.884286,0.714286,0.271429,0.171143,0.088429,0.714286,0.814286,0.855714,0.884286,0.799296,0.771901,0.776334,0.692857,0.801429,0.844286,0.88,0.692857,0.267143,0.168857,0.088,0.692857,0.801429,0.844286,0.88,0.785433,0.755187,0.75997,0.661429,0.777143,0.818571,0.857143,0.661429,0.259048,0.163714,0.085714,0.661429,0.777143,0.818571,0.857143,0.759557,0.728328,0.73377,0.661429


Computing widget examples:   0%|          | 0/5 [00:00<?, ?example/s]

Computing widget examples:   0%|          | 0/5 [00:00<?, ?example/s]

Computing widget examples:   0%|          | 0/5 [00:00<?, ?example/s]

Computing widget examples:   0%|          | 0/5 [00:00<?, ?example/s]

The training with Flash Attention (SDPA) for 4 epochs on 6.3k samples took less than 4 mins on a 3090. 

### Evaluate fine-tuned model against baseline

We evaluated our model during training, but we also want to evaluate it against our baseline at the end. We use the same InformationRetrievalEvaluator to evaluate the performance of our model on a given set of queries and corpus set.

In [10]:
from sentence_transformers import SentenceTransformer
 
fine_tuned_model = SentenceTransformer(
    args.output_dir, device="cuda" if torch.cuda.is_available() else "cpu"
)

# Evaluate the model
results = evaluator(fine_tuned_model)
 
# Print the main score
for dim in matryoshka_dimensions:
    key = f"dim_{dim}_cosine_ndcg@10"
    print(f"{key}: {results[key]}")

dim_768_cosine_ndcg@10: 0.804975867626354
dim_512_cosine_ndcg@10: 0.8051485851636322
dim_256_cosine_ndcg@10: 0.7983234053795578
dim_128_cosine_ndcg@10: 0.7847019509468028
dim_64_cosine_ndcg@10: 0.7596781274632667


### Conclusion

Embedding models are crucial for successfull RAG applications, since if you don't retrieve the right context you can't generate the right answer. Customizing embedding models for domain-specific data can improve retrieval performance significantly compared to using general knowledge models. Fine-tuning embedding models has become highly accessible, and using synthetic data generated by LLMs, one can easily customize models for specific needs, resulting in substantial improvements.

Our results show that fine-tuning can boost performance by ~7% with only 6.3k sample. The training took 3 minutes on a consumer size GPU and by leveraging modern techniques like Matryoshka Representation Learning we achieved over 99% performance retention with 6x storage reduction and efficiency gains.