# Fine tuning

Reference: [Fine-Tuning Embeddings for RAG with Synthetic Data](https://medium.com/llamaindex-blog/fine-tuning-embeddings-for-rag-with-synthetic-data-e534409a3971)

Finetune an opensource sentencetransformers embedding model on our synthetically generated dataset.

## Load pretrained model

In [1]:
from tqdm import tqdm # needed
from sentence_transformers import SentenceTransformer
from sentence_transformers import losses
from sentence_transformers.evaluation import InformationRetrievalEvaluator

import json

from torch.utils.data import DataLoader
from sentence_transformers import InputExample

In [2]:
# To load the model from a folder:
# 1) Download the model: model = SentenceTransformer(model_name)
#    It saves the model in: /home/daniele/.cache/torch/sentence_transformers/
# 2) Save the model in a folder: model.save(path_to_model_folder)

# TOO BIG
# model = SentenceTransformer("aari1995/German_Semantic_STS_V2")
# model.save("model/sentence_transformers/aari1995_German_Semantic_STS_V2")
# model = SentenceTransformer('model/sentence_transformers/aari1995_German_Semantic_STS_V2')

# model = SentenceTransformer("PM-AI/bi-encoder_msmarco_bert-base_german")
# model.save("model/sentence_transformers/PM-AI_bi-encoder_msmarco_bert-base_german")
#
model = SentenceTransformer('model/sentence_transformers/PM-AI_bi-encoder_msmarco_bert-base_german')

In [3]:
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 350, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

## Dataloader

In [4]:
train_dataset_fpath = 'data_finetuning/one_file/train_val_data/train_dataset.json'
val_dataset_fpath = 'data_finetuning/one_file/train_val_data/val_dataset.json'

batch_size = 10 # This should typically be much larger than 10

In [5]:
with open(train_dataset_fpath, 'r+') as f:
    train_dataset = json.load(f)

with open(val_dataset_fpath, 'r+') as f:
    val_dataset = json.load(f)

In [6]:
train_dataset.keys()

dict_keys(['queries', 'corpus', 'relevant_docs'])

In [7]:
dataset = train_dataset

corpus = dataset['corpus']
queries = dataset['queries']
relevant_docs = dataset['relevant_docs']

examples = []
for query_id, query in queries.items():
    node_id = relevant_docs[query_id][0]
    text = corpus[node_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

In [8]:
len(examples)

309

In [9]:
loader = DataLoader(
    examples,
    batch_size=batch_size
)

## Loss

`MultipleNegativesRankingLoss` is a great loss function if you only have positive pairs, for example, only pairs of similar texts like pairs of paraphrases, pairs of duplicate questions, pairs of (query, response), or pairs of (source_language, target_language).

This loss function works great to train embeddings for retrieval setups where you have positive pairs (e.g. (query, relevant_doc)) as it will sample in each batch n-1 negative docs randomly.

The performance usually increases with increasing batch sizes.

For more detals, see this [docs](https://www.sbert.net/docs/package_reference/losses.html)

In [10]:
loss = losses.MultipleNegativesRankingLoss(model)

## Evaluator 

We setup an evaluator with our val split of the dataset to monitor how well the embedding model is performing during training.

In [11]:
dataset = val_dataset

corpus = dataset['corpus']
queries = dataset['queries']
relevant_docs = dataset['relevant_docs']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

## Training

The training loop is very straight forward to set up thanks to `sentencetransformers` high-level model training API.
All we need to do is plugging in the data loader, loss function, and evaluator that we defined in the previous cells (along with a couple of additional minor settings).

In [12]:
n_epochs = 2 # This should be higher for better performance.

In [13]:
warmup_steps = int(len(loader) * n_epochs * 0.1)

model.fit(
    train_objectives=[(loader, loss)],
    epochs=n_epochs,
    warmup_steps=warmup_steps,
    output_path='mod_finetuned/bert_base_german_finetuned',
    show_progress_bar=True,
    evaluator=evaluator, 
    evaluation_steps=50,
)

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/31 [00:00<?, ?it/s]

Iteration:   0%|          | 0/31 [00:00<?, ?it/s]