<a href="https://colab.research.google.com/github/Add-Vishnu/Finetuning-Embedding-models/blob/main/FineTuneEmbeddingModel_With_Adapters_WorkFlow.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Finetuning the embedding model**

## 1) Generating Synthetic Data for Training and Evaluation

**i) Generating corpus**

-> You can load the documents and create nodes using the llamaindex

-> or if you have the data like name_of_document, page_no and content you can directly create the corpus dictionary .

corpus = { node_id : node_content }

->  If created using llamaindex , node_id will be the node.id. If you already have the data you have create your custom node_id. Eg: node_id can be comnination of name_of_document and page_no.

-> Content will be the node.content if you created nodes of documents using llamaindex. or they can be the content you have, but make sure document with that page number have that content.

In [None]:
# your own corpus
corpus = {}

**ii) Generating Synthetic queries**

-> You can use LLM to genrate questions for each text chunk in the corpus and create the `queries` and `relevant_docs` dictionaries.

-> `queries` dictionary should have a query_id which can be generated using `uuid` and the value will be the Question.

**`queries[query_id] = question`**

-> `relevant_docs` dictionary have key as the `query_id` and the values will be the list of `node_ids`.

**`relevant_dosc[question_id] = [node_id]`**

**Note**:

If you already have the questions, relevant docs i.e name of the document and page no., you can combine those information and create these dictionaries.

-> Split the dataset into training and validation datasets, and you can create a function to create these dictionaries and and pass each datasets(train,val) one by one.

`train_corpus`

`val_corpus`

In [None]:
train_corpus = # Use your data
val_corpus = # Use your data

In [None]:
# create the function which return queries and relevant_docs dictionary
def generate_queries(faq_corpus):
  queries = {}
  relevant_docs = {}

  # Implement you logic here

  return queries,relevant_docs

In [None]:
train_queries, train_relevant_docs = generate_queries(train_corpus)
val_queries, val_relevant_docs = generate_queries(val_corpus)

**iii) Merge Data**:

-> Creating the training and validation dataset using the data you have.

**`train_dataset = {
            'queries' : train_queries ,
            'corpus' : corpus,
            'relevant_docs' : train_relevant_docs,
    }`**


**`val_dataset = {
            'queries' : val_queries ,
            'corpus' : corpus,
            'relevant_docs' : val_relevant_docs,
    }`**

If you split the train and val datasets from the same data, then the corpus will be the same

In [None]:
train_dataset = {
    'queries': train_queries,
    'corpus': corpus,
    'relevant_docs': train_relevant_docs,
}

val_dataset = {
    'queries': val_queries,
    'corpus': corpus,
    'relevant_docs': val_relevant_docs,
}

# **2) Finetuning the embedding model using LLamaindex**

**Note: Save the train_dataset and val_dataset into json files**



In [None]:
import json

# Save train_dataset and val_dataset as JSON files
with open("train_dataset.json", "w") as train_file:
    json.dump(train_dataset, train_file)

with open("val_dataset.json", "w") as val_file:
    json.dump(val_dataset, val_file)

Load them using the EmbeddingQAFinetuningDataset

In [None]:
from llama_index.finetuning import EmbeddingQAFinetuneDataset


train_dataset_ll = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
val_dataset_ll = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")

Create the object for `SentenceTransformersFinetuneEngine` and pass the train_dataset, model_name, model_output_path, val_dataset.

In [None]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine
finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="BAAI/bge-small-en",
    model_output_path="test_model",
    val_dataset=val_dataset,
)

Finetune the model by calling the finetune method

In [None]:
finetune_engine.finetune()

###  **Finetuning Linear adapter on top of fine tuned embedding model**


In [None]:
from llama_index.finetuning import EmbeddingAdapterFinetuneEngine
from llama_index.embeddings import resolve_embed_model
import torch

base_embed_model = resolve_embed_model("local:/test_model") # path_to_finetuned_model


In [None]:
finetune_engine_adapter = EmbeddingAdapterFinetuneEngine(
    train_dataset_ll,
    base_embed_model,
    model_output_path="/test_adapter_onFinetuned",
    # bias=True,
    epochs=4,
    verbose=False,
    optimizer_class=torch.optim.SGD,
    optimizer_params={"lr": 0.01}
)

In [None]:
finetune_engine_adapter.finetune()

Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Iteration:   0%|          | 0/315 [00:00<?, ?it/s]

Iteration:   0%|          | 0/315 [00:00<?, ?it/s]

Iteration:   0%|          | 0/315 [00:00<?, ?it/s]

Iteration:   0%|          | 0/315 [00:00<?, ?it/s]

# **3) Evaluate the fine-tune model**

In [None]:
from llama_index import ServiceContext, VectorStoreIndex
from llama_index.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd

Funtion for hit rate.

The arguments it takes are `dataset` , `embedding model` , `top_k` and returns a dictionary of evaluation results containing

`eval_result = {
            "is_hit": is_hit,
            "retrieved": retrieved_ids,
            "expected": expected_ids,
            "query": query_id,
        }`

In [None]:
def evaluate(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    service_context = ServiceContext.from_defaults(embed_model=embed_model,llm=None)
    nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]
    index = VectorStoreIndex(nodes, service_context=service_context, show_progress=True)
    retriever = index.as_retriever(similarity_top_k=top_k)

    eval_results = []
    for query_id, query in tqdm(queries.items()):
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]
        expected_ids = relevant_docs[query_id]
        # is_hit = expected_ids in retrieved_ids  # assume 1 relevant doc


        retrieved_set = set(retrieved_ids)
        expected_set = set(expected_ids)
        common_nodes = retrieved_set.intersection(expected_set)
        is_hit = len(common_nodes) > 0




        eval_result = {
            "is_hit": is_hit,
            "retrieved": retrieved_ids,
            "expected": expected_ids,
            "query": query_id,
        }
        eval_results.append(eval_result)
        # break
    return eval_results

Function for evaluating using the `InformationRetrievalEvaluatoe` of sentence transformers.


In [None]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer


def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs, name=name)
    model = SentenceTransformer(model_id)
    return evaluator(model, output_path="results/")

## Evaluate the original model with the val_dataset.

Evaluation using hit rate for original bge model

In [None]:
bge = "local:BAAI/bge-small-en"
bge_val_results = evaluate(val_dataset_ll, bge)   # hit rate

In [None]:
# create a dataframe
df_bge = pd.DataFrame(bge_val_results)
# Calculate the mean of all the hitrates for each node
hit_rate_bge = df_bge['is_hit'].mean()
hit_rate_bge

Evaluation using InformationRetirevalEvaluator for original bge model

In [None]:
evaluate_st(val_dataset_ll, "BAAI/bge-small-en", name='bge')

## Evaluate the Finetuned model with the val_dataset.

Evaluation using hit rate for finetuned model

In [None]:
finetuned = "local:test_model"
val_results_finetuned = evaluate(val_dataset_ll, finetuned) # hit rate

In [None]:
df_finetuned = pd.DataFrame(val_results_finetuned)
hit_rate_finetuned = df_finetuned['is_hit'].mean()
hit_rate_finetuned

Evaluation using InformationRetirevalEvaluator for finetuned bge model

In [None]:
evaluate_st(val_dataset_ll, "test_model", name='finetuned')

## Evaluate the Finetuned adapter with the val_dataset.

Evaluation using hit rate for finetuned adapter

In [None]:
adapter = "local:test_adapter_onFinetuned"
val_results_adapter = evaluate(val_dataset_ll, adapter) # hit rate

In [None]:
df_adapter = pd.DataFrame(val_results_adapter)
hit_rate_adapter = df_adapter['is_hit'].mean()
hit_rate_adapter

Evaluation using InformationRetirevalEvaluator for finetuned adapter on top of finetuned embedding model

In [None]:
evaluate_st(val_dataset_ll, "test_adapter_onFinetuned", name='adapter')

# Summary of evaluation

Hit rate

In [None]:
df_bge['model'] = 'bge'
df_finetuned['model'] = 'fine_tuned'
df_adapter['model'] = 'fine_tuned_adapter'

In [None]:
df_all = pd.concat([df_bge, df_finetuned,df_adapter])
df_all.groupby('model').mean('is_hit')

InformationRetrievalEvaluator

In [None]:
df_st_bge = pd.read_csv('results/Information-Retrieval_evaluation_bge_results.csv')
df_st_finetuned = pd.read_csv('results/Information-Retrieval_evaluation_finetuned_results.csv')
df_st_adapter = pd.read_csv('results/Information-Retrieval_evaluation_adapter_results.csv')

df_st_bge['model'] = 'bge'
df_st_finetuned['model'] = 'fine_tuned'
df_st_adapter['model'] = 'fine_tuned_adapter'
df_st_all = pd.concat([df_st_bge, df_st_finetuned, df_st_adapter])
df_st_all = df_st_all.set_index('model')
df_st_all
