In [1]:

%load_ext autoreload


%autoreload 2


# Finetune Embeddings

In this notebook, we show users how to finetune their own embedding models.

We go through three main sections:
1. Preparing the data (our `generate_qa_embedding_pairs` function makes this easy)
2. Finetuning the model (using our `SentenceTransformersFinetuneEngine`)
3. Evaluating the model on a validation knowledge corpus

## Generate Corpus

First, we create the corpus of text chunks by leveraging LlamaIndex to load some financial PDFs, and parsing/chunking into plain text chunks.

In [2]:
import json
import pandas as pd

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode

Download Data

In [3]:
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset


train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
val_dataset = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")

In [4]:
'''from llama_index.finetuning import SentenceTransformersFinetuneEngine

finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="Alibaba-NLP/gte-large-en-v1.5",
    model_output_path="gte-large-en-v1.5-ukgov-finetuned",
    val_dataset=val_dataset,
    trust_remote_code=True
)'''

'from llama_index.finetuning import SentenceTransformersFinetuneEngine\n\nfinetune_engine = SentenceTransformersFinetuneEngine(\n    train_dataset,\n    model_id="Alibaba-NLP/gte-large-en-v1.5",\n    model_output_path="gte-large-en-v1.5-ukgov-finetuned",\n    val_dataset=val_dataset,\n    trust_remote_code=True\n)'

In [5]:
'''from llama_index.finetuning import SentenceTransformersFinetuneEngine

finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="sentence-transformers/all-mpnet-base-v2-2",
    model_output_path="all-mpnet-base-v2-ukgov-finetuned",
    val_dataset=val_dataset,
    trust_remote_code=True
)

finetune_engine.finetune()

embed_model = finetune_engine.get_finetuned_model()
embed_model
'''

'from llama_index.finetuning import SentenceTransformersFinetuneEngine\n\nfinetune_engine = SentenceTransformersFinetuneEngine(\n    train_dataset,\n    model_id="sentence-transformers/all-mpnet-base-v2-2",\n    model_output_path="all-mpnet-base-v2-ukgov-finetuned",\n    val_dataset=val_dataset,\n    trust_remote_code=True\n)\n\nfinetune_engine.finetune()\n\nembed_model = finetune_engine.get_finetuned_model()\nembed_model\n'

In [6]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="sentence-transformers/multi-qa-mpnet-base-dot-v1",
    model_output_path="multi-qa-mpnet-base-dot-v1-ukgov-finetuned",
    val_dataset=val_dataset,
    trust_remote_code=True
)

finetune_engine.finetune()

embed_model = finetune_engine.get_finetuned_model()
embed_model

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/212 [00:00<?, ?B/s]

You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





README.md:   0%|          | 0.00/8.71k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/5114 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
'''from llama_index.finetuning import SentenceTransformersFinetuneEngine

finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="Alibaba-NLP/gte-large-en-v1.5",
    model_output_path="gte-large-en-v1.5-ukgov-finetuned",
    val_dataset=val_dataset,
    trust_remote_code=True
)

finetune_engine.finetune()

embed_model = finetune_engine.get_finetuned_model()
embed_model'''

    PyTorch 2.3.0+cu121 with CUDA 1201 (you have 2.3.1+cpu)
    Python  3.11.9 (you have 3.11.9)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/5114 [00:00<?, ?it/s]

## Evaluate Finetuned Model

In this section, we evaluate 3 different embedding models:
1. proprietary OpenAI embedding,
2. open source `BAAI/bge-small-en`, and
3. our finetuned embedding model.

We consider 2 evaluation approaches:
1. a simple custom **hit rate** metric
2. using `InformationRetrievalEvaluator` from sentence_transformers

We show that finetuning on synthetic (LLM-generated) dataset significantly improve upon an opensource embedding model.

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd

: 

### Define eval function

**Option 1**: We use a simple **hit rate** metric for evaluation:
* for each (query, relevant_doc) pair,
* we retrieve top-k documents with the query,  and
* it's a **hit** if the results contain the relevant_doc.

This approach is very simple and intuitive, and we can apply it to both the proprietary OpenAI embedding as well as our open source and fine-tuned embedding models.

In [None]:
def evaluate(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]
    index = VectorStoreIndex(
        nodes, embed_model=embed_model, show_progress=True
    )
    retriever = index.as_retriever(similarity_top_k=top_k)

    eval_results = []
    for query_id, query in tqdm(queries.items()):
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]
        expected_id = relevant_docs[query_id][0]
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

        eval_result = {
            "is_hit": is_hit,
            "retrieved": retrieved_ids,
            "expected": expected_id,
            "query": query_id,
        }
        eval_results.append(eval_result)
    return eval_results

: 

**Option 2**: We use the `InformationRetrievalEvaluator` from sentence_transformers.

This provides a more comprehensive suite of metrics, but we can only run it against the sentencetransformers compatible models (open source and our finetuned model, *not* the OpenAI embedding model).

In [None]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer
from pathlib import Path


def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    evaluator = InformationRetrievalEvaluator(
        queries, corpus, relevant_docs, name=name
    )
    model = SentenceTransformer(model_id)
    output_path = "results/"
    Path(output_path).mkdir(exist_ok=True, parents=True)
    return evaluator(model, output_path=output_path)

: 

### Run Evals

#### OpenAI

Note: this might take a few minutes to run since we have to embed the corpus and queries

In [None]:
# use python dotenv to read the API key
import os
from dotenv import load_dotenv

load_dotenv()
openai_api_key = os.getenv("OPEN_AI_API_KEY")

: 

In [None]:
ada = OpenAIEmbedding(api_key=openai_api_key)
ada_val_results = evaluate(val_dataset, ada)

: 

In [None]:
df_ada = pd.DataFrame(ada_val_results)

: 

In [None]:
hit_rate_ada = df_ada["is_hit"].mean()
hit_rate_ada

: 

### BAAI/bge-small-en

In [None]:
bge = "local:BAAI/bge-small-en"
bge_val_results = evaluate(val_dataset, bge)

: 

In [None]:
df_bge = pd.DataFrame(bge_val_results)

: 

In [None]:
hit_rate_bge = df_bge["is_hit"].mean()
hit_rate_bge

: 

In [None]:
evaluate_st(val_dataset, "BAAI/bge-small-en", name="bge")

: 

### Finetuned

In [None]:
finetuned = "local:test_model"
val_results_finetuned = evaluate(val_dataset, finetuned)

: 

In [None]:
df_finetuned = pd.DataFrame(val_results_finetuned)

: 

In [None]:
hit_rate_finetuned = df_finetuned["is_hit"].mean()
hit_rate_finetuned

: 

In [None]:
evaluate_st(val_dataset, "test_model", name="finetuned")

: 

### Summary of Results

#### Hit rate

In [None]:
df_ada["model"] = "ada"
df_bge["model"] = "bge"
df_finetuned["model"] = "fine_tuned"

: 

We can see that fine-tuning our small open-source embedding model drastically improve its retrieval quality (even approaching the quality of the proprietary OpenAI embedding)!

In [None]:
df_all = pd.concat([df_ada, df_bge, df_finetuned])
df_all.groupby("model").mean("is_hit")

: 

#### InformationRetrievalEvaluator

In [None]:
df_st_bge = pd.read_csv(
    "results/Information-Retrieval_evaluation_bge_results.csv"
)
df_st_finetuned = pd.read_csv(
    "results/Information-Retrieval_evaluation_finetuned_results.csv"
)

: 

We can see that embedding finetuning improves metrics consistently across the suite of eval metrics

In [None]:
df_st_bge["model"] = "bge"
df_st_finetuned["model"] = "fine_tuned"
df_st_all = pd.concat([df_st_bge, df_st_finetuned])
df_st_all = df_st_all.set_index("model")
df_st_all

: 