<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/finetuning/embeddings/finetune_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune Embeddings

In this notebook, we show users how to finetune their own embedding models.

We go through three main sections:
1. Preparing the data (our `generate_qa_embedding_pairs` function makes this easy)
2. Finetuning the model (using our `SentenceTransformersFinetuneEngine`)
3. Evaluating the model on a validation knowledge corpus

## Generate Corpus

First, we create the corpus of text chunks by leveraging LlamaIndex to load some financial PDFs, and parsing/chunking into plain text chunks.

In [None]:
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
%pip install llama-index-finetuning
%pip install -U llama-index-readers-file



In [None]:
!pip install pyarrow==15.0.2



In [None]:
import json

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode

Download Data

In [None]:
TRAIN_FILES = ["./dataset/data1.pdf","./dataset/data2.pdf"]
VAL_FILES = ["./dataset/test_data.pdf"]

TRAIN_CORPUS_FPATH = "./dataset/train_corpus.json"
VAL_CORPUS_FPATH = "./dataset/val_corpus.json"

In [None]:
def load_corpus(files, verbose=False):
    if verbose:
        print(f"Loading files {files}")

    reader = SimpleDirectoryReader(input_files=files)
    docs = reader.load_data()
    if verbose:
        print(f"Loaded {len(docs)} docs")

    parser = SentenceSplitter()
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f"Parsed {len(nodes)} nodes")

    return nodes

We do a very naive train/val split by having the Lyft corpus as the train dataset, and the Uber corpus as the val dataset.

In [None]:
train_nodes = load_corpus(TRAIN_FILES, verbose=True)
val_nodes = load_corpus(VAL_FILES, verbose=True)

Loading files ['./dataset/data1.pdf', './dataset/data2.pdf']
Loaded 137 docs


Parsing nodes:   0%|          | 0/137 [00:00<?, ?it/s]

Parsed 138 nodes
Loading files ['./dataset/test_data.pdf']
Loaded 21 docs


Parsing nodes:   0%|          | 0/21 [00:00<?, ?it/s]

Parsed 22 nodes


### Generate synthetic queries

Now, we use an LLM (gpt-3.5-turbo) to generate questions using each text chunk in the corpus as context.

Each pair of (generated question, text chunk used as context) becomes a datapoint in the finetuning dataset (either for training or evaluation).

In [None]:
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

In [None]:
import os

from google.colab import userdata

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [None]:
from llama_index.llms.openai import OpenAI


train_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-4o-mini"), nodes=train_nodes
)
val_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-4o-mini"), nodes=val_nodes
)

train_dataset.save_json("train_dataset.json")
val_dataset.save_json("val_dataset.json")

138it [00:00, ?it/s]


Final dataset saved.


138it [00:00, ?it/s]

Final dataset saved.





In [None]:
# [Optional] Load
train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
val_dataset = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")

In [None]:
import pyarrow.lib
print('ListViewType' in dir(pyarrow.lib))


False


## Run Embedding Finetuning

In [None]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

In [None]:
finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="BAAI/bge-large-en-v1.5",
    model_output_path="test_model",
    val_dataset=val_dataset,
)

In [None]:
finetune_engine.finetune()

Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100,Dot Accuracy@1,Dot Accuracy@3,Dot Accuracy@5,Dot Accuracy@10,Dot Precision@1,Dot Precision@3,Dot Precision@5,Dot Precision@10,Dot Recall@1,Dot Recall@3,Dot Recall@5,Dot Recall@10,Dot Ndcg@10,Dot Mrr@10,Dot Map@100
28,No log,No log,0.764493,0.916667,0.963768,0.981884,0.764493,0.305556,0.192754,0.098188,0.764493,0.916667,0.963768,0.981884,0.881455,0.848278,0.849657,0.764493,0.916667,0.963768,0.981884,0.764493,0.305556,0.192754,0.098188,0.764493,0.916667,0.963768,0.981884,0.881455,0.848278,0.849657
50,No log,No log,0.789855,0.942029,0.967391,1.0,0.789855,0.31401,0.193478,0.1,0.789855,0.942029,0.967391,1.0,0.901445,0.869052,0.869052,0.789855,0.942029,0.967391,1.0,0.789855,0.31401,0.193478,0.1,0.789855,0.942029,0.967391,1.0,0.901445,0.869052,0.869052
56,No log,No log,0.789855,0.942029,0.967391,1.0,0.789855,0.31401,0.193478,0.1,0.789855,0.942029,0.967391,1.0,0.901445,0.869052,0.869052,0.789855,0.942029,0.967391,1.0,0.789855,0.31401,0.193478,0.1,0.789855,0.942029,0.967391,1.0,0.901445,0.869052,0.869052


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

In [None]:
embed_model = finetune_engine.get_finetuned_model()

In [None]:
embed_model

HuggingFaceEmbedding(model_name='test_model', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x785cf2fcbf70>, num_workers=None, max_length=512, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None)

## Evaluate Finetuned Model

In this section, we evaluate 3 different embedding models:
1. proprietary OpenAI embedding,
2. open source `BAAI/bge-small-en`, and
3. our finetuned embedding model.

We consider 2 evaluation approaches:
1. a simple custom **hit rate** metric
2. using `InformationRetrievalEvaluator` from sentence_transformers

We show that finetuning on synthetic (LLM-generated) dataset significantly improve upon an opensource embedding model.

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd

### Define eval function

**Option 1**: We use a simple **hit rate** metric for evaluation:
* for each (query, relevant_doc) pair,
* we retrieve top-k documents with the query,  and
* it's a **hit** if the results contain the relevant_doc.

This approach is very simple and intuitive, and we can apply it to both the proprietary OpenAI embedding as well as our open source and fine-tuned embedding models.

In [None]:
def evaluate(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]
    index = VectorStoreIndex(
        nodes, embed_model=embed_model, show_progress=True
    )
    retriever = index.as_retriever(similarity_top_k=top_k)

    eval_results = []
    for query_id, query in tqdm(queries.items()):
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]
        expected_id = relevant_docs[query_id][0]
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

        eval_result = {
            "is_hit": is_hit,
            "retrieved": retrieved_ids,
            "expected": expected_id,
            "query": query_id,
        }
        eval_results.append(eval_result)
    return eval_results

**Option 2**: We use the `InformationRetrievalEvaluator` from sentence_transformers.

This provides a more comprehensive suite of metrics, but we can only run it against the sentencetransformers compatible models (open source and our finetuned model, *not* the OpenAI embedding model).

In [None]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer
from pathlib import Path


def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    evaluator = InformationRetrievalEvaluator(
        queries, corpus, relevant_docs, name=name
    )
    model = SentenceTransformer(model_id)
    output_path = "results/"
    Path(output_path).mkdir(exist_ok=True, parents=True)
    return evaluator(model, output_path=output_path)

### Run Evals

#### OpenAI

Note: this might take a few minutes to run since we have to embed the corpus and queries

In [None]:
ada = OpenAIEmbedding()
ada_val_results = evaluate(val_dataset, ada)

Generating embeddings:   0%|          | 0/138 [00:00<?, ?it/s]

  0%|          | 0/276 [00:00<?, ?it/s]

In [None]:
df_ada = pd.DataFrame(ada_val_results)

In [None]:
hit_rate_ada = df_ada["is_hit"].mean()
hit_rate_ada

0.8333333333333334

### Comparative metric with famous embedding

In [None]:
Snow = "local:Snowflake/snowflake-arctic-embed-l"
Snow_val_results = evaluate(val_dataset, Snow)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/84.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

Generating embeddings:   0%|          | 0/138 [00:00<?, ?it/s]

  0%|          | 0/276 [00:00<?, ?it/s]

In [None]:
df_snow = pd.DataFrame(Snow_val_results)

In [None]:
hit_rate_snow = df_snow["is_hit"].mean()
hit_rate_snow

0.4384057971014493

Comparative metric with a good results model

In [None]:
finetuned = "local:mixedbread-ai/mxbai-embed-large-v1"
val_results_finetuned = evaluate(val_dataset, finetuned)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/171 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/114k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

Generating embeddings:   0%|          | 0/138 [00:00<?, ?it/s]

  0%|          | 0/276 [00:00<?, ?it/s]

In [None]:
df_finetuned = pd.DataFrame(val_results_finetuned)

In [None]:
hit_rate_finetuned = df_finetuned["is_hit"].mean()
hit_rate_finetuned

0.8333333333333334

### Finetuned

In [None]:
finetuned = "local:test_model"
val_results_finetuned = evaluate(val_dataset, finetuned)

Generating embeddings:   0%|          | 0/138 [00:00<?, ?it/s]

  0%|          | 0/276 [00:00<?, ?it/s]

In [None]:
df_finetuned = pd.DataFrame(val_results_finetuned)

In [None]:
hit_rate_finetuned = df_finetuned["is_hit"].mean()
hit_rate_finetuned

0.967391304347826

In [None]:
evaluate_st(val_dataset, "test_model", name="finetuned")

{'finetuned_cosine_accuracy@1': 0.7898550724637681,
 'finetuned_cosine_accuracy@3': 0.9420289855072463,
 'finetuned_cosine_accuracy@5': 0.967391304347826,
 'finetuned_cosine_accuracy@10': 1.0,
 'finetuned_cosine_precision@1': 0.7898550724637681,
 'finetuned_cosine_precision@3': 0.31400966183574874,
 'finetuned_cosine_precision@5': 0.19347826086956518,
 'finetuned_cosine_precision@10': 0.09999999999999998,
 'finetuned_cosine_recall@1': 0.7898550724637681,
 'finetuned_cosine_recall@3': 0.9420289855072463,
 'finetuned_cosine_recall@5': 0.967391304347826,
 'finetuned_cosine_recall@10': 1.0,
 'finetuned_cosine_ndcg@10': 0.9014446110917275,
 'finetuned_cosine_mrr@10': 0.8690519323671496,
 'finetuned_cosine_map@100': 0.8690519323671497,
 'finetuned_dot_accuracy@1': 0.7898550724637681,
 'finetuned_dot_accuracy@3': 0.9420289855072463,
 'finetuned_dot_accuracy@5': 0.967391304347826,
 'finetuned_dot_accuracy@10': 1.0,
 'finetuned_dot_precision@1': 0.7898550724637681,
 'finetuned_dot_precision@3':

### Summary of Results

#### Hit rate

In [None]:
df_ada["model"] = "openai"
df_snow["model"] = "snow"
df_finetuned["model"] = "fine_tuned"

We can see that fine-tuning our small open-source embedding model drastically improve its retrieval quality (even approaching the quality of the proprietary OpenAI embedding)!

In [None]:
df_all = pd.concat([df_ada, df_snow, df_finetuned])
df_all.groupby("model").mean("is_hit")

Unnamed: 0_level_0,is_hit
model,Unnamed: 1_level_1
fine_tuned,0.967391
openai,0.833333
snow,0.438406


In [None]:
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import login
import os

from google.colab import userdata


# Login to Hugging Face Hub
login(token=userdata.get('HF_TOKEN'))

# Define paths for the model files
model_dir = "test_model"  # Replace with the actual directory containing the files
config_path = os.path.join(model_dir, "config.json")
model_path = os.path.join(model_dir, "pytorch_model.bin")


# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-large-en-v1.5")
model = AutoModel.from_pretrained(model_dir)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
from transformers import AutoModel, AutoTokenizer

# Push the model to Hugging Face
model.push_to_hub("CamiloGC93/bge-large-en-v1.5-etical")  # Use your Hugging Face username and a valid repo name
tokenizer.push_to_hub("CamiloGC93/bge-large-en-v1.5-etical")

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

No files have been modified since last commit. Skipping to prevent empty commit.


No files have been modified since last commit. Skipping to prevent empty commit.


CommitInfo(commit_url='https://huggingface.co/CamiloGC93/bge-large-en-v1.5-etical/commit/c73712dfb016b3eda07d05771b8f92ffa10490c5', commit_message='Upload tokenizer', commit_description='', oid='c73712dfb016b3eda07d05771b8f92ffa10490c5', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
from sentence_transformers import SentenceTransformer

In [None]:
model_from_huggingface = "local:CamiloGC93/bge-large-en-v1.5-etical"
ftmodel__val_results = evaluate(val_dataset, model_from_huggingface)



No sentence-transformers model found with name CamiloGC93/bge-large-en-v1.5-etical. Creating a new one with mean pooling.


config.json:   0%|          | 0.00/730 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

Generating embeddings:   0%|          | 0/138 [00:00<?, ?it/s]

  0%|          | 0/276 [00:00<?, ?it/s]

In [None]:
df_ft = pd.DataFrame(ftmodel__val_results)

In [None]:
hit_rate_finetuned = df_ft["is_hit"].mean()
hit_rate_finetuned

0.9601449275362319