<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/finetuning/embeddings/finetune_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune Embeddings

In this notebook, we show users how to finetune their own embedding models.

We go through three main sections:
1. Preparing the data (our `generate_qa_embedding_pairs` function makes this easy)
2. Finetuning the model (using our `SentenceTransformersFinetuneEngine`)
3. Evaluating the model on a validation knowledge corpus

## Generate Corpus

First, we create the corpus of text chunks by leveraging LlamaIndex to load some financial PDFs, and parsing/chunking into plain text chunks.

In [1]:
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
%pip install llama-index-finetuning
%pip install llama-index-readers-file
%pip install datasets
%pip install llama-index-embeddings-huggingface

Collecting llama-index-llms-openai
  Downloading llama_index_llms_openai-0.3.18-py3-none-any.whl.metadata (3.3 kB)
Collecting llama-index-core<0.13.0,>=0.12.4 (from llama-index-llms-openai)
  Downloading llama_index_core-0.12.16.post1-py3-none-any.whl.metadata (2.5 kB)
Collecting dataclasses-json (from llama-index-core<0.13.0,>=0.12.4->llama-index-llms-openai)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index-core<0.13.0,>=0.12.4->llama-index-llms-openai)
  Downloading dirtyjson-1.0.8-py3-none-any.whl.metadata (11 kB)
Collecting filetype<2.0.0,>=1.2.0 (from llama-index-core<0.13.0,>=0.12.4->llama-index-llms-openai)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting tiktoken>=0.3.3 (from llama-index-core<0.13.0,>=0.12.4->llama-index-llms-openai)
  Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting typing-inspect>=0.8.0 (fro

In [2]:
import json

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode

Download Data

In [None]:
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'

In [None]:
TRAIN_FILES = ["./data/10k/lyft_2021.pdf"]
VAL_FILES = ["./data/10k/uber_2021.pdf"]

TRAIN_CORPUS_FPATH = "./data/train_corpus.json"
VAL_CORPUS_FPATH = "./data/val_corpus.json"

In [None]:
def load_corpus(files, verbose=False):
    if verbose:
        print(f"Loading files {files}")

    reader = SimpleDirectoryReader(input_files=files)
    docs = reader.load_data()
    if verbose:
        print(f"Loaded {len(docs)} docs")

    parser = SentenceSplitter()
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f"Parsed {len(nodes)} nodes")

    return nodes

We do a very naive train/val split by having the Lyft corpus as the train dataset, and the Uber corpus as the val dataset.

In [None]:
train_nodes = load_corpus(TRAIN_FILES, verbose=True)
val_nodes = load_corpus(VAL_FILES, verbose=True)

Loading files ['./data/10k/lyft_2021.pdf']
Loaded 238 docs


Parsing nodes:   0%|          | 0/238 [00:00<?, ?it/s]

Parsed 344 nodes
Loading files ['./data/10k/uber_2021.pdf']
Loaded 307 docs


Parsing nodes:   0%|          | 0/307 [00:00<?, ?it/s]

Parsed 410 nodes


### Generate synthetic queries

Now, we use an LLM (gpt-3.5-turbo) to generate questions using each text chunk in the corpus as context.

Each pair of (generated question, text chunk used as context) becomes a datapoint in the finetuning dataset (either for training or evaluation).

In [3]:
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

In [None]:
import os

OPENAI_API_KEY = "sk-"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

In [None]:
from llama_index.llms.openai import OpenAI


train_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-3.5-turbo"),
    nodes=train_nodes,
    output_path="train_dataset.json",
)
val_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-3.5-turbo"),
    nodes=val_nodes,
    output_path="val_dataset.json",
)

100%|██████████| 344/344 [12:51<00:00,  2.24s/it]
100%|██████████| 410/410 [16:07<00:00,  2.36s/it]


In [4]:
# [Optional] Load
train_dataset = EmbeddingQAFinetuneDataset.from_json("qa_train_5k.json")
val_dataset = EmbeddingQAFinetuneDataset.from_json("qa_val_5k.json")

## Run Embedding Finetuning

In [5]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

In [17]:
finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="BAAI/bge-small-zh-v1.5",
    model_output_path="test_model",
    val_dataset=val_dataset,
)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/27.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/95.8M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/439k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [18]:
finetune_engine.finetune()

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mnileonx[0m ([33mnileonx-nanjing-university[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
50,No log,No log,0.8987,0.968039,0.978332,0.988082,0.8987,0.32268,0.195666,0.098808,0.8987,0.968039,0.978332,0.988082,0.948159,0.934867,0.935289
100,No log,No log,0.891658,0.967497,0.98104,0.984832,0.891658,0.322499,0.196208,0.098483,0.891658,0.967497,0.98104,0.984832,0.944195,0.930473,0.931202
150,No log,No log,0.884615,0.96533,0.978332,0.985915,0.884615,0.321777,0.195666,0.098592,0.884615,0.96533,0.978332,0.985915,0.941049,0.926003,0.926648
200,No log,No log,0.9052,0.970748,0.98104,0.988082,0.9052,0.323583,0.196208,0.098808,0.9052,0.970748,0.98104,0.988082,0.951446,0.939192,0.939811
250,No log,No log,0.896533,0.969664,0.97779,0.986457,0.896533,0.323221,0.195558,0.098646,0.896533,0.969664,0.97779,0.986457,0.946483,0.933116,0.933837
300,No log,No log,0.902492,0.971831,0.979957,0.988082,0.902492,0.323944,0.195991,0.098808,0.902492,0.971831,0.979957,0.988082,0.950267,0.937592,0.938204
350,No log,No log,0.904659,0.970748,0.980498,0.990249,0.904659,0.323583,0.1961,0.099025,0.904659,0.970748,0.980498,0.990249,0.951634,0.938806,0.939254
400,No log,No log,0.903575,0.972914,0.982124,0.989707,0.903575,0.324305,0.196425,0.098971,0.903575,0.972914,0.982124,0.989707,0.951674,0.938929,0.939392
450,No log,No log,0.905742,0.973456,0.981582,0.989707,0.905742,0.324485,0.196316,0.098971,0.905742,0.973456,0.981582,0.989707,0.952354,0.939892,0.940421
500,0.033300,No log,0.906284,0.97454,0.983207,0.991874,0.906284,0.324847,0.196641,0.099187,0.906284,0.97454,0.983207,0.991874,0.954122,0.941511,0.941864


In [19]:
embed_model = finetune_engine.get_finetuned_model()

In [20]:
embed_model

HuggingFaceEmbedding(model_name='test_model', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7841f40dad90>, num_workers=None, max_length=512, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None)

## Evaluate Finetuned Model

In this section, we evaluate 3 different embedding models:
1. proprietary OpenAI embedding,
2. open source `BAAI/bge-small-en`, and
3. our finetuned embedding model.

We consider 2 evaluation approaches:
1. a simple custom **hit rate** metric
2. using `InformationRetrievalEvaluator` from sentence_transformers

We show that finetuning on synthetic (LLM-generated) dataset significantly improve upon an opensource embedding model.

In [6]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd

### Define eval function

**Option 1**: We use a simple **hit rate** metric for evaluation:
* for each (query, relevant_doc) pair,
* we retrieve top-k documents with the query,  and
* it's a **hit** if the results contain the relevant_doc.

This approach is very simple and intuitive, and we can apply it to both the proprietary OpenAI embedding as well as our open source and fine-tuned embedding models.

In [7]:
def evaluate(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]
    index = VectorStoreIndex(
        nodes, embed_model=embed_model, show_progress=True
    )
    retriever = index.as_retriever(similarity_top_k=top_k)

    eval_results = []
    for query_id, query in tqdm(queries.items()):
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]
        expected_id = relevant_docs[query_id][0]
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

        eval_result = {
            "is_hit": is_hit,
            "retrieved": retrieved_ids,
            "expected": expected_id,
            "query": query_id,
        }
        eval_results.append(eval_result)
    return eval_results

**Option 2**: We use the `InformationRetrievalEvaluator` from sentence_transformers.

This provides a more comprehensive suite of metrics, but we can only run it against the sentencetransformers compatible models (open source and our finetuned model, *not* the OpenAI embedding model).

In [8]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer
from pathlib import Path


def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    evaluator = InformationRetrievalEvaluator(
        queries, corpus, relevant_docs, name=name
    )
    model = SentenceTransformer(model_id)
    output_path = "results/"
    Path(output_path).mkdir(exist_ok=True, parents=True)
    return evaluator(model, output_path=output_path)

### Run Evals

BAAI/bge-m3

In [9]:
bge_m3 = "local:BAAI/bge-m3"
bge_m3_val_results = evaluate(val_dataset,bge_m3)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/123 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/15.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/54.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/687 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/444 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.27G [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.1M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

Generating embeddings:   0%|          | 0/923 [00:00<?, ?it/s]

  0%|          | 0/1846 [00:00<?, ?it/s]

In [10]:
df_bge_m3 = pd.DataFrame(bge_m3_val_results)
hit_rate_bge = df_bge_m3["is_hit"].mean()
hit_rate_bge

0.9761646803900325

### BAAI/bge-large

In [12]:
bge = "local:BAAI/bge-small-zh-v1.5"
bge_val_results = evaluate(val_dataset, bge)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/27.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/95.8M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/367 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/110k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/439k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generating embeddings:   0%|          | 0/923 [00:00<?, ?it/s]

  0%|          | 0/1846 [00:00<?, ?it/s]

In [13]:
df_bge = pd.DataFrame(bge_val_results)

In [14]:
hit_rate_bge = df_bge["is_hit"].mean()
hit_rate_bge

0.9664138678223185

In [22]:
evaluate_st(val_dataset, "BAAI/bge-small-zh-v1.5", name="bge")

{'bge_cosine_accuracy@1': 0.8504875406283857,
 'bge_cosine_accuracy@3': 0.9479956663055255,
 'bge_cosine_accuracy@5': 0.9631635969664138,
 'bge_cosine_accuracy@10': 0.9777898158179849,
 'bge_cosine_precision@1': 0.8504875406283857,
 'bge_cosine_precision@3': 0.31599855543517513,
 'bge_cosine_precision@5': 0.19263271939328278,
 'bge_cosine_precision@10': 0.0977789815817985,
 'bge_cosine_recall@1': 0.8504875406283857,
 'bge_cosine_recall@3': 0.9479956663055255,
 'bge_cosine_recall@5': 0.9631635969664138,
 'bge_cosine_recall@10': 0.9777898158179849,
 'bge_cosine_ndcg@10': 0.9198788561595,
 'bge_cosine_mrr@10': 0.9006687561265023,
 'bge_cosine_map@100': 0.9014981288555106}

### Finetuned

In [21]:
finetuned = "local:test_model"
val_results_finetuned = evaluate(val_dataset, finetuned)

Generating embeddings:   0%|          | 0/923 [00:00<?, ?it/s]

  0%|          | 0/1846 [00:00<?, ?it/s]

In [23]:
df_finetuned = pd.DataFrame(val_results_finetuned)

In [24]:
hit_rate_finetuned = df_finetuned["is_hit"].mean()
hit_rate_finetuned

0.9880823401950163

In [25]:
evaluate_st(val_dataset, "test_model", name="finetuned")

{'finetuned_cosine_accuracy@1': 0.9160346695557963,
 'finetuned_cosine_accuracy@3': 0.9794149512459371,
 'finetuned_cosine_accuracy@5': 0.9880823401950163,
 'finetuned_cosine_accuracy@10': 0.9929577464788732,
 'finetuned_cosine_precision@1': 0.9160346695557963,
 'finetuned_cosine_precision@3': 0.32647165041531234,
 'finetuned_cosine_precision@5': 0.19761646803900326,
 'finetuned_cosine_precision@10': 0.09929577464788733,
 'finetuned_cosine_recall@1': 0.9160346695557963,
 'finetuned_cosine_recall@3': 0.9794149512459371,
 'finetuned_cosine_recall@5': 0.9880823401950163,
 'finetuned_cosine_recall@10': 0.9929577464788732,
 'finetuned_cosine_ndcg@10': 0.959032698909591,
 'finetuned_cosine_mrr@10': 0.9476055048238143,
 'finetuned_cosine_map@100': 0.9479040413204217}

### Summary of Results

#### Hit rate

In [35]:
df_bge["model"] = "bge-small-zh-v1.5"
df_finetuned["model"] = "fine_tuned bge-small-zh-v1.5"

We can see that fine-tuning our small open-source embedding model drastically improve its retrieval quality (even approaching the quality of the proprietary OpenAI embedding)!

In [36]:
df_all = pd.concat([ df_bge, df_finetuned])
df_all.groupby("model").mean("is_hit")

Unnamed: 0_level_0,is_hit
model,Unnamed: 1_level_1
bge-small-zh-v1.5,0.966414
fine_tuned bge-small-zh-v1.5,0.988082


#### InformationRetrievalEvaluator

In [37]:
df_st_bge = pd.read_csv(
    "results/Information-Retrieval_evaluation_bge_results.csv"
)
df_st_finetuned = pd.read_csv(
    "results/Information-Retrieval_evaluation_finetuned_results.csv"
)

We can see that embedding finetuning improves metrics consistently across the suite of eval metrics

In [39]:
df_st_bge["model"] = "bge"
df_st_finetuned["model"] = "fine_tuned"
df_st_all = []
df_st_all = pd.concat([df_st_bge, df_st_finetuned])
df_st_all = df_st_all.set_index("model")
df_st_all

Unnamed: 0_level_0,epoch,steps,cosine-Accuracy@1,cosine-Accuracy@3,cosine-Accuracy@5,cosine-Accuracy@10,cosine-Precision@1,cosine-Recall@1,cosine-Precision@3,cosine-Recall@3,cosine-Precision@5,cosine-Recall@5,cosine-Precision@10,cosine-Recall@10,cosine-MRR@10,cosine-NDCG@10,cosine-MAP@100
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1
bge,-1,-1,0.198808,0.315818,0.37974,0.459913,0.198808,0.198808,0.105273,0.315818,0.075948,0.37974,0.045991,0.459913,0.275749,0.319429,0.286431
bge,-1,-1,0.850488,0.947996,0.963164,0.97779,0.850488,0.850488,0.315999,0.947996,0.192633,0.963164,0.097779,0.97779,0.900669,0.919879,0.901498
fine_tuned,-1,-1,0.916035,0.979415,0.988082,0.992958,0.916035,0.916035,0.326472,0.979415,0.197616,0.988082,0.099296,0.992958,0.947606,0.959033,0.947904


In [32]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [34]:
!zip -r results.zip results

  adding: results/ (stored 0%)
  adding: results/Information-Retrieval_evaluation_finetuned_results.csv (deflated 62%)
  adding: results/Information-Retrieval_evaluation_bge_results.csv (deflated 64%)
