<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/finetuning/embeddings/finetune_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune Embeddings

In this notebook, we show users how to finetune their own embedding models.

We go through three main sections:
1. Preparing the data (our `generate_qa_embedding_pairs` function makes this easy)
2. Finetuning the model (using our `SentenceTransformersFinetuneEngine`)
3. Evaluating the model on a validation knowledge corpus

## Generate Corpus

First, we create the corpus of text chunks by leveraging LlamaIndex to load some financial PDFs, and parsing/chunking into plain text chunks.

In [1]:
%pip install -q llama-index-llms-openai
%pip install -q llama-index-embeddings-openai
%pip install -q llama-index-finetuning
%pip install -q llama-index-readers-file
%pip install -q llama-index-llms-huggingface
%pip install -q llama-index-embeddings-huggingface

Collecting llama-index-llms-openai
  Downloading llama_index_llms_openai-0.1.14-py3-none-any.whl (10 kB)
Collecting llama-index-core<0.11.0,>=0.10.24 (from llama-index-llms-openai)
  Downloading llama_index_core-0.10.27-py3-none-any.whl (15.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.4/15.4 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json (from llama-index-core<0.11.0,>=0.10.24->llama-index-llms-openai)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting deprecated>=1.2.9.3 (from llama-index-core<0.11.0,>=0.10.24->llama-index-llms-openai)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index-core<0.11.0,>=0.10.24->llama-index-llms-openai)
  Downloading dirtyjson-1.0.8-py3-none-any.whl (25 kB)
Collecting httpx (from llama-index-core<0.11.0,>=0.10.24->llama-index-llms-openai)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━

In [4]:
import json

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode

Download Data

In [5]:
!mkdir medium data

In [6]:
import pandas as pd

In [8]:
articles = pd.read_csv('./medium.csv').values

In [9]:
for n,i in enumerate(articles):
  f = open(f'./medium/{n}.txt','w')
  f.write(i[0]+'\n')
  f.write(i[1])
  f.close()

In [10]:
TRAIN_CORPUS_FPATH = "./data/train_corpus.json"
VAL_CORPUS_FPATH = "./data/val_corpus.json"

In [11]:
def load_corpus(files, verbose=False):
    if verbose:
        print(f"Loading files {files}")

    reader = SimpleDirectoryReader(input_files=files)
    docs = reader.load_data()
    if verbose:
        print(f"Loaded {len(docs)} docs")

    parser = SentenceSplitter()
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f"Parsed {len(nodes)} nodes")

    return nodes

We do a very naive train/val split by having the Lyft corpus as the train dataset, and the Uber corpus as the val dataset.

In [12]:
import os
ld = os.listdir('./medium') #list of articles
ld = ['./medium/'+i for i in ld] #list of full path articles


In [13]:
import numpy as np
import random
random.seed(42)  # Set a random seed for reproducibility

# Randomly sample ~90% rows from the data
train_sampled_indices = random.sample(range(len(ld)//10*9), len(ld)//10*9)
train_articles = np.array(ld)[train_sampled_indices]
# Randomly sample ~10% rows from the data
test_sampled_indices = random.sample(range(len(ld)-(len(ld)//10*9)), len(ld)-(len(ld)//10*9))
test_articles = np.array(ld)[(np.array(test_sampled_indices)+(len(ld)//10*9)).tolist()]

In [14]:
train_nodes = load_corpus(train_articles.tolist(), verbose=True)
val_nodes = load_corpus(test_articles.tolist(), verbose=True)

Loading files ['./medium/385.txt', './medium/66.txt', './medium/286.txt', './medium/11.txt', './medium/576.txt', './medium/300.txt', './medium/819.txt', './medium/92.txt', './medium/540.txt', './medium/723.txt', './medium/1328.txt', './medium/1125.txt', './medium/272.txt', './medium/1150.txt', './medium/502.txt', './medium/127.txt', './medium/198.txt', './medium/841.txt', './medium/524.txt', './medium/252.txt', './medium/1058.txt', './medium/822.txt', './medium/33.txt', './medium/1113.txt', './medium/850.txt', './medium/388.txt', './medium/48.txt', './medium/831.txt', './medium/1247.txt', './medium/1206.txt', './medium/1011.txt', './medium/106.txt', './medium/999.txt', './medium/1319.txt', './medium/480.txt', './medium/1026.txt', './medium/512.txt', './medium/1378.txt', './medium/312.txt', './medium/562.txt', './medium/1296.txt', './medium/305.txt', './medium/598.txt', './medium/793.txt', './medium/615.txt', './medium/802.txt', './medium/254.txt', './medium/763.txt', './medium/909.txt'

Parsing nodes:   0%|          | 0/1251 [00:00<?, ?it/s]

Parsed 2370 nodes
Loading files ['./medium/1181.txt', './medium/245.txt', './medium/447.txt', './medium/125.txt', './medium/827.txt', './medium/110.txt', './medium/1361.txt', './medium/362.txt', './medium/1248.txt', './medium/493.txt', './medium/1001.txt', './medium/1010.txt', './medium/1261.txt', './medium/62.txt', './medium/846.txt', './medium/446.txt', './medium/625.txt', './medium/444.txt', './medium/1110.txt', './medium/1325.txt', './medium/445.txt', './medium/1033.txt', './medium/1051.txt', './medium/1200.txt', './medium/1197.txt', './medium/591.txt', './medium/415.txt', './medium/53.txt', './medium/1323.txt', './medium/772.txt', './medium/372.txt', './medium/1235.txt', './medium/235.txt', './medium/592.txt', './medium/1284.txt', './medium/1111.txt', './medium/464.txt', './medium/1341.txt', './medium/905.txt', './medium/429.txt', './medium/505.txt', './medium/1085.txt', './medium/1103.txt', './medium/754.txt', './medium/93.txt', './medium/1104.txt', './medium/893.txt', './medium/

Parsing nodes:   0%|          | 0/140 [00:00<?, ?it/s]

Parsed 235 nodes


### Generate synthetic queries

Now, we use an LLM (gpt-3.5-turbo) to generate questions using each text chunk in the corpus as context.

Each pair of (generated question, text chunk used as context) becomes a datapoint in the finetuning dataset (either for training or evaluation).

In [15]:
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

In [16]:
import os

OPENAI_API_TOKEN = "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"
os.environ["OPENAI_API_KEY"] = OPENAI_API_TOKEN

In [17]:
from llama_index.llms.openai import OpenAI


train_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-3.5-turbo"), nodes=train_nodes
)
val_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-3.5-turbo"), nodes=val_nodes
)

train_dataset.save_json("train_dataset.json")
val_dataset.save_json("val_dataset.json")


100%|██████████| 2370/2370 [1:10:25<00:00,  1.78s/it]
100%|██████████| 235/235 [06:36<00:00,  1.69s/it]


In [19]:
# [Optional] Load
train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
val_dataset = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")

## Run Embedding Finetuning

In [20]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

In [21]:
finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="BAAI/bge-small-en",
    model_output_path="test_model",
    val_dataset=val_dataset,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [22]:
finetune_engine.finetune()

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/475 [00:00<?, ?it/s]

Iteration:   0%|          | 0/475 [00:00<?, ?it/s]

In [25]:
embed_model = finetune_engine.get_finetuned_model()

## Evaluate Finetuned Model

In this section, we evaluate 3 different embedding models:
1. proprietary OpenAI embedding,
2. open source `BAAI/bge-small-en`, and
3. our finetuned embedding model.

We consider 2 evaluation approaches:
1. a simple custom **hit rate** metric
2. using `InformationRetrievalEvaluator` from sentence_transformers

We show that finetuning on synthetic (LLM-generated) dataset significantly improve upon an opensource embedding model.

In [27]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd

### Define eval function

**Option 1**: We use a simple **hit rate** metric for evaluation:
* for each (query, relevant_doc) pair,
* we retrieve top-k documents with the query,  and
* it's a **hit** if the results contain the relevant_doc.

This approach is very simple and intuitive, and we can apply it to both the proprietary OpenAI embedding as well as our open source and fine-tuned embedding models.

In [28]:
def evaluate(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]
    index = VectorStoreIndex(
        nodes, embed_model=embed_model, show_progress=True
    )
    retriever = index.as_retriever(similarity_top_k=top_k)

    eval_results = []
    for query_id, query in tqdm(queries.items()):
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]
        expected_id = relevant_docs[query_id][0]
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

        eval_result = {
            "is_hit": is_hit,
            "retrieved": retrieved_ids,
            "expected": expected_id,
            "query": query_id,
        }
        eval_results.append(eval_result)
    return eval_results

**Option 2**: We use the `InformationRetrievalEvaluator` from sentence_transformers.

This provides a more comprehensive suite of metrics, but we can only run it against the sentencetransformers compatible models (open source and our finetuned model, *not* the OpenAI embedding model).

In [29]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer
from pathlib import Path


def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    evaluator = InformationRetrievalEvaluator(
        queries, corpus, relevant_docs, name=name
    )
    model = SentenceTransformer(model_id)
    output_path = "results/"
    Path(output_path).mkdir(exist_ok=True, parents=True)
    return evaluator(model, output_path=output_path)

### Run Evals

#### OpenAI

Note: this might take a few minutes to run since we have to embed the corpus and queries

In [30]:
ada = OpenAIEmbedding()
ada_val_results = evaluate(val_dataset, ada)

Generating embeddings:   0%|          | 0/235 [00:00<?, ?it/s]

  0%|          | 0/470 [00:00<?, ?it/s]

In [31]:
df_ada = pd.DataFrame(ada_val_results)

In [32]:
hit_rate_ada = df_ada["is_hit"].mean()
hit_rate_ada

0.9829787234042553

### BAAI/bge-small-en

In [33]:
bge = "local:BAAI/bge-small-en"
bge_val_results = evaluate(val_dataset, bge)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generating embeddings:   0%|          | 0/235 [00:00<?, ?it/s]

  0%|          | 0/470 [00:00<?, ?it/s]

In [34]:
df_bge = pd.DataFrame(bge_val_results)

In [35]:
hit_rate_bge = df_bge["is_hit"].mean()
hit_rate_bge

0.9702127659574468

In [36]:
evaluate_st(val_dataset, "BAAI/bge-small-en", name="bge")

0.8592190712112324

### Finetuned

In [37]:
finetuned = "local:test_model"
val_results_finetuned = evaluate(val_dataset, finetuned)

Generating embeddings:   0%|          | 0/235 [00:00<?, ?it/s]

  0%|          | 0/470 [00:00<?, ?it/s]

In [38]:
df_finetuned = pd.DataFrame(val_results_finetuned)

In [39]:
hit_rate_finetuned = df_finetuned["is_hit"].mean()
hit_rate_finetuned

0.9702127659574468

In [40]:
evaluate_st(val_dataset, "test_model", name="finetuned")

0.8924530109370535

### Summary of Results

#### Hit rate

In [41]:
df_ada["model"] = "ada"
df_bge["model"] = "bge"
df_finetuned["model"] = "fine_tuned"

We can see that fine-tuning our small open-source embedding model drastically improve its retrieval quality (even approaching the quality of the proprietary OpenAI embedding)!

In [42]:
df_all = pd.concat([df_ada, df_bge, df_finetuned])
df_all.groupby("model").mean("is_hit")

Unnamed: 0_level_0,is_hit
model,Unnamed: 1_level_1
ada,0.982979
bge,0.970213
fine_tuned,0.970213


#### InformationRetrievalEvaluator

In [43]:
df_st_bge = pd.read_csv(
    "results/Information-Retrieval_evaluation_bge_results.csv"
)
df_st_finetuned = pd.read_csv(
    "results/Information-Retrieval_evaluation_finetuned_results.csv"
)

We can see that embedding finetuning improves metrics consistently across the suite of eval metrics

In [44]:
df_st_bge["model"] = "bge"
df_st_finetuned["model"] = "fine_tuned"
df_st_all = pd.concat([df_st_bge, df_st_finetuned])
df_st_all = df_st_all.set_index("model")
df_st_all

Unnamed: 0_level_0,epoch,steps,cos_sim-Accuracy@1,cos_sim-Accuracy@3,cos_sim-Accuracy@5,cos_sim-Accuracy@10,cos_sim-Precision@1,cos_sim-Recall@1,cos_sim-Precision@3,cos_sim-Recall@3,...,dot_score-Recall@1,dot_score-Precision@3,dot_score-Recall@3,dot_score-Precision@5,dot_score-Recall@5,dot_score-Precision@10,dot_score-Recall@10,dot_score-MRR@10,dot_score-NDCG@10,dot_score-MAP@100
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
bge,-1,-1,0.776596,0.92766,0.961702,0.987234,0.776596,0.776596,0.30922,0.92766,...,0.776596,0.30922,0.92766,0.19234,0.961702,0.098723,0.987234,0.85858,0.890512,0.859219
fine_tuned,-1,-1,0.82766,0.955319,0.970213,0.982979,0.82766,0.82766,0.31844,0.955319,...,0.82766,0.31844,0.955319,0.194043,0.970213,0.098298,0.982979,0.891407,0.914416,0.892453


In [45]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [55]:
embed_model.save

AttributeError: 'HuggingFaceEmbedding' object has no attribute 'save'

In [56]:
finetune_engine.save

AttributeError: 'SentenceTransformersFinetuneEngine' object has no attribute 'save'

In [1]:
ls results

ls: cannot access 'results': No such file or directory
