<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/finetuning/embeddings/finetune_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune Embeddings

In this notebook, we show users how to finetune their own embedding models.

We go through three main sections:
1. Preparing the data (our `generate_qa_embedding_pairs` function makes this easy)
2. Finetuning the model (using our `SentenceTransformersFinetuneEngine`)
3. Evaluating the model on a validation knowledge corpus

## Generate Corpus

First, we create the corpus of text chunks by leveraging LlamaIndex to load some financial PDFs, and parsing/chunking into plain text chunks.

In [1]:
# %pip install llama-index-llms-openai
# %pip install llama-index-embeddings-openai
# %pip install llama-index-finetuning

In [2]:
import json
import os

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode

Download Data

In [3]:
# !mkdir -p 'data/10k/'
# !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
# !wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'

In [4]:


# EY_TRAINING_DATA_DIR = './EYData/train'
# EY_VALIDATION_DATA_DIR = './EYData/validate'

# # Get a list of all files in the directory
# train_files = os.listdir(EY_TRAINING_DATA_DIR)
# validate_files = os.listdir(EY_VALIDATION_DATA_DIR)

# # print(train_files)

# # Filter files to only include PDFs
# train_pdf_files = [f for f in train_files if f.endswith('.pdf')]
# validate_pdf_files = [f for f in validate_files if f.endswith('.pdf')]

# print(train_pdf_files)

# # Split files into training and validation sets
# TRAIN_FILES = [os.path.join(EY_TRAINING_DATA_DIR, f) for f in train_pdf_files[:len(train_pdf_files)]]
# VAL_FILES = [os.path.join(EY_VALIDATION_DATA_DIR, f) for f in validate_pdf_files[:len(validate_pdf_files)]]

# print(len(TRAIN_FILES))
# print(len(VAL_FILES))

In [5]:
# print(TRAIN_FILES)
# print(VAL_FILES)

In [6]:
# TRAIN_FILES = ["./data/10k/lyft_2021.pdf"]
# VAL_FILES = ["./data/10k/uber_2021.pdf"]

# TRAIN_CORPUS_FPATH = "./EYData/json/train_corpus.json"
# VAL_CORPUS_FPATH = "./EYData/json/val_corpus.json"

In [7]:
# def load_corpus(files, verbose=False):
#     if verbose:
#         print(f"Loading files {files}")

#     reader = SimpleDirectoryReader(input_files=files)
#     docs = reader.load_data()
#     if verbose:
#         print(f"Loaded {len(docs)} docs")

#     parser = SentenceSplitter()
#     nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

#     if verbose:
#         print(f"Parsed {len(nodes)} nodes")

#     return nodes

In [8]:
# %pip install -U llama-index-readers-file

We do a very naive train/val split by having the Lyft corpus as the train dataset, and the Uber corpus as the val dataset.

In [9]:
# train_nodes = load_corpus(TRAIN_FILES, verbose=True)
# val_nodes = load_corpus(VAL_FILES, verbose=True)

In [10]:
# train_nodes

### Generate synthetic queries

Now, we use an LLM (gpt-3.5-turbo) to generate questions using each text chunk in the corpus as context.

Each pair of (generated question, text chunk used as context) becomes a datapoint in the finetuning dataset (either for training or evaluation).

In [11]:
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

In [12]:
# import os
# import getpass

# if "OPENAI_API_KEY" not in os.environ:
#     os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your Open AI API key: ") 

In [13]:
# from llama_index.llms.openai import OpenAI


# train_dataset = generate_qa_embedding_pairs(
#     llm=OpenAI(model="gpt-4o"),
#     nodes=train_nodes,
#     output_path="train_dataset.json",
# )
# val_dataset = generate_qa_embedding_pairs(
#     llm=OpenAI(model="gpt-4o"),
#     nodes=val_nodes,
#     output_path="val_dataset.json",
# )

In [14]:
# from llama_index.llms.openai_like import OpenAILike

# # llm = OpenAILike(model="neuralmagic/Llama-3.2-3B-Instruct-FP8", api_base="http://localhost:8000/v1", api_key="NOKEY")
# llm = OpenAILike(model="neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8", api_base="http://43.230.201.125:60100/v1", api_key="NOKEY")
# # neuralmagic/Meta-Llama-3.1-8B-Instruct-FP8
# response = llm.complete("Hello World!")

# print(str(response))

In [15]:
# train_dataset = generate_qa_embedding_pairs(
#     llm=llm,
#     nodes=train_nodes,
#     output_path="train_dataset.json",
# )
# val_dataset = generate_qa_embedding_pairs(
#     llm=llm,
#     nodes=val_nodes,
#     output_path="val_dataset.json",
# )

In [16]:
# [Optional] Load
train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
val_dataset = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")

## Run Embedding Finetuning

In [17]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

In [18]:
finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="BAAI/bge-m3",
    model_output_path="FineTuned_Model",
    val_dataset=val_dataset,
)

  from .autonotebook import tqdm as notebook_tqdm


INFO:datasets:PyTorch version 2.5.1 available.
PyTorch version 2.5.1 available.
INFO:sentence_transformers.SentenceTransformer:Use pytorch device_name: cpu
Use pytorch device_name: cpu
INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: BAAI/bge-m3
Load pretrained SentenceTransformer: BAAI/bge-m3


  return torch._C._cuda_getDeviceCount() > 0


In [19]:
# %pip install datasets

In [20]:
from datasets import load_dataset

In [21]:
# %pip install transformers[torch]

In [None]:
finetune_engine.finetune()

  0%|          | 0/1710 [00:00<?, ?it/s]

In [None]:
embed_model = finetune_engine.get_finetuned_model()

In [None]:
embed_model

## Evaluate Finetuned Model

In this section, we evaluate 3 different embedding models:
1. proprietary OpenAI embedding,
2. open source `BAAI/bge-small-en`, and
3. our finetuned embedding model.

We consider 2 evaluation approaches:
1. a simple custom **hit rate** metric
2. using `InformationRetrievalEvaluator` from sentence_transformers

We show that finetuning on synthetic (LLM-generated) dataset significantly improve upon an opensource embedding model.

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd

### Define eval function

**Option 1**: We use a simple **hit rate** metric for evaluation:
* for each (query, relevant_doc) pair,
* we retrieve top-k documents with the query,  and
* it's a **hit** if the results contain the relevant_doc.

This approach is very simple and intuitive, and we can apply it to both the proprietary OpenAI embedding as well as our open source and fine-tuned embedding models.

In [None]:
def evaluate(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]
    index = VectorStoreIndex(
        nodes, embed_model=embed_model, show_progress=True
    )
    retriever = index.as_retriever(similarity_top_k=top_k)

    eval_results = []
    for query_id, query in tqdm(queries.items()):
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]
        expected_id = relevant_docs[query_id][0]
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

        eval_result = {
            "is_hit": is_hit,
            "retrieved": retrieved_ids,
            "expected": expected_id,
            "query": query_id,
        }
        eval_results.append(eval_result)
    return eval_results

**Option 2**: We use the `InformationRetrievalEvaluator` from sentence_transformers.

This provides a more comprehensive suite of metrics, but we can only run it against the sentencetransformers compatible models (open source and our finetuned model, *not* the OpenAI embedding model).

In [None]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer
from pathlib import Path


def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    evaluator = InformationRetrievalEvaluator(
        queries, corpus, relevant_docs, name=name
    )
    model = SentenceTransformer(model_id)
    output_path = "results/"
    Path(output_path).mkdir(exist_ok=True, parents=True)
    return evaluator(model, output_path=output_path)

### Run Evals

#### OpenAI

Note: this might take a few minutes to run since we have to embed the corpus and queries

In [None]:
ada = OpenAIEmbedding()
ada_val_results = evaluate(val_dataset, ada)

In [None]:
df_ada = pd.DataFrame(ada_val_results)

In [None]:
hit_rate_ada = df_ada["is_hit"].mean()
hit_rate_ada

### BAAI/bge-small-en

In [None]:
bge = "local:BAAI/bge-small-en"
bge_val_results = evaluate(val_dataset, bge)

In [None]:
df_bge = pd.DataFrame(bge_val_results)

In [None]:
hit_rate_bge = df_bge["is_hit"].mean()
hit_rate_bge

In [None]:
evaluate_st(val_dataset, "BAAI/bge-small-en", name="bge")

### Finetuned

In [None]:
finetuned = "local:test_model"
val_results_finetuned = evaluate(val_dataset, finetuned)

In [None]:
df_finetuned = pd.DataFrame(val_results_finetuned)

In [None]:
hit_rate_finetuned = df_finetuned["is_hit"].mean()
hit_rate_finetuned

In [None]:
evaluate_st(val_dataset, "test_model", name="finetuned")

### Summary of Results

#### Hit rate

In [None]:
df_ada["model"] = "ada"
df_bge["model"] = "bge"
df_finetuned["model"] = "fine_tuned"

We can see that fine-tuning our small open-source embedding model drastically improve its retrieval quality (even approaching the quality of the proprietary OpenAI embedding)!

In [None]:
df_all = pd.concat([df_ada, df_bge, df_finetuned])
df_all.groupby("model").mean("is_hit")

#### InformationRetrievalEvaluator

In [None]:
df_st_bge = pd.read_csv(
    "results/Information-Retrieval_evaluation_bge_results.csv"
)
df_st_finetuned = pd.read_csv(
    "results/Information-Retrieval_evaluation_finetuned_results.csv"
)

We can see that embedding finetuning improves metrics consistently across the suite of eval metrics

In [None]:
df_st_bge["model"] = "bge"
df_st_finetuned["model"] = "fine_tuned"
df_st_all = pd.concat([df_st_bge, df_st_finetuned])
df_st_all = df_st_all.set_index("model")
df_st_all