<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/finetuning/embeddings/finetune_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune Embeddings

In this notebook, we show users how to finetune their own embedding models.

We go through three main sections:
1. Preparing the data (our `generate_qa_embedding_pairs` function makes this easy)
2. Finetuning the embedding model (using our `SentenceTransformersFinetuneEngine`)
3. Evaluating the embedding model on a validation knowledge corpus

<b> If you face any errors in running this notebook, you run the code mentioned in the below link in the google colab <b>

https://docs.llamaindex.ai/en/stable/examples/finetuning/embeddings/finetune_embedding/

pip install llama-index-finetuning

In [None]:
!pip install openai
!pip install llama_index
!pip install llama-index-finetuning
!pip install llama-index-embeddings-huggingface

In [None]:
# ## ------NOTE: Use this piece of code when you are running the code on your local machine##-------
# import os
# from dotenv import load_dotenv, find_dotenv
# load_dotenv('D:/Learning/Gen AI/Building production ready RAG systems using LlamaIndex/API Keys/.env')
# OPENAI_API_KEY = os.environ['OPENAI_API_KEY']

## ------NOTE: Use this piece of code when you are running the code on Google colab (Assign the API key in the secrets tab on the left)##-------
from google.colab import userdata
import openai
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
openai.api_key = OPENAI_API_KEY

## Generate Corpus

First, we create the corpus of text chunks by leveraging LlamaIndex to load some financial PDFs, and parsing/chunking into plain text chunks.

In [None]:
import json

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.schema import MetadataMode

## Download Data

In [None]:
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/9607a05a923ddf07deee86a56d386b42943ce381/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/9607a05a923ddf07deee86a56d386b42943ce381/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'

In [None]:
def load_corpus(files, verbose=False):
    if verbose:
        print(f"Loading files {files}")

    reader = SimpleDirectoryReader(input_files=files)
    docs = reader.load_data()
    if verbose:
        print(f"Loaded {len(docs)} docs")

    parser = SimpleNodeParser.from_defaults()
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f"Parsed {len(nodes)} nodes")

    return nodes

We do a very naive train/val split by having the Lyft corpus as the train dataset, and the Uber corpus as the val dataset.

In [None]:
TRAIN_FILES = ["./lyft_2021_short_version.pdf"]
VAL_FILES = ["./uber_2021_short_version.pdf"]

TRAIN_CORPUS_FPATH = "./train_corpus.json"
VAL_CORPUS_FPATH = "./val_corpus.json"

train_nodes = load_corpus(TRAIN_FILES, verbose=True)
val_nodes = load_corpus(VAL_FILES, verbose=True)

### Generate synthetic queries

Now, we use an LLM (gpt-3.5-turbo) to generate questions using each text chunk in the corpus as context.

Each pair of (generated question, text chunk used as context) becomes a datapoint in the finetuning dataset (either for training or evaluation).

In [None]:
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-3.5-turbo")

In [None]:
from llama_index.finetuning import generate_qa_embedding_pairs

train_dataset = generate_qa_embedding_pairs(train_nodes, llm=llm)
val_dataset = generate_qa_embedding_pairs(val_nodes, llm=llm)

In [None]:
train_dataset.save_json("train_dataset.json")
val_dataset.save_json("val_dataset.json")

In [None]:
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
val_dataset = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")

In [None]:
list(train_dataset.queries.values())[1]

## Run Embedding Finetuning

In [None]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

finetune_engine = SentenceTransformersFinetuneEngine(train_dataset,
                                                     model_id = "BAAI/bge-small-en",
                                                     model_output_path = "test_model",
                                                     val_dataset = val_dataset)

In [None]:
finetune_engine.finetune()

In [None]:
finetuned_embed_model = finetune_engine.get_finetuned_model()

In [None]:
finetuned_embed_model

In [None]:
finetuned

## Evaluate Finetuned Model

In this section, we evaluate 3 different embedding models:
1. proprietary OpenAI embedding,
2. open source `BAAI/bge-small-en`, and
3. our finetuned embedding model.

We evaluate the models using **hit rate** metric

We show that finetuning on synthetic (LLM-generated) dataset significantly improve upon an opensource embedding model.

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd

### Define eval function

**Option 1**: We use a simple **hit rate** metric for evaluation:
* for each (query, relevant_doc) pair,
* we retrieve top-k documents with the query,  and
* it's a **hit** if the results contain the relevant_doc.

This approach is very simple and intuitive, and we can apply it to both the proprietary OpenAI embedding as well as our open source and fine-tuned embedding models.

In [None]:
def evaluate_embed_model(dataset, embed_model, top_k=5, verbose=False):

  corpus = dataset.corpus
  queries = dataset.queries
  relevant_docs = dataset.relevant_docs

  nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]

  vector_index = VectorStoreIndex(nodes, embed_model=embed_model, show_progress=True)

  retriever = vector_index.as_retriever(similarity_top_k=top_k)

  eval_results = []
  for query_id, query in tqdm(queries.items()):
      retrieved_nodes = retriever.retrieve(query)
      retrieved_ids = [node.node.node_id for node in retrieved_nodes]
      expected_id = relevant_docs[query_id][0]
      is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

      eval_result = {"is_hit": is_hit,
                     "retrieved": retrieved_ids,
                     "expected": expected_id,
                     "query": query_id}

      eval_results.append(eval_result)

  return eval_results

### Run Evals

#### OpenAI

Note: this might take a few minutes to run since we have to embed the corpus and queries

In [None]:
embed_model_open_ai = OpenAIEmbedding(model='text-embedding-3-small')
val_results = evaluate_embed_model(val_dataset, embed_model_open_ai)

In [None]:
df_opanai = pd.DataFrame(val_results)

In [None]:
hit_rate = df_opanai["is_hit"].mean()
hit_rate

### BAAI/bge-small-en

In [None]:
embed_model_bge = "local:BAAI/bge-small-en"
bge_val_results = evaluate_embed_model(val_dataset, embed_model_bge)

In [None]:
df_bge = pd.DataFrame(bge_val_results)

In [None]:
hit_rate_bge = df_bge["is_hit"].mean()
hit_rate_bge

### Finetuned

In [None]:
val_results_finetuned = evaluate_embed_model(val_dataset, finetuned_embed_model)

In [None]:
df_finetuned = pd.DataFrame(val_results_finetuned)

In [None]:
hit_rate_finetuned = df_finetuned["is_hit"].mean()
hit_rate_finetuned

### Summary of Results

#### Hit rate

In [None]:
df_opanai["model"] = "text-embedding-3-small"
df_bge["model"] = "bge"
df_finetuned["model"] = "fine_tuned"

We can see that fine-tuning our small open-source embedding model  improves its retrieval quality (even approaching the quality of the proprietary OpenAI embedding)!

In [None]:
df_all = pd.concat([df_opanai, df_bge, df_finetuned])
df_all.groupby("model").mean("is_hit")

In [None]:
def build_nodes(filepath):

    reader = SimpleDirectoryReader(input_files=[filepath])
    docs = reader.load_data()

    parser = SimpleNodeParser.from_defaults()
    nodes = parser.get_nodes_from_documents(docs, show_progress=True)

    return nodes

In [None]:
# Building Nodes and Indices for all three embedding models (openai, bge, fintuned_bge):
nodes = build_nodes("./uber_2021_short_version.pdf")

finetuned_embed_index = VectorStoreIndex(nodes, embed_model = finetuned_embed_model, show_progress = True)
base_embed_index = VectorStoreIndex(nodes, embed_model = embed_model_bge, show_progress = True)
openai_embed_index = VectorStoreIndex(nodes, embed_model = embed_model_open_ai, show_progress = True)

In [None]:
# Building Query engines for all three embedding models (openai, bge, fintuned_bge):
finetuned_embed_qe = finetuned_embed_index.as_query_engine(similarity_top_k=2)
base_embed_qe = base_embed_index.as_query_engine(similarity_top_k=2)
openai_embed_qe = openai_embed_index.as_query_engine(similarity_top_k=2)

In [None]:
query = "what are risks related to uber?"
response1 = finetuned_embed_qe.query(query)
response2 = base_embed_qe.query(query)
response3 = openai_embed_qe.query(query)

In [None]:
print(response1)

In [None]:
print(response2)

In [None]:
print(response3)

In [None]:
for node in response1.source_nodes:
    print("NODE")
    print(node.get_text())
    print("-----")

In [None]:
for node in response2.source_nodes:
    print("NODE")
    print(node.get_text())
    print("-----")

In [None]:
for node in response3.source_nodes:
    print("NODE")
    print(node.get_text())
    print("-----")