<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/finetuning/embeddings/finetune_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune Embeddings

In this notebook, we show users how to finetune their own embedding models.

We go through three main sections:
1. Preparing the data (our `generate_qa_embedding_pairs` function makes this easy)
2. Finetuning the embedding model (using our `SentenceTransformersFinetuneEngine`)
3. Evaluating the embedding model on a validation knowledge corpus

<b> If you face any errors in running this notebook, you run the code mentioned in the below link in the google colab <b>

https://docs.llamaindex.ai/en/stable/examples/finetuning/embeddings/finetune_embedding/

pip install llama-index-finetuning

In [2]:
!pip install openai
!pip install llama_index
!pip install llama-index-finetuning
!pip install llama-index-embeddings-huggingface



In [3]:
# ## ------NOTE: Use this piece of code when you are running the code on your local machine##-------
# import os
# from dotenv import load_dotenv, find_dotenv
# load_dotenv('D:/Learning/Gen AI/Building production ready RAG systems using LlamaIndex/API Keys/.env')
# OPENAI_API_KEY = os.environ['OPENAI_API_KEY']

## ------NOTE: Use this piece of code when you are running the code on Google colab (Assign the API key in the secrets tab on the left)##-------
from google.colab import userdata
import openai
OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
openai.api_key = OPENAI_API_KEY

## Generate Corpus

First, we create the corpus of text chunks by leveraging LlamaIndex to load some financial PDFs, and parsing/chunking into plain text chunks.

In [4]:
import json

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.schema import MetadataMode

## Download Data

In [5]:
!mkdir -p 'data/10k/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/9607a05a923ddf07deee86a56d386b42943ce381/docs/docs/examples/data/10k/uber_2021.pdf' -O 'data/10k/uber_2021.pdf'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/9607a05a923ddf07deee86a56d386b42943ce381/docs/docs/examples/data/10k/lyft_2021.pdf' -O 'data/10k/lyft_2021.pdf'

--2024-05-24 11:00:13--  https://raw.githubusercontent.com/run-llama/llama_index/9607a05a923ddf07deee86a56d386b42943ce381/docs/docs/examples/data/10k/uber_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1880483 (1.8M) [application/octet-stream]
Saving to: ‘data/10k/uber_2021.pdf’


2024-05-24 11:00:14 (42.6 MB/s) - ‘data/10k/uber_2021.pdf’ saved [1880483/1880483]

--2024-05-24 11:00:14--  https://raw.githubusercontent.com/run-llama/llama_index/9607a05a923ddf07deee86a56d386b42943ce381/docs/docs/examples/data/10k/lyft_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTT

In [6]:
def load_corpus(files, verbose=False):
    if verbose:
        print(f"Loading files {files}")

    reader = SimpleDirectoryReader(input_files=files)
    docs = reader.load_data()
    if verbose:
        print(f"Loaded {len(docs)} docs")

    parser = SimpleNodeParser.from_defaults()
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f"Parsed {len(nodes)} nodes")

    return nodes

We do a very naive train/val split by having the Lyft corpus as the train dataset, and the Uber corpus as the val dataset.

In [7]:
TRAIN_FILES = ["./lyft_2021_short_version.pdf"]
VAL_FILES = ["./uber_2021_short_version.pdf"]

TRAIN_CORPUS_FPATH = "./train_corpus.json"
VAL_CORPUS_FPATH = "./val_corpus.json"

train_nodes = load_corpus(TRAIN_FILES, verbose=True)
val_nodes = load_corpus(VAL_FILES, verbose=True)

Loading files ['./lyft_2021_short_version.pdf']
Loaded 50 docs


Parsing nodes:   0%|          | 0/50 [00:00<?, ?it/s]

Parsed 91 nodes
Loading files ['./uber_2021_short_version.pdf']
Loaded 52 docs


Parsing nodes:   0%|          | 0/52 [00:00<?, ?it/s]

Parsed 98 nodes


### Generate synthetic queries

Now, we use an LLM (gpt-3.5-turbo) to generate questions using each text chunk in the corpus as context.

Each pair of (generated question, text chunk used as context) becomes a datapoint in the finetuning dataset (either for training or evaluation).

In [8]:
from llama_index.llms.openai import OpenAI
llm = OpenAI(model="gpt-3.5-turbo")

In [9]:
from llama_index.finetuning import generate_qa_embedding_pairs

train_dataset = generate_qa_embedding_pairs(train_nodes, llm=llm)
val_dataset = generate_qa_embedding_pairs(val_nodes, llm=llm)

100%|██████████| 91/91 [02:18<00:00,  1.52s/it]
100%|██████████| 98/98 [02:27<00:00,  1.51s/it]


In [10]:
train_dataset.save_json("train_dataset.json")
val_dataset.save_json("val_dataset.json")

In [11]:
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
val_dataset = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")

In [12]:
list(train_dataset.queries.values())[1]

"Can you explain the significance of the Registrant's classification as a well-known seasoned issuer and how it impacts their reporting requirements under the Securities Exchange Act of 1934?"

## Run Embedding Finetuning

In [13]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

finetune_engine = SentenceTransformersFinetuneEngine(train_dataset,
                                                     model_id = "BAAI/bge-small-en",
                                                     model_output_path = "test_model",
                                                     val_dataset = val_dataset)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [14]:
finetune_engine.finetune()

Epoch:   0%|          | 0/2 [00:00<?, ?it/s]

Iteration:   0%|          | 0/19 [00:00<?, ?it/s]

Iteration:   0%|          | 0/19 [00:00<?, ?it/s]

In [38]:
finetuned_embed_model = finetune_engine.get_finetuned_model()

In [39]:
finetuned_embed_model

HuggingFaceEmbedding(model_name='test_model', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7d56c71b2800>, num_workers=None, max_length=512, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None)

In [40]:
finetuned

'local:test_model'

## Evaluate Finetuned Model

In this section, we evaluate 3 different embedding models:
1. proprietary OpenAI embedding,
2. open source `BAAI/bge-small-en`, and
3. our finetuned embedding model.

We evaluate the models using **hit rate** metric

We show that finetuning on synthetic (LLM-generated) dataset significantly improve upon an opensource embedding model.

In [17]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd

### Define eval function

**Option 1**: We use a simple **hit rate** metric for evaluation:
* for each (query, relevant_doc) pair,
* we retrieve top-k documents with the query,  and
* it's a **hit** if the results contain the relevant_doc.

This approach is very simple and intuitive, and we can apply it to both the proprietary OpenAI embedding as well as our open source and fine-tuned embedding models.

In [19]:
def evaluate_embed_model(dataset, embed_model, top_k=5, verbose=False):

  corpus = dataset.corpus
  queries = dataset.queries
  relevant_docs = dataset.relevant_docs

  nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]

  vector_index = VectorStoreIndex(nodes, embed_model=embed_model, show_progress=True)

  retriever = vector_index.as_retriever(similarity_top_k=top_k)

  eval_results = []
  for query_id, query in tqdm(queries.items()):
      retrieved_nodes = retriever.retrieve(query)
      retrieved_ids = [node.node.node_id for node in retrieved_nodes]
      expected_id = relevant_docs[query_id][0]
      is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

      eval_result = {"is_hit": is_hit,
                     "retrieved": retrieved_ids,
                     "expected": expected_id,
                     "query": query_id}

      eval_results.append(eval_result)

  return eval_results

### Run Evals

#### OpenAI

Note: this might take a few minutes to run since we have to embed the corpus and queries

In [20]:
embed_model_open_ai = OpenAIEmbedding(model='text-embedding-3-small')
val_results = evaluate_embed_model(val_dataset, embed_model_open_ai)

Generating embeddings:   0%|          | 0/98 [00:00<?, ?it/s]

  0%|          | 0/196 [00:00<?, ?it/s]

In [21]:
df_opanai = pd.DataFrame(val_results)

In [22]:
hit_rate = df_opanai["is_hit"].mean()
hit_rate

0.9132653061224489

### BAAI/bge-small-en

In [23]:
embed_model_bge = "local:BAAI/bge-small-en"
bge_val_results = evaluate_embed_model(val_dataset, embed_model_bge)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Generating embeddings:   0%|          | 0/98 [00:00<?, ?it/s]

  0%|          | 0/196 [00:00<?, ?it/s]

In [24]:
df_bge = pd.DataFrame(bge_val_results)

In [25]:
hit_rate_bge = df_bge["is_hit"].mean()
hit_rate_bge

0.8520408163265306

### Finetuned

In [41]:
val_results_finetuned = evaluate_embed_model(val_dataset, finetuned_embed_model)

Generating embeddings:   0%|          | 0/98 [00:00<?, ?it/s]

  0%|          | 0/196 [00:00<?, ?it/s]

In [43]:
df_finetuned = pd.DataFrame(val_results_finetuned)

In [44]:
hit_rate_finetuned = df_finetuned["is_hit"].mean()
hit_rate_finetuned

0.8775510204081632

### Summary of Results

#### Hit rate

In [31]:
df_opanai["model"] = "text-embedding-3-small"
df_bge["model"] = "bge"
df_finetuned["model"] = "fine_tuned"

We can see that fine-tuning our small open-source embedding model  improves its retrieval quality (even approaching the quality of the proprietary OpenAI embedding)!

In [35]:
df_all = pd.concat([df_opanai, df_bge, df_finetuned])
df_all.groupby("model").mean("is_hit")

Unnamed: 0_level_0,is_hit
model,Unnamed: 1_level_1
bge,0.852041
fine_tuned,0.877551
text-embedding-3-small,0.913265


In [36]:
def build_nodes(filepath):

    reader = SimpleDirectoryReader(input_files=[filepath])
    docs = reader.load_data()

    parser = SimpleNodeParser.from_defaults()
    nodes = parser.get_nodes_from_documents(docs, show_progress=True)

    return nodes

In [47]:
# Building Nodes and Indices for all three embedding models (openai, bge, fintuned_bge):
nodes = build_nodes("./uber_2021_short_version.pdf")

finetuned_embed_index = VectorStoreIndex(nodes, embed_model = finetuned_embed_model, show_progress = True)
base_embed_index = VectorStoreIndex(nodes, embed_model = embed_model_bge, show_progress = True)
openai_embed_index = VectorStoreIndex(nodes, embed_model = embed_model_open_ai, show_progress = True)

Parsing nodes:   0%|          | 0/52 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/98 [00:00<?, ?it/s]



Generating embeddings:   0%|          | 0/98 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/98 [00:00<?, ?it/s]

In [48]:
# Building Query engines for all three embedding models (openai, bge, fintuned_bge):
finetuned_embed_qe = finetuned_embed_index.as_query_engine(similarity_top_k=2)
base_embed_qe = base_embed_index.as_query_engine(similarity_top_k=2)
openai_embed_qe = openai_embed_index.as_query_engine(similarity_top_k=2)

In [56]:
query = "what are risks related to uber?"
response1 = finetuned_embed_qe.query(query)
response2 = base_embed_qe.query(query)
response3 = openai_embed_qe.query(query)

In [57]:
print(response1)

The risks related to Uber include regulatory concerns regarding the use of cash for ridesharing, safety and security risks for drivers and riders, compliance risks with anti-money laundering laws, potential loss of credit card acceptance privileges, facing litigation related to claims by drivers, inherent dangers of operating motor vehicles, insurance coverage limitations, liability exposure leading to negative publicity and increased operating costs, and the uncertainty of realizing expected benefits from substantial investments in new offerings and technologies.


In [58]:
print(response2)

Risks related to Uber include regulatory concerns regarding the use of cash for ridesharing, safety and security risks for drivers and riders when cash is involved, potential reputational harm due to safety incidents, challenges in collecting service fees for cash-based trips, compliance risks with anti-money laundering laws, and the potential adverse effects of losing credit card acceptance privileges. Additionally, risks involve maintaining and enhancing brand reputation, addressing operational and cultural challenges, managing growth effectively, dealing with safety incidents that may harm user retention, making risky investments in new offerings and technologies, and facing negative impacts on operations in metropolitan areas due to various conditions like economic, social, weather, and regulatory factors.


In [59]:
print(response3)

The risks related to Uber include potential liabilities from traffic accidents, injuries, or incidents involving drivers on their platform, the need to post collateral for insurance claims impacting cash reserves, challenges with insurance coverage adequacy, potential regulatory actions related to insurance requirements and pricing regulations, as well as risks associated with maintaining a critical mass of drivers and platform users.


In [60]:
for node in response1.source_nodes:
    print("NODE")
    print(node.get_text())
    print("-----")

NODE
Gross  Bookings.  This  percentage  may  increase  in  the  future,  particularly  in  the  markets  in  which  Careem  operates.  The  use  of  cash  in  connection  with  ourtechnology
 raises numerous regulatory, operational, and safety concerns. For example, many jurisdictions have specific regulations regarding the use of cash forridesharing
 and certain jurisdictions prohibit the use of cash for ridesharing. Failure to comply with these regulations could result in the imposition of significantfines
 and penalties and could result in a regulator requiring that we suspend operations in those jurisdictions. In addition to these regulatory concerns, the use ofcash
 with our Mobility products and Delivery offering can increase safety and security risks for Drivers and riders, including potential robbery, assault, violent orfatal attacks, and other
 criminal acts. In certain jurisdictions such as Brazil, serious safety incidents resulting in robberies and violent, fatal attacks on

In [61]:
for node in response2.source_nodes:
    print("NODE")
    print(node.get_text())
    print("-----")

NODE
Gross  Bookings.  This  percentage  may  increase  in  the  future,  particularly  in  the  markets  in  which  Careem  operates.  The  use  of  cash  in  connection  with  ourtechnology
 raises numerous regulatory, operational, and safety concerns. For example, many jurisdictions have specific regulations regarding the use of cash forridesharing
 and certain jurisdictions prohibit the use of cash for ridesharing. Failure to comply with these regulations could result in the imposition of significantfines
 and penalties and could result in a regulator requiring that we suspend operations in those jurisdictions. In addition to these regulatory concerns, the use ofcash
 with our Mobility products and Delivery offering can increase safety and security risks for Drivers and riders, including potential robbery, assault, violent orfatal attacks, and other
 criminal acts. In certain jurisdictions such as Brazil, serious safety incidents resulting in robberies and violent, fatal attacks on

In [62]:
for node in response3.source_nodes:
    print("NODE")
    print(node.get_text())
    print("-----")

NODE
transfer a significant portion of the risk from the insurance provider to us or our captive insurance subsidiary, which could require us to pay out material amountsthat
 may be in excess of our insurance reserves, resulting in harm to our financial condition. Our insurance reserves account for unpaid losses and loss adjustmentexpenses
 for risks retained by us through our captive insurance subsidiary and other risk retention mechanisms. Such amounts are based on actuarial estimates,historical claim information, and industry data.
 While management believes that these reserve amounts are adequate, the ultimate liability could be in excess of ourreserves.
 We also have requirements to post collateral for current and future claim settlement obligations with certain of our insurance carriers, which may have asignificant impact on ou
r unrestricted cash and cash equivalents available for general business purposes.We
 may be subject to claims of significant liability based on traffic ac