<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/finetuning/embeddings/finetune_embedding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finetune Embeddings

In this notebook, I am going to finetune our own embedding models.


## Generate Corpus

First, we create the corpus of text chunks by leveraging LlamaIndex to load some financial PDFs, and parsing/chunking into plain text chunks.

In [1]:
%pip install llama-index-llms-openai
%pip install llama-index-embeddings-openai
%pip install llama-index-finetuning==0.2.0
%pip install -U llama-index-readers-file

Collecting llama-index-llms-openai
  Downloading llama_index_llms_openai-0.2.9-py3-none-any.whl.metadata (648 bytes)
Collecting llama-index-core<0.12.0,>=0.11.7 (from llama-index-llms-openai)
  Downloading llama_index_core-0.11.14-py3-none-any.whl.metadata (2.4 kB)
Collecting openai<2.0.0,>=1.40.0 (from llama-index-llms-openai)
  Downloading openai-1.50.2-py3-none-any.whl.metadata (24 kB)
Collecting dataclasses-json (from llama-index-core<0.12.0,>=0.11.7->llama-index-llms-openai)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting deprecated>=1.2.9.3 (from llama-index-core<0.12.0,>=0.11.7->llama-index-llms-openai)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl.metadata (5.4 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index-core<0.12.0,>=0.11.7->llama-index-llms-openai)
  Downloading dirtyjson-1.0.8-py3-none-any.whl.metadata (11 kB)
Collecting httpx (from llama-index-core<0.12.0,>=0.11.7->llama-index-llms-openai)
  Downloading httpx-0.27.2-py3-

In [None]:
!pip install pyarrow==15.0.2



In [None]:
#!pip install sentence-transformers



In [None]:
import json

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core.schema import MetadataMode

Load Data

In [None]:
TRAIN_FILES = ["./dataset/doc1.pdf","./dataset/doc2.pdf","./dataset/doc3.pdf","./dataset/doc5.pdf"]
VAL_FILES = ["./dataset/doc4.pdf"]

TRAIN_CORPUS_FPATH = "./dataset/train_corpus.json"
VAL_CORPUS_FPATH = "./dataset/val_corpus.json"

In [None]:
def load_corpus(files, verbose=False):
    if verbose:
        print(f"Loading files {files}")

    reader = SimpleDirectoryReader(input_files=files)
    docs = reader.load_data()
    if verbose:
        print(f"Loaded {len(docs)} docs")

    parser = SentenceSplitter()
    nodes = parser.get_nodes_from_documents(docs, show_progress=verbose)

    if verbose:
        print(f"Parsed {len(nodes)} nodes")

    return nodes

Doing a naive train/val split

In [None]:
train_nodes = load_corpus(TRAIN_FILES, verbose=True)
val_nodes = load_corpus(VAL_FILES, verbose=True)

Loading files ['./dataset/doc1.pdf', './dataset/doc2.pdf', './dataset/doc3.pdf', './dataset/doc5.pdf']
Loaded 360 docs


Parsing nodes:   0%|          | 0/360 [00:00<?, ?it/s]

Parsed 360 nodes
Loading files ['./dataset/doc4.pdf']
Loaded 73 docs


Parsing nodes:   0%|          | 0/73 [00:00<?, ?it/s]

Parsed 73 nodes


### Generate synthetic queries

Now, we use an LLM (gpt-3.5-turbo) to generate questions using each text chunk in the corpus as context.

Each pair of (generated question, text chunk used as context) becomes a datapoint in the finetuning dataset (either for training or evaluation).

In [None]:
from llama_index.finetuning import generate_qa_embedding_pairs
from llama_index.core.evaluation import EmbeddingQAFinetuneDataset

In [None]:
import os

from google.colab import userdata

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

# Finetunnnig Embeddings for a soft skills coath topics
The main idea is to enhance the embeddings process by exposing the model to general data relevant to the topic, thereby generating a richer contextual understanding. Initially, we use a large embedding model to segment (chunk) the data and generate question-answer pairs. These pairs are then utilized to fine-tune the model within the specific context it will operate, improving its performance in the RAG system.

In [None]:
from llama_index.llms.openai import OpenAI


train_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-4o-mini"), nodes=train_nodes
)
val_dataset = generate_qa_embedding_pairs(
    llm=OpenAI(model="gpt-4o-mini"), nodes=val_nodes
)

train_dataset.save_json("train_dataset.json")
val_dataset.save_json("val_dataset.json")

100%|██████████| 360/360 [09:58<00:00,  1.66s/it]


Final dataset saved.


360it [00:00, ?it/s]

Final dataset saved.





In [None]:
# [Optional] Load
train_dataset = EmbeddingQAFinetuneDataset.from_json("train_dataset.json")
val_dataset = EmbeddingQAFinetuneDataset.from_json("val_dataset.json")

In [None]:
train_dataset.model_fields

{'queries': FieldInfo(annotation=Dict[str, str], required=True),
 'corpus': FieldInfo(annotation=Dict[str, str], required=True),
 'relevant_docs': FieldInfo(annotation=Dict[str, List[str]], required=True),
 'mode': FieldInfo(annotation=str, required=False, default='text')}

## Run Embedding Finetuning

In [None]:
from llama_index.finetuning import SentenceTransformersFinetuneEngine

In [None]:
finetune_engine = SentenceTransformersFinetuneEngine(
    train_dataset,
    model_id="BAAI/bge-large-en-v1.5",
    model_output_path="test_model",
    val_dataset=val_dataset,
    show_progress_bar=True,
    epochs=3
)

In [None]:
finetune_engine.finetune()

Epoch:   0%|          | 0/3 [00:00<?, ?it/s]

Iteration:   0%|          | 0/72 [00:00<?, ?it/s]

Iteration:   0%|          | 0/72 [00:00<?, ?it/s]

Iteration:   0%|          | 0/72 [00:00<?, ?it/s]

In [None]:
embed_model = finetune_engine.get_finetuned_model()

In [None]:
embed_model

HuggingFaceEmbedding(model_name='test_model', embed_batch_size=10, callback_manager=<llama_index.core.callbacks.base.CallbackManager object at 0x7c96cc38c8e0>, num_workers=None, max_length=512, normalize=True, query_instruction=None, text_instruction=None, cache_folder=None)

## Evaluate Finetuned Model

In [None]:
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core.schema import TextNode
from tqdm.notebook import tqdm
import pandas as pd

### Define eval function

In [None]:
def evaluate(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    nodes = [TextNode(id_=id_, text=text) for id_, text in corpus.items()]
    index = VectorStoreIndex(
        nodes, embed_model=embed_model, show_progress=True
    )
    retriever = index.as_retriever(similarity_top_k=top_k)

    eval_results = []
    for query_id, query in tqdm(queries.items()):
        retrieved_nodes = retriever.retrieve(query)
        retrieved_ids = [node.node.node_id for node in retrieved_nodes]
        expected_id = relevant_docs[query_id][0]
        is_hit = expected_id in retrieved_ids  # assume 1 relevant doc

        eval_result = {
            "is_hit": is_hit,
            "retrieved": retrieved_ids,
            "expected": expected_id,
            "query": query_id,
        }
        eval_results.append(eval_result)
    return eval_results

In [None]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator
from sentence_transformers import SentenceTransformer
from pathlib import Path


def evaluate_st(
    dataset,
    model_id,
    name,
):
    corpus = dataset.corpus
    queries = dataset.queries
    relevant_docs = dataset.relevant_docs

    evaluator = InformationRetrievalEvaluator(
        queries, corpus, relevant_docs, name=name
    )
    model = SentenceTransformer(model_id)
    output_path = "results/"
    Path(output_path).mkdir(exist_ok=True, parents=True)
    return evaluator(model, output_path=output_path)

### Run Evals

#### OpenAI

Note: this might take a few minutes to run since we have to embed the corpus and queries

In [None]:
ada = OpenAIEmbedding(model="text-embedding-3-small")
ada_val_results = evaluate(val_dataset, ada)

Generating embeddings:   0%|          | 0/360 [00:00<?, ?it/s]



Retrying llama_index.embeddings.openai.base.get_embeddings in 0.43917967805353264 seconds as it raised APIConnectionError: Connection error..


APIConnectionError: Connection error.

In [None]:
df_ada = pd.DataFrame(ada_val_results)

In [None]:
hit_rate_ada = df_ada["is_hit"].mean()
hit_rate_ada

0.8333333333333334

### Comparative metric with famous embedding

In [None]:
Snow = "local:Snowflake/snowflake-arctic-embed-l"
Snow_val_results = evaluate(val_dataset, Snow)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]






You try to use a model that was created with version 2.7.0.dev0, however, your version is 2.7.0. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





README.md:   0%|          | 0.00/84.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

Generating embeddings:   0%|          | 0/360 [00:00<?, ?it/s]

  0%|          | 0/720 [00:00<?, ?it/s]

In [None]:
df_snow = pd.DataFrame(Snow_val_results)

In [None]:
hit_rate_snow = df_snow["is_hit"].mean()
hit_rate_snow

0.2388888888888889

Comparative metric with a good results model

In [None]:
finetuned = "local:mixedbread-ai/mxbai-embed-large-v1"
val_results_finetuned = evaluate(val_dataset, finetuned)

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/171 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/114k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/677 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.24k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

Generating embeddings:   0%|          | 0/360 [00:00<?, ?it/s]

  0%|          | 0/720 [00:00<?, ?it/s]

In [None]:
df_finetuned = pd.DataFrame(val_results_finetuned)

In [None]:
hit_rate_finetuned = df_finetuned["is_hit"].mean()
hit_rate_finetuned

0.7986111111111112

### Finetuned

In [None]:
finetuned = "local:test_model"
val_results_finetuned = evaluate(val_dataset, finetuned)

Generating embeddings:   0%|          | 0/360 [00:00<?, ?it/s]

  0%|          | 0/720 [00:00<?, ?it/s]

In [None]:
df_finetuned = pd.DataFrame(val_results_finetuned)

In [None]:
hit_rate_finetuned = df_finetuned["is_hit"].mean()
hit_rate_finetuned

0.9736111111111111

In [None]:
evaluate_st(val_dataset, "test_model", name="finetuned")

0.926170650337317

In [None]:
from transformers import AutoModel, AutoTokenizer
from huggingface_hub import login
import os

from google.colab import userdata


# Login to Hugging Face Hub
login(token=userdata.get('HF_TOKEN'))

# Define paths for the model files
model_dir = "test_model"  # Replace with the actual directory containing the files
config_path = os.path.join(model_dir, "config.json")
model_path = os.path.join(model_dir, "pytorch_model.bin")


# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("BAAI/bge-large-en-v1.5")
model = AutoModel.from_pretrained(model_dir)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [None]:
from transformers import AutoModel, AutoTokenizer

# Push the model to Hugging Face
model.push_to_hub("CamiloGC93/bge-large-en-v1.5-soft-skills")  # Use your Hugging Face username and a valid repo name
tokenizer.push_to_hub("CamiloGC93/bge-large-en-v1.5-soft-skills")

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/CamiloGC93/bge-large-en-v1.5-soft-skills/commit/3cd7defcff369c465778ca9ddff6a6b43c9ef02c', commit_message='Upload tokenizer', commit_description='', oid='3cd7defcff369c465778ca9ddff6a6b43c9ef02c', pr_url=None, pr_revision=None, pr_num=None)

In [2]:
from sentence_transformers import SentenceTransformer

In [3]:
model_from_huggingface = "local:CamiloGC93/bge-large-en-v1.5-soft-skills"
ftmodel__val_results = evaluate(val_dataset, model_from_huggingface)

NameError: name 'evaluate' is not defined

In [None]:
df_ft = pd.DataFrame(ftmodel__val_results)

In [None]:
hit_rate_finetuned = df_ft["is_hit"].mean()
hit_rate_finetuned

0.9736111111111111

In [1]:
!pip install optimum[exporters]



In [4]:
!optimum-cli export onnx --model CamiloGC93/bge-large-en-v1.5-soft-skills --task feature-extraction bge-large-en-v1.5-soft-skills_onnx/


2024-09-30 14:00:08.265172: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-09-30 14:00:08.282537: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-09-30 14:00:08.303587: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-09-30 14:00:08.310042: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-09-30 14:00:08.325121: I tensorflow/core/platform/cpu_feature_guar