# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating our Retriever



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

#####❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

---

**ANSWER:**

We are specifically relating *the questions* to *the documents*. This means that we are making our embedding model at the very specific task of relating potential questions to specific documents.

There are many caveats, but the main ones are:

- Your Q's should reflect the Q's of your users
- This kind of fine-tuning will (purposefully) "overfit" on your data; this is the desired result in this case.

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [4]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

>> NOTE: You do not need to do these steps if you are running this notebook locally with `uv`.

In [5]:
!pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community langchain-text-splitters

In [6]:
!pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml

### Provide OpenAI API Key

In [7]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


## Task 2: Loading Data

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [8]:
!mkdir data

mkdir: cannot create directory ‘data’: File exists


In [9]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31392    0 31392    0     0  88217      0 --:--:-- --:--:-- --:--:-- 88179


In [10]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 70292    0 70292    0     0   733k      0 --:--:-- --:--:-- --:--:--  738k


In [11]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import BSHTMLLoader

path = "data/"
text_loader = DirectoryLoader(path, glob="*.html", loader_cls=BSHTMLLoader)

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [12]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

>> NOTE: You may need to run this cell twice to get it to work.

In [13]:
training_documents = text_splitter.split_documents(text_loader.load())

In [14]:
len(training_documents)

102

Next, we're going to associate each of our chunks with a unique identifier.

In [15]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [16]:
training_split_documents = training_documents[:len(training_documents) - 24]
val_split_documents = training_documents[len(training_documents) - 24:102-12]
test_split_documents = training_documents[102-12:]

In [17]:
training_split_documents[:2]

[Document(metadata={'source': 'data/2024_llms.html', 'title': 'Things we learned about LLMs in 2024', 'id': '337b5e1f-3f4d-4f1f-a405-69a2892ae78d'}, page_content='Things we learned about LLMs in 2024\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSimon Willison’s Weblog\nSubscribe\n\n\n\n\n\n\nThings we learned about LLMs in 2024\n31st December 2024\nA lot has happened in the world of Large Language Models over the course of 2024. Here’s a review of things we figured out about the field in the past twelve months, plus my attempt at identifying key themes and pivotal moments.\nThis is a sequel to my review of 2023.\nIn this article:'),
 Document(metadata={'source': 'data/2024_llms.html', 'title': 'Things we learned about LLMs in 2024', 'id': 'f9108a61-f686-4efe-afaa-76402dd049da'}, page_content='The GPT-4 barrier was comprehensively broken\nSome of those GPT-4 models run on my laptop\nLLM prices crashed, thanks to competition and increased efficiency\nMultimodal vision is common, audio and 

In [18]:
type(training_split_documents)

list

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4o-mini` (released [today](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)).

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [19]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [20]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [19]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [86]:
import tqdm

async def create_questions(documents, n_questions):
  questions = {}  # Stores question_id -> question text
  relevant_docs = {}  # Stores question_id -> List of context_id

  for doc in tqdm.tqdm(documents, desc="Processing Documents"):
      doc_id = doc.metadata.get("id")  # Extracting 'id' from metadata

      if not doc_id:
          continue  # Skip if no valid document ID

      # Generate n_questions per document
      doc_questions = [
          (str(uuid.uuid4()), f"Sample question {i+1} related to {doc.metadata.get('title', 'Unknown Title')}")
          for i in range(n_questions)
      ]

      for question_id, question_text in doc_questions:
          questions[question_id] = question_text
          if question_id not in relevant_docs:
              relevant_docs[question_id] = []
          relevant_docs[question_id].append(doc_id)  # Associate question with document ID

  return questions, relevant_docs

### REMOVE `await` IF NOT USING ASYNC (HINT: Use `async`)

In [87]:
training_split_documents[1]

Document(metadata={'source': 'data/2024_llms.html', 'title': 'Things we learned about LLMs in 2024', 'id': 'f9108a61-f686-4efe-afaa-76402dd049da'}, page_content='The GPT-4 barrier was comprehensively broken\nSome of those GPT-4 models run on my laptop\nLLM prices crashed, thanks to competition and increased efficiency\nMultimodal vision is common, audio and video are starting to emerge\nVoice and live camera mode are science fiction come to life\nPrompt driven app generation is a commodity already\nUniversal access to the best models lasted for just a few short months\n“Agents” still haven’t really happened yet\nEvals really matter\nApple Intelligence is bad, Apple’s MLX library is excellent\nThe rise of inference-scaling “reasoning” models\nWas the best currently available LLM trained in China for less than $6m?\nThe environmental impact got better\nThe environmental impact got much, much worse')

In [88]:
import asyncio
training_questions, training_relevant_contexts = asyncio.run(create_questions(training_split_documents, 2))

Processing Documents: 100%|██████████| 78/78 [00:00<00:00, 46026.41it/s]


We'll use the function to generate training, validation, and test data.

In [89]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 2)

Processing Documents: 100%|██████████| 12/12 [00:00<00:00, 17439.93it/s]


In [90]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 2)

Processing Documents: 100%|██████████| 12/12 [00:00<00:00, 29502.72it/s]


### Reformating and Saving Datasets

Now, we can save our datasets for later use!

In [91]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [92]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [93]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

>> NOTE: Skip installing dependencies if you are running this notebook locally.

In [95]:
!pip uninstall -y pyarrow

Found existing installation: pyarrow 18.0.0
Uninstalling pyarrow-18.0.0:
  Successfully uninstalled pyarrow-18.0.0


In [1]:
!pip install pyarrow==18.0.0



In [1]:
!pip install -U sentence_transformers datasets tqdm



In [2]:
import pyarrow as pa
print("pyarrow version:", pa.__version__)
print("Has Decimal32Type:", hasattr(pa.lib, "Decimal32Type"))

pyarrow version: 18.0.0
Has Decimal32Type: False


In [29]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [94]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [95]:
BATCH_SIZE = 10

Let's move our dataset into the expected format for training.

In [96]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

In [97]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [98]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!

MatryoshkaLoss
Purpose:
MatryoshkaLoss is designed as a loss modifier that enables a model to be trained at multiple embedding dimensions simultaneously. This approach is particularly useful when you want your model to support various embedding sizes at inference time—allowing users to choose lower-dimensional embeddings (for speed and reduced storage) while still benefiting from a model trained on higher-dimensional features.

How It Works:

Multiple Dimensionalities:
It takes a base loss function (such as MultipleNegativesRankingLoss) and applies it over a range of embedding dimensions. For example, you might train using dimensions like [768, 512, 256, 128, 64]. Each of these dimensions is assigned a weight so that the final loss is a weighted sum of the losses computed at each dimension.

Caching and Truncation:
To avoid recomputing embeddings multiple times, a ForwardDecorator is used to cache the output of the SentenceTransformer’s forward pass. Then, using the helper function shrink, the cached embeddings are truncated (i.e., only the first n dimensions are retained) and normalized. This means the model’s expensive forward computation happens only once per batch, and the embeddings are reused for loss computations at multiple truncated sizes.

Handling Cached Losses:
If the base loss is one of the Cached... variants (e.g., CachedMultipleNegativesRankingLoss), a CachedLossDecorator is applied. This decorator handles the backward pass carefully by detaching and then reattaching gradients across the multiple truncated embeddings.

In Summary:
MatryoshkaLoss lets you train a SentenceTransformer model so that its sentence embeddings can be “shrunk” to various dimensions—each contributing to the overall loss. This multi-scale approach helps the model remain effective even if the user later decides to trade off accuracy for speed with a smaller embedding size.

MultipleNegativesRankingLoss
Purpose:
MultipleNegativesRankingLoss is a ranking loss tailored for training retrieval-style models where you have paired positive examples and many in-batch negatives. Its goal is to ensure that for each given anchor sentence, the model assigns a higher similarity score to its positive pair than to any other sentence in the batch.

How It Works:

In-Batch Negatives:
The loss assumes that your batch consists of positive pairs—each pair contains an anchor and a positive candidate (e.g., a query and its relevant document). For each anchor, all other positives in the batch are treated as negatives. This creates many negative examples without needing to sample them separately.

Similarity and Scaling:
It computes the cosine similarity (or another similarity measure) between the anchor embeddings and the candidate embeddings. The similarity scores are then scaled by a constant factor (by default, 20.0) to adjust the range of values before applying the loss.

Cross-Entropy Loss:
A cross-entropy loss is applied to these similarity scores. For each anchor, the correct positive candidate is expected to have the highest score. The ground-truth label for each anchor is simply its index in the batch, and the loss encourages the model to assign the highest similarity to the correct positive while lowering the scores for the negatives.

In Summary:
MultipleNegativesRankingLoss is used to train models to distinguish the true positive pair from many negatives in a batch. It is highly effective for tasks like information retrieval, where the model needs to rank relevant documents or responses higher than irrelevant ones—all while benefiting from efficient in-batch negative sampling.

Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [101]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [103]:
EPOCHS = 10

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [37]:
import wandb
wandb.init(mode="disabled")

In [38]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic_ft',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
16,No log,No log,0.083333,0.25,0.416667,0.833333,0.083333,0.083333,0.083333,0.083333,0.083333,0.25,0.416667,0.833333,0.37863,0.244081,0.258601
32,No log,No log,0.083333,0.25,0.416667,0.833333,0.083333,0.083333,0.083333,0.083333,0.083333,0.25,0.416667,0.833333,0.37863,0.244081,0.258601
48,No log,No log,0.083333,0.25,0.416667,0.833333,0.083333,0.083333,0.083333,0.083333,0.083333,0.25,0.416667,0.833333,0.37863,0.244081,0.258601
50,No log,No log,0.083333,0.25,0.416667,0.833333,0.083333,0.083333,0.083333,0.083333,0.083333,0.25,0.416667,0.833333,0.37863,0.244081,0.258601
64,No log,No log,0.083333,0.25,0.416667,0.833333,0.083333,0.083333,0.083333,0.083333,0.083333,0.25,0.416667,0.833333,0.37863,0.244081,0.258601
80,No log,No log,0.083333,0.25,0.416667,0.833333,0.083333,0.083333,0.083333,0.083333,0.083333,0.25,0.416667,0.833333,0.37863,0.244081,0.258601
96,No log,No log,0.083333,0.25,0.416667,0.833333,0.083333,0.083333,0.083333,0.083333,0.083333,0.25,0.416667,0.833333,0.37863,0.244081,0.258601
100,No log,No log,0.083333,0.25,0.416667,0.833333,0.083333,0.083333,0.083333,0.083333,0.083333,0.25,0.416667,0.833333,0.37863,0.244081,0.258601
112,No log,No log,0.083333,0.25,0.416667,0.833333,0.083333,0.083333,0.083333,0.083333,0.083333,0.25,0.416667,0.833333,0.37863,0.244081,0.258601
128,No log,No log,0.083333,0.25,0.416667,0.833333,0.083333,0.083333,0.083333,0.083333,0.083333,0.25,0.416667,0.833333,0.37863,0.244081,0.258601


In [39]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [40]:
hf_username = "krishanusinha20"

In [41]:
# model.push_to_hub(f"{hf_username}/legal-ft-v0")

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [42]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

In [43]:
from itertools import islice

for k,v in test_dataset.items():
  print(k,v)

questions {'44966156-d634-48f1-90b4-1c6c862ef2a3': 'Sample question 1 related to Stuff we figured out about AI in 2023', '940bba3f-a042-41bd-a964-2ba341215429': 'Sample question 2 related to Stuff we figured out about AI in 2023', '9e4e908c-4440-4a63-8882-bb19edaed7bc': 'Sample question 1 related to Stuff we figured out about AI in 2023', '8dca6de0-3db0-4af4-90f2-183f8a90ae80': 'Sample question 2 related to Stuff we figured out about AI in 2023', '0dc66444-2e1b-4c4e-8cee-709169e6f0da': 'Sample question 1 related to Stuff we figured out about AI in 2023', 'cb98ed09-c587-431e-a7fb-401ef51699bd': 'Sample question 2 related to Stuff we figured out about AI in 2023', '2ae5b75e-1d71-466d-b914-519926c2b1b3': 'Sample question 1 related to Stuff we figured out about AI in 2023', '6d15270d-f5de-45af-a32b-2448725fb22f': 'Sample question 2 related to Stuff we figured out about AI in 2023', '95c30a95-696e-4655-b6c0-5a7166e46b21': 'Sample question 1 related to Stuff we figured out about AI in 2023',

### `text-embedding-3-small`

In [44]:
from tqdm import tqdm  # Import the function directly

def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
    corpus = dataset['corpus']
    questions = dataset['questions']
    relevant_docs = dataset['relevant_contexts']
    documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
    vectorstore = FAISS.from_documents(documents, embed_model)

    retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

    eval_results = []
    for id, question in tqdm(questions.items()):
        retrieved_nodes = retriever.invoke(question)
        retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
        expected_id = relevant_docs[id][0]
        is_hit = expected_id in retrieved_ids
        eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

    return eval_results

In [45]:
from tqdm import tqdm

te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 24/24 [00:32<00:00,  1.36s/it]


In [46]:
te3_results_df = pd.DataFrame(te3_results)

In [47]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

0.4166666666666667

### `Snowflake/snowflake-arctic-embed-l` (base)

In [48]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

100%|██████████| 24/24 [00:00<00:00, 47.37it/s]


In [49]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [50]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

0.4166666666666667

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned)

In [51]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic_ft")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic_ft and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 24/24 [00:00<00:00, 47.53it/s]


In [52]:
finetune_results_df = pd.DataFrame(finetune_results)

In [53]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

0.4166666666666667

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [54]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(text_loader.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [55]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [56]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [57]:
rag_llm =  ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [58]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [59]:
base_rag_chain.invoke({"question" : "What is an agent?"})["response"]

'An agent, in the context of AI, is a term that refers to AI systems that can act on your behalf. However, the term is considered vague and lacks a single, clear definition. There are different interpretations, such as AI systems that autonomously perform tasks (like a travel agent) or LLMs (large language models) that utilize tools to solve problems. The concept of "agents" is still evolving, and there is skepticism about their utility due to challenges like gullibility, where these systems may struggle to distinguish truth from fiction.'

In [60]:
base_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Organizations that have produced better-than-GPT-3 class models include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.'

In [61]:
base_rag_chain.invoke({"question" : "What is the laziest month for AI?"})["response"]

'I do not know.'

In [62]:
base_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'I do not know.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [63]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [64]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [65]:
finetune_rag_chain.invoke({"question" : "What is an Agent?"})["response"]

'An "Agent" is a term that lacks a single, clear, and widely understood meaning in the context of AI. It generally refers to AI systems that are intended to act on behalf of users, but the specifics can vary widely. Some people think of agents as systems that can autonomously perform tasks, similar to a travel agent, while others view them as LLMs (Large Language Models) that utilize tools to solve problems. The term is often used without a clear definition, leading to confusion and skepticism about their utility, particularly due to issues like gullibility in AI systems.'

In [66]:
finetune_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Organizations that have produced better-than-GPT-3 class models include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.'

In [67]:
finetune_rag_chain.invoke({"question" : "What is the laziest month for AI?"})["response"]

'The context suggests that December might be considered the laziest month for AI, as it mentions the possibility that ChatGPT gets lazy in December due to its hidden system prompt including the current date and the observation that people provide less useful answers coming up to the holidays.'

In [68]:
finetune_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'I do not know.'

####❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?

finetune_rag_chain was clearly able to perform much better by able to answer domain specific questions. It was able to do so because we finetuned it with Matryoshka loss funtion. It helps train the model with the most important embeddings.This type of finetuning is very useful in case of domain specific finetuning.

## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. The includes the Synthetic Data Generation steps.

In [69]:
### YOUR CODE HERE
!pip install -qU ragas==0.2.10

In [150]:
import pandas as pd
from ragas import EvaluationDataset, evaluate, RunConfig
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity


In [134]:
# Build records: for each question, create an evaluation instance.
records = []
for qid, question in test_dataset["questions"].items():
    context_ids = test_dataset["relevant_contexts"].get(qid, [])
    ref_id = context_ids[0] if context_ids else None
    reference_text = test_dataset["corpus"].get(ref_id, "") if ref_id else ""

    records.append({
         "id": qid,
         "question": question,            # Input question
         "reference": reference_text,        # Expected context (ground truth)
         "retrieved_contexts": [],           # To be filled by the chain
         "response": ""                      # To be filled by the chain
    })

# Convert records to a DataFrame.
df = pd.DataFrame(records)
print("DataFrame:")
print(df)

DataFrame:
                                      id  \
0   dc63318e-e94d-4b13-b24d-80b92806425f   
1   96401b36-5881-41e8-9d53-dce58ff417e3   
2   c2a9cf20-150c-498b-aea6-6e3c6ade1fb8   
3   4d4c66ca-8f13-406e-9e93-1d71a74dd99e   
4   e7d8f869-2449-4be4-971b-b620a26dc390   
5   b8223543-f174-4804-aa65-5f71eb328ff1   
6   5b30709c-61f8-4424-9205-43cb616b0b88   
7   c500a394-57b1-470c-abae-6f961bcda8e8   
8   991032a7-bd9a-4ba2-a5a7-99d9982a2adc   
9   cd3ed50c-2c32-41e0-a426-0a7eef323327   
10  3477139b-4d18-4303-a9ce-3279458867a2   
11  da8ed472-c9f8-47eb-86a5-ad18b9ebafbd   
12  8ab817b8-3186-4c34-8ac0-b57777366842   
13  78db9829-598a-4f9b-9522-ce1962164aa0   
14  1e268502-7b59-4a7f-aafc-aa187f87188d   
15  d1422cf9-a1ec-451b-843b-8e2300f01b71   
16  e025bd80-699a-495c-95c8-2cdecef51f3f   
17  7ecc4a8e-8ddb-444f-8cbb-77fbffd0184d   
18  814ce667-931c-4693-9f25-927d1ecb1cd6   
19  aaf8e814-f746-4eff-bcee-904d7d7044e5   
20  d5515c7d-a821-44cd-9059-765df2823a5a   
21  3918e5ee-18f0-421

In [148]:
generated_records = []
for idx, row in df.iterrows():
    # Convert the row to a dict.
    input_item = row.to_dict()
    # Ensure the input has the key "question"
    if "question" not in input_item and "user_input" in input_item:
        input_item["question"] = input_item["user_input"]

    # Run your chain using the .invoke() method.
    output = finetune_rag_chain.invoke(input_item)

    # Update the record with generated outputs.
    input_item["response"] = output.get("response", "")

    # Process the "context" output to ensure we have a list of strings.
    ctx = output.get("context", "")
    if isinstance(ctx, list):
        # If it's a list, convert each Document to its text if necessary.
        processed_ctx = []
        for item in ctx:
            if hasattr(item, "page_content"):
                processed_ctx.append(item.page_content)
            else:
                processed_ctx.append(str(item))
    else:
        # If a single value, convert it appropriately.
        if hasattr(ctx, "page_content"):
            processed_ctx = [ctx.page_content]
        else:
            processed_ctx = [str(ctx)]

    input_item["retrieved_contexts"] = processed_ctx
    generated_records.append(input_item)

# Create a new DataFrame with the generated outputs.
df_generated = pd.DataFrame(generated_records)
print("DataFrame with Generated Outputs:")
print(df_generated)

DataFrame with Generated Outputs:
                                      id  \
0   dc63318e-e94d-4b13-b24d-80b92806425f   
1   96401b36-5881-41e8-9d53-dce58ff417e3   
2   c2a9cf20-150c-498b-aea6-6e3c6ade1fb8   
3   4d4c66ca-8f13-406e-9e93-1d71a74dd99e   
4   e7d8f869-2449-4be4-971b-b620a26dc390   
5   b8223543-f174-4804-aa65-5f71eb328ff1   
6   5b30709c-61f8-4424-9205-43cb616b0b88   
7   c500a394-57b1-470c-abae-6f961bcda8e8   
8   991032a7-bd9a-4ba2-a5a7-99d9982a2adc   
9   cd3ed50c-2c32-41e0-a426-0a7eef323327   
10  3477139b-4d18-4303-a9ce-3279458867a2   
11  da8ed472-c9f8-47eb-86a5-ad18b9ebafbd   
12  8ab817b8-3186-4c34-8ac0-b57777366842   
13  78db9829-598a-4f9b-9522-ce1962164aa0   
14  1e268502-7b59-4a7f-aafc-aa187f87188d   
15  d1422cf9-a1ec-451b-843b-8e2300f01b71   
16  e025bd80-699a-495c-95c8-2cdecef51f3f   
17  7ecc4a8e-8ddb-444f-8cbb-77fbffd0184d   
18  814ce667-931c-4693-9f25-927d1ecb1cd6   
19  aaf8e814-f746-4eff-bcee-904d7d7044e5   
20  d5515c7d-a821-44cd-9059-765df2823a5a  

In [149]:
class LLMAdapter:
    def __init__(self, llm):
        self.llm = llm

    def __call__(self, input_data):
        return self.llm(input_data)

    def set_run_config(self, run_config):
        # Return self or optionally store the run_config if needed.
        return self

# Wrap your evaluator_llm:
adapter_llm = LLMAdapter(LangchainLLMWrapper(ChatOpenAI(model="gpt-4o")))

In [154]:
from ragas import evaluate, RunConfig
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

# Patch set_run_config onto LangchainLLMWrapper
def set_run_config(self, run_config):
    return self

LangchainLLMWrapper.set_run_config = set_run_config

# Create your evaluator_llm
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

# Set up your run config and metrics (assumed already defined)
custom_run_config = RunConfig(timeout=360)
metrics = [

    Faithfulness(),
    FactualCorrectness(),
    ResponseRelevancy(),
    ContextEntityRecall(),
    NoiseSensitivity()
]

In [156]:
df_generated = df_generated.rename(columns={"question": "user_input"})

# Optionally, verify the columns:
print("Columns in df_generated:", df_generated.columns)

# Convert df_generated (a DataFrame) into an EvaluationDataset
evaluation_dataset = EvaluationDataset.from_pandas(df_generated)

# Now call evaluate() with the evaluation_dataset, not the DataFrame.
result = evaluate(
    dataset=evaluation_dataset,
    metrics=metrics,
    llm=evaluator_llm,  # your patched/wrapped LLM
    run_config=custom_run_config
)

print("Evaluation Results:")
print(result)

Columns in df_generated: Index(['id', 'user_input', 'reference', 'retrieved_contexts', 'response'], dtype='object')


Evaluating:   0%|          | 0/120 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[16]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')
ERROR:ragas.executor:Exception raised in Job[6]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')
ERROR:ragas.executor:Exception raised in Job[26]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')
ERROR:ragas.executor:Exception raised in Job[36]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')
ERROR:ragas.executor:Exception raised in Job[46]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be

Evaluation Results:
{'faithfulness': 0.4815, 'factual_correctness': 0.1438, 'answer_relevancy': 0.3828, 'context_entity_recall': 0.1535, 'noise_sensitivity_relevant': 0.2001}


Overall the finetune_rag_chain has performed not very well. But notably compared to all other metrics, 'faithfulness' is the highest. I can infer that from the 'faithfulness' score ('faithfulness' indicates how much your models is hallucinating ) that when it comes to domain related questions, our model performs relatively better compared to all other metrics.