<a href="https://colab.research.google.com/github/Keatnuxsuo/AIE6JIA/blob/main/09_Finetuning_Embeddings/Fine_tuning_Embedding_Models_for_RAG_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating our Retriever



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

##### ❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

Finetuning helps the model learn to better understand the semantic relationship within our specific domain, resulting in high accuracy of responses. However, there are differences between using Q&D pairs vs using inter-document pairs.

- Using **Q&D pairs** brings user queries close to the right context. The model emphasises on matching intent to the content of the paired document. This is great for retrieval/ranking in QA use case.
- Whereas, using **Inter-document/related sentence pairs** teach the model semantic similarity between two pieces of text (e.g. two documents or two sentences drawn from the same doc). This captures the content-level similarity, independent of any external query. It improves clustering, topic grouping or finding related docs even when no explicit question is asked. This suits jargons, domain-specific vocabs and cross-referencing.

Caveats and limitations
- A model fine-tuned only on Q -> D pairs may excel at retrieval but lose nuance in purely semantic tasks. Conversely, a model trained only on doc -> doc pairs might not learn the notion of intent that questions convey. Both lead to task mismatching.
- We need hard negative questions, especially for Q -> D pairs
- If all questions follow a narrow template (“What is …?”, “How do I …?”), the model may struggle with more conversational or elliptical queries. This can be easily done by using the 3 different Query Distributors from RAGAS.





## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [1]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

> NOTE: You do not need to do these steps if you are running this notebook locally with `uv`.

In [2]:
!pip install -qU "langchain_openai>=0.3.4" "langchain_huggingface" "langchain_core>=0.3.34" "langchain>=0.3.18" "langchain_community>=0.3.17" "langchain-text-splitters>=0.3.6" "datasets>=3.2.0"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.8/62.8 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m437.4/437.4 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m13.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.4/44.4 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
!pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.8/472.8 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m44.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m44.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m169.4/169.4 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25h

### Provide OpenAI API Key

In [4]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


## Task 2: Loading Data

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [5]:
!mkdir data

In [6]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31528    0 31528    0     0  36702      0 --:--:-- --:--:-- --:--:-- 36703


In [7]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70695    0 70695    0     0  58610      0 --:--:--  0:00:01 --:--:-- 58619


In [8]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import BSHTMLLoader

path = "data/"
text_loader = DirectoryLoader(path, glob="*.html", loader_cls=BSHTMLLoader)

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

> NOTE: You may need to run this cell twice to get it to work.

In [10]:
training_documents = text_splitter.split_documents(text_loader.load())

In [11]:
len(training_documents)

102

Next, we're going to associate each of our chunks with a unique identifier.

In [12]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [13]:
training_split_documents = training_documents[:len(training_documents) - 24]
val_split_documents = training_documents[len(training_documents) - 24:102-12]
test_split_documents = training_documents[102-12:]

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4.1-mini`

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [14]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4.1-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [15]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [16]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [17]:
import tqdm
import asyncio

"""
Sample Usage of TQDM:

for i in tqdm.tqdm(range(10)):
  time.sleep(1)
"""

async def process_document(document, n_questions):
    questions_generated = await question_generation_chain.ainvoke({"context": document.page_content, "n_questions": n_questions})

    doc_questions = {}
    doc_relevant_docs = {}

    for question in questions_generated.content.split("\n"):
        question_id = str(uuid.uuid4())
        doc_questions[question_id] = "".join(question.split(".")[1:]).strip()
        doc_relevant_docs[question_id] = [document.metadata["id"]]

    return doc_questions, doc_relevant_docs

async def create_questions(documents, n_questions):
    tasks = [process_document(doc, n_questions) for doc in documents]

    questions = {}
    relevant_docs = {}

    for task in tqdm.tqdm(asyncio.as_completed(tasks), total=len(documents), desc="Processing documents"):
        doc_questions, doc_relevant_docs = await task
        questions.update(doc_questions)
        relevant_docs.update(doc_relevant_docs)

    return questions, relevant_docs

### REMOVE `await` IF NOT USING ASYNC (HINT: Use `async`)

In [18]:
training_questions, training_relevant_contexts = await create_questions(training_split_documents, 2)

Processing documents: 100%|██████████| 78/78 [00:06<00:00, 11.30it/s]


We'll use the function to generate training, validation, and test data.

In [19]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 2)

Processing documents: 100%|██████████| 12/12 [00:02<00:00,  5.18it/s]


In [20]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 2)

Processing documents: 100%|██████████| 12/12 [00:01<00:00,  6.09it/s]


### Reformating and Saving Datasets

Now, we can save our datasets for later use!

In [21]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [22]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [23]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

>> NOTE: Skip installing dependencies if you are running this notebook locally.

In [24]:
!pip install -qU sentence_transformers pyarrow

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.7/345.7 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 25.2.1 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 20.0.0 which is incompatible.
pylibcudf-cu12 25.2.1 requires pyarrow<20.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 20.0.0 which is incompatible.[0m[31m
[0m

In [25]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/85.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [26]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [27]:
BATCH_SIZE = 10

Let's move our dataset into the expected format for training.

In [28]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

In [29]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [30]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!



**MultipleNegativesRankingLoss**
- This loss function teaches the model that each “anchor” text (think of this as your question or prompt) should be closest to its matching “positive” text (the right answer) and farther away from everything else. How it works: The model first converts each piece of text i.e. questions, the correct answers, and any extra wrong answers—into a list of numbers (an “embedding”) using `SentenceTransformer` -> For each question, it measures how “close” its number list is to every answer’s number list (using `cos_sim`). -> Learn from mistakes. It then nudges the model so that the right answer becomes closer and the wrong ones move farther away (using CrossEntropyLoss)

**MatryoshkaLoss**
- This is a helper that takes a full-length embedding (say 768 numbers) and slices off extra dimensions to make it shorter (say 256 numbers), then re-normalises it so it still represent the same information. Normally we'd run the text through the model separately for every embedding size we want but that’s slow. So this wrapper intercepts the model’s “forward” call and cache its output the first time. When you later ask for a different size, it just re-uses that cached result and shrinks it, rather than running the full model again.
- Whatever loss we're using, this wrapper will Loop over each target size (e.g. 768, 512, 256, 128…) ->  Shrink the embeddings to that size -> Compute the loss at that size -> Multiply it by a weight choosen -> then, add them all up into one final number.


When using MultipleNegativeRankingLoss and MatryoshkaLossl together, we get batch negative training at multiple vector sizes in one go. The model learns to solve the retrieval task not just at 768 dim but also at 512, 256, etc. At inference, we can pick whichever embedding size we need, trading off speed and memory for (slightly) less accuracy, without retraining a new model for each size.

Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [31]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [32]:
EPOCHS = 10

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [33]:
import wandb
wandb.init(mode="disabled")

> NOTE: You may not see direct improvement during the training cycles - this is absolutely expected. We will verify performance later in the notebook.

In [34]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic_ft',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
16,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
32,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
48,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
50,No log,No log,0.875,1.0,1.0,1.0,0.875,0.333333,0.2,0.1,0.875,1.0,1.0,1.0,0.953866,0.9375,0.9375
64,No log,No log,0.833333,1.0,1.0,1.0,0.833333,0.333333,0.2,0.1,0.833333,1.0,1.0,1.0,0.938488,0.916667,0.916667
80,No log,No log,0.875,1.0,1.0,1.0,0.875,0.333333,0.2,0.1,0.875,1.0,1.0,1.0,0.953866,0.9375,0.9375
96,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
100,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
112,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
128,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333


In [35]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [36]:
hf_username = "DiamondCutter88"

In [None]:
import uuid

model.push_to_hub(f"{hf_username}/legal-ft-{uuid.uuid4()}")

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [37]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [38]:
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm.tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [39]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 24/24 [00:11<00:00,  2.05it/s]


In [40]:
te3_results_df = pd.DataFrame(te3_results)

In [None]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

np.float64(1.0)

### `Snowflake/snowflake-arctic-embed-l` (base)

In [41]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

100%|██████████| 24/24 [00:00<00:00, 41.60it/s]


In [42]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [43]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

np.float64(0.8333333333333334)

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned)

In [44]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic_ft")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic_ft and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 24/24 [00:00<00:00, 41.60it/s]


In [45]:
finetune_results_df = pd.DataFrame(finetune_results)

In [46]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

np.float64(0.9583333333333334)

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [47]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(text_loader.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [48]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [49]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [50]:
rag_llm =  ChatOpenAI(
    model="gpt-4.1-nano",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [51]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [52]:
base_rag_chain.invoke({"question" : "What is an agent?"})["response"]

'Based on the provided context, an "agent" in the context of AI refers to systems that can act on your behalf, such as travel agents or digital assistants. However, the term is highly vague and lacks a clear, universally accepted definition. Different people interpret "agents" differently—some see them as systems that go and act independently, while others think of them as LLMs with access to tools that can be used in loops to solve problems. The discussion also highlights skepticism about the current utility of such agents, mainly due to issues like gullibility and the difficulty in distinguishing truth from fiction. Overall, "agent" is a loosely defined term that generally refers to AI systems capable of performing tasks or making decisions on behalf of users, but its precise meaning varies and remains somewhat ambiguous.'

In [53]:
base_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Several organizations have produced models that are better than GPT-3, including Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, and Baidu.'

In [54]:
base_rag_chain.invoke({"question" : "What is the laziest time of the year for AI?"})["response"]

'The provided context does not specify or mention the "laziest time of the year for AI."'

In [55]:
base_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'The provided context does not specify the name "Simon" or details about the largest model he has run on his phone. Therefore, I do not know the answer.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [56]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [57]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [58]:
finetune_rag_chain.invoke({"question" : "What is an Agent?"})["response"]

'Based on the provided context, an "agent" is a term that is used in various ways but remains vague and lacks a clear, universally accepted definition. Some interpret agents as systems that can act on your behalf, similar to a travel agent or digital assistant, while others see them as LLMs with access to tools that can operate in loops to solve problems. However, the term "agent" is often associated with concepts like autonomy and acting independently, but without a precise or consistent meaning. The context also suggests skepticism about the current utility of agents, citing issues like gullibility and the difficulty of making meaningful decisions if the system cannot reliably distinguish truth from fiction.'

In [59]:
finetune_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Several organizations have produced models that are better than GPT-3, including Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII (Abu Dhabi), Microsoft Research, xAI, Replit, and Baidu.'

In [60]:
finetune_rag_chain.invoke({"question" : "What is the laziest time of the year for AI?"})["response"]

'I do not know.'

In [61]:
finetune_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'The provided context mentions that Simon runs Mistral 7B on his iPhone. There is no information indicating that he has run a larger model than 7B on his phone. Therefore, the largest model Simon has run on his phone is Mistral 7B.'

#### ❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?

**Analyzing RAG responses**

The user is comparing two RAG chain outputs to see which responds better to specific questions. The first question is, "What is an Agent?" The finetuned response offers insightful context but mentions that the term “agent” is vague and lacks a precise definition. It also highlights the different interpretations of the term, making it a good starting point. I’ll need to weigh these nuances against the base response for a clearer conclusion. I’ll look through the other questions to gather a comprehensive comparison.

**Comparing responses: finetuned vs. base**

Both responses describe agents as AI systems acting on behalf of users, such as digital assistants or travel agents, and note skepticism about agents' current usefulness. The main difference is that the finetuned version emphasizes the vagueness of the term "agent" and highlights specific concerns, like gullibility. The base response addresses the same concerns but lacks some of the nuance. Overall, they're very similar, and the distinction between them is minimal—both effectively convey the same message about the uncertainty surrounding "agents."


Thought for 6 seconds


Overall, the **finetuned RAG chain** gives the stronger answers—especially on domain-specific queries—while the base chain tends to fall back to “I don’t know” more readily. Here’s a quick breakdown:

* **“What is an agent?”**
  Both chains give almost identical, thorough definitions, noting vagueness and the spectrum from travel-agent analogies to tool-enabled LLM loops.

* **“Who has produced better models than GPT-3?”**
  Both chains list the same set of organizations. The base chain even adds a parenthetical “Falcon” for TII, but otherwise they match.

* **“What is the laziest time of the year for AI?”**

  * *Finetuned:* “I do not know.”
  * *Base:* “The provided context does not specify or mention…”
    The base chain edges out here with a more informative refusal.

* **“What is the largest model that Simon has run on his phone?”**

  * *Finetuned:* Correctly recalls “Mistral 7B” from context and reports it.
  * *Base:* Claims no mention of Simon in context and says “I do not know.”

Because the finetuned chain was able to retrieve and surface the specific detail about **Simon running Mistral 7B**—which the base chain completely missed—it wins overall. The takeaway:

* **Base chain** is a bit more conservative (“I don’t know”) when context is missing or ambiguous.
* **Finetuned chain** better captures fine-grained, custom facts embedded in your corpus, making it more useful for questions where you *do* have that specialized information.


## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. The includes the Synthetic Data Generation steps.

In [63]:
!pip install -qU ragas==0.2.10

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/175.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [64]:
!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m34.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.5/43.5 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m60.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m65.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.2/137.2 kB[0m [31m11.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.3/42.3 kB[0m [31m3.3 MB/s[0m eta [36m0

In [100]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1.nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [95]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [96]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,OpenAI only one with LLM before?,[We don’t yet know how to build GPT-4 Vibes Ba...,"A year ago, the only organization that had rel...",single_hop_specifc_query_synthesizer
1,How are large language models (LLMs) performin...,[I’m surprised that no-one has beaten the now ...,"Over the course of the year, it has become inc...",single_hop_specifc_query_synthesizer
2,Wut is a Weblog and how duz Simon Willison's W...,[Simon Willison’s Weblog Subscribe Stuff we fi...,Simon Willison’s Weblog is a platform where he...,single_hop_specifc_query_synthesizer
3,"what chatgpt do for me, like how people use it...",[Microsoft over this issue. The 69 page PDF is...,"ChatGPT show up a lot in the blog, with 78 pos...",single_hop_specifc_query_synthesizer
4,"why llms so black box and like, why we can’t t...",[<1-hop>\n\nI’m surprised that no-one has beat...,"llms is like, super black box, nobody really k...",multi_hop_abstract_query_synthesizer
5,why is it so hard to know what llms do and how...,[<1-hop>\n\nI’m surprised that no-one has beat...,it hard to know what llms do and how they work...,multi_hop_abstract_query_synthesizer
6,Given the ongoing limitations and opacity of l...,[<1-hop>\n\nI’m surprised that no-one has beat...,The context highlights that LLMs remain highly...,multi_hop_abstract_query_synthesizer
7,how u fine-tune big llms if u cant just dump a...,[<1-hop>\n\nby DeepSeek-R1. Meta’s Llama 3.3 7...,u fine-tune big llms like Llama 3.3 70B by usi...,multi_hop_abstract_query_synthesizer
8,How has Apple’s MLX library impacted the abili...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,Apple’s MLX library has significantly improved...,multi_hop_specific_query_synthesizer
9,How has Anthropic contributed to the commoditi...,[<1-hop>\n\ndid. These abilities are just a fe...,Anthropic has played a significant role in the...,multi_hop_specific_query_synthesizer


# Retrieval

In [97]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

In [101]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

74

**Evaluating Base Chain with RAGAS**

In [102]:
for test_row in dataset:
  response = base_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [103]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,OpenAI only one with LLM before?,"[Prompt injection explained, with video, slide...",[We don’t yet know how to build GPT-4 Vibes Ba...,"Based on the provided context, it was mentione...","A year ago, the only organization that had rel...",single_hop_specifc_query_synthesizer
1,How are large language models (LLMs) performin...,[Longer inputs dramatically increase the scope...,[I’m surprised that no-one has beaten the now ...,Large language models (LLMs) are quite effecti...,"Over the course of the year, it has become inc...",single_hop_specifc_query_synthesizer
2,Wut is a Weblog and how duz Simon Willison's W...,"[Prompt injection explained, with video, slide...",[Simon Willison’s Weblog Subscribe Stuff we fi...,"A weblog, or blog, is an online platform where...",Simon Willison’s Weblog is a platform where he...,single_hop_specifc_query_synthesizer
3,"what chatgpt do for me, like how people use it...","[Meanwhile, it’s increasingly common for end u...",[Microsoft over this issue. The 69 page PDF is...,"Based on the provided information, ChatGPT is ...","ChatGPT show up a lot in the blog, with 78 pos...",single_hop_specifc_query_synthesizer
4,"why llms so black box and like, why we can’t t...",[I get it. There are plenty of reasons to disl...,[<1-hop>\n\nI’m surprised that no-one has beat...,"LLMs are considered black boxes because, despi...","llms is like, super black box, nobody really k...",multi_hop_abstract_query_synthesizer
5,why is it so hard to know what llms do and how...,[Even the openly licensed ones are still the w...,[<1-hop>\n\nI’m surprised that no-one has beat...,It is so hard to know what LLMs do and how the...,it hard to know what llms do and how they work...,multi_hop_abstract_query_synthesizer
6,Given the ongoing limitations and opacity of l...,"[Just this week, the New York Times launched a...",[<1-hop>\n\nI’m surprised that no-one has beat...,The challenges in evaluating and understanding...,The context highlights that LLMs remain highly...,multi_hop_abstract_query_synthesizer
7,how u fine-tune big llms if u cant just dump a...,[Another common technique is to use larger mod...,[<1-hop>\n\nby DeepSeek-R1. Meta’s Llama 3.3 7...,You fine-tune big LLMs by using carefully desi...,u fine-tune big llms like Llama 3.3 70B by usi...,multi_hop_abstract_query_synthesizer
8,How has Apple’s MLX library impacted the abili...,"[While MLX is a game changer, Apple’s own “App...",[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,Apple’s MLX library has significantly improved...,Apple’s MLX library has significantly improved...,multi_hop_specific_query_synthesizer
9,How has Anthropic contributed to the commoditi...,[The GPT-4 barrier was comprehensively broken\...,[<1-hop>\n\ndid. These abilities are just a fe...,"Based on the provided context, Anthropic has c...",Anthropic has played a significant role in the...,multi_hop_specific_query_synthesizer


Converting the dataframe into EvaluationDataset


In [104]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

Using gpt-4o-mini as a judge for the metrices

In [105]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

In [106]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.6165, 'faithfulness': 0.7160, 'factual_correctness': 0.6183, 'answer_relevancy': 0.4517, 'context_entity_recall': 0.2388, 'noise_sensitivity_relevant': 0.1588}

**Evaluating finetuned RAG with RAGAS**

In [107]:
for test_row in dataset:
  response = finetune_rag_chain.invoke({"question": test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [108]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,OpenAI only one with LLM before?,"[If you can gather the right data, and afford ...",[We don’t yet know how to build GPT-4 Vibes Ba...,"Based on the provided context, it was stated t...","A year ago, the only organization that had rel...",single_hop_specifc_query_synthesizer
1,How are large language models (LLMs) performin...,[Large Language Models\nThey’re actually quite...,[I’m surprised that no-one has beaten the now ...,The provided documents indicate that large lan...,"Over the course of the year, it has become inc...",single_hop_specifc_query_synthesizer
2,Wut is a Weblog and how duz Simon Willison's W...,[Stuff we figured out about AI in 2023\n\n\n\n...,[Simon Willison’s Weblog Subscribe Stuff we fi...,"A weblog, commonly known as a blog, is an onli...",Simon Willison’s Weblog is a platform where he...,single_hop_specifc_query_synthesizer
3,"what chatgpt do for me, like how people use it...",[Except... you can run generated code to see i...,[Microsoft over this issue. The 69 page PDF is...,"Based on the provided context, ChatGPT is an a...","ChatGPT show up a lot in the blog, with 78 pos...",single_hop_specifc_query_synthesizer
4,"why llms so black box and like, why we can’t t...",[I get it. There are plenty of reasons to disl...,[<1-hop>\n\nI’m surprised that no-one has beat...,The reason LLMs are considered black boxes is ...,"llms is like, super black box, nobody really k...",multi_hop_abstract_query_synthesizer
5,why is it so hard to know what llms do and how...,[Even the openly licensed ones are still the w...,[<1-hop>\n\nI’m surprised that no-one has beat...,It is so hard to know what LLMs do and how the...,it hard to know what llms do and how they work...,multi_hop_abstract_query_synthesizer
6,Given the ongoing limitations and opacity of l...,"[Just this week, the New York Times launched a...",[<1-hop>\n\nI’m surprised that no-one has beat...,The challenges in evaluating and understanding...,The context highlights that LLMs remain highly...,multi_hop_abstract_query_synthesizer
7,how u fine-tune big llms if u cant just dump a...,[Another common technique is to use larger mod...,[<1-hop>\n\nby DeepSeek-R1. Meta’s Llama 3.3 7...,"To fine-tune big LLMs, you don't need to dump ...",u fine-tune big llms like Llama 3.3 70B by usi...,multi_hop_abstract_query_synthesizer
8,How has Apple’s MLX library impacted the abili...,[Last year it felt like my lack of a Linux/Win...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,Apple’s MLX library has significantly improved...,Apple’s MLX library has significantly improved...,multi_hop_specific_query_synthesizer
9,How has Anthropic contributed to the commoditi...,[The GPT-4 barrier was comprehensively broken\...,[<1-hop>\n\ndid. These abilities are just a fe...,Anthropic has contributed to the advancement o...,Anthropic has played a significant role in the...,multi_hop_specific_query_synthesizer


In [110]:
result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[41]: TimeoutError()


{'context_recall': 0.6611, 'faithfulness': 0.7201, 'factual_correctness': 0.6217, 'answer_relevancy': 0.6055, 'context_entity_recall': 0.3258, 'noise_sensitivity_relevant': 0.1316}

It shows meaningful improvements across every core metric, most notably in answer relevancy and entity recall—areas critical for high-quality retrieval.

Reduced noise sensitivity means cleaner, more focused responses even when distractors are present.

Minor uplifts in faithfulness and factual correctness further solidify its overall advantage.

**Context Recall (+4.5%)**
Fine-tuned outputs cite a larger fraction of the relevant passages, making them more grounded in the source material.

Faithfulness (+0.4%) & Factual Correctness (+0.3%) **bold text**
Small but consistent gains indicate the fine-tuned chain sticks slightly closer to the facts and is marginally more accurate.

**Answer Relevancy (+15.4%)**
This is the biggest jump—answers from the fine-tuned chain align much better with the user’s query, reflecting improved question–document alignment.

**Context Entity Recall (+8.7%)**
The fine-tuned chain retrieves and mentions key entities from the context far more reliably.

**Noise Sensitivity (Relevant) (–2.7%)**
A lower score here means the fine-tuned chain is less thrown off by irrelevant or noisy content, boosting robustness.