# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating our Retriever



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

#####❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

---

**ANSWER:**

We are specifically relating *the questions* to *the documents*. This means that we are making our embedding model at the very specific task of relating potential questions to specific documents.

There are many caveats, but the main ones are:

- Your Q's should reflect the Q's of your users
- This kind of fine-tuning will (purposefully) "overfit" on your data; this is the desired result in this case.

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [1]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

>> NOTE: You do not need to do these steps if you are running this notebook locally with `uv`.

In [2]:
!pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community langchain-text-splitters

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/54.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.9/54.9 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/1.0 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m1.0/1.0 MB[0m [31m30.9 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m78.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m69.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [

In [3]:
!pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.8/472.8 kB[0m [31m8.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m70.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m97.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.1/165.1 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
[?25h

### Provide OpenAI API Key

In [4]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


## Task 2: Loading Data

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [5]:
!mkdir data

In [6]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 31392    0 31392    0     0   201k      0 --:--:-- --:--:-- --:--:--  201k


In [7]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 70292    0 70292    0     0   501k      0 --:--:-- --:--:-- --:--:--  504k


In [8]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import BSHTMLLoader

path = "data/"
text_loader = DirectoryLoader(path, glob="*.html", loader_cls=BSHTMLLoader)

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

>> NOTE: You may need to run this cell twice to get it to work.

In [10]:
training_documents = text_splitter.split_documents(text_loader.load())

In [11]:
len(training_documents)

102

Next, we're going to associate each of our chunks with a unique identifier.

In [12]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [13]:
training_split_documents = training_documents[:len(training_documents) - 24]
val_split_documents = training_documents[len(training_documents) - 24:102-12]
test_split_documents = training_documents[102-12:]

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4o-mini` (released [today](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)).

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [14]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [15]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [16]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [31]:
import uuid
import tqdm

async def create_questions(documents, n_questions):
    questions = {}
    relevant_docs = {}

    for document in tqdm.tqdm(documents, desc="Generating Questions"):
        document_content = {"context": document.page_content, "questions": []}
        questions_generated = question_generation_chain.invoke({"context": document.page_content, "n_questions": n_questions})

        for question in questions_generated.content.split("\n"):
            if question.strip():  # Ensure question is not empty
                question_id = str(uuid.uuid4())
                question_text = question.split(".", 1)[-1].strip()  # Extract text after first period

                if question_text:  # Ensure non-empty question text
                    questions[question_id] = question_text
                    relevant_docs[question_id] = [document.metadata["id"]]
                    print("Generated Questions:", questions)
                    print("Relevant Docs:", relevant_docs)

    return questions, relevant_docs

    await questions, relevant_docs


### REMOVE `await` IF NOT USING ASYNC (HINT: Use `async`)

In [24]:
training_questions, training_relevant_contexts = await create_questions(training_split_documents, 2)

Generating Questions: 100%|██████████| 78/78 [01:41<00:00,  1.30s/it]

question: 2. How did the construction of railways in the 1800s impact the environment?
relevant_docs: {'f0750ebf-11cb-43b1-a6cd-38342c4ce088': ['c05fa301-c692-495a-919b-5de8d2022b8b'], '33946ef2-7026-4b05-887d-5890d4db7d0d': ['c05fa301-c692-495a-919b-5de8d2022b8b'], '6eb701ce-5cce-4cf4-b02b-5bee4fef2cd7': ['e7645232-3727-4990-92ac-9ea1665522c5'], '644f7855-84ab-4af8-acc9-2cb4842098b3': ['e7645232-3727-4990-92ac-9ea1665522c5'], 'c7b5186e-7a99-4df4-bc49-b03760eda87b': ['5fd31f84-e928-4d4d-b59c-a870920d23ad'], '264f3434-6221-4b39-8f61-1f8e4244902a': ['5fd31f84-e928-4d4d-b59c-a870920d23ad'], '886b2d00-fbbc-478c-8190-85074f15128b': ['2d3eeba6-d3e2-4ccd-a5dd-2cdaadc61b5d'], 'b127ce27-eba3-486a-93d0-aacb958ad6db': ['2d3eeba6-d3e2-4ccd-a5dd-2cdaadc61b5d'], '53b13773-bb68-4fad-b8ae-f3a4fcd38217': ['f5d5cd27-ecc5-45af-a201-e2f0d909696d'], 'ae217771-d61b-4194-8762-f7e0016be5f9': ['f5d5cd27-ecc5-45af-a201-e2f0d909696d'], '62d3594e-c135-4c79-877b-bbf1bcf67dbe': ['6a8baf70-4b5c-40ae-b61a-3b0add80f00




We'll use the function to generate training, validation, and test data.

In [28]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 2)

Generating Questions: 100%|██████████| 12/12 [00:12<00:00,  1.04s/it]


In [32]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 2)

Generating Questions:   8%|▊         | 1/12 [00:00<00:09,  1.12it/s]

Generated Questions: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': 'What is the significance of the knowledge gap between those who follow technology closely and the majority of the population?'}
Relevant Docs: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': ['34b7b7e2-f365-4522-85ad-7326990a24c3']}
Generated Questions: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': 'What is the significance of the knowledge gap between those who follow technology closely and the majority of the population?', '457f76ed-453e-44dc-9f29-c8e157d733b9': 'How has the recent introduction of live interfaces impacted the perception of technology among self-certified nerds?'}
Relevant Docs: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': ['34b7b7e2-f365-4522-85ad-7326990a24c3'], '457f76ed-453e-44dc-9f29-c8e157d733b9': ['34b7b7e2-f365-4522-85ad-7326990a24c3']}


Generating Questions:  17%|█▋        | 2/12 [00:01<00:09,  1.01it/s]

Generated Questions: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': 'What is the significance of the knowledge gap between those who follow technology closely and the majority of the population?', '457f76ed-453e-44dc-9f29-c8e157d733b9': 'How has the recent introduction of live interfaces impacted the perception of technology among self-certified nerds?', '3a0f2180-8e72-4459-bb21-76e919771df4': 'What are some reasons people dislike LLMs according to the context?'}
Relevant Docs: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': ['34b7b7e2-f365-4522-85ad-7326990a24c3'], '457f76ed-453e-44dc-9f29-c8e157d733b9': ['34b7b7e2-f365-4522-85ad-7326990a24c3'], '3a0f2180-8e72-4459-bb21-76e919771df4': ['f5d7ea1e-ecf5-4462-8138-69b6a507064d']}
Generated Questions: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': 'What is the significance of the knowledge gap between those who follow technology closely and the majority of the population?', '457f76ed-453e-44dc-9f29-c8e157d733b9': 'How has the recent introduction of live inte

Generating Questions:  25%|██▌       | 3/12 [00:03<00:10,  1.13s/it]

Generated Questions: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': 'What is the significance of the knowledge gap between those who follow technology closely and the majority of the population?', '457f76ed-453e-44dc-9f29-c8e157d733b9': 'How has the recent introduction of live interfaces impacted the perception of technology among self-certified nerds?', '3a0f2180-8e72-4459-bb21-76e919771df4': 'What are some reasons people dislike LLMs according to the context?', 'cbabf6cb-337b-4b5c-bb34-33a2db7e1efc': 'Why is it important to discuss the criticisms of LLMs?', 'cbe5433f-416b-46f7-827b-84b176783500': 'What are some potential consequences of making decisions based on hype and misinformation?'}
Relevant Docs: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': ['34b7b7e2-f365-4522-85ad-7326990a24c3'], '457f76ed-453e-44dc-9f29-c8e157d733b9': ['34b7b7e2-f365-4522-85ad-7326990a24c3'], '3a0f2180-8e72-4459-bb21-76e919771df4': ['f5d7ea1e-ecf5-4462-8138-69b6a507064d'], 'cbabf6cb-337b-4b5c-bb34-33a2db7e1efc': ['f5

Generating Questions:  33%|███▎      | 4/12 [00:04<00:09,  1.20s/it]

Generated Questions: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': 'What is the significance of the knowledge gap between those who follow technology closely and the majority of the population?', '457f76ed-453e-44dc-9f29-c8e157d733b9': 'How has the recent introduction of live interfaces impacted the perception of technology among self-certified nerds?', '3a0f2180-8e72-4459-bb21-76e919771df4': 'What are some reasons people dislike LLMs according to the context?', 'cbabf6cb-337b-4b5c-bb34-33a2db7e1efc': 'Why is it important to discuss the criticisms of LLMs?', 'cbe5433f-416b-46f7-827b-84b176783500': 'What are some potential consequences of making decisions based on hype and misinformation?', '7abc8e61-8aac-4bf2-aee1-dd2c80b8f572': 'Why is it important to acknowledge good applications of certain tools before making decisions about their use?', 'b3427dbd-0bb7-4474-90fd-f1a2e6a832e7': "What is the author's perspective on the environmental impact of plagiarism machines in the field discussed?"}
R

Generating Questions:  42%|████▏     | 5/12 [00:05<00:07,  1.08s/it]

Generated Questions: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': 'What is the significance of the knowledge gap between those who follow technology closely and the majority of the population?', '457f76ed-453e-44dc-9f29-c8e157d733b9': 'How has the recent introduction of live interfaces impacted the perception of technology among self-certified nerds?', '3a0f2180-8e72-4459-bb21-76e919771df4': 'What are some reasons people dislike LLMs according to the context?', 'cbabf6cb-337b-4b5c-bb34-33a2db7e1efc': 'Why is it important to discuss the criticisms of LLMs?', 'cbe5433f-416b-46f7-827b-84b176783500': 'What are some potential consequences of making decisions based on hype and misinformation?', '7abc8e61-8aac-4bf2-aee1-dd2c80b8f572': 'Why is it important to acknowledge good applications of certain tools before making decisions about their use?', 'b3427dbd-0bb7-4474-90fd-f1a2e6a832e7': "What is the author's perspective on the environmental impact of plagiarism machines in the field discussed?", '

Generating Questions:  50%|█████     | 6/12 [00:06<00:07,  1.18s/it]

Generated Questions: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': 'What is the significance of the knowledge gap between those who follow technology closely and the majority of the population?', '457f76ed-453e-44dc-9f29-c8e157d733b9': 'How has the recent introduction of live interfaces impacted the perception of technology among self-certified nerds?', '3a0f2180-8e72-4459-bb21-76e919771df4': 'What are some reasons people dislike LLMs according to the context?', 'cbabf6cb-337b-4b5c-bb34-33a2db7e1efc': 'Why is it important to discuss the criticisms of LLMs?', 'cbe5433f-416b-46f7-827b-84b176783500': 'What are some potential consequences of making decisions based on hype and misinformation?', '7abc8e61-8aac-4bf2-aee1-dd2c80b8f572': 'Why is it important to acknowledge good applications of certain tools before making decisions about their use?', 'b3427dbd-0bb7-4474-90fd-f1a2e6a832e7': "What is the author's perspective on the environmental impact of plagiarism machines in the field discussed?", '

Generating Questions:  58%|█████▊    | 7/12 [00:07<00:05,  1.10s/it]

Generated Questions: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': 'What is the significance of the knowledge gap between those who follow technology closely and the majority of the population?', '457f76ed-453e-44dc-9f29-c8e157d733b9': 'How has the recent introduction of live interfaces impacted the perception of technology among self-certified nerds?', '3a0f2180-8e72-4459-bb21-76e919771df4': 'What are some reasons people dislike LLMs according to the context?', 'cbabf6cb-337b-4b5c-bb34-33a2db7e1efc': 'Why is it important to discuss the criticisms of LLMs?', 'cbe5433f-416b-46f7-827b-84b176783500': 'What are some potential consequences of making decisions based on hype and misinformation?', '7abc8e61-8aac-4bf2-aee1-dd2c80b8f572': 'Why is it important to acknowledge good applications of certain tools before making decisions about their use?', 'b3427dbd-0bb7-4474-90fd-f1a2e6a832e7': "What is the author's perspective on the environmental impact of plagiarism machines in the field discussed?", '

Generating Questions:  67%|██████▋   | 8/12 [00:08<00:03,  1.01it/s]

Generated Questions: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': 'What is the significance of the knowledge gap between those who follow technology closely and the majority of the population?', '457f76ed-453e-44dc-9f29-c8e157d733b9': 'How has the recent introduction of live interfaces impacted the perception of technology among self-certified nerds?', '3a0f2180-8e72-4459-bb21-76e919771df4': 'What are some reasons people dislike LLMs according to the context?', 'cbabf6cb-337b-4b5c-bb34-33a2db7e1efc': 'Why is it important to discuss the criticisms of LLMs?', 'cbe5433f-416b-46f7-827b-84b176783500': 'What are some potential consequences of making decisions based on hype and misinformation?', '7abc8e61-8aac-4bf2-aee1-dd2c80b8f572': 'Why is it important to acknowledge good applications of certain tools before making decisions about their use?', 'b3427dbd-0bb7-4474-90fd-f1a2e6a832e7': "What is the author's perspective on the environmental impact of plagiarism machines in the field discussed?", '

Generating Questions:  75%|███████▌  | 9/12 [00:09<00:03,  1.15s/it]

Generated Questions: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': 'What is the significance of the knowledge gap between those who follow technology closely and the majority of the population?', '457f76ed-453e-44dc-9f29-c8e157d733b9': 'How has the recent introduction of live interfaces impacted the perception of technology among self-certified nerds?', '3a0f2180-8e72-4459-bb21-76e919771df4': 'What are some reasons people dislike LLMs according to the context?', 'cbabf6cb-337b-4b5c-bb34-33a2db7e1efc': 'Why is it important to discuss the criticisms of LLMs?', 'cbe5433f-416b-46f7-827b-84b176783500': 'What are some potential consequences of making decisions based on hype and misinformation?', '7abc8e61-8aac-4bf2-aee1-dd2c80b8f572': 'Why is it important to acknowledge good applications of certain tools before making decisions about their use?', 'b3427dbd-0bb7-4474-90fd-f1a2e6a832e7': "What is the author's perspective on the environmental impact of plagiarism machines in the field discussed?", '

Generating Questions:  83%|████████▎ | 10/12 [00:11<00:02,  1.13s/it]

Generated Questions: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': 'What is the significance of the knowledge gap between those who follow technology closely and the majority of the population?', '457f76ed-453e-44dc-9f29-c8e157d733b9': 'How has the recent introduction of live interfaces impacted the perception of technology among self-certified nerds?', '3a0f2180-8e72-4459-bb21-76e919771df4': 'What are some reasons people dislike LLMs according to the context?', 'cbabf6cb-337b-4b5c-bb34-33a2db7e1efc': 'Why is it important to discuss the criticisms of LLMs?', 'cbe5433f-416b-46f7-827b-84b176783500': 'What are some potential consequences of making decisions based on hype and misinformation?', '7abc8e61-8aac-4bf2-aee1-dd2c80b8f572': 'Why is it important to acknowledge good applications of certain tools before making decisions about their use?', 'b3427dbd-0bb7-4474-90fd-f1a2e6a832e7': "What is the author's perspective on the environmental impact of plagiarism machines in the field discussed?", '

Generating Questions:  92%|█████████▏| 11/12 [00:12<00:01,  1.12s/it]

Generated Questions: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': 'What is the significance of the knowledge gap between those who follow technology closely and the majority of the population?', '457f76ed-453e-44dc-9f29-c8e157d733b9': 'How has the recent introduction of live interfaces impacted the perception of technology among self-certified nerds?', '3a0f2180-8e72-4459-bb21-76e919771df4': 'What are some reasons people dislike LLMs according to the context?', 'cbabf6cb-337b-4b5c-bb34-33a2db7e1efc': 'Why is it important to discuss the criticisms of LLMs?', 'cbe5433f-416b-46f7-827b-84b176783500': 'What are some potential consequences of making decisions based on hype and misinformation?', '7abc8e61-8aac-4bf2-aee1-dd2c80b8f572': 'Why is it important to acknowledge good applications of certain tools before making decisions about their use?', 'b3427dbd-0bb7-4474-90fd-f1a2e6a832e7': "What is the author's perspective on the environmental impact of plagiarism machines in the field discussed?", '

Generating Questions: 100%|██████████| 12/12 [00:12<00:00,  1.08s/it]

Generated Questions: {'b2fe15a5-36be-4a81-95be-5b0608c26bda': 'What is the significance of the knowledge gap between those who follow technology closely and the majority of the population?', '457f76ed-453e-44dc-9f29-c8e157d733b9': 'How has the recent introduction of live interfaces impacted the perception of technology among self-certified nerds?', '3a0f2180-8e72-4459-bb21-76e919771df4': 'What are some reasons people dislike LLMs according to the context?', 'cbabf6cb-337b-4b5c-bb34-33a2db7e1efc': 'Why is it important to discuss the criticisms of LLMs?', 'cbe5433f-416b-46f7-827b-84b176783500': 'What are some potential consequences of making decisions based on hype and misinformation?', '7abc8e61-8aac-4bf2-aee1-dd2c80b8f572': 'Why is it important to acknowledge good applications of certain tools before making decisions about their use?', 'b3427dbd-0bb7-4474-90fd-f1a2e6a832e7': "What is the author's perspective on the environmental impact of plagiarism machines in the field discussed?", '




### Reformating and Saving Datasets

Now, we can save our datasets for later use!

In [33]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [34]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [35]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

>> NOTE: Skip installing dependencies if you are running this notebook locally.

In [37]:
!pip install -qU sentence_transformers datasets pyarrow

In [38]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/85.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [39]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [40]:
BATCH_SIZE = 10

Let's move our dataset into the expected format for training.

In [41]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

In [42]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [43]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!


##### ✅ Answers #2:

### **Loss Function Summaries:**

#### **1. MultipleNegativesRankingLoss (MNRL)**  
This loss is designed for contrastive learning, especially in sentence embedding tasks. It **encourages similar sentences to have high cosine similarity while pushing apart dissimilar ones**.  

- Given a batch, each sample is treated as a positive for itself and as a negative for all other samples.  
- It uses **in-batch negatives**, meaning every other example in the batch serves as a distractor (negative).  
- The loss is minimized when positive pairs are close in embedding space, and negatives are far apart.

✅ **Used for:** Training dense retrieval models, contrastive sentence embedding tasks.

---

#### **2. MatryoshkaLoss (MLoss)**  
Matryoshka Loss extends MNRL by enforcing a hierarchical structure in embeddings. It **trains embeddings at multiple levels of granularity** to improve efficiency and robustness.  

- Instead of just optimizing for a single-dimensional space (e.g., 768-d), it **gradually reduces dimensions** (e.g., 768 → 512 → 256 → ... → 64).  
- This helps in scenarios where lower-dimensional embeddings are required (e.g., memory-constrained environments).  
- It ensures that lower-dimensional representations **retain meaningful structure** from the higher-dimensional ones.

✅ **Used for:** Training embeddings that are **efficient** and **scalable** across multiple dimensionalities.

---

### **Why Use Them Together?**  
MatryoshkaLoss wraps MultipleNegativesRankingLoss, meaning:  
1. MNRL ensures embeddings capture semantic similarity.  
2. MatryoshkaLoss makes these embeddings **efficient at different dimensionalities** without losing too much performance.


Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [44]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [45]:
EPOCHS = 10

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [46]:
import wandb
wandb.init(mode="disabled")

In [47]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic_ft',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
16,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
32,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
48,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
50,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
64,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
80,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
96,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
100,No log,No log,0.958333,1.0,1.0,1.0,0.958333,0.333333,0.2,0.1,0.958333,1.0,1.0,1.0,0.984622,0.979167,0.979167
112,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
128,No log,No log,0.875,1.0,1.0,1.0,0.875,0.333333,0.2,0.1,0.875,1.0,1.0,1.0,0.953866,0.9375,0.9375


In [52]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [59]:
hf_username = "AkshaySandbox"

In [60]:
model.push_to_hub(f"{hf_username}/legal-ft-v0")

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

'https://huggingface.co/AkshaySandbox/legal-ft-v0/commit/d5ac5b62388722add9cca355b31d87bc32d0604e'

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [68]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document
from tqdm.auto import tqdm

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [66]:
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [69]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

  0%|          | 0/24 [00:00<?, ?it/s]

In [70]:
te3_results_df = pd.DataFrame(te3_results)

In [71]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

1.0

### `Snowflake/snowflake-arctic-embed-l` (base)

In [72]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

  0%|          | 0/24 [00:00<?, ?it/s]

In [73]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [74]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

0.9166666666666666

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned)

In [75]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic_ft")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic_ft and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/24 [00:00<?, ?it/s]

In [76]:
finetune_results_df = pd.DataFrame(finetune_results)

In [77]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

1.0

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [78]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(text_loader.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [79]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [80]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [81]:
rag_llm =  ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [82]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [83]:
base_rag_chain.invoke({"question" : "What is an agent?"})["response"]

'An agent, in the context of AI, is an infuriatingly vague term that generally refers to AI systems that can act on your behalf. There are different interpretations of what an agent is, with some viewing them as systems that autonomously perform tasks (like a travel agent), while others see them as LLMs (large language models) that utilize tools to solve problems. However, the term lacks a single, clear definition, leading to confusion about its meaning and utility.'

In [84]:
base_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Organizations that have produced better-than-GPT-3 class models include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.'

In [85]:
base_rag_chain.invoke({"question" : "What is the laziest month for AI?"})["response"]

'I do not know.'

In [86]:
base_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'I do not know.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [87]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [88]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [89]:
finetune_rag_chain.invoke({"question" : "What is an Agent?"})["response"]

'An "agent" is a term that lacks a single, clear, and widely understood meaning in the context of AI. It can refer to AI systems that act on behalf of users, similar to a travel agent model, or to LLMs (large language models) that have access to tools and can run them in a loop to solve problems. However, the term is often used vaguely, and there is skepticism about the utility of such agents, particularly due to issues like gullibility, where these systems may struggle to distinguish truth from fiction.'

In [90]:
finetune_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Organizations that have produced better models than GPT-3 include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.'

In [91]:
finetune_rag_chain.invoke({"question" : "What is the laziest month for AI?"})["response"]

'The context suggests that December might be considered a lazy month for AI, as it mentions the possibility that ChatGPT gets lazy in December due to its hidden system prompt including the current date and the observation that people provide less useful answers as the holidays approach.'

In [92]:
finetune_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'The largest model that Simon has run on his phone is the Mistral 7B model.'

####❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?


#### Answer #2:

The fine-tuned model's RAG chain performed better overall, To determine which **LCEL RAG Chain** (Base or Fine-Tuned) answered the questions better, let's analyze their differences based on the following factors:  

### **Key Factors for Evaluation:**  
1. **Relevance of Retrieved Context:** Did the retriever find the most useful context for answering the question?  
2. **Faithfulness & Factual Accuracy:** Did the response correctly represent the information from the retrieved context?  
3. **Fluency & Coherence:** Did the response make sense and provide a clear, well-structured answer?  
4. **Precision & Specificity:** Did the response contain detailed, meaningful insights rather than generic statements?  

---

### **Base RAG Chain vs. Fine-Tuned RAG Chain: Key Differences**  
1. **Retriever Performance:**  
   - The **Base RAG Chain** used the **original retriever** (default embedding model).  
   - The **Fine-Tuned RAG Chain** replaced this with a **fine-tuned FAISS vectorstore** using fine-tuned embeddings.  
   - **Impact:** A fine-tuned embedding model should retrieve **more relevant documents**, leading to better grounding for answers.  

2. **Context & Faithfulness:**  
   - The **Base RAG Chain** had **higher context recall** in previous results, meaning it retrieved a **wider range** of relevant information.  
   - The **Fine-Tuned RAG Chain** had **slightly lower context recall** but improved **faithfulness and factual correctness**.  
   - **Impact:** If fine-tuning improved the embedding model, it should reduce hallucinations and provide **more grounded answers**.  

3. **Answer Quality:**  
   - Fine-tuning should help **reduce generic responses**, providing **more detailed and precise** answers to niche questions.  
   - For **factual questions (e.g., "Who has produced better models than GPT-3?")**, a fine-tuned retriever should perform better if trained properly.  
   - However, for **ambiguous or creative questions (e.g., "What is the laziest month for AI?")**, the fine-tuned model might **not necessarily** improve performance.  

---

### **Final Verdict: Which Performed Better?**  
- If the **fine-tuned retriever** successfully retrieved **higher-quality, more relevant context**, then **Fine-Tuned RAG should perform better** in **factual correctness & faithfulness**.  
- If the **base retriever** already had strong embeddings and retrieved sufficiently good context, the improvement may be **minor** or even negative if the fine-tuned retriever missed important sources.  


## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. The includes the Synthetic Data Generation steps.

In [93]:
!pip install -qU ragas==0.2.10 rapidfuzz==3.12.1

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/175.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/3.1 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━[0m [32m2.1/3.1 MB[0m [31m63.5 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.1/3.1 MB[0m [31m51.9 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [124]:
!pip install langchain_qdrant langgraph

Collecting langgraph
  Downloading langgraph-0.2.73-py3-none-any.whl.metadata (17 kB)
Collecting langgraph-checkpoint<3.0.0,>=2.0.10 (from langgraph)
  Downloading langgraph_checkpoint-2.0.16-py3-none-any.whl.metadata (4.6 kB)
Collecting langgraph-sdk<0.2.0,>=0.1.42 (from langgraph)
  Downloading langgraph_sdk-0.1.51-py3-none-any.whl.metadata (1.8 kB)
Downloading langgraph-0.2.73-py3-none-any.whl (151 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.5/151.5 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading langgraph_checkpoint-2.0.16-py3-none-any.whl (38 kB)
Downloading langgraph_sdk-0.1.51-py3-none-any.whl (44 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.7/44.7 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: langgraph-sdk, langgraph-checkpoint, langgraph
Successfully installed langgraph-0.2.73 langgraph-checkpoint-2.0.16 langgraph-sdk-0.1.51


In [113]:
!pip install unstructured

Collecting unstructured
  Downloading unstructured-0.16.21-py3-none-any.whl.metadata (24 kB)
Collecting filetype (from unstructured)
  Downloading filetype-1.2.0-py2.py3-none-any.whl.metadata (6.5 kB)
Collecting python-magic (from unstructured)
  Downloading python_magic-0.4.27-py2.py3-none-any.whl.metadata (5.8 kB)
Collecting emoji (from unstructured)
  Downloading emoji-2.14.1-py3-none-any.whl.metadata (5.7 kB)
Collecting python-iso639 (from unstructured)
  Downloading python_iso639-2025.2.18-py3-none-any.whl.metadata (14 kB)
Collecting langdetect (from unstructured)
  Downloading langdetect-1.0.9.tar.gz (981 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting backoff (from unstructured)
  Downloading backoff-2.2.1-py3-none-any.whl.metadata (14 kB)
Collecting unstructured-client (from unstructured)
  Downloading unstructured_client-0.30.1-py

In [114]:
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

# Download all NLTK data
nltk.download('all')

# Now proceed with loading
from langchain_community.document_loaders import DirectoryLoader
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to /root/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to /root/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_r

In [100]:
import os
from getpass import getpass
os.environ["RAGAS_APP_TOKEN"] = getpass("Please enter your Ragas API key!")

Please enter your Ragas API key!··········


In [95]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

from ragas.testset import TestsetGenerator
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(training_documents, testset_size=10)

Applying SummaryExtractor:   0%|          | 0/83 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/127 [00:00<?, ?it/s]



Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/337 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

**Convert to pandas dataFrame**

In [96]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What LLMs do in AI?,[Stuff we figured out about AI in 2023\n\n\n\n...,"In 2023, Large Language Models (LLMs) were con...",single_hop_specifc_query_synthesizer
1,What is the current understanding of building ...,[Large Language Models\nThey’re actually quite...,We don’t yet know how to build GPT-4.,single_hop_specifc_query_synthesizer
2,What advancements in Large Language Models wer...,[Here’s the sequel to this post: Things we lea...,"In 2024, it was observed that Large Language M...",single_hop_specifc_query_synthesizer
3,What is the most crucial factor in building ef...,[They’re actually quite easy to build\nThe mos...,The most crucial factor in building effective ...,single_hop_specifc_query_synthesizer
4,What role does Mistral play in the development...,"[If you can gather the right data, and afford ...",Mistral is one of the organizations that have ...,single_hop_specifc_query_synthesizer
5,How has the development of AI technologies fro...,[<1-hop>\n\nJust the other day Google Search w...,"From September 2022 to September 2023, there h...",multi_hop_specific_query_synthesizer
6,How has the development of LLMs like GPT-3 inf...,[<1-hop>\n\nYou can run LLMs on your own devic...,The development of LLMs like GPT-3 initially s...,multi_hop_specific_query_synthesizer
7,What recent development allows running LLMs on...,"[<1-hop>\n\nMore recent articles\n\nLLM 0.22, ...",The recent development that allows running LLM...,multi_hop_specific_query_synthesizer
8,How did the availability of GPT-4o and its cap...,[<1-hop>\n\nFor a few short months this year a...,The availability of GPT-4o marked a significan...,multi_hop_specific_query_synthesizer
9,What are the capabilities of Llama 3.2 Vision ...,[<1-hop>\n\nOctober\n\n1st: OpenAI DevDay 2024...,Llama 3.2 Vision is a multi-modal model releas...,multi_hop_specific_query_synthesizer


In [101]:
dataset.upload()

Testset uploaded! View at https://app.ragas.io/dashboard/alignment/testset/c4c4c0be-7b4e-4671-840d-f3bb30343367


'https://app.ragas.io/dashboard/alignment/testset/c4c4c0be-7b4e-4671-840d-f3bb30343367'

**Create evaluation dataset**

In [106]:
base_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")

In [128]:
ft_embeddings = HuggingFaceEmbeddings(model_name="AkshaySandbox/legal-ft-v0")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/29.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertModel were not initialized from the model checkpoint at AkshaySandbox/legal-ft-v0 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

**Creating Collection**

In [129]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="base_ai_across_years",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)

client.create_collection(
    collection_name="ft_ai_across_years",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)

base_vector_store = QdrantVectorStore(
    client=client,
    collection_name="base_ai_across_years",
    embedding=base_embeddings,
)

ft_vector_store = QdrantVectorStore(
    client=client,
    collection_name="ft_ai_across_years",
    embedding=ft_embeddings,
)

**Creating Chunks**


In [130]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

74

In [131]:
_ = base_vector_store.add_documents(documents=split_documents)
__ = ft_vector_store.add_documents(documents=split_documents)

In [132]:
base_retriever = base_vector_store.as_retriever(search_kwargs={"k": 5})
ft_retriever = ft_vector_store.as_retriever(search_kwargs={"k": 5})

In [133]:
def base_retrieve(state):
  retrieved_docs = base_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

def ft_retrieve(state):
  retrieved_docs = ft_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

In [134]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""
rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

In [135]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

In [136]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

In [137]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class BaseState(TypedDict):
  question: str
  context: List[Document]
  response: str

class FState(TypedDict):
  question: str
  context: List[Document]
  response: str

In [138]:
base_graph_builder = StateGraph(BaseState).add_sequence([base_retrieve, generate])
base_graph_builder.add_edge(START, "base_retrieve")
base_graph = base_graph_builder.compile()

ft_graph_builder = StateGraph(FState).add_sequence([ft_retrieve, generate])
ft_graph_builder.add_edge(START, "ft_retrieve")
ft_graph = ft_graph_builder.compile()


**Invoke Base Graph**

In [127]:
response = base_graph.invoke({"question" : "How are LLM agents useful?"})
response["response"]

"LLM agents can be useful in various applications, despite their inherent unreliability. They possess incredible power and can assist users in tasks such as decision making, research, and generating content. However, the key to benefiting from LLMs lies in understanding how to navigate their flaws and intricacies. It's important to help users recognize the good applications of LLMs and provide guidance on how to effectively use them while avoiding common pitfalls. Ultimately, a better-informed user base can leverage LLMs to achieve valuable outcomes, making it crucial to educate and assist those looking to utilize these tools effectively."

Invoke FT Graph

In [139]:
response = ft_graph.invoke({"question" : "How are LLM agents useful?"})
response["response"]

'LLM agents are seen as potentially useful in various ways, particularly in automating tasks and improving productivity. While the term "agents" itself is described as infuriatingly vague and lacking a clear definition, there are two primary perspectives on their utility: \n\n1. **Automation**: Some view AI agents as systems that can act on behalf of users, similar to a travel agent. This perspective encompasses AI systems that can perform tasks autonomously or semi-autonomously, which can simplify complex processes for users.\n\n2. **Code Generation**: The context emphasizes that one of the most successful applications of LLMs is in writing code. The grammatical structure of programming languages makes it relatively easier for LLMs to generate accurate and functional code compared to natural languages. This application has proven to be particularly effective and could be among the most beneficial uses of LLMs.\n\nDespite the excitement surrounding LLM agents, there is skepticism regar

RAG chain on Test Data

In [140]:
base_dataset = dataset
ft_dataset = dataset

for base_test_row in base_dataset:
  base_response = base_graph.invoke({"question" : base_test_row.eval_sample.user_input})
  base_test_row.eval_sample.response = base_response["response"]
  base_test_row.eval_sample.retrieved_contexts = [context.page_content for context in base_response["context"]]

for ft_test_row in ft_dataset:
  ft_response = ft_graph.invoke({"question" : ft_test_row.eval_sample.user_input})
  ft_test_row.eval_sample.response = ft_response["response"]
  ft_test_row.eval_sample.retrieved_contexts = [context.page_content for context in ft_response["context"]]

**Base Model DataFrames**

In [141]:
base_dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What LLMs do in AI?,[Simon Willison’s Weblog\n\nSubscribe\n\nStuff...,[Stuff we figured out about AI in 2023\n\n\n\n...,Large Language Models (LLMs) in AI are capable...,"In 2023, Large Language Models (LLMs) were con...",single_hop_specifc_query_synthesizer
1,What is the current understanding of building ...,[This is a huge advantage for open over closed...,[Large Language Models\nThey’re actually quite...,The current understanding of building GPT-4 is...,We don’t yet know how to build GPT-4.,single_hop_specifc_query_synthesizer
2,What advancements in Large Language Models wer...,[Simon Willison’s Weblog\n\nSubscribe\n\nThing...,[Here’s the sequel to this post: Things we lea...,"In 2024, several significant advancements in L...","In 2024, it was observed that Large Language M...",single_hop_specifc_query_synthesizer
3,What is the most crucial factor in building ef...,[A lot of people are yet to be sold on their v...,[They’re actually quite easy to build\nThe mos...,The most crucial factor in building effective ...,The most crucial factor in building effective ...,single_hop_specifc_query_synthesizer
4,What role does Mistral play in the development...,[I wrote about how Large language models are h...,"[If you can gather the right data, and afford ...",Mistral plays a role in the development of lar...,Mistral is one of the organizations that have ...,single_hop_specifc_query_synthesizer
5,How has the development of AI technologies fro...,[These abilities are just a few weeks old at t...,[<1-hop>\n\nJust the other day Google Search w...,"From September 2022 to September 2023, the dev...","From September 2022 to September 2023, there h...",multi_hop_specific_query_synthesizer
6,How has the development of LLMs like GPT-3 inf...,[I wrote about how Large language models are h...,[<1-hop>\n\nYou can run LLMs on your own devic...,The development of large language models (LLMs...,The development of LLMs like GPT-3 initially s...,multi_hop_specific_query_synthesizer
7,What recent development allows running LLMs on...,"[Apple Intelligence is bad, Apple’s MLX librar...","[<1-hop>\n\nMore recent articles\n\nLLM 0.22, ...",The recent development that allows running LLM...,The recent development that allows running LLM...,multi_hop_specific_query_synthesizer
8,How did the availability of GPT-4o and its cap...,[Did you know ChatGPT has two entirely differe...,[<1-hop>\n\nFor a few short months this year a...,The availability of GPT-4o marked a significan...,The availability of GPT-4o marked a significan...,multi_hop_specific_query_synthesizer
9,What are the capabilities of Llama 3.2 Vision ...,[I can now run a GPT-4 class model on my lapto...,[<1-hop>\n\nOctober\n\n1st: OpenAI DevDay 2024...,"Llama 3.2 Vision, while not explicitly mention...",Llama 3.2 Vision is a multi-modal model releas...,multi_hop_specific_query_synthesizer


**FineTune Dataframe**

In [142]:
ft_dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What LLMs do in AI?,[Simon Willison’s Weblog\n\nSubscribe\n\nStuff...,[Stuff we figured out about AI in 2023\n\n\n\n...,Large Language Models (LLMs) in AI are capable...,"In 2023, Large Language Models (LLMs) were con...",single_hop_specifc_query_synthesizer
1,What is the current understanding of building ...,[This is a huge advantage for open over closed...,[Large Language Models\nThey’re actually quite...,The current understanding of building GPT-4 is...,We don’t yet know how to build GPT-4.,single_hop_specifc_query_synthesizer
2,What advancements in Large Language Models wer...,[Simon Willison’s Weblog\n\nSubscribe\n\nThing...,[Here’s the sequel to this post: Things we lea...,"In 2024, several significant advancements in L...","In 2024, it was observed that Large Language M...",single_hop_specifc_query_synthesizer
3,What is the most crucial factor in building ef...,[A lot of people are yet to be sold on their v...,[They’re actually quite easy to build\nThe mos...,The most crucial factor in building effective ...,The most crucial factor in building effective ...,single_hop_specifc_query_synthesizer
4,What role does Mistral play in the development...,[I wrote about how Large language models are h...,"[If you can gather the right data, and afford ...",Mistral plays a role in the development of lar...,Mistral is one of the organizations that have ...,single_hop_specifc_query_synthesizer
5,How has the development of AI technologies fro...,[These abilities are just a few weeks old at t...,[<1-hop>\n\nJust the other day Google Search w...,"From September 2022 to September 2023, the dev...","From September 2022 to September 2023, there h...",multi_hop_specific_query_synthesizer
6,How has the development of LLMs like GPT-3 inf...,[I wrote about how Large language models are h...,[<1-hop>\n\nYou can run LLMs on your own devic...,The development of large language models (LLMs...,The development of LLMs like GPT-3 initially s...,multi_hop_specific_query_synthesizer
7,What recent development allows running LLMs on...,"[Apple Intelligence is bad, Apple’s MLX librar...","[<1-hop>\n\nMore recent articles\n\nLLM 0.22, ...",The recent development that allows running LLM...,The recent development that allows running LLM...,multi_hop_specific_query_synthesizer
8,How did the availability of GPT-4o and its cap...,[Did you know ChatGPT has two entirely differe...,[<1-hop>\n\nFor a few short months this year a...,The availability of GPT-4o marked a significan...,The availability of GPT-4o marked a significan...,multi_hop_specific_query_synthesizer
9,What are the capabilities of Llama 3.2 Vision ...,[I can now run a GPT-4 class model on my lapto...,[<1-hop>\n\nOctober\n\n1st: OpenAI DevDay 2024...,"Llama 3.2 Vision, while not explicitly mention...",Llama 3.2 Vision is a multi-modal model releas...,multi_hop_specific_query_synthesizer


In [143]:
from ragas import EvaluationDataset

base_evaluation_dataset = EvaluationDataset.from_pandas(base_dataset.to_pandas())
ft_evaluation_dataset = EvaluationDataset.from_pandas(ft_dataset.to_pandas())
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

base_result = evaluate(
    dataset=base_evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)

ft_result = evaluate(
    dataset=ft_evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

**Base Model Evaluation results**

In [144]:
base_result

{'context_recall': 0.6417, 'faithfulness': 0.8398, 'factual_correctness': 0.3390, 'answer_relevancy': 0.7512, 'context_entity_recall': 0.4741, 'noise_sensitivity_relevant': 0.3229}

**FineTune Evaluation results**

In [145]:
ft_result

{'context_recall': 0.5250, 'faithfulness': 0.8460, 'factual_correctness': 0.3930, 'answer_relevancy': 0.7505, 'context_entity_recall': 0.4174, 'noise_sensitivity_relevant': 0.3213}

### **Analysis of Base vs. Fine-Tuned Model Results**  

| Metric                         | Base Model  | Fine-Tuned Model | Change  | Interpretation |
|--------------------------------|------------|-----------------|---------|----------------|
| **Context Recall**             | **0.6417**  | **0.5250**      | 🔻 -11.6%  | The fine-tuned model retrieves **less** context overall, potentially missing relevant information. |
| **Faithfulness**               | **0.8398**  | **0.8460**      | 🔼 +0.7%   | A slight **increase**, meaning the fine-tuned model is more **truthful** in sticking to the provided context. |
| **Factual Correctness**        | **0.3390**  | **0.3930**      | 🔼 +15.9%  | A **significant improvement**, meaning the fine-tuned model produces more **factually correct** outputs. |
| **Answer Relevancy**           | **0.7512**  | **0.7505**      | 🔻 -0.1%  | No meaningful change; answers remain equally relevant. |
| **Context Entity Recall**      | **0.4741**  | **0.4174**      | 🔻 -12.0%  | The fine-tuned model recalls **fewer entities** from the context, possibly reducing specificity. |
| **Noise Sensitivity Relevant** | **0.3229**  | **0.3213**      | 🔻 -0.5%  | Minimal change; both models are similarly sensitive to noise. |

---

### **Key Insights:**
1. **Fine-tuning improved factual correctness (+15.9%) and faithfulness (+0.7%)**, suggesting that it generates more **truthful** and **fact-based responses**.  
2. **Context recall (-11.6%) and entity recall (-12.0%) dropped**, meaning the fine-tuned model is retrieving **less supporting information**—which might contribute to its improved faithfulness.  
3. **Answer relevancy and noise sensitivity remained stable**, indicating that the **overall answer quality did not degrade** despite changes in retrieval behavior.  

---

### **Overall Takeaway:**
The **fine-tuned model is better at generating factually correct and faithful responses**, but it comes at the **cost of lower context recall and entity retrieval**. This trade-off suggests that the model may be **more selective** in using context, which can be good for faithfulness but risky if relevant information is omitted.
