# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating our Retriever



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

#####❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

---

**ANSWER:**

We are specifically relating *the questions* to *the documents*. This means that we are making our embedding model at the very specific task of relating potential questions to specific documents.

There are many caveats, but the main ones are:

- Your Q's should reflect the Q's of your users
- This kind of fine-tuning will (purposefully) "overfit" on your data; this is the desired result in this case.

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [1]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

>> NOTE: You do not need to do these steps if you are running this notebook locally with `uv`.

In [7]:
#!pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community langchain-text-splitters

In [8]:
#!pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml 

### Provide OpenAI API Key

In [2]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

## Task 2: Loading Data

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [10]:
!mkdir data

mkdir: cannot create directory ‘data’: File exists


In [11]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31440    0 31440    0     0   124k      0 --:--:-- --:--:-- --:--:--  124k


In [12]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70299    0 70299    0     0   655k      0 --:--:-- --:--:-- --:--:--  660k


In [3]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import BSHTMLLoader

path = "data/"
text_loader = DirectoryLoader(path, glob="*.html", loader_cls=BSHTMLLoader)

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [4]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

>> NOTE: You may need to run this cell twice to get it to work.

In [5]:
training_documents = text_splitter.split_documents(text_loader.load())

In [6]:
len(training_documents)

102

Next, we're going to associate each of our chunks with a unique identifier.

In [7]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

In [8]:
training_documents

[Document(metadata={'source': 'data/2023_llms.html', 'title': 'Stuff we figured out about AI in 2023', 'id': '0f046e40-6ca1-4088-b332-4f88e3dd3def'}, page_content='Stuff we figured out about AI in 2023\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSimon Willison’s Weblog\nSubscribe\n\n\n\n\n\n\nStuff we figured out about AI in 2023\n31st December 2023\n2023 was the breakthrough year for Large Language Models (LLMs). I think it’s OK to call these AI—they’re the latest and (currently) most interesting development in the academic field of Artificial Intelligence that dates back to the 1950s.\nHere’s my attempt to round up the highlights in one place!'),
 Document(metadata={'source': 'data/2023_llms.html', 'title': 'Stuff we figured out about AI in 2023', 'id': 'be5cfe3a-20dc-46a3-890d-969bd8cf7c5f'}, page_content='Large Language Models\nThey’re actually quite easy to build\nYou can run LLMs on your own devices\nHobbyists can build their own fine-tuned models\nWe don’t yet know how to build G

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [9]:
training_split_documents = training_documents[:len(training_documents) - 24]
val_split_documents = training_documents[len(training_documents) - 24:102-12]
test_split_documents = training_documents[102-12:]

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4o-mini` (released [today](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)).

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [10]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [11]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

In [12]:
qa_prompt_template

ChatPromptTemplate(input_variables=['context', 'n_questions'], input_types={}, partial_variables={}, messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'n_questions'], input_types={}, partial_variables={}, template='Given the following context, you must generate questions based on only the provided context.\n\nYou are to generate {n_questions} questions which should be provided in the following format:\n\n1. QUESTION #1\n2. QUESTION #2\n...\n\nContext:\n{context}\n'), additional_kwargs={})])

We'll create a simple chain to query the LLM!

In [13]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [14]:
import uuid
import re

async def create_questions(documents, n_questions):
    questions = {}
    relevant_docs = {}

    for document in documents:
        # Generate questions using the chat model
        question_response = await qa_prompt_template.ainvoke({"context": document.page_content, "n_questions": n_questions})
        
        # Debug: Print the type and structure of the response
        print(f"Type of response: {type(question_response)}")
        print(f"Dir of response: {dir(question_response)}")
        print(f"Response: {question_response}")
        
        try:
            # Try different ways to get the content
            if hasattr(question_response, 'text'):
                question_text = question_response.text
            elif hasattr(question_response, 'message'):
                question_text = question_response.message.content
            else:
                # If all else fails, try string conversion
                question_text = str(question_response)
            
            # Parse the numbered questions from the response
            parsed_questions = re.findall(r'\d+\.\s+(.*?)(?=\d+\.|$)', question_text, re.DOTALL)
            
            # For each parsed question, create a unique ID and store the question
            for question in parsed_questions:
                question = question.strip()  # Clean up any extra whitespace
                if question:  # Only process non-empty questions
                    question_id = str(uuid.uuid4())
                    questions[question_id] = question
                    relevant_docs[question_id] = [document.metadata["id"]]
                    
        except Exception as e:
            print(f"Error processing document: {e}")
            raise

    return questions, relevant_docs

### REMOVE `await` IF NOT USING ASYNC (HINT: Use `async`)

In [15]:
training_questions, training_relevant_contexts = await create_questions(training_split_documents, 2)

Type of response: <class 'langchain_core.prompt_values.ChatPromptValue'>
Dir of response: ['__abstractmethods__', '__annotations__', '__class__', '__class_getitem__', '__class_vars__', '__copy__', '__deepcopy__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__fields__', '__fields_set__', '__firstlineno__', '__format__', '__ge__', '__get_pydantic_core_schema__', '__get_pydantic_json_schema__', '__getattr__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__pretty__', '__private_attributes__', '__pydantic_complete__', '__pydantic_computed_fields__', '__pydantic_core_schema__', '__pydantic_custom_init__', '__pydantic_decorators__', '__pydantic_extra__', '__pydantic_fields__', '__pydantic_fields_set__', '__pydantic_generic_metadata__', '__pydantic_init_subclass__', '__pydantic_parent_namespace__', '__pydantic_post_init__', '__pydantic_private__', '__pydantic_root_m

We'll use the function to generate training, validation, and test data.

In [16]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 2)

Type of response: <class 'langchain_core.prompt_values.ChatPromptValue'>
Dir of response: ['__abstractmethods__', '__annotations__', '__class__', '__class_getitem__', '__class_vars__', '__copy__', '__deepcopy__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__fields__', '__fields_set__', '__firstlineno__', '__format__', '__ge__', '__get_pydantic_core_schema__', '__get_pydantic_json_schema__', '__getattr__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__pretty__', '__private_attributes__', '__pydantic_complete__', '__pydantic_computed_fields__', '__pydantic_core_schema__', '__pydantic_custom_init__', '__pydantic_decorators__', '__pydantic_extra__', '__pydantic_fields__', '__pydantic_fields_set__', '__pydantic_generic_metadata__', '__pydantic_init_subclass__', '__pydantic_parent_namespace__', '__pydantic_post_init__', '__pydantic_private__', '__pydantic_root_m

In [17]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 2)

Type of response: <class 'langchain_core.prompt_values.ChatPromptValue'>
Dir of response: ['__abstractmethods__', '__annotations__', '__class__', '__class_getitem__', '__class_vars__', '__copy__', '__deepcopy__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__fields__', '__fields_set__', '__firstlineno__', '__format__', '__ge__', '__get_pydantic_core_schema__', '__get_pydantic_json_schema__', '__getattr__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__pretty__', '__private_attributes__', '__pydantic_complete__', '__pydantic_computed_fields__', '__pydantic_core_schema__', '__pydantic_custom_init__', '__pydantic_decorators__', '__pydantic_extra__', '__pydantic_fields__', '__pydantic_fields_set__', '__pydantic_generic_metadata__', '__pydantic_init_subclass__', '__pydantic_parent_namespace__', '__pydantic_post_init__', '__pydantic_private__', '__pydantic_root_m

### Reformating and Saving Datasets

Now, we can save our datasets for later use!

In [18]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [19]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [20]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

>> NOTE: Skip installing dependencies if you are running this notebook locally.

In [31]:
#!pip install -qU sentence_transformers datasets pyarrow

In [21]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [22]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [23]:
BATCH_SIZE = 10

Let's move our dataset into the expected format for training.

In [24]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

In [25]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [26]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!

#### ✅ Answer #2:

```python
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)
```

This code implements a nested (Matryoshka) loss structure:

1. **MultipleNegativesRankingLoss (Inner Loss)**
```python
inner_train_loss = MultipleNegativesRankingLoss(model)
```
- Primary loss function that:
  - Takes a batch of sentence pairs
  - For each anchor sentence (question):
    - One positive pair (correct context)
    - Multiple negative pairs (other contexts in batch)
  - Optimizes to maximize similarity with positive pair
  - Minimizes similarity with negative pairs
  - Uses cross-entropy loss under the hood

2. **MatryoshkaLoss (Outer Loss)**
```python
matryoshka_dimensions = [768, 512, 256, 128, 64]
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)
```
- Wraps the inner loss in a multi-dimensional structure
- Creates nested embeddings of different sizes:
  - Full dimension: 768 (base embedding)
  - Reduced dimensions: 512 → 256 → 128 → 64
- Benefits:
  - Flexibility: Can use different embedding sizes for different tasks
  - Efficiency: Smaller dimensions for resource-constrained environments
  - Performance: Maintains quality across different dimensions

The dimensions work like nested dolls (hence "Matryoshka"):
```
[768] → Contains all information
  [512] → Compressed but still detailed
    [256] → Medium compression
      [128] → Higher compression
        [64] → Most compressed
```

This setup allows:
1. Training one model that can output embeddings of multiple sizes
2. Maintaining performance across different dimensionality requirements
3. Flexibility in deployment (can choose size based on resources/needs)
4. Efficient storage and computation options

This is particularly useful for:
- Resource-constrained environments (can use smaller dimensions)
- Systems requiring different precision levels
- Balancing performance vs computational cost
- Production deployments where flexibility is needed



Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [27]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [28]:
EPOCHS = 10

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [29]:
import wandb
wandb.init(mode="disabled")

In [30]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic_ft',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
17,No log,No log,0.56,0.64,0.72,0.92,0.56,0.213333,0.144,0.092,0.56,0.64,0.72,0.92,0.701742,0.637159,0.644128
34,No log,No log,0.56,0.64,0.72,0.92,0.56,0.213333,0.144,0.092,0.56,0.64,0.72,0.92,0.701742,0.637159,0.644128
50,No log,No log,0.56,0.64,0.72,0.92,0.56,0.213333,0.144,0.092,0.56,0.64,0.72,0.92,0.701742,0.637159,0.644128
51,No log,No log,0.56,0.64,0.72,0.92,0.56,0.213333,0.144,0.092,0.56,0.64,0.72,0.92,0.701742,0.637159,0.644128
68,No log,No log,0.56,0.64,0.72,0.92,0.56,0.213333,0.144,0.092,0.56,0.64,0.72,0.92,0.701742,0.637159,0.644128
85,No log,No log,0.56,0.64,0.72,0.92,0.56,0.213333,0.144,0.092,0.56,0.64,0.72,0.92,0.701742,0.637159,0.644128
100,No log,No log,0.56,0.64,0.72,0.92,0.56,0.213333,0.144,0.092,0.56,0.64,0.72,0.92,0.701742,0.637159,0.644128
102,No log,No log,0.56,0.64,0.72,0.92,0.56,0.213333,0.144,0.092,0.56,0.64,0.72,0.92,0.701742,0.637159,0.644128
119,No log,No log,0.56,0.64,0.72,0.92,0.56,0.213333,0.144,0.092,0.56,0.64,0.72,0.92,0.701742,0.637159,0.644128
136,No log,No log,0.56,0.64,0.72,0.92,0.56,0.213333,0.144,0.092,0.56,0.64,0.72,0.92,0.701742,0.637159,0.644128


In [32]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [33]:
hf_username = "dataera2013"

In [57]:
model.push_to_hub(f"{hf_username}/legal-ft-2")

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

'https://huggingface.co/dataera2013/legal-ft-2/commit/c0abc78c3ccd6e30c2d427c81e0dc96a6accdcb1'

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [34]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document
from tqdm.auto import tqdm

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [35]:
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [36]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

  0%|          | 0/24 [00:00<?, ?it/s]

In [37]:
te3_results_df = pd.DataFrame(te3_results)

In [38]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

np.float64(0.7083333333333334)

### `Snowflake/snowflake-arctic-embed-l` (base)

In [39]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

  0%|          | 0/24 [00:00<?, ?it/s]

In [40]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [41]:
arctic_embed_m_results_df

Unnamed: 0,id,question,expected_id,is_hit
0,3229c051-3f54-4104-9644-3a014d51a0b9,QUESTION #1\n,3f516b02-c270-440c-889e-d755a57b3612,True
1,68ba016c-59cc-45a4-a3e0-aaf416376329,QUESTION #2\n...\n\nContext:\nThe knowledge ga...,3f516b02-c270-440c-889e-d755a57b3612,True
2,7152573e-aeed-46c2-891b-755bb9fb2740,QUESTION #1\n,82ecb6f2-efc0-4661-a417-6208a207da63,True
3,dcec525e-d085-47d2-98d9-bee0ed289045,QUESTION #2\n...\n\nContext:\nA lot of people ...,82ecb6f2-efc0-4661-a417-6208a207da63,True
4,4064a412-791c-4a5f-8696-d66b5f869844,QUESTION #1\n,565ad4c9-5058-4707-ac06-4e745c7b9f65,True
5,e1552679-8e47-4101-9184-331eb722fee8,QUESTION #2\n...\n\nContext:\nI like people wh...,565ad4c9-5058-4707-ac06-4e745c7b9f65,True
6,42bdd36d-867f-401c-a753-b1ab1fbb0b69,QUESTION #1\n,8a5d1b36-a6c0-4a9a-95d2-e833d1f21c44,False
7,a670dfaa-b62e-4bbe-b859-c98a0b19e2a2,QUESTION #2\n...\n\nContext:\nI think telling ...,8a5d1b36-a6c0-4a9a-95d2-e833d1f21c44,True
8,868155ee-e75a-4d65-a493-13023590dec0,QUESTION #1\n,7863fe22-8bcf-47f0-8478-5d925f422c1f,False
9,a3e3d26d-1184-4c6b-bbdd-7687f2317c37,QUESTION #2\n...\n\nContext:\nJanuary\n\n7th: ...,7863fe22-8bcf-47f0-8478-5d925f422c1f,True


In [42]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

np.float64(0.7083333333333334)

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned)

In [43]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic_ft")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic_ft and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/24 [00:00<?, ?it/s]

In [44]:
finetune_results_df = pd.DataFrame(finetune_results)

In [45]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

np.float64(0.7083333333333334)

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [46]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(text_loader.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [47]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [49]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [50]:
rag_llm =  ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [51]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [52]:
base_rag_chain.invoke({"question" : "What is an agent?"})["response"]

'An agent, in the context of AI, is an infuriatingly vague term that generally refers to AI systems that can act on your behalf. There are two main interpretations: one sees agents as systems that go and act for you (like a travel agent), while the other views them as LLMs (large language models) that have access to tools and can run them in a loop to solve problems. However, the term lacks a clear and widely understood definition, leading to confusion about its meaning and utility.'

In [53]:
base_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Organizations that have produced better-than-GPT-3 class models include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.'

In [54]:
base_rag_chain.invoke({"question" : "What is the laziest AI month?"})["response"]

'I do not know.'

In [55]:
base_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'I do not know.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [56]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [57]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [58]:
finetune_rag_chain.invoke({"question" : "What is an Agent?"})["response"]

'An "Agent" is a term that lacks a single, clear, and widely understood meaning in the context of AI. It is often used to refer to AI systems that can act on behalf of a user, but there are various interpretations of what this entails. Some people view agents as systems that go away and perform tasks autonomously, while others think of them as LLMs (Large Language Models) that have access to tools and can run processes in a loop to solve problems. The term is often associated with concepts like autonomy, but there is no consensus on its definition or practical implementation, leading to skepticism about their utility.'

In [59]:
finetune_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Organizations that have produced better-than-GPT-3 class models include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.'

In [60]:
finetune_rag_chain.invoke({"question" : "What is the laziest AI month?"})["response"]

'I do not know.'

In [61]:
finetune_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'The largest model that Simon has run on his phone is the Llama 3.2 3B model.'

####❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?

#### ✅ Answer #2:

Now with the complete responses, I can make a much better comparison. The fine-tuned model's RAG chain performed better overall. Here's the detailed analysis:

1. "What is an Agent?":
   - Fine-tuned model gave a more nuanced and complete answer, including:
     - Explicitly stated the lack of clear definition
     - Covered both interpretations more thoroughly
     - Mentioned the concept of autonomy
     - Addressed the skepticism about utility
   - Base model's answer was more concise but missed some important nuances

2. "Who has produced better models than GPT-3?":
   - Both gave identical answers, listing the same organizations
   - This shows consistency in factual retrieval for straightforward questions

3. "What is the laziest AI month?":
   - Both correctly responded "I do not know"
   - This shows good handling of questions that can't be answered from the context

4. "What is the largest model that Simon has run on his phone?":
   - Fine-tuned model correctly identified "Llama 3.2 3B model"
   - Base model incorrectly responded "I do not know" despite the information being present in the context
   - This is a significant difference showing better retrieval and context understanding by the fine-tuned model

The fine-tuned model's RAG chain performed better because:
1. More detailed and nuanced responses for complex concepts (Agent definition)
2. Better retrieval of specific information (Llama 3.2 3B model)
3. Maintained the same accuracy on factual questions
4. Kept the appropriate "I do not know" responses for unanswerable questions

The fine-tuning appears to have improved the model's ability to:
- Extract relevant information from context more reliably
- Provide more comprehensive answers while maintaining accuracy
- Better handle both conceptual and factual questions

This suggests the fine-tuning process successfully improved the model's retrieval and response generation capabilities while maintaining its ability to acknowledge limitations when appropriate.









## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. The includes the Synthetic Data Generation steps.

In [218]:
# First, install all required NLTK data
import nltk
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

# Download all NLTK data
nltk.download('all')

# Now proceed with loading
from langchain_community.document_loaders import DirectoryLoader
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

[nltk_data] Downloading collection 'all'
[nltk_data]    | 
[nltk_data]    | Downloading package abc to
[nltk_data]    |     /home/nageshbm/nltk_data...
[nltk_data]    |   Package abc is already up-to-date!
[nltk_data]    | Downloading package alpino to
[nltk_data]    |     /home/nageshbm/nltk_data...
[nltk_data]    |   Package alpino is already up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger to
[nltk_data]    |     /home/nageshbm/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger is already up-
[nltk_data]    |       to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_eng to
[nltk_data]    |     /home/nageshbm/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_eng is already
[nltk_data]    |       up-to-date!
[nltk_data]    | Downloading package averaged_perceptron_tagger_ru to
[nltk_data]    |     /home/nageshbm/nltk_data...
[nltk_data]    |   Package averaged_perceptron_tagger_ru is already
[nltk_data]  

In [264]:
# Now proceed with loading the HTML files
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

In [265]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [266]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [267]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Wut iz Meta's role in the development of LLMs?,[Code may be the best application The ethics o...,Meta has contributed to the development of LLM...,single_hop_specifc_query_synthesizer
1,What happen in September with prompt injection?,[Based Development As a computer scientist and...,"In September last year, the term 'prompt injec...",single_hop_specifc_query_synthesizer
2,Who is Simon Willson?,[Simon Willison’s Weblog Subscribe Stuff we fi...,Simon Willison is the author of a weblog that ...,single_hop_specifc_query_synthesizer
3,What insights has Simon Willison shared about ...,[easy to follow. The rest of the document incl...,Simon Willison has discussed the profound impa...,single_hop_specifc_query_synthesizer
4,How has the environmental impact of AI models ...,[<1-hop>\n\nPrompt driven app generation is a ...,The environmental impact of AI models has impr...,multi_hop_abstract_query_synthesizer
5,How did the advancements in GPT-4 and the subs...,[<1-hop>\n\nPrompt driven app generation is a ...,The advancements in GPT-4 and subsequent techn...,multi_hop_abstract_query_synthesizer
6,How have advancements in energy efficiency and...,[<1-hop>\n\nPrompt driven app generation is a ...,Advancements in energy efficiency and the envi...,multi_hop_abstract_query_synthesizer
7,How has the environmental impact of AI models ...,[<1-hop>\n\nPrompt driven app generation is a ...,The environmental impact of AI models has impr...,multi_hop_abstract_query_synthesizer
8,What were some key advancements in large langu...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,"In 2023, large language models (LLMs) saw sign...",multi_hop_specific_query_synthesizer
9,How has Google's Gemini model contributed to t...,[<1-hop>\n\ngets you OpenAI’s most expensive m...,"Google's Gemini model, particularly the Gemini...",multi_hop_specific_query_synthesizer


In [268]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

In [269]:
base_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")

In [270]:
ft_embeddings = HuggingFaceEmbeddings(model_name="dataera2013/legal-ft-2")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/281 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/31.3k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/641 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of BertModel were not initialized from the model checkpoint at dataera2013/legal-ft-2 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


tokenizer_config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

In [271]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="base_ai_across_years",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)

client.create_collection(
    collection_name="ft_ai_across_years",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)

base_vector_store = QdrantVectorStore(
    client=client,
    collection_name="base_ai_across_years",
    embedding=base_embeddings,
)

ft_vector_store = QdrantVectorStore(
    client=client,
    collection_name="ft_ai_across_years",
    embedding=ft_embeddings,
)

In [272]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

73

In [273]:
_ = base_vector_store.add_documents(documents=split_documents)

In [274]:
__ = ft_vector_store.add_documents(documents=split_documents)

In [275]:
base_retriever = base_vector_store.as_retriever(search_kwargs={"k": 5})
ft_retriever = ft_vector_store.as_retriever(search_kwargs={"k": 5})

In [276]:
def base_retrieve(state):
  retrieved_docs = base_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

def ft_retrieve(state): 
  retrieved_docs = ft_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

In [277]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

In [278]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

In [279]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

In [283]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class BaseState(TypedDict):
  question: str
  context: List[Document]
  response: str

class FState(TypedDict):
  question: str
  context: List[Document]
  response: str


In [284]:
base_graph_builder = StateGraph(BaseState).add_sequence([base_retrieve, generate])
base_graph_builder.add_edge(START, "base_retrieve")
base_graph = base_graph_builder.compile()

ft_graph_builder = StateGraph(FState).add_sequence([ft_retrieve, generate])
ft_graph_builder.add_edge(START, "ft_retrieve")
ft_graph = ft_graph_builder.compile()

In [285]:
response = base_graph.invoke({"question" : "How are LLM agents useful?"})
response["response"]

'LLM agents are useful primarily in their capacity to assist with complex tasks like writing code. They excel at understanding the simpler grammar rules associated with programming languages, making them effective tools for coding-related tasks. Additionally, they can aid in generating training data for smaller models, which enhances the development of AI systems.\n\nDespite their flaws and inherent unreliability, LLMs present valuable opportunities when used correctly. The key lies in understanding how to leverage their capabilities while being aware of their limitations. There are good applications for LLMs, provided users acquire the necessary skills to navigate their complexities effectively.'

In [286]:
response = ft_graph.invoke({"question" : "How are LLM agents useful?"})
response["response"]

'LLM agents are useful in several ways, as highlighted in the provided context:\n\n1. **Productivity Improvement**: LLMs can enhance personal productivity by performing tasks such as answering questions, summarizing documents, translating languages, extracting information, and writing code. This allows users to accomplish more in less time.\n\n2. **Guidance for Effective Use**: Though LLMs have the potential to create value, users often need guidance to navigate their complexities and avoid pitfalls. This guidance helps individuals to effectively implement LLMs in ways that can improve their quality of life.\n\n3. **Positive Applications**: Despite criticisms, the context indicates that there are genuinely good applications for LLMs. Identifying and promoting these applications can help mitigate the negative aspects associated with LLMs, such as environmental impact and potential misuse.\n\n4. **Aid in Training Data Creation**: Larger LLMs can assist in generating training data for sma

In [287]:
base_dataset = dataset
ft_dataset = dataset

for base_test_row in base_dataset:
  base_response = base_graph.invoke({"question" : base_test_row.eval_sample.user_input})
  base_test_row.eval_sample.response = base_response["response"]
  base_test_row.eval_sample.retrieved_contexts = [context.page_content for context in base_response["context"]]

for ft_test_row in ft_dataset:
  ft_response = ft_graph.invoke({"question" : ft_test_row.eval_sample.user_input})
  ft_test_row.eval_sample.response = ft_response["response"]
  ft_test_row.eval_sample.retrieved_contexts = [context.page_content for context in ft_response["context"]]

In [288]:
base_dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,Wut iz Meta's role in the development of LLMs?,[I wrote about how Large language models are h...,[Code may be the best application The ethics o...,Meta's role in the development of Large Langua...,Meta has contributed to the development of LLM...,single_hop_specifc_query_synthesizer
1,What happen in September with prompt injection?,"[But on the other hand, the things you sometim...",[Based Development As a computer scientist and...,"In September of the previous year, the term ""p...","In September last year, the term 'prompt injec...",single_hop_specifc_query_synthesizer
2,Who is Simon Willson?,[Posted \n\n31st December 2023 at 11:59 pm · F...,[Simon Willison’s Weblog Subscribe Stuff we fi...,Simon Willison is a writer and blogger who foc...,Simon Willison is the author of a weblog that ...,single_hop_specifc_query_synthesizer
3,What insights has Simon Willison shared about ...,[Simon Willison’s Weblog\n\nSubscribe\n\nStuff...,[easy to follow. The rest of the document incl...,Simon Willison has shared several insights reg...,Simon Willison has discussed the profound impa...,single_hop_specifc_query_synthesizer
4,How has the environmental impact of AI models ...,[The much bigger problem here is the enormous ...,[<1-hop>\n\nPrompt driven app generation is a ...,The environmental impact of AI models has seen...,The environmental impact of AI models has impr...,multi_hop_abstract_query_synthesizer
5,How did the advancements in GPT-4 and the subs...,[Simon Willison’s Weblog\n\nSubscribe\n\nThing...,[<1-hop>\n\nPrompt driven app generation is a ...,The advancements in GPT-4 and subsequent devel...,The advancements in GPT-4 and subsequent techn...,multi_hop_abstract_query_synthesizer
6,How have advancements in energy efficiency and...,[The much bigger problem here is the enormous ...,[<1-hop>\n\nPrompt driven app generation is a ...,Advancements in energy efficiency and the envi...,Advancements in energy efficiency and the envi...,multi_hop_abstract_query_synthesizer
7,How has the environmental impact of AI models ...,[The much bigger problem here is the enormous ...,[<1-hop>\n\nPrompt driven app generation is a ...,The environmental impact of AI models has impr...,The environmental impact of AI models has impr...,multi_hop_abstract_query_synthesizer
8,What were some key advancements in large langu...,[Simon Willison’s Weblog\n\nSubscribe\n\nStuff...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,"In 2023, there were several key advancements i...","In 2023, large language models (LLMs) saw sign...",multi_hop_specific_query_synthesizer
9,How has Google's Gemini model contributed to t...,"[Multimodal vision is common, audio and video ...",[<1-hop>\n\ngets you OpenAI’s most expensive m...,Google's Gemini model has significantly contri...,"Google's Gemini model, particularly the Gemini...",multi_hop_specific_query_synthesizer


In [289]:
ft_dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,Wut iz Meta's role in the development of LLMs?,[I wrote about how Large language models are h...,[Code may be the best application The ethics o...,Meta's role in the development of Large Langua...,Meta has contributed to the development of LLM...,single_hop_specifc_query_synthesizer
1,What happen in September with prompt injection?,"[But on the other hand, the things you sometim...",[Based Development As a computer scientist and...,"In September of the previous year, the term ""p...","In September last year, the term 'prompt injec...",single_hop_specifc_query_synthesizer
2,Who is Simon Willson?,[Posted \n\n31st December 2023 at 11:59 pm · F...,[Simon Willison’s Weblog Subscribe Stuff we fi...,Simon Willison is a writer and blogger who foc...,Simon Willison is the author of a weblog that ...,single_hop_specifc_query_synthesizer
3,What insights has Simon Willison shared about ...,[Simon Willison’s Weblog\n\nSubscribe\n\nStuff...,[easy to follow. The rest of the document incl...,Simon Willison has shared several insights reg...,Simon Willison has discussed the profound impa...,single_hop_specifc_query_synthesizer
4,How has the environmental impact of AI models ...,[The much bigger problem here is the enormous ...,[<1-hop>\n\nPrompt driven app generation is a ...,The environmental impact of AI models has seen...,The environmental impact of AI models has impr...,multi_hop_abstract_query_synthesizer
5,How did the advancements in GPT-4 and the subs...,[Simon Willison’s Weblog\n\nSubscribe\n\nThing...,[<1-hop>\n\nPrompt driven app generation is a ...,The advancements in GPT-4 and subsequent devel...,The advancements in GPT-4 and subsequent techn...,multi_hop_abstract_query_synthesizer
6,How have advancements in energy efficiency and...,[The much bigger problem here is the enormous ...,[<1-hop>\n\nPrompt driven app generation is a ...,Advancements in energy efficiency and the envi...,Advancements in energy efficiency and the envi...,multi_hop_abstract_query_synthesizer
7,How has the environmental impact of AI models ...,[The much bigger problem here is the enormous ...,[<1-hop>\n\nPrompt driven app generation is a ...,The environmental impact of AI models has impr...,The environmental impact of AI models has impr...,multi_hop_abstract_query_synthesizer
8,What were some key advancements in large langu...,[Simon Willison’s Weblog\n\nSubscribe\n\nStuff...,[<1-hop>\n\nSimon Willison’s Weblog Subscribe ...,"In 2023, there were several key advancements i...","In 2023, large language models (LLMs) saw sign...",multi_hop_specific_query_synthesizer
9,How has Google's Gemini model contributed to t...,"[Multimodal vision is common, audio and video ...",[<1-hop>\n\ngets you OpenAI’s most expensive m...,Google's Gemini model has significantly contri...,"Google's Gemini model, particularly the Gemini...",multi_hop_specific_query_synthesizer


In [290]:
from ragas import EvaluationDataset

base_evaluation_dataset = EvaluationDataset.from_pandas(base_dataset.to_pandas())
ft_evaluation_dataset = EvaluationDataset.from_pandas(ft_dataset.to_pandas())

In [291]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

In [292]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

base_result = evaluate(
    dataset=base_evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)

ft_result = evaluate(
    dataset=ft_evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[41]: TimeoutError()


In [293]:
base_result

{'context_recall': 0.7816, 'faithfulness': 0.9268, 'factual_correctness': 0.6267, 'answer_relevancy': 0.7774, 'context_entity_recall': 0.4477, 'noise_sensitivity_relevant': 0.2554}

In [294]:
ft_result

{'context_recall': 0.7955, 'faithfulness': 0.9335, 'factual_correctness': 0.6333, 'answer_relevancy': 0.8559, 'context_entity_recall': 0.4366, 'noise_sensitivity_relevant': 0.2491}

# Analysis of Base and Fine-tuned Model Results

Here’s a detailed analysis comparing the `base_result` and `ft_result` dictionaries, including percentage changes for each metric.

## 1. Context Recall
- **Base Result:** 0.7816
- **Fine-tuned Result:** 0.7955
- **Percentage Change:** +1.79%
- **Analysis:** The fine-tuned model shows a slight improvement in context recall, indicating that it is better at retrieving relevant contexts related to the questions posed. This suggests that the fine-tuning process has enhanced the model's ability to understand and recall relevant information from the context.

## 2. Faithfulness
- **Base Result:** 0.9268
- **Fine-tuned Result:** 0.9335
- **Percentage Change:** +0.76%
- **Analysis:** The fine-tuned model also demonstrates a higher faithfulness score, which means it is more likely to provide answers that are consistent with the provided context. This improvement indicates that the fine-tuning has helped the model maintain accuracy in its responses, ensuring that the information it provides aligns closely with the context.

## 3. Factual Correctness
- **Base Result:** 0.6267
- **Fine-tuned Result:** 0.6333
- **Percentage Change:** +1.05%
- **Analysis:** There is a marginal increase in factual correctness in the fine-tuned model. While both models have relatively low scores in this area, the fine-tuned model is slightly better at providing factually accurate information. This suggests that the fine-tuning process has had a positive impact on the model's ability to deliver correct information.

## 4. Answer Relevancy
- **Base Result:** 0.7774
- **Fine-tuned Result:** 0.8559
- **Percentage Change:** +10.06%
- **Analysis:** The fine-tuned model shows a significant improvement in answer relevancy. This indicates that the responses generated by the fine-tuned model are more pertinent to the questions asked, suggesting that the fine-tuning has effectively enhanced the model's ability to generate contextually appropriate answers.

## 5. Context Entity Recall
- **Base Result:** 0.4477
- **Fine-tuned Result:** 0.4366
- **Percentage Change:** -2.48%
- **Analysis:** Interestingly, the fine-tuned model shows a slight decrease in context entity recall. This metric measures the model's ability to identify and recall specific entities within the context. The drop may suggest that while the fine-tuned model is better at generating relevant answers, it may not be as effective at recognizing specific entities within the context.

## 6. Noise Sensitivity Relevant
- **Base Result:** 0.2554
- **Fine-tuned Result:** 0.2491
- **Percentage Change:** -2.47%
- **Analysis:** The fine-tuned model has a lower score in noise sensitivity relevant, indicating that it may be slightly less affected by irrelevant or noisy information in the context. This could imply that the fine-tuning has helped the model focus more on relevant information, although the overall scores in this area are low for both models.

## Summary
Overall, the fine-tuned model demonstrates improvements in most metrics, particularly in context recall, faithfulness, and answer relevancy, with percentage changes of +1.79%, +0.76%, and +10.06%, respectively. These enhancements suggest that the fine-tuning process has effectively improved the model's ability to generate accurate and relevant responses based on the provided context. However, the slight decreases in context entity recall and noise sensitivity relevant, with percentage changes of -2.48% and -2.47%, indicate areas where the model may need further refinement.

In conclusion, the fine-tuned model appears to be a better performer overall, especially in generating relevant and contextually accurate answers, which is crucial for applications requiring high-quality information retrieval and response generation.