# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating our Retriever



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

#####❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

---

**ANSWER:**

We are specifically relating *the questions* to *the documents*. This means that we are making our embedding model at the very specific task of relating potential questions to specific documents.

There are many caveats, but the main ones are:

- Your Q's should reflect the Q's of your users
- This kind of fine-tuning will (purposefully) "overfit" on your data; this is the desired result in this case.

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [1]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

>> NOTE: You do not need to do these steps if you are running this notebook locally with `uv`.

In [1]:
!pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community langchain-text-splitters

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/54.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.9/54.9 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m35.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m42.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m1.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m84.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m68.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [3

In [2]:
!pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m472.8/472.8 kB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.7/30.7 MB[0m [31m60.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m20.0/20.0 MB[0m [31m73.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m165.1/165.1 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25h

### Provide OpenAI API Key

In [3]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


## Task 2: Loading Data

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [34]:
!mkdir data

In [35]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31440    0 31440    0     0  94409      0 --:--:-- --:--:-- --:--:-- 94414


In [36]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100 70299    0 70299    0     0   585k      0 --:--:-- --:--:-- --:--:--  586k


In [37]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import BSHTMLLoader

path = "data/"
text_loader = DirectoryLoader(path, glob="*.html", loader_cls=BSHTMLLoader)

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [9]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

>> NOTE: You may need to run this cell twice to get it to work.

In [10]:
training_documents = text_splitter.split_documents(text_loader.load())

In [11]:
len(training_documents)

102

Next, we're going to associate each of our chunks with a unique identifier.

In [12]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [13]:
training_split_documents = training_documents[:len(training_documents) - 24]
val_split_documents = training_documents[len(training_documents) - 24:102-12]
test_split_documents = training_documents[102-12:]

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4o-mini` (released [today](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)).

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [14]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [15]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [16]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [20]:
from tqdm import tqdm
import uuid
import asyncio


async def process_document(document, n_questions):
  questions_generated = await question_generation_chain.ainvoke({"context": document.page_content, "n_questions":n_questions})

  doc_questions = {}
  doc_relevant_docs = {}

  for question in questions_generated.content.split("\n"):
    question_id = str(uuid.uuid4())
    doc_questions[question_id] = "".join(question.split(".")[1:]).strip()
    doc_relevant_docs[question_id] = [document.metadata["id"]]

  return doc_questions, doc_relevant_docs

async def create_questions(documents, n_questions):
  tasks = [process_document(doc, n_questions) for doc in documents]

  questions = {}
  relevant_docs = {}

  for task in tqdm(asyncio.as_completed(tasks), total=len(documents), desc="Processing Document"):
    doc_questions, doc_relevant_docs = await task
    questions.update(doc_questions)
    relevant_docs.update(doc_relevant_docs)

  return questions, relevant_docs

In [None]:
# async def create_questions(documents, n_questions):
#   questions = {}
#   relevant_docs = {}

#   for document in documents:
#     questions_generated = await question_generation_chain.ainvoke({"context": document.page_content, "n_questions":n_questions})
#     for question in questions_generated.content.split("\n"):
#       question_id = str(uuid.uuid4())
#       questions[question_id] = "".join(question.split(".")[1:]).strip()
#       relevant_docs[question_id] = [document.metadata["id"]]

#   await questions, relevant_docs

### REMOVE `await` IF NOT USING ASYNC (HINT: Use `async`)

In [21]:
training_questions, training_relevant_contexts = await create_questions(training_split_documents, 2)

Processing Document: 100%|██████████| 78/78 [01:33<00:00,  1.20s/it]


We'll use the function to generate training, validation, and test data.

In [23]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 2)

Processing Document: 100%|██████████| 12/12 [00:02<00:00,  5.81it/s]


In [25]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 2)

Processing Document: 100%|██████████| 12/12 [00:01<00:00,  6.87it/s]


### Reformating and Saving Datasets

Now, we can save our datasets for later use!

In [22]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [24]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [26]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

In [35]:
test_dataset

{'questions': {'f12b320b-e00c-4d0a-8725-82b88a71d1ae': 'What topics were covered in the annotated presentations given in 2023?',
  '708eb80c-5315-420f-a336-87b263f74563': 'Which podcasts featured discussions about Large Language Models?',
  '88ace483-231f-4042-8c5c-b3d39b407f29': 'What are embeddings and why are they considered important in the context of LLMs?',
  'e25b15fb-c5c0-4589-b0fc-77ab975a97eb': 'How does the new llamafile improve the process of running an LLM on a personal computer?',
  '5279719b-36d0-41b9-abe2-e6c901b4d21e': 'What is the numerical value associated with "ai" in the provided context?',
  '4053bf7f-c43e-4721-aa62-2b1325e35703': 'How many times is "llms" mentioned in the context?',
  'f718cf93-b109-4e66-9f29-68f24ec78d44': 'What is the significance of prompt engineering in DALL-E 3 as mentioned in the context?',
  '7829bd29-b771-469d-ba0a-75bced0a6e0a': 'How does the vicuna-7b Large Language Model operate within a web browser?',
  '295dced0-f6aa-4f92-8418-648e15

In [33]:
test_questions

{'f12b320b-e00c-4d0a-8725-82b88a71d1ae': 'What topics were covered in the annotated presentations given in 2023?',
 '708eb80c-5315-420f-a336-87b263f74563': 'Which podcasts featured discussions about Large Language Models?',
 '88ace483-231f-4042-8c5c-b3d39b407f29': 'What are embeddings and why are they considered important in the context of LLMs?',
 'e25b15fb-c5c0-4589-b0fc-77ab975a97eb': 'How does the new llamafile improve the process of running an LLM on a personal computer?',
 '5279719b-36d0-41b9-abe2-e6c901b4d21e': 'What is the numerical value associated with "ai" in the provided context?',
 '4053bf7f-c43e-4721-aa62-2b1325e35703': 'How many times is "llms" mentioned in the context?',
 'f718cf93-b109-4e66-9f29-68f24ec78d44': 'What is the significance of prompt engineering in DALL-E 3 as mentioned in the context?',
 '7829bd29-b771-469d-ba0a-75bced0a6e0a': 'How does the vicuna-7b Large Language Model operate within a web browser?',
 '295dced0-f6aa-4f92-8418-648e1538f507': 'What are som

In [34]:
test_relevant_contexts

{'f12b320b-e00c-4d0a-8725-82b88a71d1ae': ['0cfa06e2-9fc1-4370-882d-82581b82ade6'],
 '708eb80c-5315-420f-a336-87b263f74563': ['0cfa06e2-9fc1-4370-882d-82581b82ade6'],
 '88ace483-231f-4042-8c5c-b3d39b407f29': ['02ac2c76-2396-4a8d-a6a5-4ba94fbecb8a'],
 'e25b15fb-c5c0-4589-b0fc-77ab975a97eb': ['02ac2c76-2396-4a8d-a6a5-4ba94fbecb8a'],
 '5279719b-36d0-41b9-abe2-e6c901b4d21e': ['7a29834d-895c-4b3f-8ad2-7ffbac015006'],
 '4053bf7f-c43e-4721-aa62-2b1325e35703': ['7a29834d-895c-4b3f-8ad2-7ffbac015006'],
 'f718cf93-b109-4e66-9f29-68f24ec78d44': ['fa7cc199-caaf-46bc-bb58-3f07d9733276'],
 '7829bd29-b771-469d-ba0a-75bced0a6e0a': ['fa7cc199-caaf-46bc-bb58-3f07d9733276'],
 '295dced0-f6aa-4f92-8418-648e1538f507': ['6ef5746e-9b7b-4a1f-91e5-3f979fdc1e31'],
 '2bf050e5-ca05-4c96-9be5-9d4c59509d76': ['6ef5746e-9b7b-4a1f-91e5-3f979fdc1e31'],
 '62430514-42aa-47da-9ee5-9026c2e91b78': ['052d4bfb-f842-48c7-a842-95dced8ff4ef'],
 'fdeb20c9-753a-40c3-ac14-55a74e7daf04': ['052d4bfb-f842-48c7-a842-95dced8ff4ef'],
 '0c

## Task 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

>> NOTE: Skip installing dependencies if you are running this notebook locally.

In [36]:
!pip install -qU sentence_transformers datasets pyarrow

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m484.9/484.9 kB[0m [31m38.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.1/42.1 MB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m13.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m20.4 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pylibcudf-cu12 24.12.0 requires pyarrow<19.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 19.0.0 which is incompatible.
cudf-cu12 24.12.0 requires pyarrow<19.0.0a0,>=14.0.0; platform_machine 

In [37]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/85.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [38]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [50]:
BATCH_SIZE = 10

Let's move our dataset into the expected format for training.

In [40]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

In [44]:
for query_id, query in queries.items():
    print(query)
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    print(text)
    print("****************************")

What role does synthetic data play in the pretraining of models, particularly in the Phi series?
Synthetic data as a substantial component of pretraining is becoming increasingly common, and the Phi series of models has consistently emphasized the importance of synthetic data. Rather than serving as a cheap substitute for organic data, synthetic data has several direct advantages over organic data.
****************************
How does synthetic data compare to organic data in terms of advantages?
Synthetic data as a substantial component of pretraining is becoming increasingly common, and the Phi series of models has consistently emphasized the importance of synthetic data. Rather than serving as a cheap substitute for organic data, synthetic data has several direct advantages over organic data.
****************************
What analogy is used to describe LLMs in the context provided?
A drum I’ve been banging for a while is that LLMs are power-user tools—they’re chainsaws disguised a

Now we can create a `torch` `DataLoader`!

In [45]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [46]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!  

### ANSWER
****
**MultipleNegativesRankingLoss**  


*   Takes a SentenceTransformer model, a scaling factor (scale=20.0), and a similarity function (cosine similarity by default).
*   Uses CrossEntropyLoss to optimize sentence embeddings.
*   Each anchor (query) is paired with a positive example.
*   Other examples in the batch serve as negative samples.
*   Minimizes the log-likelihood of selecting the correct positive sample.
*   The loss function encourages the model to increase the similarity between a query and its positive example while decreasing the similarity with all other sentences in the batch.  
****
**MatryoshkaLoss**  
* It uses a technique called Matryoshka Representation Learning to create embeddings at progressively smaller dimensions(matryoshka_dims) and applies a loss function to each of them, optimizing the model at various resolutions.  

* It takes a SentenceTransformer model, a loss function, and a list of embedding dimensions (matryoshka_dims) along with optional weights (matryoshka_weights). The loss is computed for each of the dimensions, and the results are combined to optimize the model.





Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [47]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [51]:
EPOCHS = 10

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [49]:
import wandb
wandb.init(mode="disabled")

In [52]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic_ft',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
16,No log,No log,0.833333,0.916667,1.0,1.0,0.833333,0.305556,0.2,0.1,0.833333,0.916667,1.0,1.0,0.916345,0.888889,0.888889
32,No log,No log,0.833333,1.0,1.0,1.0,0.833333,0.333333,0.2,0.1,0.833333,1.0,1.0,1.0,0.933033,0.909722,0.909722
48,No log,No log,0.833333,1.0,1.0,1.0,0.833333,0.333333,0.2,0.1,0.833333,1.0,1.0,1.0,0.933033,0.909722,0.909722
50,No log,No log,0.833333,1.0,1.0,1.0,0.833333,0.333333,0.2,0.1,0.833333,1.0,1.0,1.0,0.933033,0.909722,0.909722
64,No log,No log,0.791667,1.0,1.0,1.0,0.791667,0.333333,0.2,0.1,0.791667,1.0,1.0,1.0,0.906744,0.875,0.875
80,No log,No log,0.791667,1.0,1.0,1.0,0.791667,0.333333,0.2,0.1,0.791667,1.0,1.0,1.0,0.906744,0.875,0.875
96,No log,No log,0.833333,0.958333,1.0,1.0,0.833333,0.319444,0.2,0.1,0.833333,0.958333,1.0,1.0,0.924689,0.899306,0.899306
100,No log,No log,0.833333,0.958333,1.0,1.0,0.833333,0.319444,0.2,0.1,0.833333,0.958333,1.0,1.0,0.924689,0.899306,0.899306
112,No log,No log,0.833333,0.958333,1.0,1.0,0.833333,0.319444,0.2,0.1,0.833333,0.958333,1.0,1.0,0.924689,0.899306,0.899306
128,No log,No log,0.833333,0.958333,1.0,1.0,0.833333,0.319444,0.2,0.1,0.833333,0.958333,1.0,1.0,0.922863,0.897222,0.897222


In [8]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [59]:
hf_username = "melghorab"

In [60]:
model.push_to_hub(f"{hf_username}/legal-ft-v0")

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

'https://huggingface.co/melghorab/legal-ft-v0/commit/0de11df6f054b998ae353f0ad2b14b20443a6c32'

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [20]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

import tqdm

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [21]:
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [63]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 24/24 [00:12<00:00,  1.91it/s]


In [64]:
te3_results_df = pd.DataFrame(te3_results)

In [65]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

1.0

### `Snowflake/snowflake-arctic-embed-l` (base)

In [41]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

model.safetensors:  15%|#4        | 199M/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

100%|██████████| 24/24 [00:11<00:00,  2.03it/s]


In [68]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [69]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

0.875

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned)

In [18]:
import json

file_path = '/content/test_dataset.jsonl'

with open(file_path, 'r') as f:
    for line in f:
        # Parse each line as a JSON object
        test_dataset = json.loads(line)
type(test_dataset)

dict

In [26]:
type(test_dataset['questions'])

dict

In [31]:
from langchain_huggingface import HuggingFaceEmbeddings
from tqdm import tqdm


# finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic_ft")
finetune_embeddings = HuggingFaceEmbeddings(model_name="melghorab/legal-ft-v0")

finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at melghorab/legal-ft-v0 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 24/24 [00:10<00:00,  2.32it/s]


In [32]:
finetune_results_df = pd.DataFrame(finetune_results)

In [33]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

1.0

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [39]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(text_loader.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [42]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [43]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [45]:
from langchain_openai import ChatOpenAI

rag_llm =  ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [46]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [47]:
base_rag_chain.invoke({"question" : "What is an agent?"})["response"]

'An agent, in the context of AI, is a term that refers to AI systems that can act on your behalf. However, the term is considered vague and lacks a single, clear definition. Some people view agents as systems that autonomously perform tasks, similar to a travel agent, while others think of them as LLMs (large language models) that utilize tools to solve problems. The concept of autonomy is often included in discussions about agents, but without a clear definition. Overall, there is skepticism about the utility of agents due to challenges such as gullibility, where these systems may struggle to distinguish truth from fiction.'

In [48]:
base_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Better-than-GPT-3 class models have been produced by Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several other organizations.'

In [49]:
base_rag_chain.invoke({"question" : "What is the laziest month for AI?"})["response"]

'I do not know.'

In [50]:
base_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'I do not know.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [51]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [52]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [53]:
finetune_rag_chain.invoke({"question" : "What is an Agent?"})["response"]

'An "agent" in the context of AI refers to a system that can act on behalf of a user, but the term is vague and lacks a single, clear definition. There are two main interpretations: one sees agents as entities that perform tasks for users, similar to a travel agent, while the other views them as LLMs (Large Language Models) that utilize tools to solve problems in a loop. The concept of autonomy is often included in discussions about agents, but without a clear definition. Overall, the term remains frustratingly ambiguous, and there is skepticism about the utility of such agents due to challenges like gullibility, where LLMs may believe false information.'

In [54]:
finetune_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Organizations that have produced better-than-GPT-3 class models include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.'

In [55]:
finetune_rag_chain.invoke({"question" : "What is the laziest month for AI?"})["response"]

'The laziest month for AI, according to the context, is December.'

In [56]:
finetune_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'The largest model that Simon has run on his phone is the Llama 3.2 3B model.'

####❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?  

### ANSWER
****
The second RAG Chain which uses the fine tuned embedder answered the questions better.  
This is because (in the final 2 questions) the the relevant chunks that contain the answered were retrieved and given to the LLM so the LLM was able to answer these questions.
  
Unlike the first RAG with the base embedder, the right chunks weren't retrieved and given to the LLM so the LLM dealed with these questions as out of context and responded with "I don't know"

## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. The includes the Synthetic Data Generation steps.

### Install dependencies

In [57]:
!pip install -qU ragas==0.2.10

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/175.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m174.1/175.7 kB[0m [31m7.3 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/71.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m484.9/484.9 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [58]:
!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m368.6/981.5 kB[0m [31m10.9 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m69.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m60.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m137.2/137.2 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.7/44.7 kB[0m [31m2.5 M

In [63]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

Please enter your OpenAI API key!··········


### Load Data

In [62]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import BSHTMLLoader

path = "data/"
text_loader = DirectoryLoader(path, glob="*.html", loader_cls=BSHTMLLoader)
docs = text_loader.load()

### Generate Synthetic Data

In [64]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [65]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [66]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Wht are the key features of GPT-3 that make it...,[Code may be the best application The ethics o...,GPT-3 is a large language model that can answe...,single_hop_specifc_query_synthesizer
1,What significant contributions has Andy Baio m...,[Based Development As a computer scientist and...,"In September last year, Andy Baio and I produc...",single_hop_specifc_query_synthesizer
2,Wut r the key insights about AI that we discov...,[Stuff we figured out about AI in 2023 Simon W...,"In 2023, significant breakthroughs were made i...",single_hop_specifc_query_synthesizer
3,What insights can be drawn about OpenAI's impa...,[easy to follow. The rest of the document incl...,"In 2023, OpenAI's influence was significant, a...",single_hop_specifc_query_synthesizer
4,What factors contributed to the dramatic colla...,[<1-hop>\n\nPrompt driven app generation is a ...,The dramatic collapse in the cost of running p...,multi_hop_abstract_query_synthesizer
5,How does prompt driven app generation relate t...,[<1-hop>\n\nPrompt driven app generation is a ...,Prompt driven app generation has become a comm...,multi_hop_abstract_query_synthesizer
6,How has the universal access to advanced AI mo...,[<1-hop>\n\nPrompt driven app generation is a ...,"The universal access to advanced AI models, su...",multi_hop_abstract_query_synthesizer
7,What are the implications of using agents in A...,[<1-hop>\n\nPrompt driven app generation is a ...,The implications of using agents in AI for dat...,multi_hop_abstract_query_synthesizer
8,What are the implications of the ease of build...,[<1-hop>\n\nCode may be the best application T...,The ease of building LLMs has significant impl...,multi_hop_specific_query_synthesizer
9,What were the key developments in Artificial I...,[<1-hop>\n\nStuff we figured out about AI in 2...,"In 2023, significant breakthroughs in Artifici...",multi_hop_specific_query_synthesizer


In [67]:
dataset.to_pandas().to_csv('ragas_data_test_embedder.csv')

### RAG LangChain with BASE EMBEDDER

In [68]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

79

In [69]:
from langchain_openai import OpenAIEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")


In [72]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

# set size of vectors to 1024 to match our embedder
client.create_collection(
    collection_name="llms",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="llms",
    embedding=huggingface_embeddings,
)

In [73]:
_ = vector_store.add_documents(documents=split_documents)
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

In [74]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

In [82]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

In [83]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

In [84]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Build LangGraph using BASE EMBEDDER

In [85]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

In [86]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

In [87]:
response = graph.invoke({"question" : "Who has produced better models than GPT-3?"})
response["response"]

'GPT-4 has been produced and is considered better than GPT-3. Additionally, the context suggests that other labs are also developing models that may exceed the capabilities of GPT-3 as advancements in AI continue.'

Right answer:  
Organizations that have produced better-than-GPT-3 class models include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.

### Evaluate with RAGAS

In [88]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [89]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,Wht are the key features of GPT-3 that make it...,[That same laptop that could just about run a ...,[Code may be the best application The ethics o...,The context provided does not specifically dis...,GPT-3 is a large language model that can answe...,single_hop_specifc_query_synthesizer
1,What significant contributions has Andy Baio m...,"[That’s clearly not happening. Instead, we are...",[Based Development As a computer scientist and...,I do not know.,"In September last year, Andy Baio and I produc...",single_hop_specifc_query_synthesizer
2,Wut r the key insights about AI that we discov...,"[That’s clearly not happening. Instead, we are...",[Stuff we figured out about AI in 2023 Simon W...,The key insights about AI that were discovered...,"In 2023, significant breakthroughs were made i...",single_hop_specifc_query_synthesizer
3,What insights can be drawn about OpenAI's impa...,"[That’s clearly not happening. Instead, we are...",[easy to follow. The rest of the document incl...,"In 2023, OpenAI has had a significant impact o...","In 2023, OpenAI's influence was significant, a...",single_hop_specifc_query_synthesizer
4,What factors contributed to the dramatic colla...,"[Meanwhile, it’s increasingly common for end u...",[<1-hop>\n\nPrompt driven app generation is a ...,The dramatic collapse in the cost of running p...,The dramatic collapse in the cost of running p...,multi_hop_abstract_query_synthesizer
5,How does prompt driven app generation relate t...,[I get it. There are plenty of reasons to disl...,[<1-hop>\n\nPrompt driven app generation is a ...,The provided context does not explicitly discu...,Prompt driven app generation has become a comm...,multi_hop_abstract_query_synthesizer
6,How has the universal access to advanced AI mo...,"[That’s clearly not happening. Instead, we are...",[<1-hop>\n\nPrompt driven app generation is a ...,The context provided does not specifically add...,"The universal access to advanced AI models, su...",multi_hop_abstract_query_synthesizer
7,What are the implications of using agents in A...,"[That’s clearly not happening. Instead, we are...",[<1-hop>\n\nPrompt driven app generation is a ...,The provided context does not specifically add...,The implications of using agents in AI for dat...,multi_hop_abstract_query_synthesizer
8,What are the implications of the ease of build...,[I get it. There are plenty of reasons to disl...,[<1-hop>\n\nCode may be the best application T...,The ease of building large language models (LL...,The ease of building LLMs has significant impl...,multi_hop_specific_query_synthesizer
9,What were the key developments in Artificial I...,"[That’s clearly not happening. Instead, we are...",[<1-hop>\n\nStuff we figured out about AI in 2...,Some key developments in Artificial Intelligen...,"In 2023, significant breakthroughs in Artifici...",multi_hop_specific_query_synthesizer


In [90]:
dataset.to_pandas().to_csv('ragas_data_test_embedder_with response.csv')

In [91]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluation_dataset

EvaluationDataset(features=['user_input', 'retrieved_contexts', 'reference_contexts', 'response', 'reference'], len=12)

In [92]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

In [93]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.2431, 'faithfulness': 0.5743, 'factual_correctness': 0.1825, 'answer_relevancy': 0.3959, 'context_entity_recall': 0.2655, 'noise_sensitivity_relevant': 0.1492}

### Building RAG LangChain with Tuned Emnbedder

In [94]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="melghorab/legal-ft-v0")

client = QdrantClient(":memory:")

# set size of vectors to 1024 to match our embedder
client.create_collection(
    collection_name="llms_tuned",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="llms_tuned",
    embedding=finetune_embeddings,
)

Some weights of BertModel were not initialized from the model checkpoint at melghorab/legal-ft-v0 and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [95]:
_ = vector_store.add_documents(documents=split_documents)
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

In [97]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

In [96]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

In [98]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

In [99]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

In [100]:
response = graph.invoke({"question" : "Who has produced better models than GPT-3?"})
response["response"]

'The context does not specify who has produced better models than GPT-3; it only mentions that 18 organizations now have models that rank higher than the original GPT-4 on the Chatbot Arena Leaderboard. Therefore, I do not know the specific producers of models better than GPT-3.'

### Evaluate new embedder

In [101]:
import numpy as np
for test_row in dataset:
  test_row.eval_sample.response = np.nan
  test_row.eval_sample.retrieved_contexts = np.nan
dataset.to_pandas()

  Expected `list[str]` but got `float` with value `nan` - serialized value may not be as expected
  Expected `str` but got `float` with value `nan` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,Wht are the key features of GPT-3 that make it...,,[Code may be the best application The ethics o...,,GPT-3 is a large language model that can answe...,single_hop_specifc_query_synthesizer
1,What significant contributions has Andy Baio m...,,[Based Development As a computer scientist and...,,"In September last year, Andy Baio and I produc...",single_hop_specifc_query_synthesizer
2,Wut r the key insights about AI that we discov...,,[Stuff we figured out about AI in 2023 Simon W...,,"In 2023, significant breakthroughs were made i...",single_hop_specifc_query_synthesizer
3,What insights can be drawn about OpenAI's impa...,,[easy to follow. The rest of the document incl...,,"In 2023, OpenAI's influence was significant, a...",single_hop_specifc_query_synthesizer
4,What factors contributed to the dramatic colla...,,[<1-hop>\n\nPrompt driven app generation is a ...,,The dramatic collapse in the cost of running p...,multi_hop_abstract_query_synthesizer
5,How does prompt driven app generation relate t...,,[<1-hop>\n\nPrompt driven app generation is a ...,,Prompt driven app generation has become a comm...,multi_hop_abstract_query_synthesizer
6,How has the universal access to advanced AI mo...,,[<1-hop>\n\nPrompt driven app generation is a ...,,"The universal access to advanced AI models, su...",multi_hop_abstract_query_synthesizer
7,What are the implications of using agents in A...,,[<1-hop>\n\nPrompt driven app generation is a ...,,The implications of using agents in AI for dat...,multi_hop_abstract_query_synthesizer
8,What are the implications of the ease of build...,,[<1-hop>\n\nCode may be the best application T...,,The ease of building LLMs has significant impl...,multi_hop_specific_query_synthesizer
9,What were the key developments in Artificial I...,,[<1-hop>\n\nStuff we figured out about AI in 2...,,"In 2023, significant breakthroughs in Artifici...",multi_hop_specific_query_synthesizer


In [102]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [103]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,Wht are the key features of GPT-3 that make it...,[Then there’s the rest. If you browse the Chat...,[Code may be the best application The ethics o...,The provided context does not include informat...,GPT-3 is a large language model that can answe...,single_hop_specifc_query_synthesizer
1,What significant contributions has Andy Baio m...,[Law is not ethics. Is it OK to train models o...,[Based Development As a computer scientist and...,Andy Baio made significant contributions in th...,"In September last year, Andy Baio and I produc...",single_hop_specifc_query_synthesizer
2,Wut r the key insights about AI that we discov...,[Stuff we figured out about AI in 2023\n\n\n\n...,[Stuff we figured out about AI in 2023 Simon W...,"In 2023, key insights about AI, particularly L...","In 2023, significant breakthroughs were made i...",single_hop_specifc_query_synthesizer
3,What insights can be drawn about OpenAI's impa...,[Here’s the rest of the transcript. It’s bland...,[easy to follow. The rest of the document incl...,OpenAI significantly impacted the AI landscape...,"In 2023, OpenAI's influence was significant, a...",single_hop_specifc_query_synthesizer
4,What factors contributed to the dramatic colla...,[I’ve been tracking these pricing changes unde...,[<1-hop>\n\nPrompt driven app generation is a ...,The dramatic collapse in the cost of running p...,The dramatic collapse in the cost of running p...,multi_hop_abstract_query_synthesizer
5,How does prompt driven app generation relate t...,[Those US export regulations on GPUs to China ...,[<1-hop>\n\nPrompt driven app generation is a ...,Prompt-driven app generation relates to the en...,Prompt driven app generation has become a comm...,multi_hop_abstract_query_synthesizer
6,How has the universal access to advanced AI mo...,[The GPT-4 barrier was comprehensively broken\...,[<1-hop>\n\nPrompt driven app generation is a ...,The universal access to advanced AI models for...,"The universal access to advanced AI models, su...",multi_hop_abstract_query_synthesizer
7,What are the implications of using agents in A...,[A lot of people are excited about AI agents—a...,[<1-hop>\n\nPrompt driven app generation is a ...,The implications of using agents in AI for dat...,The implications of using agents in AI for dat...,multi_hop_abstract_query_synthesizer
8,What are the implications of the ease of build...,[Law is not ethics. Is it OK to train models o...,[<1-hop>\n\nCode may be the best application T...,The ease of building Large Language Models (LL...,The ease of building LLMs has significant impl...,multi_hop_specific_query_synthesizer
9,What were the key developments in Artificial I...,[Stuff we figured out about AI in 2023\n\n\n\n...,[<1-hop>\n\nStuff we figured out about AI in 2...,"In 2023, the key developments in Artificial In...","In 2023, significant breakthroughs in Artifici...",multi_hop_specific_query_synthesizer


In [104]:
dataset.to_pandas().to_csv('ragas_data_test_embedder_with response_tuned.csv')

In [105]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluation_dataset

EvaluationDataset(features=['user_input', 'retrieved_contexts', 'reference_contexts', 'response', 'reference'], len=12)

In [106]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

In [107]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.5208, 'faithfulness': 0.7607, 'factual_correctness': 0.4442, 'answer_relevancy': 0.7934, 'context_entity_recall': 0.3949, 'noise_sensitivity_relevant': 0.2323}

### Results  

**Base embedder:**  
{'context_recall': 0.2431, 'faithfulness': 0.5743, 'factual_correctness': 0.1825, 'answer_relevancy': 0.3959, 'context_entity_recall': 0.2655, 'noise_sensitivity_relevant': 0.1492}  

**Fine tuned embedder:**  
{'context_recall': 0.5208, 'faithfulness': 0.7607, 'factual_correctness': 0.4442, 'answer_relevancy': 0.7934, 'context_entity_recall': 0.3949, 'noise_sensitivity_relevant': 0.2323}