# Fine-tuning Embeddings for RAG on Specific Data

As we start our "fine-tuning" week, we'll start with the lowest hanging improvement one can do for RAG - which is:

Fine-tuning embeddings!

- 🤝 Breakout Room #1:
  - Task 1: Dependencies and Boilerplate
  - Task 2: Loading Data
  - Task 3: Constructing a Fine-tuning Dataset
  - Task 4: Fine-tuning `snowflake-arctic-embed-l`
  - Task 5: Evaluating our Retriever



#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

#####❓ Question #1:

Describe the nuance between using Q&D pairs to train the embedding model vs. inter-document pairs/related sentences.

What caveats does this approach have? Are there any special considerations for what kind of Q's we should use?

---

**ANSWER:**

We are specifically relating *the questions* to *the documents*. This means that we are making our embedding model at the very specific task of relating potential questions to specific documents.

There are many caveats, but the main ones are:

- Your Q's should reflect the Q's of your users
- This kind of fine-tuning will (purposefully) "overfit" on your data; this is the desired result in this case.

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [None]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies

>> NOTE: You do not need to do these steps if you are running this notebook locally with `uv`.

In [None]:
!pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community langchain-text-splitters

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/413.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m413.6/413.6 kB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
!pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml

### Provide OpenAI API Key

In [None]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


## Task 2: Loading Data

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [None]:
!mkdir data

mkdir: cannot create directory ‘data’: File exists


In [None]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31392    0 31392    0     0  38859      0 --:--:-- --:--:-- --:--:-- 38851


In [None]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70292    0 70292    0     0  63479      0 --:--:--  0:00:01 --:--:-- 63497


In [None]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import BSHTMLLoader

path = "data/"
text_loader = DirectoryLoader(path, glob="*.html", loader_cls=BSHTMLLoader)

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.

>> NOTE: You may need to run this cell twice to get it to work.

In [None]:
training_documents = text_splitter.split_documents(text_loader.load())

In [None]:
len(training_documents)

102

Next, we're going to associate each of our chunks with a unique identifier.

In [None]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [None]:
training_split_documents = training_documents[:len(training_documents) - 24]
val_split_documents = training_documents[len(training_documents) - 24:102-12]
test_split_documents = training_documents[102-12:]

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4o-mini` (released [today](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)).

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [None]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [None]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [None]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [None]:
import tqdm
import asyncio
from collections import defaultdict

async def create_questions(documents, n_questions):
  questions = {}
  relevant_docs = defaultdict(list)

  tasks = [
      question_generation_chain.ainvoke({"context" : document.page_content, "n_questions": n_questions}) for document in documents
  ]

  results = await asyncio.gather(*tasks)

  for i, questions_generated in enumerate(results):
      for question in questions_generated.content.split("\n"):
          question_id = str(uuid.uuid4())
          questions[question_id] = "".join(question.split(".")[1:]).strip()
          relevant_docs[question_id].append(documents[i].metadata["id"])

  return questions, dict(relevant_docs)

### REMOVE `await` IF NOT USING ASYNC (HINT: Use `async`)

In [None]:
training_questions, training_relevant_contexts = await create_questions(training_split_documents, 2)
print(training_questions)
print(training_relevant_contexts)

{'b07aea58-f39a-4343-b76e-8b286b108e02': 'What significant advancements in AI were made in 2023, particularly regarding Large Language Models (LLMs)?', '36b901a8-e1ac-4147-bb38-b11821a63200': 'How does the development of LLMs in 2023 relate to the historical context of Artificial Intelligence since the 1950s?', 'f4fc0250-352b-42e0-a2f6-e204684b1768': 'What are some potential applications of Large Language Models (LLMs) mentioned in the context?', '074ac00d-ab16-4d34-a7dd-1a045b4be998': 'What is identified as the biggest unsolved problem related to LLMs?', '5225529e-fa6c-4681-adba-fbba529035b2': 'What are some of the capabilities of Large Language Models (LLMs) mentioned in the context?', 'f8a0e3d2-5d10-44a0-bb3f-8841f9dba3d4': 'What potential negative uses of LLMs are highlighted in the provided context?', '55682315-2ff4-47ad-ad7d-48d80ddbe035': 'What are some ways the author has used LLMs to improve productivity and entertainment?', 'c6a3a265-8124-4524-8174-0a8c73a1df42': 'What concer

We'll use the function to generate training, validation, and test data.

In [None]:
val_questions, val_relevant_contexts = await create_questions(val_split_documents, 2)

In [None]:
test_questions, test_relevant_contexts = await create_questions(test_split_documents, 2)

### Reformating and Saving Datasets

Now, we can save our datasets for later use!

In [None]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [None]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [None]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `snowflake-arctic-embed-l`

Now that we have a dataset, let's grab a `sentence-transformers` Embeddings model!

We'll be using Snowflake's [`snowflake-arctic-embed-l`](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) as a base embeddings model.

It is a well performing embeddings model by itself, but there's a lot of very specific domain terms and vocabulary in our courpus - so lets fine-tune it and see what that can do for us!

>> NOTE: Skip installing dependencies if you are running this notebook locally.

In [None]:
!pip install -qU sentence_transformers datasets pyarrow

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.1/42.1 MB[0m [31m49.9 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cudf-cu12 24.12.0 requires pyarrow<19.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 19.0.1 which is incompatible.
pylibcudf-cu12 24.12.0 requires pyarrow<19.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 19.0.1 which is incompatible.[0m[31m
[0m

In [None]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [None]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [None]:
BATCH_SIZE = 10

Let's move our dataset into the expected format for training.

In [None]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

In [None]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [None]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

##### 🏗️ Activity #2:

Both of these losses sound "cool", but what are they - exactly - under the hood?

Why are these losses specifically doing? Please write a short summary of each loss.

> NOTE: This is a course focused on AI Engineering and the application of AI - looking for a hint? Try pasting the code (linked above) into ChatGPT/Claude to write the summary!

#### ANSWER:

A 'loss' is a mathematical model for calculating how accurate and how confident a model's responses are. Models are trained on these functions to improve their accuracy.

MultipleNegativesRankingLoss assumes input data of or pairs of questions and a correct answer, with optional incorrect answers. It will calculate loss by using the correct answer to one question as a negative exapmle for other questions. It improes accuracy at larger scales by providing additional negative answers.

Matryoshka loss is a modifier function that can apply different loss techniques based on scale and context, and can be used to specify other loss functions for particular context cases. It will compute loss with multiple context lengths simultaneously.

Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [None]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [None]:
EPOCHS = 10

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [None]:
import wandb
wandb.init(mode="disabled")

In [None]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic_ft',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
16,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
32,No log,No log,0.875,1.0,1.0,1.0,0.875,0.333333,0.2,0.1,0.875,1.0,1.0,1.0,0.953866,0.9375,0.9375
48,No log,No log,0.875,1.0,1.0,1.0,0.875,0.333333,0.2,0.1,0.875,1.0,1.0,1.0,0.953866,0.9375,0.9375
50,No log,No log,0.875,1.0,1.0,1.0,0.875,0.333333,0.2,0.1,0.875,1.0,1.0,1.0,0.953866,0.9375,0.9375
64,No log,No log,0.875,1.0,1.0,1.0,0.875,0.333333,0.2,0.1,0.875,1.0,1.0,1.0,0.953866,0.9375,0.9375
80,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
96,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
100,No log,No log,0.916667,1.0,1.0,1.0,0.916667,0.333333,0.2,0.1,0.916667,1.0,1.0,1.0,0.969244,0.958333,0.958333
112,No log,No log,0.875,1.0,1.0,1.0,0.875,0.333333,0.2,0.1,0.875,1.0,1.0,1.0,0.953866,0.9375,0.9375
128,No log,No log,0.875,1.0,1.0,1.0,0.875,0.333333,0.2,0.1,0.875,1.0,1.0,1.0,0.953866,0.9375,0.9375


In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
#hf_username = "Technologic101"

In [None]:
#model.push_to_hub(f"{hf_username}/finetuned_arctic_ft")

HfHubHTTPError: 409 Client Error: Conflict for url: https://huggingface.co/api/repos/create (Request ID: Root=1-67b501ef-6645f0a930e3a58123bad0a3;d9d48e0e-d1f4-47a1-a03b-08a3ee948bb7)

You already created this model repo

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [39]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".

In [40]:
from tqdm.auto import tqdm

def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

### `text-embedding-3-small`

In [41]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

  0%|          | 0/24 [00:00<?, ?it/s]

In [42]:
te3_results_df = pd.DataFrame(te3_results)

In [43]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

1.0

### `Snowflake/snowflake-arctic-embed-l` (base)

In [44]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

  0%|          | 0/24 [00:00<?, ?it/s]

In [45]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [46]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

0.9166666666666666

### `Snowflake/snowflake-arctic-embed-l` (fine-tuned)

In [47]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic_ft")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic_ft and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


  0%|          | 0/24 [00:00<?, ?it/s]

In [48]:
finetune_results_df = pd.DataFrame(finetune_results)

In [49]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

1.0

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [50]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap  = 50,
    length_function = len
)

training_documents = text_splitter.split_documents(text_loader.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [51]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 6})

#### A - Augmented

In [52]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
Given a provided context and a question, you must answer the question. If you do not know the answer, you must state that you do not know.

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [53]:
rag_llm =  ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [54]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [55]:
base_rag_chain.invoke({"question" : "What is an agent?"})["response"]

'An agent, in the context of AI, is a term that refers to AI systems that can act on your behalf. However, the term is considered vague and lacks a single, clear definition. Some people view agents as systems that autonomously perform tasks, while others think of them as LLMs (large language models) that utilize tools to solve problems. The concept of "autonomy" is often included in discussions about agents, but without a clear definition. Overall, the term remains frustratingly ambiguous, and there is skepticism about the utility of agents due to challenges such as gullibility, where AI systems may struggle to distinguish truth from fiction.'

In [56]:
base_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Organizations that have produced better-than-GPT-3 class models include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.'

In [57]:
base_rag_chain.invoke({"question" : "What is the laziest month for AI?"})["response"]

'I do not know.'

In [58]:
base_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'I do not know.'

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [59]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 6})

In [60]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [61]:
finetune_rag_chain.invoke({"question" : "What is an Agent?"})["response"]

'An "agent" is a term that lacks a single, clear, and widely understood meaning in the context of AI. It is often used to refer to AI systems that can act on behalf of a user, but the specific definition can vary widely. Some people think of agents as systems that go and act on your behalf, similar to a travel agent, while others consider them as LLMs (large language models) that have access to tools and can run processes in a loop to solve problems. The term "autonomy" is also associated with agents, but again, without a clear definition. Overall, the concept of agents remains vague and is often seen as perpetually "coming soon" in terms of practical implementation.'

In [62]:
finetune_rag_chain.invoke({"question" : "Who has produced better models than GPT-3?"})["response"]

'Organizations that have produced better-than-GPT-3 class models include Anthropic, Mistral, Google, Meta, EleutherAI, Stability AI, TII in Abu Dhabi (Falcon), Microsoft Research, xAI, Replit, Baidu, and several others.'

In [63]:
finetune_rag_chain.invoke({"question" : "What is the laziest month for AI?"})["response"]

'I do not know.'

In [64]:
finetune_rag_chain.invoke({"question" : "What is the largest model that Simon has run on his phone?"})["response"]

'The largest model that Simon has run on his phone is the Llama 3.2 3B model.'

####❓Question #2:

Which LCEL RAG Chain do you think answered the questions better, and why?

#### Answer:

The fine-tuned RAG chain seemed to perform better on first glance, but its responses were not necessarily more correct. The answer to the 'laziest AI' question did not make sense. The answer to the 'largest model' correctly stated that Simon ran Llama 3.2, but the source does not say that it was the largest model he ran.

**UPDATE - on running this notebook again, the fine-tuned embedding model also now replies "I don't know" to the third question, which is an improvement over a nonsense answer! But the last question still inaccurately presents Llama 3.2 as the "largest" model, when the source material doesn't support that claim.

## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. The includes the Synthetic Data Generation steps.

In [65]:
!pip install -qU ragas==0.2.10 rapidfuzz

In [66]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from ragas.testset import TestsetGenerator

generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(training_documents, testset_size=10)

dataset.to_pandas()










Applying SummaryExtractor:   0%|          | 0/83 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/127 [00:00<?, ?it/s]



Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/337 [00:00<?, ?it/s]

ERROR:ragas.testset.transforms.engine:unable to apply transformation: Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4o in organization org-ki7oReowfmveoSmMuWrWski0 on tokens per min (TPM): Limit 30000, Used 29872, Requested 430. Please try again in 604ms. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}}


Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

ERROR:ragas.testset.transforms.engine:unable to apply transformation: Node 9c0e9b20-e312-4232-b04b-bb70c0120bfc or 5c1c2158-3f43-4ff2-b17c-016d721b0c07 has no entities


Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/1 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Wht happend in AI in the 1950s?,[Stuff we figured out about AI in 2023\n\n\n\n...,The academic field of Artificial Intelligence ...,single_hop_specifc_query_synthesizer
1,What are the challenges and ethical considerat...,[Large Language Models\nThey’re actually quite...,The challenges associated with developing and ...,single_hop_specifc_query_synthesizer
2,Wht did we lern about LLMs in 2024?,[Here’s the sequel to this post: Things we lea...,"In 2024, we learned that Large Language Models...",single_hop_specifc_query_synthesizer
3,What is the most surprising aspect of building...,[They’re actually quite easy to build\nThe mos...,The most surprising aspect of building LLMs is...,single_hop_specifc_query_synthesizer
4,"What Mistral do in AI, they make LLMs like Ope...","[If you can gather the right data, and afford ...",Mistral is one of the organizations that have ...,single_hop_specifc_query_synthesizer
5,How much did it cost to train Microsoft's Phi-...,[The training cost (hardware and electricity) ...,The training cost for Microsoft's Phi-2 using ...,single_hop_specifc_query_synthesizer
6,How does the complexity of training an LLM com...,[So training an LLM still isn’t something a ho...,The complexity of training an LLM is compared ...,single_hop_specifc_query_synthesizer
7,Wht is the imprtance of GPT-3.5 in the context...,[You can run LLMs on your own devices\nIn Janu...,GPT-3 and 3.5 were initially considered the pr...,single_hop_specifc_query_synthesizer
8,Wht is Mistral 7B and how can it be used on an...,"[This unleashed a whirlwind of innovation, whi...",Mistral 7B is a surprisingly great model that ...,single_hop_specifc_query_synthesizer
9,How can WebAssembly be utilized in browsers?,[You can even run them entirely in your browse...,WebAssembly can be used to run applications en...,single_hop_specifc_query_synthesizer


In [67]:
!pip install -qU qdrant_client langchain_qdrant

In [75]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="ai_across_years",
    vectors_config=VectorParams(size=1024, distance=Distance.COSINE),
)

vector_store_base = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=huggingface_embeddings
)

vector_store_finetune = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=finetune_embeddings
)

vector_store_base.add_documents(documents=training_documents)
retriever_base = vector_store_base.as_retriever(search_kwargs={"k": 5})

vector_store_finetune.add_documents(documents=training_documents)
retriever_finetune = vector_store_finetune.as_retriever(search_kwargs={"k": 5})




In [78]:
from ragas import EvaluationDataset
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import ContextRecall

# Create evaluation dataset from our test data
test_df = pd.DataFrame({
    "user_input": test_questions.values(),
    "retrieved_contexts": [[train_corpus[id] for id in rel_ids] for rel_ids in test_relevant_contexts.values()],
    "reference": ["" for _ in test_questions], # Empty since we're just evaluating retrieval
})

evaluation_dataset = EvaluationDataset.from_pandas(test_df)

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-3.5-turbo"))

def evaluate_embeddings(retriever, name=""):
    results = []
    for question_id, question in test_questions.items():
        retrieved_docs = retriever.invoke(question)
        retrieved_contexts = [doc.page_content for doc in retrieved_docs]

        expected_ids = test_relevant_contexts[question_id]  # Use correct expected IDs
        expected_contexts = [train_corpus[id] for id in expected_ids if id in train_corpus]

        # Join expected_contexts into a single string
        expected_contexts_str = " ".join(expected_contexts)

        if not retrieved_contexts:
            results.append({
                "user_input": question,
                "retrieved_contexts": [], # Pass empty list if no contexts are retrieved
                "reference": expected_contexts_str,
                "context_recall": 0.0
            })
            continue

        results.append({
            "user_input": question,
            "retrieved_contexts": retrieved_contexts,
            "reference": expected_contexts_str
        })

    eval_df = pd.DataFrame(results)
    eval_dataset = EvaluationDataset.from_pandas(eval_df)

    results = evaluate(
        dataset=eval_dataset,
        metrics=[ContextRecall()],
    )

    print(f"\nRetrieval Results for {name}:")
    print(results)
    return results

# Evaluate base Snowflake model
base_vectorstore = FAISS.from_documents(test_split_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 5})
base_results = evaluate_embeddings(base_retriever, "Base Snowflake Model")

# Evaluate fine-tuned model
finetuned_vectorstore = FAISS.from_documents(test_split_documents, finetune_embeddings)
finetuned_retriever = finetuned_vectorstore.as_retriever(search_kwargs={"k": 5})
finetuned_results = evaluate_embeddings(finetuned_retriever, "Fine-tuned Model")


Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]


Retrieval Results for Base Snowflake Model:
{'context_recall': 0.5124}


Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]


Retrieval Results for Fine-tuned Model:
{'context_recall': 0.6015}


IMPROVEMENT!