<a href="https://colab.research.google.com/github/Deepali-Khalkar/Midterm-streamlit/blob/main/Fine_tuning_Embeddings_Midterm.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Fine tuning the Embeddings**

Swap out your existing embedding model for the new fine-tuned version. Provide a link to your fine-tuned embedding model on the Hugging Face Hub.

How does the performance compare to your original RAG application? Test the fine-tuned embedding model using the RAGAS frameworks to quantify any improvements. Provide results in a table.

#### Basic Overview of Fine-tuning Embeddings

In essence, what we want to do when we fine-tune our embedding models is very simple:

```
Move the embeddings for questions relating to a document
closer together with that document
```

We can think of fine-tuning our embedding models as follows:

1) We have some pair of text items that *should* be closer together
  - `Question`, `Document` pairs
  - EX: `Who drives the bus?`, `The bus was driven by Kyle, the Bus Driver`.

2) We use these pairs as labeled data to fine-tune our embedding model.

The process of training helps the model more accurately associate our questions with the correct documents.

## Task 1: Dependencies and Boilerplate

We'll set up our `nest_asyncio` so we can leverage async loops in our Notebook.

We'll also install the required libraries we'll be using today, and set up our OpenAI API key!

### Nest Asyncio

In [12]:
import nest_asyncio

nest_asyncio.apply()

### Install Dependencies


In [13]:
!pip install -qU langchain_openai langchain_huggingface langchain_core langchain langchain_community langchain-text-splitters

In [14]:
!pip install -qU faiss-cpu python-pptx==1.0.2 nltk==3.9.1 pymupdf beautifulsoup4 lxml

### Provide OpenAI API Key

In [15]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter Your OpenAI API Key: ")

Enter Your OpenAI API Key: ··········


## Task 2: Loading Data


In [16]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader

path = "data/"
text_loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)

Next, we'll set up a classic naive chunking strategy as we only care that the documents get parsed into chunks that we can generate synthetic questions about.

In [17]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 750,
    chunk_overlap  = 20,
    length_function = len
)

Next we can load/split these documents as follows.


In [18]:
training_documents = text_splitter.split_documents(text_loader.load())

In [19]:
len(training_documents)

34

Next, we're going to associate each of our chunks with a unique identifier.

In [20]:
import uuid

id_set = set()

for document in training_documents:
  id = str(uuid.uuid4())
  while id in id_set:
    id = uuid.uuid4()
  id_set.add(id)
  document.metadata["id"] = id

Next, we'll simply use naive Python slicing to create a training, test, and validation set to prepare our data for the next step.

In [21]:
training_split_documents = training_documents[:len(training_documents) - 10]
val_split_documents = training_documents[len(training_documents) - 10:34-5]
test_split_documents = training_documents[34-5:]

len(training_split_documents)


24

In [22]:
len(val_split_documents)

5

In [23]:
len(test_split_documents)

5

## Task 3: Constructing a Fine-tuning Dataset

Using the nodes we created above, we can finally start constructing a fine-tuning dataset utilizing OpenAI's `gpt-4o-mini` (released [today](https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/)).

The basic idea here is straightforward enough:

1. We look at a document
2. We generate questions that could be answered by that node

This gives us a number of question/context pairs that we can use to fine-tune our Embeddings model.

In [24]:
from langchain_openai import ChatOpenAI

qa_chat_model = ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

We'll create a simple Question Generation prompt to query `gpt-4o-mini` to generate Questions for each retrieved context.

In [25]:
from langchain_core.prompts import ChatPromptTemplate

qa_prompt = """\
Given the following context, you must generate questions based on only the provided context.

You are to generate {n_questions} questions which should be provided in the following format:

1. QUESTION #1
2. QUESTION #2
...

Context:
{context}
"""

qa_prompt_template = ChatPromptTemplate.from_template(qa_prompt)

We'll create a simple chain to query the LLM!

In [26]:
question_generation_chain = qa_prompt_template | qa_chat_model

There's a lot going on in this function - let's take a deeper look:

1. First, we provide a list of documents and a number of questions
2. We, for each document in our list, generate `n_questions` of questions.
3. We then associate those questions and contexts via a `UUID`.

> NOTE: The reason we're doing this `UUID` association is for ease of use later in the notebook.

##### 🏗️ Activity #1:

We have:

- Lists of `Documents` with the `metadata` field `id`.

We need:

- An object with key `id`, which have values `str` questions.
- An object with key `question_id`, which have values `List(str)` which will be a list of associated `context_id`.

An Example:

question_object:
```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': 'What types of accessible formats are available for persons with disabilities?',
'df58ee4f-714c-419e-8324-94e5870574e2': 'How do accessible formats benefit persons with disabilities?',
'505fce8b-0e56-48de-a251-61027e396918': 'What are some of the risks associated with the increasing capabilities of AI systems that generate synthetic content?',
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': 'Why is it important for providers of AI systems to embed technical solutions for marking and detecting synthetic content?'
}
 ```

 context_object:
 ```python
{
'b4b95fb6-f827-4454-aa5b-20e62733f172': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'df58ee4f-714c-419e-8324-94e5870574e2': ['dd75bf94-75f3-4603-8e4b-5522f6925638'],
'505fce8b-0e56-48de-a251-61027e396918': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
'8ff0ab33-60dc-4fee-8958-91bfb686aca8': ['ffe3893f-688c-48e8-90bd-7a9feb953d90'],
}
 ```

 As you can see, a piece of context can be associated with more than 1 question.

 The task is to write the Python function(s) to accomplish this task.

 Your function signature is provided below, along with the desired return values.

 > NOTE: You can make any modifications that you desire - assuming that you have the correct input and outputs.

In [27]:
import tqdm

def create_questions(documents, n_questions):
  questions = {}
  relevant_docs = {}

  ### YOUR CODE HERE

  for document in tqdm.tqdm(documents):
    questions_generated = question_generation_chain.invoke({"context": document.page_content, "n_questions": n_questions})
    for question in questions_generated.content.split("\n"):
      question_id = str(uuid.uuid4())
      questions[question_id] = "".join(question.split(".")[1:]).strip()
      relevant_docs[question_id] = [document.metadata["id"]]

  return questions, relevant_docs

In [28]:
training_questions, training_relevant_contexts =  create_questions(training_split_documents, 2)

100%|██████████| 24/24 [00:22<00:00,  1.07it/s]


We'll use the function to generate training, validation, and test data.

In [29]:
val_questions, val_relevant_contexts = create_questions(val_split_documents, 2)

100%|██████████| 5/5 [00:06<00:00,  1.36s/it]


In [30]:
test_questions, test_relevant_contexts =  create_questions(test_split_documents, 2)

100%|██████████| 5/5 [00:03<00:00,  1.37it/s]


### Reformating and Saving Datasets

Now, we can save our datasets for later use!

In [31]:
import json

training_corpus = {train_item.metadata["id"] : train_item.page_content for train_item in training_split_documents}

train_dataset = {
    "questions" : training_questions,
    "relevant_contexts" : training_relevant_contexts,
    "corpus" : training_corpus
}

with open("training_dataset.jsonl", "w") as f:
  json.dump(train_dataset, f)

In [32]:
val_corpus = {val_item.metadata["id"] : val_item.page_content for val_item in val_split_documents}

val_dataset = {
    "questions" : val_questions,
    "relevant_contexts" : val_relevant_contexts,
    "corpus" : val_corpus
}

with open("val_dataset.jsonl", "w") as f:
  json.dump(val_dataset, f)

In [33]:
train_corpus = {test_item.metadata["id"] : test_item.page_content for test_item in test_split_documents}

test_dataset = {
    "questions" : test_questions,
    "relevant_contexts" : test_relevant_contexts,
    "corpus" : train_corpus
}

with open("test_dataset.jsonl", "w") as f:
  json.dump(test_dataset, f)

## Task 4: Fine-tuning `text-embedding-3-small`


In [34]:
!pip install -qU sentence_transformers datasets pyarrow

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.4/485.4 kB[0m [31m10.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.1/42.1 MB[0m [31m49.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m10.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m143.5/143.5 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.8/194.8 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pylibcudf-cu12 24.12.0 requires pyarrow<19.0.0a0,>=14.0.0; platform_machine == "x86_64", but you have pyarrow 19.0.1 which is incompatible.
cudf-cu12 24.12.0 requires pyarrow<19.0.0a0,>=14.0.0; platform_machine 

In [35]:
from sentence_transformers import SentenceTransformer

model_id = "Snowflake/snowflake-arctic-embed-l"
model = SentenceTransformer(model_id)

Error while fetching `HF_TOKEN` secret value from your vault: 'Requesting secret HF_TOKEN timed out. Secrets can only be fetched when running from the Colab UI.'.
You are not authenticated with the Hugging Face Hub in this notebook.
If the error persists, please let us know by opening an issue on GitHub (https://github.com/huggingface/huggingface_hub/issues/new).


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/85.4k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/107 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/704 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.38k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/297 [00:00<?, ?B/s]

We'll grab some necessary imports from `sentence_transformers` and `torch`.

> NOTE: PyTorch (`torch`) is a popular machine learning library - while we don't go very deep into PyTorch it's an incredibly powerful and interesting library! Please read more about it [here](https://pytorch.org/tutorials/beginner/basics/intro.html)!

In [36]:
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
from sentence_transformers import InputExample

We're using a toy batch size here to reflect the limited number of examples we have.

> NOTE: It is typical to use a much larger batch size (~64+), hardware permitting.

In [37]:
BATCH_SIZE = 10

Let's move our dataset into the expected format for training.

In [38]:
corpus = train_dataset['corpus']
queries = train_dataset['questions']
relevant_docs = train_dataset['relevant_contexts']

examples = []
for query_id, query in queries.items():
    doc_id = relevant_docs[query_id][0]
    text = corpus[doc_id]
    example = InputExample(texts=[query, text])
    examples.append(example)

Now we can create a `torch` `DataLoader`!

In [39]:
loader = DataLoader(
    examples, batch_size=BATCH_SIZE
)

Next up, we'll prepare our loss function!

Loss is an important part of training, fine-tuning, and more. If you want a deep dive on loss - you can check out our [event on loss!](https://www.youtube.com/watch?v=iB8FWR9aD5Q&t=8s).

The core loss we're using today is called `MultipleNegativesRankingLoss` - you can find more information [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MultipleNegativesRankingLoss.py).

This is "wrapped" in `MatryoshkaLoss`, which you can read the implementation of [here](https://github.com/UKPLab/sentence-transformers/blob/master/sentence_transformers/losses/MatryoshkaLoss.py).

In [40]:
from sentence_transformers.losses import MatryoshkaLoss, MultipleNegativesRankingLoss

matryoshka_dimensions = [768, 512, 256, 128, 64]
inner_train_loss = MultipleNegativesRankingLoss(model)
train_loss = MatryoshkaLoss(
    model, inner_train_loss, matryoshka_dims=matryoshka_dimensions
)

Now we can set-up our evaluator.

> NOTE: Due to the formatting of our dataset - this is all we have to do!

In [41]:
from sentence_transformers.evaluation import InformationRetrievalEvaluator

corpus = val_dataset['corpus']
queries = val_dataset['questions']
relevant_docs = val_dataset['relevant_contexts']

evaluator = InformationRetrievalEvaluator(queries, corpus, relevant_docs)

We'll train this model for 5 epochs, though you could increase this number if we had a significant amount more data.

In [42]:
EPOCHS = 10

It's training time!

> NOTE: We're manually defining a warm-up period here - this is just to provide a smooth ramp into our training!

In [43]:
import wandb
wandb.init(mode="disabled")

In [44]:
warmup_steps = int(len(loader) * EPOCHS * 0.1)

model.fit(
    train_objectives=[(loader, train_loss)],
    epochs=EPOCHS,
    warmup_steps=warmup_steps,
    output_path='finetuned_arctic_ft',
    show_progress_bar=True,
    evaluator=evaluator,
    evaluation_steps=50
)

Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]



Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
5,No log,No log,0.9,0.9,1.0,1.0,0.9,0.3,0.2,0.1,0.9,0.9,1.0,1.0,0.943068,0.925,0.925
10,No log,No log,1.0,1.0,1.0,1.0,1.0,0.333333,0.2,0.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
15,No log,No log,1.0,1.0,1.0,1.0,1.0,0.333333,0.2,0.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Step,Training Loss,Validation Loss,Cosine Accuracy@1,Cosine Accuracy@3,Cosine Accuracy@5,Cosine Accuracy@10,Cosine Precision@1,Cosine Precision@3,Cosine Precision@5,Cosine Precision@10,Cosine Recall@1,Cosine Recall@3,Cosine Recall@5,Cosine Recall@10,Cosine Ndcg@10,Cosine Mrr@10,Cosine Map@100
5,No log,No log,0.9,0.9,1.0,1.0,0.9,0.3,0.2,0.1,0.9,0.9,1.0,1.0,0.943068,0.925,0.925
10,No log,No log,1.0,1.0,1.0,1.0,1.0,0.333333,0.2,0.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
15,No log,No log,1.0,1.0,1.0,1.0,1.0,0.333333,0.2,0.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
20,No log,No log,1.0,1.0,1.0,1.0,1.0,0.333333,0.2,0.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25,No log,No log,1.0,1.0,1.0,1.0,1.0,0.333333,0.2,0.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
30,No log,No log,1.0,1.0,1.0,1.0,1.0,0.333333,0.2,0.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
35,No log,No log,1.0,1.0,1.0,1.0,1.0,0.333333,0.2,0.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
40,No log,No log,1.0,1.0,1.0,1.0,1.0,0.333333,0.2,0.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
45,No log,No log,1.0,1.0,1.0,1.0,1.0,0.333333,0.2,0.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0
50,No log,No log,1.0,1.0,1.0,1.0,1.0,0.333333,0.2,0.1,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [48]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [46]:
hf_username = "deepali1021"

In [50]:
model.push_to_hub(f"{hf_username}/finetuned_arctic_ft-v2")

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

'https://huggingface.co/deepali1021/finetuned_arctic_ft-v2/commit/68885892148ba8224bbdce0b3a6d15e9626a3bd1'

## Task 5: Evaluating our Retriever

Now that we have fine-tuned our retriever - let's see if it's worthwhile!

We'll start with some basic imports.

In [51]:
import pandas as pd

from langchain_community.vectorstores import FAISS
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_core.documents import Document

Now we'll define a function that will help us evaluate our retrieval process.

> NOTE: We're assuming 1 correct document in a "hit".


In [52]:
def evaluate_openai(
    dataset,
    embed_model,
    top_k=5,
    verbose=False,
):
  corpus = dataset['corpus']
  questions = dataset['questions']
  relevant_docs = dataset['relevant_contexts']
  documents = [Document(page_content=content, metadata={"id": doc_id}) for doc_id, content in corpus.items()]
  vectorstore = FAISS.from_documents(documents, embed_model)

  retriever = vectorstore.as_retriever(search_kwargs={"k": top_k})

  eval_results = []
  for id, question in tqdm.tqdm(questions.items()):
    retrieved_nodes = retriever.invoke(question)
    retrieved_ids = [node.metadata["id"] for node in retrieved_nodes]
    expected_id = relevant_docs[id][0]
    is_hit = expected_id in retrieved_ids
    eval_results.append({"id": id, "question": question, "expected_id": expected_id, "is_hit": is_hit})

  return eval_results

All that's left to do is evaluate, we'll evaluate our model against:

1. OpenAI's closed source `text-embedding-3-small`
2. The base non-fine-tuned version of `Snowflake/snowflake-arctic-embed-l`.

Let's see how it stacks up!

**text-embedding-3-small**


In [53]:
te3_openai = OpenAIEmbeddings(model="text-embedding-3-small")
te3_results = evaluate_openai(test_dataset, te3_openai)

100%|██████████| 10/10 [00:03<00:00,  2.69it/s]


In [54]:
te3_results_df = pd.DataFrame(te3_results)

In [55]:
te3_hit_rate = te3_results_df["is_hit"].mean()
te3_hit_rate

1.0

**Snowflake/snowflake-arctic-embed-l (base)**


In [56]:
from langchain_huggingface import HuggingFaceEmbeddings

huggingface_embeddings = HuggingFaceEmbeddings(model_name="Snowflake/snowflake-arctic-embed-l")
arctic_embed_m_results = evaluate_openai(test_dataset, huggingface_embeddings)

100%|██████████| 10/10 [00:01<00:00,  9.17it/s]


In [57]:
arctic_embed_m_results_df = pd.DataFrame(arctic_embed_m_results)

In [58]:
arctic_embed_m_hit_rate = arctic_embed_m_results_df["is_hit"].mean()
arctic_embed_m_hit_rate

1.0

**Snowflake/snowflake-arctic-embed-l (fine-tuned)**

In [59]:
finetune_embeddings = HuggingFaceEmbeddings(model_name="finetuned_arctic_ft")
finetune_results = evaluate_openai(test_dataset, finetune_embeddings)

Some weights of BertModel were not initialized from the model checkpoint at finetuned_arctic_ft and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
100%|██████████| 10/10 [00:01<00:00,  8.59it/s]


In [60]:
finetune_results_df = pd.DataFrame(finetune_results)

In [61]:
finetune_hit_rate = finetune_results_df["is_hit"].mean()
finetune_hit_rate

1.0

## Task 1: Vibe Checking the RAG Pipeline

We're going to use our RAG pipeline to vibe check on some common phrases now that we've modified it!

### Creating New Chunks

In order to try and evaluate our system more fairly, let's create new chunks that we will use to create our Vector Store.

In [62]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # Increased for better context
    chunk_overlap=200,  # Added overlap for better continuity
)

training_documents = text_splitter.split_documents(text_loader.load())

### Base Chain

We'll start by constructing our base chain, which will use the untrained retrieval model.

#### R - Retrieval

In [63]:
from langchain_community.vectorstores import FAISS

base_vectorstore = FAISS.from_documents(training_documents, huggingface_embeddings)
base_retriever = base_vectorstore.as_retriever(search_kwargs={"k": 5})

#### A - Augmented

In [64]:
from langchain_core.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context.
You must only use the provided context, and cannot use your own knowledge.
If you do not know the answer, or it's not contained in the provided context response with "I don't know"

Context:
{context}

Question:
{question}

Answer:
"""

rag_prompt_template = ChatPromptTemplate.from_template(RAG_PROMPT)

#### G - Generation

In [65]:
rag_llm =  ChatOpenAI(
    model="gpt-4o-mini",
    temperature=0
)

#### RAG - LCEL RAG Pipeline

In [66]:
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough, RunnableParallel

base_rag_chain = (
    {"context": itemgetter("question") | base_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [67]:
base_rag_chain.invoke({"question" : "How many paid leave an employee can take in a year?"})["response"]

'Employees are entitled to 15 days of paid vacation leave per year.'

In [68]:
base_rag_chain.invoke({"question" : "What is the contact of IT help desk?"})["response"]

'Phone: +1-555-123-4567  \nEmail: itservicedesk@example.com'

In [69]:
base_rag_chain.invoke({"question" : "Do you inspect vehicle periodically for safety of passengers ?"})["response"]

'Yes, all vehicles undergo regular inspections and maintenance to ensure they are in optimal condition. In the past year, 500 vehicle inspections were conducted to identify and address any maintenance issues promptly.'

In [71]:
base_rag_chain.invoke({"question" : "The driver was rude. What is the process to make a complaint"})["response"]

"I don't know"

### Fine-tuned Embedding Model

Now let's rebuild our RAG chain with the Fine-tuned model - the only component we need to change is our `FAISS` vectorstore!

In [73]:
finetune_vectorstore = FAISS.from_documents(training_documents, finetune_embeddings)
finetune_retriever = finetune_vectorstore.as_retriever(search_kwargs={"k": 5})

In [74]:
finetune_rag_chain = (
    {"context": itemgetter("question") | finetune_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt_template | rag_llm | StrOutputParser(), "context": itemgetter("context")}
)

In [75]:
base_rag_chain.invoke({"question" : "How many paid leave an employee can take in a year?"})["response"]

'Employees are entitled to 15 days of paid vacation leave per year.'

In [76]:
base_rag_chain.invoke({"question" : "What is the contact of IT help desk?"})["response"]

'Phone: +1-555-123-4567  \nEmail: itservicedesk@example.com'

In [77]:
base_rag_chain.invoke({"question" : "Do you inspect vehicle periodically for safety of passengers ?"})["response"]

'Yes, all vehicles undergo regular inspections and maintenance to ensure they are in optimal condition. In the past year, 500 vehicle inspections were conducted to identify and address any maintenance issues promptly.'

In [78]:
base_rag_chain.invoke({"question" : "The driver was rude. What is the process to make a complaint"})["response"]

"I don't know"

## Task 2: RAGAS Evaluation

It's great to have some idea of how our system is doing based on vibe-checks, but let's use RAGAS to provide more insight info. on how things are improving!

> NOTE: Please recreate *exactly* the RAGAS process we used to evaluate RAG, baselining with the default retriever, and then comparing the new retriever. The includes the Synthetic Data Generation steps.

**Install dependencies**

In [79]:
!pip install -qU ragas==0.2.10

!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/175.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m174.1/175.7 kB[0m [31m7.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m175.7/175.7 kB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m68.5 MB/s[0m eta [

Input RAGAS API Key

In [80]:
import os
from getpass import getpass

os.environ["RAGAS_APP_TOKEN"] = getpass("Please enter your Ragas API key!")

Please enter your Ragas API key!··········


Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG
This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.


In [81]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [82]:
from ragas.testset import TestsetGenerator

docs = training_documents
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying SummaryExtractor:   0%|          | 0/24 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/27 [00:00<?, ?it/s]



Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/78 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

In [83]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,What are the key sections covered in the Trans...,[Transportation Department Policy Manual \n \n...,The Transportation Department Policy Manual in...,single_hop_specifc_query_synthesizer
1,Wht is the role of the Transportashun Departmint?,[services. Please read this manual carefully a...,The Transportation Department plays a critical...,single_hop_specifc_query_synthesizer
2,Wht is the role of the Transportashun Departmint?,[compliance audits to ensure adherence to regu...,The Transportation Department focuses on compl...,single_hop_specifc_query_synthesizer
3,What are the available resources for passenger...,[Our fare collection system ensures fair and c...,"Route information, including maps, schedules, ...",single_hop_specifc_query_synthesizer
4,What is the purpose of the HR Policy Manual in...,[HR Policy Manual \n \nTable of Contents: \n \...,The HR Policy Manual provides important guidel...,single_hop_specifc_query_synthesizer
5,How can employees report IT incidents and what...,[<1-hop>\n\nOutlook: \na. Outlook is our organ...,Employees can report IT incidents by contactin...,multi_hop_specific_query_synthesizer
6,How are updates to the HR Policy Manual commun...,[<1-hop>\n\nHR Policy Manual \n \nTable of Con...,Updates to the HR Policy Manual are communicat...,multi_hop_specific_query_synthesizer
7,What role does the Transportation Department P...,[<1-hop>\n\nTransportation Department Policy M...,The Transportation Department Policy Manual se...,multi_hop_specific_query_synthesizer
8,What role does the IT department play in suppo...,[<1-hop>\n\nIT Department Policy Manual \n \nT...,The IT department plays a crucial role in supp...,multi_hop_specific_query_synthesizer
9,What role does the IT Service Desk play in inc...,[<1-hop>\n\nContinuous learning and profession...,The IT Service Desk plays a crucial role in in...,multi_hop_specific_query_synthesizer


## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [84]:
for test_row in dataset:
  response = base_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [85]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What are the key sections covered in the Trans...,[Transportation Department Policy Manual \n \n...,[Transportation Department Policy Manual \n \n...,The key sections covered in the Transportation...,The Transportation Department Policy Manual in...,single_hop_specifc_query_synthesizer
1,Wht is the role of the Transportashun Departmint?,[work environment. If you have any questions o...,[services. Please read this manual carefully a...,The Transportation Department plays a critical...,The Transportation Department plays a critical...,single_hop_specifc_query_synthesizer
2,Wht is the role of the Transportashun Departmint?,[work environment. If you have any questions o...,[compliance audits to ensure adherence to regu...,The Transportation Department plays a critical...,The Transportation Department focuses on compl...,single_hop_specifc_query_synthesizer
3,What are the available resources for passenger...,"[any questions or need further information, pl...",[Our fare collection system ensures fair and c...,"Route information, including maps, schedules, ...","Route information, including maps, schedules, ...",single_hop_specifc_query_synthesizer
4,What is the purpose of the HR Policy Manual in...,[work environment. If you have any questions o...,[HR Policy Manual \n \nTable of Contents: \n \...,I don't know.,The HR Policy Manual provides important guidel...,single_hop_specifc_query_synthesizer
5,How can employees report IT incidents and what...,[b. Reach out to the IT Service Desk for assis...,[<1-hop>\n\nOutlook: \na. Outlook is our organ...,Employees should promptly report any IT incide...,Employees can report IT incidents by contactin...,multi_hop_specific_query_synthesizer
6,How are updates to the HR Policy Manual commun...,[work environment. If you have any questions o...,[<1-hop>\n\nHR Policy Manual \n \nTable of Con...,Updates to the HR Policy Manual are communicat...,Updates to the HR Policy Manual are communicat...,multi_hop_specific_query_synthesizer
7,What role does the Transportation Department P...,[Transportation Department Policy Manual \n \n...,[<1-hop>\n\nTransportation Department Policy M...,The Transportation Department Policy Manual se...,The Transportation Department Policy Manual se...,multi_hop_specific_query_synthesizer
8,What role does the IT department play in suppo...,"[any questions or need further information, pl...",[<1-hop>\n\nIT Department Policy Manual \n \nT...,The IT department plays a crucial role in supp...,The IT department plays a crucial role in supp...,multi_hop_specific_query_synthesizer
9,What role does the IT Service Desk play in inc...,[b. Reach out to the IT Service Desk for assis...,[<1-hop>\n\nContinuous learning and profession...,The IT Service Desk provides technical support...,The IT Service Desk plays a crucial role in in...,multi_hop_specific_query_synthesizer


In [86]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

In [87]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

In [88]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[26]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')


{'context_recall': 0.7833, 'faithfulness': 1.0000, 'factual_correctness': 0.7067, 'answer_relevancy': 0.8480, 'context_entity_recall': 0.5771, 'noise_sensitivity_relevant': 0.1745}

**Test with fine tuned embedding model**

In [89]:
for test_row in dataset:
  response = finetune_rag_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [90]:
dataset.to_pandas()


Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,What are the key sections covered in the Trans...,[Transportation Department Policy Manual \n \n...,[Transportation Department Policy Manual \n \n...,The key sections covered in the Transportation...,The Transportation Department Policy Manual in...,single_hop_specifc_query_synthesizer
1,Wht is the role of the Transportashun Departmint?,[services. Please read this manual carefully a...,[services. Please read this manual carefully a...,The role of the Transportation Department is t...,The Transportation Department plays a critical...,single_hop_specifc_query_synthesizer
2,Wht is the role of the Transportashun Departmint?,[services. Please read this manual carefully a...,[compliance audits to ensure adherence to regu...,The role of the Transportation Department is t...,The Transportation Department focuses on compl...,single_hop_specifc_query_synthesizer
3,What are the available resources for passenger...,[Our fare collection system ensures fair and c...,[Our fare collection system ensures fair and c...,"Route information, including maps, schedules, ...","Route information, including maps, schedules, ...",single_hop_specifc_query_synthesizer
4,What is the purpose of the HR Policy Manual in...,[HR Policy Manual \n \nTable of Contents: \n \...,[HR Policy Manual \n \nTable of Contents: \n \...,I don't know.,The HR Policy Manual provides important guidel...,single_hop_specifc_query_synthesizer
5,How can employees report IT incidents and what...,[Continuous learning and professional developm...,[<1-hop>\n\nOutlook: \na. Outlook is our organ...,Employees should promptly report any IT incide...,Employees can report IT incidents by contactin...,multi_hop_specific_query_synthesizer
6,How are updates to the HR Policy Manual commun...,[regular data security audits to ensure the sa...,[<1-hop>\n\nHR Policy Manual \n \nTable of Con...,Updates to the HR Policy Manual are communicat...,Updates to the HR Policy Manual are communicat...,multi_hop_specific_query_synthesizer
7,What role does the Transportation Department P...,[Transportation Department Policy Manual \n \n...,[<1-hop>\n\nTransportation Department Policy M...,The Transportation Department Policy Manual se...,The Transportation Department Policy Manual se...,multi_hop_specific_query_synthesizer
8,What role does the IT department play in suppo...,[carefully and consult with your supervisor or...,[<1-hop>\n\nIT Department Policy Manual \n \nT...,The IT department plays a crucial role in supp...,The IT department plays a crucial role in supp...,multi_hop_specific_query_synthesizer
9,What role does the IT Service Desk play in inc...,[b. Reach out to the IT Service Desk for assis...,[<1-hop>\n\nContinuous learning and profession...,The IT Service Desk plays a crucial role in in...,The IT Service Desk plays a crucial role in in...,multi_hop_specific_query_synthesizer


In [91]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

In [93]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/60 [00:00<?, ?it/s]

ERROR:ragas.executor:Exception raised in Job[26]: TypeError(ufunc 'invert' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'')


{'context_recall': 0.9500, 'faithfulness': 0.9870, 'factual_correctness': 0.7267, 'answer_relevancy': 0.8492, 'context_entity_recall': 0.5092, 'noise_sensitivity_relevant': 0.1389}