<a href="https://colab.research.google.com/github/Emarhnuel/Insurance_Chatbot_evaluation/blob/main/RAG_Evaluation_LangChain_%26_Ragas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG Evaluation with Langchain and RAGAS

In the following notebook I will be exploring the following:

- Creating a simple RAG pipeline with LangChain v0.1.0
- Evaluating our pipeline with the [Ragas](https://github.com/explodinggradients/ragas) library
- Making an adjustment to our RAG pipeline
- Evaluating our adjusted pipeline against our baseline




In [None]:
!pip install -q langchain langchain-openai langchain_core langchain-community langchainhub openai ragas tiktoken cohere faiss_cpu requests==2.31.0 tokenizers==0.19 pypdf2 unstructured langchain langchain_together

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m974.0/974.0 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m314.7/314.7 kB[0m [31m11.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.5/325.5 kB[0m [31m19.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.1/86.1 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m24.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m166.6/166.6 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m53.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━

In [None]:
%pip install --upgrade --quiet  sentence_transformers  rank_bm25 > /dev/null

In [None]:
import os
import openai
from getpass import getpass


openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key



Please provide your OpenAI Key: ··········


## Building our RAG pipeline

I will:

- Create an Index
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

#### Loading Data



In [None]:
from langchain_community.document_loaders import UnstructuredMarkdownLoader

# Replace with the actual path to your Markdown file in Colab
markdown_path = "/content/policy-booklet-0923.md"
loader = UnstructuredMarkdownLoader(markdown_path)
documents = loader.load()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


In [None]:
documents[0].metadata

{'source': '/content/policy-booklet-0923.md'}

#### Transforming Data

Now that I have gotten my single document - let's split it into smaller pieces so I can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)

documents = text_splitter.split_documents(documents)




Let's confirm we've split our document.

In [None]:
len(documents)

136

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task! (soon we'll be able to leverage OpenAI's newest embedding model which is waiting on an approved PR to be merged as we speak!)

In [None]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

#### Creating a FAISS VectorStore

Now that i have my documents - I'll need a place to store them alongside their embeddings.

I will be using Meta's FAISS for this task.

In [None]:
from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(documents, embeddings)

#### Creating a Retriever

To complete my index, all that's left to do is expose my vectorstore as a retriever

In [None]:
retriever = vector_store.as_retriever()

#### Testing the Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [None]:
retrieved_documents = retriever.invoke("How much will you pay if my car is damaged?")

In [None]:
for doc in retrieved_documents:
  print(doc)

page_content="Faqs How Much Will You Pay If My Car Is Damaged?\n\nWhere damage to your car is covered under your policy, we'll pay the cost of repairing or replacing your car up to its UK market value. This is the current value of your car at the time of the claim. It may be different to the amount you paid or any amount you provided when you insured your car with us.\n\nWho Is Covered To Drive Other Cars?\n\nYour certificate of motor insurance will show who has cover to drive other cars. We'll only cover injury to third parties, or damage caused to their property, not to the car being driven. See 'Section 1: Liability' on page 11. Am I covered if I leave my car unlocked or the keys in the car? We won't pay a claim for theft or attempted theft if your car is left:\n\nUnlocked.\n\nWith keys or key fobs in, on, or attached to the car.\n\nWith the engine running.\n\nWith a window or roof open.\n\nWhat's not included in my cover?\n\nWe don't cover things like:\n\nMechanical or electrical f

### Creating a RAG Chain



#### Creating a Prompt Template

There are a few different ways I could create a prompt template - I could create a custom template, as seen in the code below, or I will simply pull a prompt from the prompt hub! Let's look at an example of that!

In [None]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [None]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple - but we'll create our own to be a bit more specific!

In [None]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up the Basic QA Chain

Now we can instantiate the basic RAG chain!

I'll use LCEL directly just to see an example of it

I'll also ensure to pass-through our context - which is critical for RAGAS.

In [None]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

Let's test it out!

In [None]:
question = "How much will you pay if my car is damaged?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

Where damage to your car is covered under your policy, we'll pay the cost of repairing or replacing your car up to its UK market value.


In [None]:
question = "Are my electric car’s charging cables covered?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

Yes, your electric car's charging cables are covered under 'Section 2: Fire and theft' or 'Section 4: Accidental damage' of your policy.
[Document(page_content="Can I Use My Car Abroad?\n\nIf you want to use your car abroad, your cover depends on the type of policy you have and where you're driving. You can find full details in 'Where you can drive' on page 31. You may need a Green Card if you're travelling abroad. If you need one, please get in touch before you travel. We also recommend you take a European Accident Statement with you. You can get one at churchill.com/eas-form.pdf\n\nAre My Electric Car'S Charging Cables Covered?\n\nYour home charger and charging cables are considered an accessory to your car. This means they're covered under 'Section 2: Fire and theft' or \n'Section 4: Accidental damage' of your policy. You're also covered for any accidents to others involving your charging cables when they are attached to your car. For example, someone tripping over your cable, as lo

As you can see that there are some improvements I could make here.

For now, let's switch gears to RAGAS to see how I can leverage that tool to provide us insight into how our pipeline is performing!

## Ragas Evaluation

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

I'll be evluating on every core metric today, but in order to do that - I'll need to create a test set. Luckily for me, Ragas can do that directly!

#### Synthetic Test Set Generation

I will leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate my own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

> NOTE: This process will use `gpt-3.5-turbo-16k` as the base generator and `gpt-4` as the critic

Let's create a new set of documents to ensure I am not accidentally creating a sample test set that favours our base model too much!

In [None]:
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 200
)
documents = text_splitter.split_documents(documents)

In [None]:
len(documents)

136

In [None]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings


# generator with openai models
generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm = ChatOpenAI(model="gpt-4o")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

testset = generator.generate_with_langchain_docs(documents, test_size=33, distributions={simple: 0.3, reasoning: 0.4, multi_context: 0.3}, raise_exceptions=False)

embedding nodes:   0%|          | 0/272 [00:00<?, ?it/s]



Generating:   0%|          | 0/9 [00:00<?, ?it/s]

In [None]:
testset.test_data[0]

DataRow(question='What is the coverage for car keys lost abroad with the Foreign Use Extension?', contexts=["Car Security\n\nWe'll provide cover to reprogram immobilisers, infrared handsets and alarms.\n\nCar Hire\n\nIf you can't drive your car because of lost or damaged car keys and have our Guaranteed Hire Car Plus cover, we'll extend this cover while you're unable to use your car. See 'Section 8: Guaranteed Hire Car Plus' on page 28.\n\nDriving Abroad\n\nWhile driving your car abroad, we'll cover your car keys if they are lost when:\n\nYou have Comprehensive cover and you've added Foreign Use Extension to your cover before you travel (this will be shown on your car insurance details).\n\nYou have Comprehensive Plus cover, where 90 days of Foreign Use Extension is included for each insured period.\n\nYou'll need to replace your car keys and send the receipts to us. We'll then reimburse the costs up to the amounts shown on page 8.\n\nYou'Re Not Covered For\n\n8 We don't cover any redu

In [None]:
test_df = testset.to_pandas()

# Save as CSV file in your Drive
test_df.to_csv('testset.csv', index=False)

In [None]:
#Load testset from drive

import pandas as pd

file_path = '/content/testset.csv'

test_df = pd.read_csv(file_path)


#### Generating Responses with RAG Pipeline

I have gotten some QC pairs, and some ground truths, let's evaluate the RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting the questions and ground truths from the create testset.

I will start by converting our test dataset into a Pandas DataFrame.

In [None]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What is the coverage for car keys lost abroad ...,"[""Car Security\n\nWe'll provide cover to repro...","While driving your car abroad, if you have Com...",simple,[{'source': '/content/policy-booklet-0923.md'}],True
1,What actions will be taken if fraud is discove...,"[""Fraud\n\nYou must be honest in your dealings...",If fraud is discovered in relation to the insu...,simple,[{'source': '/content/policy-booklet-0923.md'}],True
2,How does the total case value impact the decis...,['The difficulty of the case. Cases which are ...,"The total case value, which includes the poten...",reasoning,[{'source': '/content/policy-booklet-0923.md'}],True
3,How does intentional damage affect insurance c...,"[""Deliberate Damage\n\n✘ We won't cover any lo...","Intentional damage, which is deliberate acts b...",reasoning,[{'source': '/content/policy-booklet-0923.md'}],True
4,What conditions are required for insurance to ...,"[""What We'Ll Do\n\nWe'll replace your car with...",The conditions required for insurance to repla...,reasoning,[{'source': '/content/policy-booklet-0923.md'}],True
5,What steps should be taken at an accident scen...,"[""Safety Comes First\n\nStop at the scene of t...","Stop at the scene of the accident, call the po...",reasoning,[{'source': '/content/policy-booklet-0923.md'}],True
6,What number and info are needed for insurance ...,"[""Making A Claim\n\nIf you need to claim These...","For insurance policy inquiries, you will need ...",multi_context,[{'source': '/content/policy-booklet-0923.md'}],True
7,What areas does the 'Liability for automated c...,"[""It also covers journeys between these places...",The 'Liability for automated cars in Great Bri...,multi_context,[{'source': '/content/policy-booklet-0923.md'}...,True
8,What support is provided for personal accident...,"[""The driver's details, if possible.\n\nThe na...",We'll help if you or your partner are accident...,multi_context,[{'source': '/content/policy-booklet-0923.md'}...,True
9,What laws apply to the contract between the po...,"[""This policy is evidence of the contract betw...",You and we may choose which law will apply to ...,simple,[{'source': '/content/policy-booklet-0923.md'}],True


In [None]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now I will generate responses using the RAG pipeline using the questions I have generated - I'll also need to collect the retrieved contexts for each question.

I'll do this in a simple loop to see exactly what's happening!

In [None]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now I can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [None]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [None]:
response_dataset[6]

{'question': 'What number and info are needed for insurance policy inquiries?',
 'answer': 'The number needed for insurance policy inquiries is 0345 877 6680. The information needed includes personal details, policy number, car registration number, description of loss or damage, and details of the other driver if in an accident.',
 'contexts': ["Making A Claim\n\nIf you need to claim These steps will help you and enable us to process your claim quickly.\n\nHere are some important numbers you'll need if you have an accident\n\nNeed To Claim? 0345 878 6261 Windscreen Claims 0800 328 9150\n\nIf you have Essentials, Comprehensive or Comprehensive Plus cover\n\nMotor Legal Helpline 0345 246 2408\n\nIf you have Motor Legal Cover\n\nHelp With Anything Else 0345 877 6680\n\nStore these numbers in your phone so you have them available if needed. Even if you don't make a claim on your car, it's important to let us know about the accident as quickly as possible. This will enable us to contact the

#### Evaluating with Ragas

Now that I have our response dataset - we can finally get into the  evaluation part of our Ragas framework!

First, I'll import the desired metrics, then I can use them to evaluate my created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [None]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [None]:
results = evaluate(response_dataset, metrics, raise_exceptions=False)

Evaluating:   0%|          | 0/155 [00:00<?, ?it/s]

ERROR:ragas.executor:Runner in Executor raised an exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 78, in _aresults
    r = await future
  File "/usr/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 37, in sema_coro
    return await coro
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 111, in wrapped_callable_async
    return counter, await callable(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ragas/metrics/base.py", line 125, in ascore
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ragas/metrics/base.py", line 121, in ascore
    score = await self._ascore(row=row, callbacks=group_cm, is_async=is_async)
  File "/usr/local/lib/python3.10/dist-packages/ragas/metrics/_context_recall.py", line 169, in _ascore
    results = await sel

In [None]:
results

{'faithfulness': 0.8811, 'answer_relevancy': 0.9378, 'context_recall': 0.9087, 'context_precision': 0.8046, 'answer_correctness': 0.8744}

In [None]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What is the coverage for car keys lost abroad ...,The coverage for car keys lost abroad with the...,[You have Comprehensive cover and you've added...,"While driving your car abroad, if you have Com...",,,,,
1,What actions will be taken if fraud is discove...,If fraud is discovered in relation to the insu...,[Fraud\n\nYou must be honest in your dealings ...,If fraud is discovered in relation to the insu...,,,,,
2,How does the total case value impact the decis...,The total case value impacts the decision to p...,[The difficulty of the case. Cases which are m...,"The total case value, which includes the poten...",,,,,1.0
3,How does intentional damage affect insurance c...,Intentional damage caused by insured individua...,[Deliberate Damage\n\n✘ We won't cover any los...,"Intentional damage, which is deliberate acts b...",0.986342,1.0,1.0,0.608746,1.0
4,What conditions are required for insurance to ...,The conditions required for insurance to repla...,[What We'Ll Do\n\nWe'll replace your car with ...,The conditions required for insurance to repla...,0.98208,1.0,0.805556,0.548485,0.857143
5,What steps should be taken at an accident scen...,"Stop at the scene of the accident, call the po...",[Safety Comes First\n\nStop at the scene of th...,"Stop at the scene of the accident, call the po...",0.912098,1.0,1.0,0.617023,1.0
6,What number and info are needed for insurance ...,The number needed for insurance policy inquiri...,[Making A Claim\n\nIf you need to claim These ...,"For insurance policy inquiries, you will need ...",0.86921,1.0,1.0,0.699986,1.0
7,What areas does the 'Liability for automated c...,The 'Liability for automated cars in Great Bri...,[Liability For Automated Cars In Great Britain...,The 'Liability for automated cars in Great Bri...,0.997535,1.0,1.0,0.622987,1.0
8,What support is provided for personal accident...,The support provided for personal accidents in...,[Section 6: Personal Benefits\n\nPersonal bene...,We'll help if you or your partner are accident...,0.920306,1.0,1.0,0.808368,0.833333
9,What laws apply to the contract between the po...,English law applies to the contract between th...,[This policy is evidence of the contract betwe...,You and we may choose which law will apply to ...,0.963727,1.0,1.0,0.728863,1.0


## Testing a More Performant Retriever

Now that I have established a baseline - I can see how any changes impact my pipeline's performance!

Let's modify the retriever and see how that impacts our Ragas metrics!

In [None]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)

I'll also re-create the RAG pipeline using the abstractions that come packaged with LangChain v0.1.0!

First, I will create a chain to "stuff" the documents into my context!

In [None]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

Next, we'll create the retrieval chain!

In [None]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [None]:
response = retrieval_chain.invoke({"input": "If my car can be repaired, and is driveable, what will happen?"})

In [None]:
print(response["answer"])

If your car can be repaired and is driveable, the insurance company will provide you with a hire car from the point your car goes in for repair. If you use their approved repairer, you will have the hire car until they have repaired your car. If you choose to use your own repairer, you will have the hire car for up to 21 days in a row while they are repairing your car.


Well, just from those responses this chain *feels* better - but lets see how it performs on the eval!

I will do the same process like I did before to collect the pipeline's contexts and answers.

In [None]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

Now I can convert this into a dataset, just like we did before.

In [None]:
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's evaluate on the same metrics we did for the first pipeline and see how it does!

In [None]:
advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics, raise_exceptions=False)

Evaluating:   0%|          | 0/155 [00:00<?, ?it/s]

ERROR:ragas.executor:Runner in Executor raised an exception
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 78, in _aresults
    r = await future
  File "/usr/lib/python3.10/asyncio/tasks.py", line 571, in _wait_for_one
    return f.result()  # May raise f.exception().
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 37, in sema_coro
    return await coro
  File "/usr/local/lib/python3.10/dist-packages/ragas/executor.py", line 111, in wrapped_callable_async
    return counter, await callable(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/ragas/metrics/base.py", line 125, in ascore
    raise e
  File "/usr/local/lib/python3.10/dist-packages/ragas/metrics/base.py", line 121, in ascore
    score = await self._ascore(row=row, callbacks=group_cm, is_async=is_async)
  File "/usr/local/lib/python3.10/dist-packages/ragas/metrics/_faithfulness.py", line 266, in _ascore
    nli_result = await se

### Comparing Results

Now I can compare the results and see what directional changes occured!

Let's refresh with the initial metrics.

In [None]:
results

{'faithfulness': 0.8811, 'answer_relevancy': 0.9378, 'context_recall': 0.9087, 'context_precision': 0.8046, 'answer_correctness': 0.8744}

And see how the other advanced retrieval modified our chain!

In [None]:
advanced_retrieval_results

{'faithfulness': 0.8053, 'answer_relevancy': 0.8226, 'context_recall': 0.9388, 'context_precision': 0.8830, 'answer_correctness': 0.8726}

In [None]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,0.881068,0.805291,-0.075776
1,answer_relevancy,0.937818,0.822632,-0.115186
2,context_recall,0.908736,0.938822,0.030086
3,context_precision,0.804623,0.882976,0.078353
4,answer_correctness,0.874418,0.872634,-0.001784
