# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

- ü§ù Breakout Room Part #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)
  

- ü§ù Breakout Room Part #2:
  1. Synthetic Dataset Generation for Evaluation using the [Ragas](https://github.com/explodinggradients/ragas)
  2. Evaluating our pipeline with Ragas
  3. Making Adjustments to our RAG Pipeline
  4. Evaluating our Adjusted pipeline against our baseline
  5. Testing OpenAI's Claim

The only way to get started is to get started - so let's grab our dependencies for the day!

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# ü§ù Breakout Room Part #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://blog.langchain.dev/langchain-v0-1-0/) of LangChain v0.1.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [1]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai

We'll also get the "star of the show" today, which is Ragas!

In [2]:
!pip install -qU ragas

We'll be leveraging [QDrant](https://qdrant.tech/) again as our LangChain `VectorStore`.

We'll also install `pymupdf` and its dependencies which will allow us to load PDFs using the `PyMuPDFLoader` in the `langchain-community` package!

In [3]:
!pip install -qU qdrant-client pymupdf pandas

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [4]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

## Task 3: Creating a Simple RAG Pipeline with LangChain v0.1.0

Building on what we learned last week, we'll be leveraging LangChain v0.1.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.1.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [None]:
!git clone https://github.com/AI-Maker-Space/DataRepository

In [5]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    "DataRepository/tswift_fued.pdf",
)

documents = loader.load()
documents

[Document(page_content="4/23/24, 6:05 PM\nA Timeline of Taylor Swift and Kim Kardashian's Feud\nhttps://people.com/taylor-swift-and-kim-kardashian-feud-timeline-8412119\n1/22\nPHOTO: KEVIN MAZUR/MTV1415/WIREIMAGE\nA Timeline of Taylor Swift and Kim\nKardashian's Feud\nTake a look back at the drama between Taylor Swift and Kim Kardashian\nthrough the years\nBy  \n |  Updated on April 23, 2024 04:46PM EDT\nENTERTAINMENT\nMUSIC\nKelsie Gibson\nAdvertisement\nAd\ni\nThe Rewind: '13 Going 30'\nC LO S E\xa0\nSUBSCRIBE\nSKIP TO CONTENT\n", metadata={'source': 'DataRepository/tswift_fued.pdf', 'file_path': 'DataRepository/tswift_fued.pdf', 'page': 0, 'total_pages': 22, 'format': 'PDF 1.4', 'title': "A Timeline of Taylor Swift and Kim Kardashian's Feud", 'author': '', 'subject': '', 'keywords': '', 'creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36', 'producer': 'Skia/PDF m123', 'creationDate': "D:20240423220523+00'00'", '

In [6]:
documents[0].metadata

{'source': 'DataRepository/tswift_fued.pdf',
 'file_path': 'DataRepository/tswift_fued.pdf',
 'page': 0,
 'total_pages': 22,
 'format': 'PDF 1.4',
 'title': "A Timeline of Taylor Swift and Kim Kardashian's Feud",
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
 'producer': 'Skia/PDF m123',
 'creationDate': "D:20240423220523+00'00'",
 'modDate': "D:20240423220523+00'00'",
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [7]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [8]:
len(documents)

177

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

In [9]:
from dotenv import load_dotenv
load_dotenv()

True

In [10]:
from langchain_openai import OpenAIEmbeddings, AzureOpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)
#embeddings = AzureOpenAIEmbeddings(
#    azure_deployment="text-embedding-3-small"
#)
embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x11b9f2a90>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x11c220fd0>, model='text-embedding-ada-002', dimensions=None, deployment='text-embedding-ada-002', openai_api_version='', openai_api_base=None, openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

#### Creating a QDrant VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

In [11]:
from langchain_community.vectorstores import Qdrant

qdrant_vector_store = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="Taylor Swift - Fued - ADA",
)

####‚ùì Question #1:

List out a few of the techniques that Qdrant uses that make it performant.

> NOTE: Check the [documentation](https://qdrant.tech/documentation/overview/) for more information about FAISS!

some techniques Qdrant uses to make it efficient are Vector indexing, HNSW graph-based indexing enabling quick ANN searches, parallel processing using multithreading, memory management

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [12]:
retriever = qdrant_vector_store.as_retriever()
retriever

VectorStoreRetriever(tags=['Qdrant', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.qdrant.Qdrant object at 0x129096350>)

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [13]:
retrieved_documents = retriever.invoke("Who is Taylor Swift fueding with?")

In [14]:
for doc in retrieved_documents:
  print(doc)

page_content="4/23/24, 6:05 PM\nA Timeline of Taylor Swift and Kim Kardashian's Feud\nhttps://people.com/taylor-swift-and-kim-kardashian-feud-timeline-8412119\n19/22\nMUSIC" metadata={'source': 'DataRepository/tswift_fued.pdf', 'file_path': 'DataRepository/tswift_fued.pdf', 'page': 18, 'total_pages': 22, 'format': 'PDF 1.4', 'title': "A Timeline of Taylor Swift and Kim Kardashian's Feud", 'author': '', 'subject': '', 'keywords': '', 'creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36', 'producer': 'Skia/PDF m123', 'creationDate': "D:20240423220523+00'00'", 'modDate': "D:20240423220523+00'00'", 'trapped': '', '_id': '16a29e3eff68428ca2973fc1126c5fe4', '_collection_name': 'Taylor Swift - Fued - ADA'}
page_content="4/23/24, 6:05 PM\nA Timeline of Taylor Swift and Kim Kardashian's Feud\nhttps://people.com/taylor-swift-and-kim-kardashian-feud-timeline-8412119\n21/22\nMUSIC" metadata={'source': 'DataRepository/tswift_fue

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [15]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [16]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple (and has a small error) - so we'll create our own to be a bit more specific!

In [17]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [18]:
from operator import itemgetter

from langchain_openai import ChatOpenAI, AzureChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
#primary_qa_llm = AzureChatOpenAI(azure_deployment="gpt4-32k",
#                                openai_api_version=os.environ['AZURE_OPENAI_API_VERSION'])
primary_qa_llm

ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x128d663d0>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x12feb94d0>, temperature=0.0, openai_api_key=SecretStr('**********'), openai_proxy='')

In [19]:
retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

####üèóÔ∏è Activity #1:

Describe the pipeline shown above in simple terms. You can include a diagram if desired.

**ANSWER:**
- we are using LCEL to create qa chain where we get context first by passing question to retriever and context is assigned to RunnablePassthrough, later by using context and question values are used in our prompt which then passed to llm to get response also we collect context information

Let's test it out!

In [20]:
question = "Who is Taylor Swift fueding with?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

Kim Kardashian


In [21]:
question = "Why are they fueding?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print("****")
print(result["context"][0].page_content)

I don't know.
****
discussed the longstanding feud between West and Swift in great detail, as
well as West's controversial rant during his appearance on Saturday Night


We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

# ü§ù Breakout Room Part #2

## Task 1: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evaluating on every core metric today, but in order to do that - we'll need to create a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

> NOTE: This process will use `gpt-3.5-turbo-16k` as the base generator and `gpt-4` as the critic - if you're attempting to create a lot of samples please be aware of cost, as well as rate limits.

In [22]:
loader = PyMuPDFLoader(
    "DataRepository/tswift_fued.pdf",
)

eval_documents = loader.load()

text_splitter_eval = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap = 50
)

eval_documents = text_splitter_eval.split_documents(eval_documents)

####‚ùì Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?

**ANSWER**:
To make sure we get difference context, question adn answers for evaluation set which should be different from original set in order to performance evaluation meaninfully, also this helps to do validation.

In [23]:
len(eval_documents)

56

> NOTE: This cell will take ~5-6min. to generate. If you run into any rate-limit issues - please use GPT-3.5-Turbo as your `critic_llm`. If you see any fields marked `nan` - this is product of rate-limiting issues, and you can safely ignore them for now.

In [24]:
embeddings

OpenAIEmbeddings(client=<openai.resources.embeddings.Embeddings object at 0x11b9f2a90>, async_client=<openai.resources.embeddings.AsyncEmbeddings object at 0x11c220fd0>, model='text-embedding-ada-002', dimensions=None, deployment='text-embedding-ada-002', openai_api_version='', openai_api_base=None, openai_api_type='', openai_proxy='', embedding_ctx_length=8191, openai_api_key=SecretStr('**********'), openai_organization=None, allowed_special=set(), disallowed_special='all', chunk_size=1000, max_retries=2, request_timeout=None, headers=None, tiktoken_enabled=True, tiktoken_model_name=None, show_progress_bar=False, model_kwargs={}, skip_empty=False, default_headers=None, default_query=None, retry_min_seconds=4, retry_max_seconds=20, http_client=None, http_async_client=None, check_embedding_ctx_length=True)

In [25]:
from ragas.testset.generator import TestsetGenerator
import os
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings, AzureChatOpenAI, AzureOpenAIEmbeddings

generator_llm_o = ChatOpenAI(model="gpt-3.5-turbo-16k")
critic_llm_o = ChatOpenAI(model="gpt-3.5-turbo")
#critic_llm = ChatOpenAI(model="gpt-4-turbo")
embeddings_o = OpenAIEmbeddings()

#generator_llm = AzureChatOpenAI(azure_deployment="gpt4-32k",
#                                openai_api_version=os.environ['AZURE_OPENAI_API_VERSION'])
#critic_llm = AzureChatOpenAI(azure_deployment=os.environ['AZURE_OPENAI_CHAT_DEPLOYMENT_NAME'], openai_api_version=os.environ['AZURE_OPENAI_API_VERSION']) 
#embeddings = AzureOpenAIEmbeddings(azure_deployment="text-embedding-3-small",openai_api_version=os.environ['AZURE_OPENAI_API_VERSION'])

generator_test = TestsetGenerator.from_langchain(
    generator_llm_o,
    critic_llm_o,
    embeddings_o
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

testset = generator_test.generate_with_langchain_docs(eval_documents, 12, distributions, is_async = False)
testset.to_pandas()

embedding nodes:   0%|          | 0/112 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/12 [00:00<?, ?it/s]

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What is Taylor Swift's feud with Kanye West an...,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
1,What was the significance of the secretly reco...,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",The significance of the secretly recorded phon...,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
2,What did Taylor Swift's spokesperson say about...,"[called to get it approved.""\nJune 2016: Taylo...",Taylor Swift's spokesperson said that much of ...,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
3,What evidence suggests that Kim Kardashian is ...,"[me now?"" Taylor sings in the new track, also ...",,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
4,What was the significance of Taylor Swift and ...,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
5,Who are the inspirations for Taylor Swift's 'S...,[Who Are Taylor Swift's 'Speak Now' Songs Abou...,,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
6,What did Kim Kardashian say about Taylor Swift...,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",Kim Kardashian said that Taylor Swift lied abo...,multi_context,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
7,How did Kim Kardashian address her conflict wi...,"[January 14, 2019: Kim Kardashian claims\nther...",Kardashian cleared the air during an appearanc...,multi_context,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
8,What was Kardashian's reaction to Swift's new ...,"[me now?"" Taylor sings in the new track, also ...","Following the release of Swift's new track, a ...",multi_context,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
9,What happened with Kim Kardashian and Taylor S...,[look in my phone to get a name [of an album]....,Kim Kardashian reshares a Taylor Swift song on...,multi_context,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True


####‚ùì Question #3:

`{simple: 0.5, reasoning: 0.25, multi_context: 0.25}`

What exactly does this mapping refer to?

> NOTE: Check out the Ragas documentation on this generation process [here](https://docs.ragas.io/en/stable/concepts/testset_generation.html).

**ANSWER**:
- This mapping is used to define % of what kind of questions with different characteristics needs to be produced when generating synthetic dataset for QA. This approach ensures comprehensive coverage of the performance of various components

In [26]:
testset.test_data

[DataRow(question="What is Taylor Swift's feud with Kanye West and Kim Kardashian?", contexts=["4/23/24, 6:05 PM\nA Timeline of Taylor Swift and Kim Kardashian's Feud\nhttps://people.com/taylor-swift-and-kim-kardashian-feud-timeline-8412119\n19/22\nMUSIC\nTaylor Swift Says Feud with Kanye West and Kim\nKardashian Felt Like 'Career Death': 'I Went Down\nReally, Really Hard'\nBy Sadie Bell\nMUSIC\nTaylor Swift's Friends Say She's Always a 'Shoulder\nto Lean on' in Life (and During Breakups)\n(Exclusive)\nBy Jeff Nelson\nCOUNTRY\nOn First Tour, Alana Springsteen Gives Thanks to\nHer Hero: 'I Wouldn't Be the Artist I Am Without\nTaylor' (Exclusive)\xa0\nBy Nancy Kruh\nMUSIC"], ground_truth='nan', evolution_type='simple', metadata=[{'source': 'DataRepository/tswift_fued.pdf', 'file_path': 'DataRepository/tswift_fued.pdf', 'page': 18, 'total_pages': 22, 'format': 'PDF 1.4', 'title': "A Timeline of Taylor Swift and Kim Kardashian's Feud", 'author': '', 'subject': '', 'keywords': '', 'creator'

Let's look at the output and see what we can learn about it!

In [27]:
print(testset.test_data[0])

question="What is Taylor Swift's feud with Kanye West and Kim Kardashian?" contexts=["4/23/24, 6:05 PM\nA Timeline of Taylor Swift and Kim Kardashian's Feud\nhttps://people.com/taylor-swift-and-kim-kardashian-feud-timeline-8412119\n19/22\nMUSIC\nTaylor Swift Says Feud with Kanye West and Kim\nKardashian Felt Like 'Career Death': 'I Went Down\nReally, Really Hard'\nBy Sadie Bell\nMUSIC\nTaylor Swift's Friends Say She's Always a 'Shoulder\nto Lean on' in Life (and During Breakups)\n(Exclusive)\nBy Jeff Nelson\nCOUNTRY\nOn First Tour, Alana Springsteen Gives Thanks to\nHer Hero: 'I Wouldn't Be the Artist I Am Without\nTaylor' (Exclusive)\xa0\nBy Nancy Kruh\nMUSIC"] ground_truth='nan' evolution_type='simple' metadata=[{'source': 'DataRepository/tswift_fued.pdf', 'file_path': 'DataRepository/tswift_fued.pdf', 'page': 18, 'total_pages': 22, 'format': 'PDF 1.4', 'title': "A Timeline of Taylor Swift and Kim Kardashian's Feud", 'author': '', 'subject': '', 'keywords': '', 'creator': 'Mozilla/5.

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [28]:
test_df = testset.to_pandas()

In [29]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What is Taylor Swift's feud with Kanye West an...,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
1,What was the significance of the secretly reco...,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",The significance of the secretly recorded phon...,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
2,What did Taylor Swift's spokesperson say about...,"[called to get it approved.""\nJune 2016: Taylo...",Taylor Swift's spokesperson said that much of ...,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
3,What evidence suggests that Kim Kardashian is ...,"[me now?"" Taylor sings in the new track, also ...",,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
4,What was the significance of Taylor Swift and ...,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
5,Who are the inspirations for Taylor Swift's 'S...,[Who Are Taylor Swift's 'Speak Now' Songs Abou...,,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
6,What did Kim Kardashian say about Taylor Swift...,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",Kim Kardashian said that Taylor Swift lied abo...,multi_context,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
7,How did Kim Kardashian address her conflict wi...,"[January 14, 2019: Kim Kardashian claims\nther...",Kardashian cleared the air during an appearanc...,multi_context,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
8,What was Kardashian's reaction to Swift's new ...,"[me now?"" Taylor sings in the new track, also ...","Following the release of Swift's new track, a ...",multi_context,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
9,What happened with Kim Kardashian and Taylor S...,[look in my phone to get a name [of an album]....,Kim Kardashian reshares a Taylor Swift song on...,multi_context,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True


In [30]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [31]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])
  print('Question: ' + question)
  print('Answer: ' + response["response"].content)
  print("****")

Question: What is Taylor Swift's feud with Kanye West and Kim Kardashian?
Answer: I don't know.
****
Question: What was the significance of the secretly recorded phone call between Kanye West and Taylor Swift in their feud?
Answer: The secretly recorded phone call between Kanye West and Taylor Swift was significant because it was later leaked online, causing controversy and fueling their feud.
****
Question: What did Taylor Swift's spokesperson say about Kim Kardashian's claims regarding Kanye West's "Famous" song?
Answer: Taylor Swift's spokesperson said, "Taylor was never made aware of the actual lyrics, 'I made that bitch famous.'"
****
Question: What evidence suggests that Kim Kardashian is a fan of Taylor Swift's music?
Answer: Kim Kardashian stated that she likes all of Taylor Swift's songs and finds them cute and catchy.
****
Question: What was the significance of Taylor Swift and Kim Kardashian posing together at the MTV EMAs in 2012?
Answer: I don't know.
****
Question: Who ar

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [31]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [32]:
response_dataset[0]

{'question': 'What are some key moments in the feud between Taylor Swift and Kim Kardashian?',
 'answer': "I don't know.",
 'contexts': ["4/23/24, 6:05 PM\nA Timeline of Taylor Swift and Kim Kardashian's Feud\nhttps://people.com/taylor-swift-and-kim-kardashian-feud-timeline-8412119\n1/22\nPHOTO: KEVIN MAZUR/MTV1415/WIREIMAGE",
  "4/23/24, 6:05 PM\nA Timeline of Taylor Swift and Kim Kardashian's Feud\nhttps://people.com/taylor-swift-and-kim-kardashian-feud-timeline-8412119\n5/22\nPHOTO: KEVIN MAZUR/MTV1415/WIREIMAGE",
  "4/23/24, 6:05 PM\nA Timeline of Taylor Swift and Kim Kardashian's Feud\nhttps://people.com/taylor-swift-and-kim-kardashian-feud-timeline-8412119\n11/22\nPHOTO: KEVIN WINTER/GETTY",
  "4/23/24, 6:05 PM\nA Timeline of Taylor Swift and Kim Kardashian's Feud\nhttps://people.com/taylor-swift-and-kim-kardashian-feud-timeline-8412119\n19/22\nMUSIC"],
 'ground_truth': 'nan'}

## Task 2: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [33]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

In [34]:
response_dataset.to_pandas()

Unnamed: 0,question,answer,contexts,ground_truth
0,What are some key moments in the feud between ...,I don't know.,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",
1,How does Taylor Swift address the issue of cha...,Taylor Swift addresses the issue of character ...,[Taylor Swift - Look What You Made Me Do\nTayl...,Taylor Swift addresses the issue of character ...
2,What is Taylor Swift's relationship with Kim K...,I don't know.,"[April 19, 2024: Taylor Swift seemingly sings\...",
3,How did Taylor Swift reference the drama of be...,Taylor Swift referenced the drama of being sta...,[appeared to be her way of reclaiming the snak...,Taylor Swift referenced the drama of being sta...
4,What did Kim Kardashian say about Taylor Swift...,Kim Kardashian claimed that Taylor Swift knew ...,"[‚ÄúFamous‚Äù lyric all along¬†\n""She totally appro...",Kim Kardashian claimed that Taylor Swift knew ...
5,What did Kim Kardashian say about Taylor Swift...,Kim Kardashian claimed that Taylor Swift knew ...,"[‚ÄúFamous‚Äù lyric all along¬†\n""She totally appro...",Kim Kardashian claimed that Taylor Swift knew ...


All that's left to do is call "evaluate" and away we go!

In [35]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

No statements were generated from the answer.
No statements were generated from the answer.


In [36]:
results

{'faithfulness': 1.0000, 'answer_relevancy': 0.6382, 'context_recall': 0.6250, 'context_precision': 0.6667, 'answer_correctness': 0.4813}

In [37]:
results_df = results.to_pandas()
results_df

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What are some key moments in the feud between ...,I don't know.,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",,,0.0,0.0,0.0,0.198202
1,How does Taylor Swift address the issue of cha...,Taylor Swift addresses the issue of character ...,[Taylor Swift - Look What You Made Me Do\nTayl...,Taylor Swift addresses the issue of character ...,1.0,0.999657,1.0,1.0,0.535389
2,What is Taylor Swift's relationship with Kim K...,I don't know.,"[April 19, 2024: Taylor Swift seemingly sings\...",,,0.0,0.75,0.0,0.198202
3,How did Taylor Swift reference the drama of be...,Taylor Swift referenced the drama of being sta...,[appeared to be her way of reclaiming the snak...,Taylor Swift referenced the drama of being sta...,1.0,0.914142,0.0,1.0,0.723496
4,What did Kim Kardashian say about Taylor Swift...,Kim Kardashian claimed that Taylor Swift knew ...,"[‚ÄúFamous‚Äù lyric all along¬†\n""She totally appro...",Kim Kardashian claimed that Taylor Swift knew ...,1.0,0.956046,1.0,1.0,0.616543
5,What did Kim Kardashian say about Taylor Swift...,Kim Kardashian claimed that Taylor Swift knew ...,"[‚ÄúFamous‚Äù lyric all along¬†\n""She totally appro...",Kim Kardashian claimed that Taylor Swift knew ...,1.0,0.959215,1.0,1.0,0.615848


## Task 3: Making Adjustments to our RAG Pipeline

Now that we have established a baseline - we can see how any changes impact our pipeline's performance!

Let's modify our retriever and see how that impacts our Ragas metrics!

> NOTE: MultiQueryRetriever is expanded on [here](https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever) but for now, the implementation is not important to our lesson!

In [38]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)
primary_qa_llm

ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x10faaae10>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x10fcf08d0>, temperature=0.0, openai_api_key=SecretStr('**********'), openai_proxy='')

We'll also re-create our RAG pipeline using the abstractions that come packaged with LangChain v0.1.0!

First, let's create a chain to "stuff" our documents into our context!

In [39]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)
retrieval_qa_prompt

ChatPromptTemplate(input_variables=['context', 'input'], input_types={'chat_history': typing.List[typing.Union[langchain_core.messages.ai.AIMessage, langchain_core.messages.human.HumanMessage, langchain_core.messages.chat.ChatMessage, langchain_core.messages.system.SystemMessage, langchain_core.messages.function.FunctionMessage, langchain_core.messages.tool.ToolMessage]]}, metadata={'lc_hub_owner': 'langchain-ai', 'lc_hub_repo': 'retrieval-qa-chat', 'lc_hub_commit_hash': 'b60afb6297176b022244feb83066e10ecadcda7b90423654c4a9d45e7a73cebc'}, messages=[SystemMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context'], template='Answer any use questions based solely on the context below:\n\n<context>\n{context}\n</context>')), MessagesPlaceholder(variable_name='chat_history', optional=True), HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['input'], template='{input}'))])

Next, we'll create the retrieval chain!

In [40]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)
retrieval_chain

RunnableBinding(bound=RunnableAssign(mapper={
  context: RunnableBinding(bound=RunnableLambda(lambda x: x['input'])
           | MultiQueryRetriever(retriever=VectorStoreRetriever(tags=['Qdrant', 'OpenAIEmbeddings'], vectorstore=<langchain_community.vectorstores.qdrant.Qdrant object at 0x10fe1ead0>), llm_chain=LLMChain(prompt=PromptTemplate(input_variables=['question'], template='You are an AI language model assistant. Your task is \n    to generate 3 different versions of the given user \n    question to retrieve relevant documents from a vector  database. \n    By generating multiple perspectives on the user question, \n    your goal is to help the user overcome some of the limitations \n    of distance-based similarity search. Provide these alternative \n    questions separated by newlines. Original question: {question}'), llm=ChatOpenAI(client=<openai.resources.chat.completions.Completions object at 0x10faaae10>, async_client=<openai.resources.chat.completions.AsyncCompletions obje

In [41]:
response = retrieval_chain.invoke({"input": "Who is Taylor Swift fueding with?"})

In [42]:
print(response["answer"])

Taylor Swift is feuding with Kim Kardashian.


In [43]:
response = retrieval_chain.invoke({"input": "Why are they fueding?"})

In [44]:
print(response["answer"])

Taylor Swift and Kim Kardashian have been feuding over leaked phone calls and other incidents that have caused tension between them. The feud escalated with Kanye West's controversial rant during his appearance on Saturday Night Live, adding to the longstanding conflict between Swift and West.


Well, just from those responses this chain *feels* better - but lets see how it performs on our eval!

Let's do the same process we did before to collect our pipeline's contexts and answers.

In [45]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

In [46]:
answers

['Some key moments in the feud between Taylor Swift and Kim Kardashian include:\n- Taylor Swift\'s public disagreement with Kanye West over his song "Famous" in 2016.\n- Kim Kardashian releasing a phone call between Taylor Swift and Kanye West in 2016 to prove Swift approved the lyrics.\n- Taylor Swift addressing the feud in her music, such as in her song "Look What You Made Me Do."\n- Various public statements and social media posts from both parties discussing the feud.',
 'In her song "Look What You Made Me Do," Taylor Swift addresses the issue of character assassination by reclaiming the snake emoji that was posted in her Instagram comments section, implying that she was a liar. This appeared to be her way of responding to the negative portrayal of her character.',
 'Taylor Swift and Kim Kardashian have had a complicated relationship with many ups and downs over the years. They have been seen posing together at events like the MTV EMAs and sharing hugs at the Grammys and MTV VMAs. 

Now we can convert this into a dataset, just like we did before.

In [47]:
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's evaluate on the same metrics we did for the first pipeline and see how it does!

In [48]:
#advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics, llm=generator_llm, embeddings=embeddings)
advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

In [49]:
advanced_retrieval_results_df = advanced_retrieval_results.to_pandas()
advanced_retrieval_results_df

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What are some key moments in the feud between ...,Some key moments in the feud between Taylor Sw...,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",,1.0,1.0,0.0,0.0,0.173487
1,How does Taylor Swift address the issue of cha...,"In her song ""Look What You Made Me Do,"" Taylor...",[Taylor Swift - Look What You Made Me Do\nTayl...,Taylor Swift addresses the issue of character ...,1.0,0.947917,1.0,1.0,0.606499
2,What is Taylor Swift's relationship with Kim K...,Taylor Swift and Kim Kardashian have had a com...,"[April 19, 2024: Taylor Swift seemingly sings\...",,1.0,0.963775,1.0,0.0,0.177307
3,How did Taylor Swift reference the drama of be...,Taylor Swift referenced the drama of being sta...,[Taylor Swift - Look What You Made Me Do\nTayl...,Taylor Swift referenced the drama of being sta...,1.0,0.952909,0.0,0.916667,0.22961
4,What did Kim Kardashian say about Taylor Swift...,Kim Kardashian claimed that Taylor Swift knew ...,"[‚ÄúFamous‚Äù lyric all along¬†\n""She totally appro...",Kim Kardashian claimed that Taylor Swift knew ...,1.0,0.928671,1.0,0.770833,0.617426
5,What did Kim Kardashian say about Taylor Swift...,Kim Kardashian claimed that Taylor Swift knew ...,"[‚ÄúFamous‚Äù lyric all along¬†\n""She totally appro...",Kim Kardashian claimed that Taylor Swift knew ...,1.0,0.93938,1.0,0.770833,0.616999


## Task 4: Evaluating our Adjusted Pipeline Against Our Baseline

Now we can compare our results and see what directional changes occured!

Let's refresh with our initial metrics.

In [50]:
results

{'faithfulness': 1.0000, 'answer_relevancy': 0.6382, 'context_recall': 0.6250, 'context_precision': 0.6667, 'answer_correctness': 0.4813}

And see how our advanced retrieval modified our chain!

In [51]:
advanced_retrieval_results

{'faithfulness': 1.0000, 'answer_relevancy': 0.9554, 'context_recall': 0.6667, 'context_precision': 0.5764, 'answer_correctness': 0.4036}

In [52]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,1.0,1.0,0.0
1,answer_relevancy,0.638177,0.955442,0.317265
2,context_recall,0.625,0.666667,0.041667
3,context_precision,0.666667,0.576389,-0.090278
4,answer_correctness,0.48128,0.403555,-0.077725


## Task 5: Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

####üèóÔ∏è Activity #2:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

### Here are we are specifying type of embedding model to use

In [53]:
new_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
#new_embeddings = AzureOpenAIEmbeddings(
#    model="text-embedding-3-large"
#)

### We are creating vector store (Qdrant), will use this store to stre the documents embeddings

In [54]:
vector_store = qdrant = Qdrant.from_documents(
    documents,
    new_embeddings,
    location=":memory:",
    collection_name="Taylor Swift - Fued - MQR",
)

### We are exposing the vectore store created in above step as a retriever, retriever helps to retrieve related context for user queries

In [55]:
new_retriever = vector_store.as_retriever()

### Here we are initilizing MultiQueryRetriever from langchian 

In [56]:
new_advanced_retriever = MultiQueryRetriever.from_llm(retriever=new_retriever, llm=primary_qa_llm)

### In this step we are creating retrieval chain, we are directly using langchain chain wrapper here, under the hood this uses LCEL to create chain.

In [57]:
new_retrieval_chain = create_retrieval_chain(new_advanced_retriever, document_chain)

### Now we are passing our synthetic generated questions to the chain to get answers and also context.

In [58]:
answers = []
contexts = []

for question in test_questions:
  response = new_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])
  print(question)
  print(response["answer"])
  print("***")

What are some key moments in the feud between Taylor Swift and Kim Kardashian?
Some key moments in the feud between Taylor Swift and Kim Kardashian include:

1. November 11, 2012: The feud between Taylor Swift and Kim Kardashian began.
2. There is a photo of Kevin Mazur/MTV1415/WireImage included in the timeline.
3. There is a photo of Kevin Winter/Getty included in the timeline.
***
How does Taylor Swift address the issue of character assassination in her song "Look What You Made Me Do"?
In her song "Look What You Made Me Do," Taylor Swift references the Kanye West and Kim Kardashian drama by including the lyrics, "I made that bitch famous." This can be seen as Swift addressing the issue of character assassination that she felt was directed towards her by Kanye West in the past.
***
What is Taylor Swift's relationship with Kim Kardashian?
Taylor Swift and Kim Kardashian have had a feud, as indicated by the article titled "A Timeline of Taylor Swift and Kim Kardashian's Feud."
***
How 

### we are converting out question, answer, context and groundtruth in huggingface dataset format.

In [59]:
new_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

### Finally we are doing RAGAS evaluation on on new_response_dataset_advanced_retrieval dataset using different ragas evaluation metrics

In [60]:
#new_advanced_retrieval_results = evaluate(new_response_dataset_advanced_retrieval, metrics, llm=generator_llm, embeddings=embeddings)
new_advanced_retrieval_results = evaluate(new_response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/30 [00:00<?, ?it/s]

### these are the evaluation results on different metrics

In [61]:
new_advanced_retrieval_results

{'faithfulness': 0.9444, 'answer_relevancy': 0.9394, 'context_recall': 0.6667, 'context_precision': 0.6667, 'answer_correctness': 0.4713}

### finally we are comparing delta between different metrics before and after changing our embedding model.

In [62]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_original = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(new_advanced_retrieval_results.items()), columns=['Metric', 'Text Embedding 3'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')
df_merged = pd.merge(df_baseline, df_merged, on="Metric")

df_merged['Delta - TE3 -> ADA'] = df_merged['Text Embedding 3'] - df_merged['ADA']
df_merged['Delta - TE3 -> Baseline'] = df_merged['Text Embedding 3'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,ADA,Text Embedding 3,Delta - TE3 -> ADA,Delta - TE3 -> Baseline
0,faithfulness,1.0,1.0,0.944444,-0.055556,-0.05555556
1,answer_relevancy,0.638177,0.955442,0.939433,-0.016009,0.3012567
2,context_recall,0.625,0.666667,0.666667,0.0,0.04166667
3,context_precision,0.666667,0.576389,0.666667,0.090278,7.222223e-12
4,answer_correctness,0.48128,0.403555,0.471275,0.06772,-0.01000521


####‚ùì Question #4:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?

**ANSWER**:
- There is good improvement with context_precision and answer_correctness metrics for text_embedding_3_small, I don't see significant improvement for other metrics based on the this notebook results.

## BONUS ACTIVITY: Showcase Multi-Context Perfomance Changes

Now that we've looked at a number of different examples - showcase the difference on the multi-context *specific* questions that were synthetically generated.

> NOTE: You have all the data you'll need already in the notebook if you made it to this step!

In [68]:
# filtering testset for only multi_context questions
test_set_multi_context = testset.to_pandas()
test_set_multi_context = test_set_multi_context[test_set_multi_context['evolution_type'] == 'multi_context']
test_set_multi_context.reset_index(drop=True, inplace=True)

q_list = list(test_set_multi_context['question'])
test_set_multi_context

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What did Kim Kardashian say about Taylor Swift...,[will be the greatest feeling in the world. Th...,Kim Kardashian claimed that Taylor Swift knew ...,multi_context,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
1,What did Kim Kardashian say about Taylor Swift...,[will be the greatest feeling in the world. Th...,Kim Kardashian claimed that Taylor Swift knew ...,multi_context,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True


In [69]:
# filtering results for different RAG pipelines for only multi_context questions
results_mc = results.to_pandas()
results_mc = results_mc[results_mc['question'].isin(q_list)]
advanced_retrieval_results_mc = advanced_retrieval_results.to_pandas()
advanced_retrieval_results_mc = advanced_retrieval_results_mc[advanced_retrieval_results_mc['question'].isin(q_list)]
new_advanced_retrieval_results_mc =new_advanced_retrieval_results.to_pandas()
new_advanced_retrieval_results_mc = new_advanced_retrieval_results_mc[new_advanced_retrieval_results_mc['question'].isin(q_list)]

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)
  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


In [70]:
# calculating mean for all the metrics
results_mc = results_mc.describe().loc['mean'].reset_index()
results_mc.columns = ['Metric', 'Baseline']
advanced_retrieval_results_mc = advanced_retrieval_results_mc.describe().loc['mean'].reset_index()
advanced_retrieval_results_mc.columns = ['Metric', 'ada']
new_advanced_retrieval_results_mc = new_advanced_retrieval_results_mc.describe().loc['mean'].reset_index()
new_advanced_retrieval_results_mc.columns = ['Metric', 'embedding_small']

In [71]:
# merging all dataframes
df_merged_mc = pd.merge(advanced_retrieval_results_mc, new_advanced_retrieval_results_mc, on='Metric')
df_merged_mc = pd.merge(results_mc, df_merged_mc, on="Metric")
df_merged_mc

Unnamed: 0,Metric,Baseline,ada,embedding_small
0,faithfulness,1.0,1.0,1.0
1,answer_relevancy,0.95763,0.934025,0.942493
2,context_recall,1.0,1.0,1.0
3,context_precision,1.0,0.770833,1.0
4,answer_correctness,0.616196,0.617213,0.603063


In [72]:
# calculating delta for all metrics for different RAG pipelines
df_merged_mc['Delta - embedding_small -> ada'] = df_merged_mc['embedding_small'] - df_merged_mc['ada']
df_merged_mc['Delta - ada -> Baseline'] = df_merged_mc['ada'] - df_merged_mc['Baseline']

df_merged_mc

Unnamed: 0,Metric,Baseline,ada,embedding_small,Delta - embedding_small -> ada,Delta - ada -> Baseline
0,faithfulness,1.0,1.0,1.0,0.0,0.0
1,answer_relevancy,0.95763,0.934025,0.942493,0.008468,-0.023605
2,context_recall,1.0,1.0,1.0,0.0,0.0
3,context_precision,1.0,0.770833,1.0,0.229167,-0.229167
4,answer_correctness,0.616196,0.617213,0.603063,-0.01415,0.001017
