# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room Part #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)
  

- 🤝 Breakout Room Part #2:
  1. Synthetic Dataset Generation for Evaluation using the [Ragas](https://github.com/explodinggradients/ragas)
  2. Evaluating our pipeline with Ragas
  3. Making Adjustments to our RAG Pipeline
  4. Evaluating our Adjusted pipeline against our baseline
  5. Testing OpenAI's Claim

The only way to get started is to get started - so let's grab our dependencies for the day!

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# 🤝 Breakout Room Part #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://blog.langchain.dev/langchain-v0-1-0/) of LangChain v0.1.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [1]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m817.7/817.7 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m299.3/299.3 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m18.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m312.3/312.3 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.4/116.4 kB[0m [31m11.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m28.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m53.0/53.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━

We'll also get the "star of the show" today, which is Ragas!

In [2]:
!pip install -qU ragas

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/81.2 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.2/81.2 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/542.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m532.5/542.0 kB[0m [31m16.3 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m542.0/542.0 kB[0m [31m12.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━

We'll be leveraging [QDrant](https://qdrant.tech/) again as our LangChain `VectorStore`.

We'll also install `pymupdf` and its dependencies which will allow us to load PDFs using the `PyMuPDFLoader` in the `langchain-community` package!

In [3]:
!pip install -qU qdrant-client pymupdf pandas

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m229.3/229.3 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.5/3.5 MB[0m [31m20.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m40.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.8/2.8 MB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.8/30.8 MB[0m [31m17.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.6/294.6 kB[0m [31m24.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.5/57.5 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the f

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [4]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

Please provide your OpenAI Key: ··········


## Task 3: Creating a Simple RAG Pipeline with LangChain v0.1.0

Building on what we learned last week, we'll be leveraging LangChain v0.1.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.1.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [5]:
!git clone https://github.com/AI-Maker-Space/DataRepository

Cloning into 'DataRepository'...
remote: Enumerating objects: 68, done.[K
remote: Counting objects: 100% (60/60), done.[K
remote: Compressing objects: 100% (47/47), done.[K
remote: Total 68 (delta 18), reused 28 (delta 8), pack-reused 8[K
Receiving objects: 100% (68/68), 69.00 MiB | 38.36 MiB/s, done.
Resolving deltas: 100% (18/18), done.


In [6]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    "DataRepository/tswift_fued.pdf",
)

documents = loader.load()

In [7]:
documents[0].metadata

{'source': 'DataRepository/tswift_fued.pdf',
 'file_path': 'DataRepository/tswift_fued.pdf',
 'page': 0,
 'total_pages': 22,
 'format': 'PDF 1.4',
 'title': "A Timeline of Taylor Swift and Kim Kardashian's Feud",
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36',
 'producer': 'Skia/PDF m123',
 'creationDate': "D:20240423220523+00'00'",
 'modDate': "D:20240423220523+00'00'",
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 200,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [9]:
len(documents)

177

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

In [10]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

#### Creating a QDrant VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

In [11]:
from langchain_community.vectorstores import Qdrant

qdrant_vector_store = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="Taylor Swift - Fued - ADA",
)

####❓ Question #1:

List out a few of the techniques that Qdrant uses that make it performant.

> NOTE: Check the [documentation](https://qdrant.tech/documentation/overview/) for more information about FAISS!

#### Answer #1: Qdrant uses 3 different distance metrics to optimize the search process:
- Cosine Similarity
- Dot Product
- Euclidean Distance

- It also uses a graph-like structure to find the closest objects, this way it doesn't need to calculate the distance to every object from the databse, but only some candidates.

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [12]:
retriever = qdrant_vector_store.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [13]:
retrieved_documents = retriever.invoke("Who is Taylor Swift fueding with?")

In [14]:
for doc in retrieved_documents:
  print(doc)

page_content="4/23/24, 6:05 PM\nA Timeline of Taylor Swift and Kim Kardashian's Feud\nhttps://people.com/taylor-swift-and-kim-kardashian-feud-timeline-8412119\n19/22\nMUSIC" metadata={'source': 'DataRepository/tswift_fued.pdf', 'file_path': 'DataRepository/tswift_fued.pdf', 'page': 18, 'total_pages': 22, 'format': 'PDF 1.4', 'title': "A Timeline of Taylor Swift and Kim Kardashian's Feud", 'author': '', 'subject': '', 'keywords': '', 'creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36', 'producer': 'Skia/PDF m123', 'creationDate': "D:20240423220523+00'00'", 'modDate': "D:20240423220523+00'00'", 'trapped': '', '_id': '58ba936860bf45bf8c154bf19f8bf648', '_collection_name': 'Taylor Swift - Fued - ADA'}
page_content="4/23/24, 6:05 PM\nA Timeline of Taylor Swift and Kim Kardashian's Feud\nhttps://people.com/taylor-swift-and-kim-kardashian-feud-timeline-8412119\n21/22\nMUSIC" metadata={'source': 'DataRepository/tswift_fue

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [15]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [16]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple (and has a small error) - so we'll create our own to be a bit more specific!

In [17]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [20]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

####🏗️ Activity #1:

Describe the pipeline shown above in simple terms. You can include a diagram if desired.

We need to pass the question to the retriever to obtain the associated context, and we keep the original question at the same time to pass it further. We can see in the graph below how the question is "feeded" to the retriever (left branch) and also passed further as-is in the right branch. As the context is the an action result, we need to use the RunnablePassthough to collect it and pass it further to be able to use in for the response generation. The original question and the new context can now be used for the Response chain. Here we pass the context (now merged with the question) on the right branch, while using it to "populate" the prompt template in ChatPromptTemplate step, and finnaly send the prompt to ChatOpenAI on the left branch. The chains end with the generated response & original context (including original question).





In [22]:
!pip install -q grandalf

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/41.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m41.8/41.8 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [23]:
print(retrieval_augmented_qa_chain.get_graph().draw_ascii())

                       +---------------------------------+                         
                       | Parallel<context,question>Input |                         
                       +---------------------------------+                         
                           *****                   ****                            
                        ***                            ****                        
                     ***                                   ****                    
+--------------------------------+                             **                  
| Lambda(itemgetter('question')) |                              *                  
+--------------------------------+                              *                  
                 *                                              *                  
                 *                                              *                  
                 *                                              *           

Let's test it out!

In [24]:
question = "Who is Taylor Swift fueding with?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

Kim Kardashian


In [25]:
question = "Why are they fueding?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

I don't know.
[Document(page_content="discussed the longstanding feud between West and Swift in great detail, as\nwell as West's controversial rant during his appearance on Saturday Night", metadata={'source': 'DataRepository/tswift_fued.pdf', 'file_path': 'DataRepository/tswift_fued.pdf', 'page': 6, 'total_pages': 22, 'format': 'PDF 1.4', 'title': "A Timeline of Taylor Swift and Kim Kardashian's Feud", 'author': '', 'subject': '', 'keywords': '', 'creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36', 'producer': 'Skia/PDF m123', 'creationDate': "D:20240423220523+00'00'", 'modDate': "D:20240423220523+00'00'", 'trapped': '', '_id': '3cbe28b8d1f24d4085034769258743f7', '_collection_name': 'Taylor Swift - Fued - ADA'}), Document(page_content='That Beef!" segment. When asked about if her feud with Swift over the\nleaked phone calls was still ongoing, she said she was "over it."', metadata={'source': 'DataRepository/tswif

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

# 🤝 Breakout Room Part #2

## Task 1: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evaluating on every core metric today, but in order to do that - we'll need to create a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

> NOTE: This process will use `gpt-3.5-turbo-16k` as the base generator and `gpt-4` as the critic - if you're attempting to create a lot of samples please be aware of cost, as well as rate limits.

In [26]:
loader = PyMuPDFLoader(
    "DataRepository/tswift_fued.pdf",
)

eval_documents = loader.load()

text_splitter_eval = RecursiveCharacterTextSplitter(
    chunk_size = 600,
    chunk_overlap = 50
)

eval_documents = text_splitter_eval.split_documents(eval_documents)

####❓ Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?

#### Answer #2: It is important to split our documents using different parameters to ensure that we have a diverse set of contexts to evaluate our pipeline on. If we used the same chunk_size in the evaluator, this could lead to a biais where the evaluator would end-up with the exact same answer. So we use a different parameters to add some variance. This way the evaluator will have to create the QA pairs and ground-truth based on a slightly different context. The resulting evaluation will be more reliable.

In [27]:
len(eval_documents)

56

> NOTE: This cell will take ~5-6min. to generate. If you run into any rate-limit issues - please use GPT-3.5-Turbo as your `critic_llm`. If you see any fields marked `nan` - this is product of rate-limiting issues, and you can safely ignore them for now.

In [28]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

generator_llm = ChatOpenAI(model="gpt-3.5-turbo-16k")
# critic_llm = ChatOpenAI(model="gpt-3.5-turbo") <--- If you don't have GPT-4 access, or run into rate-limit, or `nan` issues.
critic_llm = ChatOpenAI(model="gpt-4-turbo-preview")
embeddings = OpenAIEmbeddings()

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

testset = generator.generate_with_langchain_docs(eval_documents, 10, distributions, is_async = False)

embedding nodes:   0%|          | 0/112 [00:00<?, ?it/s]



Generating:   0%|          | 0/10 [00:00<?, ?it/s]



####❓ Question #3:

`{simple: 0.5, reasoning: 0.25, multi_context: 0.25}`

What exactly does this mapping refer to?

> NOTE: Check out the Ragas documentation on this generation process [here](https://docs.ragas.io/en/stable/concepts/testset_generation.html).

#### Answer #3: This mapping refers to the distribution of the different types of questions that will be generated. They will be used for the evaluation. In this case, 50% of the questions will be simple, 25% will be reasoning questions and 25% will be multi-context questions. Setting it all depends on what we want to evaluate. Will the user questions mostly be simple, multi_context, or whys and hows ?

Let's look at the output and see what we can learn about it!

In [29]:

len(testset.test_data)

8

In [30]:
testset.test_data[0]

DataRow(question="What is the timeline of Taylor Swift and Kim Kardashian's feud?", contexts=['4/23/24, 6:05 PM\nA Timeline of Taylor Swift and Kim Kardashian\'s Feud\nhttps://people.com/taylor-swift-and-kim-kardashian-feud-timeline-8412119\n6/22\n"I want to say to all the young women out there, there are going to be\npeople along the way who will try to undercut your success or take credit\nfor your accomplishments or your fame," she said on stage. "If you just\nfocus on the work, and you don\'t let those people sidetrack you, someday\nwhen you get where you\'re going, you\'ll look around and you will know\nthat it was you and the people who love you who put you there, and that'], ground_truth='nan', evolution_type='simple', metadata=[{'source': 'DataRepository/tswift_fued.pdf', 'file_path': 'DataRepository/tswift_fued.pdf', 'page': 5, 'total_pages': 22, 'format': 'PDF 1.4', 'title': "A Timeline of Taylor Swift and Kim Kardashian's Feud", 'author': '', 'subject': '', 'keywords': '', '

In [32]:
reasoning_data = next((data for data in testset.test_data if data.evolution_type == "reasoning"), None)
reasoning_data

DataRow(question='What did Kim Kardashian say about her favorite Taylor Swift album during her interview on the Honestly podcast with host Bari Weiss?', contexts=['call in her statements, replied this time, re-sharing her very first statement\non the matter, and asking, "P.S. who did you guys piss off to leak that\nvideo?"\nDecember 16, 2021: Kim Kardashian says\nshe likes all of Taylor Swift’s songs\nDuring an interview on the Honestly podcast with host Bari Weiss,\nKardashian was asked about Swift during a lightning question round.\nWhen asked what her favorite Swift album was, Kardashian responded, “I\nreally like a lot of her songs. They\'re all super cute and catchy. I\'d have to\nlook in my phone to get a name [of an album].”'], ground_truth="Kim Kardashian said that she really likes a lot of Taylor Swift's songs and that they're all super cute and catchy. She mentioned that she would have to look in her phone to get the name of an album.", evolution_type='reasoning', metadata=[{

In [34]:
multi_context_data = next((data for data in testset.test_data if data.evolution_type == "multi_context"), None)
multi_context_data

DataRow(question='What caused the feud between Taylor Swift and Kim Kardashian, and what happened as a result?', contexts=['4/23/24, 6:05 PM\nA Timeline of Taylor Swift and Kim Kardashian\'s Feud\nhttps://people.com/taylor-swift-and-kim-kardashian-feud-timeline-8412119\n14/22\n"To be clear, the only issue I ever had around the situation was that Taylor\nlied through her publicist who stated that \'Kanye never called to ask for\npermission…\' They clearly spoke so I let you all see that. Nobody ever\ndenied the word \'bitch\' was used without her permission," tweeted\nKardashian.\nSwift\'s publicist, Tree Paine, who never denied the existence of the phone', '4/23/24, 6:05 PM\nA Timeline of Taylor Swift and Kim Kardashian\'s Feud\nhttps://people.com/taylor-swift-and-kim-kardashian-feud-timeline-8412119\n7/22\nwas on vacation with her family in January of 2016 and they have never\nspoken since," the statement continued. "Taylor has never denied that\nconversation took place. It was on tha

In [35]:
simple_data = next((data for data in testset.test_data if data.evolution_type == "simple"), None)
simple_data

DataRow(question="What is the timeline of Taylor Swift and Kim Kardashian's feud?", contexts=['4/23/24, 6:05 PM\nA Timeline of Taylor Swift and Kim Kardashian\'s Feud\nhttps://people.com/taylor-swift-and-kim-kardashian-feud-timeline-8412119\n6/22\n"I want to say to all the young women out there, there are going to be\npeople along the way who will try to undercut your success or take credit\nfor your accomplishments or your fame," she said on stage. "If you just\nfocus on the work, and you don\'t let those people sidetrack you, someday\nwhen you get where you\'re going, you\'ll look around and you will know\nthat it was you and the people who love you who put you there, and that'], ground_truth='nan', evolution_type='simple', metadata=[{'source': 'DataRepository/tswift_fued.pdf', 'file_path': 'DataRepository/tswift_fued.pdf', 'page': 5, 'total_pages': 22, 'format': 'PDF 1.4', 'title': "A Timeline of Taylor Swift and Kim Kardashian's Feud", 'author': '', 'subject': '', 'keywords': '', '

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [37]:
test_df = testset.to_pandas()

In [38]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What is the timeline of Taylor Swift and Kim K...,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
1,What did Kim Kardashian say about her feud wit...,"[January 14, 2019: Kim Kardashian claims\nther...","Kardashian said she was ""over it"" and that the...",simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
2,What did Kim Kardashian say about Taylor Swift...,[from The Tortured Poets Department: The Antho...,Kim Kardashian said that she loves Taylor Swif...,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
3,What was the outcome of the feud between Kim K...,"[January 14, 2019: Kim Kardashian claims\nther...",The feud between Kim Kardashian and Taylor Swi...,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
4,What was one of Taylor Swift's first public in...,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...","In 2012, Taylor Swift and Kim Kardashian had o...",simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
5,What caused the feud between Taylor Swift and ...,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",The feud between Taylor Swift and Kim Kardashi...,multi_context,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
6,What did Kim Kardashian say about her feud wit...,"[January 14, 2019: Kim Kardashian claims\nther...",Kardashian said that she was 'over' her feud w...,multi_context,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
7,What did Kim Kardashian say about her favorite...,"[call in her statements, replied this time, re...",Kim Kardashian said that she really likes a lo...,reasoning,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True


In [41]:
# Drop the nan
test_df = test_df[test_df['ground_truth'] != 'nan']
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
1,What did Kim Kardashian say about her feud wit...,"[January 14, 2019: Kim Kardashian claims\nther...","Kardashian said she was ""over it"" and that the...",simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
2,What did Kim Kardashian say about Taylor Swift...,[from The Tortured Poets Department: The Antho...,Kim Kardashian said that she loves Taylor Swif...,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
3,What was the outcome of the feud between Kim K...,"[January 14, 2019: Kim Kardashian claims\nther...",The feud between Kim Kardashian and Taylor Swi...,simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
4,What was one of Taylor Swift's first public in...,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...","In 2012, Taylor Swift and Kim Kardashian had o...",simple,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
5,What caused the feud between Taylor Swift and ...,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",The feud between Taylor Swift and Kim Kardashi...,multi_context,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
6,What did Kim Kardashian say about her feud wit...,"[January 14, 2019: Kim Kardashian claims\nther...",Kardashian said that she was 'over' her feud w...,multi_context,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True
7,What did Kim Kardashian say about her favorite...,"[call in her statements, replied this time, re...",Kim Kardashian said that she really likes a lo...,reasoning,"[{'source': 'DataRepository/tswift_fued.pdf', ...",True


In [42]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [43]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [44]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [45]:
response_dataset[0]

{'question': 'What did Kim Kardashian say about her feud with Taylor Swift during her appearance on Watch What Happens Live?',
 'answer': "I don't know.",
 'contexts': ['on the matter, and asking, "P.S. who did you guys piss off to leak that\nvideo?"\nDecember 16, 2021: Kim Kardashian says\nshe likes all of Taylor Swift’s songs',
  'then Kim posts it on the Internet."\nAugust 27\n, 2017: Taylor Swift references\nKanye West and Kim Kardashian drama in\n“Look What You Made Me Do” music video\xa0\nTaylor Swift - Look What You Made Me Do',
  'taking videos on her phone, saying, “I’m going to edit this later.”\xa0\nJanuary 14, 2019: Kim Kardashian claims\nthere’s no more “Bad Blood” with Taylor\nSwift',
  "April 23: A source tells PEOPLE Kim\nKardashian is 'over' the feud\nFollowing the release of Swift's new track, a source gave PEOPLE insight"],
 'ground_truth': 'Kardashian said she was "over it" and that they had all moved on from the feud during her appearance on Watch What Happens Live

## Task 2: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [46]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [47]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/35 [00:00<?, ?it/s]



In [48]:
results

{'faithfulness': 1.0000, 'answer_relevancy': 0.3987, 'context_recall': 0.6429, 'context_precision': 0.4246, 'answer_correctness': 0.4415}

In [49]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What did Kim Kardashian say about her feud wit...,I don't know.,"[on the matter, and asking, ""P.S. who did you ...","Kardashian said she was ""over it"" and that the...",,0.0,0.0,0.25,0.185406
1,What did Kim Kardashian say about Taylor Swift...,"Kim Kardashian said she ""loves"" Taylor Swift i...","[years, including what they’ve said about the ...",Kim Kardashian said that she loves Taylor Swif...,1.0,1.0,0.5,0.0,0.731974
2,What was the outcome of the feud between Kim K...,Kim Kardashian is 'over' the feud.,[April 23: A source tells PEOPLE Kim\nKardashi...,The feud between Kim Kardashian and Taylor Swi...,1.0,0.895872,0.0,0.0,0.22507
3,What was one of Taylor Swift's first public in...,One of Taylor Swift's first public interaction...,"[JAY-Z and West for a photo, she was seen givi...","In 2012, Taylor Swift and Kim Kardashian had o...",1.0,0.89524,1.0,0.583333,0.737441
4,What caused the feud between Taylor Swift and ...,I don't know.,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",The feud between Taylor Swift and Kim Kardashi...,,0.0,1.0,1.0,0.179626
5,What did Kim Kardashian say about her feud wit...,I don't know.,"[on the matter, and asking, ""P.S. who did you ...",Kardashian said that she was 'over' her feud w...,,0.0,1.0,0.638889,0.183174
6,What did Kim Kardashian say about her favorite...,Kim Kardashian responded that she really likes...,[she likes all of Taylor Swift’s songs\nDuring...,Kim Kardashian said that she really likes a lo...,1.0,0.0,1.0,0.5,0.847899


## Task 3: Making Adjustments to our RAG Pipeline

Now that we have established a baseline - we can see how any changes impact our pipeline's performance!

Let's modify our retriever and see how that impacts our Ragas metrics!

> NOTE: MultiQueryRetriever is expanded on [here](https://python.langchain.com/docs/modules/data_connection/retrievers/MultiQueryRetriever) but for now, the implementation is not important to our lesson!

In [50]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)

We'll also re-create our RAG pipeline using the abstractions that come packaged with LangChain v0.1.0!

First, let's create a chain to "stuff" our documents into our context!

In [51]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

Next, we'll create the retrieval chain!

In [52]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [53]:
response = retrieval_chain.invoke({"input": "Who is Taylor Swift fueding with?"})

In [54]:
print(response["answer"])

Taylor Swift is feuding with Kim Kardashian.


In [55]:
response = retrieval_chain.invoke({"input": "Why are they fueding?"})

In [56]:
print(response["answer"])

They are feuding due to a longstanding feud between West and Swift, as well as West's controversial rant during his appearance on Saturday Night Live. Additionally, there was a feud over leaked phone calls between Swift and Kardashian.


Well, just from those responses this chain *feels* better - but lets see how it performs on our eval!

Let's do the same process we did before to collect our pipeline's contexts and answers.

In [57]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

Now we can convert this into a dataset, just like we did before.

In [58]:
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's evaluate on the same metrics we did for the first pipeline and see how it does!

In [59]:
advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/35 [00:00<?, ?it/s]



In [60]:
advanced_retrieval_results_df = advanced_retrieval_results.to_pandas()
advanced_retrieval_results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,What did Kim Kardashian say about her feud wit...,Kim Kardashian claimed during her appearance o...,"[on the matter, and asking, ""P.S. who did you ...","Kardashian said she was ""over it"" and that the...",1.0,0.942819,0.0,0.413492,0.722381
1,What did Kim Kardashian say about Taylor Swift...,"Kim Kardashian said she ""loves"" Taylor Swift i...","[years, including what they’ve said about the ...",Kim Kardashian said that she loves Taylor Swif...,1.0,1.0,0.5,0.0,0.731965
2,What was the outcome of the feud between Kim K...,The context does not provide a specific outcom...,"[Kardashian.\nSwift's publicist, Tree Paine, w...",The feud between Kim Kardashian and Taylor Swi...,,0.0,0.0,0.402778,0.223795
3,What was one of Taylor Swift's first public in...,One of Taylor Swift's first public interaction...,"[3/22\nNovember 11, 2012: Taylor Swift and Kim...","In 2012, Taylor Swift and Kim Kardashian had o...",1.0,0.947575,1.0,0.873413,0.736885
4,What caused the feud between Taylor Swift and ...,The context provided does not specify the exac...,"[4/23/24, 6:05 PM\nA Timeline of Taylor Swift ...",The feud between Taylor Swift and Kim Kardashi...,,0.0,0.666667,1.0,0.222807
5,What did Kim Kardashian say about her feud wit...,Kim Kardashian expressed that she likes all of...,"[on the matter, and asking, ""P.S. who did you ...",Kardashian said that she was 'over' her feud w...,0.5,0.908142,0.5,0.525,0.724608
6,What did Kim Kardashian say about her favorite...,Kim Kardashian responded that she really likes...,[she likes all of Taylor Swift’s songs\nDuring...,Kim Kardashian said that she really likes a lo...,1.0,0.900244,0.5,0.333333,0.848161


## Task 4: Evaluating our Adjusted Pipeline Against Our Baseline

Now we can compare our results and see what directional changes occured!

Let's refresh with our initial metrics.

In [61]:
results

{'faithfulness': 1.0000, 'answer_relevancy': 0.3987, 'context_recall': 0.6429, 'context_precision': 0.4246, 'answer_correctness': 0.4415}

And see how our advanced retrieval modified our chain!

In [62]:
advanced_retrieval_results

{'faithfulness': 0.9000, 'answer_relevancy': 0.6713, 'context_recall': 0.4524, 'context_precision': 0.5069, 'answer_correctness': 0.6015}

In [63]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,1.0,0.9,-0.1
1,answer_relevancy,0.39873,0.671254,0.272524
2,context_recall,0.642857,0.452381,-0.190476
3,context_precision,0.424603,0.506859,0.082256
4,answer_correctness,0.441513,0.601515,0.160002


## Task 5: Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

####🏗️ Activity #2:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

#### Embedding model comparison with RAGAS
We want to evaluate how the pipeline is performing with text-embedding-ada-002 VS text-embedding-3-small. So we create a pipeline with the all the same parameters as before, except we are changing the embedding model.

In [64]:
# Instantiating a new embedding model
new_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [65]:
# Creating a new vectore store, with the new embedding.
vector_store = qdrant = Qdrant.from_documents(
    documents,
    new_embeddings,
    location=":memory:",
    collection_name="Taylor Swift - Fued - MQR",
)

In [66]:
# Defining the new vectore store as the retriever
new_retriever = vector_store.as_retriever()

In [67]:
# Applying the MultiQueryRetriver to the new retriever, with the same QA as before.
new_advanced_retriever = MultiQueryRetriever.from_llm(retriever=new_retriever, llm=primary_qa_llm)

In [68]:
# Creating the new retrieval chain
new_retrieval_chain = create_retrieval_chain(new_advanced_retriever, document_chain)

In [69]:
# Collecting the pipeline answers and context
answers = []
contexts = []

for question in test_questions:
  response = new_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

In [70]:
# Creating a HF dataset for evaluation
new_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [71]:
# Running the RAGAS evaluation
new_advanced_retrieval_results = evaluate(new_response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/35 [00:00<?, ?it/s]



In [72]:
new_advanced_retrieval_results

{'faithfulness': 0.8000, 'answer_relevancy': 0.6821, 'context_recall': 0.5238, 'context_precision': 0.6676, 'answer_correctness': 0.5273}

In [73]:
# Displaying the evaluation deltas between our pipeline running with ada-2 VS Small 3 embedding models
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_original = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(new_advanced_retrieval_results.items()), columns=['Metric', 'Text Embedding 3'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')
df_merged = pd.merge(df_baseline, df_merged, on="Metric")

df_merged['Delta - TE3 -> ADA'] = df_merged['Text Embedding 3'] - df_merged['ADA']
df_merged['Delta - TE3 -> Baseline'] = df_merged['Text Embedding 3'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,ADA,Text Embedding 3,Delta - TE3 -> ADA,Delta - TE3 -> Baseline
0,faithfulness,1.0,0.9,0.8,-0.1,-0.2
1,answer_relevancy,0.39873,0.671254,0.682065,0.010811,0.283334
2,context_recall,0.642857,0.452381,0.52381,0.071429,-0.119048
3,context_precision,0.424603,0.506859,0.667622,0.160762,0.243019
4,answer_correctness,0.441513,0.601515,0.527333,-0.074182,0.08582


####❓ Question #4:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?

#### Answer #4 Considering Ada 2 and Small-3 use the same dimension size, yes text-embedding-3-small is more efficient. We can see how it affected the context precision and context recall metrics positively on our pipeline, as well as the overall answer_relevancy. I'm not keen to judge based on the faithfulness, because I think it handles the "I don't know" answers weirdly.  

## BONUS ACTIVITY: Showcase Multi-Context Perfomance Changes

Now that we've looked at a number of different examples - showcase the difference on the multi-context *specific* questions that were synthetically generated.

> NOTE: You have all the data you'll need already in the notebook if you made it to this step!

In [None]:
### YOUR CODE HERE