# Evaluation of RAG Using Ragas

In the following notebook we'll explore how to evaluate RAG pipelines using a powerful open-source tool called "Ragas". This will give us tools to evaluate component-wise metrics, as well as end-to-end metrics about the performance of our RAG pipelines.

In the following notebook we'll complete the following tasks:

- 🤝 Breakout Room #1:
  1. Install required libraries
  2. Set Environment Variables
  3. Creating a simple RAG pipeline with [LangChain v0.1.0](https://blog.langchain.dev/langchain-v0-1-0/)
  

- 🤝 Breakout Room #2:
  1. Synthetic Dataset Generation for Evaluation using the [Ragas](https://github.com/explodinggradients/ragas)
  2. Evaluating our pipeline with Ragas
  3. Making Adjustments to our RAG Pipeline
  4. Evaluating our Adjusted pipeline against our baseline
  5. Testing OpenAI's Claim

The only way to get started is to get started - so let's grab our dependencies for the day!

## Motivation

A claim, made by OpenAI, is that their `text-embedding-3-small` is better (generally) than their `text-embedding-ada-002` model.

Here's some passages from their [blog](https://openai.com/blog/new-embedding-models-and-api-updates) about the `text-embedding-3` release:

> `text-embedding-3-small` is our new highly efficient embedding model and provides a significant upgrade over its predecessor, the `text-embedding-ada-002` model...

> **Stronger performance.** Comparing `text-embedding-ada-002` to `text-embedding-3-small`, the average score on a commonly used benchmark for multi-language retrieval ([MIRACL](https://github.com/project-miracl/miracl)) has increased from 31.4% to 44.0%, while the average score on a commonly used benchmark for English tasks ([MTEB](https://github.com/embeddings-benchmark/mteb)) has increased from 61.0% to 62.3%.

Well, with a library like Ragas - we can put that claim to the test!

If what they claim is true - we should see an increase on related metrics by using the new embedding model!

# 🤝 Breakout Room #1

## Task 1: Installing Required Libraries

A reminder that one of the [key features](https://blog.langchain.dev/langchain-v0-1-0/) of LangChain v0.1.0 is the compartmentalization of the various LangChain ecosystem packages!

So let's begin grabbing all of our LangChain related packages!

In [1]:
!pip install -U -q langchain langchain-openai langchain_core langchain-community langchainhub openai

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m807.5/807.5 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m256.9/256.9 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m66.6/66.6 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.6/75.6 kB[0m [31m8.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━

We'll also get the "star of the show" today, which is Ragas!

In [2]:
!pip install -qU ragas

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.7/69.7 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m510.5/510.5 kB[0m [31m10.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.1/71.1 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m16.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.2 MB/s[0m eta [36m0:00:00[0m
[?25h

As well, instead of the remote hosted solution that we used last week (Pinecone), we'll be leveraging Meta's [FAISS](https://github.com/facebookresearch/faiss) as the backend for our LangChain `VectorStore`.

We'll also install `unstructured` (from [Unstructured-IO](https://github.com/Unstructured-IO/unstructured)) and its dependencies which will allow us to load PDFs using the `UnstructuredPDFLoader` in the `langchain-community` package!

In [3]:
!pip install -qU faiss_cpu pymupdf pandas

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m50.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.4/4.4 MB[0m [31m32.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.0/13.0 MB[0m [31m62.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.6/30.6 MB[0m [31m15.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m345.4/345.4 kB[0m [31m35.2 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 0.22.0 requires pandas<2.1.4,>=1.5.0, but you have pandas 2.2.1 which is incompatible.
google-colab 1.0.0 requires pandas==1.5.3, but you have pandas 2.2.1 which is incompatible.[0m[31m
[0m

## Task 2: Set Environment Variables

Let's set up our OpenAI API key so we can leverage their API later on.

In [4]:
import os
import openai
from getpass import getpass

openai.api_key = getpass("Please provide your OpenAI Key: ")
os.environ["OPENAI_API_KEY"] = openai.api_key

Please provide your OpenAI Key: ··········


## Task 3: Creating a Simple RAG Pipeline with LangChain v0.1.0

Building on what we learned last week, we'll be leveraging LangChain v0.1.0 and LCEL to build a simple RAG pipeline that we can baseline with Ragas.

## Building our RAG pipeline

Let's review the basic steps of RAG again:

- Create an Index
- Use retrieval to obtain pieces of context from our Index that are similar to our query
- Use a LLM to generate responses based on the retrieved context

Let's get started by creating our index.

> NOTE: We're going to start leaning on the term "index" to refer to our `VectorStore`, `VectorDatabase`, etc. We can think of "index" as the catch-all term, whereas `VectorStore` and the like relate to the specific technologies used to create, store, and interact with the index.

### Creating an Index

You'll notice that the largest changes (outside of some import changes) are that our old favourite chains are back to being bundled in an easily usable abstraction.

We can still create custom chains using LCEL - but we can also be more confident that our pre-packaged chains are creating using LCEL under the hood.

#### Loading Data

Let's start by loading some data!

> NOTE: You'll notice that we're using a document loader from the community package of LangChain. This is part of the v0.1.0 changes that make the base (`langchain-core`) package remain lightweight while still providing access to some of the more powerful community integrations.

In [5]:
!git clone https://github.com/AI-Maker-Space/DataRepository

Cloning into 'DataRepository'...
remote: Enumerating objects: 54, done.[K
remote: Counting objects: 100% (46/46), done.[K
remote: Compressing objects: 100% (34/34), done.[K
remote: Total 54 (delta 15), reused 20 (delta 7), pack-reused 8[K
Receiving objects: 100% (54/54), 51.28 MiB | 26.15 MiB/s, done.
Resolving deltas: 100% (15/15), done.


In [6]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(
    "DataRepository/MuskComplaint.pdf",
)

documents = loader.load()

In [7]:
documents[0].metadata

{'source': 'DataRepository/MuskComplaint.pdf',
 'file_path': 'DataRepository/MuskComplaint.pdf',
 'page': 0,
 'total_pages': 46,
 'format': 'PDF 1.7',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'creator': '',
 'producer': '',
 'creationDate': '',
 'modDate': '',
 'trapped': ''}

#### Transforming Data

Now that we've got our single document - let's split it into smaller pieces so we can more effectively leverage it with our retrieval chain!

We'll start with the classic: `RecursiveCharacterTextSplitter`.

In [8]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 700,
    chunk_overlap = 50
)

documents = text_splitter.split_documents(documents)

Let's confirm we've split our document.

In [9]:
len(documents)

159

#### Loading OpenAI Embeddings Model

We'll need a process by which we can convert our text into vectors that allow us to compare to our query vector.

Let's use OpenAI's `text-embedding-ada-002` for this task!

In [10]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(
    model="text-embedding-ada-002"
)

#### Creating a FAISS VectorStore

Now that we have documents - we'll need a place to store them alongside their embeddings.

In [11]:
from langchain_community.vectorstores import FAISS

vector_store = FAISS.from_documents(documents, embeddings)

####❓ Question #1:

List out a few of the techniques that FAISS uses that make it performant.

> NOTE: Check the [repository](https://github.com/facebookresearch/faiss) for more information about FAISS!

Ans.
1. FAISS has numerous indexing structures that can be utilised to speed up the search, including LSH, IVF, and PQ.
2. It also includes GPU support, which enables further search acceleration.
3. FAISS also offers an estimated nearest neighbour search, which delivers approximate nearest neighbours with a quality guarantee.
4. Faiss supports searching only from RAM, as disk databases are orders of magnitude slower. Even with SSDs.
5. Query batches: Faiss is optimized for working with batches of samples, rather than processing samples one by one. Internally, Faiss parallelizes over the batch elements in a way that is more efficient than if parallelization was performed by the caller.

#### Creating a Retriever

To complete our index, all that's left to do is expose our vectorstore as a retriever - which we can do the same way we would in previous version of LangChain!

In [12]:
retriever = vector_store.as_retriever()

#### Testing our Retriever

Now that we've gone through the trouble of creating our retriever - let's see it in action!

In [13]:
retrieved_documents = retriever.invoke("Who is the plantiff?")

In [14]:
for doc in retrieved_documents:
  print(doc)

page_content='would be owned by the foundation and used ‘for the good of the world’[.]” Plaintiff \nreplied: “Agree on all.” Ex. 2 at 1.' metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 27, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': ''}
page_content='property and derivative works funded by those monies, Plaintiff is presently unable to ascertain his \ninterest in or the use, allocation, or distribution of assets without an accounting. Plaintiff is therefore \nentitled to an accounting.' metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 32, 'total_pages': 46, 'format': 'PDF 1.7', 'title': '', 'author': '', 'subject': '', 'keywords': '', 'creator': '', 'producer': '', 'creationDate': '', 'modDate': '', 'trapped': ''}
page_content='1

### Creating a RAG Chain

Now that we have the "R" in RAG taken care of - let's look at creating the "AG"!

#### Creating a Prompt Template

There are a few different ways we could create our prompt template - we could create a custom template, as seen in the code below, or we could simply pull a prompt from the prompt hub! Let's look at an example of that!

In [15]:
from langchain import hub

retrieval_qa_prompt = hub.pull("langchain-ai/retrieval-qa-chat")

In [16]:
print(retrieval_qa_prompt.messages[0].prompt.template)

Answer any use questions based solely on the context below:

<context>
{context}
</context>


As you can see - the prompt template is simple (and has a small error) - so we'll create our own to be a bit more specific!

In [17]:
from langchain.prompts import ChatPromptTemplate

template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':

Context:
{context}

Question:
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

#### Setting Up our Basic QA Chain

Now we can instantiate our basic RAG chain!

We'll use LCEL directly just to see an example of it - but you could just as easily use an abstraction here to achieve the same goal!

We'll also ensure to pass-through our context - which is critical for RAGAS.

In [18]:
from operator import itemgetter

from langchain_openai import ChatOpenAI
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
)

####🏗️ Activity #1:

Describe the pipeline shown above in simple terms. You can include a diagram if desired.

Ans:

1. The pipeline starts by invoking the chain with the question that the user inputs.
2. This question is used as input to the retriever.
3. The retriever fetches the relevant context.
4. Then we use the RunnablePassthrough to pass the context to the next step. (RunnablePassThrough acts as a place holder for the context to move that context through without manipulating the data and breaking the chain down).
5. The prompt is then created using the custom prompt we designed using ChatPromptTemplate from langchain with the question and the context pass to it.
6. This is then passed to the LLM (that is a deterministic model of gpt-turbo-3.5) to generate a response to the user's question.


Let's test it out!

In [19]:
question = "Who is the plantiff?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)

Elon Musk


In [20]:
question = "What does this complaint pertain to?"

result = retrieval_augmented_qa_chain.invoke({"question" : question})

print(result["response"].content)
print(result["context"])

The complaint pertains to breach of fiduciary duty, unfair business practices, accounting, and a demand for a jury trial.
[Document(page_content='1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 \n28 \n \n \n– 31 – \nCOMPLAINT \n \nTHIRD CAUSE OF ACTION \nBreach of Fiduciary Duty  \nAgainst All Defendants \n133. \nPlaintiff realleges and incorporates by reference only paragraphs of this Complaint \nnecessary for his claim of Breach of Fiduciary Duty. \n134. \nUnder California law, Defendants owe fiduciary duties to Plaintiff, including a duty \nto use Plaintiff’s contributions for the purposes for which they were made. E.g., Cal. Bus. & Prof. \nCode § 17510.8. Defendants have repeatedly breached their fiduciary duties to Plaintiff, including \nby:', metadata={'source': 'DataRepository/MuskComplaint.pdf', 'file_path': 'DataRepository/MuskComplaint.pdf', 'page': 30, 'total_pages': 46, 'format': 'PDF 1.7', 'title':

We can already see that there are some improvements we could make here.

For now, let's switch gears to RAGAS to see how we can leverage that tool to provide us insight into how our pipeline is performing!

# 🤝 Breakout Room #2

## Task 1: Synthetic Dataset Generation for Evaluation using Ragas

Ragas is a powerful library that lets us evaluate our RAG pipeline by collecting input/output/context triplets and obtaining metrics relating to a number of different aspects of our RAG pipeline.

We'll be evluating on every core metric today, but in order to do that - we'll need to creat a test set. Luckily for us, Ragas can do that directly!

### Synthetic Test Set Generation

We can leverage Ragas' [`Synthetic Test Data generation`](https://docs.ragas.io/en/stable/concepts/testset_generation.html) functionality to generate our own synthetic QC pairs - as well as a synthetic ground truth - quite easily!

> NOTE: This process will use `gpt-3.5-turbo-16k` as the base generator and `gpt-4` as the critic - if you're attempting to create a lot of samples please be aware of cost, as well as rate limits.

In [21]:
eval_documents = documents

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 400
)

eval_documents = text_splitter.split_documents(eval_documents)

####❓ Question #2:

Why is it important to split our documents using different parameters when creating our synthetic data?

Ans.
1. Reduce overfitting and bias:  By generating synthetic data that represents different subsets of the population or different contexts, we can reduce the risk of perpetuating biases present in the original data.
2. Diversity and Generalization: We ensure that the synthetic data captures a diverse range of characteristics present in the original documents. It can help generalize well to unseen data and various scenarios.
3. Data Augmentation: By generating synthetic data that pertains to different aspects or variations of the original data, we can improve the robustness of the RAG.

In [22]:
len(eval_documents)

159

In [23]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context

generator = TestsetGenerator.with_openai()

testset = generator.generate_with_langchain_docs(eval_documents, test_size=10, distributions={simple: 0.25, reasoning: 0.25, multi_context: 0.5})

embedding nodes:   0%|          | 0/318 [00:00<?, ?it/s]



Generating:   0%|          | 0/10 [00:00<?, ?it/s]

####❓ Question #3:

`{simple: 0.5, reasoning: 0.25, multi_context: 0.25}`

What exactly does this mapping refer to?

Ans: It is the distribution of the questions generated.
So for our example:
Test set = 10
Simple questions = 5
Reasoning questions = 2 or 3
Multi-context questions = 2 or 3

LLMs can generate simple to complex questions.

To generate medium to hard samples from the provided documents, we can use reasoning, conditioning and multi-context.

Reasoning would mean rewriting the question in a way that enhances the need for reasoning to answer it effectively.

Conditioning would mean modifing the question to introduce a conditional element, which adds complexity to the question.

Multi-Context would mean rephrasing the question in a manner that necessitates information from multiple related sections or chunks to formulate an answer.


> NOTE: Check out the Ragas documentation on this generation process [here](https://docs.ragas.io/en/stable/concepts/testset_generation.html).

Let's look at the output and see what we can learn about it!

In [24]:
testset.test_data[0]

DataRow(question="How did the publication of OpenAI's models contribute to the development of future models?", contexts=['1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 \n28 \n \n \n– 19 – \nCOMPLAINT \n \n82. \nTheir publication did prove to be useful to the developers of future, powerful models. \nEntire communities sprung up to enhance and extend the models released by OpenAI. These \ncommunities spread to open-source, grass-roots efforts and commercial entities alike. \n83. \nIn 2020, OpenAI announced a third version of its model, GPT-3. It used “175 billion \nparameters, 10x more than any previous non-sparse language model.” Again, OpenAI announced \nthe development of this model with the publication of a research paper describing its complete'], ground_truth='Their publication did prove to be useful to the developers of future, powerful models.', evolution_type='simple')

### Generating Responses with RAG Pipeline

Now that we have some QC pairs, and some ground truths, let's evaluate our RAG pipeline using Ragas.

The process is, again, quite straightforward - thanks to Ragas and LangChain!

Let's start by extracting our questions and ground truths from our create testset.

We can start by converting our test dataset into a Pandas DataFrame.

In [25]:
test_df = testset.to_pandas()

In [26]:
test_df

Unnamed: 0,question,contexts,ground_truth,evolution_type,episode_done
0,How did the publication of OpenAI's models con...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,Their publication did prove to be useful to th...,simple,True
1,What strategy video game did OpenAI compete in?,[77. \nInitial work at OpenAI followed much in...,"OpenAI competed in Dota 2, a strategy video game.",simple,True
2,What was Mr. Page's response to Mr. Musk's con...,"[Page, then-CEO of Google’s parent company Alp...",Mr. Page responded that would merely “be the n...,reasoning,True
3,How did OpenAI use reinforcement learning in t...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning to compete ...,reasoning,True
4,"""What strategy video game did OpenAI excel in,...",[77. \nInitial work at OpenAI followed much in...,OpenAI excelled in the strategy video game Dot...,multi_context,True
5,"""What game did OpenAI use reinforcement learni...",[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning in the game...,multi_context,True
6,How does the use of IP assets affect the propo...,"[business model were valid, it would radically...",The use of IP assets in the proposed business ...,multi_context,True
7,What breach of the Founding Agreement has occu...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,Licensing GPT-4 exclusively to Microsoft,multi_context,True
8,What makes AGI in the wrong hands dangerous an...,[18. \nMr. Musk has long recognized that AGI p...,The advancement of AI in the wrong hands is da...,multi_context,True
9,How did OpenAI use a deep neural network in th...,[those connections to the target language. \n7...,By using the first half of Google's Transforme...,simple,True


In [27]:
test_questions = test_df["question"].values.tolist()
test_groundtruths = test_df["ground_truth"].values.tolist()

Now we'll generate responses using our RAG pipeline using the questions we've generated - we'll also need to collect our retrieved contexts for each question.

We'll do this in a simple loop to see exactly what's happening!

In [28]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"].content)
  contexts.append([context.page_content for context in response["context"]])

Now we can wrap our information in a Hugging Face dataset for use in the Ragas library.

In [29]:
from datasets import Dataset

response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's take a peek and see what that looks like!

In [30]:
response_dataset[0]

{'question': "How did the publication of OpenAI's models contribute to the development of future models?",
 'answer': "The publication of OpenAI's models contributed to the development of future models by inspiring entire communities to enhance and extend the models released by OpenAI.",
 'contexts': ['challenging.” At the time, OpenAI stated that it was releasing the full, open version with the hope \nthat it “will be useful to developers of future powerful models.” This release was accompanied by a \ndetailed paper co-authored by OpenAI scientists as well as independent social and technical \nscientists. This paper explained just some of the many benefits that came from releasing models \npublically as opposed to keeping them closed.',
  '1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \n12 \n13 \n14 \n15 \n16 \n17 \n18 \n19 \n20 \n21 \n22 \n23 \n24 \n25 \n26 \n27 \n28 \n \n \n– 19 – \nCOMPLAINT \n \n82. \nTheir publication did prove to be useful to the developers of future, powerful mod

## Task 2: Evaluating our Pipeline with Ragas

Now that we have our response dataset - we can finally get into the "meat" of Ragas - evaluation!

First, we'll import the desired metrics, then we can use them to evaluate our created dataset!

Check out the specific metrics we'll be using in the Ragas documentation:

- [Faithfulness](https://docs.ragas.io/en/stable/concepts/metrics/faithfulness.html)
- [Answer Relevancy](https://docs.ragas.io/en/stable/concepts/metrics/answer_relevance.html)
- [Context Precision](https://docs.ragas.io/en/stable/concepts/metrics/context_precision.html)
- [Context Recall](https://docs.ragas.io/en/stable/concepts/metrics/context_recall.html)
- [Answer Correctness](https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html)

See the accompanied presentation for more in-depth explanations about each of the metrics!

In [31]:
from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    answer_correctness,
    context_recall,
    context_precision,
)

metrics = [
    faithfulness,
    answer_relevancy,
    context_recall,
    context_precision,
    answer_correctness,
]

All that's left to do is call "evaluate" and away we go!

In [32]:
results = evaluate(response_dataset, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [33]:
results

{'faithfulness': 0.8333, 'answer_relevancy': 0.9389, 'context_recall': 0.9500, 'context_precision': 0.8944, 'answer_correctness': 0.7384}

In [34]:
results_df = results.to_pandas()
results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,How did the publication of OpenAI's models con...,The publication of OpenAI's models contributed...,"[challenging.” At the time, OpenAI stated that...",Their publication did prove to be useful to th...,1.0,1.0,1.0,1.0,0.591555
1,What strategy video game did OpenAI compete in?,Dota 2,[77. \nInitial work at OpenAI followed much in...,"OpenAI competed in Dota 2, a strategy video game.",1.0,0.95385,1.0,1.0,0.716374
2,What was Mr. Page's response to Mr. Musk's con...,Mr. Page responded that the potential replacem...,"[Page, then-CEO of Google’s parent company Alp...",Mr. Page responded that would merely “be the n...,0.5,0.959527,1.0,0.833333,0.837155
3,How did OpenAI use reinforcement learning in t...,OpenAI used reinforcement learning to compete ...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning to compete ...,1.0,0.903244,1.0,1.0,1.0
4,"""What strategy video game did OpenAI excel in,...",Dota 2,[a superhuman level of play in the games of ch...,OpenAI excelled in the strategy video game Dot...,0.0,0.888023,1.0,0.5,0.968325
5,"""What game did OpenAI use reinforcement learni...",OpenAI used reinforcement learning to compete ...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning in the game...,1.0,0.93077,0.5,1.0,0.539319
6,How does the use of IP assets affect the propo...,The use of IP assets in the proposed business ...,"[business model were valid, it would radically...",The use of IP assets in the proposed business ...,,0.935539,1.0,1.0,0.743719
7,What breach of the Founding Agreement has occu...,The breach of the Founding Agreement that has ...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,Licensing GPT-4 exclusively to Microsoft,1.0,0.889089,1.0,0.805556,0.719172
8,What makes AGI in the wrong hands dangerous an...,AGI in the wrong hands is dangerous and an exi...,[18. \nMr. Musk has long recognized that AGI p...,The advancement of AI in the wrong hands is da...,1.0,0.987338,1.0,0.805556,0.529234
9,How did OpenAI use a deep neural network in th...,OpenAI used the first half of Google's Transfo...,[those connections to the target language. \n7...,By using the first half of Google's Transforme...,1.0,0.941486,1.0,1.0,0.73871


## Task 3: Making Adjustments to our RAG Pipeline

Now that we have established a baseline - we can see how any changes impact our pipeline's performance!

Let's modify our retriever and see how that impacts our Ragas metrics!

In [35]:
from langchain.retrievers import MultiQueryRetriever

advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)

We'll also re-create our RAG pipeline using the abstractions that come packaged with LangChain v0.1.0!

First, let's create a chain to "stuff" our documents into our context!

In [36]:
from langchain.chains.combine_documents import create_stuff_documents_chain

document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

Next, we'll create the retrieval chain!

In [37]:
from langchain.chains import create_retrieval_chain

retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [38]:
response = retrieval_chain.invoke({"input": "Who is the plantiff?"})

In [39]:
print(response["answer"])

The plaintiff is Elon Musk.


In [40]:
response = retrieval_chain.invoke({"input": "What does this complaint pertain to?"})

In [41]:
print(response["answer"])

The complaint pertains to a legal case involving Plaintiff Elon Musk alleging breach of fiduciary duty, unfair business practices, and seeking restitution, disgorgement of funds, prejudgment interest, an injunction against future activities, specific performance, and an accounting. The complaint also includes a demand for a jury trial.


Well, just from those responses this chain *feels* better - but lets see how it performs on our eval!

Let's do the same process we did before to collect our pipeline's contexts and answers.

In [42]:
answers = []
contexts = []

for question in test_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

Now we can convert this into a dataset, just like we did before.

In [43]:
response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

Let's evaluate on the same metrics we did for the first pipeline and see how it does!

In [44]:
advanced_retrieval_results = evaluate(response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [45]:
advanced_retrieval_results_df = advanced_retrieval_results.to_pandas()
advanced_retrieval_results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,How did the publication of OpenAI's models con...,The publication of OpenAI's models contributed...,"[challenging.” At the time, OpenAI stated that...",Their publication did prove to be useful to th...,1.0,0.935184,1.0,0.755556,0.508656
1,What strategy video game did OpenAI compete in?,"OpenAI competed in Dota 2, a strategy video ga...",[77. \nInitial work at OpenAI followed much in...,"OpenAI competed in Dota 2, a strategy video game.",1.0,1.0,1.0,1.0,0.741034
2,What was Mr. Page's response to Mr. Musk's con...,Mr. Page responded to Mr. Musk's concerns abou...,"[Page, then-CEO of Google’s parent company Alp...",Mr. Page responded that would merely “be the n...,1.0,0.97789,1.0,0.833333,0.832839
3,How did OpenAI use reinforcement learning in t...,OpenAI used reinforcement learning to compete ...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning to compete ...,1.0,0.936765,1.0,1.0,0.538131
4,"""What strategy video game did OpenAI excel in,...","OpenAI excelled in Dota 2, a strategy video ga...",[77. \nInitial work at OpenAI followed much in...,OpenAI excelled in the strategy video game Dot...,1.0,0.909699,1.0,1.0,0.74022
5,"""What game did OpenAI use reinforcement learni...",OpenAI used reinforcement learning in the stra...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning in the game...,1.0,0.92915,0.666667,1.0,0.535368
6,How does the use of IP assets affect the propo...,The use of IP assets in the proposed business ...,"[business model were valid, it would radically...",The use of IP assets in the proposed business ...,0.0,0.946253,1.0,0.7,0.735254
7,What breach of the Founding Agreement has occu...,The breach of the Founding Agreement that occu...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,Licensing GPT-4 exclusively to Microsoft,1.0,0.994584,1.0,0.8875,0.207065
8,What makes AGI in the wrong hands dangerous an...,AGI in the wrong hands is considered dangerous...,[18. \nMr. Musk has long recognized that AGI p...,The advancement of AI in the wrong hands is da...,1.0,0.957879,1.0,0.7,0.872316
9,How did OpenAI use a deep neural network in th...,OpenAI used a deep neural network by pre-train...,[those connections to the target language. \n7...,By using the first half of Google's Transforme...,1.0,0.896986,1.0,1.0,0.834551


## Task 4: Evaluating our Adjusted Pipeline Against Our Baseline

Now we can compare our results and see what directional changes occured!

Let's refresh with our initial metrics.

In [46]:
results

{'faithfulness': 0.8333, 'answer_relevancy': 0.9389, 'context_recall': 0.9500, 'context_precision': 0.8944, 'answer_correctness': 0.7384}

And see how our advanced retrieval modified our chain!

In [47]:
advanced_retrieval_results

{'faithfulness': 0.9000, 'answer_relevancy': 0.9484, 'context_recall': 0.9667, 'context_precision': 0.8876, 'answer_correctness': 0.6545}

In [48]:
import pandas as pd

df_original = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_comparison = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'MultiQueryRetriever with Document Stuffing'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')

df_merged['Delta'] = df_merged['MultiQueryRetriever with Document Stuffing'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,MultiQueryRetriever with Document Stuffing,Delta
0,faithfulness,0.833333,0.9,0.066667
1,answer_relevancy,0.938887,0.948439,0.009552
2,context_recall,0.95,0.966667,0.016667
3,context_precision,0.894444,0.887639,-0.006806
4,answer_correctness,0.738356,0.654543,-0.083813


## Task 5: Testing OpenAI's Claim

Now that we've seen how our retriever can impact the performance of our RAG pipeline - let's see how changing our embedding model impacts performance.

####🏗️ Activity #2:

Please provide markdown, or code comments, to explain which each of the following steps are doing!

In [49]:
# Using OpenAIEmbeddings function to access the open AI embeddings model text-embedding-3-small to generate embeddings
new_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

In [50]:
# Initialize a FAISS vector store with a list of documents and their corresponding embeddings.
vector_store = FAISS.from_documents(documents, new_embeddings)

In [51]:
#Create a retriever from the FAISS vector store for semantic document retrieval.
new_retriever = vector_store.as_retriever()

In [52]:
# Enhance the retriever with a MultiQueryRetriever, combining it with the initialized language model.
# This creates an advanced retriever that leverages the language model for improved query understanding and document retrieval.
new_advanced_retriever = MultiQueryRetriever.from_llm(retriever=new_retriever, llm=primary_qa_llm)


In [53]:
new_retrieval_chain = create_retrieval_chain(new_advanced_retriever, document_chain)

In [54]:
answers = []
contexts = []

for question in test_questions:
  response = new_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])

In [55]:
new_response_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

In [56]:
new_advanced_retrieval_results = evaluate(new_response_dataset_advanced_retrieval, metrics)

Evaluating:   0%|          | 0/50 [00:00<?, ?it/s]

In [57]:
new_advanced_retrieval_results

{'faithfulness': 1.0000, 'answer_relevancy': 0.9157, 'context_recall': 0.9750, 'context_precision': 0.9533, 'answer_correctness': 0.7023}

In [58]:
df_baseline = pd.DataFrame(list(results.items()), columns=['Metric', 'Baseline'])
df_original = pd.DataFrame(list(advanced_retrieval_results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(new_advanced_retrieval_results.items()), columns=['Metric', 'Text Embedding 3'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')
df_merged = pd.merge(df_baseline, df_merged, on="Metric")

df_merged['Delta - TE3 -> ADA'] = df_merged['Text Embedding 3'] - df_merged['ADA']
df_merged['Delta - TE3 -> Baseline'] = df_merged['Text Embedding 3'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,ADA,Text Embedding 3,Delta - TE3 -> ADA,Delta - TE3 -> Baseline
0,faithfulness,0.833333,0.9,1.0,0.1,0.166667
1,answer_relevancy,0.938887,0.948439,0.915722,-0.032717,-0.023165
2,context_recall,0.95,0.966667,0.975,0.008333,0.025
3,context_precision,0.894444,0.887639,0.953333,0.065694,0.058889
4,answer_correctness,0.738356,0.654543,0.702321,0.047777,-0.036036


####❓ Question #4:

Do you think, in your opinion, `text-embedding-3-small` is significantly better than `ada`?

Ans. No, Not quite. Though faithfulness, context_recall and context_precision seem to have improved, the results seem eratic when we try out different test datasets.  If cost is the problem which in most cases it is then `text-embedding-3-small` should be used but otherwise either can be used.

## BONUS ACTIVITY: Showcase Multi-Context Perfomance Changes

Now that we've looked at a number of different examples - showcase the difference on the multi-context *specific* questions that were synthetically generated.

> NOTE: You have all the data you'll need already in the notebook if you made it to this step!

In [None]:
#Creating a dataframe that contains only the multi_context specific questions

In [59]:
multi_context_df = test_df.iloc[4:9,[0,2,3]]
multi_context_df

Unnamed: 0,question,ground_truth,evolution_type
4,"""What strategy video game did OpenAI excel in,...",OpenAI excelled in the strategy video game Dot...,multi_context
5,"""What game did OpenAI use reinforcement learni...",OpenAI used reinforcement learning in the game...,multi_context
6,How does the use of IP assets affect the propo...,The use of IP assets in the proposed business ...,multi_context
7,What breach of the Founding Agreement has occu...,Licensing GPT-4 exclusively to Microsoft,multi_context
8,What makes AGI in the wrong hands dangerous an...,The advancement of AI in the wrong hands is da...,multi_context


In [60]:
multi_context_questions = multi_context_df['question']
multi_context_questions

4    "What strategy video game did OpenAI excel in,...
5    "What game did OpenAI use reinforcement learni...
6    How does the use of IP assets affect the propo...
7    What breach of the Founding Agreement has occu...
8    What makes AGI in the wrong hands dangerous an...
Name: question, dtype: object

In [61]:
multi_context_groundtruths = multi_context_df['ground_truth']
multi_context_groundtruths

4    OpenAI excelled in the strategy video game Dot...
5    OpenAI used reinforcement learning in the game...
6    The use of IP assets in the proposed business ...
7             Licensing GPT-4 exclusively to Microsoft
8    The advancement of AI in the wrong hands is da...
Name: ground_truth, dtype: object

In [None]:
# Ran this block code above so not running it again but will be using what is available
## This is for text-embedding-ada

# Loading original document
#loader = PyMuPDFLoader(
#    "DataRepository/MuskComplaint.pdf",
#)
#documents = loader.load()

# splitting it using RecursiveCharacterTextSplitter
#text_splitter = RecursiveCharacterTextSplitter(
#    chunk_size = 700,
#    chunk_overlap = 50
#)
#documents = text_splitter.split_documents(documents)

# We are using a chat model from OpenAI. This is "gpt-3.5-turbo" model and we are making it deterministic by setting the temperature parameter to 0
#primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Using OpenAIEmbeddings function to access the open AI embeddings model text-embedding-3-small to generate embeddings
#new_embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Initialize a FAISS vector store with a list of documents and their corresponding embeddings.
#vector_store = FAISS.from_documents(documents, new_embeddings)

#Create a retriever from the FAISS vector store for semantic document retrieval.
#retriever = vector_store.as_retriever()

# We created our own template
#template = """Answer the question based only on the following context. If you cannot answer the question with the context, please respond with 'I don't know':
#Context:
#{context}
#Question:
#{question}
#"""
#prompt = ChatPromptTemplate.from_template(template)

# Created the retrival chain using LCEL
#retrieval_augmented_qa_chain = (
#    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
#    # "question" : populated by getting the value of the "question" key
#    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
#    {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
#    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
#    #              by getting the value of the "context" key from the previous step
#    | RunnablePassthrough.assign(context=itemgetter("context"))
#    # "response" : the "context" and "question" values are used to format our prompt object and then piped
#    #              into the LLM and stored in a key called "response"
#    # "context"  : populated by getting the value of the "context" key from the previous step
#    | {"response": prompt | primary_qa_llm, "context": itemgetter("context")}
#)

In [67]:
# generating answers from questions using our new retrieval chain
#appending the answers and the context which were used to generate those answers in separate lists
answers = []
contexts = []

for question in multi_context_questions:
  response = retrieval_augmented_qa_chain.invoke({"question" : question})
  answers.append(response["response"])
  contexts.append([context.page_content for context in response["context"]])



In [75]:
multi_context_questions

4    "What strategy video game did OpenAI excel in,...
5    "What game did OpenAI use reinforcement learni...
6    How does the use of IP assets affect the propo...
7    What breach of the Founding Agreement has occu...
8    What makes AGI in the wrong hands dangerous an...
Name: question, dtype: object

In [68]:
answers

[AIMessage(content='Dota 2'),
 AIMessage(content='OpenAI used reinforcement learning to compete in Dota 2, a strategy video game, and they quickly achieved a superhuman level of play.'),
 AIMessage(content='The use of IP assets in the proposed business model allows investors to enrich themselves and their profit-maximizing corporate partners once the technology has been developed and proven. This enables investors to receive the same "for profit" upside as those who invest in conventional for-profit corporations, while also benefiting from reduced income taxes through pre-tax donations to fund research and development.'),
 AIMessage(content="The breach of the Founding Agreement that has occurred is licensing OpenAI's latest technology exclusively to Microsoft."),
 AIMessage(content='AGI in the wrong hands is dangerous and an existential threat because it can become more economically useful than humans, potentially leading to a future where "the future doesn\'t need us."')]

In [69]:
contexts

[['a superhuman level of play in the games of chess and shogi (Japanese chess) as well as Go, and \nconvincingly defeated a world-champion program in each case.” \n22. \nWith the DeepMind team, Google immediately catapulted to the front of the race for \nAGI. Mr. Musk was deeply troubled by this development. He believed (and still does) that in the \nhands of a closed, for-profit company like Google, AGI poses a particularly acute and noxious \ndanger to humanity. In 2014, it was already difficult enough to compete with Google in its core \nbusinesses. Google had collected a uniquely large set of data from our searches, our emails, and',
  '77. \nInitial work at OpenAI followed much in the footsteps of DeepMind. OpenAI used \nreinforcement learning to play a game. Instead of playing chess, however, OpenAI competed in \nDota 2, a strategy video game with far more moving pieces than chess. OpenAI’s team quickly',
  'learns to play chess by playing itself with different versions of the so

In [76]:
# Creating a dataset dictionary of our questions, answers, contexts and groundtruths
# This dataset is multi_context_response_dataset_with_ada
multi_context_dataset_retrieval = Dataset.from_dict({
    "question" : multi_context_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : multi_context_groundtruths
})



In [77]:
#Evaluating our new responses using RAGAS evaulate against metrics like faithfulness, answer_relevancy, context_recall, context_precision, answer_correctness
multi_context_dataset_results = evaluate(multi_context_dataset_retrieval, metrics)

# checking metrics
multi_context_dataset_results

Evaluating:   0%|          | 0/25 [00:00<?, ?it/s]

{'faithfulness': 1.0000, 'answer_relevancy': 0.9477, 'context_recall': 0.9000, 'context_precision': 0.9442, 'answer_correctness': 0.7226}

In [78]:
multi_context_dataset_results_df = multi_context_dataset_results.to_pandas()
multi_context_dataset_results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,"""What strategy video game did OpenAI excel in,...","OpenAI excelled in Dota 2, a strategy video ga...",[77. \nInitial work at OpenAI followed much in...,OpenAI excelled in the strategy video game Dot...,1.0,0.917673,1.0,1.0,0.74022
1,"""What game did OpenAI use reinforcement learni...",OpenAI used reinforcement learning in the game...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning in the game...,1.0,0.93784,0.5,1.0,0.53476
2,How does the use of IP assets affect the propo...,The use of IP assets in the proposed business ...,"[business model were valid, it would radically...",The use of IP assets in the proposed business ...,1.0,0.93064,1.0,1.0,0.862018
3,What breach of the Founding Agreement has occu...,The breach of the Founding Agreement that occu...,[an algorithm that is outside the scope of Mic...,Licensing GPT-4 exclusively to Microsoft,1.0,0.994584,1.0,0.8875,0.580688
4,What makes AGI in the wrong hands dangerous an...,AGI in the wrong hands is considered dangerous...,[18. \nMr. Musk has long recognized that AGI p...,The advancement of AI in the wrong hands is da...,1.0,0.957879,1.0,0.833333,0.895238


In [None]:
# Ran this block code above so not running it again but will be using what is available
## This is for text-embedding-ada with MultiqueryRetriever

# Loading original document
#loader = PyMuPDFLoader(
#    "DataRepository/MuskComplaint.pdf",
#)
#documents = loader.load()

# splitting it using RecursiveCharacterTextSplitter
#text_splitter = RecursiveCharacterTextSplitter(
#    chunk_size = 700,
#    chunk_overlap = 50
#)
#documents = text_splitter.split_documents(documents)

# We are using a chat model from OpenAI. This is "gpt-3.5-turbo" model and we are making it deterministic by setting the temperature parameter to 0
#primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Using OpenAIEmbeddings function to access the open AI embeddings model text-embedding-3-small to generate embeddings
#new_embeddings = OpenAIEmbeddings(model="text-embedding-ada-002")

# Initialize a FAISS vector store with a list of documents and their corresponding embeddings.
#vector_store = FAISS.from_documents(documents, new_embeddings)

#Create a retriever from the FAISS vector store for semantic document retrieval.
#retriever = vector_store.as_retriever()

# Enhance the retriever with a MultiQueryRetriever, combining it with the initialized language model.
# This creates an advanced retriever that leverages the language model for improved query understanding and document retrieval.
#advanced_retriever = MultiQueryRetriever.from_llm(retriever=retriever, llm=primary_qa_llm)

# This chain takes a list of documents and formats them all into a prompt, then passes that prompt to an LLM.
# It passes ALL documents, so you should make sure it fits within the context window the LLM you are using.
#from langchain.chains.combine_documents import create_stuff_documents_chain
#document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

#creating a chain using langchain's inbuilt methods
#retrieval_chain = create_retrieval_chain(advanced_retriever, document_chain)

In [70]:
# generating answers from questions using our new retrieval chain
#appending the answers and the context which were used to generate those answers in separate lists
answers = []
contexts = []

for question in multi_context_questions:
  response = retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])


# Creating a dataset dictionary of our questions, answers, contexts and groundtruths
# This dataset is multi_context_response_dataset_advanced_retrieval_with_small_embeddings_and_multiquery
multi_context_dataset_advanced_retrieval = Dataset.from_dict({
    "question" : multi_context_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : multi_context_groundtruths
})

#Evaluating our new responses using RAGAS evaulate against metrics like faithfulness, answer_relevancy, context_recall, context_precision, answer_correctness
multi_context_dataset_advanced_retrieval_results = evaluate(multi_context_dataset_advanced_retrieval, metrics)

# checking metrics
multi_context_dataset_advanced_retrieval_results

Evaluating:   0%|          | 0/25 [00:00<?, ?it/s]

{'faithfulness': 0.9600, 'answer_relevancy': 0.9476, 'context_recall': 0.9500, 'context_precision': 0.9233, 'answer_correctness': 0.7407}

In [73]:
multi_context_dataset_advanced_retrieval_results_df = multi_context_dataset_advanced_retrieval_results.to_pandas()
multi_context_dataset_advanced_retrieval_results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,"""What strategy video game did OpenAI excel in,...","OpenAI excelled in Dota 2, a strategy video ga...",[77. \nInitial work at OpenAI followed much in...,OpenAI excelled in the strategy video game Dot...,1.0,0.909699,1.0,1.0,0.74022
1,"""What game did OpenAI use reinforcement learni...",OpenAI used reinforcement learning in the game...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning in the game...,1.0,0.93784,0.75,1.0,0.535255
2,How does the use of IP assets affect the propo...,The use of IP assets in the proposed business ...,"[business model were valid, it would radically...",The use of IP assets in the proposed business ...,0.8,0.929807,1.0,1.0,0.835702
3,What breach of the Founding Agreement has occu...,The breach of the Founding Agreement that occu...,[1 \n2 \n3 \n4 \n5 \n6 \n7 \n8 \n9 \n10 \n11 \...,Licensing GPT-4 exclusively to Microsoft,1.0,0.994584,1.0,0.916667,0.707065
4,What makes AGI in the wrong hands dangerous an...,AGI in the wrong hands is considered dangerous...,[18. \nMr. Musk has long recognized that AGI p...,The advancement of AI in the wrong hands is da...,1.0,0.96624,1.0,0.7,0.885045


In [132]:
# Ran this block code above so not running it again but will be using what is available
## This is for text-embedding-3-small with MultiqueryRetrieval

# Loading original document
#loader = PyMuPDFLoader(
#    "DataRepository/MuskComplaint.pdf",
#)
#documents = loader.load()

# splitting it using RecursiveCharacterTextSplitter
#text_splitter = RecursiveCharacterTextSplitter(
#    chunk_size = 700,
#    chunk_overlap = 50
#)
#documents = text_splitter.split_documents(documents)

# We are using a chat model from OpenAI. This is "gpt-3.5-turbo" model and we are making it deterministic by setting the temperature parameter to 0
#primary_qa_llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

# Using OpenAIEmbeddings function to access the open AI embeddings model text-embedding-3-small to generate embeddings
#new_embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Initialize a FAISS vector store with a list of documents and their corresponding embeddings.
#vector_store = FAISS.from_documents(documents, new_embeddings)

#Create a retriever from the FAISS vector store for semantic document retrieval.
#new_retriever = vector_store.as_retriever()

# Enhance the retriever with a MultiQueryRetriever, combining it with the initialized language model.
# This creates an advanced retriever that leverages the language model for improved query understanding and document retrieval.
#new_advanced_retriever = MultiQueryRetriever.from_llm(retriever=new_retriever, llm=primary_qa_llm)

# This chain takes a list of documents and formats them all into a prompt, then passes that prompt to an LLM.
# It passes ALL documents, so you should make sure it fits within the context window the LLM you are using.
#from langchain.chains.combine_documents import create_stuff_documents_chain
#document_chain = create_stuff_documents_chain(primary_qa_llm, retrieval_qa_prompt)

#creating a chain using langchain's inbuilt methods
#new_retrieval_chain = create_retrieval_chain(new_advanced_retriever, document_chain)


In [72]:
# generating answers from questions using our new retrieval chain
#appending the answers and the context which were used to generate those answers in separate lists
answers = []
contexts = []

for question in multi_context_questions:
  response = new_retrieval_chain.invoke({"input" : question})
  answers.append(response["answer"])
  contexts.append([context.page_content for context in response["context"]])


# Creating a dataset dictionary of our questions, answers, contexts and groundtruths
multi_context_dataset_new_advanced_retrieval = Dataset.from_dict({
    "question" : multi_context_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : multi_context_groundtruths
})

#Evaluating our new responses using RAGAS evaulate against metrics like faithfulness, answer_relevancy, context_recall, context_precision, answer_correctness
multi_context_dataset_new_advanced_retrieval_results = evaluate(multi_context_dataset_new_advanced_retrieval, metrics)

# checking metrics
multi_context_dataset_new_advanced_retrieval_results


Evaluating:   0%|          | 0/25 [00:00<?, ?it/s]

{'faithfulness': 1.0000, 'answer_relevancy': 0.9498, 'context_recall': 0.9000, 'context_precision': 0.9442, 'answer_correctness': 0.7226}

In [74]:
multi_context_dataset_new_advanced_retrieval_results_df = multi_context_dataset_new_advanced_retrieval_results.to_pandas()
multi_context_dataset_new_advanced_retrieval_results_df

Unnamed: 0,question,answer,contexts,ground_truth,faithfulness,answer_relevancy,context_recall,context_precision,answer_correctness
0,"""What strategy video game did OpenAI excel in,...","OpenAI excelled in Dota 2, a strategy video ga...",[77. \nInitial work at OpenAI followed much in...,OpenAI excelled in the strategy video game Dot...,1.0,0.927023,1.0,1.0,0.74022
1,"""What game did OpenAI use reinforcement learni...",OpenAI used reinforcement learning in the game...,[77. \nInitial work at OpenAI followed much in...,OpenAI used reinforcement learning in the game...,1.0,0.93784,0.5,1.0,0.53476
2,How does the use of IP assets affect the propo...,The use of IP assets in the proposed business ...,"[business model were valid, it would radically...",The use of IP assets in the proposed business ...,1.0,0.931506,1.0,1.0,0.862021
3,What breach of the Founding Agreement has occu...,The breach of the Founding Agreement that occu...,[an algorithm that is outside the scope of Mic...,Licensing GPT-4 exclusively to Microsoft,1.0,0.994584,1.0,0.8875,0.580687
4,What makes AGI in the wrong hands dangerous an...,AGI in the wrong hands is considered dangerous...,[18. \nMr. Musk has long recognized that AGI p...,The advancement of AI in the wrong hands is da...,1.0,0.957879,1.0,0.833333,0.89527


In [79]:
# Comparing results for multi context questions across Baseline, Using Multiquery with ADA, Using Multiquery with Text Embedding small
df_baseline = pd.DataFrame(list(multi_context_dataset_results.items()), columns=['Metric', 'Baseline'])
df_original = pd.DataFrame(list(multi_context_dataset_advanced_retrieval_results.items()), columns=['Metric', 'ADA'])
df_comparison = pd.DataFrame(list(multi_context_dataset_new_advanced_retrieval_results.items()), columns=['Metric', 'Text Embedding 3'])

df_merged = pd.merge(df_original, df_comparison, on='Metric')
df_merged = pd.merge(df_baseline, df_merged, on="Metric")

df_merged['Delta - TE3 -> ADA'] = df_merged['Text Embedding 3'] - df_merged['ADA']
df_merged['Delta - TE3 -> Baseline'] = df_merged['Text Embedding 3'] - df_merged['Baseline']

df_merged

Unnamed: 0,Metric,Baseline,ADA,Text Embedding 3,Delta - TE3 -> ADA,Delta - TE3 -> Baseline
0,faithfulness,1.0,0.96,1.0,0.04,0.0
1,answer_relevancy,0.947723,0.947634,0.949767,0.002133,0.002043
2,context_recall,0.9,0.95,0.9,-0.05,0.0
3,context_precision,0.944167,0.923333,0.944167,0.020833,0.0
4,answer_correctness,0.722585,0.740657,0.722591,-0.018066,6e-06
