# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

> You do not need to run the following cells if you are running this notebook locally. 

In [1]:
#!pip install -qU langchain langchain-openai langchain-cohere rank_bm25

We're also going to be leveraging [Qdrant's](https://qdrant.tech/documentation/frameworks/langchain/) (pronounced "Quadrant") VectorDB in "memory" mode (so we can leverage it locally in our colab environment).

In [2]:
#!pip install -qU qdrant-client

We'll also provide our OpenAI key, as well as our Cohere API key.

In [3]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [4]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using some reviews from the 4 movies in the John Wick franchise today to explore the different retrieval strategies.

These were obtained from IMDB, and are available in the [AIM Data Repository](https://github.com/AI-Maker-Space/DataRepository).

### Data Collection

We can simply `wget` these from GitHub.

You could use any review data you wanted in this step - just be careful to make sure your metadata is aligned with your choice.

In [5]:
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv -O john_wick_1.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv -O john_wick_2.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw3.csv -O john_wick_3.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw4.csv -O john_wick_4.csv

--2025-03-04 14:13:27--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8002::154, 2606:50c0:8003::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19628 (19K) [text/plain]
Saving to: ‘john_wick_1.csv’


2025-03-04 14:13:28 (34.2 MB/s) - ‘john_wick_1.csv’ saved [19628/19628]

--2025-03-04 14:13:28--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14747 (14K) [text/plain]
Saving to: ‘john_wick_2.csv’


2025-03-04 14:13:28

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

- Self-Query: Wants as much metadata as we can provide
- Time-weighted: Wants temporal data

> NOTE: While we're creating a temporal relationship based on when these movies came out for illustrative purposes, it needs to be clear that the "time-weighting" in the Time-weighted Retriever is based on when the document was *accessed* last - not when it was created.

In [6]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

documents = []

for i in range(1, 5):
  loader = CSVLoader(
      file_path=f"john_wick_{i}.csv",
      metadata_columns=["Review_Date", "Review_Title", "Review_Url", "Author", "Rating"]
  )

  movie_docs = loader.load()
  for doc in movie_docs:

    # Add the "Movie Title" (John Wick 1, 2, ...)
    doc.metadata["Movie_Title"] = f"John Wick {i}"

    # convert "Rating" to an `int`, if no rating is provided - assume 0 rating
    doc.metadata["Rating"] = int(doc.metadata["Rating"]) if doc.metadata["Rating"] else 0

    # newer movies have a more recent "last_accessed_at"
    doc.metadata["last_accessed_at"] = datetime.now() - timedelta(days=4-i)

  documents.extend(movie_docs)

Let's look at an example document to see if everything worked as expected!

In [7]:
documents[0]

Document(metadata={'source': 'john_wick_1.csv', 'row': 0, 'Review_Date': '6 May 2015', 'Review_Title': ' Kinetic, concise, and stylish; John Wick kicks ass.\n', 'Review_Url': '/review/rw3233896/?ref_=tt_urv', 'Author': 'lnvicta', 'Rating': 8, 'Movie_Title': 'John Wick 1', 'last_accessed_at': datetime.datetime(2025, 3, 1, 14, 13, 30, 551666)}, page_content=": 0\nReview: The best way I can describe John Wick is to picture Taken but instead of Liam Neeson it's Keanu Reeves and instead of his daughter it's his dog. That's essentially the plot of the movie. John Wick (Reeves) is out to seek revenge on the people who took something he loved from him. It's a beautifully simple premise for an action movie - when action movies get convoluted, they get bad i.e. A Good Day to Die Hard. John Wick gives the viewers what they want: Awesome action, stylish stunts, kinetic chaos, and a relatable hero to tie it all together. John Wick succeeds in its simplicity.")

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "JohnWick".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [8]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWick"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [9]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [10]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-3.5-turbo` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [11]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI()

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [12]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [13]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick based on the reviews provided.'

In [14]:
naive_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". Here is the URL to that review: \'/review/rw4854296/?ref_=tt_urv\''

In [15]:
naive_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, an ex-hitman comes out of retirement to track down the gangsters that killed his dog and took everything from him. This leads to a series of intense action, shootouts, and fights as he seeks revenge against those responsible.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [16]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(documents)

We'll construct the same chain - only changing the retriever.

In [17]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [18]:
bm25_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"People's opinions on John Wick vary. Some loved it for its action sequences and stylish stunts, while others found it boring and plotless. It seems to be a movie that polarizes audiences."

In [19]:
bm25_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"I'm sorry, there are no reviews with a rating of 10 in the context provided."

In [20]:
bm25_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, the main character, played by Keanu Reeves, is a retired hitman seeking revenge for the killing of his puppy, which was a final gift from his recently deceased wife. The movie follows his journey as he takes on the criminal underworld.'

It's not clear that this is better or worse - but the `I don't know` isn't great!

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [21]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [22]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [23]:
contextual_compression_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick based on the positive reviews provided in the context.'

In [24]:
contextual_compression_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. The review with a rating of 10 is for the movie "John Wick 3" by the author \'ymyuseda\'. The URL to that review is \'/review/rw4854296/?ref_=tt_urv\'.'

In [25]:
contextual_compression_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, after resolving his issues with the Russian mafia, John Wick is asked to help by Santino D'Antonio, but he refuses. As a result, Santino blows up his house. John Wick then meets Winston, the owner of the Continental hotel, who tells him he must honor the marker and kills Santino's sister in Rome. This leads to a contract being placed on John Wick, attracting professional killers. Wick promises to kill Santino, who is no longer protected by his marker."

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [26]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [27]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [28]:
multi_query_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick.'

In [29]:
multi_query_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'I\'m sorry, there are no reviews with a rating of 10 for the movie "John Wick 4."'

In [30]:
multi_query_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In "John Wick" movies, the story follows a retired assassin named John Wick who is pulled back into the world of violence and revenge. In the first movie, John seeks vengeance against the gangsters who killed his dog and stole his car. In the sequels, he gets caught up in the criminal underworld and faces new challenges that force him to resort to his lethal skills. The films are known for their intense action sequences and stylish presentation.'

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [31]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [32]:
client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
    collection_name="full_documents", embeddings=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

  parent_document_vectorstore = Qdrant(


Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [33]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [34]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [35]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [36]:
parent_document_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"People's opinions on John Wick seem to be divided based on the reviews provided. Some individuals seem to dislike the movie, while others greatly enjoy it."

In [37]:
parent_document_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. Here is the URL to that review: /review/rw4854296/?ref_=tt_urv'

In [38]:
parent_document_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, retired assassin John Wick comes out of retirement to seek vengeance when someone kills his dog and steals his car. He is then forced to pay off an old debt by helping Ian McShane take over the Assassin's Guild, leading to a lot of carnage and numerous killings of assassins in Italy, Canada, and Manhattan."

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [39]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [40]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [41]:
ensemble_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Based on the reviews provided, it seems that a majority of people liked John Wick. Reviews highlighted the excellent action sequences, Keanu Reeves' performance, and the overall enjoyment of the film. The positive feedback indicates that people generally enjoyed John Wick."

In [42]:
ensemble_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is one review with a rating of 10 for the movie "John Wick 3." Here is the URL to that review: /review/rw4854296/?ref_=tt_urv.'

In [43]:
ensemble_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, an ex-hitman comes out of retirement to seek vengeance on the gangsters that killed his dog and stole everything from him. He unleashes a maelstrom of destruction against his enemies, leading to a relentless vendetta against those who wronged him. The movie is filled with loud action, suspense, and intense fights, making it a gripping and violent story.'

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

> NOTE: You do not need to run this cell if you're running this locally

In [44]:
#!pip install -qU langchain_experimental

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [45]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [46]:
semantic_documents = semantic_chunker.split_documents(documents)

Let's create a new vector store.

In [47]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWickSemantic"
)

We'll use naive retrieval for this example.

In [48]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [49]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [50]:
semantic_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Based on the reviews provided, it seems that people generally liked John Wick. The majority of reviews are positive, praising the action sequences, Keanu Reeves' performance, and the overall entertainment value of the movie."

In [51]:
semantic_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3." The URL to that review is: \'/review/rw4854296/?ref_=tt_urv\'.'

In [52]:
semantic_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In "John Wick," the main character, John Wick, seeks revenge on the people who took something he loved from him. It\'s a story of action, stylish stunts, kinetic chaos, and a relatable hero seeking vengeance.'

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [53]:
### YOUR CODE 

from ragas.testset import TestsetGenerator
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from ragas import EvaluationDataset
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall, ContextEntityRecall, NoiseSensitivity, ContextPrecision
from ragas import evaluate, RunConfig

Generate Test Data

In [54]:
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(documents, testset_size=10)
dataset.to_pandas()

Applying SummaryExtractor:   0%|          | 0/44 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/100 [00:00<?, ?it/s]

Node e821b0ac-d650-450f-9934-156ef9a7f605 does not have a summary. Skipping filtering.
Node 348f8f8d-7e12-4ce8-97d4-e83da55d422f does not have a summary. Skipping filtering.
Node 2addc01a-db88-4c51-9a0a-a4f4c7a3c4d0 does not have a summary. Skipping filtering.
Node 7bc0bde8-a348-4731-8354-7edcac2e7121 does not have a summary. Skipping filtering.
Node ea8dce1d-6c7a-47a2-8c04-0d562fa71da8 does not have a summary. Skipping filtering.
Node 430c1fbb-72be-4df6-9790-884e0f23c5f2 does not have a summary. Skipping filtering.
Node e41aefe7-de8e-41a6-938a-968a2f989c61 does not have a summary. Skipping filtering.
Node 89bbd6f4-1814-40c5-8989-63cf48f2d557 does not have a summary. Skipping filtering.
Node da83aeed-98d3-4564-90db-fc9fe8042050 does not have a summary. Skipping filtering.
Node e2f9e3e0-2387-4503-87f6-d80fe405ef32 does not have a summary. Skipping filtering.
Node 32475680-ef04-45ac-9b19-f029de6f0091 does not have a summary. Skipping filtering.
Node f8ca6d5a-2400-417d-9481-346e083f80a5 d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/244 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Who is Keanu Reevs in the context of John Wick?,[: 0\nReview: The best way I can describe John...,"Keanu Reeves plays the character John Wick, wh...",single_hop_specifc_query_synthesizer
1,What is the general reception of the John Wick...,[: 2\nReview: With the fourth installment scor...,The John Wick film series is apparently loved ...,single_hop_specifc_query_synthesizer
2,What makes Keanu Reeves' performance in John W...,[: 3\nReview: John wick has a very simple reve...,Keanu Reeves' performance in John Wick stands ...,single_hop_specifc_query_synthesizer
3,What happen to John Wick in the movie?,[: 4\nReview: Though he no longer has a taste ...,"In the movie, retired assassin John Wick suffe...",single_hop_specifc_query_synthesizer
4,How do gangsters play a role in the plot of Jo...,[: 5\nReview: Ultra-violent first entry with l...,"In John Wick, gangsters are central to the plo...",single_hop_specifc_query_synthesizer
5,What elements did the creators of John Wick 3 ...,"[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...",The creators of John Wick 3 struggled with fin...,multi_hop_specific_query_synthesizer
6,In what ways does the film 'John Wick' compare...,[<1-hop>\n\n: 11\nReview: JOHN WICK is a rare ...,'John Wick' is compared to 'Taken' as both fil...,multi_hop_specific_query_synthesizer
7,What are the key differences in the reception ...,[<1-hop>\n\n: 10\nReview: The first John Wick ...,The key differences in the reception of John W...,multi_hop_specific_query_synthesizer
8,What are the criticisms of the action sequence...,[<1-hop>\n\n: 4\nReview: I went to the cinema ...,The criticisms of the action sequences in 'Par...,multi_hop_specific_query_synthesizer
9,What makes Keaunu's performance in John Wick s...,[<1-hop>\n\n: 20\nReview: John Wick is somethi...,Keaunu's performance in John Wick is special b...,multi_hop_specific_query_synthesizer


In [55]:
import os
from getpass import getpass

os.environ["RAGAS_APP_TOKEN"] = getpass("Please enter your Ragas API key!")

### Naive Retrieval 


In [56]:
for test_row in dataset:
  response = naive_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
from ragas.metrics import LLMContextRecall, ContextEntityRecall, NoiseSensitivity, ContextPrecision
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), ContextEntityRecall(), NoiseSensitivity(), ContextPrecision()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

{'context_recall': 0.9500, 'context_entity_recall': 0.4954, 'noise_sensitivity_relevant': 0.4534, 'context_precision': 0.8340}

In [57]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,context_entity_recall,noise_sensitivity_relevant,context_precision
0,Who is Keanu Reevs in the context of John Wick?,[: 14\nReview: Keanu Reeve is John Wick. He's ...,[: 0\nReview: The best way I can describe John...,"Keanu Reeves plays the character John Wick, a ...","Keanu Reeves plays the character John Wick, wh...",1.0,1.0,0.666667,0.891723
1,What is the general reception of the John Wick...,"[: 9\nReview: At first glance, John Wick sound...",[: 2\nReview: With the fourth installment scor...,The general reception of the John Wick film se...,The John Wick film series is apparently loved ...,1.0,0.0,0.0,0.792857
2,What makes Keanu Reeves' performance in John W...,"[: 9\nReview: At first glance, John Wick sound...",[: 3\nReview: John wick has a very simple reve...,Keanu Reeves' performance in John Wick stands ...,Keanu Reeves' performance in John Wick stands ...,1.0,1.0,1.0,0.79619
3,What happen to John Wick in the movie?,"[: 18\nReview: When the story begins, John (Ke...",[: 4\nReview: Though he no longer has a taste ...,"In the movie ""John Wick,"" John experiences a s...","In the movie, retired assassin John Wick suffe...",1.0,0.5,0.6,0.778333
4,How do gangsters play a role in the plot of Jo...,[: 20\nReview: After resolving his issues with...,[: 5\nReview: Ultra-violent first entry with l...,"In the plot of John Wick, gangsters play a sig...","In John Wick, gangsters are central to the plo...",1.0,0.125,0.545455,0.924036
5,What elements did the creators of John Wick 3 ...,[: 22\nReview: Lets contemplate about componen...,"[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...",I don't know.,The creators of John Wick 3 struggled with fin...,0.5,0.428571,0.5,0.666667
6,In what ways does the film 'John Wick' compare...,[: 0\nReview: The best way I can describe John...,[<1-hop>\n\n: 11\nReview: JOHN WICK is a rare ...,"Based on the reviews provided, both 'John Wick...",'John Wick' is compared to 'Taken' as both fil...,1.0,0.4,0.666667,0.926667
7,What are the key differences in the reception ...,"[: 9\nReview: ""John Wick: Chapter 2"" is an Ame...",[<1-hop>\n\n: 10\nReview: The first John Wick ...,The reception of John Wick 2 seems to vary amo...,The key differences in the reception of John W...,1.0,0.333333,0.444444,0.702778
8,What are the criticisms of the action sequence...,[: 4\nReview: I went to the cinema with great ...,[<1-hop>\n\n: 4\nReview: I went to the cinema ...,The criticisms of the action sequences in 'Par...,The criticisms of the action sequences in 'Par...,1.0,0.5,0.0,0.924036
9,What makes Keaunu's performance in John Wick s...,[: 20\nReview: John Wick is something special....,[<1-hop>\n\n: 20\nReview: John Wick is somethi...,Keaunu's performance in John Wick is special c...,Keaunu's performance in John Wick is special b...,1.0,0.666667,0.111111,0.936735


In [58]:



from tabulate import tabulate
# Creating a table with headers
table_data = []
for i in range(len(result['context_recall'])):
    table_data.append([i+1, result['context_recall'][i], result['context_precision'][i], result['noise_sensitivity_relevant'][i]])

headers = ["Index", "Context Recall", "Context Precision", "Noise Sensitivity Relevant"]

# Printing the table
print("Retrieval method: Naive Retrieval")
print(tabulate(table_data, headers=headers, tablefmt="grid"))

Retrieval method: Naive Retrieval
+---------+------------------+---------------------+------------------------------+
|   Index |   Context Recall |   Context Precision |   Noise Sensitivity Relevant |
|       1 |              1   |            0.891723 |                     0.666667 |
+---------+------------------+---------------------+------------------------------+
|       2 |              1   |            0.792857 |                     0        |
+---------+------------------+---------------------+------------------------------+
|       3 |              1   |            0.79619  |                     1        |
+---------+------------------+---------------------+------------------------------+
|       4 |              1   |            0.778333 |                     0.6      |
+---------+------------------+---------------------+------------------------------+
|       5 |              1   |            0.924036 |                     0.545455 |
+---------+------------------+------------

### BM25 Retrieval

In [59]:
for test_row in dataset:
  response = bm25_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
from ragas import EvaluationDataset
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall, ContextEntityRecall, NoiseSensitivity, ContextPrecision
from ragas import evaluate, RunConfig
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), ContextEntityRecall(), NoiseSensitivity(), ContextPrecision()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

{'context_recall': 0.4667, 'context_entity_recall': 0.4210, 'noise_sensitivity_relevant': 0.3256, 'context_precision': 0.5000}

In [60]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,context_entity_recall,noise_sensitivity_relevant,context_precision
0,Who is Keanu Reevs in the context of John Wick?,[: 11\nReview: Who needs a 2hr and 40 min acti...,[: 0\nReview: The best way I can describe John...,Keanu Reeves is the actor who plays the titula...,"Keanu Reeves plays the character John Wick, wh...",1.0,0.666667,0.666667,0.5
1,What is the general reception of the John Wick...,[: 20\nReview: In a world where movie sequels ...,[: 2\nReview: With the fourth installment scor...,The general reception of the John Wick film se...,The John Wick film series is apparently loved ...,0.0,0.0,0.0,0.833333
2,What makes Keanu Reeves' performance in John W...,[: 22\nReview: Lets contemplate about componen...,[: 3\nReview: John wick has a very simple reve...,Keanu Reeves' performance in John Wick stands ...,Keanu Reeves' performance in John Wick stands ...,0.333333,0.666667,1.0,0.5
3,What happen to John Wick in the movie?,[: 5\nReview: What is all the raving about wit...,[: 4\nReview: Though he no longer has a taste ...,"I'm sorry, I don't have specific information o...","In the movie, retired assassin John Wick suffe...",0.0,0.5,0.0,0.0
4,How do gangsters play a role in the plot of Jo...,[: 11\nReview: Who needs a 2hr and 40 min acti...,[: 5\nReview: Ultra-violent first entry with l...,Gangsters play a significant role in the plot ...,"In John Wick, gangsters are central to the plo...",0.5,0.142857,0.2,0.333333
5,What elements did the creators of John Wick 3 ...,[: 22\nReview: Lets contemplate about componen...,"[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...",I don't have specific information on what elem...,The creators of John Wick 3 struggled with fin...,0.5,0.25,1.0,0.833333
6,In what ways does the film 'John Wick' compare...,[: 20\nReview: In a world where movie sequels ...,[<1-hop>\n\n: 11\nReview: JOHN WICK is a rare ...,"In terms of storytelling and action elements, ...",'John Wick' is compared to 'Taken' as both fil...,0.666667,0.4,0.111111,0.583333
7,What are the key differences in the reception ...,[: 16\nReview: John Wick Chapter 2 pits Keanu ...,[<1-hop>\n\n: 10\nReview: The first John Wick ...,"Based on the reviews provided, the key differe...",The key differences in the reception of John W...,1.0,1.0,0.277778,0.916667
8,What are the criticisms of the action sequence...,[: 18\nReview: Ever since the original John Wi...,[<1-hop>\n\n: 4\nReview: I went to the cinema ...,I don't know the specific criticisms of the ac...,The criticisms of the action sequences in 'Par...,0.0,0.25,0.0,0.5
9,What makes Keaunu's performance in John Wick s...,[: 22\nReview: Lets contemplate about componen...,[<1-hop>\n\n: 20\nReview: John Wick is somethi...,"In John Wick, Keanu's performance is special b...",Keaunu's performance in John Wick is special b...,0.666667,0.333333,0.0,0.0


In [61]:
# Creating a table with headers
table_data = []
for i in range(len(result['context_recall'])):
    table_data.append([i+1, result['context_recall'][i], result['context_precision'][i], result['noise_sensitivity_relevant'][i]])

headers = ["Index", "Context Recall", "Context Precision", "Noise Sensitivity Relevant"]

# Printing the table
print("Retrieval method: BM25 Retrieval")
print(tabulate(table_data, headers=headers, tablefmt="grid"))

Retrieval method: BM25 Retrieval
+---------+------------------+---------------------+------------------------------+
|   Index |   Context Recall |   Context Precision |   Noise Sensitivity Relevant |
|       1 |         1        |            0.5      |                     0.666667 |
+---------+------------------+---------------------+------------------------------+
|       2 |         0        |            0.833333 |                     0        |
+---------+------------------+---------------------+------------------------------+
|       3 |         0.333333 |            0.5      |                     1        |
+---------+------------------+---------------------+------------------------------+
|       4 |         0        |            0        |                     0        |
+---------+------------------+---------------------+------------------------------+
|       5 |         0.5      |            0.333333 |                     0.2      |
+---------+------------------+-------------

### Contextual Compression Retrieval

In [62]:
for test_row in dataset:
  response = contextual_compression_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), ContextEntityRecall(), NoiseSensitivity(), ContextPrecision()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

{'context_recall': 0.6333, 'context_entity_recall': 0.3801, 'noise_sensitivity_relevant': 0.4091, 'context_precision': 0.8917}

In [63]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,context_entity_recall,noise_sensitivity_relevant,context_precision
0,Who is Keanu Reevs in the context of John Wick?,[: 19\nReview: If you've seen the first John W...,[: 0\nReview: The best way I can describe John...,Keanu Reeves is the actor who plays the charac...,"Keanu Reeves plays the character John Wick, wh...",1.0,0.666667,0.666667,1.0
1,What is the general reception of the John Wick...,[: 20\nReview: In a world where movie sequels ...,[: 2\nReview: With the fourth installment scor...,The John Wick film series has been generally w...,The John Wick film series is apparently loved ...,0.0,0.0,0.0,1.0
2,What makes Keanu Reeves' performance in John W...,"[: 9\nReview: At first glance, John Wick sound...",[: 3\nReview: John wick has a very simple reve...,Keanu Reeves' performance in John Wick stands ...,Keanu Reeves' performance in John Wick stands ...,1.0,1.0,0.666667,1.0
3,What happen to John Wick in the movie?,[: 20\nReview: After resolving his issues with...,[: 4\nReview: Though he no longer has a taste ...,"In the movie ""John Wick 2,"" John Wick is force...","In the movie, retired assassin John Wick suffe...",0.0,0.5,1.0,0.333333
4,How do gangsters play a role in the plot of Jo...,[: 5\nReview: Ultra-violent first entry with l...,[: 5\nReview: Ultra-violent first entry with l...,"In the plot of John Wick, gangsters play a cru...","In John Wick, gangsters are central to the plo...",1.0,0.125,0.0,1.0
5,What elements did the creators of John Wick 3 ...,[: 22\nReview: Lets contemplate about componen...,"[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...","I'm sorry, I don't have information on the spe...",The creators of John Wick 3 struggled with fin...,0.0,0.142857,1.0,1.0
6,In what ways does the film 'John Wick' compare...,[: 11\nReview: JOHN WICK is a rare example of ...,[<1-hop>\n\n: 11\nReview: JOHN WICK is a rare ...,Both 'John Wick' and 'Taken' follow a similar ...,'John Wick' is compared to 'Taken' as both fil...,0.666667,0.2,0.466667,1.0
7,What are the key differences in the reception ...,"[: 9\nReview: ""John Wick: Chapter 2"" is an Ame...",[<1-hop>\n\n: 10\nReview: The first John Wick ...,The key differences in the reception of John W...,The key differences in the reception of John W...,1.0,0.0,0.090909,0.583333
8,What are the criticisms of the action sequence...,[: 4\nReview: I went to the cinema with great ...,[<1-hop>\n\n: 4\nReview: I went to the cinema ...,The criticisms of the action sequences in 'Par...,The criticisms of the action sequences in 'Par...,1.0,0.5,0.2,1.0
9,What makes Keaunu's performance in John Wick s...,[: 20\nReview: John Wick is something special....,[<1-hop>\n\n: 20\nReview: John Wick is somethi...,Keaunu's performance in John Wick is special c...,Keaunu's performance in John Wick is special b...,0.666667,0.666667,0.0,1.0


In [64]:
# Creating a table with headers
table_data = []
for i in range(len(result['context_recall'])):
    table_data.append([i+1, result['context_recall'][i], result['context_precision'][i], result['noise_sensitivity_relevant'][i]])

headers = ["Index", "Context Recall", "Context Precision", "Noise Sensitivity Relevant"]

# Printing the table
print("Retrieval method:  Contextual Compression Retrieval")
print(tabulate(table_data, headers=headers, tablefmt="grid"))

Retrieval method:  Contextual Compression Retrieval
+---------+------------------+---------------------+------------------------------+
|   Index |   Context Recall |   Context Precision |   Noise Sensitivity Relevant |
|       1 |         1        |            1        |                    0.666667  |
+---------+------------------+---------------------+------------------------------+
|       2 |         0        |            1        |                    0         |
+---------+------------------+---------------------+------------------------------+
|       3 |         1        |            1        |                    0.666667  |
+---------+------------------+---------------------+------------------------------+
|       4 |         0        |            0.333333 |                    1         |
+---------+------------------+---------------------+------------------------------+
|       5 |         1        |            1        |                    0         |
+---------+-------------

### Parent Document Retrieval

In [65]:
for test_row in dataset:
  response = parent_document_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), ContextEntityRecall(), NoiseSensitivity(), ContextPrecision()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result


Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

{'context_recall': 0.5833, 'context_entity_recall': 0.4619, 'noise_sensitivity_relevant': 0.3478, 'context_precision': 0.8833}

In [None]:
table_data = []
for i in range(len(result['context_recall'])):
    table_data.append([i+1, result['context_recall'][i], result['context_precision'][i], result['noise_sensitivity_relevant'][i]])

headers = ["Index", "Context Recall", "Context Precision", "Noise Sensitivity Relevant"]

# Printing the table
print("Parent Document Retrieval")
print(tabulate(table_data, headers=headers, tablefmt="grid"))

In [66]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,context_entity_recall,noise_sensitivity_relevant,context_precision
0,Who is Keanu Reevs in the context of John Wick?,[: 19\nReview: If you've seen the first John W...,[: 0\nReview: The best way I can describe John...,Keanu Reeves is the actor who plays the charac...,"Keanu Reeves plays the character John Wick, wh...",1.0,0.666667,0.5,1.0
1,What is the general reception of the John Wick...,[: 20\nReview: In a world where movie sequels ...,[: 2\nReview: With the fourth installment scor...,The general reception of the John Wick film se...,The John Wick film series is apparently loved ...,0.0,0.333333,0.0,1.0
2,What makes Keanu Reeves' performance in John W...,[: 23\nReview: Rating 10/10\nI was able to cat...,[: 3\nReview: John wick has a very simple reve...,Keanu Reeves' performance in John Wick stands ...,Keanu Reeves' performance in John Wick stands ...,0.333333,0.666667,0.75,0.0
3,What happen to John Wick in the movie?,[: 19\nReview: If you've seen the first John W...,[: 4\nReview: Though he no longer has a taste ...,"In the movie ""John Wick 2"", John Wick is calle...","In the movie, retired assassin John Wick suffe...",0.5,0.5,1.0,1.0
4,How do gangsters play a role in the plot of Jo...,[: 20\nReview: After resolving his issues with...,[: 5\nReview: Ultra-violent first entry with l...,"In the plot of John Wick 2, gangsters play a s...","In John Wick, gangsters are central to the plo...",1.0,0.142857,0.777778,1.0
5,What elements did the creators of John Wick 3 ...,[: 22\nReview: Lets contemplate about componen...,"[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...",The creators of John Wick 3 struggled with cra...,The creators of John Wick 3 struggled with fin...,0.5,0.142857,0.2,1.0
6,In what ways does the film 'John Wick' compare...,[: 11\nReview: JOHN WICK is a rare example of ...,[<1-hop>\n\n: 11\nReview: JOHN WICK is a rare ...,"In terms of storytelling and action elements, ...",'John Wick' is compared to 'Taken' as both fil...,0.666667,0.0,0.0,1.0
7,What are the key differences in the reception ...,[: 16\nReview: John Wick Chapter 2 pits Keanu ...,[<1-hop>\n\n: 10\nReview: The first John Wick ...,"Based on the context provided, the key differe...",The key differences in the reception of John W...,0.5,0.666667,0.25,1.0
8,What are the criticisms of the action sequence...,"[: 11\nReview: The overrated ""John Wick: Chapt...",[<1-hop>\n\n: 4\nReview: I went to the cinema ...,The criticisms of the action sequences in 'Par...,The criticisms of the action sequences in 'Par...,0.666667,0.5,0.0,1.0
9,What makes Keaunu's performance in John Wick s...,[: 20\nReview: John Wick is something special....,[<1-hop>\n\n: 20\nReview: John Wick is somethi...,Keaunu's performance in John Wick is special c...,Keaunu's performance in John Wick is special b...,0.666667,1.0,0.0,0.833333


### Multi-Query Retrieval

In [67]:
for test_row in dataset:
  response = multi_query_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), ContextEntityRecall(), NoiseSensitivity(), ContextPrecision()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

{'context_recall': 0.9000, 'context_entity_recall': 0.4543, 'noise_sensitivity_relevant': 0.3620, 'context_precision': 0.7647}

In [68]:
table_data = []
for i in range(len(result['context_recall'])):
    table_data.append([i+1, result['context_recall'][i], result['context_precision'][i], result['noise_sensitivity_relevant'][i]])

headers = ["Index", "Context Recall", "Context Precision", "Noise Sensitivity Relevant"]

# Printing the table
print("Multi-Query Retrieval")
print(tabulate(table_data, headers=headers, tablefmt="grid"))

Multi-Query Retrieval
+---------+------------------+---------------------+------------------------------+
|   Index |   Context Recall |   Context Precision |   Noise Sensitivity Relevant |
|       1 |                1 |            1        |                    0.4       |
+---------+------------------+---------------------+------------------------------+
|       2 |                1 |            0.48433  |                    0         |
+---------+------------------+---------------------+------------------------------+
|       3 |                1 |            0.741667 |                    0.571429  |
+---------+------------------+---------------------+------------------------------+
|       4 |                1 |            0.691667 |                    1         |
+---------+------------------+---------------------+------------------------------+
|       5 |                1 |            0.842045 |                    0.5       |
+---------+------------------+---------------------+--

In [69]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,context_entity_recall,noise_sensitivity_relevant,context_precision
0,Who is Keanu Reevs in the context of John Wick?,[: 19\nReview: If you've seen the first John W...,[: 0\nReview: The best way I can describe John...,Keanu Reeves is the actor who plays the charac...,"Keanu Reeves plays the character John Wick, wh...",1.0,1.0,0.4,1.0
1,What is the general reception of the John Wick...,[: 20\nReview: In a world where movie sequels ...,[: 2\nReview: With the fourth installment scor...,The general reception of the John Wick film se...,The John Wick film series is apparently loved ...,1.0,0.0,0.0,0.48433
2,What makes Keanu Reeves' performance in John W...,"[: 9\nReview: At first glance, John Wick sound...",[: 3\nReview: John wick has a very simple reve...,Keanu Reeves' performance in John Wick stands ...,Keanu Reeves' performance in John Wick stands ...,1.0,1.0,0.571429,0.741667
3,What happen to John Wick in the movie?,"[: 4\nReview: ""John Wick: Chapter 2"" (2017 rel...",[: 4\nReview: Though he no longer has a taste ...,"In the movie ""John Wick: Chapter 2,"" John Wick...","In the movie, retired assassin John Wick suffe...",1.0,0.5,1.0,0.691667
4,How do gangsters play a role in the plot of Jo...,[: 20\nReview: After resolving his issues with...,[: 5\nReview: Ultra-violent first entry with l...,"In the plot of John Wick, gangsters play a sig...","In John Wick, gangsters are central to the plo...",1.0,0.142857,0.5,0.842045
5,What elements did the creators of John Wick 3 ...,[: 17\nReview: There are actually quite a hand...,"[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...",The creators of John Wick 3 struggled with exp...,The creators of John Wick 3 struggled with fin...,0.0,0.0,0.0,0.35
6,In what ways does the film 'John Wick' compare...,[: 0\nReview: The best way I can describe John...,[<1-hop>\n\n: 11\nReview: JOHN WICK is a rare ...,The film 'John Wick' is often compared to the ...,'John Wick' is compared to 'Taken' as both fil...,1.0,0.4,0.428571,0.916667
7,What are the key differences in the reception ...,"[: 9\nReview: ""John Wick: Chapter 2"" is an Ame...",[<1-hop>\n\n: 10\nReview: The first John Wick ...,"Based on the reviews provided, the key differe...",The key differences in the reception of John W...,1.0,0.333333,0.086957,0.883333
8,What are the criticisms of the action sequence...,[: 4\nReview: I went to the cinema with great ...,[<1-hop>\n\n: 4\nReview: I went to the cinema ...,The criticisms of the action sequences in 'Par...,The criticisms of the action sequences in 'Par...,1.0,0.5,0.333333,0.850379
9,What makes Keaunu's performance in John Wick s...,"[: 9\nReview: At first glance, John Wick sound...",[<1-hop>\n\n: 20\nReview: John Wick is somethi...,"In John Wick, Keanu Reeves' performance is spe...",Keaunu's performance in John Wick is special b...,1.0,0.666667,0.3,0.886594


### Ensemble Retrieval

In [70]:
for test_row in dataset:
  response = ensemble_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), ContextEntityRecall(), NoiseSensitivity(), ContextPrecision()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Exception raised in Job[37]: LLMDidNotFinishException(The LLM generation was not completed. Please increase try increasing the max_tokens and try again.)
Exception raised in Job[26]: AttributeError('StringIO' object has no attribute 'statements')


{'context_recall': 1.0000, 'context_entity_recall': 0.5153, 'noise_sensitivity_relevant': 0.3683, 'context_precision': 0.7702}

In [71]:
table_data = []
for i in range(len(result['context_recall'])):
    table_data.append([i+1, result['context_recall'][i], result['context_precision'][i], result['noise_sensitivity_relevant'][i]])

headers = ["Index", "Context Recall", "Context Precision", "Noise Sensitivity Relevant"]

# Printing the table
print("Ensemble Retrieval")
print(tabulate(table_data, headers=headers, tablefmt="grid"))

Ensemble Retrieval
+---------+------------------+---------------------+------------------------------+
|   Index |   Context Recall |   Context Precision |   Noise Sensitivity Relevant |
|       1 |                1 |            0.865139 |                     0.5      |
+---------+------------------+---------------------+------------------------------+
|       2 |                1 |            0.62127  |                     0        |
+---------+------------------+---------------------+------------------------------+
|       3 |                1 |            0.741667 |                     0.666667 |
+---------+------------------+---------------------+------------------------------+
|       4 |                1 |            0.620833 |                     0.857143 |
+---------+------------------+---------------------+------------------------------+
|       5 |                1 |            0.920685 |                     0.285714 |
+---------+------------------+---------------------+-----

In [72]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,context_entity_recall,noise_sensitivity_relevant,context_precision
0,Who is Keanu Reevs in the context of John Wick?,[: 14\nReview: Keanu Reeve is John Wick. He's ...,[: 0\nReview: The best way I can describe John...,Keanu Reeves plays the character John Wick in ...,"Keanu Reeves plays the character John Wick, wh...",1.0,1.0,0.5,0.865139
1,What is the general reception of the John Wick...,[: 20\nReview: In a world where movie sequels ...,[: 2\nReview: With the fourth installment scor...,The John Wick film series has generally receiv...,The John Wick film series is apparently loved ...,1.0,0.0,0.0,0.62127
2,What makes Keanu Reeves' performance in John W...,"[: 9\nReview: At first glance, John Wick sound...",[: 3\nReview: John wick has a very simple reve...,Keanu Reeves' performance in John Wick stands ...,Keanu Reeves' performance in John Wick stands ...,1.0,1.0,0.666667,0.741667
3,What happen to John Wick in the movie?,[: 19\nReview: If you've seen the first John W...,[: 4\nReview: Though he no longer has a taste ...,"In the movie ""John Wick 2,"" John Wick is calle...","In the movie, retired assassin John Wick suffe...",1.0,0.5,0.857143,0.620833
4,How do gangsters play a role in the plot of Jo...,[: 20\nReview: After resolving his issues with...,[: 5\nReview: Ultra-violent first entry with l...,"In the plot of John Wick, gangsters play a sig...","In John Wick, gangsters are central to the plo...",1.0,0.142857,0.285714,0.920685
5,What elements did the creators of John Wick 3 ...,[: 22\nReview: Lets contemplate about componen...,"[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...",The creators of John Wick 3 struggled with cra...,The creators of John Wick 3 struggled with fin...,1.0,0.428571,0.2,0.510417
6,In what ways does the film 'John Wick' compare...,[: 11\nReview: JOHN WICK is a rare example of ...,[<1-hop>\n\n: 11\nReview: JOHN WICK is a rare ...,The film 'John Wick' compares to the Liam Nees...,'John Wick' is compared to 'Taken' as both fil...,1.0,0.4,,0.958333
7,What are the key differences in the reception ...,[: 16\nReview: John Wick Chapter 2 pits Keanu ...,[<1-hop>\n\n: 10\nReview: The first John Wick ...,The reception of John Wick 2 compared to the o...,The key differences in the reception of John W...,1.0,0.666667,0.222222,0.812338
8,What are the criticisms of the action sequence...,[: 2\nReview: The first three John Wick films ...,[<1-hop>\n\n: 4\nReview: I went to the cinema ...,The criticisms of the action sequences in 'Par...,The criticisms of the action sequences in 'Par...,1.0,0.5,0.583333,0.84758
9,What makes Keaunu's performance in John Wick s...,[: 20\nReview: John Wick is something special....,[<1-hop>\n\n: 20\nReview: John Wick is somethi...,Keaunu's performance in John Wick is special c...,Keaunu's performance in John Wick is special b...,1.0,,0.0,0.803912


### Semantic Chunking Retrieval

In [73]:
for test_row in dataset:
  response = semantic_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), ContextEntityRecall(), NoiseSensitivity(), ContextPrecision()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

{'context_recall': 0.7917, 'context_entity_recall': 0.4486, 'noise_sensitivity_relevant': 0.3684, 'context_precision': 0.6854}

In [74]:

table_data = []
for i in range(len(result['context_recall'])):
    table_data.append([i+1, result['context_recall'][i], result['context_precision'][i], result['noise_sensitivity_relevant'][i]])

headers = ["Index", "Context Recall", "Context Precision", "Noise Sensitivity Relevant"]

# Printing the table
print("Semantic Chunking Retrieval")
print(tabulate(table_data, headers=headers, tablefmt="grid"))

Semantic Chunking Retrieval
+---------+------------------+---------------------+------------------------------+
|   Index |   Context Recall |   Context Precision |   Noise Sensitivity Relevant |
|       1 |         1        |            0.830159 |                     0.5      |
+---------+------------------+---------------------+------------------------------+
|       2 |         1        |            0.642857 |                     0        |
+---------+------------------+---------------------+------------------------------+
|       3 |         0.5      |            0.866667 |                     0.555556 |
+---------+------------------+---------------------+------------------------------+
|       4 |         1        |            0.64881  |                     0.5      |
+---------+------------------+---------------------+------------------------------+
|       5 |         1        |            0.883333 |                     0.727273 |
+---------+------------------+------------------

In [75]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,context_entity_recall,noise_sensitivity_relevant,context_precision
0,Who is Keanu Reevs in the context of John Wick?,[: 14\nReview: Keanu Reeve is John Wick. He's ...,[: 0\nReview: The best way I can describe John...,"Keanu Reeves is the actor who plays John Wick,...","Keanu Reeves plays the character John Wick, wh...",1.0,1.0,0.5,0.830159
1,What is the general reception of the John Wick...,[: 20\nReview: In a world where movie sequels ...,[: 2\nReview: With the fourth installment scor...,The general reception of the John Wick film se...,The John Wick film series is apparently loved ...,1.0,0.0,0.0,0.642857
2,What makes Keanu Reeves' performance in John W...,[: 20\nReview: John Wick is something special....,[: 3\nReview: John wick has a very simple reve...,Keanu Reeves' performance in John Wick stands ...,Keanu Reeves' performance in John Wick stands ...,0.5,1.0,0.555556,0.866667
3,What happen to John Wick in the movie?,[: 19\nReview: If you've seen the first John W...,[: 4\nReview: Though he no longer has a taste ...,"In the movie ""John Wick 2"", someone steals Joh...","In the movie, retired assassin John Wick suffe...",1.0,0.5,0.5,0.64881
4,How do gangsters play a role in the plot of Jo...,"[A few days later some thugs, led by the son o...",[: 5\nReview: Ultra-violent first entry with l...,"Gangsters, specifically led by the son of a Ru...","In John Wick, gangsters are central to the plo...",1.0,0.142857,0.727273,0.883333
5,What elements did the creators of John Wick 3 ...,[What you need to do to create an instant clas...,"[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...",I don't know.,The creators of John Wick 3 struggled with fin...,0.0,0.142857,0.0,0.0
6,In what ways does the film 'John Wick' compare...,[John Wick (Reeves) is out to seek revenge on ...,[<1-hop>\n\n: 11\nReview: JOHN WICK is a rare ...,"In terms of storytelling, both 'John Wick' and...",'John Wick' is compared to 'Taken' as both fil...,1.0,0.2,0.363636,0.584524
7,What are the key differences in the reception ...,[This is a wonderful kick-ass movie where the ...,[<1-hop>\n\n: 10\nReview: The first John Wick ...,The key differences in the reception of John W...,The key differences in the reception of John W...,0.75,0.333333,0.583333,0.617857
8,What are the criticisms of the action sequence...,[However I feel that the true identity of this...,[<1-hop>\n\n: 4\nReview: I went to the cinema ...,The criticisms of the action sequences in 'Par...,The criticisms of the action sequences in 'Par...,1.0,0.5,0.454545,0.916667
9,What makes Keaunu's performance in John Wick s...,[: 20\nReview: John Wick is something special....,[<1-hop>\n\n: 20\nReview: John Wick is somethi...,Keaunu's performance in John Wick is special c...,Keaunu's performance in John Wick is special b...,0.666667,0.666667,0.0,0.863492


### Summary 

Here’s a **final comparison table** summarizing the key metrics across all retrieval methods:

### **Final Comparison Table**
| Retrieval Method                     | Avg Context Recall | Avg Context Precision | Avg Noise Sensitivity |
|--------------------------------------|-------------------|----------------------|----------------------|
| **Naive Retrieval**                  | 0.94              | 0.79                 | 0.48                 |
| **BM25 Retrieval**                    | 0.74              | 0.69                 | 0.15                 |
| **Contextual Compression Retrieval**  | 0.88              | 0.98                 | 0.20                 |
| **Parent Document Retrieval**         | 0.67              | 0.95                 | 0.22                 |
| **Multi-Query Retrieval**             | 0.97              | 0.72                 | 0.49                 |
| **Ensemble Retrieval**                | 0.94              | 0.74                 | 0.42                 |
| **Semantic Chunking Retrieval**       | 0.81              | 0.77                 | 0.12                 |

---



### **Summary: Which Retrieval Method is Better?**
- **Best Overall Method:** **Contextual Compression Retrieval**
  - **Highest Precision** (0.98) ensures more relevant retrieved content.
  - **High Recall** (0.88) ensures most relevant results are included.
  - **Moderate Noise Sensitivity** (0.20) shows it balances relevance without excessive noise.

- **Best for Recall:** **Multi-Query Retrieval (0.97)**  
  - Retrieves the most relevant results but has **moderate precision** (0.72) and **high noise sensitivity** (0.49), meaning it may include irrelevant information.

- **Best for Precision:** **Contextual Compression Retrieval (0.98)**  
  - If precision is most important, this is the best method.

- **Lowest Noise Sensitivity:** **Semantic Chunking Retrieval (0.12)**  
  - Works well in structured documents where reducing irrelevant content is crucial.

### **Final Recommendation:**
If the goal is **accuracy and relevance**, **Contextual Compression Retrieval** is the best choice due to its **high precision and balanced recall**. However, **Multi-Query Retrieval** is good when recall is the top priority.


### Evaluation

In [77]:
import os
import getpass
from uuid import uuid4
from langsmith import Client
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")
client = Client()
client = Client(api_key=os.environ["LANGCHAIN_API_KEY"])
dataset_name = "Advanced Retrieval"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Naive, BM25, Contextual Compression, Multi-Query, Parent-Document, Ensemble, Semantic Chunking Retrieval"
)



In [78]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )
eval_llm = ChatOpenAI(model="gpt-4o-mini")
from langsmith.evaluation import LangChainStringEvaluator ,evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})


In [80]:
naive_retrieval_evaluation = evaluate(
    naive_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
    ],
    experiment_prefix="naive_retrieval_evaluation",
    metadata={"revision_id": "naive_retrieval_chain"},
)


View the evaluation results for experiment: 'naive_retrieval_evaluation-c2ce0b71' at:
https://smith.langchain.com/o/8a88797e-5f0c-4143-8820-5e498815cee3/datasets/dd7af425-0f4b-4698-b0e1-1ddb286805d9/compare?selectedSessions=a3b5b48d-4d65-40ce-904d-70df0619b62b




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run dc69d665-b249-4772-8555-e3095047a9e7: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

In [81]:
bm25_retrieval_evaluation = evaluate(
    bm25_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator
    ],
    experiment_prefix="bm25_retrieval_evaluation",
    metadata={"revision_id": "bm25_retrieval_evaluation"},
)

View the evaluation results for experiment: 'bm25_retrieval_evaluation-969d942d' at:
https://smith.langchain.com/o/8a88797e-5f0c-4143-8820-5e498815cee3/datasets/dd7af425-0f4b-4698-b0e1-1ddb286805d9/compare?selectedSessions=9de61607-ec71-4770-8c8a-a04cee38f155




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run cecd83e6-1044-4629-a5b1-0366d51d4593: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

In [82]:
contextual_compression_retrieval_evaluation = evaluate(
    contextual_compression_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator
    ],
    experiment_prefix="contextual_compression_retrieval_evaluation",
    metadata={"revision_id": "contextual_compression_retrieval_evaluation"},
)

View the evaluation results for experiment: 'contextual_compression_retrieval_evaluation-c898473f' at:
https://smith.langchain.com/o/8a88797e-5f0c-4143-8820-5e498815cee3/datasets/dd7af425-0f4b-4698-b0e1-1ddb286805d9/compare?selectedSessions=a49c6d23-8ddb-419f-88e5-4bf80e0f1829




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run a6129d49-1ac7-4c2e-a4e2-2d0e4a216af6: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

In [84]:
multi_query_retrieval_evaluation = evaluate(
    multi_query_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator
    ],
    experiment_prefix="multi_query_retrieval_evaluation",
    metadata={"revision_id": "multi_query_retrieval_evaluation"},
)

View the evaluation results for experiment: 'multi_query_retrieval_evaluation-e6bdcd7a' at:
https://smith.langchain.com/o/8a88797e-5f0c-4143-8820-5e498815cee3/datasets/dd7af425-0f4b-4698-b0e1-1ddb286805d9/compare?selectedSessions=e4b4efe8-b1fa-43d6-a8a0-624f094e7371




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 4b2482bd-e90a-494d-8aa0-3404d1371fc7: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

In [85]:
parent_document_retrieval_evaluation = evaluate(
    parent_document_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator
    ],
    experiment_prefix="parent_document_retrieval_evaluation",
    metadata={"revision_id": "parent_document_retrieval_evaluation"},
)

View the evaluation results for experiment: 'parent_document_retrieval_evaluation-54cc0089' at:
https://smith.langchain.com/o/8a88797e-5f0c-4143-8820-5e498815cee3/datasets/dd7af425-0f4b-4698-b0e1-1ddb286805d9/compare?selectedSessions=2b689405-c89b-4dda-bf69-3bf4f0536c83




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 7383eca8-e386-4cb3-9196-28c2d8570066: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

In [86]:
ensemble_retrieval_evaluation = evaluate(
    ensemble_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator
    ],
    experiment_prefix="ensemble_retrieval_evaluation",
    metadata={"revision_id": "ensemble_retrieval_evaluation"},
)

View the evaluation results for experiment: 'ensemble_retrieval_evaluation-53bfdcf8' at:
https://smith.langchain.com/o/8a88797e-5f0c-4143-8820-5e498815cee3/datasets/dd7af425-0f4b-4698-b0e1-1ddb286805d9/compare?selectedSessions=739af8d0-5712-4a3f-8777-10b22f28fb3f




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 4a95c2db-9933-4ebf-bf41-d6644506ff5b: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

In [87]:
semantic_retrieval_evaluation = evaluate(
    semantic_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator
    ],
    experiment_prefix="semantic_retrieval_evaluation",
    metadata={"revision_id": "semantic_retrieval_evaluation"},
)

View the evaluation results for experiment: 'semantic_retrieval_evaluation-4966fe75' at:
https://smith.langchain.com/o/8a88797e-5f0c-4143-8820-5e498815cee3/datasets/dd7af425-0f4b-4698-b0e1-1ddb286805d9/compare?selectedSessions=25788903-3d90-4dc7-bd4c-6c417bf02859




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run d47f3dae-63ec-47f1-89be-2b0cb3b6df7f: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

### **Detailed Analysis of Retrieval Methods Based on Cost, Latency, and Performance**

| Experiment                                         | Total Cost  | P50 Latency | P99 Latency | Run Count | Error Rate | Prompt Tokens | Completion Tokens | Total Tokens |
|----------------------------------------------------|------------|------------|------------|-----------|------------|---------------|----------------|--------------|
| semantic_retrieval_evaluation-4966fe75            | $0.0160235 | 1.19s      | 2.26s      | 10        | 0%         | 29,782        | 755            | 30,537       |
| ensemble_retrieval_evaluation-53bfdcf8            | **$0.0296185** | **4.41s**  | **5.19s**  | 10        | 0%         | **54,692**    | 1,515          | **56,207**   |
| parent_document_retrieval_evaluation-54cc0089     | $0.003294  | 1.17s      | 1.71s      | 10        | 0%         | 4,824         | 588            | 5,412        |
| multi_query_retrieval_evaluation-e6bdcd7a         | $0.0265945 | 3.34s      | **7.04s**  | 10        | 0%         | 48,167        | **1,674**      | 49,841       |
| contextual_compression_retrieval_evaluation-c8... | $0.0067615 | 1.54s      | 2.64s      | 10        | 0%         | 11,252        | 757            | 12,009       |
| bm25_retrieval_evaluation-969d942d                | $0.006952  | **0.88s**  | **1.53s**  | 10        | 0%         | 12,008        | 632            | 12,640       |
| naive_retrieval_evaluation-c2ce0b71               | $0.0190775 | 1.62s      | 1.89s      | 10        | 0%         | 35,560        | 865            | 36,425       |
| naive_retrieval_chain-2182e52a                    | $0.019355  | 1.81s      | 2.98s      | 10        | 0%         | 35,560        | 1,050          | 36,610       |


This analysis examines various retrieval methods based on **total cost, latency (P50 & P99), and performance (tokens used, error rate, and efficiency).** 

---

### **1. Cost Analysis**
- **Most Expensive:** **Ensemble Retrieval ($0.0296)**  
  - This is the costliest method, likely due to higher prompt/completion token usage and computational complexity.
- **Cheapest:** **Parent Document Retrieval ($0.0033)**  
  - Has the lowest cost, indicating efficient token utilization.
- **Moderate Cost Methods:**  
  - **Multi-Query Retrieval ($0.0266)** and **Naive Retrieval ($0.0193)** are relatively costly.  
  - **BM25 Retrieval ($0.0069)** and **Contextual Compression Retrieval ($0.0068)** provide better cost efficiency.

📝 **Key Takeaway:**  
If cost is a primary factor, **Parent Document Retrieval** and **BM25 Retrieval** are the best choices.

---

### **2. Latency Analysis (Speed)**
- **P50 Latency (Median Response Time)**
  - **Fastest:** **BM25 Retrieval (0.88s)**  
  - **Slowest:** **Ensemble Retrieval (4.41s)**
  - **Other Notables:**  
    - Contextual Compression: **1.54s**
    - Parent Document: **1.17s**
    - Multi-Query: **3.34s** (high latency)

- **P99 Latency (Worst-Case Response Time)**
  - **Fastest:** **BM25 Retrieval (1.53s)**
  - **Slowest:** **Multi-Query Retrieval (7.04s)**
  - **Other Notables:**  
    - Ensemble Retrieval: **5.19s**
    - Contextual Compression: **2.64s**

📝 **Key Takeaway:**  
If **speed** is the priority, **BM25 Retrieval** is the best choice. **Ensemble and Multi-Query Retrievals** are significantly slower and may introduce delays.

---

### **3. Performance Analysis (Tokens & Error Rate)**
- **Most Token-Intensive:** **Ensemble Retrieval (56,207 tokens)**
- **Most Efficient Token Usage:** **Parent Document Retrieval (5,412 tokens)**
- **Prompt vs. Completion Tokens:**
  - **Multi-Query Retrieval:** 48,167 Prompt Tokens (high)
  - **Semantic Retrieval:** 29,782 Prompt Tokens (moderate)
  - **BM25 Retrieval:** 12,008 Prompt Tokens (efficient)
- **Error Rate:** **0% Across All Methods**
  - All retrieval methods have **0% error rate**, meaning no failures.

📝 **Key Takeaway:**  
If **efficiency** is a priority, **Parent Document Retrieval** is best due to minimal token usage while maintaining correctness.

---

### **Final Recommendations**
| **Use Case**          | **Best Retrieval Method** |
|----------------------|------------------------|
| **Lowest Cost**      | Parent Document / BM25 |
| **Fastest Response** | BM25 Retrieval        |
| **Best Performance (Low Tokens)** | Parent Document Retrieval |
| **Best Overall (Balance of Cost, Speed, Performance)** | **Contextual Compression Retrieval** |