# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

> You do not need to run the following cells if you are running this notebook locally. 

In [None]:
#!pip install -qU langchain langchain-openai langchain-cohere rank_bm25

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/49.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.6/49.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/233.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.1/233.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m378.1/378.1 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

We're also going to be leveraging [Qdrant's](https://qdrant.tech/documentation/frameworks/langchain/) (pronounced "Quadrant") VectorDB in "memory" mode (so we can leverage it locally in our colab environment).

In [None]:
#!pip install -qU qdrant-client

We'll also provide our OpenAI key, as well as our Cohere API key.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [2]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using some reviews from the 4 movies in the John Wick franchise today to explore the different retrieval strategies.

These were obtained from IMDB, and are available in the [AIM Data Repository](https://github.com/AI-Maker-Space/DataRepository).

### Data Collection

We can simply `wget` these from GitHub.

You could use any review data you wanted in this step - just be careful to make sure your metadata is aligned with your choice.

In [3]:
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv -O john_wick_1.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv -O john_wick_2.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw3.csv -O john_wick_3.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw4.csv -O john_wick_4.csv

--2025-03-02 13:16:04--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19628 (19K) [text/plain]
Saving to: ‘john_wick_1.csv’


2025-03-02 13:16:04 (1.15 MB/s) - ‘john_wick_1.csv’ saved [19628/19628]

--2025-03-02 13:16:04--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14747 (14K) [text/plain]
Saving to: ‘john_wick_2.csv’


2025-03-02 13:16:05 (1.35 MB/s) - ‘john_wick_2.csv’

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

- Self-Query: Wants as much metadata as we can provide
- Time-weighted: Wants temporal data

> NOTE: While we're creating a temporal relationship based on when these movies came out for illustrative purposes, it needs to be clear that the "time-weighting" in the Time-weighted Retriever is based on when the document was *accessed* last - not when it was created.

In [4]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

documents = []

for i in range(1, 5):
  loader = CSVLoader(
      file_path=f"john_wick_{i}.csv",
      metadata_columns=["Review_Date", "Review_Title", "Review_Url", "Author", "Rating"]
  )

  movie_docs = loader.load()
  for doc in movie_docs:

    # Add the "Movie Title" (John Wick 1, 2, ...)
    doc.metadata["Movie_Title"] = f"John Wick {i}"

    # convert "Rating" to an `int`, if no rating is provided - assume 0 rating
    doc.metadata["Rating"] = int(doc.metadata["Rating"]) if doc.metadata["Rating"] else 0

    # newer movies have a more recent "last_accessed_at"
    doc.metadata["last_accessed_at"] = datetime.now() - timedelta(days=4-i)

  documents.extend(movie_docs)

Let's look at an example document to see if everything worked as expected!

In [5]:
documents[0]

Document(metadata={'source': 'john_wick_1.csv', 'row': 0, 'Review_Date': '6 May 2015', 'Review_Title': ' Kinetic, concise, and stylish; John Wick kicks ass.\n', 'Review_Url': '/review/rw3233896/?ref_=tt_urv', 'Author': 'lnvicta', 'Rating': 8, 'Movie_Title': 'John Wick 1', 'last_accessed_at': datetime.datetime(2025, 2, 27, 13, 16, 20, 268454)}, page_content=": 0\nReview: The best way I can describe John Wick is to picture Taken but instead of Liam Neeson it's Keanu Reeves and instead of his daughter it's his dog. That's essentially the plot of the movie. John Wick (Reeves) is out to seek revenge on the people who took something he loved from him. It's a beautifully simple premise for an action movie - when action movies get convoluted, they get bad i.e. A Good Day to Die Hard. John Wick gives the viewers what they want: Awesome action, stylish stunts, kinetic chaos, and a relatable hero to tie it all together. John Wick succeeds in its simplicity.")

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "JohnWick".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [6]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWick"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [7]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [8]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-3.5-turbo` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [9]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI()

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [10]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [11]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick based on the positive reviews provided.'

In [12]:
naive_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. Here is the URL to that review:\n\n- [Review by ymyuseda](/review/rw4854296/?ref_=tt_urv)'

In [13]:
naive_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, an ex-hitman comes out of retirement seeking vengeance after the gangsters kill his dog and take everything from him. To exact his revenge, he unleashes a maelstrom of destruction against those who cross him, leading to intense action and thrilling fights.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [14]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(documents)

We'll construct the same chain - only changing the retriever.

In [15]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [16]:
bm25_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Opinions about John Wick seem to vary. Some people really enjoyed the action and style of the movie, while others found it lacking in plot and substance. It seems like it's a matter of personal preference."

In [17]:
bm25_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"I don't know."

In [18]:
bm25_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, the main character, played by Keanu Reeves, experiences emotional setup and beautifully choreographed action sequences. It is a highly recommended movie, especially for those who enjoy action films.'

It's not clear that this is better or worse - but the `I don't know` isn't great!

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [19]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [20]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [21]:
contextual_compression_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick based on the reviews provided in the context.'

In [22]:
contextual_compression_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". Here is the URL to that review: /review/rw4854296/?ref_=tt_urv.'

In [23]:
contextual_compression_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, the main character, John Wick, is forced back into the world of crime when a mobster asks him to carry out a hit. When he completes the task, a contract is put on him, leading to chaos and intense action.'

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [24]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [25]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [26]:
multi_query_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Based on the reviews provided, it seems that people generally liked John Wick. The reviews praised the action sequences, Keanu Reeves' performance, and the overall entertainment value of the movie."

In [27]:
multi_query_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'I\'m sorry, there are no reviews with a rating of 10 for the movie "John Wick 4."'

In [28]:
multi_query_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick: Chapter 2, former hitman John Wick is forced back into action when an Italian crime lord calls in a favor. Wick must carry out an assignment to kill the crime lord's sister in Rome in order to fulfill this favor. However, after completing the task, the crime lord puts a bounty on Wick's head, leading to a series of intense action sequences as Wick is hunted by professional killers. The movie is filled with action, suspense, and thrilling fight scenes."

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [29]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [30]:
client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
    collection_name="full_documents", embeddings=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

  parent_document_vectorstore = Qdrant(


Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [31]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [32]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [33]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [34]:
parent_document_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the reviews provided, opinions on John Wick seem to be divided. Some people really enjoy the movie and find it to be a wild ride, while others are critical of its plot and fight scenes.'

In [35]:
parent_document_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". The URL to that review is: /review/rw4854296/?ref_=tt_urv'

In [36]:
parent_document_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, a retired assassin named John Wick comes out of retirement when someone kills his dog. In the sequel, John Wick 2, he is forced back into the world of assassins to pay off an old debt and ends up killing many assassins in Italy, Canada, and Manhattan.'

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [37]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [38]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [39]:
ensemble_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'People generally liked John Wick based on the positive reviews given by critics.'

In [40]:
ensemble_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review for "John Wick 3" with a rating of 10. Here is the URL to that review: \'/review/rw4854296/?ref_=tt_urv\''

In [41]:
ensemble_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, an ex-hitman comes out of retirement to seek revenge on the gangsters that killed his dog and took everything from him, ultimately leading to a series of violent and action-packed confrontations as he faces off against numerous enemies seeking to take him down.'

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

> NOTE: You do not need to run this cell if you're running this locally

In [None]:
#!pip install -qU langchain_experimental

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.4/50.4 kB[0m [31m1.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m208.1/208.1 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.4/2.4 MB[0m [31m40.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m399.9/399.9 kB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m32.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m292.1/292.1 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.9/77.9 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [42]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [43]:
semantic_documents = semantic_chunker.split_documents(documents)

Let's create a new vector store.

In [44]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWickSemantic"
)

We'll use naive retrieval for this example.

In [45]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [46]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [47]:
semantic_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Overall, the reviews for John Wick were mostly positive with many people enjoying it.'

In [48]:
semantic_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10 for the movie "John Wick 3". Here is the URL to that review: \'/review/rw4854296/?ref_=tt_urv\''

In [49]:
semantic_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, the protagonist, John Wick, seeks revenge on the people who took something he loved from him. Initially, it was his dog that was killed, leading him to unleash his lethal capacity against gangsters who also stole his car. The movie focuses on action, stylish stunts, kinetic chaos, and a relatable hero seeking vengeance.'

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

# Activity #1 Begins Here

In [108]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [109]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(documents, testset_size=10)

Applying SummaryExtractor:   0%|          | 0/44 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/100 [00:00<?, ?it/s]

Node cd903521-f20b-4076-bad0-23e12f3cffba does not have a summary. Skipping filtering.
Node 82e643a2-92f3-4af0-bac0-c3b62eae480d does not have a summary. Skipping filtering.
Node 17e0c9bf-2e14-4fb2-8c3a-9b600ba5af7b does not have a summary. Skipping filtering.
Node d1306e41-94ae-4ad7-95b8-4e76c4899815 does not have a summary. Skipping filtering.
Node 0b75e5aa-dc65-4ed3-bb1f-c8d7e67f619e does not have a summary. Skipping filtering.
Node c61b00ff-9b42-44e9-9150-9597a9e577bd does not have a summary. Skipping filtering.
Node 93cbf8a9-947a-44ea-ac81-1901a21086e3 does not have a summary. Skipping filtering.
Node 7877a68b-4bcc-4050-bf5c-4b68e97c60b8 does not have a summary. Skipping filtering.
Node ce88d054-bded-4d91-abfb-82880aec1f38 does not have a summary. Skipping filtering.
Node 4245bbae-b829-4dcc-aed9-27cf6de2b57d does not have a summary. Skipping filtering.
Node ca188f73-272f-4f4d-b794-0d535b6cd53f does not have a summary. Skipping filtering.
Node 86361dd0-00db-4bca-8677-d05eda2ed53d d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/244 [00:00<?, ?it/s]

Applying OverlapScoreBuilder:   0%|          | 0/1 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/10 [00:00<?, ?it/s]

In [110]:
dataset.to_pandas()


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,How does John Wick's premise contribute to its...,[: 0\nReview: The best way I can describe John...,"John Wick's premise is beautifully simple, foc...",single_hop_specifc_query_synthesizer
1,"So like, what is the deal with John Wick and w...",[: 2\nReview: With the fourth installment scor...,John Wick is a film series that has gained imm...,single_hop_specifc_query_synthesizer
2,What makes Keanu's performance in John wick st...,[: 3\nReview: John wick has a very simple reve...,Keanu's performance in John Wick stands out du...,single_hop_specifc_query_synthesizer
3,What happen to John Wick in the movie?,[: 4\nReview: Though he no longer has a taste ...,"John Wick, a retired assassin known as the ""Bo...",single_hop_specifc_query_synthesizer
4,What role does the Russian mob play in the plo...,[: 5\nReview: Ultra-violent first entry with l...,"In John Wick, the Russian mob plays a signific...",single_hop_specifc_query_synthesizer
5,In what ways does John Wick's character develo...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,"In John Wick: Chapter 3 - Parabellum, John’s c...",multi_hop_specific_query_synthesizer
6,How does the action quality in John Wick 3 com...,[<1-hop>\n\n: 10\nReview: The first John Wick ...,John Wick 3 is described as the best action mo...,multi_hop_specific_query_synthesizer
7,What elements contribute to the uniqueness of ...,[<1-hop>\n\n: 22\nReview: Lets contemplate abo...,John Wick 3 is noted for creating something sp...,multi_hop_specific_query_synthesizer
8,What are the key themes and elements that make...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,John Wick: Chapter 3 - Parabellum stands out i...,multi_hop_specific_query_synthesizer
9,Why was the latest John Wick film a disappoint...,"[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...",The latest John Wick film was a bitter disappo...,multi_hop_specific_query_synthesizer


In [111]:
import os
from getpass import getpass

os.environ["RAGAS_APP_TOKEN"] = getpass("Please enter your Ragas API key!")

#### Naive Retrieval

In [112]:
for test_row in dataset:
  response = naive_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [113]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

In [114]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))

In [115]:
from ragas.metrics import LLMContextRecall, ContextEntityRecall, NoiseSensitivity, ContextPrecision
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), ContextEntityRecall(), NoiseSensitivity(), ContextPrecision()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

{'context_recall': 0.9667, 'context_entity_recall': 0.2643, 'noise_sensitivity_relevant': 0.4913, 'context_precision': 0.8654}

In [116]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,context_entity_recall,noise_sensitivity_relevant,context_precision
0,How does John Wick's premise contribute to its...,[: 0\nReview: The best way I can describe John...,[: 0\nReview: The best way I can describe John...,"The premise of John Wick, which is a simple ye...","John Wick's premise is beautifully simple, foc...",1.0,0.125,0.142857,1.0
1,"So like, what is the deal with John Wick and w...",[: 2\nReview: With the fourth installment scor...,[: 2\nReview: With the fourth installment scor...,John Wick has become a beloved franchise becau...,John Wick is a film series that has gained imm...,1.0,0.25,0.272727,0.976543
2,What makes Keanu's performance in John wick st...,"[: 9\nReview: At first glance, John Wick sound...",[: 3\nReview: John wick has a very simple reve...,Keanu Reeves' performance in John Wick stands ...,Keanu's performance in John Wick stands out du...,1.0,0.666667,0.555556,1.0
3,What happen to John Wick in the movie?,"[: 18\nReview: When the story begins, John (Ke...",[: 4\nReview: Though he no longer has a taste ...,"In the movie, ""John Wick,"" John Wick, played b...","John Wick, a retired assassin known as the ""Bo...",1.0,0.25,0.666667,0.778333
4,What role does the Russian mob play in the plo...,"[: 18\nReview: When the story begins, John (Ke...",[: 5\nReview: Ultra-violent first entry with l...,The Russian mob plays a significant role in th...,"In John Wick, the Russian mob plays a signific...",1.0,0.142857,0.692308,0.841667
5,In what ways does John Wick's character develo...,[: 24\nReview: John Wick: Chapter 3 - Parabell...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,John Wick's character development in Chapter 3...,"In John Wick: Chapter 3 - Parabellum, John’s c...",0.666667,0.0,0.0,0.666667
6,How does the action quality in John Wick 3 com...,[: 1\nReview: I'm a fan of the John Wick films...,[<1-hop>\n\n: 10\nReview: The first John Wick ...,"I'm sorry, but I don't have specific details c...",John Wick 3 is described as the best action mo...,1.0,0.0,1.0,0.74966
7,What elements contribute to the uniqueness of ...,[: 22\nReview: Lets contemplate about componen...,[<1-hop>\n\n: 22\nReview: Lets contemplate abo...,John Wick 3 distinguishes itself from John Wic...,John Wick 3 is noted for creating something sp...,1.0,0.5,0.666667,0.814286
8,What are the key themes and elements that make...,[: 13\nReview: Following on from two delirious...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,John Wick: Chapter 3 - Parabellum is praised f...,John Wick: Chapter 3 - Parabellum stands out i...,1.0,0.375,0.25,0.885714
9,Why was the latest John Wick film a disappoint...,[: 17\nReview: There are actually quite a hand...,"[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...","I'm sorry, I do not have specific information ...",The latest John Wick film was a bitter disappo...,1.0,0.333333,0.666667,0.941518


| Metric                      | Values                                                      |
|-----------------------------|------------------------------------------------------------|
| Retrieval method            | Naive Retrieval                                            |
| context_recall              | [1.0, 1.0, 1.0, 0.6667, 0.5, 0.6667, 1.0, 1.0, 1.0, 1.0]    |
| context_precision           | [0.9095, 0.5010, 0.8955, 0.8441, 0.9444, 0.9889, 1.0, 0.7282, 0.8304, 1.0] |
| noise_sensitivity_relevant  | [0.6667, 0.0, 0.125, 0.25, 0.0, nan, 0.0, nan, 0.4, 0.1]    |
| context_precision (repeat)   | [0.9095, 0.5010, 0.8955, 0.8441, 0.9444, 0.9889, 1.0, 0.7282, 0.8304, 1.0] |


#### BM25 Retrieval

In [117]:
for test_row in dataset:
  response = bm25_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [118]:
from ragas import EvaluationDataset
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall, ContextEntityRecall, NoiseSensitivity, ContextPrecision
from ragas import evaluate, RunConfig

In [119]:
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), ContextEntityRecall(), NoiseSensitivity(), ContextPrecision()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

{'context_recall': 0.8083, 'context_entity_recall': 0.2994, 'noise_sensitivity_relevant': 0.0950, 'context_precision': 0.6583}

In [120]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,context_entity_recall,noise_sensitivity_relevant,context_precision
0,How does John Wick's premise contribute to its...,[: 11\nReview: Who needs a 2hr and 40 min acti...,[: 0\nReview: The best way I can describe John...,"The premise of John Wick, where the protagonis...","John Wick's premise is beautifully simple, foc...",1.0,0.125,0.0,0.5
1,"So like, what is the deal with John Wick and w...","[: 9\nReview: At first glance, John Wick sound...",[: 2\nReview: With the fourth installment scor...,"John Wick is loved by many for its slickness, ...",John Wick is a film series that has gained imm...,1.0,0.25,0.25,1.0
2,What makes Keanu's performance in John wick st...,[: 3\nReview: John wick has a very simple reve...,[: 3\nReview: John wick has a very simple reve...,Keanu's performance in John Wick stands out in...,Keanu's performance in John Wick stands out du...,1.0,0.666667,0.2,0.75
3,What happen to John Wick in the movie?,[: 5\nReview: What is all the raving about wit...,[: 4\nReview: Though he no longer has a taste ...,"I'm sorry, I don't know what happens to John W...","John Wick, a retired assassin known as the ""Bo...",0.0,0.25,0.0,0.0
4,What role does the Russian mob play in the plo...,[: 14\nReview: Another significant increase in...,[: 5\nReview: Ultra-violent first entry with l...,The Russian mob plays a significant role in th...,"In John Wick, the Russian mob plays a signific...",1.0,0.285714,0.0,0.583333
5,In what ways does John Wick's character develo...,[: 24\nReview: John Wick: Chapter 3 - Parabell...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,John Wick's character development in Chapter 3...,"In John Wick: Chapter 3 - Parabellum, John’s c...",0.666667,0.0,0.4,1.0
6,How does the action quality in John Wick 3 com...,[: 10\nReview: The first John Wick film took m...,[<1-hop>\n\n: 10\nReview: The first John Wick ...,I don't know.,John Wick 3 is described as the best action mo...,0.75,0.25,0.0,0.5
7,What elements contribute to the uniqueness of ...,[: 23\nReview: I love me a bit of the old ultr...,[<1-hop>\n\n: 22\nReview: Lets contemplate abo...,The uniqueness of John Wick 3 compared to John...,John Wick 3 is noted for creating something sp...,1.0,0.166667,0.0,1.0
8,What are the key themes and elements that make...,[: 13\nReview: Following on from two delirious...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,The key themes and elements that make John Wic...,John Wick: Chapter 3 - Parabellum stands out i...,1.0,0.0,0.1,0.916667
9,Why was the latest John Wick film a disappoint...,[: 10\nReview: The first John Wick film took m...,"[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...",The latest John Wick film was a disappointment...,The latest John Wick film was a bitter disappo...,0.666667,1.0,0.0,0.333333


In [121]:
print('Retrieval method: BM25 Retrieval')
print('context_recall', result['context_recall'])
print('context_precision', result['context_precision'])
print('noise_sensitivity_relevant', result['noise_sensitivity_relevant'])
print('context_precision', result['context_precision'])


Retrieval method: BM25 Retrieval
context_recall [1.0, 1.0, 1.0, 0.0, 1.0, 0.6666666666666666, 0.75, 1.0, 1.0, 0.6666666666666666]
context_precision [0.499999999975, 0.999999999975, 0.7499999999625, 0.0, 0.5833333333041666, 0.9999999999666667, 0.499999999975, 0.99999999995, 0.9166666666361111, 0.3333333333]
noise_sensitivity_relevant [np.float64(0.0), np.float64(0.25), np.float64(0.2), np.float64(0.0), np.float64(0.0), np.float64(0.4), np.float64(0.0), np.float64(0.0), np.float64(0.1), np.float64(0.0)]
context_precision [0.499999999975, 0.999999999975, 0.7499999999625, 0.0, 0.5833333333041666, 0.9999999999666667, 0.499999999975, 0.99999999995, 0.9166666666361111, 0.3333333333]


| Metric                      | Values                                                      |
|-----------------------------|------------------------------------------------------------|
| Retrieval method            | BM25 Retrieval                                             |
| context_recall              | [0.5, 0.5, 1.0, 0.6667, 0.75, 0.6667, 1.0, 0.5, 0.5, 0.5]   |
| context_precision           | [0.8333, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8333, 1.0]    |
| noise_sensitivity_relevant  | [0.375, 0.0, 0.3333, 0.0, 0.0, 1.0, 0.0, 0.5, 0.1667, 0.3333] |
| context_precision (repeat)   | [0.8333, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8333, 1.0]    |


In [122]:
for test_row in dataset:
  response = contextual_compression_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [123]:
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), ContextEntityRecall(), NoiseSensitivity(), ContextPrecision()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

{'context_recall': 0.6500, 'context_entity_recall': 0.2476, 'noise_sensitivity_relevant': 0.3823, 'context_precision': 0.9333}

In [124]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,context_entity_recall,noise_sensitivity_relevant,context_precision
0,How does John Wick's premise contribute to its...,[: 0\nReview: The best way I can describe John...,[: 0\nReview: The best way I can describe John...,"The premise of John Wick, which involves a man...","John Wick's premise is beautifully simple, foc...",1.0,0.125,0.0,1.0
1,"So like, what is the deal with John Wick and w...",[: 18\nReview: Ever since the original John Wi...,[: 2\nReview: With the fourth installment scor...,John Wick has garnered significant love and at...,John Wick is a film series that has gained imm...,1.0,0.25,0.5,1.0
2,What makes Keanu's performance in John wick st...,"[: 9\nReview: At first glance, John Wick sound...",[: 3\nReview: John wick has a very simple reve...,Keanu Reeves' performance in John Wick stands ...,Keanu's performance in John Wick stands out du...,1.0,0.666667,0.454545,1.0
3,What happen to John Wick in the movie?,[: 20\nReview: After resolving his issues with...,[: 4\nReview: Though he no longer has a taste ...,John Wick is targeted by various professional ...,"John Wick, a retired assassin known as the ""Bo...",0.0,0.25,1.0,0.333333
4,What role does the Russian mob play in the plo...,[: 20\nReview: After resolving his issues with...,[: 5\nReview: Ultra-violent first entry with l...,The Russian mob plays a significant role in th...,"In John Wick, the Russian mob plays a signific...",0.666667,0.142857,0.75,1.0
5,In what ways does John Wick's character develo...,[: 24\nReview: John Wick: Chapter 3 - Parabell...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,John Wick's character development in Chapter 3...,"In John Wick: Chapter 3 - Parabellum, John’s c...",0.666667,0.0,0.285714,1.0
6,How does the action quality in John Wick 3 com...,[: 19\nReview: The inevitable third chapter of...,[<1-hop>\n\n: 10\nReview: The first John Wick ...,"Based on the reviews provided, the action qual...",John Wick 3 is described as the best action mo...,0.0,0.25,0.272727,1.0
7,What elements contribute to the uniqueness of ...,[: 22\nReview: Lets contemplate about componen...,[<1-hop>\n\n: 22\nReview: Lets contemplate abo...,"In John Wick 3, the uniqueness compared to Joh...",John Wick 3 is noted for creating something sp...,0.666667,0.333333,0.142857,1.0
8,What are the key themes and elements that make...,[: 24\nReview: John Wick: Chapter 3 - Parabell...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,The key themes and elements that make John Wic...,John Wick: Chapter 3 - Parabellum stands out i...,1.0,0.125,0.416667,1.0
9,Why was the latest John Wick film a disappoint...,[: 14\nReview: By now you know what to expect ...,"[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...",The latest John Wick film was not a disappoint...,The latest John Wick film was a bitter disappo...,0.5,0.333333,0.0,1.0


In [125]:
print('Retrieval method: Contextual Compression Retrieval')
print('context_recall', result['context_recall'])
print('context_precision', result['context_precision'])
print('noise_sensitivity_relevant', result['noise_sensitivity_relevant'])
print('context_precision', result['context_precision'])


Retrieval method: Contextual Compression Retrieval
context_recall [1.0, 1.0, 1.0, 0.0, 0.6666666666666666, 0.6666666666666666, 0.0, 0.6666666666666666, 1.0, 0.5]
context_precision [0.9999999999666667, 0.9999999999666667, 0.99999999995, 0.3333333333, 0.9999999999666667, 0.9999999999, 0.99999999995, 0.99999999995, 0.9999999999666667, 0.9999999999666667]
noise_sensitivity_relevant [np.float64(0.0), np.float64(0.5), np.float64(0.45454545454545453), np.float64(1.0), np.float64(0.75), np.float64(0.2857142857142857), np.float64(0.2727272727272727), np.float64(0.14285714285714285), np.float64(0.4166666666666667), np.float64(0.0)]
context_precision [0.9999999999666667, 0.9999999999666667, 0.99999999995, 0.3333333333, 0.9999999999666667, 0.9999999999, 0.99999999995, 0.99999999995, 0.9999999999666667, 0.9999999999666667]


| Metric                      | Values                                                      |
|-----------------------------|------------------------------------------------------------|
| Retrieval method            | Contextual Compression Retrieval                            |
| context_recall              | [0.5, 0.5, 1.0, 0.6667, 0.75, 0.6667, 1.0, 0.5, 0.5, 0.5]   |
| context_precision           | [0.8333, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8333, 1.0]    |
| noise_sensitivity_relevant  | [0.75, 0.0, 0.3333, 0.0, 0.0, 1.0, 0.0, 0.5, 0.1667, 0.2222] |
| context_precision (repeat)   | [0.8333, 0.5, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.8333, 1.0]    |


#### Multi-Query Retrieval


In [126]:
for test_row in dataset:
  response = multi_query_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [127]:
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), ContextEntityRecall(), NoiseSensitivity(), ContextPrecision()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Exception raised in Job[34]: TimeoutError()


{'context_recall': 0.9167, 'context_entity_recall': 0.2823, 'noise_sensitivity_relevant': 0.5841, 'context_precision': 0.7933}

In [128]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,context_entity_recall,noise_sensitivity_relevant,context_precision
0,How does John Wick's premise contribute to its...,"[: 9\nReview: At first glance, John Wick sound...",[: 0\nReview: The best way I can describe John...,"The premise of John Wick, where the main chara...","John Wick's premise is beautifully simple, foc...",1.0,0.125,0.428571,0.876816
1,"So like, what is the deal with John Wick and w...",[: 18\nReview: Ever since the original John Wi...,[: 2\nReview: With the fourth installment scor...,John Wick has gained immense popularity and lo...,John Wick is a film series that has gained imm...,1.0,0.25,0.875,0.989744
2,What makes Keanu's performance in John wick st...,"[: 9\nReview: At first glance, John Wick sound...",[: 3\nReview: John wick has a very simple reve...,Keanu Reeves' performance in John Wick stands ...,Keanu's performance in John Wick stands out du...,1.0,0.666667,0.428571,0.881346
3,What happen to John Wick in the movie?,"[: 18\nReview: When the story begins, John (Ke...",[: 4\nReview: Though he no longer has a taste ...,"In the movie ""John Wick,"" John experiences a p...","John Wick, a retired assassin known as the ""Bo...",1.0,0.5,0.625,0.859259
4,What role does the Russian mob play in the plo...,"[: 18\nReview: When the story begins, John (Ke...",[: 5\nReview: Ultra-violent first entry with l...,The Russian mob plays a significant role in th...,"In John Wick, the Russian mob plays a signific...",1.0,0.142857,0.5,0.961735
5,In what ways does John Wick's character develo...,[: 24\nReview: John Wick: Chapter 3 - Parabell...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,"In John Wick: Chapter 3 - Parabellum, the char...","In John Wick: Chapter 3 - Parabellum, John’s c...",0.666667,0.0,0.0,0.619048
6,How does the action quality in John Wick 3 com...,"[: 9\nReview: ""John Wick: Chapter 2"" is an Ame...",[<1-hop>\n\n: 10\nReview: The first John Wick ...,"I am sorry, I do not have information comparin...",John Wick 3 is described as the best action mo...,1.0,0.25,1.0,0.558895
7,What elements contribute to the uniqueness of ...,[: 3\nReview: John wick has a very simple reve...,[<1-hop>\n\n: 22\nReview: Lets contemplate abo...,The elements that contribute to the uniqueness...,John Wick 3 is noted for creating something sp...,1.0,0.0,0.6,0.640415
8,What are the key themes and elements that make...,[: 13\nReview: Following on from two delirious...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,John Wick: Chapter 3 - Parabellum is praised f...,John Wick: Chapter 3 - Parabellum stands out i...,1.0,0.555556,,0.661088
9,Why was the latest John Wick film a disappoint...,[: 17\nReview: There are actually quite a hand...,"[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...",The latest John Wick film was considered a dis...,The latest John Wick film was a bitter disappo...,0.5,0.333333,0.8,0.884354


In [129]:
print('Retrieval method:  Multi-Query Retrieval')
print('context_recall', result['context_recall'])
print('context_precision', result['context_precision'])
print('noise_sensitivity_relevant', result['noise_sensitivity_relevant'])
print('context_precision', result['context_precision'])


Retrieval method:  Multi-Query Retrieval
context_recall [1.0, 1.0, 1.0, 1.0, 1.0, 0.6666666666666666, 1.0, 1.0, 1.0, 0.5]
context_precision [0.8768161131717784, 0.9897435897359763, 0.8813455988367854, 0.8592592592449383, 0.9617346938638119, 0.6190476190352381, 0.5588945776375914, 0.6404151404059917, 0.661087677746999, 0.8843537414839651]
noise_sensitivity_relevant [np.float64(0.42857142857142855), np.float64(0.875), np.float64(0.42857142857142855), np.float64(0.625), np.float64(0.5), np.float64(0.0), np.float64(1.0), np.float64(0.6), nan, np.float64(0.8)]
context_precision [0.8768161131717784, 0.9897435897359763, 0.8813455988367854, 0.8592592592449383, 0.9617346938638119, 0.6190476190352381, 0.5588945776375914, 0.6404151404059917, 0.661087677746999, 0.8843537414839651]


| Metric                      | Values                                                      |
|-----------------------------|------------------------------------------------------------|
| Retrieval method            | Multi-Query Retrieval                            |
| context_recall              | [1.0, 1.0, 1.0, 0.6667, 1.0, 0.6667, 1.0, 1.0, 1.0, 1.0]    |
| context_precision           | [1.0, 0.4292, 0.9583, 0.8632, 0.8898, 0.9172, 1.0, 0.5875, 0.8167, 0.8934] |
| noise_sensitivity_relevant  | [0.6522, 0.0, 0.4, 0.2857, 0.5, 1.0, 0.0, 0.25, 0.8, 0.0909] |
| context_precision (repeat)   | [1.0, 0.4292, 0.9583, 0.8632, 0.8898, 0.9172, 1.0, 0.5875, 0.8167, 0.8934] |


#### Parent Document Retrieval



In [130]:
for test_row in dataset:
  response = parent_document_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [131]:
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), ContextEntityRecall(), NoiseSensitivity(), ContextPrecision()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

{'context_recall': 0.4250, 'context_entity_recall': 0.2300, 'noise_sensitivity_relevant': 0.3475, 'context_precision': 0.9417}

In [132]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,context_entity_recall,noise_sensitivity_relevant,context_precision
0,How does John Wick's premise contribute to its...,[: 18\nReview: Ever since the original John Wi...,[: 0\nReview: The best way I can describe John...,"The simple premise of John Wick, where the pro...","John Wick's premise is beautifully simple, foc...",1.0,0.125,0.0,1.0
1,"So like, what is the deal with John Wick and w...",[: 2\nReview: With the fourth installment scor...,[: 2\nReview: With the fourth installment scor...,John Wick is loved by many because it features...,John Wick is a film series that has gained imm...,1.0,0.25,0.25,0.833333
2,What makes Keanu's performance in John wick st...,[: 23\nReview: Rating 10/10\nI was able to cat...,[: 3\nReview: John wick has a very simple reve...,Keanu Reeves' performance in John Wick is high...,Keanu's performance in John Wick stands out du...,0.0,0.333333,0.714286,1.0
3,What happen to John Wick in the movie?,[: 19\nReview: If you've seen the first John W...,[: 4\nReview: Though he no longer has a taste ...,"In the movie ""John Wick 2"", John Wick is calle...","John Wick, a retired assassin known as the ""Bo...",0.0,0.25,1.0,1.0
4,What role does the Russian mob play in the plo...,[: 20\nReview: After resolving his issues with...,[: 5\nReview: Ultra-violent first entry with l...,"In the plot of John Wick, the Russian mob play...","In John Wick, the Russian mob plays a signific...",0.666667,0.285714,0.0,1.0
5,In what ways does John Wick's character develo...,[: 24\nReview: John Wick: Chapter 3 - Parabell...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,John Wick: Chapter 3 - Parabellum explores the...,"In John Wick: Chapter 3 - Parabellum, John’s c...",0.666667,0.0,0.0,1.0
6,How does the action quality in John Wick 3 com...,[: 1\nReview: I'm a fan of the John Wick films...,[<1-hop>\n\n: 10\nReview: The first John Wick ...,"Based on the reviews provided, the action qual...",John Wick 3 is described as the best action mo...,0.25,0.5,0.5,1.0
7,What elements contribute to the uniqueness of ...,[: 1\nReview: I'm a fan of the John Wick films...,[<1-hop>\n\n: 22\nReview: Lets contemplate abo...,"In John Wick 3, one element that contributes t...",John Wick 3 is noted for creating something sp...,0.333333,0.0,0.4,1.0
8,What are the key themes and elements that make...,[: 13\nReview: Following on from two delirious...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,The key themes and elements that make John Wic...,John Wick: Chapter 3 - Parabellum stands out i...,0.333333,0.222222,0.166667,1.0
9,Why was the latest John Wick film a disappoint...,[: 14\nReview: By now you know what to expect ...,"[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...",The latest John Wick film was a disappointment...,The latest John Wick film was a bitter disappo...,0.0,0.333333,0.444444,0.583333


In [133]:
print('Retrieval method: Parent Document Retrieval')
print('context_recall', result['context_recall'])
print('context_precision', result['context_precision'])
print('noise_sensitivity_relevant', result['noise_sensitivity_relevant'])
print('context_precision', result['context_precision'])


Retrieval method: Parent Document Retrieval
context_recall [1.0, 1.0, 0.0, 0.0, 0.6666666666666666, 0.6666666666666666, 0.25, 0.3333333333333333, 0.3333333333333333, 0.0]
context_precision [0.99999999995, 0.8333333332916666, 0.9999999999, 0.9999999999, 0.99999999995, 0.99999999995, 0.99999999995, 0.9999999999, 0.9999999999, 0.5833333333041666]
noise_sensitivity_relevant [np.float64(0.0), np.float64(0.25), np.float64(0.7142857142857143), np.float64(1.0), np.float64(0.0), np.float64(0.0), np.float64(0.5), np.float64(0.4), np.float64(0.16666666666666666), np.float64(0.4444444444444444)]
context_precision [0.99999999995, 0.8333333332916666, 0.9999999999, 0.9999999999, 0.99999999995, 0.99999999995, 0.99999999995, 0.9999999999, 0.9999999999, 0.5833333333041666]


| Metric                      | Values                                                      |
|-----------------------------|------------------------------------------------------------|
| Retrieval method            | Parent Document Retrieval                                   |
| context_recall              | [0.5, 0.5, 0.5, 0.6667, 0.25, 0.6667, 0.25, 0.5, 0.0, 0.5]  |
| context_precision           | [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]        |
| noise_sensitivity_relevant  | [0.6, 0.0, 0.6667, 0.2222, 0.625, 0.0, 0.0, 0.1429, 0.0, 0.2222] |
| context_precision (repeat)   | [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]        |


#### Ensemble Retrieval


In [134]:
for test_row in dataset:
  response = ensemble_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [135]:
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), ContextEntityRecall(), NoiseSensitivity(), ContextPrecision()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

Exception raised in Job[26]: TimeoutError()
Exception raised in Job[34]: TimeoutError()


{'context_recall': 0.9667, 'context_entity_recall': 0.3167, 'noise_sensitivity_relevant': 0.3934, 'context_precision': 0.7591}

In [136]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,context_entity_recall,noise_sensitivity_relevant,context_precision
0,How does John Wick's premise contribute to its...,[: 0\nReview: The best way I can describe John...,[: 0\nReview: The best way I can describe John...,The premise of John Wick contributes to its su...,"John Wick's premise is beautifully simple, foc...",1.0,0.125,0.111111,0.893759
1,"So like, what is the deal with John Wick and w...",[: 18\nReview: Ever since the original John Wi...,[: 2\nReview: With the fourth installment scor...,John Wick has gained immense popularity and lo...,John Wick is a film series that has gained imm...,1.0,0.25,0.625,0.92684
2,What makes Keanu's performance in John wick st...,[: 3\nReview: John wick has a very simple reve...,[: 3\nReview: John wick has a very simple reve...,Keanu's performance in John Wick stands out in...,Keanu's performance in John Wick stands out du...,1.0,0.666667,0.555556,0.855175
3,What happen to John Wick in the movie?,[: 19\nReview: If you've seen the first John W...,[: 4\nReview: Though he no longer has a taste ...,John Wick faces significant challenges in the ...,"John Wick, a retired assassin known as the ""Bo...",1.0,0.5,0.5,0.551474
4,What role does the Russian mob play in the plo...,[: 20\nReview: After resolving his issues with...,[: 5\nReview: Ultra-violent first entry with l...,The Russian mob plays a significant role in th...,"In John Wick, the Russian mob plays a signific...",1.0,0.166667,0.4,0.986111
5,In what ways does John Wick's character develo...,[: 24\nReview: John Wick: Chapter 3 - Parabell...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,"In John Wick: Chapter 3, the character develop...","In John Wick: Chapter 3 - Parabellum, John’s c...",0.666667,0.0,0.4,0.542177
6,How does the action quality in John Wick 3 com...,[: 1\nReview: I'm a fan of the John Wick films...,[<1-hop>\n\n: 10\nReview: The first John Wick ...,"Based on the context provided, it appears that...",John Wick 3 is described as the best action mo...,1.0,0.0,,0.756443
7,What elements contribute to the uniqueness of ...,[: 1\nReview: I'm a fan of the John Wick films...,[<1-hop>\n\n: 22\nReview: Lets contemplate abo...,The unique elements that contribute to the dis...,John Wick 3 is noted for creating something sp...,1.0,0.666667,0.555556,0.705208
8,What are the key themes and elements that make...,[: 13\nReview: Following on from two delirious...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,The key themes and elements that make John Wic...,John Wick: Chapter 3 - Parabellum stands out i...,1.0,0.125,,0.730682
9,Why was the latest John Wick film a disappoint...,[: 14\nReview: By now you know what to expect ...,"[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...",I don't know the specific reasons why the late...,The latest John Wick film was a bitter disappo...,1.0,0.666667,0.0,0.642842


In [137]:
print('Retrieval method: Ensemble Retrieval')
print('context_recall', result['context_recall'])
print('context_precision', result['context_precision'])
print('noise_sensitivity_relevant', result['noise_sensitivity_relevant'])
print('context_precision', result['context_precision'])


Retrieval method: Ensemble Retrieval
context_recall [1.0, 1.0, 1.0, 1.0, 1.0, 0.6666666666666666, 1.0, 1.0, 1.0, 1.0]
context_precision [0.8937585671349617, 0.9268402893341103, 0.855175380167606, 0.5514739228946162, 0.9861111110987848, 0.5421768707405539, 0.7564425770213568, 0.7052083333245183, 0.7306818181726847, 0.6428421259702533]
noise_sensitivity_relevant [np.float64(0.1111111111111111), np.float64(0.625), np.float64(0.5555555555555556), np.float64(0.5), np.float64(0.4), np.float64(0.4), nan, np.float64(0.5555555555555556), nan, np.float64(0.0)]
context_precision [0.8937585671349617, 0.9268402893341103, 0.855175380167606, 0.5514739228946162, 0.9861111110987848, 0.5421768707405539, 0.7564425770213568, 0.7052083333245183, 0.7306818181726847, 0.6428421259702533]


| Metric                      | Values                                                      |
|-----------------------------|------------------------------------------------------------|
| Retrieval method            | Ensemble Retrieval                                         |
| context_recall              | [1.0, 1.0, 1.0, 1.0, 0.75, 1.0, 1.0, 1.0, 1.0, 1.0]       |
| context_precision           | [0.8691, 0.5121, 0.9206, 0.9060, 0.8976, 0.8123, 1.0, 0.7528, 0.8304, 1.0] |
| noise_sensitivity_relevant  | [0.4286, 0.0, 0.25, 0.0, 0.0, 0.125, 0.2143, 0.6667, 0.5, 0.25] |
| context_precision (repeat)   | [0.8691, 0.5121, 0.9206, 0.9060, 0.8976, 0.8123, 1.0, 0.7528, 0.8304, 1.0] |


#### Semantic Chunking Retrieval

In [138]:
for test_row in dataset:
  response = semantic_retrieval_chain.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"].content
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [139]:
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())
evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o-mini"))
custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), ContextEntityRecall(), NoiseSensitivity(), ContextPrecision()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/40 [00:00<?, ?it/s]

{'context_recall': 0.8667, 'context_entity_recall': 0.2375, 'noise_sensitivity_relevant': 0.5083, 'context_precision': 0.8191}

In [140]:
result.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,context_recall,context_entity_recall,noise_sensitivity_relevant,context_precision
0,How does John Wick's premise contribute to its...,[John Wick (Reeves) is out to seek revenge on ...,[: 0\nReview: The best way I can describe John...,"The premise of John Wick, where the main chara...","John Wick's premise is beautifully simple, foc...",1.0,0.125,0.083333,0.962654
1,"So like, what is the deal with John Wick and w...",[: 2\nReview: With the fourth installment scor...,[: 2\nReview: With the fourth installment scor...,The John Wick movies are loved for their exper...,John Wick is a film series that has gained imm...,1.0,0.25,0.285714,1.0
2,What makes Keanu's performance in John wick st...,[this movie so amazing !! One of the best acti...,[: 3\nReview: John wick has a very simple reve...,"Keanu's performance in ""John Wick"" stands out ...",Keanu's performance in John Wick stands out du...,0.333333,0.333333,0.6,0.982143
3,What happen to John Wick in the movie?,[: 19\nReview: If you've seen the first John W...,[: 4\nReview: Though he no longer has a taste ...,"In the movie ""John Wick 2,"" John Wick is invol...","John Wick, a retired assassin known as the ""Bo...",1.0,0.5,0.666667,0.64881
4,What role does the Russian mob play in the plo...,"[A few days later some thugs, led by the son o...",[: 5\nReview: Ultra-violent first entry with l...,The Russian mob plays a significant role in th...,"In John Wick, the Russian mob plays a signific...",1.0,0.166667,0.6,0.961735
5,In what ways does John Wick's character develo...,[: 24\nReview: John Wick: Chapter 3 - Parabell...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,John Wick's character development in Chapter 3...,"In John Wick: Chapter 3 - Parabellum, John’s c...",0.666667,0.0,0.142857,0.75
6,How does the action quality in John Wick 3 com...,[: 19\nReview: The inevitable third chapter of...,[<1-hop>\n\n: 10\nReview: The first John Wick ...,"Based on the context provided, the action qual...",John Wick 3 is described as the best action mo...,1.0,0.0,0.454545,0.855159
7,What elements contribute to the uniqueness of ...,[: 5\nReview: The first John Wick film was spe...,[<1-hop>\n\n: 22\nReview: Lets contemplate abo...,I don't know the specifics of John Wick 3 comp...,John Wick 3 is noted for creating something sp...,0.666667,0.0,1.0,0.468254
8,What are the key themes and elements that make...,[: 13\nReview: Following on from two delirious...,[<1-hop>\n\n: 24\nReview: John Wick: Chapter 3...,John Wick: Chapter 3 - Parabellum is praised f...,John Wick: Chapter 3 - Parabellum stands out i...,1.0,0.0,0.5,0.895139
9,Why was the latest John Wick film a disappoint...,"[But we'll get to that in a bit...! Anyway, I ...","[<1-hop>\n\n: 16\nReview: Ok, so I got back fr...","I don't know, but based on the provided contex...",The latest John Wick film was a bitter disappo...,1.0,1.0,0.75,0.666667


In [141]:
print('Retrieval method: Semantic Chunking Retrieval')
print('context_recall', result['context_recall'])
print('context_precision', result['context_precision'])
print('noise_sensitivity_relevant', result['noise_sensitivity_relevant'])
print('context_precision', result['context_precision'])


Retrieval method: Semantic Chunking Retrieval
context_recall [1.0, 1.0, 0.3333333333333333, 1.0, 1.0, 0.6666666666666666, 1.0, 0.6666666666666666, 1.0, 1.0]
context_precision [0.9626543209769582, 0.99999999999, 0.9821428571288265, 0.6488095237933036, 0.9617346938638119, 0.7499999999625, 0.8551587301444776, 0.4682539682422619, 0.8951388888776997, 0.66666666665]
noise_sensitivity_relevant [np.float64(0.08333333333333333), np.float64(0.2857142857142857), np.float64(0.6), np.float64(0.6666666666666666), np.float64(0.6), np.float64(0.14285714285714285), np.float64(0.45454545454545453), np.float64(1.0), np.float64(0.5), np.float64(0.75)]
context_precision [0.9626543209769582, 0.99999999999, 0.9821428571288265, 0.6488095237933036, 0.9617346938638119, 0.7499999999625, 0.8551587301444776, 0.4682539682422619, 0.8951388888776997, 0.66666666665]


| Metric                      | Values                                                      |
|-----------------------------|------------------------------------------------------------|
| Retrieval method            | Semantic Chunking Retrieval                                 |
| context_recall              | [1.0, 1.0, 1.0, 0.6667, 0.75, 0.3333, 1.0, 1.0, 1.0, 1.0]  |
| context_precision           | [0.8486, 0.3873, 0.9861, 0.7823, 0.8968, 0.9129, 0.9468, 0.9472, 1.0, 0.8951] |
| noise_sensitivity_relevant  | [0.6, 0.0, 0.3077, 0.0, 0.0, 1.0, 0.1667, 0.4667, 0.2, 0.1111] |
| context_precision (repeat)   | [0.8486, 0.3873, 0.9861, 0.7823, 0.8968, 0.9129, 0.9468, 0.9472, 1.0, 0.8951] |


## Final retriever comparison

| Retrieval Method               | Context Recall | Context Entity Recall | Noise Sensitivity Relevant | Context Precision |
|--------------------------------|----------------|-----------------------|---------------------------|------------------|
| Naive Retrieval               | 0.9667        | 0.2643               | 0.4913                   | 0.8654          |
| BM25 Retrieval                | 0.8083        | 0.2994               | 0.0950                   | 0.6583          |
| Contextual Compression Retrieval | 0.6500        | 0.2476               | 0.3823                   | 0.9333          |
| Multi-Query Retrieval         | 0.9167        | 0.2823               | 0.5841                   | 0.7933          |
| Parent-Document Retrieval     | 0.4250        | 0.2300               | 0.3475                   | 0.9417          |
| Ensemble Retrieval            | 0.9667        | 0.3167               | 0.3934                   | 0.7591          |
| Semantic Chunking Retrieval    | 0.8667        | 0.2375               | 0.5083                   | 0.8191          |


# Comparison summarization

Based on the evaluation metrics and the John Wick review dataset, here's an analysis of the best retrieval methods:

The Ensemble Retriever appears to be the most well-rounded choice for this particular dataset. Here's why:

1. **Highest Context Entity Recall (0.3167)**: This is crucial for a movie review dataset where specific entities (characters, actors, locations) are important for providing accurate context. The Ensemble Retriever's ability to capture these entities better than other methods makes it particularly suitable for movie-related queries.

2. **Strong Context Recall (0.9667)**: It ties with Naive Retrieval for the highest recall, meaning it successfully retrieves most relevant information from the reviews. This is essential when users ask broad questions about plot points or general reception.

3. **Balanced Noise Sensitivity (0.3934)**: While not the best, it shows good resistance to noise while maintaining high recall, suggesting it can handle the varied writing styles and subjective nature of movie reviews.

4. **Reasonable Context Precision (0.7591)**: Though not the highest, it maintains good precision while achieving high recall, striking a practical balance for this use case.

While Parent-Document Retrieval shows the highest precision (0.9417) and Contextual Compression shows excellent precision (0.9333), their significantly lower recall scores (0.4250 and 0.6500 respectively) make them less suitable for a review dataset where comprehensive coverage of opinions and plot points is important. The Ensemble Retriever's ability to combine the strengths of multiple approaches while mitigating their individual weaknesses makes it the most effective choice for handling diverse queries about movie reviews.


#### Evaluations

In [142]:
import os
import getpass
from uuid import uuid4
from langsmith import Client

In [165]:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

In [200]:
client = Client()
client = Client(api_key=os.environ["LANGCHAIN_API_KEY"])
dataset_name = "JW Retrieval Methods"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Evaluating JW docs (Naive, BM25, Contextual Compression, Multi-Query, Parent-Document, Ensemble, Semantic Chunking)"
)

In [201]:
for data_row in dataset.to_pandas().iterrows():
    print(data_row[1]["user_input"])
    print(data_row[1]["reference"])
    print(data_row[1]["reference_contexts"])
    print(langsmith_dataset.id)
    print("--------------------------------")
    break

How does John Wick's premise contribute to its success as an action movie?
John Wick's premise is beautifully simple, focusing on revenge for something the protagonist loves, which allows the film to deliver awesome action, stylish stunts, and kinetic chaos, all tied together by a relatable hero. This simplicity is key to its success, especially compared to more convoluted action films.
[": 0\nReview: The best way I can describe John Wick is to picture Taken but instead of Liam Neeson it's Keanu Reeves and instead of his daughter it's his dog. That's essentially the plot of the movie. John Wick (Reeves) is out to seek revenge on the people who took something he loved from him. It's a beautifully simple premise for an action movie - when action movies get convoluted, they get bad i.e. A Good Day to Die Hard. John Wick gives the viewers what they want: Awesome action, stylish stunts, kinetic chaos, and a relatable hero to tie it all together. John Wick succeeds in its simplicity."]
114fe

In [202]:
for data_row in dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

In [203]:
eval_llm = ChatOpenAI(model="gpt-4o-mini")

In [215]:
from langsmith.evaluation import LangChainStringEvaluator ,evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
       "prediction": (run.outputs["response"].content if hasattr(run.outputs["response"], "content") 
                  else str(run.outputs["response"]) if hasattr(run.outputs, "response") 
                  else run.outputs.content if hasattr(run.outputs, "content") 
                  else str(run.outputs)),
        "reference": str(example.outputs["answer"]),
        "input": str(example.inputs["question"]) 
    }
)

relevance_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "relevance": "Does the response directly address the user's question in a relevant manner?"
        },
        "llm": eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": (run.outputs["response"].content if hasattr(run.outputs["response"], "content") 
                        else str(run.outputs["response"]) if hasattr(run.outputs, "response") 
                        else run.outputs.content if hasattr(run.outputs, "content") 
                        else str(run.outputs)),
        "reference": str(example.outputs["answer"]),
        "input": str(example.inputs["question"])
    }
)

grounded_relevance_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "grounded_relevance": "Is the response factually accurate and grounded based on the reference answer?"
        },
        "llm": eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": (run.outputs["response"].content if hasattr(run.outputs["response"], "content") 
                        else str(run.outputs["response"]) if hasattr(run.outputs, "response") 
                        else run.outputs.content if hasattr(run.outputs, "content") 
                        else str(run.outputs)),
        "reference": str(example.outputs["answer"]),
        "input": str(example.inputs["question"])
    }
)

retrieval_quality_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "retrieval_quality": "How well does the response leverage retrieved documents to answer the question?"
        },
        "llm": eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": (run.outputs["response"].content if hasattr(run.outputs["response"], "content") 
                        else str(run.outputs["response"]) if hasattr(run.outputs, "response") 
                        else run.outputs.content if hasattr(run.outputs, "content") 
                        else str(run.outputs)),
        "reference": str(example.outputs["answer"]),
        "input": str(example.inputs["question"])
    }
)


In [216]:
naive_retrieval_evaluation = evaluate(
    naive_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        relevance_evaluator,
        grounded_relevance_evaluator,
        retrieval_quality_evaluator 
    ],
    experiment_prefix="naive_retrieval_chain",
    metadata={"revision_id": "naive_retrieval_chain"},
)

View the evaluation results for experiment: 'naive_retrieval_chain-dbee2207' at:
https://smith.langchain.com/o/5893d499-6998-4f44-a84d-2fcf6d99ac9b/datasets/114fed35-8bbe-4294-b7e8-7918690cc88a/compare?selectedSessions=72fa1aeb-ac96-4379-981d-db82438072d1




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 5270a1ce-e62b-4a21-b6f1-8868eb493d14: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

In [217]:
bm25_retrieval_evaluation = evaluate(
    bm25_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        relevance_evaluator,
        grounded_relevance_evaluator,
        retrieval_quality_evaluator 
    ],
    experiment_prefix="bm25_retrieval_evaluation",
    metadata={"revision_id": "bm25_retrieval_evaluation"},
)

View the evaluation results for experiment: 'bm25_retrieval_evaluation-f7abf53b' at:
https://smith.langchain.com/o/5893d499-6998-4f44-a84d-2fcf6d99ac9b/datasets/114fed35-8bbe-4294-b7e8-7918690cc88a/compare?selectedSessions=5edb61c4-4deb-4367-91ef-74d12bd538b9




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 1fe9f776-b4e3-40e8-9ad6-4fa24372d94e: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

View the evaluation results for experiment: 'crushing-rule-24' at:
https://smith.langchain.com/o/5893d499-6998-4f44-a84d-2fcf6d99ac9b/datasets/b80af506-cfd1-4a4e-a21c-a4026acb5854/compare?selectedSessions=264ee670-6bef-427f-a6be-14cf01e00c64

In [218]:
contextual_compression_retrieval_evaluation = evaluate(
    contextual_compression_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        relevance_evaluator,
        grounded_relevance_evaluator,
        retrieval_quality_evaluator
    ],
    experiment_prefix="contextual_compression_retrieval_evaluation",
    metadata={"revision_id": "contextual_compression_retrieval_evaluation"},
)

View the evaluation results for experiment: 'contextual_compression_retrieval_evaluation-3fb2da7e' at:
https://smith.langchain.com/o/5893d499-6998-4f44-a84d-2fcf6d99ac9b/datasets/114fed35-8bbe-4294-b7e8-7918690cc88a/compare?selectedSessions=3b8958a9-038e-4393-9a79-2d9046b868d7




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run d321ac49-6031-4187-9186-cd02baf30729: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

View the evaluation results for experiment: 'dependable-watch-31' at:
https://smith.langchain.com/o/5893d499-6998-4f44-a84d-2fcf6d99ac9b/datasets/b80af506-cfd1-4a4e-a21c-a4026acb5854/compare?selectedSessions=a29faba0-1232-4dcd-b986-7fdb8145e83b

In [219]:
multi_query_retrieval_evaluation = evaluate(
    multi_query_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        relevance_evaluator,
        grounded_relevance_evaluator,
        retrieval_quality_evaluator
    ],
    experiment_prefix="multi_query_retrieval_evaluation",
    metadata={"revision_id": "multi_query_retrieval_evaluation"},
)

View the evaluation results for experiment: 'multi_query_retrieval_evaluation-67cd3e0d' at:
https://smith.langchain.com/o/5893d499-6998-4f44-a84d-2fcf6d99ac9b/datasets/114fed35-8bbe-4294-b7e8-7918690cc88a/compare?selectedSessions=036f077e-f330-4fce-97af-41e9a999a8d9




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run d502ec66-6724-41d1-815d-348e122e1cb3: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

View the evaluation results for experiment: 'proper-competition-5' at:
https://smith.langchain.com/o/5893d499-6998-4f44-a84d-2fcf6d99ac9b/datasets/b80af506-cfd1-4a4e-a21c-a4026acb5854/compare?selectedSessions=a718c41c-766f-41f4-95df-8a12d241274d

In [220]:
parent_document_retrieval_evaluation = evaluate(
    parent_document_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        relevance_evaluator,
        grounded_relevance_evaluator,
        retrieval_quality_evaluator
    ],
    experiment_prefix="parent_document_retrieval_evaluation",
    metadata={"revision_id": "parent_document_retrieval_evaluation"},
)

View the evaluation results for experiment: 'parent_document_retrieval_evaluation-54a6308b' at:
https://smith.langchain.com/o/5893d499-6998-4f44-a84d-2fcf6d99ac9b/datasets/114fed35-8bbe-4294-b7e8-7918690cc88a/compare?selectedSessions=566f664f-12cc-4692-b80f-7eaf2c7660dc




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run cbd23a1b-ab46-4beb-abd4-51aa0d5273b5: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

Error running evaluator <DynamicRunEvaluator evaluate> on run b330f8b2-3655-4b96-81f3-d4cbbc4101b6: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

View the evaluation results for experiment: 'long-library-86' at:
https://smith.langchain.com/o/5893d499-6998-4f44-a84d-2fcf6d99ac9b/datasets/b80af506-cfd1-4a4e-a21c-a4026acb5854/compare?selectedSessions=7323a50a-f3b2-48d5-9c47-6a67b9b832da

In [221]:
ensemble_retrieval_evaluation = evaluate(
    ensemble_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        relevance_evaluator,
        grounded_relevance_evaluator,
        retrieval_quality_evaluator
    ],
    experiment_prefix="ensemble_retrieval_evaluation",
    metadata={"revision_id": "ensemble_retrieval_evaluation"},
)

View the evaluation results for experiment: 'ensemble_retrieval_evaluation-27dbe483' at:
https://smith.langchain.com/o/5893d499-6998-4f44-a84d-2fcf6d99ac9b/datasets/114fed35-8bbe-4294-b7e8-7918690cc88a/compare?selectedSessions=c084e2c3-1b5a-455b-87a5-f180f9ddd587




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 05cd2bde-0fcc-41e5-8808-8d8243fa4b8e: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

View the evaluation results for experiment: 'bold-value-12' at:
https://smith.langchain.com/o/5893d499-6998-4f44-a84d-2fcf6d99ac9b/datasets/b80af506-cfd1-4a4e-a21c-a4026acb5854/compare?selectedSessions=1d221440-72e3-4489-8f11-fcdd6de957e2

In [222]:
semantic_retrieval_evaluation = evaluate(
    semantic_retrieval_chain.invoke,
    data=dataset_name,
    evaluators=[
        qa_evaluator,
        labeled_helpfulness_evaluator,
        relevance_evaluator,
        grounded_relevance_evaluator,
        retrieval_quality_evaluator
    ],
    experiment_prefix="semantic_retrieval_evaluation",
    metadata={"revision_id": "semantic_retrieval_evaluation"},
)

View the evaluation results for experiment: 'semantic_retrieval_evaluation-7ef7849c' at:
https://smith.langchain.com/o/5893d499-6998-4f44-a84d-2fcf6d99ac9b/datasets/114fed35-8bbe-4294-b7e8-7918690cc88a/compare?selectedSessions=58cbfded-55f3-4bfe-b6aa-957a97919059




0it [00:00, ?it/s]

Error running evaluator <DynamicRunEvaluator evaluate> on run 2b7141ef-5da5-47cf-8a78-ddfa42aa51ac: ValueError('Evaluator verbose=False prompt=PromptTemplate(input_variables=[\'answer\', \'query\', \'result\'], input_types={}, partial_variables={}, template="You are a teacher grading a quiz.\\nYou are given a question, the student\'s answer, and the true answer, and are asked to score the student answer as either CORRECT or INCORRECT.\\n\\nExample Format:\\nQUESTION: question here\\nSTUDENT ANSWER: student\'s answer here\\nTRUE ANSWER: true answer here\\nGRADE: CORRECT or INCORRECT here\\n\\nGrade the student answers based ONLY on their factual accuracy. Ignore differences in punctuation and phrasing between the student answer and true answer. It is OK if the student answer contains more information than the true answer, as long as it does not contain any conflicting statements. Begin! \\n\\nQUESTION: {query}\\nSTUDENT ANSWER: {result}\\nTRUE ANSWER: {answer}\\nGRADE:") llm=ChatOpenAI(

View the evaluation results for experiment: 'flowery-dinner-94' at:
https://smith.langchain.com/o/5893d499-6998-4f44-a84d-2fcf6d99ac9b/datasets/b80af506-cfd1-4a4e-a21c-a4026acb5854/compare?selectedSessions=383a032e-b9ca-4d97-a459-719554d1119b

### Results

### Analysis Report: Best Retrieval Method Evaluation

#### Overview
The analysis compares different retrieval methods based on three key factors: **Cost**, **Latency**, and **Performance**. The goal is to identify the best retrieval method for the given data.

#### Evaluation Criteria
1. **Cost**: Measured indirectly by P50 and P99 latencies.
2. **Latency**: Lower latency indicates faster response time.
3. **Performance**: Evaluated through Grounded Relevance, Helpfulness, Relevance, Retrieval Quality, and Final Retriever Comparison metrics.

#### Summary of Results
| Method                         | P50 Latency | P99 Latency | Grounded Relevance | Helpfulness | Relevance | Retrieval Quality | Context Recall | Context Precision | Noise Sensitivity | Final Score |
|--------------------------------|-------------|-------------|-------------------|-------------|-----------|------------------|---------------|----------------|----------------|-------------|
| Naive Retrieval               | 1.96s      | 2.44s      | 5               | 5           | 7        | 3                | **0.9667**   | 0.8654        | 0.4913        | 7.5        |
| BM25 Retrieval                | **1.09s**  | **1.53s**  | 6               | 4           | 5        | 5                | 0.8083       | 0.6583        | **0.0950**    | 6          |
| Contextual Compression        | 2.43s      | 5.21s      | 6               | 4           | 7        | 3                | 0.6500       | **0.9333**    | 0.3823        | 6.5        |
| Multi-Query Retrieval         | 3.42s      | 6.35s      | 6               | 4           | 6        | 4                | 0.9167       | 0.7933        | 0.5841        | 6.5        |
| Parent-Document Retrieval     | 2.14s      | 3.01s      | 7               | 3           | 7        | 3                | 0.4250       | **0.9417**    | 0.3475        | 6.25       |
| Ensemble Retrieval            | 6.02s      | 9.84s      | 6               | 4           | 5        | 9                | **0.9667**   | 0.7591        | 0.3934        | **7.75**   |
| Semantic Chunking Retrieval    | 1.92s      | 4.19s      | 8               | 2           | 6        | 4                | 0.8667       | 0.8191        | 0.5083        | 7          |

#### Key Observations
- **Naive Retrieval** has the best context recall but higher noise sensitivity and lower precision.
- **BM25 Retrieval** is the fastest method with the lowest noise sensitivity but sacrifices context recall and precision.
- **Ensemble Retrieval** combines the best context recall with the highest retrieval quality but suffers from high latency.
- **Semantic Chunking Retrieval** balances performance and latency, making it the most consistent method.
- **Contextual Compression** achieves the highest precision but at the cost of recall.

#### Conclusion
The **Semantic Chunking Retrieval** method emerges as the best option based on the following:
- Balanced performance across all evaluation metrics.
- Moderate latency and cost.
- Consistent retrieval quality.

However, if **maximum recall** is required and **latency is not a concern**, **Ensemble Retrieval** is the best option.

If **speed and cost-efficiency** are the highest priority, **BM25 Retrieval** is the ideal choice.

Let me know if you would like to generate visualizations or further fine-tune the recommendations.
