# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

> You do not need to run the following cells if you are running this notebook locally. 

In [None]:
#!pip install -qU langchain langchain-openai langchain-cohere rank_bm25

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/49.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.6/49.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/44.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/233.1 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.1/233.1 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m378.1/378.1 kB[0m [31m15.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.1/1.1 MB[0m [31m37.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m 

We're also going to be leveraging [Qdrant's](https://qdrant.tech/documentation/frameworks/langchain/) (pronounced "Quadrant") VectorDB in "memory" mode (so we can leverage it locally in our colab environment).

In [None]:
#!pip install -qU qdrant-client

We'll also provide our OpenAI key, as well as our Cohere API key.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [2]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

In [3]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")
os.environ["LANGCHAIN_PROJECT"] = "Comparing Retrievals"

## Task 2: Data Collection and Preparation

We'll be using some reviews from the 4 movies in the John Wick franchise today to explore the different retrieval strategies.

These were obtained from IMDB, and are available in the [AIM Data Repository](https://github.com/AI-Maker-Space/DataRepository).

### Data Collection

We can simply `wget` these from GitHub.

You could use any review data you wanted in this step - just be careful to make sure your metadata is aligned with your choice.

In [4]:
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv -O john_wick_1.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv -O john_wick_2.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw3.csv -O john_wick_3.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw4.csv -O john_wick_4.csv

--2025-05-16 14:52:34--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 19628 (19K) [text/plain]
Saving to: ‘john_wick_1.csv’


2025-05-16 14:52:34 (78.0 MB/s) - ‘john_wick_1.csv’ saved [19628/19628]

--2025-05-16 14:52:34--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 14747 (14K) [text/plain]
Saving to: ‘john_wick_2.csv’


2025-05-16 14:52:35 (3.07 MB/s) - ‘john_wick_2.csv’

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

- Self-Query: Wants as much metadata as we can provide
- Time-weighted: Wants temporal data

> NOTE: While we're creating a temporal relationship based on when these movies came out for illustrative purposes, it needs to be clear that the "time-weighting" in the Time-weighted Retriever is based on when the document was *accessed* last - not when it was created.

In [5]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

documents = []

for i in range(1, 5):
  loader = CSVLoader(
      file_path=f"john_wick_{i}.csv",
      metadata_columns=["Review_Date", "Review_Title", "Review_Url", "Author", "Rating"]
  )

  movie_docs = loader.load()
  for doc in movie_docs:

    # Add the "Movie Title" (John Wick 1, 2, ...)
    doc.metadata["Movie_Title"] = f"John Wick {i}"

    # convert "Rating" to an `int`, if no rating is provided - assume 0 rating
    doc.metadata["Rating"] = int(doc.metadata["Rating"]) if doc.metadata["Rating"] else 0

    # newer movies have a more recent "last_accessed_at"
    doc.metadata["last_accessed_at"] = datetime.now() - timedelta(days=4-i)

  documents.extend(movie_docs)

Let's look at an example document to see if everything worked as expected!

In [6]:
documents[0]

Document(metadata={'source': 'john_wick_1.csv', 'row': 0, 'Review_Date': '6 May 2015', 'Review_Title': ' Kinetic, concise, and stylish; John Wick kicks ass.\n', 'Review_Url': '/review/rw3233896/?ref_=tt_urv', 'Author': 'lnvicta', 'Rating': 8, 'Movie_Title': 'John Wick 1', 'last_accessed_at': datetime.datetime(2025, 5, 13, 14, 52, 38, 299252)}, page_content=": 0\nReview: The best way I can describe John Wick is to picture Taken but instead of Liam Neeson it's Keanu Reeves and instead of his daughter it's his dog. That's essentially the plot of the movie. John Wick (Reeves) is out to seek revenge on the people who took something he loved from him. It's a beautifully simple premise for an action movie - when action movies get convoluted, they get bad i.e. A Good Day to Die Hard. John Wick gives the viewers what they want: Awesome action, stylish stunts, kinetic chaos, and a relatable hero to tie it all together. John Wick succeeds in its simplicity.")

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "JohnWick".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [7]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWick"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [8]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [9]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [14]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [15]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [16]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Based on the reviews provided, people generally liked John Wick. Many reviews are highly positive, praising its action sequences, style, and Keanu Reeves' performance, with ratings like 9 and 10 out of 10. Several reviewers mention that it is a must-see for action fans and highlight its entertainment value. While there are some mixed reviews with lower ratings (such as a 5 or 6), the overall tone suggests that the film was well-received and appreciated by most viewers."

In [17]:
naive_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there are reviews with a rating of 10. The URLs to those reviews are:\n\n1. [Review URL: /review/rw4854296/?ref_=tt_urv](https://example.com/review/rw4854296/?ref_=tt_urv)\n\nPlease note that the URLs provided are as per the data; you may need to append the base URL of the review site to access them directly.'

In [18]:
naive_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In the John Wick film series, the story centers around John Wick, a retired and highly skilled assassin, who is pulled back into the violent underworld of crime after personal tragedy strikes. The first film depicts Wick seeking vengeance for the killing of his dog and the theft of his car by criminals, which leads him to unleash a brutal and meticulously orchestrated rampage against those who cross him. As the series progresses, he becomes embroiled in complex criminal alliances and conflicts, with themes of revenge, consequence, and the code of the assassin world playing significant roles. The films are known for their stylish action sequences, deep world-building, and Keanu Reeves’ compelling portrayal of Wick.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [19]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(documents)

We'll construct the same chain - only changing the retriever.

In [20]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [21]:
bm25_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the reviews provided, people\'s opinions on John Wick are mixed. Some reviews highly praise the first film for its stylish action and straightforward plot, suggesting that many viewers liked it. However, other reviews, like the one for John Wick 4, are more critical, describing the latest installment as "almost three hours of nothing" and the weakest in the series. Additionally, the review for John Wick 3 describes it as "boring, dull, and full of stereotypes," indicating some viewers did not enjoy it.\n\nOverall, while many fans appreciate the style, action, and Keanu Reeves\' performance, others feel the series has declined or do not enjoy certain installments. Therefore, people\'s general opinion about John Wick varies, with a significant number of fans liking it, but there are also a notable proportion who are not favorable toward it.'

In [22]:
bm25_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Based on the reviews provided, there are no reviews with a rating of 10.'

In [23]:
bm25_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In the John Wick film series, the story revolves around John Wick, a former hitman who is drawn back into the violent assassin world after a series of events. The first movie, "John Wick," features Keanu Reeves as John Wick, who seeks vengeance after criminals steal his car and kill his dog, a gift from his deceased wife. The series showcases intense, highly choreographed action scenes and a complex underworld with its own rules and societies. As the series progresses, Wick navigates deep into the assassin universe, facing various enemies and challenges, all while dealing with themes of revenge, honor, and survival.'

It's not clear that this is better or worse - but the `I don't know` isn't great!

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [24]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [25]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [26]:
contextual_compression_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Based on the reviews provided, people generally liked John Wick. The reviews are highly positive, with ratings of 9 and 10 out of 10, and enthusiastic descriptions praising its action sequences, style, and Keanu Reeves' performance. However, there is at least one less favorable review with a rating of 5, indicating some viewers' opinions may vary. Overall, the majority of the feedback in the provided context suggests that people generally liked John Wick."

In [27]:
contextual_compression_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there are reviews with a rating of 10. Here are the URLs to those reviews:\n\n1. Review Title: "A Masterpiece & Brilliant Sequel"\n   URL: /review/rw4854296/?ref_=tt_urv\n\n2. Review Title: "Most American action flicks released these days have poor screenplays and overuse computer-generated imagery. The John Wick franchise is one of the few exceptions, along with Mission Impossible. These franchises keep getting better with every entry."\n   URL: /review/rw4860412/?ref_=tt_urv'

In [28]:
contextual_compression_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In the John Wick movies, John Wick is a retired hitman who initially comes out of retirement to avenge the killing of his dog and the theft of his car, which were tied to the loss of his wife. He is a highly skilled and lethal assassin who is targeted by various criminal organizations after he reenters the world of violence. Throughout the series, Wick faces challenges from mobsters and hitmen as he seeks to settle old scores and deal with threats to his life. His actions often violate mafia rules, which leads to him being pursued by professional killers, and he must navigate a dangerous world of crime, revenge, and loyalty.'

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [29]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [30]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [31]:
multi_query_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the reviews provided, people generally liked John Wick. Many reviews are highly positive, praising its action sequences, style, and Keanu Reeves\' performance. Some reviews gave high ratings like 9 or 10 out of 10, and phrases like "I cannot recommend this movie enough," "slick, violent fun," and "a must-see for action fans" indicate a strong positive reception. However, there are some mixed or negative opinions as well, with a few reviewers finding the films lacking in plot or criticizing the over-the-top violence. Overall, the general consensus suggests that people tend to like John Wick, especially fans of action movies.'

In [32]:
multi_query_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there are reviews with a rating of 10. The URLs to those reviews are:\n\n- /review/rw4854296/?ref_=tt_urv (Review titled "A Masterpiece & Brilliant Sequel" for John Wick 3)\n- /review/rw4862630/?ref_=tt_urv (Review titled "Less is more" for John Wick 3)'

In [33]:
multi_query_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In the John Wick film series, the story centers around John Wick, a retired hitman who is drawn back into a violent underworld of assassins. The series begins with Wick seeking revenge after thugs steal his car and kill his beloved dog, which his late wife had gifted him, unleashing a relentless pursuit of vengeance against those who wronged him. As the series progresses, Wick becomes embroiled in a complex criminal universe involving a global assassin community, a powerful crime organization called the High Table, and a series of deadly conflicts. Throughout the movies, Wick fights to reclaim his peace while facing escalating threats, relentless assassins, and the consequences of his violent past. The series is known for its stylized action sequences, well-choreographed fight scenes, and exploration of themes like revenge, honor, and consequence.'

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [34]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [35]:
client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
    collection_name="full_documents", embeddings=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

  parent_document_vectorstore = Qdrant(


Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [36]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [37]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [38]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [39]:
parent_document_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the provided reviews, people\'s opinions on John Wick vary. Some reviewers, like MrHeraclius, express strong positive feelings and highly recommend the series, suggesting that many people do like the movies. Conversely, there is at least one negative review, such as the one by solidabs, who found John Wick 4 to be "horrible" and criticized various aspects of the film. Overall, the reviews indicate mixed opinions, with some people liking the series and others not enjoying certain installments.'

In [40]:
parent_document_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. The URL for that review is: /review/rw4854296/?ref_=tt_urv'

In [41]:
parent_document_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In the John Wick movies, John Wick is a retired assassin who is drawn back into a violent world of killing and revenge. In the first film, he comes out of retirement after a gang steals his car and kills his dog, which were personal losses that ignite his desire for vengeance. He then unleashes a relentless and brutal attack on those responsible. The sequel, John Wick Chapter 2, continues his story as he is pulled back into the dangerous assassin world when an Italian crime boss calls in a favor, leading Wick through a series of intense shootouts and action sequences across various locations. Overall, the series portrays John Wick as a highly skilled and deadly hitman who operates in a gritty, violent universe driven by revenge and consequence.'

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [42]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [43]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [44]:
ensemble_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the reviews in the context provided, people generally seem to like John Wick. Several reviews rate the film highly, praising its action sequences, style, and entertainment value, with ratings like 9, 8, and 10 out of 10. Many reviewers describe the film as fun, stylish, and a must-see for action fans. However, there are some mixed or negative reviews as well, especially for later installments, with ratings of 1, 2, or 4, criticizing the over-the-top action and perceived lack of plot or realism. Overall, the majority of reviews express a positive reception toward John Wick.'

In [45]:
ensemble_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there are reviews with a rating of 10. One such review can be found at the following URL: \n\n/ review / rw4854296 / ?ref_=tt_urv\n\nPlease note that the URL is relative; you may need to add the base website address to access it directly.'

In [46]:
ensemble_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'John Wick is about a retired hitman, played by Keanu Reeves, who comes out of retirement to seek revenge after a violent home invasion that results in the death of his dog and the theft of his car. The story highlights his relentless quest for vengeance against those who wronged him, unleashing a series of brutal and highly choreographed action sequences. Throughout the series, Wick navigates a criminal underworld, dealing with various enemies, including gangsters, assassins, and criminal organizations, while highlighting themes of revenge, consequences, and survival.'

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

> NOTE: You do not need to run this cell if you're running this locally

In [47]:
#!pip install -qU langchain_experimental

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [48]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [49]:
semantic_documents = semantic_chunker.split_documents(documents)

Let's create a new vector store.

In [50]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWickSemantic"
)

We'll use naive retrieval for this example.

In [51]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [52]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [53]:
semantic_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the reviews provided, people generally liked John Wick. Many reviews are highly positive, praising its action, style, and entertainment value. Ratings often range from 8 to 10 out of 10, indicating a favorable reception. However, there are some less favorable opinions as well, with a few reviews giving low ratings or expressing that the film has lost some of its magic. Overall, the majority of reviewers seem to have liked the franchise.'

In [54]:
semantic_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. The URL to that review is: /review/rw4854296/?ref_=tt_urv'

In [55]:
semantic_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In the John Wick movies, the main story follows a retired assassin named John Wick, played by Keanu Reeves, who seeks revenge after a series of tragic events. The original film begins with John Wick living a peaceful life after leaving his violent career. However, his peaceful existence is shattered when a young Russian punk and his accomplices break into his house, beat him up, kill his dog—his last remaining connection to his late wife—and steal his car. Unaware of his identity as a legendary hitman, they trigger John Wick's return to violence. Enraged and driven by revenge, Wick unleashes his lethal skills against those who wronged him, drawing the attention of criminal underworld forces, mobsters, and relentless killers. The story revolves around his quest for justice, revenge, and the consequences of his actions as he battles to protect what remains of his peace."

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [56]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/keatnuxsuo/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/keatnuxsuo/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


True

In [57]:
from langchain_community.document_loaders import CSVLoader
from langchain_community.document_loaders import DirectoryLoader
import os
import glob

# Path to your CSV files directory
path = "data/"

# Method 1: Using DirectoryLoader with CSVLoader as the loader_cls
csv_loader = DirectoryLoader(
    path,
    glob="*.csv",  # Only get CSV files
    loader_cls=CSVLoader
)

# Load all CSV files
try:
    movie_docs = csv_loader.load()
    print(f"Loaded {len(documents)} documents from {path}")
except Exception as e:
    print(f"Error with DirectoryLoader: {e}")


Loaded 100 documents from data/


### Generate SDG for evaluation

In [63]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [64]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(movie_docs, testset_size=10)

Applying SummaryExtractor:   0%|          | 0/84 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/100 [00:00<?, ?it/s]

Node 167161cf-f8e1-4b16-ab62-80e2967b8eac does not have a summary. Skipping filtering.
Node beec6591-9182-4870-a7a5-c4b23b7220d3 does not have a summary. Skipping filtering.
Node d10ca170-2e65-4812-8432-a342ec864872 does not have a summary. Skipping filtering.
Node 3648ad64-eeba-43c2-b8ce-022c84e676ce does not have a summary. Skipping filtering.
Node 4b9f2de0-e3d3-481d-8c3e-38807307bdb7 does not have a summary. Skipping filtering.
Node 98cf99ea-e3a6-46fa-a088-b69fa38dcc1b does not have a summary. Skipping filtering.
Node ddbfe3ed-b47e-4906-a5d7-9c5ffa469f45 does not have a summary. Skipping filtering.
Node f94f9b3c-1adb-4e76-8e9b-05469972bd55 does not have a summary. Skipping filtering.
Node b732795c-3533-46bf-8437-8f297d1d2ad6 does not have a summary. Skipping filtering.
Node 846fb9e4-e984-4605-8214-a67b3ccaa9d4 does not have a summary. Skipping filtering.
Node 399cb61f-66c5-436d-b98e-adf0f5a1d524 does not have a summary. Skipping filtering.
Node dd5b85c9-91fb-4dd8-b196-bf2daa6fbf3c d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/257 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [125]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Liam Neeson in John Wick?,[: 0\nReview_Date: 6 May 2015\nAuthor: lnvicta...,The review compares John Wick to Taken but wit...,single_hop_specifc_query_synthesizer
1,Who is CountJonnie?,[: 1\nReview_Date: 17 January 2015\nAuthor: Co...,CountJonnie is the author of the review dated ...,single_hop_specifc_query_synthesizer
2,Why is John Wick like so popular and why peopl...,[: 2\nReview_Date: 5 May 2023\nAuthor: Coventr...,The review mentions that after the success of ...,single_hop_specifc_query_synthesizer
3,What makes John Wick stand out among action mo...,[: 3\nReview_Date: 28 September 2018\nAuthor: ...,"John Wick has a very simple revenge story, but...",single_hop_specifc_query_synthesizer
4,How do the critiqe of action scenes and backgr...,[<1-hop>\n\n: 3\nReview_Date: 27 May 2019\nAut...,"The critique of action sequences, noting that ...",multi_hop_abstract_query_synthesizer
5,How does the fight choreography in John Wick c...,[<1-hop>\n\n: 3\nReview_Date: 28 September 201...,The fight choreography in John Wick significan...,multi_hop_abstract_query_synthesizer
6,How does the revent thriller nerrative in John...,[<1-hop>\n\n: 21\nReview_Date: 8 February 2017...,"The review describes John Wick as a bloodier, ...",multi_hop_abstract_query_synthesizer
7,How do critics criticize the artistic quality ...,[<1-hop>\n\n: 9\nReview_Date: 30 March 2023\nA...,The reviews indicate that the John Wick movies...,multi_hop_abstract_query_synthesizer
8,John Wick 3 Parabellum and John Wick 4 how the...,[<1-hop>\n\n: 20\nReview_Date: 23 March 2023\n...,"According to the reviews, John Wick: Chapter 3...",multi_hop_specific_query_synthesizer
9,Chapter 3 better than Chapter 2 or not?,[<1-hop>\n\n: 4\nReview_Date: 13 February 2017...,"Based on the reviews, Chapter 2 is described a...",multi_hop_specific_query_synthesizer


### Create Evaluation Dataset (Question/Answer pairs) for each retreiver

In [65]:
import pandas as pd
from ragas import EvaluationDataset

def create_eval_dataset(retriever_chain, dataset, name="retrieval"):
    """
    create a Ragas EvaluationDataset
    
    Parameters:
    -----------
    retriever_chain : The retrieval chain to evaluate
    dataset : The dataset containing test samples
    name : Name of the retriever (for logging purposes)
    
    Returns:
    --------
    EvaluationDataset: A Ragas evaluation dataset containing the results
    """
    retrieval_data = []
    
    # Run the evaluation for this retriever
    for test_row in dataset:
        response = retriever_chain.invoke({"question": test_row.eval_sample.user_input})
        
        # Store results in the correct format for Ragas
        retrieval_data.append({
            "user_input": test_row.eval_sample.user_input,
            "retrieved_contexts": [context.page_content for context in response["context"]],
            "reference_contexts": test_row.eval_sample.reference_contexts,
            "response": response["response"].content,
            "reference": test_row.eval_sample.reference
        })
    
    # Create the Ragas EvaluationDataset
    eval_dataset = EvaluationDataset.from_list(retrieval_data)
    
    print(f"Created {name} dataset with {len(retrieval_data)} entries")
    return eval_dataset


In [66]:
eval_naive_retrieval_data = create_eval_dataset(naive_retrieval_chain, dataset, "naive_retrieval")

Created naive_retrieval dataset with 12 entries


In [68]:
eval_naive_retrieval_data = create_eval_dataset(naive_retrieval_chain, dataset, "naive_retrieval")
eval_bm25_retrieval_data = create_eval_dataset(bm25_retrieval_chain, dataset, "bm25_retrieval")
eval_multi_retrieval_data = create_eval_dataset(multi_query_retrieval_chain, dataset, "multi_query_retrieval")
eval_parent_retrieval_data = create_eval_dataset(parent_document_retrieval_chain, dataset, "parent_document_retrieval")
eval_ensemble_retrieval_data = create_eval_dataset(ensemble_retrieval_chain, dataset, "ensemble_retrieval")

Created naive_retrieval dataset with 12 entries
Created bm25_retrieval dataset with 12 entries
Created multi_query_retrieval dataset with 12 entries
Created parent_document_retrieval dataset with 12 entries
Created ensemble_retrieval dataset with 12 entries


In [77]:
#run seperately cause of API limit
eval_compression_retrieval_data = create_eval_dataset(contextual_compression_retrieval_chain, dataset, "compression_retrieval")

Created compression_retrieval dataset with 12 entries


### Run RAGAS Evaluation Metrics for all retreivers

In [69]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

def evaluate_retriever_performance(eval_dataset, evaluator_llm, name="retrieval", timeout=360):
    """
    Evaluate a retriever's performance using Ragas metrics
    
    Parameters:
    -----------
    eval_dataset : The Ragas EvaluationDataset containing retrieval results
    evaluator_llm : The LLM to use for evaluation
    name : Name of the retriever (for reporting purposes)
    timeout : Timeout in seconds for the evaluation
    
    Returns:
    --------
    dict: The evaluation results
    """
    # Set up custom run configuration
    custom_run_config = RunConfig(timeout=timeout)
    
    # Run the evaluation
    print(f"Evaluating {name} performance...")
    results = evaluate(
        dataset=eval_dataset,
        metrics=[
            LLMContextRecall(), 
            Faithfulness(), 
            FactualCorrectness(), 
            ResponseRelevancy(), 
            ContextEntityRecall(), 
            NoiseSensitivity()
        ],
        llm=evaluator_llm,
        run_config=custom_run_config
    )
    
    print(f"Evaluation of {name} complete!")
    return results

In [70]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

In [71]:
naive_results = evaluate_retriever_performance(eval_naive_retrieval_data, evaluator_llm, "naive_retrieval")

Evaluating naive_retrieval performance...


Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[17]: TimeoutError()
Exception raised in Job[23]: TimeoutError()
Exception raised in Job[29]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[47]: TimeoutError()
Exception raised in Job[53]: TimeoutError()
Exception raised in Job[65]: TimeoutError()


Evaluation of naive_retrieval complete!


In [72]:
naive_results

{'context_recall': 0.6319, 'faithfulness': 0.8979, 'factual_correctness(mode=f1)': 0.3500, 'answer_relevancy': 0.8556, 'context_entity_recall': 0.5266, 'noise_sensitivity(mode=relevant)': 0.4306}

In [73]:
bm25_results = evaluate_retriever_performance(eval_bm25_retrieval_data, evaluator_llm, "bm25_retrieval")
bm25_results

Evaluating bm25_retrieval performance...


Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[47]: TimeoutError()


Evaluation of bm25_retrieval complete!


{'context_recall': 0.6458, 'faithfulness': 0.8410, 'factual_correctness(mode=f1)': 0.4025, 'answer_relevancy': 0.7922, 'context_entity_recall': 0.5188, 'noise_sensitivity(mode=relevant)': 0.2823}

In [78]:
compression_results = evaluate_retriever_performance(eval_compression_retrieval_data, evaluator_llm, "compression_retrieval")
compression_results

Evaluating compression_retrieval performance...


Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Evaluation of compression_retrieval complete!


{'context_recall': 0.5903, 'faithfulness': 0.8661, 'factual_correctness(mode=f1)': 0.4233, 'answer_relevancy': 0.7900, 'context_entity_recall': 0.5097, 'noise_sensitivity(mode=relevant)': 0.3218}

In [74]:
multi_results = evaluate_retriever_performance(eval_multi_retrieval_data, evaluator_llm, "multi_query_retrieval")
multi_results

Evaluating multi_query_retrieval performance...


Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[17]: TimeoutError()
Exception raised in Job[29]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[47]: TimeoutError()
Exception raised in Job[53]: TimeoutError()
Exception raised in Job[65]: TimeoutError()
Exception raised in Job[71]: TimeoutError()


Evaluation of multi_query_retrieval complete!


{'context_recall': 0.7014, 'faithfulness': 0.9674, 'factual_correctness(mode=f1)': 0.3600, 'answer_relevancy': 0.8702, 'context_entity_recall': 0.4472, 'noise_sensitivity(mode=relevant)': 0.5091}

In [75]:
parent_results = evaluate_retriever_performance(eval_parent_retrieval_data, evaluator_llm, "parent_document_retrieval")
parent_results

Evaluating parent_document_retrieval performance...


Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Evaluation of parent_document_retrieval complete!


{'context_recall': 0.5486, 'faithfulness': 0.6469, 'factual_correctness(mode=f1)': 0.4608, 'answer_relevancy': 0.7883, 'context_entity_recall': 0.5990, 'noise_sensitivity(mode=relevant)': 0.1934}

In [76]:
ensemble_results = evaluate_retriever_performance(eval_ensemble_retrieval_data, evaluator_llm, "ensemble_retrieval")
ensemble_results

Evaluating ensemble_retrieval performance...


Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

Exception raised in Job[17]: TimeoutError()
Exception raised in Job[23]: TimeoutError()
Exception raised in Job[29]: TimeoutError()
Exception raised in Job[35]: TimeoutError()
Exception raised in Job[41]: TimeoutError()
Exception raised in Job[47]: TimeoutError()
Exception raised in Job[53]: TimeoutError()
Exception raised in Job[59]: TimeoutError()
Exception raised in Job[65]: TimeoutError()
Exception raised in Job[71]: TimeoutError()


Evaluation of ensemble_retrieval complete!


{'context_recall': 0.8333, 'faithfulness': 0.9552, 'factual_correctness(mode=f1)': 0.4175, 'answer_relevancy': 0.8654, 'context_entity_recall': 0.4681, 'noise_sensitivity(mode=relevant)': 0.1667}

### Combining all the metric results

In [120]:
import pandas as pd
import numpy as np

# List of all result variables
all_results = [naive_results, bm25_results, compression_results, multi_results, parent_results, ensemble_results]
retriever_names = ['Naive', 'BM25', 'Contextual Compression', 'Multi-Query', 'Parent Document', 'Ensemble']

# Get all unique metrics from the first item in each result's scores list
all_metrics = set()
for result in all_results:
    if hasattr(result, 'scores') and isinstance(result.scores, list) and len(result.scores) > 0:
        for score_dict in result.scores:
            if isinstance(score_dict, dict):
                all_metrics.update(score_dict.keys())

# Create DataFrame with results
data = {'Retriever': retriever_names}

# Function to safely compute mean of a metric across all score dictionaries
def compute_metric_mean(result_obj, metric_name):
    if not hasattr(result_obj, 'scores') or not isinstance(result_obj.scores, list):
        return None
    
    values = []
    for score_dict in result_obj.scores:
        if isinstance(score_dict, dict) and metric_name in score_dict:
            value = score_dict[metric_name]
            if value is not None and not (isinstance(value, float) and np.isnan(value)):
                values.append(float(value))
    
    if values:
        return np.mean(values)
    return None

# Add each metric to the DataFrame
for metric in sorted(all_metrics):
    data[metric.capitalize().replace('_', ' ')] = [
        compute_metric_mean(result, metric) for result in all_results
    ]
# Create the comparison DataFrame
comparison_df = pd.DataFrame(data)

# Format numeric values as percentages with 2 decimal places
for col in comparison_df.columns:
    if col != 'Retriever' and col != 'Total Cost ($)':
        comparison_df[col] = comparison_df[col].apply(
            lambda x: f"{x*100:.2f}%" if isinstance(x, (int, float)) else x
        )

Get Costs, Latency and Tokens from Langsmith

In [109]:
import os
from langsmith import Client
import pandas as pd

# Initialize LangSmith client
client = Client(api_key=os.environ.get("LANGSMITH_API_KEY"))

# List of retrievers and their run IDs
retriever_names = ['Naive', 'BM25', 'Contextual Compression', 'Multi-Query', 'Parent Document', 'Ensemble']
run_ids = [
    "13573b02-7cfa-4879-baab-aebb209863be",  # Naive
    "ff241b88-ffec-4432-9ea5-efd1d451d549",  # BM25
    "94f14b15-e31f-4044-abd2-269445773050",  # Contextual Compression
    "cd407e77-43c1-42f1-b43b-3dd37436a61c",  # Multi-Query
    "f8c90244-388e-41d2-a409-f2d79ea1a8e7",  # Parent Document
    "c83157b1-b035-4f57-bd6b-fc097d88074d"   # Ensemble
]

# Function to get metrics from LangSmith run
def get_run_metrics(run_id):
    try:
        # Get run details from LangSmith
        run = client.read_run(run_id)
        
        # Try to get cost from various attributes
        cost = None
        if hasattr(run, 'total_cost'):
            cost = run.total_cost
        elif hasattr(run, 'prompt_cost') and hasattr(run, 'completion_cost'):
            prompt_cost = run.prompt_cost or 0
            completion_cost = run.completion_cost or 0
            cost = prompt_cost + completion_cost
        
        # Get token counts if available
        tokens = None
        if hasattr(run, 'total_tokens'):
            tokens = run.total_tokens
        
        # Calculate latency from timestamps
        latency = None
        if hasattr(run, 'start_time') and hasattr(run, 'end_time'):
            start_time = run.start_time
            end_time = run.end_time
            if start_time and end_time:
                latency = (end_time - start_time).total_seconds()
        
        return cost, latency, tokens
    except Exception as e:
        print(f"Error getting metrics for run {run_id}: {e}")
        return None, None, None

# Get metrics for each retriever
metrics = []
for i, run_id in enumerate(run_ids):
    cost, latency, tokens = get_run_metrics(run_id)
    metrics.append({
        'Retriever': retriever_names[i],
        'Cost ($)': cost,
        'Latency (s)': latency,
        'Tokens': tokens
    })

# Create DataFrame
metrics_df = pd.DataFrame(metrics)


Combine all metrics, cost, latency and tokens

In [119]:
all_metrics = pd.merge(comparison_df, metrics_df, on='Retriever')
all_metrics

Unnamed: 0,Retriever,Answer relevancy,Context entity recall,Context recall,Factual correctness(mode=f1),Faithfulness,Noise sensitivity(mode=relevant),Cost ($),Latency (s),Tokens
0,Naive,85.56%,52.66%,63.19%,35.00%,89.79%,43.06%,0.5175488,454.344248,658064
1,BM25,79.22%,51.88%,64.58%,40.25%,84.10%,28.23%,0.3220576,405.710811,402238
2,Contextual Compression,79.00%,50.97%,59.03%,42.33%,86.61%,32.18%,0.288856,304.538759,359449
3,Multi-Query,87.02%,44.72%,70.14%,36.00%,96.74%,50.91%,0.564982,509.805455,743233
4,Parent Document,78.83%,59.90%,54.86%,46.08%,64.69%,19.34%,0.2515708,293.522741,310327
5,Ensemble,86.54%,46.81%,83.33%,41.75%,95.52%,16.67%,0.5670152,483.225317,753854


### Analysis

**Answer relevancy**
- Highest: Multi-Query (87.02%), Ensemble (86.54%), Naive (85.56%)
- Lowest: Parent Document (78.83%), Contextual Compression (79.00%), BM25 (79.22%)
- Analysis: 
  - **Multi-Query** and **Ensemble** can synthesize information from multiple sources/queries, raising the odds of finding directly relevant snippets
  - **Naive** is surprisingly high—possibly due to direct matching on clear queries in your dataset (e.g., “Who is Count Jonnie?”).
  - **Parent Document** focuses on whole docs, not just direct matches, which can dilute relevancy.

**Context Entity Recall / Context Recall**
- Highest (Entity Recall): Parent Document (59.90%)
- Lowest (Entity Recall): Multi-Query (44.72%)
- Highest (Context Recall): Ensemble (83.33%), Multi-Query (70.14%)
- Lowest (Context Recall): Parent Document (54.86%), Contextual Compression (59.03%)
- Analysis:
  - **Parent Document** grabs more context entities, as it retrieves full docs (risk: more noise).
  - **Ensemble** and **Multi-Query** excel in context recall by aggregating from different strategies, likely capturing more “edges” of the reference contexts.
  - Some strategies **(BM25, Contextual Compression)** retrieve precise snippets, missing broader context.


**Factual Correctness (F1)**
- Highest: Parent Document (46.08%), Contextual Compression (42.33%), Ensemble (41.75%)
- Lowest: Naive (35.00%), Multi-Query (36.00%)
- Analysis:
  - **Parent Document’s **recall of full docs helps surface more facts, but can risk noise/confusion.
  - **Naive** and **Multi-Query** may overfit to “apparent” answers without complete factual support.

**Faithfulness**
- Highest: Multi-Query (96.74%), Ensemble (95.52%), Naive (89.79%)
- Lowest: Parent Document (64.69%)
- Analysis:
  - **Multi-Query** and Ensemble provide answers closely supported by retrieved text.
  - **Parent Document**’s low faithfulness is a classic risk: retrieving too much unrelated content.

**Noise Sensitivity**
- Lowest (Best): Ensemble (16.67%), Parent Document (19.34%)
- Highest (Worst): Multi-Query (50.91%), Naive (43.06%)
- Analysis:
  - **Ensemble** and **Parent Document** approaches are robust to irrelevant info—either via ensembling or by diluting noise in larger context.
  - **Multi-Query** and **Naive** may over-retrieve or pick up spurious matches (especially with long or complex questions).


**Cost / Latency / Tokens**
- Lowest Cost & Latency: Parent Document, Contextual Compression
- Highest Cost & Latency: Ensemble, Multi-Query, Naive (due to large context or multiple queries)
- Analysis:
  - **Parent Document** and **Contextual Compression** are resource-efficient because 
    - Parent Document retriever fetches an entire document (or a large chunk), rather than splitting the query into many sub-queries or retrieving lots of small passages.
    - Contextual Compression uses dense retrieval (embeddings) to directly return only the most relevant compressed snippets from the corpus, instead of large documents or multiple sub-queries.
  - **Multi-Query** and **Ensemble** are more expensive—multiple calls, context merging, or model ensemble.


### Why Do Retrievers Perform Differently?

**Retriever Strategies**
- **Naive**: retrieves by simple keyword or passage match. Good for clear-cut Q&A, but falls down on complex/nuanced tasks (low F1).

- **BM25**: Classic lexical match; performs reliably on explicit questions but struggles with paraphrasing and implied context (explains middling scores across most metrics).

- **Contextual Compression:** Uses embeddings to “compress” relevant context; can miss details (lower recall) but is efficient, which suits focused fact retrieval.

- **Multi-Query:** Splits queries, gathers context from multiple angles—great for high coverage and recall (especially for compound/multi-part questions), but at the cost of noise and higher resource use.

- **Parent Document:** Retrieves entire docs—max context entity recall, but faithfulness suffers because not all context is always relevant (can confuse LLMs, but boosts F1 by “accidentally” surfacing more facts).

- **Ensemble**: Combines several approaches, benefiting from their strengths (e.g., higher context recall, high faithfulness, low noise), but at the cost of higher latency and compute.

---- 
**Testset Impact**
- Short, Direct Queries (e.g., “Who is CountJonnie?”): **Naive** and **BM25** work fine; complex methods don’t add much value.

- Multi-Hop/Compound Questions (e.g., “Why is John Wick so popular and how has it evolved?”): **Multi-Query**, **Ensemble**, and** Contextual Compression** outperform naive approaches by piecing together answers from multiple places.

- Contextual Nuance (e.g., “How does critique of action scenes affect tone/quality?”): Methods with better context recall **(Ensemble, Multi-Query, Parent Document)** capture subtler points—snippets or full reviews with relevant critique.

- Risk of Noise: **Parent Document** can pull in a lot of irrelevant info if the reference doc is long; **Ensemble** mitigates this by blending signals; **Multi-Query** risks high noise if sub-queries aren’t well-formed.

---
**Cost & Latency**
- **Parent Document, Contextual Compression**: Fastest, cheapest—good for scaling, but may lack subtlety.

- **Multi-Query, Ensemble**: Best for nuanced, compound reasoning but expensive and slow—justified for complex research tasks, not for real-time or cost-sensitive use cases.

---
**Conclusion**
- If we want high answer relevancy and faithfulness for complex, multi-faceted movie Q&A, use **Ensemble or Multi-Query** but be aware of cost/latency.

- For simple direct questions, **Naive/BM25** are surprisingly competitive and resource-efficient.

- If cost/latency matters and you can tolerate lower recall/coverage, **Contextual Compression or Parent Document** suffice.

