# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

> You do not need to run the following cells if you are running this notebook locally. 

In [1]:
#!pip install -qU langchain langchain-openai langchain-cohere rank_bm25

We're also going to be leveraging [Qdrant's](https://qdrant.tech/documentation/frameworks/langchain/) (pronounced "Quadrant") VectorDB in "memory" mode (so we can leverage it locally in our colab environment).

In [2]:
#!pip install -qU qdrant-client

We'll also provide our OpenAI key, as well as our Cohere API key.

In [10]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [4]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using some reviews from the 4 movies in the John Wick franchise today to explore the different retrieval strategies.

These were obtained from IMDB, and are available in the [AIM Data Repository](https://github.com/AI-Maker-Space/DataRepository).

### Data Collection

We can simply `wget` these from GitHub.

You could use any review data you wanted in this step - just be careful to make sure your metadata is aligned with your choice.

In [5]:
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv -O john_wick_1.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv -O john_wick_2.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw3.csv -O john_wick_3.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw4.csv -O john_wick_4.csv

--2025-05-16 18:17:17--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv
Résolution de raw.githubusercontent.com (raw.githubusercontent.com)… 2606:50c0:8000::154, 2606:50c0:8003::154, 2606:50c0:8002::154, ...
Connexion à raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8000::154|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 19628 (19K) [text/plain]
Sauvegarde en : « john_wick_1.csv »


2025-05-16 18:17:17 (2,48 MB/s) — « john_wick_1.csv » sauvegardé [19628/19628]

--2025-05-16 18:17:17--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv
Résolution de raw.githubusercontent.com (raw.githubusercontent.com)… 2606:50c0:8003::154, 2606:50c0:8002::154, 2606:50c0:8001::154, ...
Connexion à raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 14747 (14K) [text/plain]
Sauvegarde en : « jo

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

- Self-Query: Wants as much metadata as we can provide
- Time-weighted: Wants temporal data

> NOTE: While we're creating a temporal relationship based on when these movies came out for illustrative purposes, it needs to be clear that the "time-weighting" in the Time-weighted Retriever is based on when the document was *accessed* last - not when it was created.

In [6]:
import pandas as pd

df = pd.read_csv("john_wick_4.csv", index_col= 0)
df


Unnamed: 0,Review_Date,Author,Rating,Review_Title,Review,Review_Url
0,23 May 2023,siderite,4.0,What a pointless film\n,Imagine a video game where you are shooting ba...,/review/rw9073117/?ref_=tt_urv
1,30 March 2023,neil-476,5.0,There is such a thing as too much\n,"The Table, the international crminal brotherho...",/review/rw8960544/?ref_=tt_urv
2,25 March 2023,BA_Harrison,4.0,It got on my wick.\n,The first three John Wick films came in fairly...,/review/rw8950606/?ref_=tt_urv
3,23 May 2023,namob-43673,3.0,I was rolling my eyes the whole time... all 3...,These John Wick movies can be sort of fun in t...,/review/rw9072963/?ref_=tt_urv
4,24 March 2023,fciocca,4.0,John Wick became the parody of himself. The t...,I went to the cinema with great expectations. ...,/review/rw8948738/?ref_=tt_urv
5,2 April 2023,skyhawk747,4.0,Am I missing something here?\n,What is all the raving about with this movie? ...,/review/rw8967740/?ref_=tt_urv
6,24 March 2023,IMDbKeepsDeletingMyReviews,5.0,"""Yeah.""\n",In this fourth installment of 8711's successfu...,/review/rw8947952/?ref_=tt_urv
7,23 April 2023,antti-eskelinen-329-929792,4.0,I don't understand the great scores of this m...,In my opinion this is by far the worst movie o...,/review/rw9011753/?ref_=tt_urv
8,28 May 2023,dstan-71445,2.0,Disappointed.\n,"Very much over rated. Repetitive, tiring and i...",/review/rw9082993/?ref_=tt_urv
9,30 March 2023,drjgardner,2.0,"Ridiculous, boring and pathetic...\n",...all at the same time. This hybrid comic boo...,/review/rw8959398/?ref_=tt_urv


In [7]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

documents = []

for i in range(1, 5):
  loader = CSVLoader(
      file_path=f"john_wick_{i}.csv",
      metadata_columns=["Review_Date", "Review_Title", "Review_Url", "Author", "Rating"]
  )

  movie_docs = loader.load()
  for doc in movie_docs:

    # Add the "Movie Title" (John Wick 1, 2, ...)
    doc.metadata["Movie_Title"] = f"John Wick {i}"

    # convert "Rating" to an `int`, if no rating is provided - assume 0 rating
    doc.metadata["Rating"] = int(doc.metadata["Rating"]) if doc.metadata["Rating"] else 0

    # newer movies have a more recent "last_accessed_at"
    doc.metadata["last_accessed_at"] = datetime.now() - timedelta(days=4-i)

  documents.extend(movie_docs)

Let's look at an example document to see if everything worked as expected!

In [8]:
documents[0]

Document(metadata={'source': 'john_wick_1.csv', 'row': 0, 'Review_Date': '6 May 2015', 'Review_Title': ' Kinetic, concise, and stylish; John Wick kicks ass.\n', 'Review_Url': '/review/rw3233896/?ref_=tt_urv', 'Author': 'lnvicta', 'Rating': 8, 'Movie_Title': 'John Wick 1', 'last_accessed_at': datetime.datetime(2025, 5, 13, 18, 17, 19, 723180)}, page_content=": 0\nReview: The best way I can describe John Wick is to picture Taken but instead of Liam Neeson it's Keanu Reeves and instead of his daughter it's his dog. That's essentially the plot of the movie. John Wick (Reeves) is out to seek revenge on the people who took something he loved from him. It's a beautifully simple premise for an action movie - when action movies get convoluted, they get bad i.e. A Good Day to Die Hard. John Wick gives the viewers what they want: Awesome action, stylish stunts, kinetic chaos, and a relatable hero to tie it all together. John Wick succeeds in its simplicity.")

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "JohnWick".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [11]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWick"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [12]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [13]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [14]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")


### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [15]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [16]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Based on the reviews, people generally liked John Wick. Many reviewers praise its action sequences, style, and Keanu Reeves' performance, with ratings often ranging from 7 to 10. There are some mixed opinions, with a few reviewers giving lower ratings or expressing confusion about its popularity, but overall, the sentiment towards John Wick seems to be positive."

In [17]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})

{'response': AIMessage(content='Based on the reviews provided, people generally liked John Wick. Many reviews gave high ratings such as 9 or 10 out of 10, praising its stylish action sequences, cool character portrayal by Keanu Reeves, and overall entertainment value. While some reviews were more moderate or critical, the overall sentiment from the majority suggests that viewers to some extent enjoyed the film.', additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 74, 'prompt_tokens': 3601, 'total_tokens': 3675, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': 'fp_eede8f0d45', 'id': 'chatcmpl-BXs6RXD1jM0KijKJg4VwYRZisoIL9', 'service_tier': 'default', 'finish_reason': 'stop', 'logprobs': None}, id='run--d056fab8-6bb2-43e2-84d5-023

In [18]:
naive_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there are reviews with a rating of 10. The URLs to those reviews are:\n\n1. [Review by ymyuseda](https://example.com/review/rw4854296/?ref_=tt_urv) (Review Title: "A Masterpiece & Brilliant Sequel")\n2. [Review by lovemichaeljordan](https://example.com/review/rw8944843/?ref_=tt_urv) (Review Title: "How Can Anyone Choose to Watch Marvel Over This?")\n\nPlease note that the URLs are based on the review URL paths provided in the data.'

In [19]:
naive_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In the first John Wick movie, the story follows an ex-hitman named John Wick, played by Keanu Reeves, who is mourning the recent loss of his wife. His life takes a dark turn when a young Russian punk and his gang break into his home, beating him senseless, killing his dog (a beloved gift from his wife), and stealing his car. It is revealed that John Wick is a former professional assassin with a lethal reputation. The attack on him and the theft of his car ignite a violent revenge saga as Wick, driven by grief and rage, unleashes a relentless and carefully orchestrated wave of destruction against those who wronged him. He becomes a target for a bounty of killers, including an army of bounty hunters and other deadly enemies, as he seeks retribution for the loss of his dog and the insult to his former life. The film is packed with intense action sequences, stylish stunts, and a noir-like underworld setting.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [20]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(documents)

We'll construct the same chain - only changing the retriever.

In [21]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [22]:
bm25_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Based on the reviews provided, people's opinions on John Wick vary. Some reviewers, like IceSkateUpHill and lnvicta, highly enjoyed the movies, praising their stylish action, world-building, and entertainment value. However, others, like janetwilkinson and Phil_H, expressed negative opinions, criticizing the later films for being overly violent, plotless, or weaker compared to earlier installments. \n\nOverall, it seems that many people liked the John Wick movies, especially the first one, but there are also negative opinions about some of the later films."

In [23]:
bm25_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Based on the provided reviews, none of them have a rating of 10. The highest ratings mentioned are 9, with one review rating 9 for "John Wick 4" and another for "John Wick 2." Therefore, there are no reviews with a rating of 10, and I cannot provide any URLs for such reviews.'

In [24]:
bm25_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In the John Wick series, the main character, John Wick, is a retired hitman who is drawn back into the violent underworld due to personal reasons involving the loss of his dog, which was a gift from his deceased wife. The movies depict his skillful and brutal actions as he seeks revenge against those who have wronged him, leading to intense and meticulously choreographed action scenes. The series is also set in a unique world with its own rules and hierarchies of assassins.'

It's not clear that this is better or worse - but the `I don't know` isn't great!

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [25]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [26]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [27]:
contextual_compression_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the reviews provided, people generally liked John Wick. The positive reviews describe it as an exciting, stylish, and fun action film, with high ratings such as 9 and 10 out of 10. However, there is a less favorable review with a rating of 5 out of 10 for the third movie, indicating some mixed opinions. Overall, the majority of comments suggest that people enjoyed the film and appreciated its action sequences, style, and Keanu Reeves’ performance.'

In [28]:
contextual_compression_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there are reviews with a rating of 10. Here are the URLs to those reviews:\n\n1. Review titled "A Masterpiece & Brilliant Sequel" - [URL](https://yourdomain.com/review/rw4854296/?ref_=tt_urv)\n2. Review with the title starting with "It\'s got its own action style!" - [URL](https://yourdomain.com/review/rw4860412/?ref_=tt_urv)\n\nPlease note that the URLs are based on the review identifiers provided in the data.'

In [29]:
contextual_compression_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In the John Wick series, the story follows a retired hitman, John Wick, played by Keanu Reeves. In the first film, he comes out of retirement to seek revenge after gangsters steal his car and kill his dog, which was a gift from his deceased wife. This leads to a violent and relentless pursuit of retribution against those who wronged him, unleashing a series of deadly confrontations and consequences. In the second film, after resolving issues with the Russian mafia, Wick is drawn back into the criminal underworld when a mobster named Santino D'Antonio seeks his help. Wick refuses, but Santino blows up his house and later puts a contract on him. Wick then embarks on a mission to recover a stolen marker and ultimately faces a bounty on his head, leading to more action and conflict. The series is characterized by stylized violence, intense action scenes, and a criminal underworld governed by strict rules."

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [30]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [31]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [32]:
multi_query_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the provided reviews, people generally liked John Wick. Many reviews are highly positive, praising its stylish action, choreography, and entertainment value, with some ratings as high as 9 or 10 out of 10. However, there are also some negative opinions; a few reviews gave low ratings or criticized the films for being overly violent, lacking plot, or being unrealistic. Overall, the majority of reviews indicate that people enjoyed the series, though opinions vary.'

In [33]:
multi_query_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. The URL to that review is: /review/rw4854296/?ref_=tt_urv'

In [34]:
multi_query_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In the John Wick film series, the story centers around John Wick, a retired hitman who is drawn back into a world of violence and assassination after personal tragedies and assaults on his life. The first movie depicts his quest for revenge after gangsters kill his dog and steal his car, unleashing a brutal and stylized series of action sequences. Subsequent films explore his ongoing conflicts with criminal organizations, obligations to a secret assassin society, and the repercussions of his actions, all marked by high-octane fight scenes, meticulous choreography, and a richly developed underworld.'

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [35]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [36]:
client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
    collection_name="full_documents", embeddings=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

  parent_document_vectorstore = Qdrant(


Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [37]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [38]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [39]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [40]:
parent_document_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the provided reviews, people\'s opinions on John Wick vary. While some reviews, like the one from MrHeraclius, express high enthusiasm and praise the series, calling it "highly recommend" and praising the action, others are quite negative. For example, the review from solidabs describes John Wick 4 as a "horrible movie" with a lot of criticism about the plot and action scenes, indicating dissatisfaction. Another review from jtindahouse is positive and considers John Wick: Chapter 4 the best in the series.\n\nOverall, viewers\' opinions are mixed, with some liking the series and others criticizing specific installments, especially the fourth movie.'

In [41]:
parent_document_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. The URL to that review is: /review/rw4854296/?ref_=tt_urv'

In [42]:
parent_document_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In the John Wick movies, a retired assassin named John Wick, played by Keanu Reeves, is drawn back into a violent world of hitmen and assassins. The first film begins with Wick seeking revenge after his dog is killed and his car is stolen, which leads him to unleash a relentless and deadly rampage against those who wronged him. The sequel, John Wick 2, involves Wick helping to help the Italian mafia take over the Assassin's Guild and involves traveling to various locations to confront numerous enemies, resulting in a lot of action and carnage. Overall, the series features intense violence, revenge, and a deep dive into the criminal underworld."

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [43]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [44]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [45]:
ensemble_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the reviews provided, people generally liked John Wick. Several reviews are highly positive, praising the action sequences, style, and entertainment value, with ratings often above 8/10. However, there are some mixed or negative opinions as well, with a few viewers criticizing the plot, length, or over-the-top violence. Overall, the majority of feedback indicates that viewers tend to enjoy the film series, especially action fans.'

In [46]:
ensemble_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. The URL to that review is: /review/rw4854296/?ref_=tt_urv'

In [47]:
ensemble_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In the John Wick film series, the story centers around John Wick, a retired but legendary assassin who comes out of retirement to seek vengeance after a series of personal acts of violence. The initial film portrays how a gangsters' group kills his dog and steals his car, leading him to unleash a brutal and meticulously choreographed vengeance against them. Throughout the series, Wick faces various criminal factions, including the Russian mafia and an underground assassin's guild, with each film escalating the scope of his conflicts and the consequences of his actions. The movies combine intense action sequences, stylish world-building, and a plot of revenge and honor."

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

> NOTE: You do not need to run this cell if you're running this locally

In [48]:
#!pip install -qU langchain_experimental

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [49]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [50]:
semantic_documents = semantic_chunker.split_documents(documents)

Let's create a new vector store.

In [51]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWickSemantic"
)

We'll use naive retrieval for this example.

In [52]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [53]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [54]:
semantic_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the reviews provided, people generally liked John Wick. Many reviews are highly positive, praising the action sequences, style, and entertainment value of the films. For example, some reviews give high ratings like 9 or 10 out of 10, and descriptions such as "John Wick was cool," "brilliantly shot," and "slick, violent fun" indicate a favorable reception. While there are some less favorable opinions with lower ratings (e.g., 0, 2, or 5 out of 10), the overall trend suggests that most people generally liked John Wick.'

In [55]:
semantic_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is at least one review with a rating of 10. The URL to that review is:\n\n/review/rw4854296/?ref_=tt_urv'

In [56]:
semantic_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In the film "John Wick," the story follows a retired assassin named John Wick, played by Keanu Reeves, who seeks revenge after his beloved dog is killed, his car is stolen, and his privacy is violated by a group of thugs. The thugs, led by the son of a Russian gangster John used to work for, break into his house, beat him, kill his dog, and steal his car, not knowing who he really is. This act of violence awakens John\'s deadly skills and sets him on a path of vengeance against those who wronged him. As he reenters the criminal underworld, he becomes a target for numerous bounty-hunting killers, leading to a series of intense and stylized action sequences. The story explores themes of revenge, consequences, and the dangerous world of professional assassins.'

# 🤝 Breakout Room Part #2
#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [78]:
# Imports standard
import copy
import os
import time
import uuid
import numpy as np
import pandas as pd
from datetime import datetime, timezone

# Import OpenAI
from openai import OpenAI

# Imports LangChain
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate


from langchain.schema.runnable import RunnableConfig
from langchain_core.tracers.context import tracing_v2_enabled
from langsmith import Client
# Imports LangSmith
from langsmith import Client, traceable, RunTree
from langsmith.run_helpers import get_current_run_tree
from langsmith.wrappers import wrap_openai

# Imports RAGAS
from ragas import EvaluationDataset, evaluate, RunConfig
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator
from ragas.metrics import (
    Faithfulness, 
    ContextPrecision
)


In [59]:
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
testset_dataset = generator.generate_with_langchain_docs(documents, testset_size=10)

Applying SummaryExtractor:   0%|          | 0/44 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/100 [00:00<?, ?it/s]

Node 5f4f7d76-6298-4b98-b395-4dcc3c6ef7d1 does not have a summary. Skipping filtering.
Node 38dfeaa0-7fb7-4ced-89ac-5ea4e62598bf does not have a summary. Skipping filtering.
Node c49739a5-5d1b-43b1-9b7c-a92e84c673b3 does not have a summary. Skipping filtering.
Node 3c412b21-bd23-49b6-9239-89dd33d382d4 does not have a summary. Skipping filtering.
Node bd5e1e34-c598-43a6-93ae-4f972cd711e9 does not have a summary. Skipping filtering.
Node 416b0cce-6b25-404d-a28f-8f83dcac0d3b does not have a summary. Skipping filtering.
Node 50d2d0e0-6009-47e8-a7c9-e6f256ec62ec does not have a summary. Skipping filtering.
Node e55d2f04-ea6c-48c3-a100-d9c478a988c5 does not have a summary. Skipping filtering.
Node 31533cea-f8a5-4a3b-ba61-84bf5cc0216c does not have a summary. Skipping filtering.
Node 30652d33-8cd5-41fe-955c-595605b38e25 does not have a summary. Skipping filtering.
Node 0cdcccbd-cde0-4929-89a4-03a9ccee45d3 does not have a summary. Skipping filtering.
Node 0a495906-185b-40b2-ba2a-788516e9ccf6 d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/244 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [60]:
testset_df = testset_dataset.to_pandas()
testset_df

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Could you provide a detailed explanation of wh...,[: 0\nReview: The best way I can describe John...,John Wick stands out as an action movie becaus...,single_hop_specifc_query_synthesizer
1,What is the reviewr's initial impression of th...,[: 2\nReview: With the fourth installment scor...,The reviewer notes that the fourth installment...,single_hop_specifc_query_synthesizer
2,How does Keanu's performance contribute to the...,[: 3\nReview: John wick has a very simple reve...,Keanu's performance in John Wick centers on a ...,single_hop_specifc_query_synthesizer
3,What is the movi John Wick abot and why is it ...,[: 4\nReview: Though he no longer has a taste ...,"John Wick is about a retired assassin who, aft...",single_hop_specifc_query_synthesizer
4,How did Chad Stahelski's involvement influence...,"[<1-hop>\n\n: 9\nReview: ""John Wick: Chapter 2...","Chad Stahelski, returning from the first John ...",multi_hop_abstract_query_synthesizer
5,How do the reviews of John Wick differ in thei...,[<1-hop>\n\n: 18\nReview: And all of this equa...,The reviews of John Wick present contrasting v...,multi_hop_abstract_query_synthesizer
6,How does the portrayal of assassin characters ...,[<1-hop>\n\n: 11\nReview: JOHN WICK is a rare ...,John Wick is praised as a near-perfect action ...,multi_hop_abstract_query_synthesizer
7,How do the over-the-top martial arts scenes in...,[<1-hop>\n\n: 16\nReview: John Wick Chapter 2 ...,John Wick Chapter 2 is described as a ridiculo...,multi_hop_abstract_query_synthesizer
8,Why people say JOHN WICK is a near-perfect act...,[<1-hop>\n\n: 11\nReview: JOHN WICK is a rare ...,JOHN WICK is praised as a rare example of Holl...,multi_hop_specific_query_synthesizer
9,How do the reviews reflect Hollywood's approac...,[<1-hop>\n\n: 11\nReview: JOHN WICK is a rare ...,The reviews reflect Hollywood's approach to ac...,multi_hop_specific_query_synthesizer


In [61]:
retrievers = [
    {"name": "Naive Retriever", "retriever": naive_retriever},
    {"name": "BM25 Retriever", "retriever": bm25_retriever},
    {"name": "Contextual Compression", "retriever": compression_retriever},
    {"name": "Multi-Query Retriever", "retriever": multi_query_retriever},
    {"name": "Parent Document Retriever", "retriever": parent_document_retriever},
    {"name": "Ensemble Retriever", "retriever": ensemble_retriever},
    {"name": "Semantic Chunking Retriever", "retriever": semantic_retriever}
]

eval_llm = ChatOpenAI(model="gpt-4.1-mini")


In [62]:
api_key = getpass.getpass("Enter your LangSmith API key: ")
os.environ["LANGSMITH_API_KEY"] = api_key
os.environ["LANGCHAIN_TRACING_V2"] = "true"

In [81]:
def evaluate_retrievers_unified(retrievers, testset_dataset, llm_model="gpt-4.1-mini"):
    
    # Cost definitions (adjust according to model)
    GENERATION_INPUT_COST = 0.0000001  # $0.10 per million tokens
    GENERATION_OUTPUT_COST = 0.0000004  # $0.40 per million tokens
    
    # Create LangSmith client
    client = Client(api_key=os.environ.get("LANGSMITH_API_KEY"))
    
    # Ensure tracing is enabled
    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGSMITH_TRACING"] = "true"
    
    # Create unique project for entire evaluation
    project_name = f"retriever-eval-unified-{uuid.uuid4().hex[:8]}"
    print(f"Creating unified LangSmith project: {project_name}")
    
    try:
        client.create_project(project_name=project_name)
        print(f"Project '{project_name}' created successfully")
    except Exception as e:
        print(f"Error creating project: {e}")
        project_name = "default"  # Fallback
    
    # Wrapped OpenAI client for automatic tracing
    openai_client = wrap_openai(OpenAI())
    
    # Standard RAG template
    prompt_template = """
    You are a helpful assistant. Use the context provided below to answer the question.
    If you don't know the answer based on the context, just say you don't know.
    
    Context:
    {context}
    
    Question:
    {question}
    """
    
    # Create main run for entire evaluation
    main_run_id = str(uuid.uuid4())
    client.create_run(
        name="Complete Retriever Evaluation",
        run_type="chain",
        project_name=project_name,
        id=main_run_id,
        inputs={"retrievers": [r["name"] for r in retrievers]},
        start_time=datetime.now(timezone.utc),
        tags=["main_evaluation"]
    )
    
    # Main evaluation function, recorded in LangSmith
    @traceable(
        project_name=project_name,
        run_type="chain",
        name="Retriever Evaluation",
        parent_run_id=main_run_id,
        tags=["evaluation_workflow"]
    )
    def evaluate_all_retrievers():
        results = []
        retriever_results = {}
    
        
        # Evaluate each retriever
        for retriever_info in retrievers:
            name = retriever_info["name"]
            retriever = retriever_info["retriever"]
            
            # Evaluation function for specific retriever
            @traceable(
                project_name=project_name,
                run_type="chain",
                name=f"{name} Evaluation",
                tags=[name, "retriever_evaluation"]
            )
            def evaluate_retriever():
                print(f"\n=== Evaluating: {name} ===")
                
                metrics = {
                    "retrieval_times": [],
                    "llm_times": [],
                    "total_times": [],
                    "document_counts": [],
                    "prompt_tokens": [],
                    "completion_tokens": [],
                    "costs": []
                }
                
                # Copy of testset
                testset_copy = copy.deepcopy(testset_dataset)
                
                # Process each question with tracing
                for i, test_row in enumerate(testset_copy):
                    question = test_row.eval_sample.user_input
                    print(f"Processing question {i+1}/{len(testset_copy)}: {question[:50]}...")
                    
                    # Function to process a question
                    @traceable(
                        project_name=project_name,
                        run_type="chain",
                        name=f"Q{i+1}: {question[:30]}...",
                        tags=[name, "question_evaluation"],
                        metadata={"question_number": i+1}
                    )
                    def process_question():
                        # 1. Retrieval with time measurement
                        retrieval_start = time.time()  
                        docs = retriever.invoke(question)
                        retrieval_time = time.time() - retrieval_start
                                                
                        # Retrieval metrics
                        doc_count = len(docs)
                        metrics["retrieval_times"].append(retrieval_time)
                        metrics["document_counts"].append(doc_count)
                        
                        # 2. Context preparation
                        context_text = "\n\n".join([doc.page_content for doc in docs])
                        formatted_prompt = prompt_template.format(context=context_text, question=question)
                        
                        # 3. LLM call with wrapped client and time measurement
                        llm_start = time.time()
                        llm_response = openai_client.chat.completions.create(
                            model=llm_model,
                            messages=[
                                {"role": "system", "content": "You are a helpful assistant."},
                                {"role": "user", "content": formatted_prompt}
                            ]
                            )
                        llm_time = time.time() - llm_start
                        metrics["llm_times"].append(llm_time)
                        
                        # LLM response
                        answer = llm_response.choices[0].message.content
                        
                        # Use tokens counted directly by API
                        prompt_tokens = 0
                        completion_tokens = 0
                        cost = 0
                        
                        if hasattr(llm_response, "usage") and llm_response.usage:
                            prompt_tokens = llm_response.usage.prompt_tokens
                            completion_tokens = llm_response.usage.completion_tokens
                            
                            # Cost calculation based on API tokens
                            cost = (
                                (prompt_tokens * GENERATION_INPUT_COST) + 
                                (completion_tokens * GENERATION_OUTPUT_COST)
                            )
                            
                            # Update metrics
                            metrics["prompt_tokens"].append(prompt_tokens)
                            metrics["completion_tokens"].append(completion_tokens)
                            metrics["costs"].append(cost)
                        
                        # Total time
                        total_time = retrieval_time + llm_time
                        metrics["total_times"].append(total_time)
                        
                        # Update testset
                        test_row.eval_sample.response = answer
                        test_row.eval_sample.retrieved_contexts = [doc.page_content for doc in docs]
                        
                        return {
                            "response": answer,
                            "contexts": [doc.page_content for doc in docs],
                            "metrics": {
                                "retrieval_time": retrieval_time,
                                "llm_time": llm_time,
                                "total_time": total_time,
                                "document_count": doc_count,
                                "prompt_tokens": prompt_tokens,
                                "completion_tokens": completion_tokens,
                                "cost": cost
                            }
                        }
                    
                    # Execute question processing
                    try:
                        result = process_question()
                        # Display metrics per question
                        print(f"  Time: {result['metrics']['total_time']:.2f}s (Retrieval: {result['metrics']['retrieval_time']:.2f}s, LLM: {result['metrics']['llm_time']:.2f}s)")
                        print(f"  Documents: {result['metrics']['document_count']}, Cost: ${result['metrics']['cost']:.6f}")
                    except Exception as e:
                        print(f"  Error processing question {i+1}: {e}")
                    
                    # Small pause to avoid overloading API
                    time.sleep(2)
                
                # Calculate averages
                avg_metrics = {
                    "avg_retrieval_time": np.mean(metrics["retrieval_times"]) if metrics["retrieval_times"] else 0,
                    "avg_llm_time": np.mean(metrics["llm_times"]) if metrics["llm_times"] else 0,
                    "avg_total_time": np.mean(metrics["total_times"]) if metrics["total_times"] else 0,
                    "avg_docs": np.mean(metrics["document_counts"]) if metrics["document_counts"] else 0,
                    "avg_prompt_tokens": np.mean(metrics["prompt_tokens"]) if metrics["prompt_tokens"] else 0,
                    "avg_completion_tokens": np.mean(metrics["completion_tokens"]) if metrics["completion_tokens"] else 0,
                    "avg_cost": np.mean(metrics["costs"]) if metrics["costs"] else 0
                }
                
                # Run RAGAS in same workflow
                print(f"\n=== Running RAGAS evaluation for {name} ===")
                ragas_scores = run_ragas_evaluation(testset_copy, name, llm_model)
                
                # Combine results and return
                return {
                    "name": name,
                    "metrics": avg_metrics,
                    "ragas_scores": ragas_scores,
                    "processed_testset": testset_copy
                }
            
            # Integrated RAGAS evaluation
            @traceable(
                project_name=project_name,
                run_type="chain", 
                name=f"{name} RAGAS Evaluation",
                tags=[name, "ragas"]
            )
            def run_ragas_evaluation(testset, retriever_name, llm_model):
                try:
                    # Use gpt-4.1-mini for RAGAS evaluation
                    evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))
                    
                    # Robust configuration
                    run_config = RunConfig(
                        timeout=600,  
                        max_workers=2,
                    )
                    
                    # Limit to most stable metrics to increase chances of success
                    metrics = [
                        Faithfulness(),                      
                        ContextPrecision()          
                    ]

                    # Prepare dataset for RAGAS
                    df = pd.DataFrame([
                        {
                            "question": row.eval_sample.user_input,
                            "answer": row.eval_sample.response,
                            "contexts": row.eval_sample.retrieved_contexts
                        }
                        for row in testset
                    ])
                    
                    # Build RAGAS DataFrame
                    ragas_df = pd.DataFrame()
                    ragas_df['question'] = df['question']
                    ragas_df['response'] = df['answer']
                    ragas_df['contexts'] = df['contexts'].apply(
                        lambda x: x if isinstance(x, list) else [x]
                    )
                    ragas_df['ground_truth'] = [[] for _ in range(len(df))]
                    ragas_df['user_input'] = df['question']
                    ragas_df['retrieved_contexts'] = ragas_df['contexts']
                    ragas_df['reference'] = ["" for _ in range(len(df))]
                    
                    # Convert to RAGAS dataset
                    dataset = EvaluationDataset.from_pandas(ragas_df)
                    
                    # Evaluation with error handling
                    try:
                        # Run RAGAS evaluation
                        result = evaluate(
                            dataset=dataset,
                            metrics=metrics,
                            llm=evaluator_llm,
                            run_config=run_config
                        )
                        
                        # DEBUG: Print returned object type
                        print(f"RAGAS result type: {type(result)}")
                        print(f"RAGAS result dir: {dir(result)}")
                        
                        # Initialize scores
                        ragas_scores = {}
                        
                        # CORRECTION: Access results correctly based on object type
                        # If DataFrame
                        if hasattr(result, 'to_pandas'):
                            df_result = result.to_pandas()
                            print(f"RAGAS result columns: {df_result.columns.tolist()}")
                            
                            # Extract column averages
                            for col in df_result.columns:
                                if col in ['faithfulness', 'context_precision']:
                                    ragas_scores[col] = df_result[col].mean()
                        
                        # If object with direct metric attributes
                        elif hasattr(result, 'faithfulness'):
                            if hasattr(result.faithfulness, 'mean'):
                                ragas_scores['faithfulness'] = result.faithfulness.mean()
                            else:
                                ragas_scores['faithfulness'] = np.mean(result.faithfulness)
                                
                            if hasattr(result, 'answer_relevancy'):
                                if hasattr(result.answer_relevancy, 'mean'):
                                    ragas_scores['response_relevancy'] = result.answer_relevancy.mean()
                                else:
                                    ragas_scores['response_relevancy'] = np.mean(result.answer_relevancy)
                        
                        # If dictionary
                        elif hasattr(result, 'items'):
                            for key, value in result.items():
                                if isinstance(value, list):
                                    ragas_scores[key] = np.mean(value)
                                else:
                                    ragas_scores[key] = value
                        
                        # If none of above methods work, use default values
                        if not ragas_scores:
                            print("Couldn't extract metrics from RAGAS result. Using default scores.")
                            ragas_scores = {
                                'faithfulness': 0.5,
                                'response_relevancy': 0.5
                            }
                        
                        # Calculate overall score
                        ragas_scores['quality_score'] = np.mean([
                                                        ragas_scores.get('faithfulness', 0.5),
                                                        ragas_scores.get('context_precision', 0.5)
                                                    ])
                                                                            
                        # Display scores
                        print(f"RAGAS scores for {retriever_name}:")
                        for key, value in ragas_scores.items():
                            print(f"- {key}: {value:.3f}")
                        
                        return ragas_scores
                            
                    except Exception as e:
                        print(f"RAGAS evaluation failed: {e}")
                        return {
                            "quality_score": 0.5,
                            "faithfulness": 0.5,
                            "response_relevancy": 0.5
                        }
                            
                except Exception as e:
                    print(f"Error in RAGAS evaluation for {retriever_name}: {e}")
                    import traceback
                    traceback.print_exc()
                    return {"quality_score": 0.5}
            
            # Execute retriever evaluation
            retriever_result = evaluate_retriever()
            retriever_results[name] = retriever_result
            
            # Extract metrics
            avg_metrics = retriever_result["metrics"]
            ragas_scores = retriever_result["ragas_scores"]
            
            # Store results for this retriever
            result = {
                "Retriever": name,
                "Avg Time (s)": avg_metrics["avg_total_time"],
                "Retrieval Time (s)": avg_metrics["avg_retrieval_time"],
                "LLM Time (s)": avg_metrics["avg_llm_time"],
                "Avg Docs": avg_metrics["avg_docs"],
                "Avg Prompt Tokens": avg_metrics["avg_prompt_tokens"],
                "Avg Completion Tokens": avg_metrics["avg_completion_tokens"],
                "Avg Cost ($)": avg_metrics["avg_cost"],
                "Quality Score": ragas_scores.get("quality_score", 0.5),
                "Faithfulness": ragas_scores.get("faithfulness", 0),
                "Context Precision": ragas_scores.get("context_precision", 0),
                "Project": project_name
            }
            
            # Calculate efficiency scores
            result["Efficiency Score"] = result["Quality Score"] / max(result["Avg Cost ($)"], 0.0001)
            result["Time-Efficiency"] = result["Quality Score"] / (max(result["Avg Cost ($)"], 0.0001) * result["Avg Time (s)"])
        
            # Add to results
            results.append(result)
            
            # Display results for this retriever
            print(f"\n=== Results for {name} ===")
            print(f"Average time: {avg_metrics['avg_total_time']:.2f}s (Retrieval: {avg_metrics['avg_retrieval_time']:.2f}s, LLM: {avg_metrics['avg_llm_time']:.2f}s)")
            print(f"Average documents: {avg_metrics['avg_docs']:.1f}")
            print(f"Average tokens: {avg_metrics['avg_prompt_tokens']:.1f} in, {avg_metrics['avg_completion_tokens']:.1f} out")
            print(f"Average cost: ${avg_metrics['avg_cost']:.6f}/query")
            print(f"Quality Score: {ragas_scores.get('quality_score', 0.5):.3f}")
        
        # Convert to DataFrame
        results_df = pd.DataFrame(results)
        
        return results_df, retriever_results
    
    # Run complete evaluation
    try:
        results_df, retriever_results = evaluate_all_retrievers()
        
        # Update main run with final results
        client.update_run(
            run_id=main_run_id,
            outputs={
                "results_df": results_df.to_dict(),
                "project_url": f"https://smith.langchain.com/projects/{project_name}"
            },
            end_time=datetime.now(timezone.utc)
        )
        
        # Display final results
        print("\n=== Final Results ===")
        print(results_df[["Retriever", "Avg Time (s)", "Avg Cost ($)", "Quality Score", "Efficiency Score"]])
        
        # Display link to LangSmith dashboard
        print(f"\nView detailed results in LangSmith: https://smith.langchain.com/projects/{project_name}")
        
        return results_df, project_name
        
    except Exception as e:
        print(f"Error in evaluation: {e}")
        import traceback
        traceback.print_exc()
        
        # Update main run with error
        client.update_run(
            run_id=main_run_id,
            error=str(e),
            end_time=datetime.now(timezone.utc)
        )
        
        return pd.DataFrame(), project_name

In [82]:
# Exécuter l'évaluation complète en une seule fois
results_df, project_name = evaluate_retrievers_unified(
    retrievers, 
    testset_dataset,
    llm_model="gpt-4.1-mini"
)

Creating unified LangSmith project: retriever-eval-unified-f58f8f85
Project 'retriever-eval-unified-f58f8f85' created successfully

=== Evaluating: Naive Retriever ===
Processing question 1/12: Could you provide a detailed explanation of what m...
  Time: 10.83s (Retrieval: 0.90s, LLM: 9.93s)
  Documents: 10, Cost: $0.000360
Processing question 2/12: What is the reviewr's initial impression of the Jo...
  Time: 3.73s (Retrieval: 0.96s, LLM: 2.76s)
  Documents: 10, Cost: $0.000230
Processing question 3/12: How does Keanu's performance contribute to the app...
  Time: 3.80s (Retrieval: 0.86s, LLM: 2.93s)
  Documents: 10, Cost: $0.000267
Processing question 4/12: What is the movi John Wick abot and why is it so p...
  Time: 6.94s (Retrieval: 0.59s, LLM: 6.35s)
  Documents: 10, Cost: $0.000362
Processing question 5/12: How did Chad Stahelski's involvement influence the...
  Time: 4.40s (Retrieval: 0.66s, LLM: 3.74s)
  Documents: 10, Cost: $0.000276
Processing question 6/12: How do the revi

Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]

RAGAS result type: <class 'ragas.dataset_schema.EvaluationResult'>
RAGAS result dir: ['__annotations__', '__class__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__firstlineno__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__post_init__', '__reduce__', '__reduce_ex__', '__replace__', '__repr__', '__setattr__', '__sizeof__', '__static_attributes__', '__str__', '__subclasshook__', '__weakref__', '_repr_dict', '_scores_dict', 'binary_columns', 'cost_cb', 'dataset', 'ragas_traces', 'run_id', 'scores', 'to_pandas', 'total_cost', 'total_tokens', 'traces', 'upload']
RAGAS result columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'context_precision']
RAGAS scores for Naive Retriever:
- faithfulness: 0.981
- context_precision: 0.577
- qua

Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]

RAGAS result type: <class 'ragas.dataset_schema.EvaluationResult'>
RAGAS result dir: ['__annotations__', '__class__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__firstlineno__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__post_init__', '__reduce__', '__reduce_ex__', '__replace__', '__repr__', '__setattr__', '__sizeof__', '__static_attributes__', '__str__', '__subclasshook__', '__weakref__', '_repr_dict', '_scores_dict', 'binary_columns', 'cost_cb', 'dataset', 'ragas_traces', 'run_id', 'scores', 'to_pandas', 'total_cost', 'total_tokens', 'traces', 'upload']
RAGAS result columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'context_precision']
RAGAS scores for BM25 Retriever:
- faithfulness: 0.924
- context_precision: 0.403
- qual

Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]

RAGAS result type: <class 'ragas.dataset_schema.EvaluationResult'>
RAGAS result dir: ['__annotations__', '__class__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__firstlineno__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__post_init__', '__reduce__', '__reduce_ex__', '__replace__', '__repr__', '__setattr__', '__sizeof__', '__static_attributes__', '__str__', '__subclasshook__', '__weakref__', '_repr_dict', '_scores_dict', 'binary_columns', 'cost_cb', 'dataset', 'ragas_traces', 'run_id', 'scores', 'to_pandas', 'total_cost', 'total_tokens', 'traces', 'upload']
RAGAS result columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'context_precision']
RAGAS scores for Contextual Compression:
- faithfulness: 0.954
- context_precision: 0.52

Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]

RAGAS result type: <class 'ragas.dataset_schema.EvaluationResult'>
RAGAS result dir: ['__annotations__', '__class__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__firstlineno__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__post_init__', '__reduce__', '__reduce_ex__', '__replace__', '__repr__', '__setattr__', '__sizeof__', '__static_attributes__', '__str__', '__subclasshook__', '__weakref__', '_repr_dict', '_scores_dict', 'binary_columns', 'cost_cb', 'dataset', 'ragas_traces', 'run_id', 'scores', 'to_pandas', 'total_cost', 'total_tokens', 'traces', 'upload']
RAGAS result columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'context_precision']
RAGAS scores for Multi-Query Retriever:
- faithfulness: 0.976
- context_precision: 0.460

Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]

RAGAS result type: <class 'ragas.dataset_schema.EvaluationResult'>
RAGAS result dir: ['__annotations__', '__class__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__firstlineno__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__post_init__', '__reduce__', '__reduce_ex__', '__replace__', '__repr__', '__setattr__', '__sizeof__', '__static_attributes__', '__str__', '__subclasshook__', '__weakref__', '_repr_dict', '_scores_dict', 'binary_columns', 'cost_cb', 'dataset', 'ragas_traces', 'run_id', 'scores', 'to_pandas', 'total_cost', 'total_tokens', 'traces', 'upload']
RAGAS result columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'context_precision']
RAGAS scores for Parent Document Retriever:
- faithfulness: 0.870
- context_precision: 0

Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]

RAGAS result type: <class 'ragas.dataset_schema.EvaluationResult'>
RAGAS result dir: ['__annotations__', '__class__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__firstlineno__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__post_init__', '__reduce__', '__reduce_ex__', '__replace__', '__repr__', '__setattr__', '__sizeof__', '__static_attributes__', '__str__', '__subclasshook__', '__weakref__', '_repr_dict', '_scores_dict', 'binary_columns', 'cost_cb', 'dataset', 'ragas_traces', 'run_id', 'scores', 'to_pandas', 'total_cost', 'total_tokens', 'traces', 'upload']
RAGAS result columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'context_precision']
RAGAS scores for Ensemble Retriever:
- faithfulness: 0.940
- context_precision: 0.531
- 

Evaluating:   0%|          | 0/24 [00:00<?, ?it/s]

RAGAS result type: <class 'ragas.dataset_schema.EvaluationResult'>
RAGAS result dir: ['__annotations__', '__class__', '__dataclass_fields__', '__dataclass_params__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__firstlineno__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__match_args__', '__module__', '__ne__', '__new__', '__post_init__', '__reduce__', '__reduce_ex__', '__replace__', '__repr__', '__setattr__', '__sizeof__', '__static_attributes__', '__str__', '__subclasshook__', '__weakref__', '_repr_dict', '_scores_dict', 'binary_columns', 'cost_cb', 'dataset', 'ragas_traces', 'run_id', 'scores', 'to_pandas', 'total_cost', 'total_tokens', 'traces', 'upload']
RAGAS result columns: ['user_input', 'retrieved_contexts', 'response', 'reference', 'faithfulness', 'context_precision']
RAGAS scores for Semantic Chunking Retriever:
- faithfulness: 0.956
- context_precision:

In [83]:
# Afficher les résultats sous forme de tableau
display(results_df)

Unnamed: 0,Retriever,Avg Time (s),Retrieval Time (s),LLM Time (s),Avg Docs,Avg Prompt Tokens,Avg Completion Tokens,Avg Cost ($),Quality Score,Faithfulness,Context Precision,Project,Efficiency Score,Time-Efficiency
0,Naive Retriever,6.156603,0.732347,5.424256,10.0,2032.166667,268.416667,0.000311,0.778869,0.980846,0.576892,retriever-eval-unified-f58f8f85,2507.762701,407.328928
1,BM25 Retriever,3.965675,0.002064,3.963611,4.0,841.166667,222.083333,0.000173,0.663194,0.92361,0.402778,retriever-eval-unified-f58f8f85,3834.598636,966.947281
2,Contextual Compression,4.933752,0.812529,4.121223,3.0,873.5,230.666667,0.00018,0.741118,0.954458,0.527778,retriever-eval-unified-f58f8f85,4126.108054,836.302359
3,Multi-Query Retriever,9.46922,3.631516,5.837703,14.583333,2694.5,340.416667,0.000406,0.717598,0.975595,0.459601,retriever-eval-unified-f58f8f85,1769.15392,186.832072
4,Parent Document Retriever,4.390347,0.447118,3.943229,2.416667,500.0,184.5,0.000124,0.690949,0.870324,0.511574,retriever-eval-unified-f58f8f85,5581.172688,1271.237224
5,Ensemble Retriever,12.752847,5.26988,7.482967,17.75,3124.416667,337.166667,0.000447,0.735315,0.939503,0.531127,retriever-eval-unified-f58f8f85,1643.866387,128.90191
6,Semantic Chunking Retriever,6.207603,0.512937,5.694666,10.0,1575.333333,283.25,0.000271,0.685199,0.955945,0.414453,retriever-eval-unified-f58f8f85,2529.964521,407.559002


In [84]:
results_df.to_csv("results_df.csv")