# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

> You do not need to run the following cells if you are running this notebook locally. 

In [1]:
#!pip install -qU langchain langchain-openai langchain-cohere rank_bm25

We're also going to be leveraging [Qdrant's](https://qdrant.tech/documentation/frameworks/langchain/) (pronounced "Quadrant") VectorDB in "memory" mode (so we can leverage it locally in our colab environment).

In [2]:
#!pip install -qU qdrant-client

We'll also provide our OpenAI key, as well as our Cohere API key.

In [3]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [4]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using some reviews from the 4 movies in the John Wick franchise today to explore the different retrieval strategies.

These were obtained from IMDB, and are available in the [AIM Data Repository](https://github.com/AI-Maker-Space/DataRepository).

### Data Collection

We can simply `wget` these from GitHub.

You could use any review data you wanted in this step - just be careful to make sure your metadata is aligned with your choice.

In [5]:
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv -O john_wick_1.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv -O john_wick_2.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw3.csv -O john_wick_3.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw4.csv -O john_wick_4.csv

--2025-05-16 15:19:37--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv
Résolution de raw.githubusercontent.com (raw.githubusercontent.com)… 2606:50c0:8002::154, 2606:50c0:8003::154, 2606:50c0:8000::154, ...
Connexion à raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8002::154|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 19628 (19K) [text/plain]
Sauvegarde en : « john_wick_1.csv »


2025-05-16 15:19:37 (3,21 MB/s) — « john_wick_1.csv » sauvegardé [19628/19628]

--2025-05-16 15:19:37--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv
Résolution de raw.githubusercontent.com (raw.githubusercontent.com)… 2606:50c0:8003::154, 2606:50c0:8000::154, 2606:50c0:8001::154, ...
Connexion à raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 14747 (14K) [text/plain]
Sauvegarde en : « jo

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

- Self-Query: Wants as much metadata as we can provide
- Time-weighted: Wants temporal data

> NOTE: While we're creating a temporal relationship based on when these movies came out for illustrative purposes, it needs to be clear that the "time-weighting" in the Time-weighted Retriever is based on when the document was *accessed* last - not when it was created.

In [6]:
import pandas as pd

df = pd.read_csv("john_wick_4.csv", index_col= 0)
df


Unnamed: 0,Review_Date,Author,Rating,Review_Title,Review,Review_Url
0,23 May 2023,siderite,4.0,What a pointless film\n,Imagine a video game where you are shooting ba...,/review/rw9073117/?ref_=tt_urv
1,30 March 2023,neil-476,5.0,There is such a thing as too much\n,"The Table, the international crminal brotherho...",/review/rw8960544/?ref_=tt_urv
2,25 March 2023,BA_Harrison,4.0,It got on my wick.\n,The first three John Wick films came in fairly...,/review/rw8950606/?ref_=tt_urv
3,23 May 2023,namob-43673,3.0,I was rolling my eyes the whole time... all 3...,These John Wick movies can be sort of fun in t...,/review/rw9072963/?ref_=tt_urv
4,24 March 2023,fciocca,4.0,John Wick became the parody of himself. The t...,I went to the cinema with great expectations. ...,/review/rw8948738/?ref_=tt_urv
5,2 April 2023,skyhawk747,4.0,Am I missing something here?\n,What is all the raving about with this movie? ...,/review/rw8967740/?ref_=tt_urv
6,24 March 2023,IMDbKeepsDeletingMyReviews,5.0,"""Yeah.""\n",In this fourth installment of 8711's successfu...,/review/rw8947952/?ref_=tt_urv
7,23 April 2023,antti-eskelinen-329-929792,4.0,I don't understand the great scores of this m...,In my opinion this is by far the worst movie o...,/review/rw9011753/?ref_=tt_urv
8,28 May 2023,dstan-71445,2.0,Disappointed.\n,"Very much over rated. Repetitive, tiring and i...",/review/rw9082993/?ref_=tt_urv
9,30 March 2023,drjgardner,2.0,"Ridiculous, boring and pathetic...\n",...all at the same time. This hybrid comic boo...,/review/rw8959398/?ref_=tt_urv


In [7]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

documents = []

for i in range(1, 5):
  loader = CSVLoader(
      file_path=f"john_wick_{i}.csv",
      metadata_columns=["Review_Date", "Review_Title", "Review_Url", "Author", "Rating"]
  )

  movie_docs = loader.load()
  for doc in movie_docs:

    # Add the "Movie Title" (John Wick 1, 2, ...)
    doc.metadata["Movie_Title"] = f"John Wick {i}"

    # convert "Rating" to an `int`, if no rating is provided - assume 0 rating
    doc.metadata["Rating"] = int(doc.metadata["Rating"]) if doc.metadata["Rating"] else 0

    # newer movies have a more recent "last_accessed_at"
    doc.metadata["last_accessed_at"] = datetime.now() - timedelta(days=4-i)

  documents.extend(movie_docs)

Let's look at an example document to see if everything worked as expected!

In [8]:
documents[0]

Document(metadata={'source': 'john_wick_1.csv', 'row': 0, 'Review_Date': '6 May 2015', 'Review_Title': ' Kinetic, concise, and stylish; John Wick kicks ass.\n', 'Review_Url': '/review/rw3233896/?ref_=tt_urv', 'Author': 'lnvicta', 'Rating': 8, 'Movie_Title': 'John Wick 1', 'last_accessed_at': datetime.datetime(2025, 5, 13, 15, 19, 38, 842160)}, page_content=": 0\nReview: The best way I can describe John Wick is to picture Taken but instead of Liam Neeson it's Keanu Reeves and instead of his daughter it's his dog. That's essentially the plot of the movie. John Wick (Reeves) is out to seek revenge on the people who took something he loved from him. It's a beautifully simple premise for an action movie - when action movies get convoluted, they get bad i.e. A Good Day to Die Hard. John Wick gives the viewers what they want: Awesome action, stylish stunts, kinetic chaos, and a relatable hero to tie it all together. John Wick succeeds in its simplicity.")

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "JohnWick".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [9]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWick"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [10]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [11]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [12]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")


### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [13]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [14]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the reviews provided, people generally liked John Wick. Many reviews highlight its stylish action, intense sequences, and entertainment value, with some rating it very highly (e.g., 9 or 10 out of 10). There are a few mixed or less favorable reviews, but overall, the sentiment indicates that people appreciated and enjoyed the film.'

In [15]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})

{'response': AIMessage(content="Based on the reviews provided, people generally liked John Wick. Many reviews give high ratings, praise the film's action sequences, style, and Keanu Reeves' performance, and describe it as fun, slick, and innovative. However, there are some lower-rated reviews and a few critics who are less enthusiastic. Overall, the majority of opinions seem to be positive, indicating that people generally appreciated and enjoyed the film.", additional_kwargs={'refusal': None}, response_metadata={'token_usage': {'completion_tokens': 83, 'prompt_tokens': 3595, 'total_tokens': 3678, 'completion_tokens_details': {'accepted_prediction_tokens': 0, 'audio_tokens': 0, 'reasoning_tokens': 0, 'rejected_prediction_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 3456}}, 'model_name': 'gpt-4.1-nano-2025-04-14', 'system_fingerprint': 'fp_eede8f0d45', 'id': 'chatcmpl-BXpGUUtx0KuWOJJhZKM8v8MBmaOA1', 'service_tier': 'default', 'finish_reason': 'stop', 'logpr

In [16]:
naive_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. The URL to that review is: /review/rw4854296/?ref_=tt_urv'

In [17]:
naive_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In the John Wick film series, the story revolves around John Wick, a retired hitman who seeks vengeance after a series of personal tragedies. In the first movie, John comes out of retirement after gangsters kill his dog and steal his car, which were the last gifts from his deceased wife. His quest for retribution plunges him into the criminal underworld, where he is targeted by numerous assassins due to a bounty placed on his head. Throughout the series, John Wick navigates a dangerous world of criminal organizations, following a strict code of conduct, and facing the consequences of his violent past. The films are known for their stylish action sequences, deep world-building, and the recurring theme that "every action has consequences."'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [18]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(documents)

We'll construct the same chain - only changing the retriever.

In [19]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [20]:
bm25_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the reviews provided, people have mixed opinions about John Wick. Some reviews are highly positive, praising the action, style, and entertainment value, suggesting that many people liked the movies, at least in the earlier installments. For example, one review gave a score of 10 and called it "something special," and another gave an 8, recommending it strongly.\n\nHowever, there are also negative reviews expressing dissatisfaction, especially with later installments like John Wick 3, which received a rating of 1 and was described as mindless and overly violent.\n\nOverall, while many people apparently enjoyed the John Wick movies, especially the first one, there are also notable criticisms, indicating that opinions vary.'

In [21]:
bm25_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'There are no reviews with a rating of 10 in the provided data.'

In [22]:
bm25_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In the John Wick film series, the story revolves around John Wick, a former assassin who is drawn back into the criminal underworld after a series of events. The first movie, John Wick 1, is highly praised for its beautifully choreographed action scenes and emotional depth. It follows Wick as he seeks vengeance after his dog is killed, which is a gift from his deceased wife. The subsequent films, John Wick 2 and John Wick 3, continue to explore his battles against assassins and crime organizations, featuring intense action and elaborate fight scenes. However, the later installment, John Wick 4, received some criticism for repetitive action sequences and implausible plot points, though it still maintains the core theme of Wick's relentless struggle for survival and justice."

It's not clear that this is better or worse - but the `I don't know` isn't great!

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [23]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [24]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [25]:
contextual_compression_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Based on the reviews provided, people generally liked John Wick. The first two reviews are highly positive, praising the film as an exciting, stylish, and fun action movie with high ratings of 9 and 10. These reviews highlight Keanu Reeves' performance, the action sequences, and the overall entertainment value. However, a later review for John Wick 3 rated it lower at 5, indicating some disappointment, but overall, the sentiment from the positive reviews suggests that many viewers liked and appreciated the film."

In [26]:
contextual_compression_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there are reviews with a rating of 10. Here are the URLs to those reviews:\n\n1. [Review titled "A Masterpiece & Brilliant Sequel"](https://yourwebsite.com/review/rw4854296/?ref_=tt_urv) - from john_wick_3.csv\n2. [Review simply titled "10"](https://yourwebsite.com/review/rw8944843/?ref_=tt_urv) - from john_wick_4.csv\n3. [Review titled "It\'s got its own action style!"](https://yourwebsite.com/review/rw4860412/?ref_=tt_urv) - from john_wick_3.csv\n\n(Note: The URLs are based on the provided review URLs; please replace "yourwebsite.com" with the actual domain if needed.)'

In [27]:
contextual_compression_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In the John Wick films, John Wick (played by Keanu Reeves) is a retired hitman who is pulled back into a violent world of crime. The story begins when he resolves issues with the Russian mafia, but his peace is disrupted when a mobster named Santino D'Antonio visits him, demanding help with a favor represented by a marker. Wick refuses, leading Santino to blow up his house. Subsequently, Wick is asked to kill Santino's sister in Rome so he can sit on the criminal organization’s High Table. After completing this task, Santino puts a bounty on Wick’s head, making him the target of numerous killers. Wick then becomes motivated to seek revenge on Santino for the betrayal and the bounty placed on him. The series features intense action, stylish combat, and explores Wick’s struggle to navigate and survive in a dangerous criminal underworld."

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [28]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [29]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [30]:
multi_query_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Yes, people generally liked John Wick. The reviews indicate high ratings and positive comments praising its action sequences, style, and overall entertainment value. Many reviewers gave it high scores (such as 9 or 10 out of 10) and described it as a fun, slick, and exciting action film that broke conventional molds.'

In [31]:
multi_query_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a ratings of 10. The URL to that review is: /review/rw4854296/?ref_=tt_urv.'

In [32]:
multi_query_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In the John Wick series, the story centers around John Wick, an ex-hitman who comes out of retirement to seek vengeance after personal tragedies. The first film starts with John mourning the death of his wife and the killing of his dog, which was a last gift from her. When a gangsters' arresting and theft of his car and dog occurs, Wick unleashes his lethal skills to exact revenge on those responsible, drawing the attention of the criminal underworld and bounty hunters who aim to eliminate him. Throughout the series, Wick is portrayed as a highly skilled and almost legendary assassin entangled in a world of organized crime, with each film exploring the consequences of his violent actions and his struggle for peace and survival amidst a complex network of criminal rules and rivalries."

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [33]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [34]:
client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
    collection_name="full_documents", embeddings=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

  parent_document_vectorstore = Qdrant(


Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [35]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [36]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [37]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [38]:
parent_document_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the provided reviews, people generally liked the John Wick series. Many reviews are positive, praising the action, choreography, and overall entertainment value. However, there is at least one negative review that criticizes John Wick 4 specifically for its plot and action scenes. Overall, the sentiment leans toward liking the series, but opinions on the latest installment vary.'

In [39]:
parent_document_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. The URL to that review is: /review/rw4854296/?ref_=tt_urv'

In [40]:
parent_document_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, the story centers around an ex-hitman named John Wick, played by Keanu Reeves, who comes out of retirement to seek vengeance after a series of personal losses and provocations. In the first film, someone steals his car and kills his dog, which prompts him to go on a violent rampage against those responsible, unleashing a deadly pursuit that draws in many assassins and reveals his lethal skills. The sequel, John Wick Chapter 2, continues his story as he is forced back into the criminal underworld to help a former associate. The plot involves travel to Italy, Canada, and Manhattan, and includes numerous violent encounters as Wick battles against other assassins and figures from his past, all while trying to escape the chaos he unleashes.'

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [41]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [42]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [43]:
ensemble_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Based on the reviews provided, people generally liked John Wick. Many reviews give high ratings (such as 8 or 9 out of 10) and describe the film as stylish, fun, and highly entertaining, especially appreciating its action sequences and Keanu Reeves' performance. However, there are some negative reviews with low ratings, criticizing the films for excess violence, implausible action, or lack of plot. Overall, the majority of the reviews suggest that viewers tend to enjoy John Wick, especially fans of action movies."

In [44]:
ensemble_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"Yes, there are reviews with a rating of 10. The URLs to those reviews are:\n\n1. [Review for John Wick 3](https://example.com/review/rw4854296/?ref_=tt_urv)\n2. [Review for John Wick 4](https://example.com/review/rw8947764/?ref_=tt_urv)\n\n(Note: The actual review URLs from the provided data are '/review/rw4854296/?ref_=tt_urv' and '/review/rw8947764/?ref_=tt_urv'. If you need the complete link, you may need to append the base URL of the site.)"

In [45]:
ensemble_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, a retired hitman named John Wick (played by Keanu Reeves) seeks revenge after a violent home invasion leaves him grieving the loss of his wife and the death of his beloved dog, which was his last remaining connection to her. The story begins with this personal tragedy, and Wick, initially attempting to rebuild his life, is drawn back into the violent underworld of assassins when a group of Russian mobsters, led by a punk who attacks him, steal his car and kill his dog. This act of violence triggers Wick's return to his former lethal skills as he unleashes a ruthless and meticulously orchestrated campaign of revenge against those who wronged him. Throughout the series, Wick is targeted by numerous bounty hunters and assassins, leading to intense action sequences and a complex web of criminal alliances and rules. The films explore themes of vengeance, consequence, and the dangerous world of professional killers."

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

> NOTE: You do not need to run this cell if you're running this locally

In [46]:
#!pip install -qU langchain_experimental

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [47]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [48]:
semantic_documents = semantic_chunker.split_documents(documents)

Let's create a new vector store.

In [49]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWickSemantic"
)

We'll use naive retrieval for this example.

In [50]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [51]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [52]:
semantic_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Based on the reviews provided, people generally liked John Wick. The reviews are mostly positive, highlighting the film's stylish action, choreography, and entertainment value. However, there are some mixed opinions as well. Overall, the majority of comments suggest that John Wick was well-received and appreciated by audiences."

In [53]:
semantic_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. The URL to that review is: /review/rw4854296/?ref_=tt_urv'

In [54]:
semantic_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In the John Wick movies, the story revolves around a retired and highly skilled assassin named John Wick. After the death of his wife, he is found to be seeking peace in his life. However, his life takes a violent turn when a young Russian punk, whom Wick declines to sell his classic car to, and his gang surprise him at his home. They beat him up, kill his beloved dog, and steal his car. \n\nThis brutal attack awakens Wick'skiller instincts, revealing his background as a super-assassin. Enraged and seeking revenge, Wick embarks on a relentless quest against the gangsters who wronged him, including their mobster father. The films showcase his swift, stylish, and often brutal action as he fights to reclaim his peace and deal with the consequences of his past. The story emphasizes themes of vengeance, consequence, and the dangerous world of hired killers."

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [55]:
### YOUR CODE HERE

In [57]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from ragas.testset import TestsetGenerator
from ragas import EvaluationDataset
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig
from ragas import EvaluationDataset
import matplotlib.pyplot as plt
import numpy as np
import time
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

# Imports LangChain
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain.callbacks.tracers import LangChainTracer
from langchain.schema.runnable import RunnableConfig

# Imports LangSmith
from langsmith import Client, traceable
from langsmith.run_helpers import get_current_run_tree

# Imports RAGAS
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from ragas.testset import TestsetGenerator
from ragas import EvaluationDataset
from ragas.metrics import (
    LLMContextRecall,
    Faithfulness,
    FactualCorrectness,
    ResponseRelevancy,
    ContextEntityRecall,
    NoiseSensitivity
)
from ragas import evaluate, RunConfig
# Import required libraries
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm
from langsmith import Client
from langchain_core.tracers import LangChainTracer
from langchain_core.runnables.config import RunnableConfig
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

from operator import itemgetter
from copy import deepcopy
import uuid
from ragas.metrics import (
    LLMContextRecall,
    Faithfulness,
    FactualCorrectness,
    ResponseRelevancy,
    ContextEntityRecall
)
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.run_config import RunConfig
from ragas import EvaluationDataset
from langchain_openai import ChatOpenAI
import tiktoken

In [58]:
# Generate SDG
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())
generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
testset_dataset = generator.generate_with_langchain_docs(documents, testset_size=10)


Applying SummaryExtractor:   0%|          | 0/44 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/100 [00:00<?, ?it/s]

Node 9f0de6fb-5076-42c9-a709-602d2a60de6f does not have a summary. Skipping filtering.
Node 8bae1cf3-a528-47a8-b561-a356ffd37da7 does not have a summary. Skipping filtering.
Node 16710e89-fbc1-4587-b8bb-429ce6765d95 does not have a summary. Skipping filtering.
Node 5c4affeb-d363-4f3f-8b1e-94b01a63bf73 does not have a summary. Skipping filtering.
Node bf65a440-0063-4482-bfcb-de85ec1c6ecf does not have a summary. Skipping filtering.
Node 80bb6c44-ce9d-4e83-aaff-dcf3a54c1dc7 does not have a summary. Skipping filtering.
Node e8288e53-4b23-47e4-bb02-c20cc8c15663 does not have a summary. Skipping filtering.
Node 24b2f02f-1c4a-40a4-a4f2-6c9a2d1a800e does not have a summary. Skipping filtering.
Node 3eef9edf-17d5-4f90-982d-309e019e99d1 does not have a summary. Skipping filtering.
Node e8c80ac6-d2e8-42a4-acab-b6fc15423964 does not have a summary. Skipping filtering.
Node 1800b626-ee25-4248-b714-12f0d139ce7d does not have a summary. Skipping filtering.
Node 690efe8b-dc43-41fe-be54-798a6c7696c1 d

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/244 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [59]:
testset_df = testset_dataset.to_pandas()
testset_df

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,Why John Wick movie so good even tho story sim...,[: 0\nReview: The best way I can describe John...,John Wick is good because it has a simple stor...,single_hop_specifc_query_synthesizer
1,"Could you provide an overview of the ""John Wic...",[: 2\nReview: With the fourth installment scor...,"The ""John Wick"" film series has gained signifi...",single_hop_specifc_query_synthesizer
2,Who is Chad Stahelski and why he important for...,[: 3\nReview: John wick has a very simple reve...,Chad Stahelski is the director of John Wick an...,single_hop_specifc_query_synthesizer
3,Can you explane the plot and action style of t...,[: 4\nReview: Though he no longer has a taste ...,John Wick follows a retired assassin who has l...,single_hop_specifc_query_synthesizer
4,How does the escalation of stakes in the John ...,[<1-hop>\n\n: 0\nReview: It is 5 years since t...,The escalation of stakes in the John Wick seri...,multi_hop_abstract_query_synthesizer
5,Is Keanu Reeves acting in John Wick: Chapter 4...,[<1-hop>\n\n: 23\nReview: Rating 10/10\nI was ...,Keanu Reeves' acting in John Wick: Chapter 4 r...,multi_hop_abstract_query_synthesizer
6,How does the depiction of katana swords versus...,[<1-hop>\n\n: 13\nReview: From the very beginn...,"In the fourth chapter, the depiction of katana...",multi_hop_abstract_query_synthesizer
7,How do the film reviews reflect the portrayal ...,[<1-hop>\n\n: 20\nReview: John Wick is somethi...,The first review praises John Wick as a specia...,multi_hop_abstract_query_synthesizer
8,How does Ian McShane's role in John Wick contr...,"[<1-hop>\n\n: 9\nReview: At first glance, John...",Ian McShane's role in John Wick is pivotal to ...,multi_hop_specific_query_synthesizer
9,Why do some action movie fans like Taken but t...,[<1-hop>\n\n: 14\nReview: I absolutely love ac...,"Some action movie fans love Taken, but they fi...",multi_hop_specific_query_synthesizer


In [60]:
retrievers = [
    {"name": "Naive Retriever", "retriever": naive_retriever},
    {"name": "BM25 Retriever", "retriever": bm25_retriever},
    {"name": "Contextual Compression", "retriever": compression_retriever},
    {"name": "Multi-Query Retriever", "retriever": multi_query_retriever},
    {"name": "Parent Document Retriever", "retriever": parent_document_retriever},
    {"name": "Ensemble Retriever", "retriever": ensemble_retriever},
    {"name": "Semantic Chunking Retriever", "retriever": semantic_retriever}
]

eval_llm = ChatOpenAI(model="gpt-4.1-mini")


In [61]:
def run_ragas_evaluation(updated_testsets, results_df, llm):

    print("\nRunning RAGAS evaluation...")
    
    # Create a wrapper for the LLM for RAGAS
    evaluator_llm = LangchainLLMWrapper(llm)
    
    # Configure RAGAS run parameters
    run_config = RunConfig(
        timeout=600,  
        max_workers=2  
    )
    
    # Define RAGAS metrics to evaluate
    metrics = [
        LLMContextRecall(),
        Faithfulness(),
        FactualCorrectness(),
        ResponseRelevancy(),
        ContextEntityRecall()
    ]
    
    # Store RAGAS results for each retriever
    ragas_results = {}
    
    # Get retrievers from results_df
    retriever_names = results_df["Retriever"].tolist()
    
    # Process each retriever separately to get RAGAS scores
    for retriever_name in retriever_names:
        print(f"Running RAGAS evaluation for {retriever_name}...")
        
        try:
            # Get the updated testset for this retriever
            retriever_testset = updated_testsets[retriever_name]
            
            # Convert the testset to a pandas DataFrame for RAGAS
            df = retriever_testset.to_pandas()
            
            # Create a new DataFrame for RAGAS format
            ragas_df = pd.DataFrame()
            
            # Map necessary columns from the testset format to RAGAS format
            # RAGAS expects question, answer, contexts, and ground_truths
            
            # Get user inputs (questions)
            if 'user_input' in df.columns:
                ragas_df['question'] = df['user_input']
            else:
                # If the structure is nested
                ragas_df['question'] = df.apply(
                    lambda row: row['eval_sample']['user_input'] 
                    if isinstance(row['eval_sample'], dict) 
                    else row['eval_sample'].user_input,
                    axis=1
                )
            
            # Get responses (answers)
            if 'response' in df.columns:
                ragas_df['answer'] = df['response']
            else:
                # If the structure is nested
                ragas_df['answer'] = df.apply(
                    lambda row: row['eval_sample']['response'] 
                    if isinstance(row['eval_sample'], dict) 
                    else row['eval_sample'].response,
                    axis=1
                )
            
            # Get retrieved contexts
            if 'retrieved_contexts' in df.columns:
                ragas_df['contexts'] = df['retrieved_contexts'].apply(
                    lambda x: x if isinstance(x, list) else [x]
                )
            else:
                # If the structure is nested
                ragas_df['contexts'] = df.apply(
                    lambda row: row['eval_sample']['retrieved_contexts'] 
                    if isinstance(row['eval_sample'], dict) 
                    else row['eval_sample'].retrieved_contexts,
                    axis=1
                )
            
            # Ground truths (if available, otherwise use empty lists)
            if 'ground_truth' in df.columns:
                ragas_df['ground_truths'] = df['ground_truth'].apply(
                    lambda x: x if isinstance(x, list) else [x]
                )
            else:
                # For RAGAS evaluation without ground truths
                ragas_df['ground_truths'] = [[] for _ in range(len(df))]
            
            # Convert to RAGAS EvaluationDataset
            try:
                evaluation_dataset = EvaluationDataset.from_pandas(ragas_df)
                
                # Run RAGAS evaluation
                ragas_result = evaluate(
                    dataset=evaluation_dataset,
                    metrics=metrics,
                    llm=evaluator_llm,
                    run_config=run_config
                )
                
                # Store the result
                ragas_results[retriever_name] = ragas_result
                
                # Extract metric scores
                llm_context_recall = np.mean(ragas_result.get('llm_context_recall', [0]))
                faithfulness = np.mean(ragas_result.get('faithfulness', [0]))
                factual_correctness = np.mean(ragas_result.get('factual_correctness(mode=f1)', [0]))
                response_relevancy = np.mean(ragas_result.get('answer_relevancy', [0]))
                context_entity_recall = np.mean(ragas_result.get('context_entity_recall', [0]))
                
                # Calculate overall quality score (average of all metrics)
                quality_score = np.mean([
                    llm_context_recall,
                    faithfulness,
                    factual_correctness,
                    response_relevancy,
                    context_entity_recall
                ])
                
                # Update the quality score in results_df
                results_df.loc[results_df["Retriever"] == retriever_name, "Quality Score"] = quality_score
                
                # Store individual metrics
                results_df.loc[results_df["Retriever"] == retriever_name, "LLM Context Recall"] = llm_context_recall
                results_df.loc[results_df["Retriever"] == retriever_name, "Faithfulness"] = faithfulness
                results_df.loc[results_df["Retriever"] == retriever_name, "Factual Correctness"] = factual_correctness
                results_df.loc[results_df["Retriever"] == retriever_name, "Response Relevancy"] = response_relevancy
                results_df.loc[results_df["Retriever"] == retriever_name, "Context Entity Recall"] = context_entity_recall
                
                print(f"✓ RAGAS evaluation completed for {retriever_name}: Quality Score = {quality_score:.2f}")
                
            except Exception as e:
                print(f"Error in RAGAS dataset creation for {retriever_name}: {e}")
                results_df.loc[results_df["Retriever"] == retriever_name, "Quality Score"] = 0.5
            
        except Exception as e:
            print(f"Error in RAGAS evaluation for {retriever_name}: {e}")
            # Use a placeholder score in case of errors
            results_df.loc[results_df["Retriever"] == retriever_name, "Quality Score"] = 0.5
    
    # Calculate efficiency metrics using actual costs
    results_df["Efficiency Score"] = results_df["Quality Score"] / results_df["Avg Cost ($)"]
    results_df["Time-Efficiency"] = results_df["Quality Score"] / (results_df["Avg Cost ($)"] * results_df["Avg Time (s)"])
    
    print("RAGAS evaluation completed for all retrievers.")
    
    return results_df, ragas_results

In [78]:
def evaluate_retrievers_with_langsmith_direct(retrievers, testset_dataset, llm):

    encoding = tiktoken.get_encoding("cl100k_base")
    

    # DÉFINITION DES COÛTS DE GÉNÉRATION (GPT-4.1-nano)
    GENERATION_INPUT_COST = 0.0000001  # $0.10 par million de tokens
    GENERATION_OUTPUT_COST = 0.0000004  # $0.40 par million de tokens
    
    # Création du client LangSmith
    client = Client(api_url="https://api.smith.langchain.com")
    
    # Création d'un projet unique
    project_name = f"john-wick-retrieval-{uuid.uuid4().hex[:8]}"
    print(f"LangSmith project name: {project_name}")
    
    # Création du projet
    client.create_project(project_name=project_name)
    print(f"Project '{project_name}' created successfully in LangSmith")
    
    # Résultats
    results = []
    latency_details = {}  # Pour stocker les détails de latence par retriever
    
    # RAG prompt template
    RAG_TEMPLATE = """\
    You are a helpful and kind assistant. Use the context provided below to answer the question.

    If you do not know the answer, or are unsure, say you don't know.

    Query:
    {question}

    Context:
    {context}
    """

    rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)
    
    # Storage pour les testsets mis à jour
    updated_testsets = {}
    
    # Évaluation de chaque retriever
    for retriever_info in retrievers:
        name = retriever_info["name"]
        retriever = retriever_info["retriever"]
        
        print(f"\nEvaluating: {name}")
        
        # Structure pour stocker les métriques
        latency_details[name] = {
            "retrieval_times": [],
            "llm_times": [],
            "total_times": [],
            "processing_times": [],
            "prompt_tokens": [],
            "completion_tokens": [],
            "costs": []
        }
        
        # Deep copy du testset
        retriever_testset = deepcopy(testset_dataset)
        document_counts = []
        retrieval_times = []
        llm_times = []
        processing_times = []
        prompt_tokens_list = []
        completion_tokens_list = []
        costs_list = []
        run_ids = []  # Pour tracer explicitement les IDs
        
        # Construire la chaîne RAG
        rag_chain = (
            {"context": itemgetter("question") | retriever, "question": itemgetter("question")}
            | RunnablePassthrough.assign(context=itemgetter("context"))
            | {"response": rag_prompt | llm, "context": itemgetter("context")}
        )
        
        # Test sur chaque question
        for i, test_row in enumerate(tqdm(retriever_testset, desc=f"Testing {name}")):
            try:
                # Question
                question = test_row.eval_sample.user_input
                
                # 1. Temps de récupération
                retrieval_start = time.time()
                context_result = retriever.invoke(question)
                retrieval_time = time.time() - retrieval_start
                retrieval_times.append(retrieval_time)
                latency_details[name]["retrieval_times"].append(retrieval_time)
                
                # 2. Temps de traitement intermédiaire
                processing_start = time.time()
                context_text = "\n\n".join([doc.page_content for doc in context_result])
                formatted_prompt = rag_prompt.format(question=question, context=context_text)
                processing_time = time.time() - processing_start
                processing_times.append(processing_time)
                latency_details[name]["processing_times"].append(processing_time)
                
                # CALCUL EXPLICITE DES TOKENS DE GÉNÉRATION
            
                prompt_tokens = len(encoding.encode(formatted_prompt))
                prompt_tokens_list.append(prompt_tokens)
                latency_details[name]["prompt_tokens"].append(prompt_tokens)
               
                # 3. Temps du LLM
                llm_start = time.time()
                llm_response = llm.invoke(formatted_prompt)
                llm_time = time.time() - llm_start
                llm_times.append(llm_time)
                latency_details[name]["llm_times"].append(llm_time)
                
                # Calcul des tokens de complétion
             
                completion_tokens = len(encoding.encode(llm_response.content))
                completion_tokens_list.append(completion_tokens)
                latency_details[name]["completion_tokens"].append(completion_tokens)
                
                
                # CALCUL EXPLICITE DU COÛT DE GÉNÉRATION
                cost = (prompt_tokens * GENERATION_INPUT_COST / 1000) + (completion_tokens * GENERATION_OUTPUT_COST / 1000)
                costs_list.append(cost)
                latency_details[name]["costs"].append(cost)
                
                # 4. Temps total
                total_time = retrieval_time + processing_time + llm_time
                latency_details[name]["total_times"].append(total_time)
                
                # Nombre de documents
                doc_count = len(context_result)
                document_counts.append(doc_count)
                
                # Mise à jour du testset
                test_row.eval_sample.response = llm_response.content
                test_row.eval_sample.retrieved_contexts = [
                    doc.page_content for doc in context_result
                ]
                
                # CRÉATION MANUELLE DU RUN DANS LANGSMITH
                run_id = str(uuid.uuid4())
                run_ids.append(run_id)
                
                try:
                    # Créer run avec API directe
                    client.create_run(
                        name=f"{name}-question-{i}",
                        inputs={"question": question},
                        outputs={
                            "response": llm_response.content,
                            "documents": [doc.page_content for doc in context_result]
                        },
                        run_type="chain",
                        tags=[name, "retriever_evaluation"],
                        project_name=project_name,
                        id=run_id,
                        start_time=retrieval_start,
                        end_time=llm_start + llm_time,
                        metadata={
                            "retriever": name,
                            "question_id": i,
                            "ls_model_name": "gpt-4.1-nano",  # Modèle de génération explicite
                            "ls_provider": "openai"
                        },
                    )
                    
                    # Mettre à jour le run avec des métriques détaillées
                    client.update_run(
                        run_id=run_id,
                        metrics={
                            "total_time": total_time,
                            "retrieval_time": retrieval_time,
                            "llm_time": llm_time,
                            "processing_time": processing_time,
                            "document_count": doc_count,
                            "prompt_tokens": prompt_tokens,
                            "completion_tokens": completion_tokens,
                            "total_tokens": prompt_tokens + completion_tokens,
                            "cost": cost
                        }
                    )
                    print(f"Run {run_id} created successfully in LangSmith")
                except Exception as e:
                    print(f"Failed to create run in LangSmith: {e}")
                
                # Petite pause
                time.sleep(0.5)
                
            except Exception as e:
                print(f"Error on question {i}: {e}")
        
        # Stockage du testset mis à jour
        updated_testsets[name] = retriever_testset
        
        # Attente pour traitement LangSmith
        print(f"Waiting for LangSmith to process runs for {name}...")
        time.sleep(10)
        
        # Calcul des moyennes de nos métriques manuelles
        avg_docs = np.mean(document_counts) if document_counts else 0
        avg_retrieval_time = np.mean(retrieval_times) if retrieval_times else 0
        avg_llm_time = np.mean(llm_times) if llm_times else 0
        avg_processing_time = np.mean(processing_times) if processing_times else 0
        avg_total_time = avg_retrieval_time + avg_llm_time + avg_processing_time
        
        avg_prompt_tokens = np.mean(prompt_tokens_list) if prompt_tokens_list else 0
        avg_completion_tokens = np.mean(completion_tokens_list) if completion_tokens_list else 0
        avg_cost = np.mean(costs_list) if costs_list else 0
        
        # Vérification directe des runs créés par ID
        print(f"Verifying runs directly by ID...")
        verified_runs = []
        for run_id in run_ids:
            try:
                run = client.read_run(run_id)
                verified_runs.append(run)
                print(f"Run {run_id} verified in LangSmith")
            except Exception as e:
                print(f"Could not verify run {run_id}: {e}")
        
        # Récupération des métriques uniquement à partir des runs vérifiés
        langsmith_times = []
        langsmith_costs = []
        
        for run in verified_runs:
            if hasattr(run, "metrics") and run.metrics:
                metrics = run.metrics
                if "total_time" in metrics:
                    langsmith_times.append(metrics["total_time"])
                if "cost" in metrics:
                    langsmith_costs.append(metrics["cost"])
        
        # Utilisation des métriques manuelles comme fallback
        if not langsmith_times:
            print("No LangSmith timing data found, using manually tracked times")
            avg_langsmith_time = avg_total_time
        else:
            avg_langsmith_time = np.mean(langsmith_times)
            print(f"Using LangSmith timing data: {avg_langsmith_time:.2f}s")
        
        if not langsmith_costs:
            print("No LangSmith cost data found, using manually calculated costs")
            avg_langsmith_cost = avg_cost
        else:
            avg_langsmith_cost = np.mean(langsmith_costs)
            print(f"Using LangSmith cost data: ${avg_langsmith_cost:.6f}/query")
        
        # Nombre de questions
        num_questions = len(testset_dataset)
        
        # Calcul des scores d'efficacité
        quality_score = 0.5  # À remplacer par le score RAGAS quand disponible
        efficiency_score = (quality_score / avg_cost) * 1000 if avg_cost > 0 else 0
        time_efficiency = quality_score / avg_total_time if avg_total_time > 0 else 0
        
        # Stockage des résultats avec les métriques simplifiées
        results.append({
            "Retriever": name,
            "Avg Time (s)": avg_langsmith_time,
            "Retrieval Time (s)": avg_retrieval_time,
            "LLM Time (s)": avg_llm_time,
            "Processing Time (s)": avg_processing_time,
            "Avg Docs": avg_docs,
            "Avg Prompt Tokens": avg_prompt_tokens,
            "Avg Completion Tokens": avg_completion_tokens,
            "Avg Cost ($)": avg_cost,  # Coût de génération seulement
            "Quality Score": quality_score,
            "Efficiency Score": efficiency_score,
            "Time-Efficiency": time_efficiency
        })
        
        print(f"✓ {name}: Total {avg_langsmith_time:.2f}s (Retrieval: {avg_retrieval_time:.2f}s, LLM: {avg_llm_time:.2f}s)")
        print(f"  Docs: {avg_docs:.1f}, Generation Cost: ${avg_cost:.6f}/query")
    
    # Afficher le lien vers le dashboard LangSmith
    print(f"\nView detailed results in LangSmith: https://smith.langchain.com/projects/{project_name}")
    
    # Retourner les résultats et testsets mis à jour
    return pd.DataFrame(results), updated_testsets, project_name, latency_details

In [79]:
def generate_visual_report_with_latency(results_df, latency_details=None):
    """
    Generate visualizations and report including latency breakdowns
    """
    # Make sure our plotting directory exists
    import os
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    os.makedirs("evaluation_plots", exist_ok=True)
    
    # Check if we have RAGAS metrics columns
    has_ragas_metrics = "LLM Context Recall" in results_df.columns
    
    # Create standard visualizations
    # Quality comparison
    plt.figure(figsize=(10, 6))
    quality_plot = sns.barplot(x="Retriever", y="Quality Score", data=results_df)
    plt.title("Retriever Quality Comparison")
    plt.xticks(rotation=45, ha="right")
    for i, bar in enumerate(quality_plot.patches):
        quality_plot.text(
            bar.get_x() + bar.get_width()/2.,
            bar.get_height() + 0.01,
            f'{bar.get_height():.2f}',
            ha='center'
        )
    plt.tight_layout()
    plt.savefig("evaluation_plots/retriever_quality.png")
    plt.close()
    
    # Latency breakdown visualization
    if "Retrieval Time (s)" in results_df.columns and "LLM Time (s)" in results_df.columns:
        # Create a melted dataframe for the stacked bar chart
        latency_data = pd.melt(
            results_df, 
            id_vars=["Retriever"], 
            value_vars=["Retrieval Time (s)", "LLM Time (s)", "Processing Time (s)"],
            var_name="Component", 
            value_name="Time (s)"
        )
        
        # Create stacked bar chart for latency breakdown
        plt.figure(figsize=(12, 6))
        latency_plot = sns.barplot(
            x="Retriever", 
            y="Time (s)", 
            hue="Component", 
            data=latency_data
        )
        plt.title("Latency Breakdown by Retriever")
        plt.xticks(rotation=45, ha="right")
        plt.legend(title="Component")
        
        # Add total time labels
        for i, retriever in enumerate(results_df["Retriever"]):
            total_time = results_df.loc[results_df["Retriever"] == retriever, "Avg Time (s)"].values[0]
            plt.text(
                i, 
                total_time + 0.1, 
                f'{total_time:.2f}s', 
                ha='center'
            )
            
        plt.tight_layout()
        plt.savefig("evaluation_plots/latency_breakdown.png")
        plt.close()
    
    # If we have RAGAS metrics, create a detailed metrics comparison
    if has_ragas_metrics:
        # Get RAGAS metric columns
        ragas_metrics = [
            "LLM Context Recall", 
            "Faithfulness", 
            "Factual Correctness", 
            "Response Relevancy", 
            "Context Entity Recall"
        ]
        
        # Create a melted dataframe for easier plotting
        plot_data = pd.melt(
            results_df, 
            id_vars=["Retriever"], 
            value_vars=ragas_metrics,
            var_name="Metric", 
            value_name="Score"
        )
        
        # Create RAGAS metrics comparison chart
        plt.figure(figsize=(12, 8))
        ragas_plot = sns.barplot(x="Metric", y="Score", hue="Retriever", data=plot_data)
        plt.title("RAGAS Metrics Comparison by Retriever", fontsize=15)
        plt.xticks(rotation=45, ha="right")
        plt.tight_layout()
        plt.savefig("evaluation_plots/ragas_metrics_comparison.png")
        plt.close()
    
    # Cost comparison
    plt.figure(figsize=(10, 6))
    cost_plot = sns.barplot(x="Retriever", y="Avg Cost ($)", data=results_df)
    plt.title("Retriever Cost Comparison")
    plt.xticks(rotation=45, ha="right")
    for i, bar in enumerate(cost_plot.patches):
        cost_plot.text(
            bar.get_x() + bar.get_width()/2.,
            bar.get_height() + 0.0001,
            f'${bar.get_height():.4f}',
            ha='center'
        )
    plt.tight_layout()
    plt.savefig("evaluation_plots/retriever_cost.png")
    plt.close()
    
    # Efficiency comparison
    plt.figure(figsize=(10, 6))
    eff_plot = sns.barplot(x="Retriever", y="Efficiency Score", data=results_df)
    plt.title("Retriever Efficiency (Quality/Cost)")
    plt.xticks(rotation=45, ha="right")
    for i, bar in enumerate(eff_plot.patches):
        eff_plot.text(
            bar.get_x() + bar.get_width()/2.,
            bar.get_height() + 5,
            f'{bar.get_height():.1f}',
            ha='center'
        )
    plt.tight_layout()
    plt.savefig("evaluation_plots/retriever_efficiency.png")
    plt.close()
    
    # Enhanced visualization: Cost vs. Quality with Latency
    plt.figure(figsize=(12, 8))
    scatter = plt.scatter(
        results_df["Avg Cost ($)"], 
        results_df["Quality Score"],
        s=results_df["Avg Time (s)"] * 50,  # Size based on time
        c=range(len(results_df)),  # Color based on index
        cmap='viridis',
        alpha=0.7
    )
    
    # Add labels for each point
    for i, row in results_df.iterrows():
        plt.annotate(
            row["Retriever"],
            (row["Avg Cost ($)"] + 0.00005, row["Quality Score"] + 0.01),
            fontsize=10,
            ha='center'
        )
    
    # Add efficiency contour lines
    cost_range = np.linspace(
        results_df["Avg Cost ($)"].min() * 0.8 if results_df["Avg Cost ($)"].min() > 0 else 0.00001,
        results_df["Avg Cost ($)"].max() * 1.2,
        100
    )
    
    # Plot efficiency contour lines
    efficiency_levels = [50, 100, 200, 500, 1000]
    for level in efficiency_levels:
        plt.plot(cost_range, level * cost_range, '--', color='gray', alpha=0.5)
        # Label the line
        mid_point = len(cost_range) // 2
        plt.text(
            cost_range[mid_point], 
            level * cost_range[mid_point], 
            f'Efficiency = {level}', 
            color='gray', 
            fontsize=8,
            rotation=np.degrees(np.arctan(level)) if level > 0 else 0
        )
    
    plt.xlabel('Cost per Query ($)', fontsize=12)
    plt.ylabel('Quality Score', fontsize=12)
    plt.title('Retriever Comparison: Quality vs Cost (circle size = latency)', fontsize=14)
    plt.grid(True, linestyle='--', alpha=0.6)
    plt.colorbar(scatter, label='Retriever Index')
    
    # Add a legend for the circle sizes
    latency_sizes = [1, 3, 5, 10]
    for size in latency_sizes:
        plt.scatter([], [], s=size * 50, color='gray', alpha=0.5, 
                   label=f'{size}s latency')
    plt.legend(title="Reference", loc="upper right")
    
    # Save enhanced visualization
    plt.tight_layout()
    plt.savefig("evaluation_plots/enhanced_quality_cost_latency.png")
    plt.close()
    
    # Find best retrievers
    best_quality = results_df.loc[results_df["Quality Score"].idxmax()]["Retriever"]
    best_efficiency = results_df.loc[results_df["Efficiency Score"].idxmax()]["Retriever"]
    if "Time-Efficiency" in results_df.columns:
        best_time_efficiency = results_df.loc[results_df["Time-Efficiency"].idxmax()]["Retriever"]
    else:
        best_time_efficiency = "Not calculated"
    lowest_cost = results_df.loc[results_df["Avg Cost ($)"].idxmin()]["Retriever"]
    fastest = results_df.loc[results_df["Avg Time (s)"].idxmin()]["Retriever"]
    
    # Generate report with detailed latency breakdown
    report = f"""
    # Retriever Evaluation Report

    ## Results Summary

    | Retriever | Total Time (s) | Retrieval (s) | LLM (s) | Docs | Quality | Cost ($) | Efficiency |
    |-----------|---------------|--------------|---------|------|---------|----------|------------|
    """
    
    # Add each row with latency breakdown
    for _, row in results_df.iterrows():
        retrieval_time = row.get("Retrieval Time (s)", "-")
        llm_time = row.get("LLM Time (s)", "-")
        
        report += f"| {row['Retriever']} | {row['Avg Time (s)']:.2f} | {retrieval_time:.2f} | {llm_time:.2f} | {row['Avg Docs']:.1f} | {row['Quality Score']:.2f} | ${row['Avg Cost ($)']:.5f} | {row['Efficiency Score']:.1f} |\n"
    
    # Add RAGAS metrics detail if available
    if has_ragas_metrics:
        report += f"""
        ## RAGAS Metrics Details

        | Retriever | LLM Context Recall | Faithfulness | Factual Correctness | Response Relevancy | Context Entity Recall |
        |-----------|-------------------|--------------|---------------------|-------------------|----------------------|
        """
        
        for _, row in results_df.iterrows():
            report += f"| {row['Retriever']} | {row['LLM Context Recall']:.2f} | {row['Faithfulness']:.2f} | {row['Factual Correctness']:.2f} | {row['Response Relevancy']:.2f} | {row['Context Entity Recall']:.2f} |\n"
    
    # Add latency analysis section
    report += f"""
    ## Latency Analysis
    
    | Retriever | Total Time (s) | Retrieval % | LLM % | Processing % |
    |-----------|---------------|------------|-------|--------------|
    """
    
    # Add latency percentage breakdown
    for _, row in results_df.iterrows():
        if "Retrieval Time (s)" in row and "LLM Time (s)" in row and "Processing Time (s)" in row:
            total = row["Avg Time (s)"]
            retrieval_pct = (row["Retrieval Time (s)"] / total) * 100 if total > 0 else 0
            llm_pct = (row["LLM Time (s)"] / total) * 100 if total > 0 else 0
            processing_pct = (row["Processing Time (s)"] / total) * 100 if total > 0 else 0
            
            report += f"| {row['Retriever']} | {total:.2f} | {retrieval_pct:.1f}% | {llm_pct:.1f}% | {processing_pct:.1f}% |\n"
    
    # Add recommendations
    report += f"""
    ## Best Retrievers by Category

    | Category | Best Retriever |
    |----------|---------------|
    | Overall Quality | {best_quality} |
    | Cost Efficiency | {best_efficiency} |
    | Time Efficiency | {best_time_efficiency} |
    | Lowest Cost | {lowest_cost} |
    | Fastest Response | {fastest} |

    ## Analysis

    Based on our evaluation of retrievers using the John Wick movie reviews dataset:

    1. **{best_quality}** achieved the highest quality results with a score of {results_df.loc[results_df["Retriever"] == best_quality, "Quality Score"].values[0]:.2f}. This retriever is ideal when result accuracy is the top priority.

    2. **{best_efficiency}** offered the best balance of quality and cost, achieving an efficiency score of {results_df.loc[results_df["Retriever"] == best_efficiency, "Efficiency Score"].values[0]:.1f}. This retriever is recommended for production systems where both performance and cost must be optimized.

    3. **{fastest}** had the lowest overall latency at {results_df.loc[results_df["Retriever"] == fastest, "Avg Time (s)"].values[0]:.2f}s, making it suitable for applications where response time is critical.

    4. **{lowest_cost}** had the lowest cost per query at ${results_df.loc[results_df["Retriever"] == lowest_cost, "Avg Cost ($)"].values[0]:.5f}, making it suitable for high-volume applications with tight budget constraints.

    ## Tradeoffs and Recommendations

    For this John Wick dataset, we recommend:
    - **For production use**: {best_efficiency}
    - **For development/testing**: {lowest_cost}
    - **For real-time applications**: {fastest}
    - **For high-stakes applications**: {best_quality}
    """
    
    # Save report
    with open("retriever_evaluation_report.md", "w") as f:
        f.write(report)
    
    return report

In [77]:
api_key = getpass.getpass("Entrez votre clé API LangSmith: ")
os.environ["LANGSMITH_API_KEY"] = api_key
os.environ["LANGCHAIN_TRACING_V2"] = "true"

In [None]:
# Exécuter l'évaluation avec la capture de latence détaillée
results_df, updated_testsets, project_name, latency_details = evaluate_retrievers_with_langsmith_direct(
    retrievers, testset_dataset, eval_llm
)



In [None]:
# Générer le rapport avec les détails de latence
report = generate_visual_report_with_latency(results_df, latency_details)
print(f"Rapport créé: retriever_evaluation_report.md")

In [None]:
# Afficher le graphique amélioré avec les données de latence
plt.figure(figsize=(12, 8))
scatter = plt.scatter(
    results_df["Avg Cost ($)"], 
    results_df["Quality Score"],
    s=results_df["Avg Time (s)"] * 50,  # Taille basée sur le temps
    c=range(len(results_df)),  # Couleur basée sur l'index
    cmap='viridis',
    alpha=0.7
)

# Ajouter les étiquettes pour chaque point
for i, row in results_df.iterrows():
    plt.annotate(
        row["Retriever"],
        (row["Avg Cost ($)"] + 0.00005, row["Quality Score"] + 0.01),
        fontsize=10,
        ha='center'
    )

plt.xlabel('Coût par requête ($)', fontsize=12)
plt.ylabel('Score de qualité', fontsize=12)
plt.title('Comparaison des retrievers: Qualité vs Coût vs Latence', fontsize=14)
plt.grid(True, linestyle='--', alpha=0.6)
plt.colorbar(scatter, label='Index du retriever')

# Ajouter une légende pour les tailles de cercle
latency_sizes = [1, 3, 5, 10]
for size in latency_sizes:
    plt.scatter([], [], s=size * 50, color='gray', alpha=0.5, 
               label=f'{size}s latence')
plt.legend(title="Référence", loc="upper right")

plt.tight_layout()
plt.savefig("evaluation_plots/interactive_latency_viz.png")
plt.show()

In [None]:

#US Personnal
#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

