# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

In [2]:
!pip install -U langchain langchain-openai langchain-cohere rank_bm25



We're also going to be leveraging [Qdrant's](https://qdrant.tech/documentation/frameworks/langchain/) (pronounced "Quadrant") VectorDB in "memory" mode (so we can leverage it locally in our colab environment).

In [3]:
!pip install -U qdrant-client

Collecting qdrant-client
  Using cached qdrant_client-1.11.3-py3-none-any.whl.metadata (10 kB)
Collecting grpcio>=1.41.0 (from qdrant-client)
  Using cached grpcio-1.66.1-cp312-cp312-win_amd64.whl.metadata (4.0 kB)
Collecting grpcio-tools>=1.41.0 (from qdrant-client)
  Using cached grpcio_tools-1.66.1-cp312-cp312-win_amd64.whl.metadata (5.5 kB)
Collecting portalocker<3.0.0,>=2.7.0 (from qdrant-client)
  Using cached portalocker-2.10.1-py3-none-any.whl.metadata (8.5 kB)
Collecting protobuf<6.0dev,>=5.26.1 (from grpcio-tools>=1.41.0->qdrant-client)
  Using cached protobuf-5.28.2-cp310-abi3-win_amd64.whl.metadata (592 bytes)
Collecting setuptools (from grpcio-tools>=1.41.0->qdrant-client)
  Using cached setuptools-75.1.0-py3-none-any.whl.metadata (6.9 kB)
Collecting h2<5,>=3 (from httpx[http2]>=0.20.0->qdrant-client)
  Using cached h2-4.1.0-py3-none-any.whl.metadata (3.6 kB)
Collecting hyperframe<7,>=6.0 (from h2<5,>=3->httpx[http2]>=0.20.0->qdrant-client)
  Using cached hyperframe-6.0.1-

We'll also provide our OpenAI key, as well as our Cohere API key.

In [5]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

Enter your OpenAI API Key:··········


In [6]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

Cohere API Key:··········


## Task 2: Data Collection and Preparation

We'll be using some reviews from the 4 movies in the John Wick franchise today to explore the different retrieval strategies.

These were obtained from IMDB, and are available in the [AIM Data Repository](https://github.com/AI-Maker-Space/DataRepository).

### Data Collection

We can simply `wget` these from GitHub.

You could use any review data you wanted in this step - just be careful to make sure your metadata is aligned with your choice.

In [12]:
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv -O john_wick_1.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv -O john_wick_2.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw3.csv -O john_wick_3.csv
!wget https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw4.csv -O john_wick_4.csv

SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
--2024-09-27 11:39:12--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw1.csv
Resolving raw.githubusercontent.com... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com|185.199.110.133|:443... connected.
OpenSSL: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol
Unable to establish SSL connection.
SYSTEM_WGETRC = c:/progra~1/wget/etc/wgetrc
syswgetrc = C:\Program Files (x86)\GnuWin32/etc/wgetrc
--2024-09-27 11:39:13--  https://raw.githubusercontent.com/AI-Maker-Space/DataRepository/main/jw2.csv
Resolving raw.githubusercontent.com... 185.199.110.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com|185.199.110.133|:443... connected.
OpenSSL: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol
Unable to establish SSL connection.
SYSTEM_WGETRC = c:/progra~1/wget/e

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

- Self-Query: Wants as much metadata as we can provide
- Time-weighted: Wants temporal data

> NOTE: While we're creating a temporal relationship based on when these movies came out for illustrative purposes, it needs to be clear that the "time-weighting" in the Time-weighted Retriever is based on when the document was *accessed* last - not when it was created.

In [1]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

documents = []

for i in range(1, 5):
  loader = CSVLoader(
      file_path=f"john_wick_{i}.csv",
      metadata_columns=["Review_Date", "Review_Title", "Review_Url", "Author", "Rating"],
      encoding="UTF-8"
  )

  movie_docs = loader.load()
  for doc in movie_docs:

    # Add the "Movie Title" (John Wick 1, 2, ...)
    doc.metadata["Movie_Title"] = f"John Wick {i}"

    # convert "Rating" to an `int`, if no rating is provided - assume 0 rating
    doc.metadata["Rating"] = int(doc.metadata["Rating"]) if doc.metadata["Rating"] else 0

    # newer movies have a more recent "last_accessed_at"
    doc.metadata["last_accessed_at"] = datetime.now() - timedelta(days=4-i)

  documents.extend(movie_docs)

Let's look at an example document to see if everything worked as expected!

In [2]:
documents[0]

Document(metadata={'source': 'john_wick_1.csv', 'row': 0, 'Review_Date': '6 May 2015', 'Review_Title': ' Kinetic, concise, and stylish; John Wick kicks ass.\n', 'Review_Url': '/review/rw3233896/?ref_=tt_urv', 'Author': 'lnvicta', 'Rating': 8, 'Movie_Title': 'John Wick 1', 'last_accessed_at': datetime.datetime(2024, 9, 24, 11, 47, 5, 638957)}, page_content=": 0\nReview: The best way I can describe John Wick is to picture Taken but instead of Liam Neeson it's Keanu Reeves and instead of his daughter it's his dog. That's essentially the plot of the movie. John Wick (Reeves) is out to seek revenge on the people who took something he loved from him. It's a beautifully simple premise for an action movie - when action movies get convoluted, they get bad i.e. A Good Day to Die Hard. John Wick gives the viewers what they want: Awesome action, stylish stunts, kinetic chaos, and a relatable hero to tie it all together. John Wick succeeds in its simplicity.")

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "JohnWick".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [5]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import AzureOpenAIEmbeddings

embeddings = AzureOpenAIEmbeddings(azure_deployment="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWick"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [6]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [7]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-3.5-turbo` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [8]:
from langchain_openai import AzureChatOpenAI

chat_model = AzureChatOpenAI(azure_deployment="gpt-35-turbo")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [9]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [43]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Overall, the reviews suggest that people generally liked John Wick. Many reviewers praised the film's action sequences, Keanu Reeves' performance as the titular character, and the film's stylish visuals. However, there were a few negative reviews as well. One reviewer did not understand the hype surrounding the film and found it to be a generic action thriller. Another reviewer felt that the magic of the first film was lost in the third installment."

In [44]:
naive_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"Yes, one review has a rating of 10. Here is the URL to that review: '/review/rw4854296/?ref_=tt_urv'."

In [45]:
naive_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In the original John Wick movie, an ex-hitman seeks revenge against the gangsters who killed his dog and took everything from him. He becomes the target of an army of bounty-hunting killers, and he unleashes a maelstrom of destruction against those who attempt to chase him. In the second movie, John Wick is forced back into the assassin world when he is called on to pay off an old debt. In the third movie, John Wick deals with the consequences of his actions in the previous film and continues to explore the world of assassination.'

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [46]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(documents)

We'll construct the same chain - only changing the retriever.

In [47]:
naive_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [48]:
naive_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'It is difficult to determine whether people generally liked John Wick based on the given context as there are differing opinions expressed in the reviews provided.'

In [49]:
naive_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, one review has a rating of 10. However, there are no URLs provided for the reviews in the given context.'

In [50]:
naive_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'John Wick is a movie series known for its beautifully choreographed action scenes, emotional setup, and unique characters.'

It's not clear that this is better or worse - but the `I don't know` isn't great!

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [51]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [52]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [53]:
contextual_compression_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'Based on the given context, it appears that people generally liked John Wick. The first two reviews gave it high ratings and praised its action sequences, characters, and overall entertainment value. However, the third review suggests that the magic may have been lost in the third installment.'

In [54]:
contextual_compression_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"Yes, one review has a rating of 10. Here is the URL to that review: '/review/rw4854296/?ref_=tt_urv'."

In [55]:
contextual_compression_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick 2, John Wick is asked by Santino D'Antonio to kill his sister Gianna D'Antonio in Rome so that Santino can sit on the High Table of the criminal organizations. After completing the task, Santino puts a seven-million dollar contract on John Wick, attracting professional killers from everywhere. John Wick promises to kill Santino, who is no longer protected by his marker. \n\nIn John Wick 1, an ex-hit-man seeks revenge after an arrogant Russian mob prince and hoodlums steal his car and kill his dog. He finds himself dragged into an impossible task as every killer in the business dreams of cornering the legendary Wick who now has an enormous price on his head. The legendary hitman will be forced to unearth his meticulously concealed identity and to carry out a relentless vendetta."

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [56]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [57]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [58]:
multi_query_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"People generally liked John Wick, with many reviewers praising its slick action sequences and Keanu Reeves' performance as the titular character. However, there were a few reviewers who were less impressed and found the film to be generic or lacking in plot. It is important to note that opinions on the film were not unanimous."

In [59]:
multi_query_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"Yes, there is a review with a rating of 10 for John Wick 3. The URL for that review is '/review/rw4854296/?ref_=tt_urv'."

In [60]:
multi_query_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, an ex-hit-man comes out of retirement to seek revenge against the gangsters that killed his dog and took everything from him. He unleashes a carefully orchestrated maelstrom of destruction against those attempting to chase him, as he is the target of hit men. In John Wick 2, he is forced back into the assassin world when an Italian baddie calls in a favor and Wick has no choice but to accept. In John Wick 3, the film deals with the fallout of John's actions at the end of the previous film and sends him on an even bigger odyssey of violence."

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [45]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = documents
child_splitter = RecursiveCharacterTextSplitter(chunk_size=200)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [46]:
client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = Qdrant(
    collection_name="full_documents", embeddings=embeddings, client=client
)

  since,


Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [47]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [48]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [49]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [68]:
parent_document_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

'It is unclear from the provided context whether people generally liked John Wick. There are mixed opinions expressed in the reviews.'

In [69]:
parent_document_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is a review with a rating of 10. The URL to that review is /review/rw4854296/?ref_=tt_urv.'

In [70]:
parent_document_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick 1, an ex-hitman seeks revenge when gangsters kill his dog and steal his car. In John Wick 2, John is called on to pay off an old debt by helping Ian McShane take over the Assassin's Guild by flying around to Italy, Canada, and Manhattan and killing what seems like hundreds of assassins."

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [55]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [56]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [73]:
ensemble_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"The majority of the reviews suggest that people generally liked John Wick. There are numerous positive reviews that praise the film's action sequences, world-building, and Keanu Reeves' performance. However, there are also a few negative reviews that criticize the film for being too violent or lacking in plot. Overall, it seems that the film has a strong following among action fans."

In [74]:
ensemble_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

'Yes, there is one review with a rating of 10 for "John Wick 3". The URL to that review is \'/review/rw4854296/?ref_=tt_urv\'.'

In [75]:
ensemble_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

"In John Wick, an ex-hitman comes out of retirement to track down the gangsters that killed his dog and took everything from him. With the untimely death of his beloved wife still bitter in his mouth, he seeks vengeance. In John Wick 2, John Wick is called on to pay off an old debt by helping Ian McShane take over the Assassin's Guild by flying around to Italy, Canada and Manhattan and killing what seems like hundreds of assassins. In John Wick 3, John Wick deals with the consequences of his actions at the end of the previous film and goes on an even bigger odyssey of violence that continues to explore the world of assassination and deliver beautifully clean action sequences. As for John Wick 4, there are mixed reviews, with some finding it disappointing and others enjoying it."

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

In [76]:
!pip install -U langchain_experimental



We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [62]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [63]:
semantic_documents = semantic_chunker.split_documents(documents)

Let's create a new vector store.

In [64]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="JohnWickSemantic"
)

We'll use naive retrieval for this example.

In [65]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [66]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [82]:
ensemble_retrieval_chain.invoke({"question" : "Did people generally like John Wick?"})["response"].content

"Overall, the reviews suggest that people generally liked John Wick. Many reviewers praised the action sequences, Keanu Reeves' performance, and the style of the film. However, there were a few negative reviews that criticized the lack of plot or character development."

In [83]:
ensemble_retrieval_chain.invoke({"question" : "Do any reviews have a rating of 10? If so - can I have the URLs to those reviews?"})["response"].content

"Yes, one review has a rating of 10. The URL to that review is '/review/rw4854296/?ref_=tt_urv'."

In [84]:
ensemble_retrieval_chain.invoke({"question" : "What happened in John Wick?"})["response"].content

'In John Wick, an ex-hitman seeks revenge after gangsters kill his dog and steal his car, and he becomes the target of hitmen. In John Wick 2, John is forced back into the assassination world to pay off an old debt and is tasked with killing the sister of a mobster. In John Wick 3, John is on the run after being declared excommunicado and has to fight his way out of New York City. In John Wick 4, the plot is unclear from the reviews provided.'

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

### Prepare chunks

In [1]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

documents = []

for i in range(1, 5):
  loader = CSVLoader(
      file_path=f"john_wick_{i}.csv",
      metadata_columns=["Review_Date", "Review_Title", "Review_Url", "Author", "Rating"],
      encoding="UTF-8"
  )

  movie_docs = loader.load()
  for doc in movie_docs:

    # Add the "Movie Title" (John Wick 1, 2, ...)
    doc.metadata["Movie_Title"] = f"John Wick {i}"

    # convert "Rating" to an `int`, if no rating is provided - assume 0 rating
    doc.metadata["Rating"] = int(doc.metadata["Rating"]) if doc.metadata["Rating"] else 0

    # newer movies have a more recent "last_accessed_at"
    doc.metadata["last_accessed_at"] = datetime.now() - timedelta(days=4-i)

  documents.extend(movie_docs)

### Create a Golden Test Data Set

In [2]:
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
from langchain_openai import AzureOpenAIEmbeddings, AzureChatOpenAI

generator_llm = AzureChatOpenAI(azure_deployment="gpt-4", temperature=0)
critic_llm = AzureChatOpenAI(azure_deployment="gpt-4", temperature=0)
embeddings = AzureOpenAIEmbeddings(azure_deployment="text-embedding-3-large")

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

num_qa_pairs = 50 # You can reduce the number of QA pairs to 5 if you're experiencing rate-limiting issues

testset = generator.generate_with_langchain_docs(documents, num_qa_pairs, distributions, raise_exceptions=False)
testset.to_pandas()

embedding nodes:   0%|          | 0/200 [00:00<?, ?it/s]

Filename and doc_id are the same for all nodes.


Generating:   0%|          | 0/50 [00:00<?, ?it/s]

max retries exceeded for SimpleEvolution(generator_llm=LangchainLLMWrapper(run_config=RunConfig(timeout=180, max_retries=15, max_wait=90, max_workers=16, exception_types=<class 'openai.RateLimitError'>, log_tenacity=False, seed=42)), docstore=InMemoryDocumentStore(splitter=<langchain_text_splitters.base.TokenTextSplitter object at 0x000002636B0C0B90>, nodes=[Node(metadata={'source': 'john_wick_1.csv', 'row': 0, 'Review_Date': '6 May 2015', 'Review_Title': ' Kinetic, concise, and stylish; John Wick kicks ass.\n', 'Review_Url': '/review/rw3233896/?ref_=tt_urv', 'Author': 'lnvicta', 'Rating': 8, 'Movie_Title': 'John Wick 1', 'last_accessed_at': datetime.datetime(2024, 9, 28, 20, 16, 35, 840641)}, page_content=": 0\nReview: The best way I can describe John Wick is to picture Taken but instead of Liam Neeson it's Keanu Reeves and instead of his daughter it's his dog. That's essentially the plot of the movie. John Wick (Reeves) is out to seek revenge on the people who took something he loved

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What aspects of the first John Wick film were ...,[: 10\nReview: The first John Wick film took m...,The first John Wick film was surprising to the...,simple,"[{'source': 'john_wick_2.csv', 'row': 10, 'Rev...",True
1,How does Keanu Reeves contribute to the action...,[: 19\nReview: The inevitable third chapter of...,Keanu Reeves contributes to the action sequenc...,simple,"[{'source': 'john_wick_3.csv', 'row': 19, 'Rev...",True
2,How does the review suggest violence in action...,"[: 15\nReview: ""I keel you, I keel all of you....",The review suggests that violence should be ma...,simple,"[{'source': 'john_wick_2.csv', 'row': 15, 'Rev...",True
3,Why might someone think the story is cool if i...,[: 16\nReview: You could have written this on ...,Someone might think the story is cool if it in...,simple,"[{'source': 'john_wick_1.csv', 'row': 16, 'Rev...",True
4,Why might someone think the story is cool if i...,[: 16\nReview: You could have written this on ...,Someone might think the story is cool if it in...,simple,"[{'source': 'john_wick_1.csv', 'row': 16, 'Rev...",True
5,How does Chad Stahelski's direction contribute...,[: 3\nReview: John wick has a very simple reve...,Chad Stahelski's direction contributes to the ...,simple,"[{'source': 'john_wick_1.csv', 'row': 3, 'Revi...",True
6,Why is practical stunt work appreciated in the...,[: 22\nReview: John Wick is one of my favourit...,Practical stunt work is appreciated in the Joh...,simple,"[{'source': 'john_wick_2.csv', 'row': 22, 'Rev...",True
7,Why does Reeves' character have Russian iconic...,[: 22\nReview: All the below are non-creative ...,"Reeves is covered in tats, of course, to show ...",simple,"[{'source': 'john_wick_1.csv', 'row': 22, 'Rev...",True
8,Why is John Wick seeking revenge in the movie?,[: 10\nReview: Wow what a great surprise this ...,John Wick is seeking revenge because some thug...,simple,"[{'source': 'john_wick_1.csv', 'row': 10, 'Rev...",True
9,"Why does the reviewer describe the movie as ""m...",[: 18\nReview: And all of this equals boredom....,The reviewer describes the movie as 'mindless ...,simple,"[{'source': 'john_wick_2.csv', 'row': 18, 'Rev...",True


In [3]:
testset.to_pandas()

Unnamed: 0,question,contexts,ground_truth,evolution_type,metadata,episode_done
0,What aspects of the first John Wick film were ...,[: 10\nReview: The first John Wick film took m...,The first John Wick film was surprising to the...,simple,"[{'source': 'john_wick_2.csv', 'row': 10, 'Rev...",True
1,How does Keanu Reeves contribute to the action...,[: 19\nReview: The inevitable third chapter of...,Keanu Reeves contributes to the action sequenc...,simple,"[{'source': 'john_wick_3.csv', 'row': 19, 'Rev...",True
2,How does the review suggest violence in action...,"[: 15\nReview: ""I keel you, I keel all of you....",The review suggests that violence should be ma...,simple,"[{'source': 'john_wick_2.csv', 'row': 15, 'Rev...",True
3,Why might someone think the story is cool if i...,[: 16\nReview: You could have written this on ...,Someone might think the story is cool if it in...,simple,"[{'source': 'john_wick_1.csv', 'row': 16, 'Rev...",True
4,Why might someone think the story is cool if i...,[: 16\nReview: You could have written this on ...,Someone might think the story is cool if it in...,simple,"[{'source': 'john_wick_1.csv', 'row': 16, 'Rev...",True
5,How does Chad Stahelski's direction contribute...,[: 3\nReview: John wick has a very simple reve...,Chad Stahelski's direction contributes to the ...,simple,"[{'source': 'john_wick_1.csv', 'row': 3, 'Revi...",True
6,Why is practical stunt work appreciated in the...,[: 22\nReview: John Wick is one of my favourit...,Practical stunt work is appreciated in the Joh...,simple,"[{'source': 'john_wick_2.csv', 'row': 22, 'Rev...",True
7,Why does Reeves' character have Russian iconic...,[: 22\nReview: All the below are non-creative ...,"Reeves is covered in tats, of course, to show ...",simple,"[{'source': 'john_wick_1.csv', 'row': 22, 'Rev...",True
8,Why is John Wick seeking revenge in the movie?,[: 10\nReview: Wow what a great surprise this ...,John Wick is seeking revenge because some thug...,simple,"[{'source': 'john_wick_1.csv', 'row': 10, 'Rev...",True
9,"Why does the reviewer describe the movie as ""m...",[: 18\nReview: And all of this equals boredom....,The reviewer describes the movie as 'mindless ...,simple,"[{'source': 'john_wick_2.csv', 'row': 18, 'Rev...",True


In [24]:
test_questions = testset.to_pandas().question.to_list()
test_groundtruths = testset.to_pandas().ground_truth.to_list()

### Naive Retrieval

##### Generate answers

In [26]:
answers = []
contexts = []

# loop through the test questions
for question in test_questions:
  response = naive_retrieval_chain.invoke({"question" : question}) # invoke the retrieval chain with the question
  answers.append(response["response"].content) # append the answer to the answers list
  contexts.append([context.page_content for context in response["context"]]) # append the context to the contexts list

##### Create a dataset with questions, answers, context, and ground truth

In [27]:
from datasets import Dataset

naive_retrieval_chain_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

##### Run evaluation on the naive RAG pipeline

In [30]:
from ragas import evaluate
from ragas.metrics import (
    context_recall,
    context_precision,
)

metrics = [
    context_recall,
    context_precision
]

In [31]:
naive_retrieval_chain_results = evaluate(naive_retrieval_chain_dataset, 
                                               metrics, 
                                               llm=generator_llm, # evaluate with gpt-4o
                                               embeddings=embeddings
                                               )

Evaluating:   0%|          | 0/98 [00:00<?, ?it/s]

In [32]:
naive_retrieval_chain_results

{'context_recall': 0.9422, 'context_precision': 0.7025}

### Best-Matching 25 (BM25) Retriever

##### Generate answers

In [18]:
from langchain_community.retrievers import BM25Retriever
bm25_retriever = BM25Retriever.from_documents(documents)
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [33]:
answers = []
contexts = []

# loop through the test questions
for question in test_questions:
  response = bm25_retrieval_chain.invoke({"question" : question}) # invoke the retrieval chain with the question
  answers.append(response["response"].content) # append the answer to the answers list
  contexts.append([context.page_content for context in response["context"]]) # append the context to the contexts list

##### Create a dataset with questions, answers, context, and ground truth

In [34]:
bm25_retrieval_chain_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

##### Run evaluation on the BM25 RAG pipeline

In [35]:
bm25_retrieval_chain_results = evaluate(bm25_retrieval_chain_dataset, 
                                               metrics, 
                                               llm=generator_llm, 
                                               embeddings=embeddings
                                               )

Evaluating:   0%|          | 0/98 [00:00<?, ?it/s]

In [36]:
bm25_retrieval_chain_results

{'context_recall': 0.6765, 'context_precision': 0.6003}

### Multi-Query Retriever

##### Generate answers

In [40]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [41]:
answers = []
contexts = []

# loop through the test questions
for question in test_questions:
  response = multi_query_retrieval_chain.invoke({"question" : question}) # invoke the retrieval chain with the question
  answers.append(response["response"].content) # append the answer to the answers list
  contexts.append([context.page_content for context in response["context"]]) # append the context to the contexts list

##### Create a dataset with questions, answers, context, and ground truth

In [42]:
multi_query_retrieval_chain_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

##### Run evaluation on the Multi-Query Retriever pipeline

In [43]:
multi_query_retrieval_chain_results = evaluate(multi_query_retrieval_chain_dataset, 
                                               metrics, 
                                               llm=generator_llm, 
                                               embeddings=embeddings
                                               )

Evaluating:   0%|          | 0/98 [00:00<?, ?it/s]

In [44]:
multi_query_retrieval_chain_results

{'context_recall': 0.9320, 'context_precision': 0.6747}

### Parent Document Retriever

In [50]:
answers = []
contexts = []

# loop through the test questions
for question in test_questions:
  response = parent_document_retrieval_chain.invoke({"question" : question}) # invoke the retrieval chain with the question
  answers.append(response["response"].content) # append the answer to the answers list
  contexts.append([context.page_content for context in response["context"]]) # append the context to the contexts list

##### Create a dataset with questions, answers, context, and ground truth

In [51]:
parent_document_retrieval_chain_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

##### Run evaluation on the Multi-Query Retriever pipeline

In [52]:
parent_document_retrieval_chain_results = evaluate(parent_document_retrieval_chain_dataset, 
                                               metrics, 
                                               llm=generator_llm, 
                                               embeddings=embeddings
                                               )

Evaluating:   0%|          | 0/98 [00:00<?, ?it/s]

In [53]:
parent_document_retrieval_chain_results

{'context_recall': 0.6037, 'context_precision': 0.7698}

### Ensemble Retriever

In [57]:
answers = []
contexts = []

# loop through the test questions
for question in test_questions:
  response = ensemble_retrieval_chain.invoke({"question" : question}) # invoke the retrieval chain with the question
  answers.append(response["response"].content) # append the answer to the answers list
  contexts.append([context.page_content for context in response["context"]]) # append the context to the contexts list

##### Create a dataset with questions, answers, context, and ground truth

In [58]:
ensemble_retrieval_chain_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

##### Run evaluation on the Ensemble Retriever pipeline

In [59]:
ensemble_retrieval_chain_results = evaluate(ensemble_retrieval_chain_dataset, 
                                               metrics, 
                                               llm=generator_llm, 
                                               embeddings=embeddings
                                               )

Evaluating:   0%|          | 0/98 [00:00<?, ?it/s]

In [60]:
ensemble_retrieval_chain_results

{'context_recall': 0.9796, 'context_precision': 0.6997}

### Semantic Retriever

In [67]:
answers = []
contexts = []

# loop through the test questions
for question in test_questions:
  response = semantic_retrieval_chain.invoke({"question" : question}) # invoke the retrieval chain with the question
  answers.append(response["response"].content) # append the answer to the answers list
  contexts.append([context.page_content for context in response["context"]]) # append the context to the contexts list

##### Create a dataset with questions, answers, context, and ground truth

In [68]:
semantic_retrieval_chain_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

##### Run evaluation on the Multi-Query Retriever pipeline

In [69]:
semantic_retrieval_chain_results = evaluate(semantic_retrieval_chain_dataset, 
                                               metrics, 
                                               llm=generator_llm, 
                                               embeddings=embeddings
                                               )

Evaluating:   0%|          | 0/98 [00:00<?, ?it/s]

In [70]:
semantic_retrieval_chain_results

{'context_recall': 0.9048, 'context_precision': 0.6114}

In [None]:
test_questions = testset.to_pandas().question.to_list()
test_groundtruths = testset.to_pandas().ground_truth.to_list()

### Baseline Retrieval

##### Generate answers

In [91]:
# update the chain with LLM, prompt, and question variable.
retrieval_baseline_qa_chain = (
    {"context": itemgetter("context"), "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context"), "question": itemgetter("question")}
)

In [92]:
import ast
answers_baseline = []

# loop through the test questions
for question, context in zip(test_questions, testset.to_pandas().contexts.to_list()):
  response = retrieval_baseline_qa_chain.invoke({"question" : question, "context": context}) # invoke the retrieval chain with the question
  answers_baseline.append(response["response"].content) # append the answer to the answers list

##### Create a dataset with questions, answers, context, and ground truth

In [93]:
baseline_response_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers_baseline,
    "contexts" : testset.to_pandas().contexts.to_list(),
    "ground_truth" : test_groundtruths
})

##### Run evaluation on the baseline RAG pipeline

In [97]:
baseline_retrieval_results = evaluate(baseline_response_dataset, 
                                               metrics, 
                                               llm=generator_llm, # evaluate with gpt-4o
                                               embeddings=embeddings
                                               )

Evaluating:   0%|          | 0/98 [00:00<?, ?it/s]

In [98]:
baseline_retrieval_results

{'context_recall': 0.9388, 'context_precision': 0.9388}

### Different retrieval methods comparison

In [99]:
import pandas as pd
df_1 = pd.DataFrame(list(naive_retrieval_chain_results.items()), columns=['Metric', 'Naive'])
df_2 = pd.DataFrame(list(bm25_retrieval_chain_results.items()), columns=['Metric', 'BM25'])
df_3 = pd.DataFrame(list(multi_query_retrieval_chain_results.items()), columns=['Metric', 'MultiQuery'])
df_4 = pd.DataFrame(list(parent_document_retrieval_chain_results.items()), columns=['Metric', 'ParentDocument'])
df_5 = pd.DataFrame(list(ensemble_retrieval_chain_results.items()), columns=['Metric', 'Ensemble'])
df_6 = pd.DataFrame(list(semantic_retrieval_chain_results.items()), columns=['Metric', 'Semantic'])
df_7 = pd.DataFrame(list(baseline_retrieval_results.items()), columns=['Metric', 'Baseline'])

df_merged_mqr = pd.merge(df_1, df_2, on='Metric')
df_merged_mqr = pd.merge(df_merged_mqr, df_3, on='Metric')
df_merged_mqr = pd.merge(df_merged_mqr, df_4, on='Metric')
df_merged_mqr = pd.merge(df_merged_mqr, df_5, on='Metric')
df_merged_mqr = pd.merge(df_merged_mqr, df_6, on='Metric')
df_merged_mqr = pd.merge(df_merged_mqr, df_7, on='Metric')

df_merged_mqr

Unnamed: 0,Metric,Naive,BM25,MultiQuery,ParentDocument,Ensemble,Semantic,Baseline
0,context_recall,0.942177,0.676531,0.931973,0.603741,0.979592,0.904762,0.938776
1,context_precision,0.702502,0.60034,0.674667,0.769841,0.699667,0.611377,0.938776


#### Comments

**Disclaimer**: I use the assumption that to ensure that RAG systems produce accurate and comprehensive answers, **maximizing Context Recall is crucial**. High precision without recall doesn't guarantee completeness. Even if the top-ranked contexts are highly relevant, missing other necessary contexts would still impair the answer's correctness. Therefore, I would personally put context recall metric above the precision metric.

* For the best comparison, I have also included Baseline RAG results. I have created a pipeline that takes the baseline context as in the Golden Dataset and produces a response based on the context. Nothing else was changed. Therefore, it gives us a baseline to compare to. As you can see, the baseline results are not perfect (not 100%).
* More interestingly baseline recall is lower than naive and ensemble.
* However, we should not read too much into the the differences of those several hundredth, though it seems at first sight significant. As you can see below, I have ran multiple statistical tests: Student's, Mann-Whitney (for unequal variances), ANOVA, and Tukey-HSD.
* The statistical test demonstrated with high significance, that there are no differences in context recall results for Naive, Multi-Query, Ensemble, Semantic, and Baseline pipelines! Though they seem different and one would tend to choose the one with the highest result, it is always prudent to run stats over the data to confirm. I should have added visuals, my bad.
* The only two groups that are different: BM25 and Parent Document. They both showed significantly lower context recall means.
* Based that the five groups have the same context recall means, I would choose Naive RAG approach because it runs faster it is cheaper given the same performance as the other retrievers.
* Unfortunately, I was not able to figure out the problem of conflicting packages when both RAGAS and langchain_cohere packages are installed. Therefore, I did not run the compression RAG pipeline.


## Testing to check if group means for context recall are different for various RAG retrievers

### Student's t-tests assumptions

The Student's t-test is a statistical test used to compare the means of two groups. There are several key assumptions that must be met for the results of the t-test to be valid:

1. **Independence**: The observations within each group and between groups should be independent of each other. This means that the data points collected from one subject should not influence the data points collected from another subject.

2. **Normality**: The data in each group should be approximately normally distributed. This assumption is particularly important for small sample sizes. For larger sample sizes, the t-test is fairly robust to deviations from normality due to the Central Limit Theorem.

3. **Homogeneity of Variances (Homoskedasticity)**: The variances of the two groups should be equal. This assumption can be tested using Levene's test or an F-test. If the variances are not equal, a variation of the t-test known as Welch's t-test can be used.

4. **Scale of Measurement**: The dependent variable should be measured at the interval or ratio level, meaning that it should be continuous and have a meaningful zero point.

For a two-sample t-test specifically, these assumptions apply to both groups being compared. If any of these assumptions are violated, the results of the t-test may not be reliable, and alternative statistical methods may need to be considered.

In [74]:
import numpy as np
from scipy import stats


# Normality test using Shapiro-Wilk test
shapiro_test_group1 = stats.shapiro(naive_retrieval_chain_results.to_pandas().context_recall)
shapiro_test_group2 = stats.shapiro(ensemble_retrieval_chain_results.to_pandas().context_recall)

print("Shapiro-Wilk Test for Group 1:")
print(f"Statistic: {shapiro_test_group1.statistic}, p-value: {shapiro_test_group1.pvalue}")

print("Shapiro-Wilk Test for Group 2:")
print(f"Statistic: {shapiro_test_group2.statistic}, p-value: {shapiro_test_group2.pvalue}")

# Homogeneity of variances test using Levene's test
levene_test = stats.levene(naive_retrieval_chain_results.to_pandas().context_recall,
                           ensemble_retrieval_chain_results.to_pandas().context_recall)

print("\nLevene's Test for Homogeneity of Variances:")
print(f"Statistic: {levene_test.statistic}, p-value: {levene_test.pvalue}")

# Interpretation of p-values
alpha = 0.05

if shapiro_test_group1.pvalue > alpha:
    print("Group 1: Data is normally distributed (fail to reject H0)")
else:
    print("Group 1: Data is not normally distributed (reject H0)")

if shapiro_test_group2.pvalue > alpha:
    print("Group 2: Data is normally distributed (fail to reject H0)")
else:
    print("Group 2: Data is not normally distributed (reject H0)")

if levene_test.pvalue > alpha:
    print("Groups have equal variances (fail to reject H0)")
else:
    print("Groups do not have equal variances (reject H0)")

Shapiro-Wilk Test for Group 1:
Statistic: 0.2956230977878055, p-value: 5.761064352267824e-14
Shapiro-Wilk Test for Group 2:
Statistic: 0.12726191214735139, p-value: 1.6541602971846816e-15

Levene's Test for Homogeneity of Variances:
Statistic: 1.0364025695931482, p-value: 0.3112184227222076
Group 1: Data is not normally distributed (reject H0)
Group 2: Data is not normally distributed (reject H0)
Groups have equal variances (fail to reject H0)


### Naive VS Ensemble

In [76]:
t_statistic, p_value = stats.ttest_ind(naive_retrieval_chain_results.to_pandas().context_recall,
                           ensemble_retrieval_chain_results.to_pandas().context_recall)

# Print the results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

T-statistic: -1.0180385894420427
P-value: 0.3112184227222026


In [77]:
from scipy.stats import mannwhitneyu
# Perform Mann-Whitney U Test
stat, p_value = mannwhitneyu(naive_retrieval_chain_results.to_pandas().context_recall,
                           ensemble_retrieval_chain_results.to_pandas().context_recall)

print(f'Mann-Whitney U Test Statistic: {stat}')
print(f'P-Value: {p_value}')

if p_value < 0.05:
    print("Reject the null hypothesis: The distributions of the two samples are different.")
else:
    print("Fail to reject the null hypothesis: The distributions of the two samples are not significantly different.")

Mann-Whitney U Test Statistic: 1128.0
P-Value: 0.17967529746847144
Fail to reject the null hypothesis: The distributions of the two samples are not significantly different.


In [78]:
t_statistic, p_value = stats.ttest_ind(naive_retrieval_chain_results.to_pandas().context_precision,
                           ensemble_retrieval_chain_results.to_pandas().context_precision)

# Print the results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

T-statistic: 0.049824493885134984
P-value: 0.960365765492277


In [79]:
from scipy.stats import mannwhitneyu
# Perform Mann-Whitney U Test
stat, p_value = mannwhitneyu(naive_retrieval_chain_results.to_pandas().context_precision,
                           ensemble_retrieval_chain_results.to_pandas().context_precision)

print(f'Mann-Whitney U Test Statistic: {stat}')
print(f'P-Value: {p_value}')

if p_value < 0.05:
    print("Reject the null hypothesis: The distributions of the two samples are different.")
else:
    print("Fail to reject the null hypothesis: The distributions of the two samples are not significantly different.")

Mann-Whitney U Test Statistic: 1261.0
P-Value: 0.6690052855974007
Fail to reject the null hypothesis: The distributions of the two samples are not significantly different.


### Naive VS Semantic

In [80]:
t_statistic, p_value = stats.ttest_ind(naive_retrieval_chain_results.to_pandas().context_recall,
                           semantic_retrieval_chain_results.to_pandas().context_recall)

# Print the results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

T-statistic: 0.795932065821667
P-value: 0.4280352844934595


In [81]:
from scipy.stats import mannwhitneyu
# Perform Mann-Whitney U Test
stat, p_value = mannwhitneyu(naive_retrieval_chain_results.to_pandas().context_recall,
                           semantic_retrieval_chain_results.to_pandas().context_recall)

print(f'Mann-Whitney U Test Statistic: {stat}')
print(f'P-Value: {p_value}')

if p_value < 0.05:
    print("Reject the null hypothesis: The distributions of the two samples are different.")
else:
    print("Fail to reject the null hypothesis: The distributions of the two samples are not significantly different.")

Mann-Whitney U Test Statistic: 1274.0
P-Value: 0.34379584569900634
Fail to reject the null hypothesis: The distributions of the two samples are not significantly different.


In [84]:
t_statistic, p_value = stats.ttest_ind(naive_retrieval_chain_results.to_pandas().context_precision,
                           semantic_retrieval_chain_results.to_pandas().context_precision)

# Print the results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

T-statistic: 1.4309444796362407
P-value: 0.15569334358252004


In [85]:
from scipy.stats import mannwhitneyu
# Perform Mann-Whitney U Test
stat, p_value = mannwhitneyu(naive_retrieval_chain_results.to_pandas().context_precision,
                           semantic_retrieval_chain_results.to_pandas().context_precision)

print(f'Mann-Whitney U Test Statistic: {stat}')
print(f'P-Value: {p_value}')

if p_value < 0.05:
    print("Reject the null hypothesis: The distributions of the two samples are different.")
else:
    print("Fail to reject the null hypothesis: The distributions of the two samples are not significantly different.")

Mann-Whitney U Test Statistic: 1355.0
P-Value: 0.272664406163859
Fail to reject the null hypothesis: The distributions of the two samples are not significantly different.


### Naive VS BM25

In [86]:
t_statistic, p_value = stats.ttest_ind(naive_retrieval_chain_results.to_pandas().context_recall,
                           bm25_retrieval_chain_results.to_pandas().context_recall)

# Print the results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

T-statistic: 3.9771950472628186
P-value: 0.00013524022178254053


In [87]:
from scipy.stats import mannwhitneyu
# Perform Mann-Whitney U Test
stat, p_value = mannwhitneyu(naive_retrieval_chain_results.to_pandas().context_recall,
                           bm25_retrieval_chain_results.to_pandas().context_recall)

print(f'Mann-Whitney U Test Statistic: {stat}')
print(f'P-Value: {p_value}')

if p_value < 0.05:
    print("Reject the null hypothesis: The distributions of the two samples are different.")
else:
    print("Fail to reject the null hypothesis: The distributions of the two samples are not significantly different.")

Mann-Whitney U Test Statistic: 1622.0
P-Value: 9.020064640570298e-05
Reject the null hypothesis: The distributions of the two samples are different.


In [88]:
t_statistic, p_value = stats.ttest_ind(naive_retrieval_chain_results.to_pandas().context_precision,
                           bm25_retrieval_chain_results.to_pandas().context_precision)

# Print the results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

T-statistic: 1.3838785319557414
P-value: 0.16960475934865415


In [89]:
from scipy.stats import mannwhitneyu
# Perform Mann-Whitney U Test
stat, p_value = mannwhitneyu(naive_retrieval_chain_results.to_pandas().context_precision,
                           bm25_retrieval_chain_results.to_pandas().context_precision)

print(f'Mann-Whitney U Test Statistic: {stat}')
print(f'P-Value: {p_value}')

if p_value < 0.05:
    print("Reject the null hypothesis: The distributions of the two samples are different.")
else:
    print("Fail to reject the null hypothesis: The distributions of the two samples are not significantly different.")

Mann-Whitney U Test Statistic: 1239.5
P-Value: 0.7826124272974455
Fail to reject the null hypothesis: The distributions of the two samples are not significantly different.


### Naive VS Baseline

In [100]:
t_statistic, p_value = stats.ttest_ind(naive_retrieval_chain_results.to_pandas().context_recall,
                           baseline_retrieval_results.to_pandas().context_recall)

# Print the results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

T-statistic: 0.0736709468683758
P-value: 0.9414255028198807


In [101]:
from scipy.stats import mannwhitneyu
# Perform Mann-Whitney U Test
stat, p_value = mannwhitneyu(naive_retrieval_chain_results.to_pandas().context_recall,
                           baseline_retrieval_results.to_pandas().context_recall)

print(f'Mann-Whitney U Test Statistic: {stat}')
print(f'P-Value: {p_value}')

if p_value < 0.05:
    print("Reject the null hypothesis: The distributions of the two samples are different.")
else:
    print("Fail to reject the null hypothesis: The distributions of the two samples are not significantly different.")

Mann-Whitney U Test Statistic: 1179.0
P-Value: 0.7381632503029347
Fail to reject the null hypothesis: The distributions of the two samples are not significantly different.


In [102]:
t_statistic, p_value = stats.ttest_ind(naive_retrieval_chain_results.to_pandas().context_precision,
                           baseline_retrieval_results.to_pandas().context_precision)

# Print the results
print(f"T-statistic: {t_statistic}")
print(f"P-value: {p_value}")

T-statistic: -4.518565791196536
P-value: 1.774247120449696e-05


In [103]:
from scipy.stats import mannwhitneyu
# Perform Mann-Whitney U Test
stat, p_value = mannwhitneyu(naive_retrieval_chain_results.to_pandas().context_precision,
                           baseline_retrieval_results.to_pandas().context_precision)

print(f'Mann-Whitney U Test Statistic: {stat}')
print(f'P-Value: {p_value}')

if p_value < 0.05:
    print("Reject the null hypothesis: The distributions of the two samples are different.")
else:
    print("Fail to reject the null hypothesis: The distributions of the two samples are not significantly different.")

Mann-Whitney U Test Statistic: 436.5
P-Value: 1.5310204543770765e-08
Reject the null hypothesis: The distributions of the two samples are different.


### Anova

In [105]:
from scipy.stats import f_oneway

# Perform One-Way ANOVA
stat, p_value = f_oneway(naive_retrieval_chain_results.to_pandas().context_recall,
                            bm25_retrieval_chain_results.to_pandas().context_recall,
                            parent_document_retrieval_chain_results.to_pandas().context_recall,
                            ensemble_retrieval_chain_results.to_pandas().context_recall,
                            semantic_retrieval_chain_results.to_pandas().context_recall,
                            multi_query_retrieval_chain_results.to_pandas().context_recall,
                            baseline_retrieval_results.to_pandas().context_recall)

print(f'ANOVA F-Statistic: {stat}')
print(f'P-Value: {p_value}')

if p_value < 0.05:
    print("Reject the null hypothesis: At least one group mean is different.")
else:
    print("Fail to reject the null hypothesis: The group means are not significantly different.")

ANOVA F-Statistic: 12.615357467448298
P-Value: 7.688975621600229e-13
Reject the null hypothesis: At least one group mean is different.


We can see that at least one of the retrivers is differents from the others

In [104]:



# Perform One-Way ANOVA
stat, p_value = f_oneway(naive_retrieval_chain_results.to_pandas().context_recall,
                            ensemble_retrieval_chain_results.to_pandas().context_recall,
                            semantic_retrieval_chain_results.to_pandas().context_recall,
                            multi_query_retrieval_chain_results.to_pandas().context_recall,
                            baseline_retrieval_results.to_pandas().context_recall)

print(f'ANOVA F-Statistic: {stat}')
print(f'P-Value: {p_value}')

if p_value < 0.05:
    print("Reject the null hypothesis: At least one group mean is different.")
else:
    print("Fail to reject the null hypothesis: The group means are not significantly different.")

ANOVA F-Statistic: 0.7124856815578468
P-Value: 0.5841033565094673
Fail to reject the null hypothesis: The group means are not significantly different.


However, just to confirm my hunch, the five retrievers (naive, ensemble, semantic, multi query, and baseline) are not different.

In [108]:
import numpy as np
import pandas as pd
from statsmodels.stats.multicomp import pairwise_tukeyhsd

# Example data
data = {
    'score': np.concatenate([naive_retrieval_chain_results.to_pandas().context_recall,
                            bm25_retrieval_chain_results.to_pandas().context_recall,
                            parent_document_retrieval_chain_results.to_pandas().context_recall,
                            ensemble_retrieval_chain_results.to_pandas().context_recall,
                            semantic_retrieval_chain_results.to_pandas().context_recall,
                            multi_query_retrieval_chain_results.to_pandas().context_recall,
                            baseline_retrieval_results.to_pandas().context_recall]),
    'group': ['Naive']*len(naive_retrieval_chain_results.to_pandas().context_recall) +
                ['BM25']*len(bm25_retrieval_chain_results.to_pandas().context_recall) +
                ['ParentDocument']*len(parent_document_retrieval_chain_results.to_pandas().context_recall) +
                ['Ensemble']*len(ensemble_retrieval_chain_results.to_pandas().context_recall) +
                ['Semantic']*len(semantic_retrieval_chain_results.to_pandas().context_recall) +
                ['MultiQuery']*len(multi_query_retrieval_chain_results.to_pandas().context_recall) +
                ['Baseline']*len(baseline_retrieval_results.to_pandas().context_recall)
}

df = pd.DataFrame(data)

# Perform Tukey's HSD test
tukey_result = pairwise_tukeyhsd(df['score'], df['group'], alpha=0.05)
print(tukey_result)

        Multiple Comparison of Means - Tukey HSD, FWER=0.05         
    group1         group2     meandiff p-adj   lower   upper  reject
--------------------------------------------------------------------
          BM25       Baseline   0.2622 0.0003  0.0861  0.4384   True
          BM25       Ensemble   0.3031    0.0  0.1269  0.4792   True
          BM25     MultiQuery   0.2554 0.0004  0.0793  0.4316   True
          BM25          Naive   0.2656 0.0002  0.0895  0.4418   True
          BM25 ParentDocument  -0.0728 0.8839  -0.249  0.1034  False
          BM25       Semantic   0.2282 0.0028   0.052  0.4044   True
      Baseline       Ensemble   0.0408 0.9932 -0.1354   0.217  False
      Baseline     MultiQuery  -0.0068    1.0  -0.183  0.1694  False
      Baseline          Naive   0.0034    1.0 -0.1728  0.1796  False
      Baseline ParentDocument   -0.335    0.0 -0.5112 -0.1588   True
      Baseline       Semantic   -0.034 0.9975 -0.2102  0.1422  False
      Ensemble     MultiQuery  -0.

In [107]:
!pip install statsmodels

Collecting statsmodels
  Downloading statsmodels-0.14.3-cp312-cp312-win_amd64.whl.metadata (9.5 kB)
Collecting patsy>=0.5.6 (from statsmodels)
  Downloading patsy-0.5.6-py2.py3-none-any.whl.metadata (3.5 kB)
Downloading statsmodels-0.14.3-cp312-cp312-win_amd64.whl (9.8 MB)
   ---------------------------------------- 0.0/9.8 MB ? eta -:--:--
   - -------------------------------------- 0.3/9.8 MB ? eta -:--:--
   ------ --------------------------------- 1.6/9.8 MB 6.0 MB/s eta 0:00:02
   ---------- ----------------------------- 2.6/9.8 MB 5.6 MB/s eta 0:00:02
   ---------------- ----------------------- 3.9/9.8 MB 5.7 MB/s eta 0:00:02
   --------------------- ------------------ 5.2/9.8 MB 5.9 MB/s eta 0:00:01
   -------------------------- ------------- 6.6/9.8 MB 6.0 MB/s eta 0:00:01
   -------------------------------- ------- 7.9/9.8 MB 6.0 MB/s eta 0:00:01
   -------------------------------------- - 9.4/9.8 MB 6.2 MB/s eta 0:00:01
   ---------------------------------------- 9.8/9.8 MB 6

### Contextual Compression (Using Reranking)

##### Generate answers

In [39]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-english-v3.0")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

sagemaker.config INFO - Not applying SDK defaults from location: C:\ProgramData\sagemaker\sagemaker\config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: C:\Users\novikova\AppData\Local\sagemaker\sagemaker\config.yaml


RuntimeError: no validator found for <class 'pydantic.types.SecretStr'>, see `arbitrary_types_allowed` in Config

In [38]:
!pip install langchain-cohere

Collecting langchain-cohere
  Using cached langchain_cohere-0.3.0-py3-none-any.whl.metadata (6.7 kB)
Collecting cohere<6.0,>=5.5.6 (from langchain-cohere)
  Downloading cohere-5.11.0-py3-none-any.whl.metadata (3.4 kB)
Collecting langchain-core<0.4,>=0.3.0 (from langchain-cohere)
  Downloading langchain_core-0.3.7-py3-none-any.whl.metadata (6.3 kB)
Collecting langchain-experimental>=0.3.0 (from langchain-cohere)
  Using cached langchain_experimental-0.3.2-py3-none-any.whl.metadata (1.7 kB)
Collecting boto3<2.0.0,>=1.34.0 (from cohere<6.0,>=5.5.6->langchain-cohere)
  Downloading boto3-1.35.30-py3-none-any.whl.metadata (6.6 kB)
Collecting fastavro<2.0.0,>=1.9.4 (from cohere<6.0,>=5.5.6->langchain-cohere)
  Using cached fastavro-1.9.7-cp312-cp312-win_amd64.whl.metadata (5.6 kB)
Collecting httpx-sse==0.4.0 (from cohere<6.0,>=5.5.6->langchain-cohere)
  Using cached httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting parameterized<0.10.0,>=0.9.0 (from cohere<6.0,>=5.5.6->langchain-c

  You can safely remove it manually.
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
datasets 3.0.0 requires dill<0.3.9,>=0.3.0, but you have dill 0.3.9 which is incompatible.
grpcio-tools 1.66.1 requires protobuf<6.0dev,>=5.26.1, but you have protobuf 4.25.5 which is incompatible.
langchain-openai 0.1.23 requires langchain-core<0.3.0,>=0.2.35, but you have langchain-core 0.3.7 which is incompatible.
langchain-qdrant 0.1.3 requires langchain-core<0.3,>=0.1.52, but you have langchain-core 0.3.7 which is incompatible.
langgraph 0.2.14 requires langchain-core<0.3,>=0.2.27, but you have langchain-core 0.3.7 which is incompatible.
langgraph-checkpoint 1.0.8 requires langchain-core<0.3,>=0.2.22, but you have langchain-core 0.3.7 which is incompatible.


In [None]:
answers = []
contexts = []

# loop through the test questions
for question in test_questions:
  response = bm25_retrieval_chain.invoke({"question" : question}) # invoke the retrieval chain with the question
  answers.append(response["response"].content) # append the answer to the answers list
  contexts.append([context.page_content for context in response["context"]]) # append the context to the contexts list

##### Create a dataset with questions, answers, context, and ground truth

In [None]:
bm25_retrieval_chain_dataset = Dataset.from_dict({
    "question" : test_questions,
    "answer" : answers,
    "contexts" : contexts,
    "ground_truth" : test_groundtruths
})

##### Run evaluation on the BM25 RAG pipeline

In [None]:
bm25_retrieval_chain_results = evaluate(bm25_retrieval_chain_dataset, 
                                               metrics, 
                                               llm=generator_llm, 
                                               embeddings=embeddings
                                               )

Evaluating:   0%|          | 0/98 [00:00<?, ?it/s]

In [None]:
bm25_retrieval_chain_results

{'context_recall': 0.9524, 'context_precision': 0.8980}

In [85]:
### YOUR CODE HERE