# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [1]:
import os
import getpass

os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [2]:
os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [3]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [4]:
loan_complaint_data[0]

Document(metadata={'source': './data/complaints.csv', 'row': 0, 'Date received': '03/27/25', 'Product': 'Student loan', 'Sub-product': 'Federal student loan servicing', 'Issue': 'Dealing with your lender or servicer', 'Sub-issue': 'Trouble with how payments are being handled', 'Consumer complaint narrative': "The federal student loan COVID-19 forbearance program ended in XX/XX/XXXX. However, payments were not re-amortized on my federal student loans currently serviced by Nelnet until very recently. The new payment amount that is effective starting with the XX/XX/XXXX payment will nearly double my payment from {$180.00} per month to {$360.00} per month. I'm fortunate that my current financial position allows me to be able to handle the increased payment amount, but I am sure there are likely many borrowers who are not in the same position. The re-amortization should have occurred once the forbearance ended to reduce the impact to borrowers.", 'Company public response': 'None', 'Company'

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [5]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [6]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 10})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [7]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [8]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [9]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [10]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issues with loans appear to be related to errors and mismanagement by loan servicers, including:\n\n- Dealing with your lender or servicer (e.g., receiving bad information, errors in loan balances, misapplied payments, wrongful denials of payment plans).\n- Incorrect or inconsistent information on credit reports and account status (e.g., falsely reported delinquencies, improper account transfers, incorrect balances).\n- Trouble with how payments are handled, such as restrictions on applying payments toward principal or paying off loans early.\n- Discrepancies and errors in loan balances, interest calculations, and loan transfer notifications.\n- Challenges with loan forgiveness, cancellation, or discharge, and long-term forbearance issues.\n- Issues related to borrower rights violations, like privacy concerns and improper data handling.\n\nWhile specific problems vary, errors in account information, misapplication of payments, and difficu

In [11]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, yes, some complaints were not handled in a timely manner. Specifically, at least two complaints indicate delays:\n\n1. Complaint from 03/28/25 (Complaint ID: 12709087) submitted to MOHELA, where the response was marked as "No" for timely response, indicating it was not handled promptly.\n\n2. Complaint from 04/24/25 (Complaint ID: 13160766) submitted to Maximus Federal Services, which was marked as "Yes" for timely response, so this one was handled in time.\n\nAdditionally, many complaints mention ongoing issues or delays in resolving their issues, such as complaints from 04/14/25 and 04/18/25 that highlight unresolved problems after extended periods.\n\nHowever, the complaint from 03/28/25 regarding MOHELA stands out as definitely not handled in a timely manner.\n\nIn summary, yes, some complaints, particularly the one from MOHELA, did not get handled in a timely manner.'

In [12]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans primarily due to a combination of factors highlighted in the complaints:\n\n1. **Limited or Misleading Payment Options:** Borrowers were often offered only options like forbearance or deferment, which led to continuous interest accumulation. This increased the total amount owed and extended repayment periods, making it difficult to pay off the loans.\n\n2. **Unawareness and Poor Communication:** Many borrowers were not adequately informed about their loan status, repayment schedules, or changes in loan servicers. For example, some were unaware that their loans had been transferred to new servicers or that their repayment was due earlier than expected.\n\n3. **Interest Accumulation and High Debt:** Continuous interest buildup, especially when payments are lowered or delayed, can cause the debt to grow over time despite ongoing payments. Some borrowers noted that their balances increased despite making payments, due to high interest rates and misman

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [13]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [14]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [15]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with student loan complaints appears to be dealing with lenders or servicers, specifically issues related to incorrect or unclear information about loans, problems with repayment procedures, and perceived bad practices like overcharging or miscommunication. Multiple complaints highlight difficulties in managing loan payments, understanding loan balances, or resolving issues with loan servicing companies.\n\nTherefore, the most common issue with loans, as reflected in these complaints, is problems related to "Dealing with your lender or servicer."'

In [16]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, all of the complaints mentioned in the context received timely responses from the companies. Specifically, the complaints with IDs 13197090, 12792958, 13160766, and 13410623 are all marked as "Timely response?": "Yes." Therefore, there is no indication that any complaints went unhandled in a timely manner.'

In [17]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including problems with their payment plans, miscommunication or lack of communication from the loan servicers, issues with loan transfers and automatic payments, and difficulties resolving disputes or obtaining forbearance. Some specific issues highlighted include being steered into incorrect repayment options, not receiving notices about loan status changes, technical problems with payments (such as payments being reversed or not processed), and the servicers not responding adequately to payment or deferment requests. These issues can lead to missed payments, increased debt due to interest capitalization, and negative impacts on credit scores.'

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.


**Answer**


BM25 does absolute exact matches and has low semantic ambiguity. It is best for retrieval of questions with keywords that match exactly in the context, for example questions about technical specifications, legal or medical documents eg what are the onset symptoms of <insert awful disease name>. 

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [18]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [19]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [20]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided context, the most common issue with loans appears to be problems related to dealing with lenders or servicers, including errors, mishandling, and misinformation. Specific recurring issues include incorrect or inconsistent loan balances, misapplied payments, wrongful denials of payment plans, and lack of proper communication or documentation. Additionally, complaints often involve concerns over privacy violations, unauthorized transfer of loans, and mishandling of personal information.'

In [21]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided information, it appears that at least one complaint (the one regarding a previous complaint about payments not being applied to the account) was addressed with a response marked as "Closed with explanation" and a note indicating the response was timely. \n\nHowever, other complaints, such as the one about the long delay in response to a request for account review and possible violations, indicate that the issue has not been resolved after a long period (nearly 18 months) and no resolution has been reached.\n\nTherefore, yes, some complaints did not get handled in a timely manner.'

In [22]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for several reasons, including:\n\n1. Lack of Understanding: Borrowers were often not fully informed about their repayment obligations or how interest accumulates, leading to confusion and unawareness of the true debt they owed.\n2. Transfer and Management Issues: Loans were transferred between different servicers without borrowers' knowledge or consent, causing confusion and processing issues.\n3. Difficulties with Payment Options: Available options like forbearance or deferment allowed interest to keep accruing, which increased the total debt over time and made repayment more difficult.\n4. Financial Hardship: Borrowers faced financial hardship due to stagnant wages, economic conditions, or personal circumstances, making it hard to keep up with payments or manage increasing interest.\n5. Administrative Problems: Errors such as incorrect account information, failure to notify borrowers of payments due, or miscommunication about repayment schedule

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [23]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [24]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [25]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'Based on the provided complaints and context, the most common issues with student loans appear to be:\n\n- Errors and inaccuracies in loan balances or account information.\n- Problems with how payments are being applied or handled, often leading to late payments or negative impact on credit scores.\n- Lack of proper communication, notices, or transparency from loan servicers regarding important changes or errors.\n- Mischandling or mismanagement of loan information, including incorrect reporting to credit bureaus.\n- Issues with loan consolidation, including lack of disclosure and unexpected payment amounts.\n- Difficulty expressing concerns or resolving issues due to poor customer service.\n- Problems with loan forgiveness, discharge, or handling of special cases like bankruptcy or settlement eligibility.\n\nOverall, a prevalent theme is mismanagement and poor communication leading to inaccuracies, unforeseen charges, or adverse credit impacts for borrowers.'

In [26]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Yes, according to the provided complaints, several complaints indicate that issues were not handled in a timely manner. Examples include:\n\n- A complaint from 04/18/25 (Complaint ID: 13062402) about a credit report and investigation still unresolved after over a year.\n- Multiple complaints about unresolved account issues, with delay of over a year or more in responses and resolution.\n- Several complaints mention that the companies (such as Maximus Federal Services and Mohela) either responded with explanations that did not resolve the core issues or closed the cases without resolution, despite the consumer repeatedly following up over extended periods.\n\nOverall, there is evidence that some complaints did not get handled promptly, with delays extending well beyond a typical response timeframe, suggesting that they were not handled in a timely manner.'

In [27]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for several reasons, including:\n\n1. Lack of clear or accurate information from lenders or servicers about repayment options, interest accrual, and consequences of forbearance or deferment.\n2. The accumulation of interest during forbearance or deferment periods, which increased the total amount owed and extended the repayment timeline.\n3. Unmanageable high interest rates and the negative impact of interest capitalization, making it difficult or impossible to pay off the loans.\n4. Financial hardships, such as unemployment, stagnant wages, or other life circumstances, that made consistent repayment difficult.\n5. Systemic issues like mismanagement, errors in loan balances, miscommunication about loan statuses, or improper reporting to credit bureaus.\n6. Coercive or unethical servicing practices, such as forbearance steering, failure to inform borrowers of income-driven repayment options, or improper handling of loans, which contributed to borro

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

**Answer**

Often users vague questions or questions that are not properly worded. LLMs are good at writing prompts for other LLMs. We can leverage this to better word the query, add more keywords, use synonyms, hypernyms or concepts and structured that help with context retrieval

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [28]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [29]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1536, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding=OpenAIEmbeddings(model="text-embedding-3-small"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [30]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [31]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [32]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [33]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided context, appears to be problems related to improper or incorrect handling of loan information, such as errors in loan balances, misreporting or misinformation on credit reports, and issues stemming from the loan servicing process. Specifically, complaints include incorrect information on credit reports, errors in loan balances, misapplied payments, wrongful denials of payment plans, and issues with loan transfer and interest rate increases.'

In [34]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, it appears that at least two complaints were not handled in a timely manner. Specifically, the complaints involving MOHELA (Complaint IDs 12709087 and 12935889) mention that no one has reached out to the complainant despite multiple follow-up calls, and the responses were marked as "No" for timeliness. This indicates that these complaints were not addressed within an acceptable or timely timeframe.'

In [35]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

'People failed to pay back their loans for various reasons, including severe financial hardship, misrepresentations by educational institutions about employment prospects and the value of their degrees, and issues with loan servicing. Specifically, some borrowers experienced difficulties due to the inability to secure employment after graduation, which made it challenging to make loan payments. Others faced issues with loan repayment due to administrative problems, such as being incorrectly billed, being reported as delinquent without proper notification, or dealing with the complexities of loan ownership and servicing following the closure of schools or institutions. Additionally, some borrowers had to rely on deferment and forbearance options, which increased the overall debt due to accumulated interest.'

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [36]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [37]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [38]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, is dealing with the loan servicer or lender, which includes problems such as receiving bad information about the loan, trouble with how payments are being handled, errors or discrepancies in loan balances, unintended default or delinquency reporting, improper transfer of loans without proper notice, and issues with loan consolidation or adjustments. Many complaints highlight mismanagement, lack of communication, inaccurate reporting, and deceptive practices by loan servicers.'

In [39]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the information provided, yes, there are complaints indicating that some issues were not handled in a timely manner. For example, several complaints noted that responses from the companies were late or that the companies failed to respond within the expected timeframe. Specifically, one complaint from EdFinancial Services mentioned a response that was not timely, and others highlighted delays or lack of response despite multiple follow-ups.'

In [40]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"People failed to pay back their loans for various reasons, including:\n\n1. Lack of Notification and Communication: Many borrowers were not properly notified about payment due dates, loan transfers, or changes in their account status, leading to missed payments and reporting errors on credit reports.\n\n2. Improper Handling and Mismanagement by Servicers: Some borrowers experienced errors in their loan balances, misapplied payments, or wrongful reporting of delinquency due to errors or lack of proper procedures by servicers like Nelnet, Maximus, or MOHELA.\n\n3. Difficulties with Payment Plans and Interest Accumulation: Borrowers faced challenges due to limited options beyond forbearance or deferment, interest continuing to accrue and capitalize during these periods, making loans larger and repayment unmanageable.\n\n4. Lack of Clear or Adequate Information: Many were misled or inadequately informed about repayment options, income-driven repayment plans, consolidation, or forgiveness 

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [41]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [42]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [43]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [44]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [45]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [46]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

'The most common issue with loans, based on the provided complaints, appears to be difficulties with loan servicing and management. This includes problems such as struggling to repay loans, improper use or reporting of loan information, issues with payment plans and auto-debit setup, miscommunication about loan status or issuer changes, and violations of privacy or legal protections related to student loans. Many complaints center around errors, lack of transparency, delays, and disputes over loan account information.'

In [47]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

'Based on the provided complaints, it appears that several complaints were marked as "Closed with explanation" and the responses time was indicated as "Yes" for the complaints with response status. This suggests that the complaints were handled within the expected time frame.\n\nHowever, the details also highlight that multiple complaints involved:\n- No response or inadequate responses from the companies.\n- Issues such as unaddressed misconduct, errors, or violations of laws.\n- Complaints where the consumer was still requesting investigations or corrections.\n\nWhile some complaints stated they received timely responses and were "Closed with explanation," there are multiple instances where complaints involved ongoing issues or inadequate handling, which may imply delays or unresolved matters in practice. \n\nTo directly answer your question: **No, according to the data, all complaints marked as "timely response" were handled in an appropriate time frame. However, several complaints 

In [48]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

"Based on the provided context, people failed to pay back their loans for various reasons, including:\n\n- Lack of clear or accurate information from lenders or servicers, leading to confusion and stress (e.g., receiving bad information or lack of transparency).\n- Administrative issues or delays, such as processing payments or documentation, which may have caused missed payments or default notices.\n- Disputes over the legitimacy or status of the loans, including claims of wrongful default, unverified debts, or legal complications.\n- Difficulties in communication or customer service, which hindered resolution or understanding of their repayment obligations.\n- Problems related to changes in payment plans or re-amortization, impacting affordability or payment schedules.\n- Cases of alleged improper reporting or illegal collection practices, leading borrowers to believe their debt records were inaccurate or invalid.\n\nIn summary, failures to pay back loans often stemmed from procedura

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?


**Answer**

If sentences are very similar then the sematic similarity will be consistently high. This could make it difficult to find a threshold to chunkify the documents. If the threshold is too high it might lead to the creation of too many small chunks that are very close in content and will cause context repetition in the vector store. Another possible outcome is that if the threshold is too low then the algorithm will produce few very large chunks and grouping different FAQs together. This would degrade retrieval as one query might return a very large chunk of data with multiple irrelevant FAQs. We could mitigate this by adding a chunk splitting rule based on the structure of the document: eg extracting based on keywords like "Question" and "Answer" or paragraphs or sentences

# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

### Modify chains for langsmith parsing


In [109]:
naive_retrieval_chain_langsmith = (
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model | StrOutputParser()} 
)

bm25_retrieval_chain_langsmith = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model | StrOutputParser() }
)

contextual_compression_retrieval_chain_langsmith = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model | StrOutputParser()}
)

multi_query_retrieval_chain_langsmith = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model | StrOutputParser()}
)

parent_document_retrieval_chain_langsmith = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model| StrOutputParser()}
)

ensemble_retrieval_chain_langsmith = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model| StrOutputParser()}
)

semantic_retrieval_chain_langsmith = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model| StrOutputParser()}
)

### Load the pdf files

In [49]:
from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader


path = "data/"
loader = DirectoryLoader(path, glob="*.pdf", loader_cls=PyMuPDFLoader)
docs = loader.load()

In [50]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [51]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
ragas_dataset = generator.generate_with_langchain_docs(docs[:20], testset_size=8)

Applying HeadlinesExtractor:   0%|          | 0/17 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/20 [00:00<?, ?it/s]

unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node
unable to apply transformation: 'headlines' property not found in this node


Applying SummaryExtractor:   0%|          | 0/31 [00:00<?, ?it/s]

Property 'summary' already exists in node 'cc2ddb'. Skipping!
Property 'summary' already exists in node '4ae810'. Skipping!
Property 'summary' already exists in node '3d8dd3'. Skipping!
Property 'summary' already exists in node 'b17c47'. Skipping!
Property 'summary' already exists in node 'f63f23'. Skipping!
Property 'summary' already exists in node '14b391'. Skipping!
Property 'summary' already exists in node '5c8bff'. Skipping!
Property 'summary' already exists in node '47347a'. Skipping!
Property 'summary' already exists in node '0df40a'. Skipping!
Property 'summary' already exists in node '6eaece'. Skipping!
Property 'summary' already exists in node '9b3fa7'. Skipping!
Property 'summary' already exists in node 'cf0166'. Skipping!
Property 'summary' already exists in node '8a8e9f'. Skipping!
Property 'summary' already exists in node 'ef3dca'. Skipping!


Applying CustomNodeFilter:   0%|          | 0/6 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/41 [00:00<?, ?it/s]

Property 'summary_embedding' already exists in node 'b17c47'. Skipping!
Property 'summary_embedding' already exists in node '5c8bff'. Skipping!
Property 'summary_embedding' already exists in node 'cc2ddb'. Skipping!
Property 'summary_embedding' already exists in node '6eaece'. Skipping!
Property 'summary_embedding' already exists in node 'cf0166'. Skipping!
Property 'summary_embedding' already exists in node '47347a'. Skipping!
Property 'summary_embedding' already exists in node '8a8e9f'. Skipping!
Property 'summary_embedding' already exists in node 'ef3dca'. Skipping!
Property 'summary_embedding' already exists in node '3d8dd3'. Skipping!
Property 'summary_embedding' already exists in node '4ae810'. Skipping!
Property 'summary_embedding' already exists in node '0df40a'. Skipping!
Property 'summary_embedding' already exists in node '14b391'. Skipping!
Property 'summary_embedding' already exists in node 'f63f23'. Skipping!
Property 'summary_embedding' already exists in node '9b3fa7'. Sk

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/9 [00:00<?, ?it/s]

In [52]:

ragas_dataset.to_pandas()


Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,where i find more about subscription-based pro...,"[non-term (includes clock-hour calendars), or ...",For more detail on subscription-based programs...,single_hop_specifc_query_synthesizer
1,How does clinical work in nursing programs aff...,[Inclusion of Clinical Work in a Standard Term...,Clinical work in nursing programs may be inclu...,single_hop_specifc_query_synthesizer
2,How are Title IV disbursements managed for non...,[Non-Term Characteristics A program that measu...,"Title IV program disbursements, except for the...",single_hop_specifc_query_synthesizer
3,How do the requirements for clinical or practi...,[<1-hop>\n\nInclusion of Clinical Work in a St...,The requirements for clinical or practicum exp...,multi_hop_abstract_query_synthesizer
4,what if a program got self-paced or independen...,[<1-hop>\n\nInclusion of Clinical Work in a St...,If a program has self-paced or independent stu...,multi_hop_abstract_query_synthesizer
5,How do the disbursement requirements for feder...,[<1-hop>\n\nboth the credit or clock hours and...,In clock-hour or non-term credit-hour programs...,multi_hop_abstract_query_synthesizer
6,"How do the requirements outlined in Volume 8, ...",[<1-hop>\n\nnon-term (includes clock-hour cale...,"In subscription-based programs, the administra...",multi_hop_specific_query_synthesizer
7,If a medical program includes clinical work th...,[<1-hop>\n\nInclusion of Clinical Work in a St...,When a medical program includes clinical work ...,multi_hop_specific_query_synthesizer
8,how direct loan program work for student in no...,[<1-hop>\n\nnon-term (includes clock-hour cale...,For students in non-term or subscription-based...,multi_hop_specific_query_synthesizer


### Evaluate Retrievals

In [53]:
type(ragas_dataset)

ragas.testset.synthesizers.testset_schema.Testset

In [None]:
from tqdm.notebook import tqdm as tqdm_notebook


retrieval_strategies = {
    'naive': naive_retrieval_chain, 
    'bm25': bm25_retrieval_chain, 
    'contextual_compression': contextual_compression_retrieval_chain,
    'parent_document': parent_document_retrieval_chain,
    'ensemble': ensemble_retrieval_chain
    }

dataset_with_response = {}
for retrieval, chain in retrieval_strategies.items():
    print(f"Evaluating {retrieval} retrieval strategy")
    for test_row in tqdm_notebook(ragas_dataset, desc="Questions progress"):
        response = chain.invoke({"question" : test_row.eval_sample.user_input})
        test_row.eval_sample.response = response["response"].content
        test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
    dataset_with_response[retrieval] = ragas_dataset

Evaluating naive retrieval strategy


Questions progress:   0%|          | 0/9 [00:00<?, ?it/s]

Evaluating bm25 retrieval strategy


Questions progress:   0%|          | 0/9 [00:00<?, ?it/s]

Evaluating contextual_compression retrieval strategy


Questions progress:   0%|          | 0/9 [00:00<?, ?it/s]

Evaluating parent_document retrieval strategy


Questions progress:   0%|          | 0/9 [00:00<?, ?it/s]

Evaluating ensemble retrieval strategy


Questions progress:   0%|          | 0/9 [00:00<?, ?it/s]

In [74]:
dataset_with_response.keys()

dict_keys(['naive', 'bm25', 'contextual_compression', 'parent_document', 'ensemble'])

In [75]:
dataset_with_response['naive'].samples[0].eval_sample.response

'To learn more about subscription-based programs in Volume 2, Chapter 2, I do not have specific information from the provided context. You might consider checking the table of contents, index, or chapter headings in Volume 2, Chapter 2 of the relevant material. Alternatively, consult the book or document directly for references to subscription-based programs within that chapter. If you have access to a search function, try searching for key terms like "subscription" or "subscription-based programs" in that chapter. If you\'re referring to a specific textbook, report, or manual, please specify its title so I can assist further.'

In [87]:
from ragas import EvaluationDataset, evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano", max_tokens= 32768))

evaluation_dataset = {}
for retrieval, dataset in dataset_with_response.items():
    print(f"Evaluating {retrieval} retrieval strategy")
    evaluation_dataset[retrieval] = EvaluationDataset.from_pandas(dataset.to_pandas())


Evaluating naive retrieval strategy
Evaluating bm25 retrieval strategy
Evaluating contextual_compression retrieval strategy
Evaluating parent_document retrieval strategy
Evaluating ensemble retrieval strategy


In [89]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=600)

evaluation_results = {}
for retrieval, dataset in evaluation_dataset.items():
    print(f"Evaluating {retrieval} retrieval strategy")
    evaluation_results[retrieval] = evaluate(
        dataset=dataset,
        metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
        llm=evaluator_llm,
        run_config=custom_run_config
    )

Evaluating naive retrieval strategy


Evaluating:   0%|          | 0/54 [00:00<?, ?it/s]

Exception raised in Job[4]: TimeoutError()
Exception raised in Job[10]: TimeoutError()
Exception raised in Job[16]: TimeoutError()
Exception raised in Job[22]: TimeoutError()
Exception raised in Job[28]: TimeoutError()
Exception raised in Job[34]: TimeoutError()
Exception raised in Job[40]: TimeoutError()
Exception raised in Job[46]: TimeoutError()
Exception raised in Job[52]: TimeoutError()


Evaluating bm25 retrieval strategy


Evaluating:   0%|          | 0/54 [00:00<?, ?it/s]

Exception raised in Job[41]: ValueError(setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (15,) + inhomogeneous part.)
Exception raised in Job[4]: TimeoutError()
Exception raised in Job[10]: TimeoutError()
Exception raised in Job[16]: TimeoutError()
Exception raised in Job[22]: TimeoutError()
Exception raised in Job[28]: TimeoutError()
Exception raised in Job[34]: TimeoutError()
Exception raised in Job[40]: TimeoutError()
Exception raised in Job[46]: TimeoutError()
Exception raised in Job[52]: TimeoutError()


Evaluating contextual_compression retrieval strategy


Evaluating:   0%|          | 0/54 [00:00<?, ?it/s]

Exception raised in Job[4]: TimeoutError()
Exception raised in Job[10]: TimeoutError()
Exception raised in Job[16]: TimeoutError()
Exception raised in Job[22]: TimeoutError()
Exception raised in Job[28]: TimeoutError()
Exception raised in Job[34]: TimeoutError()
Exception raised in Job[40]: TimeoutError()
Exception raised in Job[46]: TimeoutError()
Exception raised in Job[52]: TimeoutError()


Evaluating parent_document retrieval strategy


Evaluating:   0%|          | 0/54 [00:00<?, ?it/s]

Exception raised in Job[4]: LLMDidNotFinishException(The LLM generation was not completed. Please increase the max_tokens and try again.)
Exception raised in Job[10]: LLMDidNotFinishException(The LLM generation was not completed. Please increase the max_tokens and try again.)
Exception raised in Job[28]: LLMDidNotFinishException(The LLM generation was not completed. Please increase the max_tokens and try again.)
Exception raised in Job[22]: LLMDidNotFinishException(The LLM generation was not completed. Please increase the max_tokens and try again.)
Exception raised in Job[34]: LLMDidNotFinishException(The LLM generation was not completed. Please increase the max_tokens and try again.)
Exception raised in Job[40]: LLMDidNotFinishException(The LLM generation was not completed. Please increase the max_tokens and try again.)
Exception raised in Job[46]: LLMDidNotFinishException(The LLM generation was not completed. Please increase the max_tokens and try again.)
Exception raised in Job[52]:

Evaluating ensemble retrieval strategy


Evaluating:   0%|          | 0/54 [00:00<?, ?it/s]

Exception raised in Job[37]: RateLimitError(Error code: 429 - {'error': {'message': 'Rate limit reached for gpt-4.1-nano in organization org-URUSKKyHzl4ZMdhZ2HYDH466 on tokens per min (TPM): Limit 200000, Used 200000, Requested 8857. Please try again in 2.657s. Visit https://platform.openai.com/account/rate-limits to learn more.', 'type': 'tokens', 'param': None, 'code': 'rate_limit_exceeded'}})
Exception raised in Job[22]: LLMDidNotFinishException(The LLM generation was not completed. Please increase the max_tokens and try again.)
Exception raised in Job[34]: LLMDidNotFinishException(The LLM generation was not completed. Please increase the max_tokens and try again.)
Exception raised in Job[28]: LLMDidNotFinishException(The LLM generation was not completed. Please increase the max_tokens and try again.)
Exception raised in Job[40]: LLMDidNotFinishException(The LLM generation was not completed. Please increase the max_tokens and try again.)
Exception raised in Job[46]: LLMDidNotFinishE

In [150]:
print('naive')
evaluation_results['naive']

naive


{'context_recall': 1.0000, 'faithfulness': 0.8833, 'factual_correctness(mode=f1)': 0.8211, 'answer_relevancy': 0.6161, 'context_entity_recall': nan, 'noise_sensitivity(mode=relevant)': 0.0106}

In [151]:
print('bm25')
evaluation_results['bm25']

bm25


{'context_recall': 1.0000, 'faithfulness': 0.7778, 'factual_correctness(mode=f1)': 0.7911, 'answer_relevancy': 0.7154, 'context_entity_recall': nan, 'noise_sensitivity(mode=relevant)': 0.0326}

In [None]:
print('contextual_compression')
evaluation_results['contextual_compression']

{'context_recall': 1.0000, 'faithfulness': 0.9504, 'factual_correctness(mode=f1)': 0.7689, 'answer_relevancy': 0.7118, 'context_entity_recall': nan, 'noise_sensitivity(mode=relevant)': 0.0344}

In [152]:
print('parent_document')
evaluation_results['parent_document']

parent_document


{'context_recall': 1.0000, 'faithfulness': 0.7778, 'factual_correctness(mode=f1)': 0.7978, 'answer_relevancy': 0.7164, 'context_entity_recall': 0.0000, 'noise_sensitivity(mode=relevant)': 0.0339}

In [153]:
print('ensemble')
evaluation_results['ensemble']

ensemble


{'context_recall': 1.0000, 'faithfulness': 0.8750, 'factual_correctness(mode=f1)': 0.8811, 'answer_relevancy': 0.6154, 'context_entity_recall': 0.0000, 'noise_sensitivity(mode=relevant)': 0.0487}

### Langsmith evaluation

In [91]:
import os
import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = getpass.getpass("LangChain API Key:")

In [92]:
from uuid import uuid4

os.environ["LANGCHAIN_PROJECT"] = f"AIM - SDG - {uuid4().hex[0:8]}"

In [94]:
from langsmith import Client

client = Client()

dataset_name = "Assignment 9 - Loan Synthetic Data"

langsmith_dataset = client.create_dataset(
    dataset_name=dataset_name,
    description="Assignment 9 - Loan Synthetic Data"
)

In [95]:
for data_row in ragas_dataset.to_pandas().iterrows():
  client.create_example(
      inputs={
          "question": data_row[1]["user_input"]
      },
      outputs={
          "answer": data_row[1]["reference"]
      },
      metadata={
          "context": data_row[1]["reference_contexts"]
      },
      dataset_id=langsmith_dataset.id
  )

In [96]:
eval_llm = ChatOpenAI(model="gpt-4.1")

In [110]:
from langsmith.evaluation import LangChainStringEvaluator, evaluate

qa_evaluator = LangChainStringEvaluator("qa", config={"llm" : eval_llm})

labeled_helpfulness_evaluator = LangChainStringEvaluator(
    "labeled_criteria",
    config={
        "criteria": {
            "helpfulness": (
                "Is this submission helpful to the user,"
                " taking into account the correct reference answer?"
            )
        },
        "llm" : eval_llm
    },
    prepare_data=lambda run, example: {
        "prediction": run.outputs["response"],
        "reference": example.outputs["answer"],
        "input": example.inputs["question"],
    }
)


In [None]:
retrieval_strategies_langsmith = {
    'naive': naive_retrieval_chain_langsmith, 
    'bm25': bm25_retrieval_chain_langsmith, 
    'contextual_compression': contextual_compression_retrieval_chain_langsmith,
    'parent_document': parent_document_retrieval_chain_langsmith,
    'ensemble': ensemble_retrieval_chain_langsmith
    }

results = {}
for retrieval, chain in retrieval_strategies_langsmith.items():
    print(f"Evaluating {retrieval} retrieval strategy")
    evaluate(
        chain.invoke,
        data=dataset_name,
        evaluators=[
            qa_evaluator,
            labeled_helpfulness_evaluator
        ],
        metadata={"revision_id": "default_chain_init"},
)

Evaluating naive retrieval strategy
View the evaluation results for experiment: 'giving-straw-29' at:
https://smith.langchain.com/o/6ce7a019-3815-4374-be47-38dfdf42ee54/datasets/3a9bf070-f6b1-46d8-9652-fd3405e96087/compare?selectedSessions=7d4719be-0e8c-49b6-acec-d00af7776a23




0it [00:00, ?it/s]

Evaluating bm25 retrieval strategy
View the evaluation results for experiment: 'worthwhile-record-90' at:
https://smith.langchain.com/o/6ce7a019-3815-4374-be47-38dfdf42ee54/datasets/3a9bf070-f6b1-46d8-9652-fd3405e96087/compare?selectedSessions=7c79a3fd-c493-4b6e-9f80-41e987b4297c




0it [00:00, ?it/s]

Evaluating contextual_compression retrieval strategy
View the evaluation results for experiment: 'elderly-part-44' at:
https://smith.langchain.com/o/6ce7a019-3815-4374-be47-38dfdf42ee54/datasets/3a9bf070-f6b1-46d8-9652-fd3405e96087/compare?selectedSessions=0ad724a7-777a-4f88-9bdd-53e21a808c65




0it [00:00, ?it/s]

Evaluating parent_document retrieval strategy
View the evaluation results for experiment: 'upbeat-stove-67' at:
https://smith.langchain.com/o/6ce7a019-3815-4374-be47-38dfdf42ee54/datasets/3a9bf070-f6b1-46d8-9652-fd3405e96087/compare?selectedSessions=dc376ca5-4827-42f1-8464-2ca7f20bfee9




0it [00:00, ?it/s]

Evaluating ensemble retrieval strategy
View the evaluation results for experiment: 'spotless-cod-35' at:
https://smith.langchain.com/o/6ce7a019-3815-4374-be47-38dfdf42ee54/datasets/3a9bf070-f6b1-46d8-9652-fd3405e96087/compare?selectedSessions=bb354749-4ce2-4837-bb1f-35a13a72bef5




0it [00:00, ?it/s]

# RAGAS Evaluation Results Summary

## Performance Metrics Comparison

| Strategy | Context Recall | Faithfulness | Factual Correctness | Answer Relevancy | Context Entity Recall | Noise Sensitivity | Latency (s) | Cost ($) | Tokens |
|----------|----------------|--------------|---------------------|------------------|----------------------|-------------------|-------------|----------|---------|
| **Naive** | 1.0000 | 0.8833 | 0.8211 | 0.6161 | NaN | 0.0106 | 4.321 | 0.082 | 71,926 |
| **BM25** | 1.0000 | 0.7778 | 0.7911 | 0.7154 | NaN | 0.0326 | 2.422 | 0.004 | 31,363 |
| **Contextual Compression** | 1.0000 | 0.9504 | 0.7689 | 0.7118 | NaN | 0.0344 | 3.165 | 0.0034 | 24,503 |
| **Parent Document** | 1.0000 | 0.7778 | 0.7978 | 0.7164 | 0.0000 | 0.0339 | 3.575 | 0.0048 | 39,079 |
| **Ensemble** | 1.0000 | 0.8750 | 0.8811 | 0.6154 | 0.0000 | 0.0487 | 7.791 | 0.0189 | 175,337 |

## Metric Definitions

- **Context Recall**: Measures how well the retrieved context covers the information needed to answer the question (higher is better)
- **Faithfulness**: Evaluates whether the generated answer is faithful to the retrieved context (higher is better)
- **Factual Correctness**: Measures the factual accuracy of the generated answers (higher is better)
- **Answer Relevancy**: Assesses how relevant the generated answer is to the original question (higher is better)
- **Context Entity Recall**: Measures entity-level recall from the retrieved context (higher is better)
- **Noise Sensitivity**: Evaluates robustness to irrelevant information in context (lower is better)

## Analysis: Optimal Strategy Selection

**BM25 emerges as the optimal strategy** for this financial aid documentation dataset, providing the best balance of performance, cost, and efficiency.

**Performance Advantages**: BM25 achieves the highest answer relevancy (0.7154) among all strategies, indicating it generates the most relevant responses to user questions. While it has slightly lower faithfulness (0.7778) compared to contextual compression, it maintains strong factual correctness (0.7911) and perfect context recall (1.0000).

**Cost Efficiency**: BM25 is the most cost-effective option at $0.004, using only 31,363 tokens. This represents a 95% cost reduction compared to the naive approach ($0.082) while maintaining competitive performance metrics.

**Latency Performance**: At 2.422 seconds, BM25 provides the fastest response time among all strategies, making it suitable for real-time applications where speed is crucial.

**Strategic Trade-offs**: While contextual compression achieves the highest faithfulness (0.9504), it comes with higher latency (3.165s) and slightly lower answer relevancy. The ensemble approach, despite having the highest factual correctness (0.8811), is prohibitively expensive ($0.0189) and slow (7.791s), making it impractical for production use.

For this financial aid documentation dataset, BM25's keyword-based retrieval approach effectively captures the specific terminology and concepts users are likely to query, while its computational efficiency makes it suitable for deployment in cost-sensitive environments.