# Advanced Retrieval with LangChain

In the following notebook, we'll explore various methods of advanced retrieval using LangChain!

We'll touch on:

- Naive Retrieval
- Best-Matching 25 (BM25)
- Multi-Query Retrieval
- Parent-Document Retrieval
- Contextual Compression (a.k.a. Rerank)
- Ensemble Retrieval
- Semantic chunking

We'll also discuss how these methods impact performance on our set of documents with a simple RAG chain.

There will be two breakout rooms:

- 🤝 Breakout Room Part #1
  - Task 1: Getting Dependencies!
  - Task 2: Data Collection and Preparation
  - Task 3: Setting Up QDrant!
  - Task 4-10: Retrieval Strategies
- 🤝 Breakout Room Part #2
  - Activity: Evaluate with Ragas

# 🤝 Breakout Room Part #1

## Task 1: Getting Dependencies!

We're going to need a few specific LangChain community packages, like OpenAI (for our [LLM](https://platform.openai.com/docs/models) and [Embedding Model](https://platform.openai.com/docs/guides/embeddings)) and Cohere (for our [Reranker](https://cohere.com/rerank)).

We'll also provide our OpenAI key, as well as our Cohere API key.

In [None]:
### API key management

### Reminder: Place .env file inside the root of the project folder so when calling the below from inside the notebook it should find the .env fule and load it inside the notebook environment
### PLEASE ADD THIS `.env` FILE TO YOUR PROJECT'S `.gitignore` file before committing and pushing the changes to your remote repo, as it contains API Keys and Secrets in it

import os
from dotenv import load_dotenv

load_dotenv(dotenv_path="../.env")

print("OPENAI_API_KEY" in os.environ)
print("LANGCHAIN_API_KEY" in os.environ)
print("TAVILY_API_KEY" in os.environ)
print("RAGAS_API_KEY" in os.environ)
print("ANTHROPIC_API_KEY" in os.environ)
print("Cohere_API_KEY" in os.environ)


In [None]:
#import os
#import getpass

#os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OpenAI API Key:")

In [None]:
#os.environ["COHERE_API_KEY"] = getpass.getpass("Cohere API Key:")

## Task 2: Data Collection and Preparation

We'll be using our Loan Data once again - this time the strutured data available through the CSV!

### Data Preparation

We want to make sure all our documents have the relevant metadata for the various retrieval strategies we're going to be applying today.

In [None]:
from langchain_community.document_loaders.csv_loader import CSVLoader
from datetime import datetime, timedelta

loader = CSVLoader(
    file_path=f"./data/complaints.csv",
    metadata_columns=[
      "Date received", 
      "Product", 
      "Sub-product", 
      "Issue", 
      "Sub-issue", 
      "Consumer complaint narrative", 
      "Company public response", 
      "Company", 
      "State", 
      "ZIP code", 
      "Tags", 
      "Consumer consent provided?", 
      "Submitted via", 
      "Date sent to company", 
      "Company response to consumer", 
      "Timely response?", 
      "Consumer disputed?", 
      "Complaint ID"
    ]
)

loan_complaint_data = loader.load()

for doc in loan_complaint_data:
    doc.page_content = doc.metadata["Consumer complaint narrative"]

Let's look at an example document to see if everything worked as expected!

In [None]:
loan_complaint_data[0]

## Task 3: Setting up QDrant!

Now that we have our documents, let's create a QDrant VectorStore with the collection name "LoanComplaints".

We'll leverage OpenAI's [`text-embedding-3-small`](https://openai.com/blog/new-embedding-models-and-api-updates) because it's a very powerful (and low-cost) embedding model.

> NOTE: We'll be creating additional vectorstores where necessary, but this pattern is still extremely useful.

In [None]:
from langchain_community.vectorstores import Qdrant
from langchain_openai import OpenAIEmbeddings
from langchain_cohere import CohereEmbeddings
embeddings = CohereEmbeddings(model="embed-english-v3.0")

#embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = Qdrant.from_documents(
    loan_complaint_data,
    embeddings,
    location=":memory:",
    collection_name="LoanComplaints"
)

## Task 4: Naive RAG Chain

Since we're focusing on the "R" in RAG today - we'll create our Retriever first.

### R - Retrieval

This naive retriever will simply look at each review as a document, and use cosine-similarity to fetch the 10 most relevant documents.

> NOTE: We're choosing `10` as our `k` here to provide enough documents for our reranking process later

In [None]:
naive_retriever = vectorstore.as_retriever(search_kwargs={"k" : 5})

### A - Augmented

We're going to go with a standard prompt for our simple RAG chain today! Nothing fancy here, we want this to mostly be about the Retrieval process.

In [None]:
from langchain_core.prompts import ChatPromptTemplate

RAG_TEMPLATE = """\
You are a helpful and kind assistant. Use the context provided below to answer the question.

If you do not know the answer, or are unsure, say you don't know.

Query:
{question}

Context:
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_TEMPLATE)

### G - Generation

We're going to leverage `gpt-4.1-nano` as our LLM today, as - again - we want this to largely be about the Retrieval process.

In [None]:
from langchain_openai import ChatOpenAI

chat_model = ChatOpenAI(model="gpt-4.1-nano")
#chat_model = ChatAnthropic(model="claude-3-5-sonnet-20240620")

### LCEL RAG Chain

We're going to use LCEL to construct our chain.

> NOTE: This chain will be exactly the same across the various examples with the exception of our Retriever!

In [None]:
from langchain_core.runnables import RunnablePassthrough
from operator import itemgetter
from langchain_core.output_parsers import StrOutputParser

naive_retrieval_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | naive_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's see how this simple chain does on a few different prompts.

> NOTE: You might think that we've cherry picked prompts that showcase the individual skill of each of the retrieval strategies - you'd be correct!

In [None]:
naive_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

In [None]:
naive_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

In [None]:
naive_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

Overall, this is not bad! Let's see if we can make it better!

## Task 5: Best-Matching 25 (BM25) Retriever

Taking a step back in time - [BM25](https://www.nowpublishers.com/article/Details/INR-019) is based on [Bag-Of-Words](https://en.wikipedia.org/wiki/Bag-of-words_model) which is a sparse representation of text.

In essence, it's a way to compare how similar two pieces of text are based on the words they both contain.

This retriever is very straightforward to set-up! Let's see it happen down below!


In [None]:
from langchain_community.retrievers import BM25Retriever

bm25_retriever = BM25Retriever.from_documents(loan_complaint_data, )

We'll construct the same chain - only changing the retriever.

In [None]:
bm25_retrieval_chain = (
    {"context": itemgetter("question") | bm25_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at the responses!

In [None]:
bm25_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

In [None]:
bm25_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

In [None]:
bm25_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

It's not clear that this is better or worse, if only we had a way to test this (SPOILERS: We do, the second half of the notebook will cover this)

#### ❓ Question #1:

Give an example query where BM25 is better than embeddings and justify your answer.

#### ✅ Answer #1:
BM25 would be better when looking for very specific and unique text such as serial numbers, model numbers, error codes, policy document names, etc. This is because BM25 ranks results based on exact matches. Embedding-based retrievers prioritize semantic similarity and may not find unique values, especially if they lack semantic meaning.

## Task 6: Contextual Compression (Using Reranking)

Contextual Compression is a fairly straightforward idea: We want to "compress" our retrieved context into just the most useful bits.

There are a few ways we can achieve this - but we're going to look at a specific example called reranking.

The basic idea here is this:

- We retrieve lots of documents that are very likely related to our query vector
- We "compress" those documents into a smaller set of *more* related documents using a reranking algorithm.

We'll be leveraging Cohere's Rerank model for our reranker today!

All we need to do is the following:

- Create a basic retriever
- Create a compressor (reranker, in this case)

That's it!

Let's see it in the code below!

In [None]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(model="rerank-v3.5")
compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=naive_retriever
)

Let's create our chain again, and see how this does!

In [None]:
contextual_compression_retrieval_chain = (
    {"context": itemgetter("question") | compression_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [None]:
contextual_compression_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

In [None]:
contextual_compression_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

In [None]:
contextual_compression_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

We'll need to rely on something like Ragas to help us get a better sense of how this is performing overall - but it "feels" better!

## Task 7: Multi-Query Retriever

Typically in RAG we have a single query - the one provided by the user.

What if we had....more than one query!

In essence, a Multi-Query Retriever works by:

1. Taking the original user query and creating `n` number of new user queries using an LLM.
2. Retrieving documents for each query.
3. Using all unique retrieved documents as context

So, how is it to set-up? Not bad! Let's see it down below!



In [None]:
from langchain.retrievers.multi_query import MultiQueryRetriever

multi_query_retriever = MultiQueryRetriever.from_llm(
    retriever=naive_retriever, llm=chat_model
)

In [None]:
multi_query_retrieval_chain = (
    {"context": itemgetter("question") | multi_query_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [None]:
multi_query_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

In [None]:
multi_query_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

In [None]:
multi_query_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

#### ❓ Question #2:

Explain how generating multiple reformulations of a user query can improve recall.

#### 🎥✅ Answer #2:

User queries may not always be well formatted. Often, the hardest part of solving a problem is having a clear problem statement. For an analogy, think of an audience member asking a question of a presenter, and the presenter responding with: "I don't follow your question, could you state that another way, please". Having an LLM reformulate the questiion in different ways, increases the chances that the retriever will match documents that might have expressed the relevant idea with different wording.

## Task 8: Parent Document Retriever

A "small-to-big" strategy - the Parent Document Retriever works based on a simple strategy:

1. Each un-split "document" will be designated as a "parent document" (You could use larger chunks of document as well, but our data format allows us to consider the overall document as the parent chunk)
2. Store those "parent documents" in a memory store (not a VectorStore)
3. We will chunk each of those documents into smaller documents, and associate them with their respective parents, and store those in a VectorStore. We'll call those "child chunks".
4. When we query our Retriever, we will do a similarity search comparing our query vector to the "child chunks".
5. Instead of returning the "child chunks", we'll return their associated "parent chunks".

Okay, maybe that was a few steps - but the basic idea is this:

- Search for small documents
- Return big documents

The intuition is that we're likely to find the most relevant information by limiting the amount of semantic information that is encoded in each embedding vector - but we're likely to miss relevant surrounding context if we only use that information.

Let's start by creating our "parent documents" and defining a `RecursiveCharacterTextSplitter`.

In [None]:
from langchain.retrievers import ParentDocumentRetriever
from langchain.storage import InMemoryStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from qdrant_client import QdrantClient, models

parent_docs = loan_complaint_data
child_splitter = RecursiveCharacterTextSplitter(chunk_size=750)

We'll need to set up a new QDrant vectorstore - and we'll use another useful pattern to do so!

> NOTE: We are manually defining our embedding dimension, you'll need to change this if you're using a different embedding model.

In [None]:
from langchain_qdrant import QdrantVectorStore

client = QdrantClient(location=":memory:")

client.create_collection(
    collection_name="full_documents",
    vectors_config=models.VectorParams(size=1024, distance=models.Distance.COSINE)
)

parent_document_vectorstore = QdrantVectorStore(
    collection_name="full_documents", embedding = CohereEmbeddings(model="embed-english-v3.0"), client=client
)

Now we can create our `InMemoryStore` that will hold our "parent documents" - and build our retriever!

In [None]:
store = InMemoryStore()

parent_document_retriever = ParentDocumentRetriever(
    vectorstore = parent_document_vectorstore,
    docstore=store,
    child_splitter=child_splitter,
)

By default, this is empty as we haven't added any documents - let's add some now!

In [None]:
parent_document_retriever.add_documents(parent_docs, ids=None)

We'll create the same chain we did before - but substitute our new `parent_document_retriever`.

In [None]:
parent_document_retrieval_chain = (
    {"context": itemgetter("question") | parent_document_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's give it a whirl!

In [None]:
parent_document_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

In [None]:
parent_document_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

In [None]:
parent_document_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

Overall, the performance *seems* largely the same. We can leverage a tool like [Ragas]() to more effectively answer the question about the performance.

## Task 9: Ensemble Retriever

In brief, an Ensemble Retriever simply takes 2, or more, retrievers and combines their retrieved documents based on a rank-fusion algorithm.

In this case - we're using the [Reciprocal Rank Fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) algorithm.

Setting it up is as easy as providing a list of our desired retrievers - and the weights for each retriever.

In [None]:
from langchain.retrievers import EnsembleRetriever

retriever_list = [bm25_retriever, naive_retriever, parent_document_retriever, compression_retriever, multi_query_retriever]
equal_weighting = [1/len(retriever_list)] * len(retriever_list)

ensemble_retriever = EnsembleRetriever(
    retrievers=retriever_list, weights=equal_weighting
)

We'll pack *all* of these retrievers together in an ensemble.

In [None]:
ensemble_retrieval_chain = (
    {"context": itemgetter("question") | ensemble_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

Let's look at our results!

In [None]:
ensemble_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

In [None]:
ensemble_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

In [None]:
ensemble_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

## Task 10: Semantic Chunking

While this is not a retrieval method - it *is* an effective way of increasing retrieval performance on corpora that have clean semantic breaks in them.

Essentially, Semantic Chunking is implemented by:

1. Embedding all sentences in the corpus.
2. Combining or splitting sequences of sentences based on their semantic similarity based on a number of [possible thresholding methods](https://python.langchain.com/docs/how_to/semantic-chunker/):
  - `percentile`
  - `standard_deviation`
  - `interquartile`
  - `gradient`
3. Each sequence of related sentences is kept as a document!

Let's see how to implement this!

We'll use the `percentile` thresholding method for this example which will:

Calculate all distances between sentences, and then break apart sequences of setences that exceed a given percentile among all distances.

In [None]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    embeddings,
    breakpoint_threshold_type="percentile"
)

Now we can split our documents.

In [None]:
semantic_documents = semantic_chunker.split_documents(loan_complaint_data[:20])

Let's create a new vector store.

In [None]:
semantic_vectorstore = Qdrant.from_documents(
    semantic_documents,
    embeddings,
    location=":memory:",
    collection_name="Loan_Complaint_Data_Semantic_Chunks"
)

We'll use naive retrieval for this example.

In [None]:
semantic_retriever = semantic_vectorstore.as_retriever(search_kwargs={"k" : 10})

Finally we can create our classic chain!

In [None]:
semantic_retrieval_chain = (
    {"context": itemgetter("question") | semantic_retriever, "question": itemgetter("question")}
    | RunnablePassthrough.assign(context=itemgetter("context"))
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

And view the results!

In [None]:
semantic_retrieval_chain.invoke({"question" : "What is the most common issue with loans?"})["response"].content

In [None]:
semantic_retrieval_chain.invoke({"question" : "Did any complaints not get handled in a timely manner?"})["response"].content

In [None]:
semantic_retrieval_chain.invoke({"question" : "Why did people fail to pay back their loans?"})["response"].content

#### ❓ Question #3:

If sentences are short and highly repetitive (e.g., FAQs), how might semantic chunking behave, and how would you adjust the algorithm?

#### ✅ Answer #3:

Short answer is that semantic chunking in this case would group many of the short, similar sentences together. And, this would probably be desirable. But, if the bar is too low, it might group together sentences that are only superficially similar. To adjust this, you would raise the threshold, e.g. in this case, the percentile value that triggers a split.

I also wanted to make sure I understood the math and process behind semantic chunking. 


The Process
First, split document into sentences (langchain uses sentence-based splitter, e.g. nltk.sent_tokenizer)
Then do pairwise comparisons of the semantic distance between sentences.
Then group sentences if they are "close" in semantic meaning, and split to a new chunk when the semantic distance to the previous sentence exceeds the threshold. 

The Math

Percentile: 
Split when the semantic distance between two adjacent sentences is bigger than the n-th percentile of all adjacent-pair distances in the document. (Typical starting threshold: **75th or 80th percentile**. Lower the threshold to get smaller chunks (e.g. if topic shifts frequently) or raise the percentil to get fewer larger chunks)

Standard Deviation: 
Split when the semantic distance between two adjacent sentences exceeds the mean by nore than n times the stddev of all adjacent-pair distances in the document. (Typical value: **n = 1.5**; threshold = **Mean + 1.5 × StdDev**)

Interquartile:
Split when the distance falls above the Q3 percentile by some factor, "n", of the "interquartile range" (IQR). IQR is the Q3 percentile - Q1 percentile. Example: threshold = Q3 + n x IQR (Typical value: **n = 1.5**; threshold = **Q3 + 1.5 × IQR**)

Gradient:
This is kind of like a first order derivative, compared to the others. It essentially measures the velocity of change between the sentences (as opposed ot the distance), triggering a split when there is a sudden change. To do this, after calculating the distances between each set of pairs, you then calculate the gradients (the distances between the distances). The rule would be to split whenever the gradient exceeds a threshold, **N**.  
(Typical value: **N = 0.10**)

🎥 See example [here](semantic_chunking_examples.md) that I worked out with ChatGPT



# 🤝 Breakout Room Part #2

#### 🏗️ Activity #1

Your task is to evaluate the various Retriever methods against eachother.

You are expected to:

1. Create a "golden dataset"
 - Use Synthetic Data Generation (powered by Ragas, or otherwise) to create this dataset
2. Evaluate each retriever with *retriever specific* Ragas metrics
 - Semantic Chunking is not considered a retriever method and will not be required for marks, but you may find it useful to do a "semantic chunking on" vs. "semantic chunking off" comparision between them
3. Compile these in a list and write a small paragraph about which is best for this particular data and why.

Your analysis should factor in:
  - Cost
  - Latency
  - Performance

> NOTE: This is **NOT** required to be completed in class. Please spend time in your breakout rooms creating a plan before moving on to writing code.

##### HINTS:

- LangSmith provides detailed information about latency and cost.

In [None]:
### YOUR CODE HERE







#### Study Note: Documentation
I'm going to document the heck out of this.
If I track every step that I do, I'll know what to say in my Loom video 🤞
First off, I'm eager and nervous to try this.
I'm pretty sure I can find all the code I need in notebooks from session 7 & 8



#### Study Note: Design Idea
Activity 1 instructions seem pretty clear, but I ran it by ChatGPT to confirm.
I need to 
1. Create synth dataset for test case
2. Evaluate 5 retreival approaches: Naive, BM25, Multi-Query, Parent Doc, Ensemble with Ragas
    a. run each retriever with the golden dataset
    b. evaluate each run with Ragas
3. Then I'll have to do something with LangSmith to get the latency and cost, but I'll think about that later



## Task 1: Create a dataset

I will find the ragas code from notebook 7

First, I'll do the nltk thing. I still don't undertsand it, but the comment in 7 said it would prevent (mac related) os errors.
It worked there, so I will keep it.
First stumbling block: it failed. But, I figured out why pretty quick. This notebook didn't install nltk

#### Study Note: Environment setup
Till now, I've just been relying on uv sync, and not thinking much about dependencies.
So, I took a look at the project.toml for this notebook vs 7. Lots of stuff in 7 that isn't here, and I think I'm going to need it.
But, I don't want to break anything, so, I'll just add things as I run into needing them (starting with nltk). Also, I'll just install via terminal first, and if nothing breaks, I'll update the toml file later.
Clearly, I'm also going to need ragas
Ran this command:  uv pip install nltk ragas==0.2.10



#### Study Notes: To-dos
Track thinkgs here that are pending
update toml with nltk and ragas (see above)
update toml with rapidfuzz (see below)

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

That worked. Baby steps!

I'm going to end up doing something with LangSmith, so I'll steal the next cell from earlier notebooks (plus a print statement)

In [None]:
os.environ["LANGCHAIN_PROJECT"] = "Number9 Evaluation 3"
os.environ["LANGCHAIN_TRACING_V2"] = "true"


Now I'll import stuff and set the llm 


In [None]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-nano"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

Taking the 'abstracted SDG' from HW7
I ran it just as it was in HW7, and I got an error, because there was no such thing as "docs", so i figured out that I needed to put in "loan_complaint_data"
Pretty happy that I figured that out, with only a slight nudge from ChatGPT
It failed, with a nice clear error message that I needed to install "rapidfuzz".

Don't know why I need rapidfuzz now, when we didn't need it in HW7. 
I suppose it is because of the structure of the data.
ChatGPT gave me some other hogwash that I don't believe. I'm sticking with my guess, and it doesn't really matter.
I installed rapidfuzz (in terminal window; I'll need to add a cell or put it in the toml file later)<< added to toml!

In [None]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
golden_dataset = generator.generate_with_langchain_docs(loan_complaint_data[:20], testset_size=5)

Wow, that worked!
Now, I need to see the dataset, so I'm stealing the next cell from notebook 7

I am gobsmacked that I got this working in under an hour (not counting annotations) Insert excessive emojis here: ✅❤️🥳🎉
Time to (commit the code), take a victory lap, and call it a night.

In [None]:
golden_dataset.to_pandas()

Coming back to work on this, after completing Hw10, so now I have to reset **my** context.
I went ahead and added nltk and ragas to project.toml, so I can clean up my 'technical debt'
Reran the notebook, and it all worked.
Now, I ended up with 10 new questions - 

Going to stash this dataset into a json, just in case I need it back later

In [None]:
import json

with open("goldendataset.json", "w") as f:
    json.dump([sample.model_dump() for sample in golden_dataset], f, indent=2)


some code for later, just in case I do need to get back then golden_data from the json file

In [None]:
import json

# Load the golden dataset
with open("goldendataset.json", "r") as f:
    golden_dataset = json.load(f)

print(f"✅ Loaded {len(golden_dataset)} test samples")
print(f"📝 First question: {golden_dataset[0]['eval_sample']['user_input']}")

In [None]:
import json

# Load questions for testing your retrievers
with open("goldendataset.json", "r") as f:
    golden_data = json.load(f)

def run_retriever_for_ragas(retriever_chain, golden_data):
    """Run retriever and format for Ragas evaluation"""
    outputs = []
    
    for item in golden_data:
        question = item["eval_sample"]["user_input"]
        reference = item["eval_sample"]["reference"]
        
        # Run your retriever
        result = retriever_chain.invoke({"question": question})
        
        outputs.append({
            "user_input": question,
            "reference": reference,
            "response": result["response"].content,
            "retrieved_contexts": [doc.page_content for doc in result["context"]]
        })
    
    return outputs

# Use with your retrievers
# ragas_data = run_retriever_for_ragas(naive_retrieval_chain, golden_data)

In [None]:
print(golden_dataset)


Reimporting golden data set from json after a restart

In [None]:
import json

# Load questions for testing your retrievers
with open("goldendataset.json", "r") as f:
    golden_data = json.load(f)

def run_retriever_on_dataset(name, retriever_chain, golden_data):
    """Run retriever and format for Ragas evaluation - consistent with your existing function"""
    print(f"Running {name} on golden dataset")
    outputs = []
    
    for item in golden_data:
        question = item["eval_sample"]["user_input"]
        reference = item["eval_sample"]["reference"]
        
        # Run your retriever
        response = retriever_chain.invoke({"question": question})
        
        outputs.append({
            "user_input": question,
            "reference": reference,
            "response": response["response"].content if hasattr(response["response"], "content") else response["response"],
            "retrieved_contexts": [ctx.page_content for ctx in response["context"]],
            "retriever_name": name
        })
    
    return outputs

## Task 2
Evaluate 5 retreival approaches: Naive, BM25, Multi-Query, Parent Doc, Ensemble with Ragas

### Task 2.1 Run retreivers

Next step: Run a retriever with the golden data set
Starting slowly, with one retriever and saving the outputs

here are the names of the retrieval chains, from above, for my reference
naive_retrieval_chain
bm25_retrieval_chain
contextual_compression_retrieval_chain
multi_query_retrieval_chain
parent_document_retrieval_chain
ensemble_retrieval_chain


first, just try one retriever

In [None]:

bm25_outputs = run_retriever_on_dataset("bm25_thing", bm25_retrieval_chain, golden_dataset)

In [None]:
#first I'll try to run, with BM25 (hard-coded)
def run_retriever_on_dataset(retriever_chain, golden_dataset): #since this version is hard-coded, it never actually uses the retriever_chain parameter
    outputs = []

    for test_row in golden_dataset:
        print(test_row)
        response = bm25_retrieval_chain.invoke({
            "question": test_row.eval_sample.user_input})
        outputs.append({
            "user_input": test_row.eval_sample.user_input,
            "reference": test_row.eval_sample.reference,
            "response": response["response"].content if hasattr(response["response"], "content") else response["response"],
            "retrieved_contexts": [ctx.page_content for ctx in response["context"]],
        })

    return outputs

bm25_outputs = run_retriever_on_dataset(bm25_retrieval_chain, golden_dataset)


In [None]:
import pandas as pd

bm25_df = pd.DataFrame(bm25_outputs)
bm25_df


Now, convert output to a pandas.DataFrame first, and then into a Ragas EvaluationDataset:

In [None]:
import pandas as pd
from ragas import EvaluationDataset

# Step 1: Convert to DataFrame
bm25_df = pd.DataFrame(bm25_outputs)

# Step 2: Convert to Ragas-compatible EvaluationDataset
bm25_eval_dataset = EvaluationDataset.from_pandas(bm25_df)


### Evaluate first retriever (BM25)
borrowing some more code fron HW8, for the eval:

In [None]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

In [None]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

custom_run_config = RunConfig(timeout=360)

result = evaluate(
    dataset=bm25_eval_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

### Party time!

I ran an evaluation for 1 retriever; time for another victory lap. 
first, stash the output for safe-keeping (since it takes 6 minutes), then figure out how to externalize the retriever name so I can loop through them all

BM25 output from first run:
{'context_recall': 0.8417, 'faithfulness': 0.7594, 'factual_correctness': 0.5620, 'answer_relevancy': 0.6619, 'context_entity_recall': 0.4776, 'noise_sensitivity_relevant': 0.2619}

instead of updating the run_retriever_on_dataset definition from above, I'll make a newer copy below, just to continue to trace my process

Define retriever names, to use with the retriever

In [None]:
retrievers = {
    #"bm25": bm25_retrieval_chain, commented out because it was already run
    "naive": naive_retrieval_chain,
    "multi_query": multi_query_retrieval_chain,
    "parent_doc": parent_document_retrieval_chain,
    "ensemble": ensemble_retrieval_chain,
    "contextual_compression": contextual_compression_retrieval_chain,
}


In [None]:
#updated version of the code, with the retriever name externalized
# import time

def run_retriever_on_dataset(name, retriever_chain, golden_dataset):
    print(f"Running {name} on golden dataset")
    outputs = []

    for test_row in golden_dataset:
        response = retriever_chain.invoke({
            "question": test_row.eval_sample.user_input})
        outputs.append({
            "user_input": test_row.eval_sample.user_input,
            "reference": test_row.eval_sample.reference,
            "response": response["response"].content if hasattr(response["response"], "content") else response["response"],
            "retrieved_contexts": [ctx.page_content for ctx in response["context"]],
            "retriever_name": name #this is the new addition for being able to keep track of the retriever name later
        })


    #    # Add delay between requests
    #     if i < len(golden_dataset) - 1:  # Don't sleep after last item
    #         print(f"  Waiting 2 seconds before next request...")
    #         time.sleep(2)  # Adjust this value as needed

    return outputs



next cell looked like a good idea, but it was overengineerd and didn't give what I wanted / needed for output format
something about dictionary of lists vs list of dictionaries 
and anyway, I would have had to split up the output because I don't want to run 40 minutes worth of evals in one cell
this tends to happen when I listen to the AI too much

In [None]:

# all_outputs = {} #initializing the dictionary to store the outputs

# for name, chain in retrievers.items():
#     print(f"Running: {name}")
#     outputs = run_retriever_on_dataset(name, chain, golden_dataset)
#     all_outputs[name] = outputs #adding the outputs to a single dictionary



this is the brute force version, with each retriever run separately, which works for me especially since I might need to run some of the retrievers individually, or multiple times
which I will now need to do, because contextual compression failed with a rate limit error. I will retry with Cohere production key.
(Production key worked 👍)

In [None]:
# bm25_outputs = run_retriever_on_dataset("bm25", bm25_retrieval_chain, golden_dataset) #this was the first version, hard-coded
# naive_outputs = run_retriever_on_dataset("naive", naive_retrieval_chain, golden_dataset)
multi_query_outputs = run_retriever_on_dataset("multi_query", multi_query_retrieval_chain, golden_dataset)
# parent_doc_outputs = run_retriever_on_dataset("parent_doc", parent_document_retrieval_chain, golden_dataset)
# ensemble_outputs = run_retriever_on_dataset("ensemble", ensemble_retrieval_chain, golden_dataset)
# contextual_compression_outputs = run_retriever_on_dataset("contextual_compression", contextual_compression_retrieval_chain, golden_dataset)

I want to do some validation. 
Next code cell is straiht from ChatGPT. 
this is the kind of stuff it is pretty good at, and I don't feel like I need to dive into the details

In [None]:
expected_keys = {"user_input", "reference", "response", "retrieved_contexts"}

for name, output in [
    ("bm25", bm25_outputs),
    ("naive", naive_outputs),
    ("multi_query", multi_query_outputs),
    ("parent_doc", parent_doc_outputs),
    ("ensemble", ensemble_outputs),
    ("contextual_compression", contextual_compression_outputs),
]:
    if not isinstance(output, list):
        print(f"{name}: ❌ Not a list")
        continue

    if not output:
        print(f"{name}: ⚠️ Empty list")
        continue

    sample = output[0]
    if not isinstance(sample, dict):
        print(f"{name}: ❌ First item is not a dict")
        continue

    missing = expected_keys - set(sample.keys())
    if missing:
        print(f"{name}: ❌ Missing keys: {missing}")
    else:
        print(f"{name}: ✅ Format looks good ({len(output)} samples)")


As before, converting the outputs to evaluation datasets. Probably there's a more efficient approach

In [None]:
import pandas as pd
from ragas import EvaluationDataset

# Step 1: Convert to DataFrame
# bm25_df = pd.DataFrame(bm25_outputs)
naive_df = pd.DataFrame(naive_outputs)
multi_query_df = pd.DataFrame(multi_query_outputs)
parent_doc_df = pd.DataFrame(parent_doc_outputs)
ensemble_df = pd.DataFrame(ensemble_outputs)
contextual_compression_df = pd.DataFrame(contextual_compression_outputs)

# Step 2: Convert to Ragas-compatible EvaluationDataset
# bm25_eval_dataset = EvaluationDataset.from_pandas(bm25_df)
naive_eval_dataset = EvaluationDataset.from_pandas(naive_df)
multi_query_eval_dataset = EvaluationDataset.from_pandas(multi_query_df)
parent_doc_eval_dataset = EvaluationDataset.from_pandas(parent_doc_df)
ensemble_eval_dataset = EvaluationDataset.from_pandas(ensemble_df)
contextual_compression_eval_dataset = EvaluationDataset.from_pandas(contextual_compression_df)



putting all the ragas imports together, here:

In [None]:
from ragas import evaluate
from ragas.llms import LangchainLLMWrapper
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1-mini"))

Next cell is same as I used above for BM25, but now putting it into a loop to run all the evals together
hoping this:

results[name] = result give me a separate, named, result set for each retriever 🤞

In [None]:
custom_run_config = RunConfig(timeout=600)
results = {}

datasets = [
    ("bm25", bm25_eval_dataset),
    ("naive", naive_eval_dataset),
    ("multi_query", multi_query_eval_dataset),
    ("parent_doc", parent_doc_eval_dataset),
    ("ensemble", ensemble_eval_dataset),
    ("contextual_compression", contextual_compression_eval_dataset)
]

for name, dataset in datasets:
    print(f"Evaluating: {name}")
    try:
        result = evaluate(
            dataset=dataset,
            metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
            llm=evaluator_llm,
            run_config=custom_run_config
        )
        results[name] = result
    except Exception as e:
        print(f"❌ Error during {name}: {e}")
        results[name] = None  # or skip entirely


### Party time again!
The evals all ran. took 56 minutes. Maybe I shouldn't have done them all at once?
I had to increase the timeout; I was getting too many errors at 360, so upped it to 10 minutes.
Ensemble and multi-query still generated too many timeout errors

Run the next cell to make sure I really got the results, the way I expected

In [None]:
results["contextual_compression"]


That worked, so lets try looping them all

In [None]:
for name in results:
    print(f"\n{name.upper()} RESULTS:")
    print(results[name])


Got all the results; sav

### Success
I got the Ragas results! I saved them to rags_results_1.
I also need cost and latency, which I can get from LangGr

## Step 3: Get latency and cost
Now I need to use Langchain.
Turns out, I should have thought of this sooner, because I probably could have used the same retriever runs to trace with LangChain and also use as output for Ragas. Oh, well, they don't take so long to run. 

Main thing is I have to figure out how to configure tracing. I'll root around in homework 7, and then bug one of the AI assistants if that doesn't work


Langchain setup (tried to do this above, but let's get everything in one place and get it working)

In [None]:
os.environ["LANGCHAIN_PROJECT"] = "Number9 Evaluation 4"
os.environ["LANGCHAIN_TRACING_V2"] = "true"

In [None]:
eval_llm = ChatOpenAI(model="gpt-4.1")

In [None]:
#magic France test cell
from langchain_openai import ChatOpenAI
# from langchain_anthropic import ChatAnthropic
from langchain_core.tracers import LangChainTracer
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Define components
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)
#llm = ChatAnthropic(model="claude-3-5-sonnet-20240620", temperature=0)
prompt = ChatPromptTemplate.from_template("What is the capital of France?")
chain = prompt | llm | StrOutputParser()

# Attach a tracer manually (failsafe if env vars don't take)
tracer = LangChainTracer()
chain_with_tracing = chain.with_config({"callbacks": [tracer]})

# Invoke chain
response = chain_with_tracing.invoke({})
print(response)


In [None]:
bm25_outputs_traced = run_retriever_on_dataset("bm25", bm25_retrieval_chain, golden_dataset)


In [None]:
# Step 1: Verify your environment variables are set
import os
from dotenv import load_dotenv

# Make sure your .env file is loaded
load_dotenv(dotenv_path="../.env")

# Check if all required keys are present
required_keys = ["OPENAI_API_KEY", "LANGCHAIN_API_KEY", "LANGCHAIN_TRACING_V2"]
for key in required_keys:
    if key in os.environ:
        print(f"✅ {key}: SET")
    else:
        print(f"❌ {key}: MISSING")

# Set the project name for organizing traces
os.environ["LANGCHAIN_PROJECT"] = "Advanced-Retrieval-Evaluation"
os.environ["LANGCHAIN_TRACING_V2"] = "true"

print(f"\n🎯 LangSmith Project: {os.environ.get('LANGCHAIN_PROJECT')}")
print(f"📊 Tracing Enabled: {os.environ.get('LANGCHAIN_TRACING_V2')}")


In [None]:
# 🔍 Comprehensive LangSmith Diagnostic
import os
from dotenv import load_dotenv

print("🔧 LANGSMITH DIAGNOSTIC REPORT")
print("=" * 50)

# 1. Check environment variables
print("\n1️⃣ ENVIRONMENT VARIABLES:")
load_dotenv(dotenv_path="../.env")

env_vars = {
    "LANGCHAIN_API_KEY": os.environ.get("LANGCHAIN_API_KEY", "NOT SET"),
    "LANGCHAIN_TRACING_V2": os.environ.get("LANGCHAIN_TRACING_V2", "NOT SET"),
    "LANGCHAIN_PROJECT": os.environ.get("LANGCHAIN_PROJECT", "NOT SET"),
    "LANGCHAIN_ENDPOINT": os.environ.get("LANGCHAIN_ENDPOINT", "NOT SET")
}

for key, value in env_vars.items():
    if value == "NOT SET":
        print(f"❌ {key}: {value}")
    elif key == "LANGCHAIN_API_KEY":
        print(f"✅ {key}: {value[:8]}...{value[-4:] if len(value) > 12 else 'TOO SHORT'}")
    else:
        print(f"✅ {key}: {value}")

# 2. Force set the environment variables again
print("\n2️⃣ FORCE SETTING VARIABLES:")
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "Advanced-Retrieval-Debug"
print(f"✅ Set LANGCHAIN_TRACING_V2 = {os.environ['LANGCHAIN_TRACING_V2']}")
print(f"✅ Set LANGCHAIN_PROJECT = {os.environ['LANGCHAIN_PROJECT']}")

# 3. Test LangSmith client directly
print("\n3️⃣ TESTING LANGSMITH CLIENT:")
try:
    from langsmith import Client
    client = Client()
    print(f"✅ LangSmith Client created successfully")
    print(f"✅ API URL: {client.api_url}")
    
    # Try to list projects to test connection
    projects = list(client.list_projects(limit=5))
    print(f"✅ Can connect to LangSmith - found {len(projects)} projects")
    
except Exception as e:
    print(f"❌ LangSmith Client failed: {e}")

print("\n" + "=" * 50)


In [None]:
import os
import time

# Use EXACTLY the same setup as your working France test
os.environ["LANGCHAIN_TRACING_V2"] = "true"

print("🧪 Using EXACT same pattern as working France test...")

# Create components exactly like France test
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

test_llm = ChatOpenAI(model="gpt-4.1-nano", temperature=0)
test_prompt = ChatPromptTemplate.from_template("Answer this question: {question}")
test_chain = test_prompt | test_llm

# Test with your loan question
question = "What is the most common issue with loans?"
print(f"Testing with: {question}")

start_time = time.time()
result = test_chain.invoke({"question": question})
end_time = time.time()

print(f"✅ Test completed in {end_time - start_time:.2f} seconds")
print(f"📝 Result: {result.content}")
print("\n🔗 Check LangSmith - this should appear like France did!")

In [None]:
from uuid import uuid4
import os
from getpass import getpass

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = f"AIE7 - Advanced Retrieval - {uuid4().hex[0:8]}"
os.environ["LANGCHAIN_API_KEY"] = getpass("LangSmith API Key: ")