# Using Ragas to Evaluate a RAG Application built with LangChain and LangGraph

In the following notebook, we'll be looking at how [Ragas](https://github.com/explodinggradients/ragas) can be helpful in a number of ways when looking to evaluate your RAG applications!

While this example is rooted in LangChain/LangGraph - Ragas is framework agnostic (you don't even need to be using a framework!).

- 🤝 Breakout Room #1
  1. Task 1: Installing Required Libraries
  2. Task 2: Set Environment Variables
  3. Task 3: Synthetic Dataset Generation for Evaluation using Ragas
  4. Task 4: Evaluating our Pipeline with Ragas
  5. Task 6: Making Adjustments and Re-Evaluating

But first! Let's set some dependencies!

## Dependencies and API Keys:

> NOTE: Please skip the pip install commands if you are running the notebook locally.

In [1]:
#!pip install -qU ragas==0.2.10

In [2]:
#!pip install -qU langchain-community==0.3.14 langchain-openai==0.2.14 unstructured==0.16.12 langgraph==0.2.61 langchain-qdrant==0.2.0

We'll also need to provide our API keys.

First, OpenAI's for our LLM/embedding model combination!

In [3]:
import os
from getpass import getpass
os.environ["OPENAI_API_KEY"] = getpass("Please enter your OpenAI API key!")

## Generating Synthetic Test Data

We wil be using Ragas to build out a set of synthetic test questions, references, and reference contexts. This is useful because it will allow us to find out how our system is performing.

> NOTE: Ragas is best suited for finding *directional* changes in your LLM-based systems. The absolute scores aren't comparable in a vacuum.

### Data Preparation

We'll prepare our data - and download our webpages which we'll be using for our data today.

These webpages are from [Simon Willison's](https://simonwillison.net/) yearly "AI learnings".

- [2023 Blog](https://simonwillison.net/2023/Dec/31/ai-in-2023/)
- [2024 Blog](https://simonwillison.net/2024/Dec/31/llms-in-2024/)

Let's start by collecting our data into a useful pile!

In [4]:
!mkdir data

mkdir: data: File exists


In [5]:
!curl https://simonwillison.net/2023/Dec/31/ai-in-2023/ -o data/2023_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 31524    0 31524    0     0  50391      0 --:--:-- --:--:-- --:--:-- 50357


In [6]:
!curl https://simonwillison.net/2024/Dec/31/llms-in-2024/ -o data/2024_llms.html

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 70549    0 70549    0     0  76792      0 --:--:-- --:--:-- --:--:-- 76767


Next, let's load our data into a familiar LangChain format using the `DirectoryLoader`.

In [7]:
from langchain_community.document_loaders import DirectoryLoader

path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

### Knowledge Graph Based Synthetic Generation

Ragas uses a knowledge graph based approach to create data. This is extremely useful as it allows us to create complex queries rather simply. The additional testset complexity allows us to evaluate larger problems more effectively, as systems tend to be very strong on simple evaluation tasks.

Let's start by defining our `generator_llm` (which will generate our questions, summaries, and more), and our `generator_embeddings` which will be useful in building our graph.

### Abstracted SDG

The above method is the full process - but we can shortcut that using the provided abstractions!

This will generate our knowledge graph under the hood, and will - from there - generate our personas and scenarios to construct our queries.



In [8]:
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
generator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4.1"))
generator_embeddings = LangchainEmbeddingsWrapper(OpenAIEmbeddings())

In [9]:
from ragas.testset import TestsetGenerator

generator = TestsetGenerator(llm=generator_llm, embedding_model=generator_embeddings)
dataset = generator.generate_with_langchain_docs(docs, testset_size=10)

Applying HeadlinesExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying HeadlineSplitter:   0%|          | 0/2 [00:00<?, ?it/s]

Applying SummaryExtractor:   0%|          | 0/2 [00:00<?, ?it/s]

Applying CustomNodeFilter:   0%|          | 0/12 [00:00<?, ?it/s]

Applying [EmbeddingExtractor, ThemesExtractor, NERExtractor]:   0%|          | 0/26 [00:00<?, ?it/s]

Applying [CosineSimilarityBuilder, OverlapScoreBuilder]:   0%|          | 0/2 [00:00<?, ?it/s]

Generating personas:   0%|          | 0/2 [00:00<?, ?it/s]

Generating Scenarios:   0%|          | 0/3 [00:00<?, ?it/s]

Generating Samples:   0%|          | 0/12 [00:00<?, ?it/s]

In [10]:
dataset.to_pandas()

Unnamed: 0,user_input,reference_contexts,reference,synthesizer_name
0,"Which organizations, besides OpenAI, have prod...",[We don’t yet know how to build GPT-4 Vibes Ba...,"Yes, Anthropic is listed among the organizatio...",single_hop_specifc_query_synthesizer
1,why openai still got best model and nobody bea...,[I’m surprised that no-one has beaten the now ...,OpenAI clearly have some substantial tricks th...,single_hop_specifc_query_synthesizer
2,wHat is AI acording to the context?,[Simon Willison’s Weblog Subscribe Stuff we fi...,"AI refers to Large Language Models, which are ...",single_hop_specifc_query_synthesizer
3,Wut is Plausible and how was it used in the co...,[Microsoft over this issue. The 69 page PDF is...,Plausible analytics was used to gather data ab...,single_hop_specifc_query_synthesizer
4,What recent advancements in multi-modal AI mod...,[<1-hop>\n\nyou talk to me exclusively in Span...,Recent advancements in multi-modal AI models i...,multi_hop_abstract_query_synthesizer
5,How have recent advancements in multi-modal AI...,[<1-hop>\n\nyou talk to me exclusively in Span...,Recent advancements in multi-modal AI models h...,multi_hop_abstract_query_synthesizer
6,how multi-modal ai models like gemini and chat...,[<1-hop>\n\nyou talk to me exclusively in Span...,multi-modal ai models like gemini and chatgpt ...,multi_hop_abstract_query_synthesizer
7,How has the rise of fine-tuning and customizat...,[<1-hop>\n\nWe don’t yet know how to build GPT...,The rise of fine-tuning and customization of l...,multi_hop_abstract_query_synthesizer
8,how chatgpt vision help with blog analytics an...,[<1-hop>\n\nMicrosoft over this issue. The 69 ...,chatgpt vision was used to analyze a screensho...,multi_hop_specific_query_synthesizer
9,How does the training cost and scale of Llama ...,[<1-hop>\n\nmodel available to try out through...,"Llama 3.1 405B, the largest model in Meta’s Ll...",multi_hop_specific_query_synthesizer


## LangChain RAG

Now we'll construct our LangChain RAG, which we will be evaluating using the above created test data!

### R - Retrieval

Let's start with building our retrieval pipeline, which will involve loading the same data we used to create our synthetic test set above.

> NOTE: We need to use the same data - as our test set is specifically designed for this data.

In [11]:
path = "data/"
loader = DirectoryLoader(path, glob="*.html")
docs = loader.load()

Now that we have our data loaded, let's split it into chunks!

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(docs)
len(split_documents)

75

#### ❓ Question: 

What is the purpose of the `chunk_overlap` parameter in the `RecursiveCharacterTextSplitter`?

#### 😎 ANSWER #2:

The `chunk_overlap` parameter defines the number of characters that overlap between two adjacent chunks when splitting documents.

The main functions of overlapping are : 
1. **Preserving contextual continuity**: Overlapping prevents concepts, sentences, or paragraphs from being abruptly cut off, which could fragment related information and then impair comprehension.

2. **Reducing information loss at boundaries**: Without overlapping, critical information located at chunk boundaries could be difficult to retrieve because the necessary context would be split between two separate chunks.

3. **Improved retrieval accuracy**: When a query references information that would otherwise be split between two chunks, overlapping increases the likelihood that at least one of the chunks contains sufficient context to be correctly retrieved.

Appropriate overlap represents a balance between chunk granularity (which enables targeted retrieval) and context preservation (which ensures the relevance of answers).

Next up, we'll need to provide an embedding model that we can use to construct our vector store.

In [13]:
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

Now we can build our in memory QDrant vector store.

In [14]:
from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
from qdrant_client.http.models import Distance, VectorParams

client = QdrantClient(":memory:")

client.create_collection(
    collection_name="ai_across_years",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE),
)

vector_store = QdrantVectorStore(
    client=client,
    collection_name="ai_across_years",
    embedding=embeddings,
)

We can now add our documents to our vector store.

In [15]:
_ = vector_store.add_documents(documents=split_documents)

Let's define our retriever.

In [16]:
retriever = vector_store.as_retriever(search_kwargs={"k": 5})

Now we can produce a node for retrieval!

In [17]:
def retrieve(state):
  retrieved_docs = retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

### Augmented

Let's create a simple RAG prompt!

In [18]:
from langchain.prompts import ChatPromptTemplate

RAG_PROMPT = """\
You are a helpful assistant who answers questions based on provided context. You must only use the provided context, and cannot use your own knowledge.

### Question
{question}

### Context
{context}
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

### Generation

We'll also need an LLM to generate responses - we'll use `gpt-4o-mini` to avoid using the same model as our judge model.

In [19]:
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o-mini")

Then we can create a `generate` node!

In [20]:
def generate(state):
  docs_content = "\n\n".join(doc.page_content for doc in state["context"])
  messages = rag_prompt.format_messages(question=state["question"], context=docs_content)
  response = llm.invoke(messages)
  return {"response" : response.content}

### Building RAG Graph with LangGraph

Let's create some state for our LangGraph RAG graph!

In [21]:
from langgraph.graph import START, StateGraph
from typing_extensions import List, TypedDict
from langchain_core.documents import Document

class State(TypedDict):
  question: str
  context: List[Document]
  response: str

Now we can build our simple graph!

> NOTE: We're using `add_sequence` since we will always move from retrieval to generation. This is essentially building a chain in LangGraph.

In [22]:
graph_builder = StateGraph(State).add_sequence([retrieve, generate])
graph_builder.add_edge(START, "retrieve")
graph = graph_builder.compile()

Let's do a test to make sure it's doing what we'd expect.

In [23]:
response = graph.invoke({"question" : "How are LLM agents useful?"})

In [24]:
response["response"]

'LLM agents are considered useful for several reasons, despite some skepticism surrounding their utility. They can act on behalf of users, similar to a travel agent, and can solve problems by utilizing tools in a loop. This capability suggests a level of autonomy, although definitions of autonomy can vary. \n\nLLMs are relatively easy to build, requiring only a few hundred lines of code and a substantial amount of high-quality training data. This accessibility has opened up the possibility for more individuals and organizations to create their own LLMs, as opposed to them being the sole domain of wealthy entities.\n\nMoreover, LLMs can be run on personal devices, making them more practical for everyday use. There have been significant advancements in this area, allowing individuals to utilize LLMs without needing expensive hardware.\n\nDespite their effectiveness, LLMs are not without criticism. They can struggle with reliability and ethical considerations, and their tendency to "hallu

## Evaluating the App with Ragas

Now we can finally do our evaluation!

We'll start by running the queries we generated usign SDG above through our application to get context and responses.

In [25]:
for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]

In [26]:
dataset.to_pandas()

Unnamed: 0,user_input,retrieved_contexts,reference_contexts,response,reference,synthesizer_name
0,"Which organizations, besides OpenAI, have prod...",[Getting back to models that beat GPT-4: Anthr...,[We don’t yet know how to build GPT-4 Vibes Ba...,"Yes, Anthropic appears among the organizations...","Yes, Anthropic is listed among the organizatio...",single_hop_specifc_query_synthesizer
1,why openai still got best model and nobody bea...,[This is a huge advantage for open over closed...,[I’m surprised that no-one has beaten the now ...,"OpenAI still has the best model, GPT-4, largel...",OpenAI clearly have some substantial tricks th...,single_hop_specifc_query_synthesizer
2,wHat is AI acording to the context?,[The two main categories I see are people who ...,[Simon Willison’s Weblog Subscribe Stuff we fi...,"According to the context, AI is described as a...","AI refers to Large Language Models, which are ...",single_hop_specifc_query_synthesizer
3,Wut is Plausible and how was it used in the co...,"[The top five: ai (342), generativeai (300), l...",[Microsoft over this issue. The 69 page PDF is...,Plausible is a web analytics tool that the aut...,Plausible analytics was used to gather data ab...,single_hop_specifc_query_synthesizer
4,What recent advancements in multi-modal AI mod...,[The rise of inference-scaling “reasoning” mod...,[<1-hop>\n\nyou talk to me exclusively in Span...,Recent advancements in multi-modal AI models h...,Recent advancements in multi-modal AI models i...,multi_hop_abstract_query_synthesizer
5,How have recent advancements in multi-modal AI...,[In October I upgraded my LLM CLI tool to supp...,[<1-hop>\n\nyou talk to me exclusively in Span...,"Recent advancements in multi-modal AI models, ...",Recent advancements in multi-modal AI models h...,multi_hop_abstract_query_synthesizer
6,how multi-modal ai models like gemini and chat...,[In October I upgraded my LLM CLI tool to supp...,[<1-hop>\n\nyou talk to me exclusively in Span...,Multi-modal AI models like Gemini and ChatGPT ...,multi-modal ai models like gemini and chatgpt ...,multi_hop_abstract_query_synthesizer
7,How has the rise of fine-tuning and customizat...,[Another common technique is to use larger mod...,[<1-hop>\n\nWe don’t yet know how to build GPT...,The rise of fine-tuning and customization of l...,The rise of fine-tuning and customization of l...,multi_hop_abstract_query_synthesizer
8,how chatgpt vision help with blog analytics an...,"[The top five: ai (342), generativeai (300), l...",[<1-hop>\n\nMicrosoft over this issue. The 69 ...,ChatGPT Vision can assist with blog analytics ...,chatgpt vision was used to analyze a screensho...,multi_hop_specific_query_synthesizer
9,How does the training cost and scale of Llama ...,[The really impressive thing about DeepSeek v3...,[<1-hop>\n\nmodel available to try out through...,The training cost and scale of Llama 3.1 405B ...,"Llama 3.1 405B, the largest model in Meta’s Ll...",multi_hop_specific_query_synthesizer


Then we can convert that table into a `EvaluationDataset` which will make the process of evaluation smoother.

In [27]:
from ragas import EvaluationDataset

evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

We'll need to select a judge model - in this case we're using the same model that was used to generate our Synthetic Data.

In [28]:
# from ragas import evaluate
from ragas.llms import LangchainLLMWrapper

evaluator_llm = LangchainLLMWrapper(ChatOpenAI(model="gpt-4o"))

Next up - we simply evaluate on our desired metrics!

In [29]:
from ragas.metrics import LLMContextRecall, Faithfulness, FactualCorrectness, ResponseRelevancy, ContextEntityRecall, NoiseSensitivity
from ragas import evaluate, RunConfig

#custom_run_config = RunConfig(timeout=360)

# I made this change because I got to many rate limit errors
custom_run_config = RunConfig(
    timeout=600, 
    max_workers=2  # 2 jobs at a time to avoid rate limiting
)

result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.6000, 'faithfulness': 0.7570, 'factual_correctness': 0.5117, 'answer_relevancy': 0.9221, 'context_entity_recall': 0.4060, 'noise_sensitivity_relevant': 0.1927}

## Making Adjustments and Re-Evaluating

Now that we've got our baseline - let's make a change and see how the model improves or doesn't improve!

> NOTE: This will be using Cohere's Rerank model (which was updated fairly [recently](https://docs.cohere.com/v2/changelog/rerank-v3.5)) - please be sure to [sign-up for an API key!](https://docs.cohere.com/reference/about)

In [30]:
os.environ["COHERE_API_KEY"] = getpass("Please enter your Cohere API key!")

In [31]:
#!pip install -qU cohere langchain_cohere


We'll first set our retriever to return more documents, which will allow us to take advantage of the reranking.

In [32]:
retriever = vector_store.as_retriever(search_kwargs={"k": 20})

Reranking, or contextual compression, is a technique that uses a reranker to compress the retrieved documents into a smaller set of documents.

This is essentially a slower, more accurate form of semantic similarity that we use on a smaller subset of our documents.

In [33]:
from langchain.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

def retrieve_adjusted(state):
  compressor = CohereRerank(model="rerank-v3.5")
  compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor, base_retriever=retriever, search_kwargs={"k": 5}
  )
  retrieved_docs = compression_retriever.invoke(state["question"])
  return {"context" : retrieved_docs}

We can simply rebuild our graph with the new retriever!

In [34]:
class State(TypedDict):
  question: str
  context: List[Document]
  response: str

graph_builder = StateGraph(State).add_sequence([retrieve_adjusted, generate])
graph_builder.add_edge(START, "retrieve_adjusted")
graph = graph_builder.compile()

In [35]:
response = graph.invoke({"question" : "How are LLM agents useful?"})
response["response"]

'LLM agents are useful primarily in two main categories. First, they can function similarly to traditional agents, acting on behalf of users in tasks such as travel planning. Second, they can utilize tools in a loop to solve specific problems, leveraging their language processing capabilities.\n\nOne of the most notable applications of LLMs is in writing code, as they can effectively understand and generate programming languages, which tend to have simpler grammar rules compared to natural languages. This capability highlights their potential utility in software development.\n\nHowever, there are significant concerns regarding their reliability and ability to distinguish truth from fiction, which raises skepticism about their overall utility. Despite the excitement surrounding AI agents, practical examples of them being deployed in production are still limited, largely due to the challenges posed by their gullibility and the potential consequences of their decisions.'

In [36]:
import time

for test_row in dataset:
  response = graph.invoke({"question" : test_row.eval_sample.user_input})
  test_row.eval_sample.response = response["response"]
  test_row.eval_sample.retrieved_contexts = [context.page_content for context in response["context"]]
  time.sleep(2) # To try to avoid rate limiting.

In [37]:
# rerun the previous cell to get the updated dataset
evaluation_dataset = EvaluationDataset.from_pandas(dataset.to_pandas())

In [38]:
result = evaluate(
    dataset=evaluation_dataset,
    metrics=[LLMContextRecall(), Faithfulness(), FactualCorrectness(), ResponseRelevancy(), ContextEntityRecall(), NoiseSensitivity()],
    llm=evaluator_llm,
    run_config=custom_run_config
)
result

Evaluating:   0%|          | 0/72 [00:00<?, ?it/s]

{'context_recall': 0.6870, 'faithfulness': 0.7012, 'factual_correctness': 0.5317, 'answer_relevancy': 0.9405, 'context_entity_recall': 0.4385, 'noise_sensitivity_relevant': 0.2881}

#### ❓ Question: 

Which system performed better, on what metrics, and why?

#### 😎 ANSWER:

The System with Cohere Rerank gives :
1. **Significant improvement in contextual recall**: The reranking system is significantly better at identifying and extracting relevant passages. 
Why?:  
- The baseline system retrieves only 5 documents directly, while the reranking approach first extracts 20 documents before filtering them. This "broaden then refine" strategy allows to capture relevant documents that would not have been in the initial top 5 based on vector similarity alone.
- Reranking applies a more sophisticated analysis of embeddings that can better understand the relationship between a query and a document, contrary to simple vector embeddings which sometimes miss subtle semantic matches.


2. **Significant improvement in noise robustness**: The Noise Sensitivity metric tells us that reranking filters out irrelevant information.
Why?:  
- The system applies a vector similarity filter then a reranking one, this creates a "double sieve" effect which is more effective to eliminates noise.
- The reranker evaluates not only thematic similarity, but also causal relationships, implications, and logical consistency between the question and each passage.


3. **Better factual accuracy**: The generated answers contain more factually correct information, likely due to more relevant context selection.
Why?
- The reranker can better distinguish between documents that appear superficially relevant but contain ambiguous or imprecise information.
- Reranking can identify specific passages that address specific aspects of the question,which improves the factual accuracy of answers.


4. **Better answer relevance**: Slight improvement in the relevance of answers to the questions asked.
Why?
- Reranking models can analyze the complex interactions between the question and each candidate document, rather than only considering their representations independently.
- By better filtering out irrelevant documents, the context provided is more focused on information directly useful to answer the question.


5. **Better entity recognition**: The reranking system better captures important entities mentioned in the source documents.
Why?:
- By starting with 20 documents before filtering, the system has a better chance of capturing important entities that might be underrepresented in the initial top 5.
- The reranker can take into account factors such as the position of entities in the document, their frequency, and their relationship to other elements in the text.


The Baseline System gives :
1. **Better fidelity to the provided context**: The Cohere Rerank seems to introduce additional complexity that might cause the model to slightly deviate from the provided context.
Why?: 
- Reranking can select more diverse but less homogeneous documents, which creates a more complex context to synthesize faithfully for the LLM.
- A system that simply reproduces exactly what is in the context can sometimes do so at the expense of providing an accurate and complete answer to the question.
