## Evaluation of RAG Systems
# Introduction
Evaluating Retrieval-Augmented Generation (RAG) systems is crucial to ensure they're providing accurate, relevant, and reliable responses. Unlike traditional applications, RAG systems combine retrieval and generation, making evaluation more complex. We need to assess both the quality of retrieved information and the generated responses.
### Faithfulness
#### What is Faithfulness?
Faithfulness measures whether the generated answer is grounded in and supported by the retrieved context. In other words, is the model making things up (hallucinating) or is it staying true to the information it retrieved?
#### Why it matters:
If your RAG system generates answers that contradict or aren't supported by the retrieved documents, users will receive incorrect information even though you have the right documents in your knowledge base. This defeats the entire purpose of RAG.
#### How to evaluate faithfulness:
You can assess faithfulness by checking if every claim in the generated answer can be traced back to the retrieved context. Think of it like fact-checking with citations. There are several approaches:

Manual review: Have humans read the context and answer, then verify if the answer is supported
LLM-as-judge: Use another LLM to compare the generated answer against the retrieved context and score faithfulness
Statement verification: Break the answer into individual claims and verify each one against the context

##### Example of high faithfulness:
Retrieved context: "The Eiffel Tower was completed in 1889 and stands 330 meters tall."
Generated answer: "The Eiffel Tower, completed in 1889, has a height of 330 meters."

##### Example of low faithfulness:
Retrieved context: "The Eiffel Tower was completed in 1889 and stands 330 meters tall."
Generated answer: "The Eiffel Tower was built in 1887 and is the tallest structure in Paris."
### Context Relevance
##### What is Context Relevance?
Context relevance evaluates whether the retrieved documents actually contain information that's useful for answering the user's question. It's possible to retrieve documents that are topically related but don't actually help answer the specific query.
##### Why it matters:
Poor context relevance leads to two problems. First, irrelevant context can confuse the LLM and lead to poor or off-topic answers. Second, you're wasting tokens and processing time on information that doesn't help. In production systems with cost constraints, this inefficiency adds up quickly.
##### How to evaluate context relevance:
The key question is: "Does this retrieved chunk contain information needed to answer the query?" You can measure this through:

Relevance scoring: Rate each retrieved chunk on a scale (like 1-5) for how relevant it is to the query
Binary classification: Simple yes/no on whether the chunk is relevant
LLM evaluation: Have an LLM assess whether the context could help answer the question
Precision metrics: Calculate what percentage of retrieved chunks are actually relevant

##### Improving context relevance:
If you find low context relevance scores, you might need to improve your retrieval strategy by refining your embedding model, adjusting chunk sizes, using hybrid search (combining semantic and keyword search), implementing query expansion or rewriting, or adjusting the number of retrieved chunks.

##### Example scenario:
Query: "What are the side effects of ibuprofen?"

High relevance context: "Common side effects of ibuprofen include nausea, stomach pain, and dizziness."

Low relevance context: "Ibuprofen is available over-the-counter at most pharmacies and was first developed in the 1960s."

### Using LangSmith to Evaluate
#### What is LangSmith?
LangSmith is a platform by LangChain for developing, monitoring, and evaluating LLM applications. It provides tools specifically designed for testing and improving RAG systems.
##### Key features for RAG evaluation:
1. Tracing and debugging: LangSmith automatically captures the full execution trace of your RAG pipeline, showing you the query, retrieved documents, prompts sent to the LLM, and final responses. This visibility is invaluable for understanding what's happening inside your system.
2. Dataset creation: You can create test datasets with example queries and expected answers (or just queries if you want to evaluate without ground truth). These datasets can be version-controlled and shared across your team.
3. Evaluation runs: LangSmith lets you run your RAG system against your test dataset and apply evaluators automatically. You can compare different versions of your system side-by-side to see which performs better.
4. Built-in evaluators: LangSmith provides pre-built evaluators for common metrics including faithfulness, relevance, answer correctness, and more. You can also create custom evaluators using LLMs or code.
##### How to use LangSmith for RAG evaluation:
First, instrument your code by wrapping your RAG application with LangSmith tracing. This usually involves adding a simple decorator or context manager to your code.
Next, create evaluation datasets containing representative user queries. Include edge cases and challenging questions, not just easy examples.
Then define your evaluators by choosing from built-in evaluators or creating custom ones. For RAG, you typically want to evaluate both retrieval quality and generation quality.
Run evaluations by executing your RAG system against your dataset while LangSmith collects metrics. Analyze the results in the LangSmith dashboard to identify patterns in failures and compare different system versions.
Finally, iterate by making improvements to your system based on insights, then re-run evaluations to measure progress.

In [23]:
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_core.documents import Document

In [25]:
# Sample documents for RAG evaluation
docs = [
    Document(
        page_content="GDPR penalties can be up to €20 million or 4% of global annual turnover.",
        metadata={"topic": "gdpr"}
    ),
    Document(
        page_content="GDPR defines rights such as access and erasure for individuals.",
        metadata={"topic": "gdpr"}
    ),
    Document(
        page_content="SOC 2 is a compliance framework focused on security controls.",
        metadata={"topic": "soc2"}
    ),
]


In [27]:
load_dotenv(override=True)

# Create embeddings + vector DB
embeddings = OpenAIEmbeddings()
vs = Chroma.from_documents(docs, embedding=embeddings)

# THIS is the missing object
base_retriever = vs.as_retriever(search_kwargs={"k": 4})


In [29]:
answer_llm = ChatOpenAI(model="gpt-4.1-mini", temperature=0)
judge_llm  = ChatOpenAI(model="gpt-4.1-mini", temperature=0)

df = evaluate_rag_dataset(
    eval_set=eval_set,
    retriever=base_retriever,
    answer_llm=answer_llm,
    judge_llm=judge_llm,
    k=4
)

df


Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3
Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3


Unnamed: 0,question,ground_truth,answer,exact_match,token_f1,answer_relevance,faithfulness,context_recall,context_precision,used_context_ids,judge_reason_relevance,judge_reason_faithfulness
0,What are GDPR penalties?,GDPR penalties can be up to €20 million or 4% ...,GDPR penalties can be up to €20 million or 4% ...,0,0.903226,5,5,5,4,[0],The answer directly addresses the question by ...,The answer is fully supported by the retrieved...
1,What rights does GDPR provide to individuals?,GDPR provides rights such as access and erasur...,GDPR provides rights such as access and erasur...,0,0.8,5,5,5,3,[0],The answer directly addresses the question by ...,All claims in the answer are supported by the ...
