# RAG Evaluation

This hands-on tutorial walks participants through building an automated evaluation pipeline for RAG applications. Using real examples, we’ll define key evaluation criteria and implement simple methods to assess LLM output quality—focusing on completeness, relevance, and hallucinations. Presented at DataNights Course.

**This tutorial will cover:**

1. How to choose key evaluation criteria for your use case

2. Selecting the right data and KPIs for metric evaluation

3. Building an LLM-as-a-judge metric for a chosen criterion

4. Using open-source metrics like RAGAS

5. Aggregating metrics into an end-to-end evaluation pipeline


In this tutorial we will use a modifed version of the [RAG-12000 dataset](https://huggingface.co/datasets/neural-bridge/rag-dataset-12000).

## Utils

In [1]:
import asyncio
import os
from typing import Union, List, Optional, Type
from openai import AsyncAzureOpenAI
import asyncio
import random
from typing import Union
from pydantic import BaseModel, Field


import dotenv
import pandas as pd
dotenv.load_dotenv('/Users/nadav/Desktop/GitRepos/llm/.env')

# Configuration – use environment variables or directly set values
AZURE_OPENAI_ENDPOINT = os.getenv("AZURE_OPENAI_ENDPOINT") or "https://your-resource-name.openai.azure.com/"
AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY") or "your-azure-api-key"
AZURE_DEPLOYMENT_NAME = os.getenv("AZURE_DEPLOYMENT_NAME") or "gpt-4o-mini"
AZURE_OPENAI_API_VERSION = os.getenv("AZURE_OPENAI_API_VERSION") or "2023-05-15"

# Create an Azure OpenAI client
client = AsyncAzureOpenAI(
    api_key=AZURE_OPENAI_API_KEY,
    api_version=AZURE_OPENAI_API_VERSION,
    azure_endpoint=AZURE_OPENAI_ENDPOINT,
)


async def run_llm_call(system_prompt: str, user_prompt: str,
                       response_model: Optional[Type[BaseModel]] = None,
                       model: str = AZURE_DEPLOYMENT_NAME) -> Union[str, None]:
    max_retries = 5
    base_delay = 30  # seconds

    for attempt in range(1, max_retries + 1):
        try:
            if response_model:
                response = await client.beta.chat.completions.parse(
                    model=model,
                    messages=[
                        {"role": "system", "content": system_prompt.strip()},
                        {"role": "user", "content": user_prompt.strip()},
                    ],
                    temperature=0,
                    response_format=response_model
                )
                return response.choices[0].message.parsed
            else:
                response = await client.chat.completions.create(
                    model=model,
                    messages=[
                        {"role": "system", "content": system_prompt.strip()},
                        {"role": "user", "content": user_prompt.strip()},
                    ],
                    temperature=0,
                )
                return response.choices[0].message.content.strip()
        except Exception as e:
            print(f"Attempt {attempt} - Azure OpenAI Error: {e}")
            if attempt == max_retries:
                return None
            # Random sleep between retries
            sleep_time = base_delay * (2 ** (attempt)) + random.uniform(0, 1)
            await asyncio.sleep(sleep_time)

    

async def run_llm_calls(system_prompts: List[str], user_prompts: List[str], 
                        response_model: Optional[Type[BaseModel]] = None,
                        model: str = AZURE_DEPLOYMENT_NAME) -> List[str]:
    tasks = [
        run_llm_call(system_prompt, user_prompt, model=model, 
                     response_model=response_model)
        for system_prompt, user_prompt in zip(system_prompts, user_prompts)
    ]
    responses = await asyncio.gather(*tasks)
    return responses

Python-dotenv could not parse statement starting at line 10
Python-dotenv could not parse statement starting at line 11
Python-dotenv could not parse statement starting at line 12


# Step 1 - Choosing Evaluation Criteria

Define what “good” means for your use case. We'll identify key dimensions like relevance, accuracy, fluency, and helpfulness, based on your product goals provided via feedback of human annotators.


In [2]:
import pandas as pd

df = pd.read_csv('https://figshare.com/ndownloader/files/53919875')
df.head(2)

Unnamed: 0,input,information_retrieved,output,annotation,annotation_reasoning
0,When was the original release of Johnny Turbo’...,Johnny Turbo’s Arcade: Express Raider Date And...,Johnny Turbo’s Arcade: Express Raider was orig...,good,
1,Who is the CEO of Franklin Templeton Investments?,"Gregory Johnson\nCEO\nUpdated On : Sep 28, 201...",The CEO of Franklin Templeton Investments is G...,good,


In [3]:
# We will start by manually reviewing some of the manual annotation reasons
# To get a sense of the use case and its potential issues

df[df['annotation'] == 'bad']['annotation_reasoning'].sample(5).values.tolist()

["While the response correctly states the main statement of the Equal Rights Amendment, it includes a lot of additional information that is not directly relevant to the customer's question. This verbosity can overwhelm the customer and distract from the clear, concise answer they are seeking. Including details about the history, ratification process, and opposition arguments, although related to the ERA, goes beyond the scope of the question and may confuse or frustrate customers looking for a straightforward explanation. As a customer support manager, I would advise keeping responses focused and succinct to maintain clarity and customer satisfaction.",
 "While the response correctly defines the theory of Panspermia, it omits the important detail of who first proposed it, which is essential to fully answer the customer's question. This missing information could lead to customer dissatisfaction as their query about the originator of the theory remains unanswered. It's important to provi

In [4]:
# We will use an llm to analyze the feedback and summarize the common problems
# in the agent responses.

sys_message_analyze_reasons = """
You are an amazing data analyst analyzing user feedback provided for a question answering agent responses.
Your task is to analyze the user feedback and summarize the what are the common problemsin the agent responses.
return up to 5 common problems.
""".strip()

only_bad = df[df['annotation'] == 'bad'].copy()
responses = await run_llm_calls(
    system_prompts=[sys_message_analyze_reasons],
    user_prompts=['\n'.join(only_bad['annotation_reasoning'].tolist())],
)

print(responses[0])

Based on the user feedback provided, here are five common problems identified in the agent responses:

1. **Introduction of Unsupported Details**: Many responses include additional information or suggestions that are not supported by the original context. This can mislead customers and create confusion, as seen in examples where recommendations or details about events, practices, or features were added without basis.

2. **Verbosity and Lack of Conciseness**: Several responses are overly verbose, including excessive details that distract from the main question. This can overwhelm customers and obscure the key information they are seeking, making it difficult for them to quickly grasp the essential points.

3. **Inaccurate Representation of Context**: Some responses contradict or misinterpret the provided context, leading to inaccuracies in the information conveyed. This includes misrepresenting facts, such as incorrectly stating the roles or opinions of individuals or organizations, wh

Based on the analysis above, the errors can be grouped into four categories:

1. Incomplete – Missing key information or context

2. Inconscise – Unnecessarily wordy or repetitive responses

3. Hallucinations – Fabricated or factually incorrect content

4. Contradictions – Statements that conflict with the source or other parts of the response


### Divide and Conquer
Although it's technically possible to evaluate all four criteria in a single LLM-as-a-judge call, recent studies—and the collective experience of many practitioners—show that RAG evaluation is complex. Breaking the task into smaller, focused components usally delivers better accuracy and insight.

We’ll start by implementing an LLM-as-a-judge method to evaluate *completeness*.

# Step 2 – Building the Benchmark for Completeness
Why does this deserve its own section? Because the naive approach is tempting—but wrong.

Not every sample labeled as “bad” is bad due to low completeness. If we don’t filter carefully, we risk evaluating against the wrong signals. To build a meaningful benchmark, we need to isolate true negatives—cases that are specifically incomplete, not flawed for other reasons like hallucinations or contradictions.

And since we’re not fans of manual work, we’ll use our BFF ChatGPT to help automate the filtering.

In [5]:
system_prompt_is_complete = """
You are evaluating whether a response from a question-answering agent is incomplete.

Your task: Based on the user feedback, determine if the primary issue with the agent’s answer is a **completeness problem**—i.e., it is missing key information that should have been included. If the issue is due to something else (e.g., hallucination, contradiction, poor phrasing), mark it as not related to completeness.

Respond with a clear yes/no judgment and a brief reasoning.
""".strip()

user_prompt_reasoning_eval = """
Question:
{question}

Wrong Answer:
{output}

Annotation Reasoning:
{reasoning}
""".strip()


class CompletenessRelated(BaseModel):
    reasoning: str
    has_completeness_problem: bool


sys_msgs = [system_prompt_is_complete] * df.shape[0]
user_msgs = [
    user_prompt_reasoning_eval.format(
        question=row['input'],
        output=row['output'],
        reasoning=row['annotation_reasoning'],
    )
    for _, row in only_bad.iterrows()
]
responses = await run_llm_calls(sys_msgs, user_msgs, response_model=CompletenessRelated)
only_bad['completeness_related'] = [r.has_completeness_problem for r in responses]

only_bad['completeness_related'].value_counts()

completeness_related
False    28
True     12
Name: count, dtype: int64

In [6]:
# For the sake of getting a sample which is representative of the data but not overly unbalanced
# we will use 20 examples of the positive class and the 10 examples of the negative class.

only_good = df[df['annotation'] == 'good'].copy()
completeness_eval_df = pd.concat([only_good.sample(20, random_state=1), only_bad[only_bad['completeness_related']]])
completeness_eval_df['annotation'] = completeness_eval_df['annotation'].replace({'good': 1, 'bad': 0})

print(completeness_eval_df.shape)
completeness_eval_df.head(2)

(32, 6)


  completeness_eval_df['annotation'] = completeness_eval_df['annotation'].replace({'good': 1, 'bad': 0})


Unnamed: 0,input,information_retrieved,output,annotation,annotation_reasoning,completeness_related
2,"Who won the gold medal in the men's 1,500m fin...",+50 points in the past 30 days\nNing is the so...,China's Ning Zhongyan won the gold medal in th...,1,,
62,What are some of the challenges Amy Bloom face...,Steven G. Smith for The Boston Globe\nAuthor A...,Amy Bloom finds getting started on a significa...,1,,


# Step 3 – Completeness Metric

We'll use an LLM to assess whether each output is complete. First, we'll implement a simple baseline approach—then iterate to improve accuracy and reliability.

In [8]:
from enum import Enum
from sklearn.metrics import classification_report

system_prompt_completeness_eval = """
Decide if the answer is missing important information based on the question.

Is the answer complete? Answer complete or incomplete.
""".strip()


user_prompt_answer_eval = """
Question:
{question}

Context:
{context}

Answer:
{conclusions}
""".strip()

class ScoreValue(int, Enum):
    INCOMPLETE = 0
    COMPLETE = 1

class CompletenessScore(BaseModel):
    score: ScoreValue

sys_msgs = [system_prompt_completeness_eval] * completeness_eval_df.shape[0]
user_msgs = [
    user_prompt_answer_eval.format(
        question=row['input'],
        context=row['information_retrieved'],
        conclusions=row['output']
    )
    for _, row in completeness_eval_df.iterrows()
]
responses = await run_llm_calls(sys_msgs, user_msgs, response_model=CompletenessScore)
completeness_eval_df['naive_method'] = [int(x.score.value) for x in responses]

print(completeness_eval_df.naive_method.value_counts())
classification_report(
    completeness_eval_df.annotation,
    completeness_eval_df.naive_method,
    output_dict=True,
    zero_division=0
)

naive_method
1    32
Name: count, dtype: int64


{'0': {'precision': 0.0, 'recall': 0.0, 'f1-score': 0.0, 'support': 12.0},
 '1': {'precision': 0.625,
  'recall': 1.0,
  'f1-score': 0.7692307692307693,
  'support': 20.0},
 'accuracy': 0.625,
 'macro avg': {'precision': 0.3125,
  'recall': 0.5,
  'f1-score': 0.38461538461538464,
  'support': 32.0},
 'weighted avg': {'precision': 0.390625,
  'recall': 0.625,
  'f1-score': 0.4807692307692308,
  'support': 32.0}}

The naive method struggles to identify *incomplete negative examples* accurately. This shows the problem is more complex than it seems—and not easily solved with a simple approach. Let’s explore more advanced methods to handle it more effectively.

In [9]:
system_prompt_completeness_eval = """
You are an assistant evaluating how complete an answer is, given a question and supporting context.

First, consider what are the key pieces of information a fully complete answer should include based on the question and context.  
Then, check whether the answer contains all of that information.

Provide a short explanation of your reasoning.  
Then assign a score:

0 - Not complete: Key information is missing or major parts of the question are not addressed.  
1 - Partially complete: All parts of the question are touched on, but some are incomplete or only loosely supported by the context.  
2 - Fully complete: The answer thoroughly and directly addresses all aspects of the question, using information clearly supported by the context.
""".strip()

class ScoreValue(int, Enum):
    INCOMPLETE = 0
    PARTIALLY_COMPLETE = 1
    COMPLETE = 2

class CompletenessScore(BaseModel):
    reasoning: str
    score: ScoreValue


sys_msgs = [system_prompt_completeness_eval] * completeness_eval_df.shape[0]
responses = await run_llm_calls(sys_msgs, user_msgs, response_model=CompletenessScore, model="gpt-4.1-mini")
completeness_eval_df['advenced_method'] = [int(x.score > 1) for x in responses]
completeness_eval_df['advenced_method_reasoning'] = [x.reasoning for x in responses]

print(completeness_eval_df.advenced_method.value_counts())
classification_report(
    completeness_eval_df.annotation,
    completeness_eval_df.advenced_method,
    output_dict=True,
    zero_division=0
)

advenced_method
1    18
0    14
Name: count, dtype: int64


{'0': {'precision': 0.7857142857142857,
  'recall': 0.9166666666666666,
  'f1-score': 0.8461538461538461,
  'support': 12.0},
 '1': {'precision': 0.9444444444444444,
  'recall': 0.85,
  'f1-score': 0.8947368421052632,
  'support': 20.0},
 'accuracy': 0.875,
 'macro avg': {'precision': 0.8650793650793651,
  'recall': 0.8833333333333333,
  'f1-score': 0.8704453441295547,
  'support': 32.0},
 'weighted avg': {'precision': 0.8849206349206349,
  'recall': 0.875,
  'f1-score': 0.8765182186234818,
  'support': 32.0}}

The advanced method performs well and reliably identifies incomplete answers. While there’s still some room for fine-tuning, it’s accurate enough for practical use.

# Step 4 - Utilaze RAGAS for hallucination detection

In [None]:
# First we will create and evalution set in a similar approach to the one we used for the completeness evaluation

system_prompt_is_factually_correct = """
You are evaluating whether a response from a question-answering agent has a **factual correctness problem**.

Your task: Based on the user feedback, determine if the main issue with the answer is related to **hallucination** (made-up or incorrect information) or **contradiction** (statements that conflict with known facts or the context). If the issue is something else (e.g., incomplete, vague, poorly phrased), mark it as not related to factual correctness.

Respond with a clear yes/no judgment and a brief explanation.
""".strip()


class FactualRelated(BaseModel):
    reasoning: str
    has_factual_correctness_problem: bool


sys_msgs = [system_prompt_is_factually_correct] * df.shape[0]
user_msgs = [
    user_prompt_reasoning_eval.format(
        question=row['input'],
        output=row['output'],
        reasoning=row['annotation_reasoning'],
    )
    for _, row in only_bad.iterrows()
]
responses = await run_llm_calls(sys_msgs, user_msgs, response_model=FactualRelated)
only_bad['factual_related'] = [r.has_factual_correctness_problem for r in responses]
print(only_bad['factual_related'].value_counts())

factual_related
True     22
False    18
Name: count, dtype: int64


In [11]:
factual_eval_df = pd.concat([only_good.sample(sum(only_bad['factual_related']), random_state=1), 
                             only_bad[only_bad['factual_related']]])
factual_eval_df['annotation'] = factual_eval_df['annotation'].replace({'good': 1, 'bad': 0})

  factual_eval_df['annotation'] = factual_eval_df['annotation'].replace({'good': 1, 'bad': 0})


In [13]:
# Test RAGAS

import os
from ragas.metrics import faithfulness
from ragas.dataset_schema import SingleTurnSample
from langchain.chat_models import ChatOpenAI
from ragas import evaluate, EvaluationDataset

langchain_client = ChatOpenAI(
    model_name="gpt-4.1",
    openai_api_key=os.environ["OPENAI_API_KEY"],
    temperature=0
)

samples = [SingleTurnSample(
    user_input=row['input'],
    response=row['output'],
    retrieved_contexts=[row['information_retrieved']],
) for _, row in factual_eval_df.iterrows()]

results = evaluate(EvaluationDataset(samples=samples), metrics=[faithfulness], llm=langchain_client)
factual_eval_df['ragas_faithfulness'] = [r['faithfulness'] for r in results.scores]
factual_eval_df['ragas_faithfulness_binary'] = [1 if r['faithfulness'] > 0.5 else 0 for r in results.scores]
print(factual_eval_df.ragas_faithfulness_binary.value_counts())

classification_report(
    factual_eval_df.annotation,
    factual_eval_df.ragas_faithfulness_binary,
    output_dict=True,
)

Evaluating:   0%|          | 0/44 [00:00<?, ?it/s]

ragas_faithfulness_binary
1    36
0     8
Name: count, dtype: int64


{'0': {'precision': 0.875,
  'recall': 0.3181818181818182,
  'f1-score': 0.4666666666666667,
  'support': 22.0},
 '1': {'precision': 0.5833333333333334,
  'recall': 0.9545454545454546,
  'f1-score': 0.7241379310344828,
  'support': 22.0},
 'accuracy': 0.6363636363636364,
 'macro avg': {'precision': 0.7291666666666667,
  'recall': 0.6363636363636364,
  'f1-score': 0.5954022988505747,
  'support': 44.0},
 'weighted avg': {'precision': 0.7291666666666667,
  'recall': 0.6363636363636364,
  'f1-score': 0.5954022988505747,
  'support': 44.0}}

Unfortunately, even the most advanced open-source tools often fall short on complex tasks. While they perform well in general settings, they don’t meet the specific needs of our use case. Addressing this gap will require targeted research to develop a factual correctness tool tailored to our domain. One example of a non-LLM-based approach can be found [here](https://arxiv.org/abs/2504.15771).

# Step 5 - Aggragate to an end-2-end evaluation pipeline

At this stage, we combine all evaluation steps into a full end-to-end pipeline. Each sample—input, context, and LLM output—is judged on whether the output is good enough, from a domain expert’s perspective, to be sent to a client. With a representative test set, this pipeline helps measure the real-world performance of the application and track what actual application quality.

When aggregating evaluation results, simply averaging scores across criteria can be misleading—some dimensions may mask critical weaknesses in others. Instead of collapsing everything into a single number, it’s often better to treat each criterion independently. The approach we will take here is to define pass/fail thresholds per criterion and only consider an output successful if it meets all of them. Another option is to use logical rules (e.g., must pass factual accuracy and clarity, but fluency can be slightly relaxed) or weighted thresholds tailored to the use case. The key is to reflect real-world standards—especially when outputs are client-facing and failure in one area can undermine the whole result.

In [None]:
# Is this example we will only take into account evaluation of completeness and factual correctness
# We will use the advanced method for completeness and the RAGAS for factual correctness

def get_final_annotation(row):
    factual_score = evalaute_factuality_via_ragas(row)
    completeness_score = evalaute_completeness_via_llm_as_judge(row)

    if factual_score < FACTUALITY_THRESHOLD:
        return "bad", "factuality problem"
    elif completeness_score < COMPLETENESS_THRESHOLD:
        return "bad", "completeness problem"
    return "good", ""


df['automated_annotation_pipeline'] = df.apply(get_final_annotation, axis=1)
classification_report(
    df.annotation,
    df.automated_annotation_pipeline,
    output_dict=True,
)

# Final Notes

You’ve now got a working evaluation pipeline for RAG that combines open-source tools with custom LLM-based judgment. The goal here isn’t perfection—it’s iteration. Use this framework as a foundation. Tweak the metrics, expand the dataset, adjust for your domain. The key takeaway is that RAG systems need feedback loops. Without evaluation, you're guessing.

While this notebook focused on RAG, the same concepts apply to any LLM-based application—summarization, question answering, agentic workflows, and beyond. Evaluation isn't just a final step; it’s part of the development cycle. Build it in early, keep it lightweight, and adapt as your system evolves.

## About Me

[Nadav Barak](https://www.linkedin.com/in/nadavbarak/) is a Head of AI at [Deepchecks](https://www.deepchecks.com/), a startup building tools to evaluate and monitor Generative AI systems. His work focuses on making LLM-based applications more reliable, measurable, and production-ready bridging the gap between cutting-edge research and real-world use.
