# RAG pipeline evaluation using DeepEval

[DeepEval](https://www.confident-ai.com/) is a framework to evaluate [Retrieval Augmented Generation](https://www.deepset.ai/blog/llms-retrieval-augmentation) (RAG) pipelines.
It supports metrics like context relevance, answer correctness, faithfulness, and more.

For more information about evaluators, supported metrics and usage, check out:

* [DeepEvalEvaluator](https://docs.haystack.deepset.ai/docs/deepevalevaluator)
* [Model based evaluation](https://docs.haystack.deepset.ai/docs/model-based-evaluation)

This notebook shows how to use [DeepEval-Haystack](https://haystack.deepset.ai/integrations/deepeval) integration to evaluate a RAG pipeline against various metrics.

## Prerequisites:

- [OpenAI](https://openai.com/) key
    - **DeepEval** uses  for computing some metrics, so we need an OpenAI key.

In [1]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass("Enter OpenAI API key:")

## Install dependencies

In [2]:
!pip install pydantic
!pip install haystack-ai
!pip install datasets
!pip install deepeval-haystack
!pip install --upgrade deepeval


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Collecting deepeval==0.20.57 (from deepeval-haystack)
  Using cached deepeval-0.20.57-py3-none-any.whl.metadata (817 bytes)
Using cached deepeval-0.20.57-py3-none-any.whl (97 kB)
Installing collected packages: deepeval
  Attempting uninstall: deep

## Create a RAG pipeline

We'll first need to create a RAG pipeline. Refer to this [link](https://haystack.deepset.ai/tutorials/27_first_rag_pipeline) for a detailed tutorial on how to create RAG pipelines.

In this notebook, we're using the [SQUAD V2](https://huggingface.co/datasets/rajpurkar/squad_v2) dataset for getting the context, questions and ground truth answers.





**Initialize the document store**



In [3]:
from datasets import load_dataset
from haystack import Document
from haystack.document_stores.in_memory import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

dataset = load_dataset("rajpurkar/squad_v2", split="validation")
documents = list(set(dataset["context"]))
docs = [Document(content=doc) for doc in documents]
document_store.write_documents(docs)

  from .autonotebook import tqdm as notebook_tqdm


1204

In [4]:
import os
from getpass import getpass
from haystack.components.builders import PromptBuilder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers.in_memory import InMemoryBM25Retriever

retriever = InMemoryBM25Retriever(document_store, top_k=3)

template = """
Given the following information, answer the question.

Context:
{% for document in documents %}
    {{ document.content }}
{% endfor %}

Question: {{question}}
Answer:
"""

prompt_builder = PromptBuilder(template=template)
generator = OpenAIGenerator(model="gpt-4o-mini")

**Build the RAG pipeline**

In [5]:
from haystack import Pipeline
from haystack.components.builders.answer_builder import AnswerBuilder

rag_pipeline = Pipeline()
# Add components to your pipeline
rag_pipeline.add_component("retriever", retriever)
rag_pipeline.add_component("prompt_builder", prompt_builder)
rag_pipeline.add_component("llm", generator)
rag_pipeline.add_component(instance=AnswerBuilder(), name="answer_builder")

# Now, connect the components to each other
rag_pipeline.connect("retriever", "prompt_builder.documents")
rag_pipeline.connect("prompt_builder", "llm")
rag_pipeline.connect("llm.replies", "answer_builder.replies")
rag_pipeline.connect("llm.meta", "answer_builder.meta")
rag_pipeline.connect("retriever", "answer_builder.documents")


<haystack.core.pipeline.pipeline.Pipeline object at 0x12d04f7a0>
🚅 Components
  - retriever: InMemoryBM25Retriever
  - prompt_builder: PromptBuilder
  - llm: OpenAIGenerator
  - answer_builder: AnswerBuilder
🛤️ Connections
  - retriever.documents -> prompt_builder.documents (List[Document])
  - retriever.documents -> answer_builder.documents (List[Document])
  - prompt_builder.prompt -> llm.prompt (str)
  - llm.replies -> answer_builder.replies (List[str])
  - llm.meta -> answer_builder.meta (List[Dict[str, Any]])

**Running the pipeline**

In [6]:
question = "In what country is Normandy located?"

response = rag_pipeline.run(
    {"retriever": {"query": question}, "prompt_builder": {"question": question}, "answer_builder": {"query": question}}
)


In [7]:
print(response["answer_builder"]["answers"][0].data)

Normandy is located in France.


We're done building our RAG pipeline. Let's evaluate it now!

## Get questions, contexts, responses and ground truths for evaluation

For computing most metrics, we will need to provide the following to the evaluator:
1. Questions
2. Generated responses
3. Retrieved contexts
4. Ground truth (Specifically, this is needed for `context precision`, `context recall` and `answer correctness` metrics)

We'll start with random three questions from the dataset (see below) and now we'll get the matching `contexts` and `responses` for those questions.

### Helper function to get context and responses for our questions


In [8]:
def get_contexts_and_responses(questions, pipeline):
    contexts = []
    responses = []
    for question in questions:
        response = pipeline.run(
            {
                "retriever": {"query": question},
                "prompt_builder": {"question": question},
                "answer_builder": {"query": question},
            }
        )

        contexts.append([d.content for d in response["answer_builder"]["answers"][0].documents])
        responses.append(response["answer_builder"]["answers"][0].data)
    return contexts, responses


In [9]:
question_map = {
    "Which mountain range influenced the split of the regions?": 0,
    "What is the prize offered for finding a solution to P=NP?": 1,
    "Which Californio is located in the upper part?": 2
}
questions = list(question_map.keys())
contexts, responses = get_contexts_and_responses(questions, rag_pipeline)

### Ground truths, review all fields

Now that we have questions, contexts, and responses we'll also get the matching ground truth answers.

In [10]:
ground_truths = [""] * len(question_map)

for question, index in question_map.items():
    idx = dataset["question"].index(question)
    ground_truths[index] = dataset["answers"][idx]["text"][0]

In [11]:
print("Questions:\n")
print("\n".join(questions))

Questions:

Which mountain range influenced the split of the regions?
What is the prize offered for finding a solution to P=NP?
Which Californio is located in the upper part?


In [12]:
print("Contexts:\n")
for c in contexts:
  print(c[0])

Contexts:

The state is most commonly divided and promoted by its regional tourism groups as consisting of northern, central, and southern California regions. The two AAA Auto Clubs of the state, the California State Automobile Association and the Automobile Club of Southern California, choose to simplify matters by dividing the state along the lines where their jurisdictions for membership apply, as either northern or southern California, in contrast to the three-region point of view. Another influence is the geographical phrase South of the Tehachapis, which would split the southern region off at the crest of that transverse range, but in that definition, the desert portions of north Los Angeles County and eastern Kern and San Bernardino Counties would be included in the southern California region due to their remoteness from the central valley and interior desert landscape.
If a problem X is in C and hard for C, then X is said to be complete for C. This means that X is the hardest p

In [13]:
print("Responses:\n")
print("\n".join(responses))

Responses:

The Tehachapi Mountains influenced the split of the regions, as referenced by the geographical phrase "South of the Tehachapis."
The prize offered for finding a solution to the P versus NP problem is US$1,000,000.
The context provided does not mention any Californios or provide information related to California. "Californio" typically refers to a Hispanic person of Californian descent, particularly during the period when California was part of Mexico. If you are looking for a specific Californio associated with a historical or geographical significance, please provide more context or clarify your question.


In [14]:
print("Ground truths:\n")
print("\n".join(ground_truths))

Ground truths:

Tehachapis
$1,000,000
Monterey


## Evaluate the RAG pipeline





Now that we have the `questions`, `contexts`,`responses` and the `ground truths`, we can begin our pipeline evaluation and compute all the supported metrics.

## Metrics computation

In addition to evaluating the final responses of the LLM, it is important that we also evaluate the individual components of the RAG pipeline as they can significantly impact the overall performance. Therefore, there are different metrics to evaluate the retriever, the generator and the overall pipeline. For a full list of available metrics and their expected inputs, check out the [DeepEvalEvaluator Docs](https://docs.haystack.deepset.ai/docs/deepevalevaluator)

The [DeepEval documentation](https://docs.confident-ai.com/docs/metrics-introduction) provides explanation of the individual metrics with simple examples for each of them.

### Contextul Precision

The contextual precision metric measures our RAG pipeline's retriever by evaluating whether items in our contexts that are relevant to the given input are ranked higher than irrelevant ones.

In [15]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_precision_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_PRECISION, metric_params={"model":"gpt-4o-mini"})
context_precision_pipeline.add_component("evaluator", evaluator)


In [16]:
evaluation_results = context_precision_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])


Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 3 test case(s) in parallel: |██████████|100% (3/3) [Time Taken: 00:05,  1.75s/test case]



Metrics Summary

  - ✅ Contextual Precision (score: 0.5, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.50 because the relevant node (rank 2) is correctly placed above the irrelevant nodes (rank 1 and rank 3), which discuss unrelated topics. However, since there are two irrelevant nodes ranked higher than just one relevant node, it affects the overall precision score., error: None)

For test case:

  - input: What is the prize offered for finding a solution to P=NP?
  - actual output: The prize offered for finding a solution to the P versus NP problem is US$1,000,000.
  - expected output: $1,000,000
  - context: None
  - retrieval context: ['If a problem X is in C and hard for C, then X is said to be complete for C. This means that X is the hardest problem in C. (Since many problems could be equally hard, one might say that X is one of the hardest problems in C.) Thus the class of NP-complete problems contains the most difficult problems in NP, i




<class 'tuple'> ('test_results', [TestResult(success=True, metrics_data=[MetricData(name='Contextual Precision', threshold=0.0, success=True, score=0.5, reason='The score is 0.50 because the relevant node (rank 2) is correctly placed above the irrelevant nodes (rank 1 and rank 3), which discuss unrelated topics. However, since there are two irrelevant nodes ranked higher than just one relevant node, it affects the overall precision score.', strict_mode=False, evaluation_model='gpt-4o-mini', error=None, evaluation_cost=0.0003183, verbose_logs='Verdicts:\n[\n    {\n        "verdict": "no",\n        "reason": "The first context discusses NP-completeness and does not mention any prize related to P=NP."\n    },\n    {\n        "verdict": "yes",\n        "reason": "This context states that \'There is a US$1,000,000 prize for resolving the problem,\' which directly answers the input question."\n    },\n    {\n        "verdict": "no",\n        "reason": "The third context talks about intractab

AttributeError: 'tuple' object has no attribute 'metrics'

In [16]:
!pip list

Package                                  Version
---------------------------------------- -----------
absl-py                                  2.1.0
aiohappyeyeballs                         2.4.3
aiohttp                                  3.10.10
aiosignal                                1.3.1
annotated-types                          0.7.0
anyio                                    4.6.2.post1
appdirs                                  1.4.4
appnope                                  0.1.4
asttokens                                2.4.1
attrs                                    24.2.0
backoff                                  2.2.1
certifi                                  2024.8.30
charset-normalizer                       3.4.0
click                                    8.1.7
comm                                     0.2.2
dataclasses-json                         0.6.7
datasets                                 3.1.0
debugpy                                  1.8.7
decorator                              

### Contextual Recall

Contextual recall measures the extent to which the contexts aligns with the `ground truth`.

In [20]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_recall_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RECALL, metric_params={"model":"gpt-4o-mini"})
context_recall_pipeline.add_component("evaluator", evaluator)


In [21]:
evaluation_results = context_recall_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "ground_truths": ground_truths, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])


Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 3 test case(s) in parallel: |██████████|100% (3/3) [Time Taken: 00:06,  2.23s/test case]



Metrics Summary

  - ✅ Contextual Recall (score: 0.0, threshold: 0.0, strict: False, evaluation model: gpt-4o-mini, reason: The score is 0.00 because the term 'Monterey' does not appear in any part of the node(s) in retrieval context, indicating a complete lack of relevant information., error: None)

For test case:

  - input: Which Californio is located in the upper part?
  - actual output: The provided context does not mention anything about Californios or their locations. Therefore, based on the given information, it is not possible to determine which Californio is located in the upper part. If you have more specific information or context about Californios, I would be happy to help!
  - expected output: Monterey
  - context: None
  - retrieval context: ['In the centre of Basel, the first major city in the course of the stream, is located the "Rhine knee"; this is a major bend, where the overall direction of the Rhine changes from West to North. Here the High Rhine ends. Legally, 




AttributeError: 'tuple' object has no attribute 'metrics'

### Contextual Relevancy

The contextual relevancy metric measures the quality of our RAG pipeline's retriever by evaluating the overall relevance of the context for a given question.

In [None]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

context_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.CONTEXTUAL_RELEVANCE, metric_params={"model":"gpt-4"})
context_relevancy_pipeline.add_component("evaluator", evaluator)


In [None]:
evaluation_results = context_relevancy_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])


Output()

Output()

Output()

Output()

[[{'name': 'contextual_relevance', 'score': 0.09090909090909091, 'explanation': 'The score is 0.09 because the sentences provided do not directly address the influence of a mountain range on the division of regions. They discuss the division of California, temperature data for Victoria, and experiments in structural geology, but do not provide information pertinent to the input question.'}], [{'name': 'contextual_relevance', 'score': 0.5384615384615384, 'explanation': 'The score is 0.54 because the majority of the sentences extracted from the retrieval context, particularly from nodes 2 and 3, focus on explaining the complexities and characteristics of the P=NP problem, rather than directly addressing the specific question about the prize offered for finding a solution to this problem.'}], [{'name': 'contextual_relevance', 'score': 0.0, 'explanation': "The score is 0.00 because none of the sentences in the retrieval context provide any information related to the queried Californio's lo

### Answer relevancy

The answer relevancy metric measures the quality of our RAG pipeline's response by evaluating how relevant the response is compared to the provided question.

In [None]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

answer_relevancy_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.ANSWER_RELEVANCY, metric_params={"model":"gpt-4"})
answer_relevancy_pipeline.add_component("evaluator", evaluator)


In [None]:
evaluation_results = answer_relevancy_pipeline.run(
    {"evaluator": {"questions": questions, "responses": responses, "contexts": contexts}}
)
print(evaluation_results["evaluator"]["results"])

Output()

Output()

Output()

Output()

[[{'name': 'answer_relevancy', 'score': 0.3333333333333333, 'explanation': "The score is 0.33 because the answer correctly identifies the Tehachapi mountain range as the influence for the split of the regions in California. However, the score is not higher because the majority of the points presented in the answer, including details about AAA Auto Clubs, orogenic wedges, numerical models, and Victoria's temperature records, are irrelevant to the original question."}], [{'name': 'answer_relevancy', 'score': 0.5, 'explanation': 'The score is 0.50 because while the answer did provide the correct information about the prize amount for solving the P=NP problem, it also included unnecessary details about the significance of the P=NP problem itself, which was not asked for in the question.'}], [{'name': 'answer_relevancy', 'score': 0.2, 'explanation': 'The score is 0.20 because while the answer does mention a location in the upper part, it is not related to the original question about a Calif

#### Note

When this notebook was created, the version 0.20.57 of [deepeval](https://github.com/confident-ai/deepeval/tree/v0.20.57) required the use of contexts for calculating Answer Relevancy. Please note that future versions will no longer require the context field. Specifically, the upcoming release of deepeval-haystack will eliminate the context field as a mandatory requirement.

### Faithfulness

The faithfulness metric measures the quality of our RAG pipeline's responses by evaluating whether the response factually aligns with the contents of context we provided.

In [None]:
from haystack import Pipeline
from haystack_integrations.components.evaluators.deepeval import DeepEvalEvaluator, DeepEvalMetric

faithfulness_pipeline = Pipeline()
evaluator = DeepEvalEvaluator(metric=DeepEvalMetric.FAITHFULNESS, metric_params={"model":"gpt-4"} )
faithfulness_pipeline.add_component("evaluator", evaluator)


In [None]:
evaluation_results = faithfulness_pipeline.run(
    {"evaluator": {"questions": questions, "contexts": contexts, "responses": responses}}
)
print(evaluation_results["evaluator"]["results"])

Output()

Output()

Output()

Output()

[[{'name': 'faithfulness', 'score': 0.2631578947368421, 'explanation': "The score is 0.26 because the actual output persistently mentions the Tehachapi mountain range influencing the split of regions in California, which is not addressed in the context of the discussions on orogenic wedges, numerical models, and their role in mountain building in the second node of the retrieval context. Additionally, the output is also unrelated to the third node of the retrieval context, which discusses about Victoria's warmest regions, the Mallee, and upper Wimmera, and their weather patterns."}], [{'name': 'faithfulness', 'score': 1.0, 'explanation': 'The score is 1.00 because the actual output perfectly aligns with all the nodes in the retrieval context, without any contradictions.'}], [{'name': 'faithfulness', 'score': 0.03571428571428571, 'explanation': "The score is 0.04 because the actual output is significantly unfaithful to the retrieval context. It completely ignores the series of discoveri

**Our pipeline evaluation using DeepEval is now complete!**

**Haystack 2.0 Useful Sources**

* [Docs](https://docs.haystack.deepset.ai/docs/intro)
* [Tutorials](https://haystack.deepset.ai/tutorials)
* [Other Cookbooks](https://github.com/deepset-ai/haystack-cookbook)