# Evaluating AI with Haystack

by Bilge Yucel ([X](https://x.com/bilgeycl), [Linkedin](https://www.linkedin.com/in/bilge-yucel/))

In this cookbook, we walktrough the [Evaluators](https://docs.haystack.deepset.ai/docs/evaluators) in Haystack, create an evaluation pipeline, streamline the evaluation with [`EvaluationHarness`](https://github.com/deepset-ai/haystack-experimental/tree/main/haystack_experimental/evaluation/harness) and try different Evaluation Frameworks like [Ragas](https://haystack.deepset.ai/integrations/ragas) and [FlowJudge](https://haystack.deepset.ai/integrations/flow-judge). 

📚 **Useful Resources:**
* [Article: Benchmarking Haystack Pipelines for Optimal Performance](https://haystack.deepset.ai/blog/benchmarking-haystack-pipelines)
* [Evaluation Walkthrough](https://haystack.deepset.ai/tutorials/guide_evaluation)
* [haystack-evaluation repo](https://github.com/deepset-ai/haystack-evaluation/tree/main)
* [EvaluationHarness (haystack-experimental)](https://github.com/deepset-ai/haystack-experimental/tree/main/haystack_experimental/evaluation/harness)
* [Evaluation tutorial](https://haystack.deepset.ai/tutorials/35_evaluating_rag_pipelines)
* [Evaluation Docs](https://docs.haystack.deepset.ai/docs/evaluation)

## 📺 Watch Along

<iframe width="560" height="315" src="https://www.youtube.com/embed/Dy-n_yC3Cto?si=LB0GdFP0VO-nJT-n" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" referrerpolicy="strict-origin-when-cross-origin" allowfullscreen></iframe>

In [1]:
!pip install haystack-ai sentence-transformers>="3.0.0" pypdf

In [None]:
!pip install ragas-haystack flow-judge[hf] # evaluation frameworks

## 1. Building your pipeline

### ARAGOG

This dataset is based on the paper [Advanced Retrieval Augmented Generation Output Grading (ARAGOG)](https://arxiv.org/pdf/2404.01037). It's a
collection of papers from ArXiv covering topics around Transformers and Large Language Models, all in PDF format.

The dataset contains:
- 13 PDF papers.
- 107 questions and answers generated with the assistance of GPT-4, and validated/corrected by humans.

We have:
- ground-truth answers
- questions

Get the dataset [here](https://github.com/deepset-ai/haystack-evaluation/blob/main/datasets/README.md)

### Indexing Pipeline

In [None]:
import os

from haystack import Pipeline
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.converters import PyPDFToDocument
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.writers import DocumentWriter
from haystack.document_stores.types import DuplicatePolicy

embedding_model="sentence-transformers/all-MiniLM-L6-v2"
document_store = InMemoryDocumentStore()

files_path = "/content/papers_for_questions" # <ENTER YOUR PATH HERE>
pipeline = Pipeline()
pipeline.add_component("converter", PyPDFToDocument())
pipeline.add_component("cleaner", DocumentCleaner())
pipeline.add_component("splitter", DocumentSplitter(split_length=250, split_by="word"))  # default splitting by word
pipeline.add_component("writer", DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP))
pipeline.add_component("embedder", SentenceTransformersDocumentEmbedder(embedding_model))
pipeline.connect("converter", "cleaner")
pipeline.connect("cleaner", "splitter")
pipeline.connect("splitter", "embedder")
pipeline.connect("embedder", "writer")
pdf_files = [files_path+"/"+f_name for f_name in os.listdir(files_path)]

pipeline.run({"converter": {"sources": pdf_files}})


In [4]:
document_store.count_documents()

690

### RAG

In [35]:
import os
from getpass import getpass

os.environ["OPENAI_API_KEY"] = getpass('OPENAI_API_KEY: ')

In [36]:
from haystack import Pipeline
from haystack.components.builders import PromptBuilder, AnswerBuilder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.generators import OpenAIGenerator
from haystack.components.retrievers import InMemoryEmbeddingRetriever

template = """
    You have to answer the following question based on the given context information only.
    If the context is empty or just a '\\n' answer with None, example: "None".

    Context:
    {% for document in documents %}
        {{ document.content }}
    {% endfor %}

    Question: {{question}}
    Answer:
    """

basic_rag = Pipeline()
basic_rag.add_component("query_embedder", SentenceTransformersTextEmbedder(
    model=embedding_model, progress_bar=False
))
basic_rag.add_component("retriever", InMemoryEmbeddingRetriever(document_store))
basic_rag.add_component("prompt_builder", PromptBuilder(template=template))
basic_rag.add_component("generator", OpenAIGenerator(model="gpt-4o-mini"))

basic_rag.connect("query_embedder", "retriever.query_embedding")
basic_rag.connect("retriever", "prompt_builder.documents")
basic_rag.connect("prompt_builder", "generator")

<haystack.core.pipeline.pipeline.Pipeline object at 0x7a05787d5cf0>
🚅 Components
  - query_embedder: SentenceTransformersTextEmbedder
  - retriever: InMemoryEmbeddingRetriever
  - prompt_builder: PromptBuilder
  - generator: OpenAIGenerator
🛤️ Connections
  - query_embedder.embedding -> retriever.query_embedding (List[float])
  - retriever.documents -> prompt_builder.documents (List[Document])
  - prompt_builder.prompt -> generator.prompt (str)

## 2. Human Evaluation

In [37]:
from typing import List, Tuple
import json

def read_question_answers() -> Tuple[List[str], List[str]]:
    with open("/content/eval_questions.json", "r") as f:
        data = json.load(f)
        questions = data["questions"]
        answers = data["ground_truths"]
    return questions, answers

all_questions, all_answers = read_question_answers()

In [38]:
print(len(all_questions))
print(len(all_answers))

107
107


In [39]:
questions = all_questions[:15]
answers = all_answers[:15]

In [40]:
index = 5
print(questions[index])
print(answers[index])
question = questions[index]

How were the questions for the multitask test sourced, and what was the criteria for their inclusion?
Questions were manually collected by graduate and undergraduate students from freely available online sources, including practice questions for standardized tests and undergraduate courses, ensuring a wide representation of difficulty levels and subjects.


In [41]:
basic_rag.run({"query_embedder":{"text":question}, "prompt_builder":{"question": question}})

{'generator': {'replies': ['The questions for the multitask test were manually collected by graduate and undergraduate students from freely available sources online. These sources included practice questions for tests such as the Graduate Record Examination and the United States Medical Licensing Examination, as well as questions designed for undergraduate courses and those for readers of Oxford University Press books. The criteria for inclusion were based on ensuring that the questions covered a range of subjects and difficulty levels, including specific tasks like "Elementary," "High School," "College," or "Professional," with each subject containing a minimum of 100 test examples.'],
  'meta': [{'model': 'gpt-4o-mini-2024-07-18',
    'index': 0,
    'finish_reason': 'stop',
    'usage': {'completion_tokens': 110,
     'prompt_tokens': 4559,
     'total_tokens': 4669,
     'completion_tokens_details': CompletionTokensDetails(audio_tokens=None, reasoning_tokens=0),
     'prompt_tokens

## 3. Deciding on Metrics

* **Semantic Answer Similarity**: SASEvaluator compares the embedding of a generated answer against a ground-truth answer based on a common embedding model.
* **ContextRelevanceEvaluator** will assess the relevancy of the retrieved context to answer the query question
* **FaithfulnessEvaluator** evaluates whether the generated answer can be derived from the context


## 4. Building an Evaluation Pipeline

In [42]:
from haystack import Pipeline
from haystack.components.evaluators import ContextRelevanceEvaluator, FaithfulnessEvaluator, SASEvaluator

eval_pipeline = Pipeline()
eval_pipeline.add_component("context_relevance", ContextRelevanceEvaluator(raise_on_failure=False))
eval_pipeline.add_component("faithfulness", FaithfulnessEvaluator(raise_on_failure=False))
eval_pipeline.add_component("sas", SASEvaluator(model=embedding_model))

## 5. Running Evaluation

### Run the RAG Pipeline

In [43]:
predicted_answers = []
retrieved_context = []

for question in questions: # loops over 15 questions
  result = basic_rag.run({"query_embedder":{"text":question}, "prompt_builder":{"question": question}}, include_outputs_from={"retriever"})
  predicted_answers.append(result["generator"]["replies"][0])
  retrieved_context.append(result["retriever"]["documents"])

### Run the Evaluation

In [44]:
eval_pipeline_results = eval_pipeline.run(
    {
        "context_relevance": {"questions": questions, "contexts": retrieved_context},
        "faithfulness": {"questions": questions, "contexts": retrieved_context, "predicted_answers": predicted_answers},
        "sas": {"predicted_answers": predicted_answers, "ground_truth_answers": answers},
    }
)

results = {
    "context_relevance": eval_pipeline_results['context_relevance'],
    "faithfulness": eval_pipeline_results['faithfulness'],
    "sas": eval_pipeline_results['sas']
}

100%|██████████| 15/15 [00:11<00:00,  1.26it/s]
100%|██████████| 15/15 [00:35<00:00,  2.37s/it]


## 6. Analyzing Results

[EvaluationRunResult](https://docs.haystack.deepset.ai/reference/evaluation-api#evaluationrunresult)

In [45]:
from haystack.evaluation import EvaluationRunResult

inputs = {
    'questions': questions,
    'contexts': retrieved_context,
    'true_answers': answers,
    'predicted_answers': predicted_answers
}
run_name="rag_eval"
eval_results = EvaluationRunResult(run_name=run_name, inputs=inputs, results=results)
eval_results.score_report()

Unnamed: 0,metrics,score
0,context_relevance,0.2
1,faithfulness,0.611111
2,sas,0.546086


In [53]:
eval_results.to_pandas()

Unnamed: 0,questions,contexts,true_answers,predicted_answers,context_relevance,faithfulness,sas
0,What are the two main tasks BERT is pre-traine...,[Document(id=1996eb783b7e2934527de00e3d5f82fb5...,Masked LM (MLM) and Next Sentence Prediction (...,The two main tasks BERT is pre-trained on are ...,0,1.0,0.552495
1,"What model sizes are reported for BERT, and wh...",[Document(id=8906a653a71ec55161d5f8c6203335456...,"BERTBASE (L=12, H=768, A=12, Total Parameters=...",The BERT model sizes reported are:\n\n1. **BER...,0,0.0,0.664142
2,How does BERT's architecture facilitate the us...,[Document(id=320d3c00ef93938ee6cc92f6a742ba1ed...,BERT uses a multi-layer bidirectional Transfor...,BERT's architecture facilitates the use of a u...,0,1.0,0.817575
3,Can you describe the modifications LLaMA makes...,[Document(id=f360dea1ec15f8f778718ae1e13eb855b...,LLaMA incorporates pre-normalization (using R...,,0,0.0,0.015276
4,How does LLaMA's approach to embedding layer o...,[Document(id=f360dea1ec15f8f778718ae1e13eb855b...,LLaMA introduces optimizations in its embeddin...,,0,0.0,0.075397
5,How were the questions for the multitask test ...,[Document(id=9415e713cf73ffea5ca383126c54f7ec4...,Questions were manually collected by graduate ...,The questions for the multitask test were manu...,0,1.0,0.652526
6,How does BERT's performance on the GLUE benchm...,[Document(id=606c67eb5eeb136ad77616d2ef06a580b...,BERT achieved new state-of-the-art on the GLUE...,BERT significantly outperforms all previous st...,0,0.833333,0.857448
7,What significant improvements does BERT bring ...,[Document(id=4ca8419f5c01c094bbda9617b3ce328cb...,"BERT set new records on SQuAD v1.1 and v2.0, s...",BERT brings substantial improvements to the SQ...,0,1.0,0.586361
8,What unique aspect of the LLaMA training datas...,[Document(id=236e5c1e3c782e68912426a7f2543710c...,LLaMA's training dataset is distinctive for b...,The unique aspect of the LLaMA training datase...,0,0.666667,0.962779
9,What detailed methodology does LLaMA utilize t...,[Document(id=9885fbffa74c564acd7a255e8b66a3343...,LLaMA's methodology for ensuring data diversit...,,0,0.0,-0.00547


In [47]:
index = 2
print(eval_pipeline_results['context_relevance']["individual_scores"][index], "\nQuestion:", questions[index],"\nTrue Answer:", answers[index], "\nAnswer:", predicted_answers[index])
print("".join([doc.content for doc in retrieved_context[index]]))

0 
Question: How does BERT's architecture facilitate the use of a unified model across diverse NLP tasks? 
True Answer: BERT uses a multi-layer bidirectional Transformer encoder architecture, allowing for minimal task-specific architecture modifications in fine-tuning. 
Answer: BERT's architecture facilitates the use of a unified model across diverse NLP tasks through its design as a multi-layer bidirectional Transformer encoder. This architecture allows for minimal differences between the pre-trained model and the final downstream model architecture. By using a consistent approach to both pre-training and fine-tuning, BERT can adapt to various tasks with only a simple classification layer added on top. Additionally, BERT's capability to jointly condition on both left and right context in all layers enhances its versatility across different natural language processing tasks, thereby enabling state-of-the-art performances without substantial task-specific modifications.
BERT: Pre-traini

## Evaluation Harness (Step 4, 5, and 6)

* Runs the RAG pipeline
* Runs the evaluation

> Try `EvaluationHarness` and give us feedback [on Github](https://github.com/deepset-ai/haystack-experimental/discussions/74)

In [None]:
from haystack_experimental.evaluation.harness.rag import (
    DefaultRAGArchitecture,
    RAGEvaluationHarness,
    RAGEvaluationMetric,
    RAGEvaluationInput
)

pipeline_eval_harness = RAGEvaluationHarness(
    rag_pipeline = basic_rag,
    rag_components=DefaultRAGArchitecture.GENERATION_WITH_EMBEDDING_RETRIEVAL, # query_embedder, retriever, prompt_builder, generator
    metrics={
        RAGEvaluationMetric.SEMANTIC_ANSWER_SIMILARITY,
        RAGEvaluationMetric.FAITHFULNESS,
        RAGEvaluationMetric.CONTEXT_RELEVANCE,
    }
)

eval_harness_input = RAGEvaluationInput(
    queries=questions,
    ground_truth_answers=answers,
    rag_pipeline_inputs={
        "prompt_builder": {"question": list(questions)},
    },
)

harness_eval_run= pipeline_eval_harness.run(inputs=eval_harness_input, run_name=run_name)

In [49]:
harness_eval_run.results.score_report()

Unnamed: 0,metrics,score
0,metric_context_relevance,0.266667
1,metric_sas,0.537721
2,metric_faithfulness,0.747778


Override some parameter

In [None]:
from haystack_experimental.evaluation.harness.rag import RAGEvaluationOverrides

overrides = RAGEvaluationOverrides(rag_pipeline={
    "generator": {"model": "gpt-4"},
})

harness_eval_run_gpt4 = pipeline_eval_harness.run(inputs=eval_harness_input, run_name="harness_eval_run_gpt4", overrides=overrides)

In [51]:
harness_eval_run_gpt4.results.score_report()

Unnamed: 0,metrics,score
0,metric_context_relevance,0.266667
1,metric_sas,0.654073
2,metric_faithfulness,0.796429


In [52]:
harness_eval_run.results.comparative_individual_scores_report(harness_eval_run_gpt4.results)

Unnamed: 0,questions,contexts,responses,ground_truth_answers,rag_eval_metric_context_relevance,rag_eval_metric_sas,rag_eval_metric_faithfulness,harness_eval_run_gpt4_metric_context_relevance,harness_eval_run_gpt4_metric_sas,harness_eval_run_gpt4_metric_faithfulness
0,What are the two main tasks BERT is pre-traine...,"[pre-trained with Ima-\ngeNet (Deng et al., 20...",The two main tasks BERT is pre-trained on are ...,Masked LM (MLM) and Next Sentence Prediction (...,0,0.593595,1.0,0,0.22082,1.0
1,"What model sizes are reported for BERT, and wh...",[the\ntraining loss for 336M and 752M BERT mod...,The model sizes reported for BERT and their sp...,"BERTBASE (L=12, H=768, A=12, Total Parameters=...",0,0.62648,1.0,0,0.762167,1.0
2,How does BERT's architecture facilitate the us...,[BERT: Pre-training of Deep Bidirectional Tran...,BERT's architecture facilitates the use of a u...,BERT uses a multi-layer bidirectional Transfor...,1,0.878212,1.0,1,0.69725,1.0
3,Can you describe the modifications LLaMA makes...,[to the transformer\narchitecture (Vaswani et ...,,LLaMA incorporates pre-normalization (using R...,0,0.015276,0.0,0,0.563944,0.857143
4,How does LLaMA's approach to embedding layer o...,[to the transformer\narchitecture (Vaswani et ...,,LLaMA introduces optimizations in its embeddin...,0,0.075397,0.0,0,0.626173,1.0
5,How were the questions for the multitask test ...,[of subjects that either do not neatly ﬁt into...,The questions for the multitask test were manu...,Questions were manually collected by graduate ...,0,0.639905,0.8,0,0.611838,1.0
6,How does BERT's performance on the GLUE benchm...,[GLUE provides a lightweight classiﬁcation API...,BERT significantly outperforms previous state-...,BERT achieved new state-of-the-art on the GLUE...,0,0.808857,1.0,0,0.853133,1.0
7,What significant improvements does BERT bring ...,[ﬁne-tuning data shufﬂing and clas-\nsiﬁer lay...,BERT brings significant improvements to the SQ...,"BERT set new records on SQuAD v1.1 and v2.0, s...",0,0.653101,1.0,0,0.662145,0.375
8,What unique aspect of the LLaMA training datas...,"[model, Gopher, has worse\nperformance than Ch...",LLaMA was trained exclusively on publicly avai...,LLaMA's training dataset is distinctive for b...,0,0.894204,1.0,0,0.949199,1.0
9,What detailed methodology does LLaMA utilize t...,[the description and satisﬁes the\ntest cases....,,LLaMA's methodology for ensuring data diversit...,0,-0.00547,0.0,0,0.681471,0.0


In [None]:
overrides = RAGEvaluationOverrides(rag_pipeline={
    "retriever": {"top_k": 2},
})

harness_eval_run_topk10 = pipeline_eval_harness.run(inputs=eval_harness_input, run_name="harness_eval_run_topk10", overrides=overrides)

Executing RAG pipeline...


100%|██████████| 30/30 [01:50<00:00,  3.67s/it]


Executing evaluation pipeline...


100%|██████████| 30/30 [01:05<00:00,  2.18s/it]
100%|██████████| 30/30 [00:26<00:00,  1.12it/s]


In [None]:
harness_eval_run_topk10.results.score_report()

Unnamed: 0,metrics,score
0,metric_sas,0.574303
1,metric_faithfulness,0.78
2,metric_context_relevance,0.4


## Evaluation Frameworks

* [RagasEvaluator](https://docs.haystack.deepset.ai/docs/ragasevaluator)
* [FlowJudge](https://haystack.deepset.ai/integrations/flow-judge)

In [None]:
from flow_judge.integrations.haystack import HaystackFlowJudge
from flow_judge.metrics.presets import RESPONSE_FAITHFULNESS_5POINT
from flow_judge import Hf

model = Hf(flash_attn=False)

flow_judge_evaluator = HaystackFlowJudge(
    metric=RESPONSE_FAITHFULNESS_5POINT,
    model=model,
    progress_bar=True,
    raise_on_failure=True,
    save_results=True,
    fail_on_parse_error=False
)

In [None]:
from haystack_integrations.components.evaluators.ragas import RagasEvaluator, RagasMetric

ragas_evaluator= RagasEvaluator(
    metric=RagasMetric.FAITHFULNESS
)

In [28]:
str_fj_retrieved_context = []
for context in retrieved_context:
  str_context = [doc.content for doc in context]
  str_fj_retrieved_context.append(" ".join(str_context)) # ["", "", ...]

In [29]:
str_retrieved_context = []
for context in retrieved_context:
  str_context = [doc.content for doc in context]
  str_retrieved_context.append(str_context) # [["", ""]]

In [31]:
from haystack import Pipeline

integration_eval_pipeline = Pipeline()
integration_eval_pipeline.add_component("ragas_evaluator", ragas_evaluator)
integration_eval_pipeline.add_component("flow_judge_evaluator", flow_judge_evaluator)

eval_framework_pipeline_results = integration_eval_pipeline.run(
    {
        "ragas_evaluator": {"questions": questions, "contexts": str_retrieved_context, "responses":predicted_answers},
        "flow_judge_evaluator": {"query": questions, "context": str_fj_retrieved_context, "response": predicted_answers},
    }
)

Evaluating:   0%|          | 0/10 [00:00<?, ?it/s]

Processing batches: 100%|██████████| 10/10 [03:32<00:00, 21.23s/it]


In [34]:
eval_framework_pipeline_results

{'ragas_evaluator': {'results': [[{'name': 'faithfulness', 'score': 0.5}],
   [{'name': 'faithfulness', 'score': 1.0}],
   [{'name': 'faithfulness', 'score': 1.0}],
   [{'name': 'faithfulness', 'score': nan}],
   [{'name': 'faithfulness', 'score': nan}],
   [{'name': 'faithfulness', 'score': 0.9090909090909091}],
   [{'name': 'faithfulness', 'score': 1.0}],
   [{'name': 'faithfulness', 'score': 1.0}],
   [{'name': 'faithfulness', 'score': 1.0}],
   [{'name': 'faithfulness', 'score': nan}]]},
 'flow_judge_evaluator': {'results': [{'feedback': "The response provided is highly consistent with the given context. The context explicitly mentions that BERT is pre-trained on two tasks: the masked language model (MLM) task and the next sentence prediction (NSP) task. The response accurately identifies these two tasks as the main pre-training tasks for BERT, directly reflecting the information provided in the context. There are no hallucinated or fabricated details in the response, and all the i