# 5a. Evaluation - Testing your RAG application (simple)
<a id="evaluation1"></a>


This section helps you assess the quality and performance of your RAG application through objective, automated testing. Effective evaluation is crucial for ensuring that your application accurately retrieves relevant information and generates helpful responses based on your document collection.

## Evaluation components

The evaluation framework we will build examines several critical aspects:

- **Answer quality**: Assesses the completeness of generated responses against expected content
- **Performance metrics**: Quantifies the system's effectiveness through pass rates and match scores
- **Test dataset**: Provides a diverse set of questions with expected keywords for consistent evaluation

**Note:** While this tutorial uses simple keyword matching for clarity and ease of implementation, your evaluation approach should be tailored to your specific use case. Consider alternatives such as *LLM as a judge*, *BERTScores*, or other domain-specific evaluation criteria for specialised applications

## What you will learn

1. How to wrap your PhariaKernel Skill in an Intelligence Layer task for consistent testing
2. How to create and register a test dataset with questions and expected keywords
3. How to implement evaluation logic for measuring keyword presence in answers
4. How to create aggregation metrics to analyse overall system performance
5. How to run benchmarks and interpret the results
6. How to identify targeted improvement opportunities based on evaluation insights

## Prerequisites

Before starting, ensure you have the following:

- **Completed Skills**: You have finished the [Skill Setup](4.%20Skill%20Setup%20-%20Customizing%20and%20Developing%20Kernel%20Skills.ipynb#evaluation1) section and have a working RAG Skill deployed in PhariaKernel
- **Intelligence Layer**: Access to the PhariaAI Intelligence Layer SDK for structured evaluation
- **Studio access**: Permissions to create datasets and benchmarks in PhariaStudio
- **Authentication**: A valid API token with permissions to access your deployed skill
- **Document collection**: The same document collection used by your RAG application


## Setting up the evaluation environment

To effectively evaluate your RAG application, we use the PhariaAI Intelligence Layer framework. However, we are actively working on creating an integrated experience and we will improve on the ease of use of the following tasks in a new SDK soon!

First, make sure to add the Intelligence Layer and Aleph Alpha client dependencies. Run these commands in your terminal:

```bash
poetry add git+https://github.com/Aleph-Alpha/intelligence-layer-sdk.git --python ">=3.10,<3.13" 
```

```bash
pip install aleph-alpha-client
```

Now import the necessary libraries and set up your environment. We start by importing components from the Intelligence Layer framework that will help us create and run our evaluations:

In [None]:
from dotenv import load_dotenv
from pydantic import BaseModel
from collections.abc import Iterable
from typing import Iterable
from statistics import mean
from uuid import uuid4
import requests
import os

from intelligence_layer.connectors import StudioClient

from intelligence_layer.core import NoOpTracer, Task, TaskSpan

from intelligence_layer.evaluation import (
    Example,
    StudioDatasetRepository,
    AggregationLogic,
    StudioBenchmarkRepository,
    SingleOutputEvaluationLogic,
)

from intelligence_layer.evaluation.dataset.domain import Example


from rag_tutorial.skill.qa import Input, Output

If you have named your project differently, replace the last line with ```from <your-application>.skill.qa import Input, Output```.

## Procedure

### 1. Connect to PhariaStudio

First, we need to establish a connection to PhariaStudio, which will be used to store our evaluation datasets, benchmarks, and traces. The StudioClient provides an interface for creating and managing these resources:

In [None]:
load_dotenv("rag_tutorial/skill/.env")
PHARIA_STUDIO_PROJECT_NAME = "rag-tutorial"

studio_client = StudioClient(
    project=PHARIA_STUDIO_PROJECT_NAME,
    studio_url=os.getenv("PHARIA_STUDIO_ADDRESS"),
    auth_token=os.getenv("PHARIA_AI_TOKEN"),
    create_project=True,
)

### 2. Create a task wrapper for your RAG Skill

To evaluate our RAG Skill, we need to wrap it in an Intelligence Layer task. This wrapper serves as an adapter between your Skill implementation and the evaluation framework. The task abstraction allows the framework to:

1. Execute your Skill with different inputs
2. Capture outputs systematically
3. Track performance metrics and execution details
4. Integrate with PhariaStudio's visualisation tools

The `QATask` class below implements the Intelligence Layer's task interface, providing a standardised way to invoke our PhariaKernel Skill, as follows:

- It connects to PhariaKernel using a REST API
- It forwards input questions to our deployed Skill
- It transforms API responses into a structured output format

This abstraction allows us to easily test our Skill with different evaluation datasets and compare performance across different configurations.

In [None]:
class QATask(Task[Input, Output]):
    def __init__(self) -> None:
        self.token = os.getenv("PHARIA_AI_TOKEN")
        self.kernel_url = os.getenv("PHARIA_KERNEL_ADDRESS")
        self.skill_namespace = "playground"
        self.skill_name = "qa-rag-tutorial"

    def do_run(self, input: Input, task_span: TaskSpan) -> Output:
        try:
            headers = {"Authorization": f"Bearer {self.token}"}
            url = f"{self.kernel_url}/v1/skills/{self.skill_namespace}/{self.skill_name}/run"
            response = requests.post(
                url,
                json=input.model_dump() if isinstance(input, BaseModel) else input,
                headers=headers,
            )
            response = response.json()
            return Output(answer=response["answer"], output=response["output"])
        except Exception as e:
            print(e)
            return Output(answer=None, output=None)

Before proceeding, verify that your task wrapper correctly interfaces with the deployed Skill:

In [None]:
test_input = Input(question="What is a transformer?")

task = QATask()
task.run(test_input, NoOpTracer())

### 3. Create an evaluation dataset

#### 3.1 Create a test dataset



First we create an example test dataset with questions that cover different topics in our document collection. For each question, we specify keywords that should appear in a well-informed answer.

In [None]:
test_set = [
    {
        "question": "What is mixture-of-experts?",
        "keywords": ["experts", "gating", "combine"],
    },
    {
        "question": "What is an Large Language Model",
        "keywords": ["corpus", "parameters", "generation"],
    },
    {"question": "What is a Sequence?", "keywords": ["order", "elements", "series"]},
    {
        "question": "What is translation?",
        "keywords": ["language", "meaning", "convert"],
    },
    {
        "question": "What is the difference between GRNN and RNN?",
        "keywords": ["gates", "general", "specific"],
    },
    {"question": "What is LSTM?", "keywords": ["memory", "gates", "vanishing"]},
    {"question": "What is are RNNs?", "keywords": ["feedback", "sequential", "state"]},
    {
        "question": "What is self-attention?",
        "keywords": ["positions", "sequence", "relate"],
    },
    {"question": "What is Attention?", "keywords": ["focus", "weighting", "context"]},
    {
        "question": "What is a transformer",
        "keywords": ["attention", "parallel", "encoder"],
    },
]

#### 3.2 Pydantic model for expected output

Next, we define a Pydantic model to establish a clear structure for test output:

1. `EvaluationExpectedOutput`: Defines what keywords we expect in a good answer

These models provide type safety and ensure consistent evaluation across multiple benchmark runs.

In [None]:
class EvaluationExpectedOutput(BaseModel):
    keywords: list[str]

With our test examples prepared, we now need to register them in PhariaStudio. This step creates a persistent, versioned dataset that can be referenced in benchmarks and used for repeated evaluations.

#### 3.3 Upload the dataset

In [None]:
studio_dataset_repo = StudioDatasetRepository(studio_client=studio_client)

examples = [
    Example(
        input=Input(question=example["question"]),
        expected_output=EvaluationExpectedOutput(keywords=example["keywords"]),
    )
    for example in test_set
]

studio_dataset = studio_dataset_repo.create_dataset(
    examples=examples, dataset_name="demo-dataset"
)

studio_dataset.id

The dataset has now been created and registered in PhariaStudio with the ID shown above. To verify and inspect your dataset:

1. Navigate to your PhariaStudio interface in your web browser
2. Select the "Evaluate" section from the main navigation
3. Choose "Datasets" from the evaluation options
4. Find your "demo-dataset" in the list of available datasets

### 4. Define evaluation logic

Now we implement the core logic for evaluating our RAG system. Our evaluation approach consists of two main components:

1. **EvaluationLogic**: Evaluates individual examples by comparing the system's output with the expected output
2. **AggregationLogic**: Aggregates individual evaluation results into overall metrics

For each of these, we define the expected return values as Pydantic models to ensure type safety and define their logic.

#### 4.1 EvaluationLogic 

First, we set up the evaluation logic that is used for each individual example. Our `QaEvaluationLogic` class implements this assessment strategy by extending the Intelligence Layer's `SingleOutputEvaluationLogic` interface, allowing it to integrate with the broader evaluation framework:

In [None]:
class QaEvaluation(BaseModel):
    matched_keywords: list[str]
    missing_keywords: list[str]
    match_score: float
    passed: bool


class QaEvaluationLogic(
    SingleOutputEvaluationLogic[Input, Output, EvaluationExpectedOutput, QaEvaluation]
):

    def __init__(self) -> None:
        self.threshold = 0.8  ## Threshold to define when an evaluation is passed

    def do_evaluate_single_output(
        self, example: Example[Input, EvaluationExpectedOutput], output: Output
    ) -> QaEvaluation:
        required_keywords = example.expected_output.keywords
        output_text = output.answer.lower()

        matched_keywords = []
        missing_keywords = []

        for keyword in required_keywords:
            if keyword.lower() in output_text:
                matched_keywords.append(keyword)
            else:
                missing_keywords.append(keyword)

        match_score = (
            len(matched_keywords) / len(required_keywords) if required_keywords else 1.0
        )

        passed = match_score >= self.threshold

        return QaEvaluation(
            matched_keywords=matched_keywords,
            missing_keywords=missing_keywords,
            match_score=match_score,
            passed=passed,
        )

To ensure our evaluation logic works correctly, we test it with a sample question and answer about neural network encoders:

In [None]:
input = Input(question="What is an encoder?")
output = Output(
    answer="**1. SUMMARY:** An encoder is a component of a neural network model, specifically composed of a stack of identical layers, each containing a multi-head self-attention mechanism and a position-wise fully connected feed-forward network.\n\n**2. DETAILS:** The encoder is a crucial part of a neural network architecture, particularly in transformer models. It is composed of a stack of N identical layers, where each layer consists of two sub-layers. The first sub-layer is a multi-head self-attention mechanism, which allows the model to attend to different parts of the input sequence simultaneously. The second sub-layer is a simple, position-wise fully connected feed-forward network, which applies a linear transformation to the input. In the context of encoder-decoder attention, the encoder takes in an input sequence and generates a set of output vectors that are used as memory keys and values for the decoder to attend to. This allows the decoder to attend to all positions in the input sequence, enabling the model to generate coherent and contextually relevant outputs.\n\n**3. SOURCES:** The provided context documents do not explicitly define what an encoder is, but rather describe its composition and functionality. However, based on the description, it is clear that the encoder is a key component of the neural network model."
)
example = Example(
    input=input,
    expected_output=EvaluationExpectedOutput(
        keywords=["encoder", "neural network", "transformer"]
    ),
)
evaluation_logic = QaEvaluationLogic()
evaluation = evaluation_logic.do_evaluate_single_output(example, output)
evaluation

#### 4.2 AggregationLogic

To assess overall system performance, we need to aggregate individual evaluation results into meaningful metrics. This is defined in the `QaAggregationLogic` class:


In [None]:
class QaAggregatedEvaluation(BaseModel):
    pass_rate: float
    average_match_score: float


class QaAggregationLogic(
    AggregationLogic[
        QaEvaluation,
        QaAggregatedEvaluation,
    ]
):
    def aggregate(self, evaluations: Iterable[QaEvaluation]) -> QaAggregatedEvaluation:
        evaluation_list = list(evaluations)
        if len(evaluation_list) == 0:
            return QaAggregatedEvaluation(
                pass_rate=0.0,
                average_match_score=0.0,
            )

        passed_count = sum(1 for eval in evaluation_list if eval.passed)
        pass_rate = passed_count / len(evaluation_list)

        average_match_score = mean(eval.match_score for eval in evaluation_list)

        return QaAggregatedEvaluation(
            pass_rate=pass_rate,
            average_match_score=average_match_score,
        )

To verify our aggregation mechanism, we test it by aggregating two examples:

In [None]:
input_1 = Input(question="What is an encoder?")
output_1 = Output(
    answer="**1. SUMMARY:** An encoder is a component of a neural network model, specifically composed of a stack of identical layers, each containing a multi-head self-attention mechanism and a position-wise fully connected feed-forward network.\n\n**2. DETAILS:** The encoder is a crucial part of a neural network architecture, particularly in transformer models. It is composed of a stack of N identical layers, where each layer consists of two sub-layers. The first sub-layer is a multi-head self-attention mechanism, which allows the model to attend to different parts of the input sequence simultaneously. The second sub-layer is a simple, position-wise fully connected feed-forward network, which applies a linear transformation to the input. In the context of encoder-decoder attention, the encoder takes in an input sequence and generates a set of output vectors that are used as memory keys and values for the decoder to attend to. This allows the decoder to attend to all positions in the input sequence, enabling the model to generate coherent and contextually relevant outputs.\n\n**3. SOURCES:** The provided context documents do not explicitly define what an encoder is, but rather describe its composition and functionality. However, based on the description, it is clear that the encoder is a key component of the neural network model."
)
example_1 = Example(
    input=input_1,
    expected_output=EvaluationExpectedOutput(
        keywords=["encoder", "neural network", "transformer"]
    ),
)
input_2 = Input(question="What is a transformer?")
output_2 = Output(
    answer='**1. SUMMARY:** A transformer is a new simple network architecture that uses attention mechanisms, eliminating the need for recurrence and convolutions.\n\n**2. DETAILS:** The transformer is a proposed network architecture that relies solely on attention mechanisms, which allow the model to focus on specific parts of the input data when generating output. This approach is different from traditional recurrent neural networks (RNNs) and convolutional neural networks (CNNs), which use recurrence and convolutions, respectively. The transformer architecture is designed to be more parallelizable and requires less training time, making it a more efficient alternative to traditional models.\n\n**3. SOURCES:** The information about the transformer architecture is based on the provided context document, which states: "We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely."'
)
example_2 = Example(
    input=input_2,
    expected_output=EvaluationExpectedOutput(
        keywords=[
            "transformer",
            "architecture",
            "attention",
            "pickle",
        ]  ## Add a random keyword to test the evaluation
    ),
)

aggregation_logic = QaAggregationLogic()
evaluation_logic = QaEvaluationLogic()

evaluation_1 = evaluation_logic.do_evaluate_single_output(example_1, output_1)
evaluation_2 = evaluation_logic.do_evaluate_single_output(example_2, output_2)
aggregation = aggregation_logic.aggregate([evaluation_1, evaluation_2])

print(evaluation_1)
print(evaluation_2)
print(f"Aggregation: {aggregation}")

### 5. Create and run a benchmark

With our evaluation components ready, we can now create a benchmark in PhariaStudio and run our evaluation on the test dataset. Benchmarks provide a structured way to do the following:

1. Store evaluation datasets and logic for reuse
2. Track performance across multiple runs and configurations
3. Visualise results through the PhariaStudio interface
4. Share insights with team members

First, we create a benchmark repository that stores our evaluation configuration and results. This repository is linked to our PhariaStudio project, making it easy to access evaluation results through the web interface:

In [None]:
benchmark_repository = StudioBenchmarkRepository(studio_client=studio_client)

benchmark = benchmark_repository.create_benchmark(
    dataset_id=studio_dataset.id,
    eval_logic=evaluation_logic,
    aggregation_logic=aggregation_logic,
    name="keyword-matching-benchmark",
    description="This benchmark evaluates the keyword matching between the model's output and the expected output.",
)

benchmark.id

Next, we trigger the becnhmark to execute. When we run the benchmark, the following process takes place:

1. Each example from our dataset is sent to the RAG task
2. The task generates an answer for each question
3. Our evaluation logic assesses each answer against expected keywords
4. Results are aggregated and stored in PhariaStudio
5. A benchmark execution ID is returned for tracking

In [None]:
benchmark = benchmark_repository.get_benchmark(
    benchmark_id=benchmark.id,
    eval_logic=evaluation_logic,
    aggregation_logic=aggregation_logic,
)

benchmark_execution_id = benchmark.execute(
    task=task,
    name=str(uuid4()),
)

After the benchmark completes, you can view detailed results in the PhariaStudio interface under Evaluate/Benchmarks. This visualisation helps identify patterns in system performance and can illuminate areas for improvement.


### 6. Improving your RAG application

Based on the evaluation results, you can identify areas for improvement in your RAG application. Common improvements include:

1. **Refining the prompt**: Adjust the prompt to encourage more precise reference citation
2. **Adjusting retrieval parameters**: Modify the number of retrieved documents or relevance thresholds
3. **Enhancing document chunking**: Change how documents are split and indexed
4. **Implementing better ranking**: Add reranking steps to prioritise the most relevant documents

To modify your RAG Skill, see the [Skill Setup](4.%20Skill%20Setup%20-%20Customizing%20and%20Developing%20Kernel%20Skills.ipynb#evaluation1) part of this tutorial.

## Summary

In this section, you established a comprehensive evaluation framework for your RAG application:

✅ **Created structured evaluation components** including a task wrapper, evaluation logic, and aggregation metrics

✅ **Built a diverse test dataset** covering key topics in your document collection

✅ **Established objective evaluation criteria** based on keyword presence in generated answers

✅ **Implemented aggregation metrics** to track overall system performance

✅ **Integrated with PhariaStudio's visualisation tools** for intuitive result analysis

✅ **Developed strategies for targeted improvements** based on evaluation insights

This evaluation framework provides you with a systematic way to measure RAG performance and identify areas for improvement. As you refine your application, regular evaluation runs help ensure that changes positively impact system quality.

Remember that different use cases may require different evaluation approaches. Consider expanding your evaluation framework with:

- Human evaluation for subjective aspects of answer quality
- Task-specific metrics for particular domains
- *LLM as a judge* for large scale semantic tests

In the next section, we explore how to deploy your fully evaluated RAG application for production use.