# 1. Evaluation - Simple implementation
<a id="evaluation1"></a>

This section explains how to use PhariaStudio SDK to build an evaluation setup that helps improving the quality of the AI logic. It focuses on the setup and it is on purpose simpler than a real scenario.

<div class="alert alert-info">
<b>Note:</b> This tutorial is run entirely on this Jupyter Notebook.
</div>

## Prerequisites

Before starting, ensure you have followed the [**"0. Getting Started"**](./0.%20Getting%20Started.ipynb) step of this evaluation section. Further the QA skill available in this folder uses a specific collection, make sure that it is available in the testing environment.

In [None]:
# The collection used by the QA skill, if not available, please edit the qa.py file to use a different collection
NAMESPACE = "Studio"
COLLECTION = "papers"
INDEX = "asym-64"

Now import the necessary libraries and set up your environment. We start by importing components from the Intelligence Layer framework that will help us create and run our evaluations:

In [None]:
from dotenv import load_dotenv
from os import getenv
from pydantic import BaseModel
from collections.abc import Iterable
from typing import Iterable
from statistics import mean
from uuid import uuid4

from pharia_inference_sdk.core import NoOpTracer, Task, TaskSpan
from pharia_skill.testing import DevCsi
from pharia_studio_sdk import StudioClient
from pharia_studio_sdk.evaluation import (Example,
                                        SingleOutputEvaluationLogic,
                                        StudioBenchmarkRepository,
                                        StudioDatasetRepository,
                                        AggregationLogic)

# Please use the qa.py file provided in the tutorial folder
from qa import Input, Output, custom_rag

## Procedure

### 1. Connect to PhariaStudio

First, we need to establish a connection to PhariaStudio, which will be used to store our evaluation datasets, benchmarks, and traces. The StudioClient provides an interface for creating and managing these resources:

In [None]:
load_dotenv(override=True)

PHARIA_AI_TOKEN = getenv("PHARIA_AI_TOKEN")
PHARIA_STUDIO_PROJECT_NAME = getenv("PHARIA_STUDIO_PROJECT_NAME")
PHARIA_STUDIO_ADDRESS = getenv("PHARIA_STUDIO_ADDRESS")

In [None]:
studio_client = StudioClient(
    project=PHARIA_STUDIO_PROJECT_NAME,
    studio_url=PHARIA_STUDIO_ADDRESS,
    auth_token=PHARIA_AI_TOKEN,
    create_project=True,
)

### 2. Create a task wrapper for your RAG Skill

To evaluate our RAG Skill, we need to wrap it in a PhariaInference SDK task. This wrapper serves as an adapter between your Skill implementation and the evaluation framework. As the evaluation framework runs locally, we can simply use DevCSI to produce the output.

In [None]:
from pharia_skill.testing import DevCsi

class QATask(Task[Input, Output]):

    def do_run(self, input: Input, task_span: TaskSpan) -> Output:
        # If you want to enable tracing, uncomment the following line
        # This triggers double tracing when executing benchmarks
        #csi = DevCsi(project=PHARIA_STUDIO_PROJECT_NAME) 
        csi = DevCsi()
        return custom_rag(csi, input)

Before proceeding, verify that your task wrapper correctly interfaces with the deployed Skill:

In [None]:
test_input = Input(question="What is an encoder?")

task = QATask()
task.run(test_input, NoOpTracer())


### 3. Create an evaluation dataset

#### 3.1 Create a test dataset



First we create an example test dataset with questions that cover different topics in our document collection. For each question, we specify keywords that should appear in a well-informed answer.

In [None]:
test_set = [
    {
        "question": "What is mixture-of-experts?",
        "keywords": ["experts", "gating", "combine"]
    },
    {
        "question": "What is an Large Language Model",
        "keywords": ["corpus", "parameters", "generation"]
    },
    {   
        "question": "What is a Sequence?", 
        "keywords": ["order", "elements", "series"]
    },
    {
        "question": "What is translation?",
        "keywords": ["language", "meaning", "convert"]
    },
    {
        "question": "What is the difference between GRNN and RNN?",
        "keywords": ["gates", "general", "specific"]
    },
    {
        "question": "What is LSTM?", 
        "keywords": ["memory", "gates", "vanishing"]
    },
    {
        "question": "What is are RNNs?", 
        "keywords": ["feedback", "sequential", "state"]
    },
    {
        "question": "What is self-attention?",
        "keywords": ["positions", "sequence", "relate"]
    },
    {
        "question": "What is Attention?", 
        "keywords": ["focus", "weighting", "context"]
    },
    {
        "question": "What is a transformer?",
        "keywords": ["attention", "parallel", "encoder"]
    },
]

#### 3.2 Pydantic model for expected output

Next, we define a Pydantic model to establish the structure for test output, to ensure type safety.

In [None]:
class EvaluationExpectedOutput(BaseModel):
    keywords: list[str]

#### 3.3 Upload the dataset

In [None]:
studio_dataset_repo = StudioDatasetRepository(studio_client=studio_client)

examples = [
    Example(
        input=Input(question=example["question"]),
        expected_output=EvaluationExpectedOutput(keywords=example["keywords"]),
    )
    for example in test_set
]

studio_dataset = studio_dataset_repo.create_dataset(
    examples=examples, dataset_name="demo-dataset"
)

studio_dataset.id

To access the dataset, follow the tutorial [Store an evaluation dataset in PhariaStudio](https://docs.aleph-alpha.com/products/pharia-ai/pharia-studio/tutorial/store-dataset-in-data-platform/).

### 4. Define evaluation logic

PhariaStudio SDK requires the creation of `EvaluationLogic` - to evaulate individual examples - and `AggregationLogic` - to aggregate all the individual evaluations into overall metrics.

#### 4.1 EvaluationLogic 

First, we set up the evaluation logic that is used for each individual example. Our `QaEvaluationLogic` class implements this assessment strategy by extending the Intelligence Layer's `SingleOutputEvaluationLogic` interface, allowing it to integrate with the broader evaluation framework:

In [None]:
class QaEvaluation(BaseModel):
    matched_keywords: list[str]
    missing_keywords: list[str]
    match_score: float
    passed: bool


class QaEvaluationLogic(
    SingleOutputEvaluationLogic[Input, Output, EvaluationExpectedOutput, QaEvaluation]
):

    def __init__(self) -> None:
        self.threshold = 0.5  ## Threshold to define when an evaluation is passed

    def do_evaluate_single_output(
        self, example: Example[Input, EvaluationExpectedOutput], output: Output
    ) -> QaEvaluation:
        required_keywords = example.expected_output.keywords
        if output.answer is None:
            output.answer = ""
        
        output_text = output.answer.lower()

        matched_keywords = []
        missing_keywords = []

        for keyword in required_keywords:
            if keyword.lower() in output_text:
                matched_keywords.append(keyword)
            else:
                missing_keywords.append(keyword)

        match_score = (
            len(matched_keywords) / len(required_keywords) if required_keywords else 1.0
        )

        passed = match_score >= self.threshold

        return QaEvaluation(
            matched_keywords=matched_keywords,
            missing_keywords=missing_keywords,
            match_score=match_score,
            passed=passed,
        )

#### 4.2 AggregationLogic

To assess overall system performance, we need to aggregate individual evaluation results into meaningful metrics. This is defined in the `QaAggregationLogic` class:


In [None]:
class QaAggregatedEvaluation(BaseModel):
    pass_rate: float
    average_match_score: float


class QaAggregationLogic(
    AggregationLogic[
        QaEvaluation,
        QaAggregatedEvaluation,
    ]
):
    def aggregate(self, evaluations: Iterable[QaEvaluation]) -> QaAggregatedEvaluation:
        evaluation_list = list(evaluations)
        if len(evaluation_list) == 0:
            return QaAggregatedEvaluation(
                pass_rate=0.0,
                average_match_score=0.0,
            )

        passed_count = sum(1 for eval in evaluation_list if eval.passed)
        pass_rate = passed_count / len(evaluation_list)

        average_match_score = mean(eval.match_score for eval in evaluation_list)

        return QaAggregatedEvaluation(
            pass_rate=pass_rate,
            average_match_score=average_match_score,
        )

### 5. Create and run a benchmark

With our evaluation components ready, we can now create a benchmark in PhariaStudio and run our evaluation on the test dataset.

In [None]:
benchmark_repository = StudioBenchmarkRepository(studio_client=studio_client)
evaluation_logic = QaEvaluationLogic()
aggregation_logic = QaAggregationLogic()

benchmark = benchmark_repository.create_benchmark(
    dataset_id=studio_dataset.id,
    eval_logic=evaluation_logic,
    aggregation_logic=aggregation_logic,
    name="keyword-matching-benchmark", # Benchmark name needs to be unique
    description="This benchmark evaluates the keyword matching between the model's output and the expected output.",
)

benchmark.id

Next, we trigger the becnhmark to execute.

In [None]:
benchmark_repository = StudioBenchmarkRepository(studio_client=studio_client)
benchmark = benchmark_repository.get_benchmark(
    benchmark_id=benchmark.id,
    eval_logic=evaluation_logic,
    aggregation_logic=aggregation_logic,
)

benchmark_execution_id = benchmark.execute(
    task=task,
    name=str(uuid4()),
)

After the benchmark completes, you can view detailed results in the PhariaStudio interface under Evaluate/Benchmarks (check [Create and submit evaluations](https://docs.aleph-alpha.com/products/pharia-ai/pharia-studio/tutorial/write-a-simple-evaluation/) for more details).

### 6. Improving your RAG application

Based on the evaluation results, you can identify areas for improvement in your RAG application. Common improvements include:

1. **Refining the prompt**: Adjust the prompt to encourage more precise reference citation
2. **Adjusting retrieval parameters**: Modify the number of retrieved documents or relevance thresholds
3. **Enhancing document chunking**: Change how documents are split and indexed
4. **Implementing better ranking**: Add reranking steps to prioritise the most relevant documents