# 2. Evaluation - Advanced implementation using LLM-as-a-Judge
<a id="evaluation2"></a>

This tutorial shows how to implement complex evaluation setup using also LLM-as-a-Judge setup. It implies using some complex helper functions and therefore requires advanced knowledge of python. If you haven't followed the [Simple Evaluation]("./1.%20Evaluation%20%28Simple%29%20-%20Testing%20Your%20RAG%20Application.ipynb") tutorial, we advice to first run through that and after use this one for a more complicated setup.

<div class="alert alert-info">
<b>Note:</b> This tutorial is run entirely on this Jupyter Notebook.
</div>

## Prerequisites

Before starting, ensure you have followed the [**"0. Getting Started"**](./0.%20Getting%20Started.ipynb) step of this evaluation section. Further the QA skill available in this folder uses a specific collection, make sure that it is available in the testing environment.


In [None]:
# The collection used by the QA skill, if not available, please edit the qa.py file to use a different collection
NAMESPACE = "Studio"
COLLECTION = "papers"
INDEX = "asym-64"

Now import the necessary libraries and set up your environment. We start by importing components from the Intelligence Layer framework that will help us create and run our evaluations:

In [None]:
from dotenv import load_dotenv
from os import getenv
from pharia_skill import ChatParams, Message
from pydantic import BaseModel
from collections.abc import Iterable
from typing import Iterable
from statistics import mean
from uuid import uuid4

from pharia_studio_sdk import StudioClient
from pharia_inference_sdk.core import NoOpTracer, Task, TaskSpan
from pharia_studio_sdk.evaluation import (
    Example,
    SingleOutputEvaluationLogic,
    StudioBenchmarkRepository,
    StudioDatasetRepository,
    AggregationLogic,
)

from qa import Input, Output, custom_rag

## Procedure

### 1. Connect to PhariaStudio

First, we need to establish a connection to PhariaStudio, which will be used to store our evaluation datasets, benchmarks, and traces. The StudioClient provides an interface for creating and managing these resources:

In [None]:
load_dotenv(override=True)

PHARIA_AI_TOKEN = getenv("PHARIA_AI_TOKEN")
PHARIA_STUDIO_PROJECT_NAME = getenv("PHARIA_STUDIO_PROJECT_NAME")
PHARIA_STUDIO_ADDRESS = getenv("PHARIA_STUDIO_ADDRESS")

In [None]:
studio_client = StudioClient(
    project=PHARIA_STUDIO_PROJECT_NAME,
    studio_url=PHARIA_STUDIO_ADDRESS,
    auth_token=PHARIA_AI_TOKEN,
    create_project=True,
)

### 2. Create a task wrapper for your RAG Skill

To evaluate our RAG Skill, we need to wrap it in a PhariaInference SDK task. This wrapper serves as an adapter between your Skill implementation and the evaluation framework. As the evaluation framework runs locally, we can simply use DevCSI to produce the output.

In [None]:
from pharia_skill.testing import DevCsi


class QATask(Task[Input, Output]):

    def do_run(self, input: Input, task_span: TaskSpan) -> Output:
        # If you want to enable tracing, uncomment the following line
        # This triggers double tracing when executing benchmarks
        # csi = DevCsi(project=PHARIA_STUDIO_PROJECT_NAME)
        csi = DevCsi()
        return custom_rag(csi, input)

Before proceeding, verify that your task wrapper correctly interfaces with the deployed Skill:

In [None]:
test_input = Input(question="What is a transformer?")

task = QATask()
task.run(test_input, NoOpTracer())

### 3. Create an evaluation dataset

#### 3.1 Create a test dataset



First we create an example test dataset matching questions covering different topics to their respective expected answers. For each question, we effectively create a Ground Truth against which we can judge the LLMs responses.

In [None]:
test_set = [
    {
        "question": "What is mixture-of-experts?",
        "generated_answer": """## SUMMARY: 
Mixture-of-experts (MoE) is a neural network architecture that involves multiple expert networks, each specializing in a different subset of the input data, and a gating network that determines the weights for combining the outputs of these experts. This approach allows for more efficient and flexible modeling of complex data distributions by leveraging the strengths of individual experts.

## DETAILS: 
The mixture-of-experts (MoE) model is designed to improve the performance and efficiency of neural networks, especially in scenarios where the data is complex, heterogeneous, or has varying densities. The core idea behind MoE is to have a collection of "expert" networks, where each expert is a neural network that specializes in a specific part of the input data space. Instead of having a single, large network trying to model the entire data distribution, MoE divides the workload among these experts, allowing each to become highly proficient in its assigned domain.

A critical component of the MoE architecture is the "gating network." The gating network's role is to determine, for each input, which of the expert networks is most likely to produce the correct output. It does this by computing a set of weights, where each weight corresponds to the relevance of an expert to the current input. These weights are then used to combine the outputs of the expert networks, effectively creating a weighted average of their predictions. The gating network learns to assign higher weights to experts that are more likely to be correct for a given input, and lower weights to those that are less relevant.

The MoE model is trained end-to-end, meaning that both the expert networks and the gating network are updated simultaneously during the training process. This allows the model to learn not only the parameters of the expert networks but also how to effectively route inputs to the appropriate experts. The training process typically involves minimizing a loss function that measures the difference between the model's predictions and the true outputs.

One of the key benefits of the MoE approach is its ability to handle complex, heterogeneous data more effectively than traditional neural network architectures. By allowing different parts of the model to specialize in different aspects of the data, MoE can capture a wider range of patterns and relationships. Additionally, MoE can be more computationally efficient than larger, monolithic models, as only the relevant experts need to be activated for a given input, reducing the computational workload.

However, the MoE model also introduces additional complexity, such as the need to determine the optimal number of experts and the architecture of both the expert and gating networks. Furthermore, training MoE models can be challenging due to the complex interactions between the gating network and the expert networks, requiring careful tuning of hyperparameters and training procedures. Despite these challenges, the mixture-of-experts model has shown promising results in various applications, including natural language processing, computer vision, and recommender systems, making it a valuable tool in the deep learning toolkit.
""",
        "sources": ["Attention Is All You Need"],
    },
    {
        "question": "What is an Large Language Model?",
        "generated_answer": """## SUMMARY: 
A Large Language Model (LLM) is a type of artificial intelligence (AI) model designed to process and understand human language, typically trained on vast amounts of text data to learn patterns, relationships, and structures of language. These models are capable of generating coherent and contextually relevant text, answering questions, translating languages, and even creating content, making them a significant advancement in natural language processing (NLP).

## DETAILS: 
Large Language Models are a subset of deep learning models that are specifically designed to handle the complexities and nuances of human language. They are trained on massive datasets of text, which can include books, articles, research papers, websites, and even social media posts. The primary goal of an LLM is to learn the statistical patterns and relationships within language, allowing it to predict the next word in a sequence, given the context of the previous words. This capability enables LLMs to perform a wide range of tasks, including but not limited to:

1. **Text Generation:** LLMs can create text that is often indistinguishable from that written by humans. This capability is used in applications such as content creation, chatbots, and automated writing tools.
2. **Language Translation:** By understanding the patterns and structures of different languages, LLMs can translate text from one language to another with a high degree of accuracy.
3. **Question Answering:** LLMs can be fine-tuned to answer questions based on the information they have been trained on, making them useful for creating virtual assistants and question-answering systems.
4. **Text Summarization:** These models can summarize long pieces of text into shorter, more digestible versions, highlighting the key points and main ideas.
5. **Sentiment Analysis:** LLMs can analyze text to determine the sentiment or emotional tone behind it, which is useful in applications such as customer service and social media monitoring.

The training of LLMs involves several key steps and technologies:

1. **Data Collection:** Gathering a large, diverse dataset of text.
2. **Model Architecture:** Designing the model's architecture, which often involves transformer models due to their effectiveness in handling sequential data like text.
3. **Training:** Using powerful computing resources to train the model on the collected data, which can take weeks, months, or even years.
4. **Fine-Tuning:** After initial training, the model can be fine-tuned for specific tasks by training it on smaller, task-specific datasets.

Examples of Large Language Models include transformer-based models like BERT (Bidirectional Encoder Representations from Transformers), RoBERTa (Robustly optimized BERT approach), and more recently, models like LLaMA (Large Language Model Meta AI) and PaLM (Pathways Language Model). These models have achieved state-of-the-art results in various NLP tasks and have been integrated into numerous applications and services.

However, LLMs also come with challenges and limitations, such as requiring significant computational resources for training and deployment, potential biases inherited from the training data, and ethical considerations regarding privacy, misinformation, and the potential for generating harmful content. Despite these challenges, the development and application of Large Language Models continue to advance, promising to revolutionize how we interact with information and each other through language.
""",
        "sources": ["Attention Is All You Need"],
    },
    {
        "question": "What is a Sequence?",
        "generated_answer": """## SUMMARY: 
According to the paper "Attention Is All You Need" by Vaswani et al., a sequence refers to a list of tokens, such as words or characters, that are processed in a specific order. In the context of the Transformer model introduced in the paper, sequences are the primary input and output data structure, where each sequence represents a sentence, phrase, or document.

## DETAILS: 
In the "Attention Is All You Need" paper, the authors propose the Transformer model, which relies heavily on the concept of sequences to process input data. According to the paper, a sequence is defined as a list of tokens, where each token can be a word, character, or subword (a subunit of a word). The sequence is the fundamental data structure used to represent input and output data in the Transformer model.

The Transformer model is designed to handle sequences of varying lengths, making it particularly well-suited for natural language processing tasks such as machine translation, text classification, and language modeling. The model's architecture is based on self-attention mechanisms, which allow it to weigh the importance of different tokens in the input sequence relative to each other.

In the context of the Transformer model, sequences are processed as follows:

1. **Tokenization:** The input text is broken down into individual tokens, such as words or subwords.
2. **Embedding:** Each token is embedded into a vector space, where semantically similar tokens are mapped to nearby points.
3. **Positional Encoding:** The embedded tokens are augmented with positional encodings, which capture the order and position of each token in the sequence.
4. **Self-Attention:** The model applies self-attention mechanisms to the input sequence, allowing it to attend to different tokens and weigh their importance relative to each other.
5. **Output:** The final output is generated based on the attended tokens, which can be used for tasks such as translation, classification, or language modeling.

The use of sequences as the primary data structure in the Transformer model allows for efficient and parallelizable processing of input data, making it particularly well-suited for large-scale natural language processing tasks. The sequence-based architecture of the Transformer model has been widely adopted and has achieved state-of-the-art results in many NLP tasks.

In the paper, the authors also introduce several key concepts related to sequences, including:

* **Sequence length:** The number of tokens in the input sequence.
* **Token embedding:** The vector representation of each token in the sequence.
* **Positional encoding:** The augmentation of token embeddings with positional information to capture the order and position of each token.
* **Self-attention:** The mechanism by which the model attends to different tokens in the sequence and weighs their importance relative to each other.

Overall, the concept of sequences plays a central role in the "Attention Is All You Need" paper, enabling the development of the
""",
        "sources": ["Attention Is All You Need"],
    },
    {
        "question": "What is translation?",
        "generated_answer": """## SUMMARY: 
In the context of the "Attention Is All You Need" paper and natural language processing (NLP), translation refers to the task of converting text from one language to another, while preserving the original meaning, context, and content. This involves using machine learning models, such as the Transformer, to learn the patterns and relationships between languages and generate accurate translations.

## DETAILS: 
Machine translation is a fundamental task in NLP, aiming to automatically translate text from a source language to a target language. The goal is to produce a translation that is not only grammatically correct but also conveys the same meaning, tone, and nuance as the original text.

In the context of the Transformer model, translation is a sequence-to-sequence task, where the input sequence is the text in the source language, and the output sequence is the translated text in the target language. The model learns to map the input sequence to the output sequence through a process of encoding, attending, and decoding.

The translation process involves several key steps:

1. **Encoding:** The source language text is encoded into a continuous representation, capturing the semantic meaning and context of the input sequence.
2. **Attention:** The model attends to different parts of the input sequence, weighing their importance and relevance to the translation task.
3. **Decoding:** The model generates the translated text, one token at a time, based on the encoded input sequence and the attention weights.
4. **Output:** The final translated text is generated, which can be a sentence, paragraph, or document.

The Transformer model is particularly well-suited for machine translation tasks due to its ability to:

* **Handle long-range dependencies:** The self-attention mechanism allows the model to capture complex relationships between tokens in the input sequence, even if they are far apart.
* **Parallelize computation:** The Transformer model can be parallelized more easily than traditional recurrent neural network (RNN) architectures, making it more efficient for large-scale translation tasks.
* **Learn contextual relationships:** The model can learn to represent words and tokens in context, taking into account the surrounding words and the overall meaning of the sentence.

The "Attention Is All You Need" paper demonstrates the effectiveness of the Transformer model for machine translation tasks, achieving state-of-the-art results on several benchmark datasets, including the WMT 2014 English-to-German and English-to-French translation tasks.

In the context of machine translation, the evaluation metrics used to measure the quality of translations include:

* **BLEU (Bilingual Evaluation Understudy) score:** Measures the similarity between the generated translation and the reference translation.
* **ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score:** Measures the overlap between the generated translation and the reference translation.
* **METEOR (Metric for Evaluation of Translation with Explicit ORdering) score:** Measures the similarity between the generated translation and the reference translation, taking into account the order of the words.

Overall, the translation task in the context of the "Attention Is All You Need" paper and NLP refers to the automatic conversion of text from one language to another, using machine learning models to learn the patterns and relationships between languages and generate accurate translations.
""",
        "sources": ["Attention Is All You Need"],
    },
    {
        "question": "What is the difference between GRNN and RNN?",
        "generated_answer": """## SUMMARY: 
The main difference between Gated Recurrent Neural Networks (GRNNs) and Recurrent Neural Networks (RNNs) lies in their ability to handle the vanishing gradient problem and their capacity to learn long-term dependencies in sequential data. GRNNs, including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), use gates to control the flow of information, allowing them to learn more complex patterns and maintain gradients over longer sequences. In contrast, traditional RNNs often struggle with vanishing gradients, making it difficult for them to learn long-term dependencies.

## DETAILS: 
Recurrent Neural Networks (RNNs) are a type of neural network designed to handle sequential data, such as time series data, speech, or text. They have a feedback connection that allows the output from the previous time step to be used as input for the current time step, enabling the network to keep track of information over time.

However, traditional RNNs suffer from the vanishing gradient problem, which occurs when the gradients used to update the network's weights become smaller as they are backpropagated through time. This makes it challenging for RNNs to learn long-term dependencies in data, as the gradients may become too small to be useful.

Gated Recurrent Neural Networks (GRNNs), on the other hand, are designed to address the vanishing gradient problem. They use gates to control the flow of information into and out of the network's memory cells, allowing them to learn more complex patterns and maintain gradients over longer sequences.

The key differences between GRNNs and RNNs are:

1. **Gates:** GRNNs use gates to control the flow of information, whereas RNNs do not. These gates help GRNNs to selectively forget or remember information, allowing them to learn more complex patterns.
2. **Memory Cells:** GRNNs have memory cells that can store information for long periods, whereas RNNs do not. These memory cells help GRNNs to learn long-term dependencies in data.
3. **Vanishing Gradient Problem:** GRNNs are less susceptible to the vanishing gradient problem than RNNs, as the gates help to maintain gradients over longer sequences.
4. **Learning Long-Term Dependencies:** GRNNs are better at learning long-term dependencies in data than RNNs, due to their ability to maintain gradients and store information in memory cells.

Examples of GRNNs include:

* **Long Short-Term Memory (LSTM) Networks:** LSTMs use three gates (input, output, and forget gates) to control the flow of information into and out of the network's memory cells.
* **Gated Recurrent Units (GRUs):** GRUs use two gates (reset and update gates) to control the flow of information into and out of the network's memory cells.

In contrast, traditional RNNs do not use gates or memory cells, and are more prone to the vanishing gradient problem.

In summary, while both RNNs and GRNNs are designed to handle sequential data, GRNNs are better equipped to learn long-term dependencies and handle the vanishing gradient problem, making them a popular choice for many applications, including natural language processing, speech recognition, and time series forecasting.      """,
        "sources": ["Attention Is All You Need"],
    },
    {
        "question": "What is LSTM?",
        "generated_answer": """## SUMMARY: 
Long Short-Term Memory (LSTM) is a type of Recurrent Neural Network (RNN) architecture designed to handle the vanishing gradient problem and learn long-term dependencies in sequential data. LSTMs use memory cells and gates to control the flow of information, allowing them to selectively remember and forget information over time.

## DETAILS: 
LSTMs were introduced by Sepp Hochreiter and Jürgen Schmidhuber in 1997 as a solution to the vanishing gradient problem in traditional RNNs. The key components of an LSTM network are:

1. **Memory Cells:** LSTMs have memory cells that can store information for long periods of time. These cells are the core of the LSTM architecture and allow the network to learn long-term dependencies.
2. **Gates:** LSTMs use three types of gates to control the flow of information into and out of the memory cells:
    * **Input Gate:** Controls the amount of new information that is added to the memory cell.
    * **Output Gate:** Controls the amount of information that is output from the memory cell.
    * **Forget Gate:** Controls the amount of information that is discarded from the memory cell.
3. **Cell State:** The cell state is the internal state of the memory cell, which stores the information that is being remembered.

The LSTM architecture works as follows:

1. **Input:** The input is passed through the input gate, which determines how much of the new information is added to the memory cell.
2. **Forget Gate:** The forget gate determines how much of the previous information is discarded from the memory cell.
3. **Cell State Update:** The cell state is updated based on the input and the forget gate.
4. **Output Gate:** The output gate determines how much of the information in the memory cell is output.
5. **Hidden State:** The hidden state is the output of the LSTM cell, which is used as input to the next time step.

LSTMs have several advantages over traditional RNNs, including:

* **Ability to learn long-term dependencies:** LSTMs can learn dependencies that span hundreds or even thousands of time steps.
* **Resistance to vanishing gradients:** LSTMs are less susceptible to the vanishing gradient problem, which makes them more stable and easier to train.
* **Ability to handle variable-length sequences:** LSTMs can handle sequences of varying lengths, making them suitable for applications such as speech recognition and natural language processing.

LSTMs have been widely used in many applications, including:

* **Speech recognition:** LSTMs are used in speech recognition systems to model the temporal dependencies in speech signals.
* **Natural language processing:** LSTMs are used in natural language processing tasks such as language modeling, text classification, and machine translation.
* **Time series forecasting:** LSTMs are used in time series forecasting to model the temporal dependencies in data.

Some of the key variants of LSTMs include:

* **Gated Recurrent Units (GRUs):** GRUs are a simpler variant of LSTMs that use only two gates (reset and update gates) instead of three.
* **Bidirectional LSTMs:** Bidirectional LSTMs are used to model both the forward and backward dependencies in a sequence.
* **Stacked LSTMs:** Stacked LSTMs are used to model complex dependencies in data by stacking multiple LSTM layers on top of each other.
""",
        "sources": ["Attention Is All You Need"],
    },
    {
        "question": "What is are RNNs?",
        "generated_answer": """## SUMMARY: 
Recurrent Neural Networks (RNNs) are a type of neural network designed to handle sequential data, such as time series data, speech, or text. RNNs have a feedback connection that allows the output from the previous time step to be used as input for the current time step, enabling the network to keep track of information over time.

## DETAILS: 
RNNs are a fundamental concept in deep learning, and they are particularly useful for modeling sequential data. The key characteristics of RNNs are:

1. **Sequential Input:** RNNs are designed to handle sequential input data, where each sample is dependent on the previous samples.
2. **Feedback Connection:** RNNs have a feedback connection that allows the output from the previous time step to be used as input for the current time step.
3. **Hidden State:** RNNs have a hidden state that captures the information from the previous time steps, allowing the network to keep track of the context.
4. **Recurrent Connections:** RNNs have recurrent connections that allow the network to feedback the output from the previous time step to the current time step.

The RNN architecture works as follows:

1. **Input:** The input is passed through the network at each time step.
2. **Hidden State Update:** The hidden state is updated based on the input and the previous hidden state.
3. **Output:** The output is generated based on the hidden state.
4. **Feedback:** The output is fed back to the network as input for the next time step.

RNNs have several advantages, including:

* **Ability to handle sequential data:** RNNs are designed to handle sequential data, making them suitable for applications such as speech recognition, natural language processing, and time series forecasting.
* **Ability to capture temporal dependencies:** RNNs can capture temporal dependencies in data, allowing them to model complex patterns and relationships.
* **Flexibility:** RNNs can be used for a wide range of applications, including classification, regression, and generation tasks.

However, RNNs also have some limitations, including:

* **Vanishing Gradient Problem:** RNNs can suffer from the vanishing gradient problem, where the gradients used to update the network's weights become smaller as they are backpropagated through time.
* **Exploding Gradient Problem:** RNNs can also suffer from the exploding gradient problem, where the gradients become larger as they are backpropagated through time.
* **Computational Complexity:** RNNs can be computationally expensive to train, especially for long sequences.

Some of the key variants of RNNs include:

* **Simple RNNs:** Simple RNNs are the basic form of RNNs, where the hidden state is updated based on the input and the previous hidden state.
* **LSTMs (Long Short-Term Memory):** LSTMs are a type of RNN that uses memory cells and gates to control the flow of information, allowing them to learn long-term dependencies.
* **GRUs (Gated Recurrent Units):** GRUs are a type of RNN that uses gates to control the flow of information, allowing them to learn long-term dependencies.
* **Bidirectional RNNs:** Bidirectional RNNs are used to model both the forward and backward dependencies in a sequence.

RNNs have been widely used in many applications, including:

* **Speech Recognition:** RNNs are used in speech recognition systems to model the temporal dependencies in speech signals.
* **Natural Language Processing:** RNNs are used in natural language processing tasks such as language modeling, text classification, and machine translation.
* **Time Series Forecasting:** RNNs are used in time series forecasting to model the temporal dependencies in data.
* **Generative Models:** RNNs are used in generative models such as language models and text generators to generate coherent and context-dependent text.RNNs (Recurrent Neural Networks) are a type of neural network that process sequential data, allowing them to maintain a state that can be updated as new data is processed.
""",
        "sources": ["Attention Is All You Need"],
    },
    {
        "question": "What is self-attention?",
        "generated_answer": """## SUMMARY: 
Self-attention is a mechanism in neural networks that allows the model to attend to different parts of the input sequence simultaneously and weigh their importance. It's a key component of the Transformer architecture, introduced in the paper "Attention Is All You Need" by Vaswani et al. Self-attention enables the model to capture long-range dependencies and contextual relationships in the input data, making it particularly useful for natural language processing tasks.

## DETAILS: 
Self-attention is a type of attention mechanism that allows the model to compute the representation of a sequence by attending to all positions in the sequence and weighing their importance. This is different from traditional recurrent neural networks (RNNs), which process the input sequence one step at a time, using the previous hidden state to inform the next step.

The self-attention mechanism works as follows:

1. **Input Sequence:** The input sequence is first embedded into a vector space, where each token (e.g., word or character) is represented as a vector.
2. **Query, Key, and Value:** The embedded input sequence is then split into three vectors: Query (Q), Key (K), and Value (V).
3. **Attention Weights:** The attention weights are computed by taking the dot product of the Query and Key vectors and applying a softmax function. This produces a set of weights that represent the importance of each token in the sequence relative to the others.
4. **Weighted Sum:** The attention weights are then used to compute a weighted sum of the Value vectors, which produces the final output of the self-attention mechanism.

The self-attention mechanism has several benefits, including:

* **Parallelization:** Self-attention can be parallelized more easily than RNNs, making it faster to train and more efficient for large-scale datasets.
* **Long-range dependencies:** Self-attention can capture long-range dependencies in the input sequence, which is particularly useful for natural language processing tasks where context is important.
* **Flexibility:** Self-attention can be used in a variety of architectures, including the Transformer, which has become a standard model for many NLP tasks.

There are several types of self-attention mechanisms, including:

* **Scaled Dot-Product Attention:** This is the original self-attention mechanism introduced in the Transformer paper, which uses a scaled dot product to compute the attention weights.
* **Multi-Head Attention:** This is a variant of self-attention that uses multiple attention heads to capture different types of relationships in the input sequence.
* **Hierarchical Attention:** This is a variant of self-attention that uses a hierarchical structure to capture relationships at different levels of granularity.

Self-attention has been widely adopted in many NLP tasks, including:

* **Machine Translation:** Self-attention is used in machine translation models to capture the relationships between words in the input and output sequences.
* **Text Classification:** Self-attention is used in text classification models to capture the relationships between words in the input sequence and the class labels.
* **Language Modeling:** Self-attention is used in language models to capture the relationships between words in the input sequence and predict the next word.

Overall, self-attention is a powerful mechanism that has revolutionized the field of NLP and has many potential applications in other areas of AI research.
""",
        "sources": ["Attention Is All You Need"],
    },
    {
        "question": "What is Attention?",
        "generated_answer": """## SUMMARY: 
Attention is a mechanism in neural networks that allows the model to focus on specific parts of the input data that are relevant for the task at hand. It's a way to selectively weigh the importance of different input elements, such as words in a sentence or pixels in an image, and concentrate on the most relevant ones.

## DETAILS: 
Attention is a key component of many neural network architectures, including transformers, recurrent neural networks (RNNs), and convolutional neural networks (CNNs). The basic idea behind attention is to compute a set of weights that represent the importance of each input element, and then use these weights to compute a weighted sum of the input elements.

The attention mechanism typically consists of three components:

1. **Query:** The query is the input to the attention mechanism, which is typically a vector or a matrix.
2. **Key:** The key is a set of vectors or matrices that represent the input data.
3. **Value:** The value is a set of vectors or matrices that represent the output of the attention mechanism.

The attention mechanism computes the weights by taking the dot product of the query and key, and then applying a softmax function to obtain a probability distribution over the input elements. The weights are then used to compute a weighted sum of the value vectors, which represents the output of the attention mechanism.

Attention has several benefits, including:

* **Improved performance:** Attention can improve the performance of neural networks by allowing them to focus on the most relevant input elements.
* **Interpretability:** Attention can provide insights into which input elements are most relevant for the task at hand.
* **Flexibility:** Attention can be used in a variety of neural network architectures and can be applied to different types of input data.

There are several types of attention mechanisms, including:

* **Scaled Dot-Product Attention:** This is a type of attention mechanism that uses a scaled dot product to compute the weights.
* **Multi-Head Attention:** This is a type of attention mechanism that uses multiple attention heads to compute the weights.
* **Hierarchical Attention:** This is a type of attention mechanism that uses a hierarchical structure to compute the weights.

Attention has been widely used in many applications, including:

* **Natural Language Processing (NLP):** Attention has been used in NLP tasks such as machine translation, text summarization, and question answering.
* **Computer Vision:** Attention has been used in computer vision tasks such as image classification, object detection, and image segmentation.
* **Speech Recognition:** Attention has been used in speech recognition tasks such as speech-to-text and voice recognition.

### Attention Mechanism

The attention mechanism can be represented as follows:
```
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d)) * V
```
Where:

* `Q` is the query vector
* `K` is the key vector
* `V` is the value vector
* `d` is the dimensionality of the query and key vectors
* `softmax` is the softmax function
* `*` is the dot product operator
* `^T` is the transpose operator

Note: This is a simplified representation of the attention mechanism. The actual implementation may vary depending on the specific use case and requirements.
""",
        "sources": ["Attention Is All You Need"],
    },
    {
        "question": "What is a transformer?",
        "generated_answer": """## SUMMARY: 
A Transformer is a type of neural network architecture introduced in the paper "Attention Is All You Need" by Vaswani et al. in 2017. It's primarily designed for sequence-to-sequence tasks, such as machine translation, text summarization, and image captioning. The Transformer model relies entirely on self-attention mechanisms to process input sequences, eliminating the need for recurrent neural networks (RNNs) and convolutional neural networks (CNNs).

## DETAILS: 
The Transformer architecture is based on the concept of self-attention, which allows the model to attend to different parts of the input sequence simultaneously and weigh their importance. This is different from traditional RNNs, which process the input sequence one step at a time, using the previous hidden state to inform the next step.

The Transformer model consists of two main components:

1. **Encoder:** The encoder takes in a sequence of tokens (e.g., words or characters) and outputs a sequence of vectors. The encoder is composed of a stack of identical layers, each of which consists of two sub-layers: self-attention and feed-forward neural network (FFNN).
2. **Decoder:** The decoder takes in the output of the encoder and generates a sequence of tokens. The decoder is also composed of a stack of identical layers, each of which consists of three sub-layers: self-attention, encoder-decoder attention, and FFNN.

The Transformer architecture has several key features:

* **Self-Attention:** The Transformer model uses self-attention mechanisms to process the input sequence. This allows the model to attend to different parts of the sequence simultaneously and weigh their importance.
* **Multi-Head Attention:** The Transformer model uses multi-head attention, which allows the model to capture different types of relationships between the input tokens.
* **Positional Encoding:** The Transformer model uses positional encoding to preserve the order of the input sequence.
* **Layer Normalization:** The Transformer model uses layer normalization to normalize the output of each layer.

The Transformer model has several advantages over traditional sequence-to-sequence models:

* **Parallelization:** The Transformer model can be parallelized more easily than RNNs, making it faster to train and more efficient for large-scale datasets.
* **Scalability:** The Transformer model can handle longer input sequences than RNNs, making it more suitable for tasks that require processing long sequences.
* **Performance:** The Transformer model has achieved state-of-the-art results in many sequence-to-sequence tasks, including machine translation, text summarization, and image captioning.

Some of the key applications of the Transformer model include:

* **Machine Translation:** The Transformer model has been widely used for machine translation tasks, including English-to-French, English-to-German, and English-to-Chinese.
* **Text Summarization:** The Transformer model has been used for text summarization tasks, including summarizing news articles and documents.
* **Image Captioning:** The Transformer model has been used for image captioning tasks, including generating captions for images.
* **Chatbots:** The Transformer model has been used for chatbot applications, including generating responses to user input.

Overall, the Transformer model is a powerful and flexible architecture that has revolutionized the field of natural language processing and has many potential applications in other areas of AI research.

### Transformer Architecture

The Transformer architecture can be represented as follows:
```
Encoder:
- Input Embedding
- Positional Encoding
- Layer 1:
    - Self-Attention
    - FFNN
- Layer 2:
    - Self-Attention
    - FFNN
- ...
- Layer N:
    - Self-Attention
    - FFNN

Decoder:
- Input Embedding
- Positional Encoding
- Layer 1:
    - Self-Attention
    - Encoder-Decoder Attention
    - FFNN
- Layer 2:
    - Self-Attention
    - Encoder-Decoder Attention
    - FFNN
- ...
- Layer N:
    - Self-Attention
    - Encoder-Decoder Attention
    - FFNN
```
Note: This is a simplified representation of the Transformer architecture. The actual implementation may vary depending on the specific use case and requirements.
""",
        "sources": ["Attention Is All You Need"],
    },
]

#### 3.2 Pydantic model for expected output

Next, we define a Pydantic model to establish the structure for test output, to ensure type safety.

In [None]:
class ExpectedOutput(BaseModel):
    answer: str | None
    sources: list[str] | None

#### 3.3 Upload the dataset

In [None]:
studio_dataset_repo = StudioDatasetRepository(studio_client=studio_client)

examples = [
    Example(
        input=Input(question=example["question"]),
        expected_output=ExpectedOutput(
            answer=example["generated_answer"], sources=example["sources"]
        ),
    )
    for example in test_set
]

studio_dataset = studio_dataset_repo.create_dataset(
    examples=examples, dataset_name="demo-dataset"
)

studio_dataset.id

To access the dataset, follow the tutorial [Store an evaluation dataset in PhariaStudio](https://docs.aleph-alpha.com/products/pharia-ai/pharia-studio/tutorial/store-dataset-in-data-platform/).

### 4. Define evaluation logic

PhariaStudio SDK requires the creation of `EvaluationLogic` - to evaulate individual examples - and `AggregationLogic` - to aggregate all the individual evaluations into overall metrics.

#### 4.1 EvaluationLogic - 'LLM as a judge' implementation

First, we set up the evaluation logic that is used for each individual example. Our advanced evaluation implements a *LLM as a judge* evaluation, which uses LLMs to assess the quality of generated responses across multiple dimensions.

##### Base class

The `Checker` abstract base class provides the foundational infrastructure for all evaluation checkers:

- **Model integration**: Connects to the PhariaAI platform using `DevCsi` and utilises `llama-3.3-70b-instruct` as the evaluation model
- **Weighted scoring**: Implements a sophisticated scoring mechanism that uses token-level log probabilities to compute more reliable scores
- **Robust parsing**: Handles edge cases where the model does not return a valid numeric score

##### Individual checker implementations

From this we derive three specialised checker classes, each designed to assess different aspects of answer quality:

1. **AccuracyChecker**: Focuses on factual correctness by comparing specific claims, data points, and verifiable information between the generated and expected answers. The scoring rubric ranges from 1-2 (major factual errors) to 9-10 (highly accurate with no factual errors).

2. **FactualityChecker**: Specifically designed to detect hallucinations and fabricated content. It evaluates whether the generated answer contains information not present in the reference, helping identify when models add unsupported claims or speculative statements presented as facts.

3. **CompletenessChecker**: Assesses coverage of content by determining how well the generated answer addresses all main topics and key points from the expected answer. This ensures that responses are comprehensive rather than just accurate.



In [None]:
import logging
import math

from pharia_skill import TopLogprobs, ChatResponse
from pharia_skill.testing import DevCsi
from jinja2 import Template
from abc import ABC

logging.basicConfig(
    level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s"
)


class Checker(ABC):

    def __init__(self) -> None:
        self.dev_csi = DevCsi(project=PHARIA_STUDIO_PROJECT_NAME)
        self.evaluation_model = "llama-3.3-70b-instruct"
        self.logger = logging.getLogger(__name__)

    def get_metric(
        self, question: str, expected_answer: str, generated_answer: str
    ) -> ChatResponse:
        system_prompt = self.system_prompt
        user_prompt = self.user_prompt.format(
            question=question,
            expected_answer=expected_answer,
            generated_answer=generated_answer,
        )

        messages = [
            Message.system(system_prompt),
            Message.user(user_prompt),
            Message.assistant("Score: "),
        ]
        params = ChatParams(max_tokens=10, temperature=0.0, logprobs=TopLogprobs(10))
        response = self.dev_csi.chat(
            model=self.evaluation_model, messages=messages, params=params
        )

        content = response.message.content.strip()
        fallback_score = self.parse_score(content)

        probs = getattr(response, "logprobs", None)
        if (
            not probs
            or not hasattr(probs, "__getitem__")
            or len(probs) < 2
            or not hasattr(probs[-2], "top")
        ):
            self.logger.warning("No logprobs found")
            return fallback_score

        logprobs = probs[-2].top
        return self.compute_weighted_score(logprobs, fallback_score)

    @staticmethod
    def parse_score(score_str: str) -> float:
        """Convert score string to float if valid, else return fallback"""
        return (
            float(score_str)
            if score_str.isdigit() and 0 <= float(score_str) <= 10
            else 1
        )

    @staticmethod
    def compute_weighted_score(logprobs, fallback_score: float) -> float:
        """Compute weighted score from token logprobs"""
        digit_probs = {
            float(prob.token): math.exp(prob.logprob)
            for prob in logprobs
            if prob.token.isdigit() and 0 <= float(prob.token) <= 10
        }

        total = sum(digit_probs.values())
        if total == 0:
            return fallback_score

        normalized = {k: v / total for k, v in digit_probs.items()}
        return round(sum(k * v for k, v in normalized.items()), 1)


class AccuracyChecker(Checker):

    def __init__(self) -> None:
        super().__init__()
        self.system_prompt = """You are a highly precise evaluation assistant specialized in assessing factual accuracy and correctness.

Your task is to evaluate how accurately the generated answer reflects the facts and information from the expected answer.

SCORING RUBRIC (1-10):
- 9-10: Highly accurate - No factual errors, all claims correctly supported
- 7-8: Mostly accurate - Minor factual discrepancies or unclear statements
- 5-6: Moderately accurate - Some factual errors but generally correct direction
- 3-4: Low accuracy - Multiple factual errors or significant misrepresentations
- 1-2: Poor accuracy - Major factual errors, contradicts expected information

EVALUATION CRITERIA:
1. Are specific facts, figures, and data points correct?
2. Are the claims and statements supported by the expected information?
3. Are there any contradictions with the expected answer?
4. Is the information presented without distortion or misinterpretation?

Return only a single integer score between 1 and 10. 
"""
        self.user_prompt: Template = """TASK: Evaluate the factual accuracy of the generated answer against the expected reference answer.

QUESTION:
{question}

EXPECTED ANSWER (Reference):
{expected_answer}

GENERATED ANSWER (To Evaluate):
{generated_answer}

EVALUATION STEPS:
1. Identify factual claims, data points, and specific information in both answers
2. Compare the accuracy of facts, figures, dates, and other verifiable information
3. Check for any contradictions or misrepresentations
4. Assess whether claims are properly supported by the reference information

IMPORTANT: Respond with ONLY a single integer from 1 to 10. Do not include any explanation or additional text.
"""


class FactualityChecker(Checker):
    def __init__(self) -> None:
        super().__init__()
        self.system_prompt = """You are a highly precise evaluation assistant specialized in assessing information precision and relevance.

Your task is to evaluate how well the generated answer stays grounded in relevant information without adding hallucinated or irrelevant content.

SCORING RUBRIC (1-10):
- 9-10: Excellent precision - Only relevant information, no hallucinations or fabrications
- 7-8: Good precision - Mostly relevant content, minimal irrelevant information
- 5-6: Fair precision - Some irrelevant details or minor unsupported claims
- 3-4: Poor precision - Significant irrelevant content or unverifiable claims
- 1-2: Very poor precision - Extensive hallucinations or fabricated information

EVALUATION CRITERIA:
1. Does the answer stick to information that can be verified from the expected content?
2. Are there any fabricated details, dates, names, or claims not in the reference?
3. Is all information relevant to the topic and question asked?
4. Are there any speculative statements presented as facts?

Return only a single integer score between 1 and 10. 
"""
        self.user_prompt: Template = """TASK: Evaluate the information precision and relevance of the generated answer, focusing on detecting hallucinations or fabricated content.

QUESTION:
{question}

EXPECTED ANSWER (Reference):
{expected_answer}

GENERATED ANSWER (To Evaluate):
{generated_answer}

EVALUATION STEPS:
1. Compare the generated answer against the reference to identify any added information
2. Check for fabricated details, names, dates, or claims not present in the reference
3. Assess whether all information is relevant to the topic
4. Look for speculative statements presented as definitive facts

IMPORTANT: Respond with ONLY a single integer from 1 to 10. Do not include any explanation or additional text.
"""


class CompletenessChecker(Checker):
    def __init__(self) -> None:
        super().__init__()
        self.system_prompt = """You are a highly precise evaluation assistant specialized in assessing content completeness.

Your task is to evaluate how completely the generated answer covers the key information from the expected answer.

SCORING RUBRIC (1-10):
- 9-10: Comprehensive coverage - All major points and most minor details included
- 7-8: Good coverage - All major points included, some minor details may be missing
- 5-6: Adequate coverage - Most major points included, several details missing
- 3-4: Incomplete coverage - Some major points included, many details missing
- 1-2: Poor coverage - Few or no major points included

EVALUATION CRITERIA:
1. Are all main topics/themes from the expected answer present?
2. Are supporting details and examples adequately covered?
3. Is the depth of information comparable to the expected answer?
4. Are any critical pieces of information missing?

Return only a single integer score between 1 and 10.
"""
        self.user_prompt = """TASK: Evaluate how completely the generated answer covers the content from the expected answer.

QUESTION:
{question}

EXPECTED ANSWER (Reference):
{expected_answer}

GENERATED ANSWER (To Evaluate):
{generated_answer}

EVALUATION STEPS:
1. Identify the main topics and key points in the expected answer
2. Check if each main topic is addressed in the generated answer
3. Assess the depth and detail level compared to the expected answer
4. Consider any missing critical information

IMPORTANT: Respond with ONLY a single integer from 1 to 10. Do not include any explanation or additional text.
"""

##### QaEvaluation data model


As we are now computing more scores, we also need to extend the `QaEvaluation` class to capture all of the different evaluation dimensions:

1. **Completeness score**: Measures how thoroughly the generated answer covers the key information present in the expected reference answer
2. **Accuracy score**: Evaluates the factual correctness of claims and information in the generated response
3. **Factuality score**: Assesses the absence of hallucinations and fabricated content not present in the reference material
4. **Correct sources**: Lists the source documents that were properly cited and match the expected sources
5. **Incorrect sources**: Catalogues any source documents that were cited incorrectly or are not in the expected source list
6. **Source accuracy**: Calculates the precision of source citations (correct sources / total cited sources)
7. **Source recall**: Measures the coverage of expected source documents (found expected sources / total expected sources)

Beyond the definition of the metrics, the `QaEvaluationLogic` class also implements several key functions:

- **`do_evaluate_single_output()`**: The main evaluation method that orchestrates the assessment of a single question-answer pair by calling all three LLM-based checkers and computing source metrics
- **`_check_sources()`**: Performs case-insensitive comparison between expected and generated source citations, returning lists of correctly and incorrectly cited sources
- **`_calculate_source_accuracy()`**: Computes the precision of source citations by dividing the number of correct sources by the total number of cited sources
- **`_calculate_source_recall()`**: Calculates the recall of expected sources by dividing the number of found expected sources by the total number of expected sources


In [None]:
class QaEvaluation(BaseModel):
    completeness_score: float = 0.0  # Coverage of expected content
    accuracy_score: float = 0.0  # Factual correctness
    factuality_score: float = 0.0  # Absence of hallucinations
    correct_sources: list[str] = []  # Properly cited sources
    incorrect_sources: list[str] = []  # Incorrectly cited sources
    source_accuracy: float = 0.0  # Precision of source citations
    source_recall: float = 0.0  # Recall of expected sources


class QaEvaluationLogic(
    SingleOutputEvaluationLogic[Input, Output, ExpectedOutput, QaEvaluation]
):

    def __init__(self) -> None:
        super().__init__()
        self.accuracy_checker = AccuracyChecker()
        self.factuality_checker = FactualityChecker()
        self.completeness_checker = CompletenessChecker()

    def do_evaluate_single_output(
        self, example: Example[Input, ExpectedOutput], output: Output
    ) -> QaEvaluation:

        completeness_score = self.completeness_checker.get_metric(
            question=example.input.question,
            expected_answer=example.expected_output.answer,
            generated_answer=output.answer,
        )

        accuracy_score = self.accuracy_checker.get_metric(
            question=example.input.question,
            expected_answer=example.expected_output.answer,
            generated_answer=output.answer,
        )

        factuality_score = self.factuality_checker.get_metric(
            question=example.input.question,
            expected_answer=example.expected_output.answer,
            generated_answer=output.answer,
        )

        correct_sources, incorrect_sources = self._check_sources(
            expected_sources=example.expected_output.sources,
            generated_sources=output.sources,
        )
        return QaEvaluation(
            completeness_score=completeness_score,
            accuracy_score=accuracy_score,
            factuality_score=factuality_score,
            correct_sources=correct_sources,
            incorrect_sources=incorrect_sources,
            source_accuracy=self._calculate_source_accuracy(
                expected_sources=example.expected_output.sources,
                generated_sources=output.sources,
            ),
            source_recall=self._calculate_source_recall(
                expected_sources=example.expected_output.sources,
                generated_sources=output.sources,
            ),
        )

    def _check_sources(
        self, expected_sources: list[str], generated_sources: list[str]
    ) -> tuple[list[str], list[str]]:
        if not generated_sources:
            return [], []

        if not expected_sources:
            return [], generated_sources.copy()

        expected_set = {source.lower().strip() for source in expected_sources}
        generated_set = {source.lower().strip() for source in generated_sources}

        correct_sources_lower = expected_set.intersection(generated_set)

        correct_sources = []
        incorrect_sources = []

        for source in generated_sources:
            if source.lower().strip() in correct_sources_lower:
                correct_sources.append(source)
            else:
                incorrect_sources.append(source)

        return correct_sources, incorrect_sources

    def _calculate_source_accuracy(
        self, expected_sources: list[str], generated_sources: list[str]
    ) -> float:

        if not generated_sources:
            return 0.0 if expected_sources else 1.0

        correct_sources, _ = self._check_sources(expected_sources, generated_sources)
        return len(correct_sources) / len(generated_sources)

    def _calculate_source_recall(
        self, expected_sources: list[str], generated_sources: list[str]
    ) -> float:
        if not expected_sources:
            return 1.0

        if not generated_sources:
            return 0.0

        expected_set = {source.lower().strip() for source in expected_sources}
        generated_set = {source.lower().strip() for source in generated_sources}

        found_expected = expected_set.intersection(generated_set)
        return len(found_expected) / len(expected_sources)

To ensure our evaluation logic works correctly, we test it with a sample question and answer about neural network encoders:

In [None]:
test_example = test_set[0]

task = QATask()
input = Input(question=test_example.get("question"))
output = task.run(input, NoOpTracer())

example = Example(
    input=input,
    expected_output=ExpectedOutput(
        answer=test_example.get("generated_answer"), sources=test_example.get("sources")
    ),
)

evaluation_logic = QaEvaluationLogic()
evaluation = evaluation_logic.do_evaluate_single_output(example, output)
print(f"Evaluation: {evaluation}")

#### 4.2 AggregationLogic

To assess overall system performance, we need to aggregate individual evaluation results into metrics. The advanced `QaAggregationLogic` class handles the evaluation data from our LLM-based assessment and outputs the rounded averages across all individual test runs:

In [None]:
class QaAggregatedEvaluation(BaseModel):
    average_completeness_score: float
    average_accuracy_score: float
    average_factuality_score: float
    average_source_accuracy: float
    average_source_recall: float


class QaAggregationLogic(
    AggregationLogic[
        QaEvaluation,
        QaAggregatedEvaluation,
    ]
):
    def aggregate(self, evaluations: Iterable[QaEvaluation]) -> QaAggregatedEvaluation:
        evaluation_list = list(evaluations)
        if len(evaluation_list) == 0:
            return QaAggregatedEvaluation(
                average_completeness_score=0.0,
                average_accuracy_score=0.0,
                average_factuality_score=0.0,
                average_source_accuracy=0.0,
                average_source_recall=0.0,
            )

        average_completeness_score = round(
            mean(eval.completeness_score for eval in evaluation_list), 2
        )
        average_accuracy_score = round(
            mean(eval.accuracy_score for eval in evaluation_list), 2
        )
        average_factuality_score = round(
            mean(eval.factuality_score for eval in evaluation_list), 2
        )
        average_source_accuracy = round(
            mean(eval.source_accuracy for eval in evaluation_list), 2
        )
        average_source_recall = round(
            mean(eval.source_recall for eval in evaluation_list), 2
        )

        return QaAggregatedEvaluation(
            average_completeness_score=average_completeness_score,
            average_accuracy_score=average_accuracy_score,
            average_factuality_score=average_factuality_score,
            average_source_accuracy=average_source_accuracy,
            average_source_recall=average_source_recall,
        )

To verify our aggregation mechanism, we test it by aggregating two examples:

In [None]:
example_1 = test_set[0]

input_1 = Input(question=example_1.get("question"))
output_1 = task.run(input_1, NoOpTracer())
example_1 = Example(
    input=input_1,
    expected_output=ExpectedOutput(
        answer=example_1.get("answer"), sources=example_1.get("sources")
    ),
)

example_2 = test_set[1]

input_2 = Input(question=example_2.get("question"))
output_2 = task.run(input_2, NoOpTracer())
example_2 = Example(
    input=input_2,
    expected_output=ExpectedOutput(
        answer=example_2.get("answer"), sources=example_2.get("sources")
    ),
)

aggregation_logic = QaAggregationLogic()
evaluation_logic = QaEvaluationLogic()

evaluation_1 = evaluation_logic.do_evaluate_single_output(example_1, output_1)
evaluation_2 = evaluation_logic.do_evaluate_single_output(example_2, output_2)
aggregation = aggregation_logic.aggregate([evaluation_1, evaluation_2])

print(evaluation_1)
print(evaluation_2)
print(f"Aggregation: {aggregation}")

### 5. Create and run a benchmark

With our evaluation components ready, we can now create a benchmark in PhariaStudio and run our evaluation on the test dataset.

In [None]:
benchmark_repository = StudioBenchmarkRepository(studio_client=studio_client)

benchmark = benchmark_repository.create_benchmark(
    dataset_id=studio_dataset.id,
    eval_logic=evaluation_logic,
    aggregation_logic=aggregation_logic,
    name="LLM-as-a-judge-benchmark",
    description="This benchmark evaluates the LLM as a judge.",
)

benchmark.id

Next, we trigger the becnhmark to execute.

In [None]:
benchmark = benchmark_repository.get_benchmark(
    benchmark_id=benchmark.id,
    eval_logic=evaluation_logic,
    aggregation_logic=aggregation_logic,
)

benchmark_execution_id = benchmark.execute(
    task=task,
    name=str(uuid4()),
)

After the benchmark completes, you can view detailed results in the PhariaStudio interface under Evaluate/Benchmarks (check [Create and submit evaluations](https://docs.aleph-alpha.com/products/pharia-ai/pharia-studio/tutorial/write-a-simple-evaluation/) for more details).

### 6. Improving your RAG application

Based on the evaluation results, you can identify areas for improvement in your RAG application. Common improvements include:

1. **Refining the prompt**: Adjust the prompt to encourage more precise reference citation
2. **Adjusting retrieval parameters**: Modify the number of retrieved documents or relevance thresholds
3. **Enhancing document chunking**: Change how documents are split and indexed
4. **Implementing better ranking**: Add reranking steps to prioritise the most relevant documents