## Running the Offline Evaluations for the Report Generation Agent

Offline evaluations are evaluations run against a **pre-defined dataset**. It performs **detailed evaluations** of the **outputs** of the agentic system and the **steps** it has taken to produce those evaluations.

This dataset is called the **expected results** or the **ground-truth** dataset, and on this case it's a **handcrafted** dataset with **inputs, outputs and trajectory** for a few known use cases.

The evaluations are run by Langfuse and the results are visualized there.

## Setting up

The code below sets the notebook default folder, sets the default constants and checks the presence of the environment variables.

The environment variables can be set in the `.env` file in the root folder of the project.

In [None]:
import json
import os
from pathlib import Path
from pprint import pprint

from aieng.agent_evals.async_client_manager import AsyncClientManager
from aieng.agent_evals.langfuse import upload_dataset_to_langfuse


# Setting the notebook directory to the project's root folder
if Path("").absolute().name == "eval-agents":
    print(f"Notebook path is already the root path: {Path('').absolute()}")
else:
    os.chdir(Path("").absolute().parent.parent)
    print(f"The notebook path has been set to: {Path('').absolute()}")

client_manager = AsyncClientManager.get_instance()
assert client_manager.configs.report_generation_db.database, (
    "[ERROR] The database path is not set! Please configure the REPORT_GENERATION_DB__DATABASE environment variable."
)
assert client_manager.configs.langfuse_secret_key, (
    "[ERROR] The Langfuse secret key is not set! Please configure the LANGFUSE_SECRET_KEY environment variable."
)
assert client_manager.configs.langfuse_public_key, (
    "[ERROR] The Langfuse public key is not set! Please configure the LANGFUSE_PUBLIC_KEY environment variable."
)
assert client_manager.configs.langfuse_host, (
    "[ERROR] The Langfuse base URL is not set! Please configure the LANGFUSE_HOST environment variable."
)

print("All environment variables have been set.")


EVALUATION_DATASET_PATH = "implementations/report_generation/data/OnlineRetailReportEval.json"
LANGFUSE_DATASET_NAME = "OnlineRetailReportEval"

## Taking a Look at the Ground Truth Dataset

The ground-truth dataset is located at `implementations/report_generation/data/OnlineRetailReportEval.json`. The code below will display one of its elements as an example:

In [None]:
with open("implementations/report_generation/data/OnlineRetailReportEval.json") as f:
    ground_truth = json.load(f)

print(f"Ground-truth dataset size: {len(ground_truth)}")
print("First element:")
pprint(ground_truth[0])

Here is an explanation of the data structure of the dataset samples:
```python
{
    'id': str,  # The ID of the sample
    'input': str,  # The input to be used to test the report generation agent
    'expected_output': {  # The expected outputs of the agent
        'final_report': {  # The output data for the final report the agent generates. 
                           # These values match the input the agent sends to the `write_xlsx` function
            'filename': str,  # The name of the report file
            'report_columns': list[str,  # The names of the columns of the report
            'report_data': list[list[Any]],  # a bidimensional array of values for the rows of the report
        }
        'trajectory': {  # information about the trajectory the agent should take to produce the report
            'actions': list[str],  # A list of the names of the actions the agent should take, in order
            'description': list[str],  # A description of what the parameters that are sent to each one of
                                       # the actions are supposed to be doing
        }
    }
}
```

## Uploading the dataset to Langfuse

Use the function below to **upload** the ground truth dataset to Langfuse so it can be used **during the evaluation**:

In [None]:
await upload_dataset_to_langfuse(
    EVALUATION_DATASET_PATH,
    LANGFUSE_DATASET_NAME,
)

## LLM-as-a-judge Evaluators

Two **LLM-as-a-judge evaluators** are set to run against this dataset and the agent's output:
1. A **Final Result Evaluator**, that will evaluate the agent's output against the contents of the `final_result` key
2. A **Trajectory Evaluator**, that will evaluate the agent's output against the contents of the `trajectory` key

Here are the instructions for both of those agents (as per `aieng.agent_evals.evaluation.report_generation.prompts`):
```python
TRAJECTORY_EVALUATOR_INSTRUCTIONS = """\
You are evaluating if an agent has followed the correct trajectory to generate a report.\
The agent is a Report Generation Agent that uses the SQLite database tool to generate a report\
and return the report as a downloadable file to the user.\
You will be presented with the "Question" that has been asked to the agent along with two sets of data:\
- The "Expected Trajectory" of the agent, which contains:\
    - A list ids for the actions the agent is expected to perform\
    - A list of rough descriptions of what has been passed as parameters to the actions\
- The "Actual Trajectory" of the agent, which contains:\
    - A list ids for the actions the agent performed\
    - A list of parameters that has been passed to each one of the actions\
It's OK if the agent makes mistakes and performs additional steps, or if the queries do not exactly match\
the description, as long as the queries performed end up satisfying the "Question".\
It is important that the last action to be of type "final_response" and that it produces a link to the report file.
"""

RESULT_EVALUATOR_INSTRUCTIONS = """\
Evaluate whether the "Proposed Answer" to the given "Question" matches the "Ground Truth". \
Disregard the following aspects when comparing the "Proposed Answer" to the "Ground Truth": \
- The order of the items should not matter, unless explicitly specified in the "Question". \
- The formatting of the values should not matter, unless explicitly specified in the "Question". \
- The column and row names have to be similar but not necessarily exact, unless explicitly specified in the "Question". \
- The filename has to be similar by name but not necessarily exact, unless explicitly specified in the "Question". \
- It is ok if the filename is missing. \
- The numerical values should be equal with a tolerance of 0.01. \
- The report data in the "Proposed Answer" should have the same number of rows as in the "Ground Truth". \
- It is OK if the report data in the "Proposed Answer" contains extra columns or if the rows are in a different order, \
unless explicitly specified in the "Question".
"""
```

## Running the Evaluations

To run those two evaluators against all of the ground-truth dataset samples, run the function below:

In [None]:
# Running as a CLI command to avoid issues between Langfuse's
# experiment runner and Jupyter
# NOTE: This will take a while to execute in a notebook environment
# It runs faster when executed in a regular console session

!uv run --env-file .env python -m implementations.report_generation.evaluate --max-concurrency 1

## Checking the Results

At the end of the run, you will see a summary in the console.

To see detailed results of the evaluation runs:
1. Go to your project on Langfuse
2. Click on **Datasets**
3. Click on the dataset name
4. Click on one of the runs

You will see a more detailed summary of the experiment run and also you can see the details of each of of the runs, including f