# Evaluation

This notebook has examples on how to simulate chats and how to run bulk evaluation.

 ❗ Before running this notebook, make sure you install the dependencies defined in `src/requirements-eval.txt`

## Dependencies

The evaluation module depends on other modules from project Acumen, specifically for loading conversation histories.

In [None]:
import sys
import os

SOURCE_DIR = "../../src"
sys.path.append(SOURCE_DIR)

from group_chat import create_group_chat
from dotenv import load_dotenv

load_dotenv(os.path.join(SOURCE_DIR, ".env"))

In [None]:
from config import load_agent_config, setup_logging

setup_logging()

agent_config = load_agent_config("default")

In [None]:
from azure.identity import DefaultAzureCredential
from azure.storage.blob.aio import BlobServiceClient
from data_models.data_access import DataAccess

credential = DefaultAzureCredential()
blob_service_client = BlobServiceClient(
    account_url=os.getenv("APP_BLOB_STORAGE_ENDPOINT"),
    credential=credential,
)
data_access = DataAccess(blob_service_client)

In [None]:
INITIAL_QUERIES_CSV_PATH = "./evaluation_sample_initial_queries.csv"
SIMULATION_OUTPUT_PATH = "../data/simulated_chats/patient_4"
EVALUATION_RESULTS_PATH = os.path.join(SIMULATION_OUTPUT_PATH, "evaluation_results")

PATIENT_TIMELINE_REFERENCE_PATH = "../data/patient_timeline_reference/"

## Simulate chats

Below, we simulate conversations based on queries loaded from a `csv`. This csv must have one column with the **patient ID** (as expected by the agents) and an **initial query column**, that will serve as the conversation starter. Optionally, we may include an additional column for **follow-up questions**.

Each row in the `csv` contains a single follow-up question. When `group_followups=True` (default), the system will combine all follow-ups with the same patient ID and initial query into a single conversation flow, asking them sequentially.

In [None]:
from evaluation.chat_simulator import ProceedUser, LLMUser, ChatSimulator

initial_query = "Orchestrator: Prepare tumor board for Patient ID: patient_4"

# user = ProceedUser()
user = LLMUser()

chat_simulator = ChatSimulator(
    simulated_user=user,
    group_chat_kwargs={
        "all_agents_config": agent_config,
        "data_access": data_access,
    },
    trial_count=1,
    max_turns=10,
    output_folder_path=SIMULATION_OUTPUT_PATH,
    save_readable_history=True,
    print_messages=False,
    raise_errors=True,
)

chat_simulator.load_initial_queries(
    csv_file_path=INITIAL_QUERIES_CSV_PATH,
    patients_id_column="Patient ID",
    initial_queries_column="Initial Query",
    followup_column="Possible Follow up",
    group_followups=False,
)

Instead of calling `load_initial_queries` to load the data needed for simulating chats, you may pass them directly to the class constructor:

```python
patient_id = "patient_4"
initial_query = "Orchestrator: Prepare tumor board for Patient ID: patient_4"
# At least an empty string must be given as a followup question
followup_questions = [""]

user = LLMUser()

chat_simulator = ChatSimulator(
    simulated_user=user,
    group_chat_kwargs={
        "all_agents_config": agent_config,
        "data_access": data_access,
    },
    patients_id=[patient_id],
    initial_queries=[initial_query],
    followup_questions=[followup_questions],
)
```

A lot of output is generated when simulating chats, which might make this file too big for opening.

For that reason, please clear the output of at least the next cell.

In [None]:
await chat_simulator.simulate_chats()

For ad-hoc cases, you may also call `chat_simulator.chat` directly:

```python
chat_simulator.chat(
    patients_id=[patient_id],
    initial_queries=[initial_query],
    followup_questions=[followup_questions],
    max_turns=5
)
```

## Evaluation

### Input data
Below, we evaluate conversations simulated in the previous step. For evaluation, we need the serialized chat context, which is out case it the `json` file generated by the simulation.

The deployed application also stores conversations whenever they are cleared with the message `@Orchestrator clear`. Naturally, that data can also be used for evaluation.

### Reference based metrics
Below you will also notice that some metric (such as `RougeMetric` and `TBFactMetric`) require ground truth data to generate scores. Internally, these metrics, will load `.txt` files from a provided folder. **The `txt` file name must be the respective `patient_id`**

> 💡**Tip**: The chat context `json` includes a top-level key `patient_id` that is used to track what patient was the target of the conversation, and thus used to match the correct reference data..

In [None]:
from semantic_kernel.connectors.ai.open_ai.services.azure_chat_completion import AzureChatCompletion

from evaluation.evaluator import Evaluator
from evaluation.metrics.agent_selection import AgentSelectionEvaluator
from evaluation.metrics.context_relevancy import ContextRelevancyEvaluator
from evaluation.metrics.info_aggregation import InformationAggregationEvaluator
from evaluation.metrics.rouge import RougeMetric
from evaluation.metrics.intent_resolution import IntentResolutionEvaluator
from evaluation.metrics.factuality import TBFactMetric

In [None]:
llm_service = AzureChatCompletion(
    deployment_name=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
    api_version="2024-12-01-preview",
    endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
)

agent_selection_evaluator = AgentSelectionEvaluator(
    evaluation_llm_service=llm_service,
)

intent_resolution_evaluator = IntentResolutionEvaluator(
    evaluation_llm_service=llm_service,
)

information_aggregation_evaluator = InformationAggregationEvaluator(
    evaluation_llm_service=llm_service,
)

context_relevancy_evaluator = ContextRelevancyEvaluator(
    evaluation_llm_service=llm_service,
    agent_name="PatientHistory",
    context_window=5,
)

rouge_metric = RougeMetric(
    agent_name="PatientHistory",
    reference_dir_path=PATIENT_TIMELINE_REFERENCE_PATH
)

tbfact_metric = TBFactMetric(
    evaluation_llm_service=llm_service,
    agent_name="PatientHistory",
    reference_dir_path=PATIENT_TIMELINE_REFERENCE_PATH,
    context_window=0,
)

In [None]:
evaluator = Evaluator(
    metrics=[
        agent_selection_evaluator,
        # intent_resolution_evaluator,
        # information_aggregation_evaluator,
        # context_relevancy_evaluator,
        # rouge_metric,
        tbfact_metric,
    ],
    output_folder_path=EVALUATION_RESULTS_PATH,
)

evaluator.load_chat_contexts(SIMULATION_OUTPUT_PATH)

Similar to the `ChatSimulator` class, you may skip `load_chat_contexts` by passing it directly in the constructor:

```python
from data_models.chat_context import ChatContext

chats_contexts: list[ChatContext]

evaluator = Evaluator(
    chats_contexts=chats_contexts
    metrics=[
        ...
    ],
    output_folder_path=SIMULATION_OUTPUT_PATH,
)
```

In [None]:
evaluation_results = await evaluator.evaluate()

evaluation_results

For a quick show of results, we print results below, but for better understanding the scores and behaviour of agents, ideally you drill down the dictionary generated in the previous step (also saved as a `json` in the evaluation output folder). It includes all individual results, explanations and details specific of each metric.

>💡**Note:** When no reference data is provided, that instance will result in `Error: No reference found for patient ID: `.

In [None]:
for metric_name, metric_result in evaluation_results["metrics"].items():
    print(f"{metric_name}: average_score: {metric_result["average_score"]} | num_errors: {metric_result["num_errors"]}")