# Evaluation

This notebook has examples on how to simulate chats and how to run bulk evaluation.

 ❗ Before running this notebook, make sure you install the dependencies defined in `src/requirements-eval.txt`

## Dependencies

The evaluation module depends on other modules from project Acumen, specifically for loading conversation histories.

In [1]:
import sys
import os

SOURCE_DIR = "../../src"
sys.path.append(SOURCE_DIR)

from dotenv import load_dotenv

load_dotenv(os.path.join(SOURCE_DIR, ".env"))

True

In [None]:
from app import create_app_context
from config import setup_logging

setup_logging()
app_ctx = create_app_context()

2025-09-04 15:09:41,669 - IPKernelApp - DEBUG - {'header': {'msg_id': '16aa80ca-d1d1e903b2a9227358f9dd26_13656_108', 'msg_type': 'execute_reply', 'username': 'username', 'session': '16aa80ca-d1d1e903b2a9227358f9dd26', 'date': datetime.datetime(2025, 9, 4, 13, 9, 41, 669282, tzinfo=datetime.timezone.utc), 'version': '5.3'}, 'msg_id': '16aa80ca-d1d1e903b2a9227358f9dd26_13656_108', 'msg_type': 'execute_reply', 'parent_header': {'date': datetime.datetime(2025, 9, 4, 13, 9, 34, 716000, tzinfo=tzutc()), 'msg_id': '50aecb9f-13e6-4db0-bffc-92028fdea8d8', 'msg_type': 'execute_request', 'session': 'b1fc781f-3b63-43db-a7a0-a71255d97996', 'username': '43eaa193-e498-4902-98d3-6253cddf4d64', 'version': '5.2'}, 'content': {'status': 'ok', 'execution_count': 2, 'user_expressions': {}, 'payload': []}, 'metadata': {'started': datetime.datetime(2025, 9, 4, 13, 9, 34, 718325, tzinfo=datetime.timezone.utc), 'dependencies_met': True, 'engine': 'a74d4df0-4fe2-4090-8043-5fcf523df0c8', 'status': 'ok'}, 'tracke

In [None]:
INITIAL_QUERIES_CSV_PATH = "./evaluation_sample_initial_queries2.csv"
SIMULATION_OUTPUT_PATH = "./simulated_chats/gpt5-opt"
EVALUATION_RESULTS_PATH = os.path.join(SIMULATION_OUTPUT_PATH, "evaluation_results")

PATIENT_TIMELINE_REFERENCE_PATH = "./reference/"

2025-09-04 15:09:46,469 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 15:09:46,469 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 15:09:46,471 - IPKernelApp - DEBUG -    Content: {'silent': False, 'store_history': True, 'user_expressions': {}, 'allow_stdin': True, 'stop_on_error': False, 'code': 'INITIAL_QUERIES_CSV_PATH = "./evaluation_sample_initial_queries2.csv"\nSIMULATION_OUTPUT_PATH = "./simulated_chats/patient_4"\nEVALUATION_RESULTS_PATH = os.path.join(SIMULATION_OUTPUT_PATH, "evaluation_results")\n\nPATIENT_TIMELINE_REFERENCE_PATH = "./reference/"'}
   --->
   
2025-09-04 15:09:46,471 - IPKernelApp - DEBUG -    Content: {'silent': False, 'store_history': True, 'user_expressions': {}, 'allow_stdin': True, 'stop_on_error': False, 'code': 'INITIAL_QUERIES_CSV_PATH = "./evaluation_sample_initial_queries2.csv"\nSIMULATION_OUTPUT_PATH = "./simulated_chats/patient_4"\nEVALUATION_RESULTS_PATH = os.path.join(SIMULATION_OUTPUT_PATH, "eva

## Simulate chats

Below, we simulate conversations based on queries loaded from a `csv`. This csv must have one column with the **patient ID** (as expected by the agents) and an **initial query column**, that will serve as the conversation starter. Optionally, we may include an additional column for **follow-up questions**.

Each row in the `csv` contains a single follow-up question. When `group_followups=True` (default), the system will combine all follow-ups with the same patient ID and initial query into a single conversation flow, asking them sequentially.

In [4]:
from evaluation.chat_simulator import ProceedUser, LLMUser, ChatSimulator

initial_query = "Orchestrator: Prepare tumor board for Patient ID: patient_4"

# user = ProceedUser()
user = LLMUser()

chat_simulator = ChatSimulator(
    simulated_user=user,
    group_chat_kwargs={
        "app_ctx": app_ctx,
    },
    trial_count=3,
    max_turns=10,
    output_folder_path=SIMULATION_OUTPUT_PATH,
    save_readable_history=True,
    print_messages=False,
    raise_errors=True,
)

chat_simulator.load_initial_queries(
    csv_file_path=INITIAL_QUERIES_CSV_PATH,
    patients_id_column="Patient ID",
    initial_queries_column="Initial Query",
    followup_column="Possible Follow up",
    group_followups=False,
)

2025-09-04 15:09:52,231 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 15:09:52,231 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 15:09:52,234 - IPKernelApp - DEBUG -    Content: {'silent': False, 'store_history': True, 'user_expressions': {}, 'allow_stdin': True, 'stop_on_error': False, 'code': 'from evaluation.chat_simulator import ProceedUser, LLMUser, ChatSimulator\n\ninitial_query = "Orchestrator: Prepare tumor board for Patient ID: patient_4"\n\n# user = ProceedUser()\nuser = LLMUser()\n\nchat_simulator = ChatSimulator(\n    simulated_user=user,\n    group_chat_kwargs={\n        "app_ctx": app_ctx,\n    },\n    trial_count=3,\n    max_turns=10,\n    output_folder_path=SIMULATION_OUTPUT_PATH,\n    save_readable_history=True,\n    print_messages=False,\n    raise_errors=True,\n)\n\nchat_simulator.load_initial_queries(\n    csv_file_path=INITIAL_QUERIES_CSV_PATH,\n    patients_id_column="Patient ID",\n    initial_queries_column="Ini

<evaluation.chat_simulator.ChatSimulator at 0x1c0ffec7980>

2025-09-04 15:09:56,538 - IPKernelApp - DEBUG - {'header': {'msg_id': '16aa80ca-d1d1e903b2a9227358f9dd26_13656_155', 'msg_type': 'execute_reply', 'username': 'username', 'session': '16aa80ca-d1d1e903b2a9227358f9dd26', 'date': datetime.datetime(2025, 9, 4, 13, 9, 56, 538961, tzinfo=datetime.timezone.utc), 'version': '5.3'}, 'msg_id': '16aa80ca-d1d1e903b2a9227358f9dd26_13656_155', 'msg_type': 'execute_reply', 'parent_header': {'date': datetime.datetime(2025, 9, 4, 13, 9, 52, 229000, tzinfo=tzutc()), 'msg_id': '0e325f3d-a111-432b-af2e-f4109c93e231', 'msg_type': 'execute_request', 'session': 'b1fc781f-3b63-43db-a7a0-a71255d97996', 'username': '43eaa193-e498-4902-98d3-6253cddf4d64', 'version': '5.2'}, 'content': {'status': 'ok', 'execution_count': 4, 'user_expressions': {}, 'payload': []}, 'metadata': {'started': datetime.datetime(2025, 9, 4, 13, 9, 52, 237077, tzinfo=datetime.timezone.utc), 'dependencies_met': True, 'engine': 'a74d4df0-4fe2-4090-8043-5fcf523df0c8', 'status': 'ok'}, 'tracke

Instead of calling `load_initial_queries` to load the data needed for simulating chats, you may pass them directly to the class constructor:

```python
patient_id = "patient_4"
initial_query = "Orchestrator: Prepare tumor board for Patient ID: patient_4"
# At least an empty string must be given as a followup question
followup_questions = [""]

user = LLMUser()

chat_simulator = ChatSimulator(
    simulated_user=user,
    group_chat_kwargs={
        "all_agents_config": agent_config,
        "data_access": data_access,
    },
    patients_id=[patient_id],
    initial_queries=[initial_query],
    followup_questions=[followup_questions],
)
```

A lot of output is generated when simulating chats, which might make this file too big for opening.

For that reason, please clear the output of at least the next cell.

In [None]:
await chat_simulator.simulate_chats()

For ad-hoc cases, you may also call `chat_simulator.chat` directly:

```python
chat_simulator.chat(
    patients_id=[patient_id],
    initial_queries=[initial_query],
    followup_questions=[followup_questions],
    max_turns=5
)
```

## Evaluation

### Input data
Below, we evaluate conversations simulated in the previous step. For evaluation, we need the serialized chat context, which is out case it the `json` file generated by the simulation.

The deployed application also stores conversations whenever they are cleared with the message `@Orchestrator clear`. Naturally, that data can also be used for evaluation.

### Reference based metrics
Below you will also notice that some metric (such as `RougeMetric` and `TBFactMetric`) require ground truth data to generate scores. Internally, these metrics, will load `.txt` files from a provided folder. **The `txt` file name must be the respective `patient_id`**

> 💡**Tip**: The chat context `json` includes a top-level key `patient_id` that is used to track what patient was the target of the conversation, and thus used to match the correct reference data..

In [5]:
from semantic_kernel.connectors.ai.open_ai.services.azure_chat_completion import AzureChatCompletion

from evaluation.evaluator import Evaluator
from evaluation.metrics.agent_selection import AgentSelectionEvaluator
from evaluation.metrics.context_relevancy import ContextRelevancyEvaluator
from evaluation.metrics.info_aggregation import InformationAggregationEvaluator
from evaluation.metrics.rouge import RougeMetric
from evaluation.metrics.intent_resolution import IntentResolutionEvaluator
from evaluation.metrics.factuality import TBFactMetric
from evaluation.metrics.turn_by_turn_agent_selection import TurnByTurnAgentSelectionEvaluator
from evaluation.metrics.turn_by_turn_with_history import TurnByTurnEvaluatorWithContext

2025-09-04 15:10:02,829 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 15:10:02,829 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 15:10:02,831 - IPKernelApp - DEBUG -    Content: {'silent': False, 'store_history': True, 'user_expressions': {}, 'allow_stdin': True, 'stop_on_error': False, 'code': 'from semantic_kernel.connectors.ai.open_ai.services.azure_chat_completion import AzureChatCompletion\n\nfrom evaluation.evaluator import Evaluator\nfrom evaluation.metrics.agent_selection import AgentSelectionEvaluator\nfrom evaluation.metrics.context_relevancy import ContextRelevancyEvaluator\nfrom evaluation.metrics.info_aggregation import InformationAggregationEvaluator\nfrom evaluation.metrics.rouge import RougeMetric\nfrom evaluation.metrics.intent_resolution import IntentResolutionEvaluator\nfrom evaluation.metrics.factuality import TBFactMetric\nfrom evaluation.metrics.turn_by_turn_agent_selection import TurnByTurnAgentSelectionEvaluato

In [6]:
llm_service = AzureChatCompletion(
    deployment_name=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],
    api_version="2024-12-01-preview",
    endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
)

#evaluate orchestrator's agent selection by passing full chat history to LLM judge
agent_selection_evaluator = AgentSelectionEvaluator(
    evaluation_llm_service=llm_service,
)

#evaluate orchestrator's agent selection by passing only individual turn to LLM judge
turn_by_turn_evaluator = TurnByTurnAgentSelectionEvaluator(
    evaluation_llm_service=llm_service,
    #scenario="my_scenario", #could be used to differentiate scenarios, leave empty to use default folder name
    #agent_name="agent_name", #specify orchestrator agent name, leave empty to use default agent name from config
)

intent_resolution_evaluator = IntentResolutionEvaluator(
    evaluation_llm_service=llm_service,
)

information_aggregation_evaluator = InformationAggregationEvaluator(
    evaluation_llm_service=llm_service,
)

context_relevancy_evaluator = ContextRelevancyEvaluator(
    evaluation_llm_service=llm_service,
    agent_name="PatientHistory",
    context_window=5,
)

rouge_metric = RougeMetric(
    agent_name="PatientHistory",
    reference_dir_path=PATIENT_TIMELINE_REFERENCE_PATH
)

tbfact_metric = TBFactMetric(
    evaluation_llm_service=llm_service,
    agent_name="PatientHistory",
    reference_dir_path=PATIENT_TIMELINE_REFERENCE_PATH,
    context_window=0,
)


2025-09-04 15:10:04,805 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 15:10:04,805 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 15:10:04,817 - IPKernelApp - DEBUG -    Content: {'silent': False, 'store_history': True, 'user_expressions': {}, 'allow_stdin': True, 'stop_on_error': False, 'code': 'llm_service = AzureChatCompletion(\n    deployment_name=os.environ["AZURE_OPENAI_DEPLOYMENT_NAME"],\n    api_version="2024-12-01-preview",\n    endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],\n)\n\n#evaluate orchestrator\'s agent selection by passing full chat history to LLM judge\nagent_selection_evaluator = AgentSelectionEvaluator(\n    evaluation_llm_service=llm_service,\n)\n\n#evaluate orchestrator\'s agent selection by passing only individual turn to LLM judge\nturn_by_turn_evaluator = TurnByTurnAgentSelectionEvaluator(\n    evaluation_llm_service=llm_service,\n    #scenario="my_scenario", #could be used to differentiate scenarios, leave em

In [7]:
intent_system_prompt = """
You are an expert evaluator of medical AI assistants. Your task is to evaluate whether the AI orchestrator (called "Orchestrator") 
correctly understood and addressed the user's intent in each turn of the conversation.

Rate how well the orchestrator understood and addressed the user's intent on a scale from 1 to 5:
1: Poor - Completely misunderstood or failed to address user's intent
2: Below Average - Partially misunderstood or inadequately addressed intent
3: Average - Basic understanding but could have addressed intent better
4: Good - Clear understanding and appropriate response to intent
5: Excellent - Perfect understanding and optimal response to user's intent

Your response must begin with "Rating: X" where X is your score (1-5), followed by your detailed explanation.
"""
#evaluate the agent-user conversation by passing the full conversation to llm judge
turn_by_turn_intent_evaluator = TurnByTurnAgentSelectionEvaluator(
    evaluation_llm_service=llm_service,
    system_prompt=intent_system_prompt,
    metric_name="turn_by_turn_intent_resolution",
    description="Evaluates intent understanding and resolution for each turn"
)

#evaluate the agent-user conversation by turn, considering the context of previous turns
turn_by_turn_intent_evaluator_with_context = TurnByTurnEvaluatorWithContext(
    evaluation_llm_service=llm_service,
    system_prompt=intent_system_prompt,
    metric_name="turn_by_turn_intent_resolution",
    agent_name="Orchestrator", #could pass any other defined agents
    description="Evaluates intent understanding and resolution for each turn"
)

info_agg_system_prompt = """
You are an expert evaluator of medical AI assistants. Your task is to evaluate a conversation between a user and an AI orchestrator (called "Orchestrator") that coordinates multiple specialized medical agents.

Focus specifically on the orchestrator's ability to INTEGRATE INFORMATION FROM MULTIPLE AGENTS to form comprehensive answers. Consider:
1. Did the orchestrator effectively combine information from different specialized agents?
2. Did it synthesize potentially contradicting information appropriately?
3. Did it create coherent, comprehensive answers that draw on multiple knowledge sources?
4. Did it identify connections between information from different agents?

Rate the orchestrator's information integration ability on a scale from 1 to 5:
1: Poor - Failed to integrate information; simply repeated individual agent outputs or used only single sources
2: Below Average - Minimal integration; mostly relied on individual agents with little synthesis
3: Average - Basic integration of information; combined some facts but missed opportunities for deeper synthesis
4: Good - Strong integration; effectively combined information from multiple agents into coherent responses
5: Excellent - Superior integration; seamlessly synthesized information from multiple agents, creating insights beyond what any single agent provided

Your response must begin with "Rating: X" where X is your score (1-5), followed by your detailed explanation.

IMPORTANT: Some conversations may end abruptly due to turn limits. In these cases, evaluate based on what was accomplished up to that point.as_integer_ratio
"""

#evaluate the agent-user conversation by passing the full conversation to llm judge
turn_by_turn_info_evaluator = TurnByTurnAgentSelectionEvaluator(
    evaluation_llm_service=llm_service,
    system_prompt=info_agg_system_prompt,
    metric_name="turn_by_turn_information_aggregation",
    description="Evaluates information aggregation for each turn"
)

#evaluate the agent-user conversation by turn, considering the context of previous turns
turn_by_turn_info_evaluator_with_context = TurnByTurnEvaluatorWithContext(
    evaluation_llm_service=llm_service,
    system_prompt=info_agg_system_prompt,
    metric_name="turn_by_turn_information_aggregation",
    agent_name="Orchestrator", #could pass any other defined agents
    description="Evaluates information aggregation for each turn"
)

2025-09-04 15:10:10,084 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 15:10:10,084 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 15:10:10,084 - IPKernelApp - DEBUG -    Content: {'silent': False, 'store_history': True, 'user_expressions': {}, 'allow_stdin': True, 'stop_on_error': False, 'code': 'intent_system_prompt = """\nYou are an expert evaluator of medical AI assistants. Your task is to evaluate whether the AI orchestrator (called "Orchestrator") \ncorrectly understood and addressed the user\'s intent in each turn of the conversation.\n\nRate how well the orchestrator understood and addressed the user\'s intent on a scale from 1 to 5:\n1: Poor - Completely misunderstood or failed to address user\'s intent\n2: Below Average - Partially misunderstood or inadequately addressed intent\n3: Average - Basic understanding but could have addressed intent better\n4: Good - Clear understanding and appropriate response to intent\n5: Excellen

In [8]:
evaluator = Evaluator(
    metrics=[
        agent_selection_evaluator,
        intent_resolution_evaluator,
        information_aggregation_evaluator,
        # context_relevancy_evaluator,
        # rouge_metric,
        # tbfact_metric,
        #turn_by_turn_intent_evaluator_with_context,
        #turn_by_turn_info_evaluator_with_context,
        #turn_by_turn_evaluator,
    ],
    output_folder_path=EVALUATION_RESULTS_PATH,
)

evaluator.load_chat_contexts(SIMULATION_OUTPUT_PATH)

2025-09-04 15:10:10,109 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 15:10:10,109 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 15:10:10,111 - IPKernelApp - DEBUG -    Content: {'silent': False, 'store_history': True, 'user_expressions': {}, 'allow_stdin': True, 'stop_on_error': False, 'code': 'evaluator = Evaluator(\n    metrics=[\n        agent_selection_evaluator,\n        intent_resolution_evaluator,\n        information_aggregation_evaluator,\n        # context_relevancy_evaluator,\n        # rouge_metric,\n        # tbfact_metric,\n        #turn_by_turn_intent_evaluator_with_context,\n        #turn_by_turn_info_evaluator_with_context,\n        #turn_by_turn_evaluator,\n    ],\n    output_folder_path=EVALUATION_RESULTS_PATH,\n)\n\nevaluator.load_chat_contexts(SIMULATION_OUTPUT_PATH)'}
   --->
   
2025-09-04 15:10:10,111 - IPKernelApp - DEBUG -    Content: {'silent': False, 'store_history': True, 'user_expressions': {}, 'allow_st

<evaluation.evaluator.Evaluator at 0x1c0b9d861b0>

2025-09-04 15:10:10,199 - IPKernelApp - DEBUG - {'header': {'msg_id': '16aa80ca-d1d1e903b2a9227358f9dd26_13656_262', 'msg_type': 'execute_reply', 'username': 'username', 'session': '16aa80ca-d1d1e903b2a9227358f9dd26', 'date': datetime.datetime(2025, 9, 4, 13, 10, 10, 199884, tzinfo=datetime.timezone.utc), 'version': '5.3'}, 'msg_id': '16aa80ca-d1d1e903b2a9227358f9dd26_13656_262', 'msg_type': 'execute_reply', 'parent_header': {'date': datetime.datetime(2025, 9, 4, 13, 10, 10, 108000, tzinfo=tzutc()), 'msg_id': 'ba6b8fcb-f157-435a-95c4-72e9e647ddfe', 'msg_type': 'execute_request', 'session': 'b1fc781f-3b63-43db-a7a0-a71255d97996', 'username': '43eaa193-e498-4902-98d3-6253cddf4d64', 'version': '5.2'}, 'content': {'status': 'ok', 'execution_count': 8, 'user_expressions': {}, 'payload': []}, 'metadata': {'started': datetime.datetime(2025, 9, 4, 13, 10, 10, 116126, tzinfo=datetime.timezone.utc), 'dependencies_met': True, 'engine': 'a74d4df0-4fe2-4090-8043-5fcf523df0c8', 'status': 'ok'}, 'tra

Similar to the `ChatSimulator` class, you may skip `load_chat_contexts` by passing it directly in the constructor:

```python
from data_models.chat_context import ChatContext

chats_contexts: list[ChatContext]

evaluator = Evaluator(
    chats_contexts=chats_contexts
    metrics=[
        ...
    ],
    output_folder_path=SIMULATION_OUTPUT_PATH,
)
```

In [9]:
evaluation_results = await evaluator.evaluate()

evaluation_results

2025-09-04 15:10:10,230 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 15:10:10,230 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 15:10:10,232 - IPKernelApp - DEBUG -    Content: {'silent': False, 'store_history': True, 'user_expressions': {}, 'allow_stdin': True, 'stop_on_error': False, 'code': 'evaluation_results = await evaluator.evaluate()\n\nevaluation_results'}
   --->
   
2025-09-04 15:10:10,232 - IPKernelApp - DEBUG -    Content: {'silent': False, 'store_history': True, 'user_expressions': {}, 'allow_stdin': True, 'stop_on_error': False, 'code': 'evaluation_results = await evaluator.evaluate()\n\nevaluation_results'}
   --->
   
2025-09-04 15:10:10,234 - IPKernelApp - DEBUG - execute_request: {'header': {'date': datetime.datetime(2025, 9, 4, 13, 10, 10, 228000, tzinfo=tzutc()), 'msg_id': 'e9fa40aa-ad3b-42da-82cd-5d87b8c3680c', 'msg_type': 'execute_request', 'session': 'b1fc781f-3b63-43db-a7a0-a71255d97996', 'username': '43eaa19

{'timestamp': '20250904_151010',
 'metrics': {'agent_selection': {'average_score': 4.2272727272727275,
   'num_evaluations': 66,
   'num_errors': 0,
   'results': [{'id': '21f2e08161ef6e94c4b7835c02b8060bf534659c2109be987f2c113a3f397c11',
     'patient_id': None,
     'result': {'score': 4,
      'explanation': 'Rating: 4\n\nExplanation:\n- Appropriate agent routing for the task:\n  - Started with PatientHistory to pull stage, biomarkers, treatments, and imaging summaries — the correct first step for assembling the “full clinical picture.”\n  - Planned handoff to PatientStatus to synthesize current status, Radiology for imaging context, MedicalResearch for evidence-based prognostic factors and progression patterns, and ReportCreation to assemble the final short report. This sequencing fits the user’s request for prognosis and progression pathways.\n  - When EHR/registry access failed, the orchestrator adapted by requesting a minimal clinical dataset and consulted PatientStatus and Radi

2025-09-04 16:25:53,996 - IPKernelApp - DEBUG - {'header': {'msg_id': '16aa80ca-d1d1e903b2a9227358f9dd26_13656_7394', 'msg_type': 'execute_reply', 'username': 'username', 'session': '16aa80ca-d1d1e903b2a9227358f9dd26', 'date': datetime.datetime(2025, 9, 4, 14, 25, 53, 994810, tzinfo=datetime.timezone.utc), 'version': '5.3'}, 'msg_id': '16aa80ca-d1d1e903b2a9227358f9dd26_13656_7394', 'msg_type': 'execute_reply', 'parent_header': {'date': datetime.datetime(2025, 9, 4, 13, 10, 10, 228000, tzinfo=tzutc()), 'msg_id': 'e9fa40aa-ad3b-42da-82cd-5d87b8c3680c', 'msg_type': 'execute_request', 'session': 'b1fc781f-3b63-43db-a7a0-a71255d97996', 'username': '43eaa193-e498-4902-98d3-6253cddf4d64', 'version': '5.2'}, 'content': {'status': 'ok', 'execution_count': 9, 'user_expressions': {}, 'payload': []}, 'metadata': {'started': datetime.datetime(2025, 9, 4, 13, 10, 10, 235736, tzinfo=datetime.timezone.utc), 'dependencies_met': True, 'engine': 'a74d4df0-4fe2-4090-8043-5fcf523df0c8', 'status': 'ok'}, 't

For a quick show of results, we print results below, but for better understanding the scores and behaviour of agents, ideally you drill down the dictionary generated in the previous step (also saved as a `json` in the evaluation output folder). It includes all individual results, explanations and details specific of each metric.

>💡**Note:** When no reference data is provided, that instance will result in `Error: No reference found for patient ID: `.

In [10]:
for metric_name, metric_result in evaluation_results["metrics"].items():
    print(f"{metric_name}: average_score: {metric_result["average_score"]} | num_errors: {metric_result["num_errors"]}")

2025-09-04 16:25:55,884 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 16:25:55,884 - IPKernelApp - DEBUG - 
*** MESSAGE TYPE:execute_request***
2025-09-04 16:25:55,889 - IPKernelApp - DEBUG -    Content: {'silent': False, 'store_history': True, 'user_expressions': {}, 'allow_stdin': True, 'stop_on_error': False, 'code': 'for metric_name, metric_result in evaluation_results["metrics"].items():\n    print(f"{metric_name}: average_score: {metric_result["average_score"]} | num_errors: {metric_result["num_errors"]}")'}
   --->
   
2025-09-04 16:25:55,889 - IPKernelApp - DEBUG -    Content: {'silent': False, 'store_history': True, 'user_expressions': {}, 'allow_stdin': True, 'stop_on_error': False, 'code': 'for metric_name, metric_result in evaluation_results["metrics"].items():\n    print(f"{metric_name}: average_score: {metric_result["average_score"]} | num_errors: {metric_result["num_errors"]}")'}
   --->
   
2025-09-04 16:25:55,894 - IPKernelApp - DEBUG - execut

agent_selection: average_score: 4.2272727272727275 | num_errors: 0
task_completion_and_focus: average_score: 2.4615384615384617 | num_errors: 53
information_integration: average_score: 0 | num_errors: 66


2025-09-04 16:25:55,904 - IPKernelApp - DEBUG - {'header': {'msg_id': '16aa80ca-d1d1e903b2a9227358f9dd26_13656_7420', 'msg_type': 'execute_reply', 'username': 'username', 'session': '16aa80ca-d1d1e903b2a9227358f9dd26', 'date': datetime.datetime(2025, 9, 4, 14, 25, 55, 904683, tzinfo=datetime.timezone.utc), 'version': '5.3'}, 'msg_id': '16aa80ca-d1d1e903b2a9227358f9dd26_13656_7420', 'msg_type': 'execute_reply', 'parent_header': {'date': datetime.datetime(2025, 9, 4, 14, 25, 55, 881000, tzinfo=tzutc()), 'msg_id': '60c45510-c9fb-433d-a663-e5d8359187b9', 'msg_type': 'execute_request', 'session': 'b1fc781f-3b63-43db-a7a0-a71255d97996', 'username': '43eaa193-e498-4902-98d3-6253cddf4d64', 'version': '5.2'}, 'content': {'status': 'ok', 'execution_count': 10, 'user_expressions': {}, 'payload': []}, 'metadata': {'started': datetime.datetime(2025, 9, 4, 14, 25, 55, 896625, tzinfo=datetime.timezone.utc), 'dependencies_met': True, 'engine': 'a74d4df0-4fe2-4090-8043-5fcf523df0c8', 'status': 'ok'}, '