# Evaluating an LLM Agent Routine Using Meta-Prompting

This notebook systematically **evaluates** an LLM Agent routine using a **test-driven approach**. By simulating real-world interactions between a **user and the agent**, we measure the agent's effectiveness in following its structured **system prompt** and making accurate decisions.

## Evaluation Framework

To assess the agent’s **compliance, reasoning, and accuracy**, we define a controlled testing environment where:

1. **Test Cases (CSV Input)** – A dataset containing multiple test scenarios, each with:
   - **Context**: The operational background or situation.
   - **User Request**: The problem or inquiry initiated by the user.
   - **Expected Function**: The correct function the agent should invoke.
   - **Expected Inputs**: The required parameters for proper execution.

2. **Dual LLM Simulation**
   - **User Simulator**: A simulated user that provides inputs dynamically from the test case.
   - **Agent Simulator**: The LLM agent utilizing the **structured routine** from the meta-prompt.

3. **Conversational Interaction**
   - The **User Simulator** presents queries based on the test case.
   - The **Agent Simulator** processes requests and responds according to its defined **meta-prompt routine**.
   - A simulated **dialogue** unfolds, mimicking a real-world exchange.

4. **Evaluation Metrics**
   - We verify whether the **Agent Simulator** correctly identifies and executes the **Expected Function**.
   - We check if the function was called with the **correct inputs**.
   - **Accuracy Score** is computed based on passed and failed tests.

## Objectives

- **Assess the Agent’s Compliance** with structured reasoning and policy adherence.
- **Measure Execution Accuracy** by comparing function calls against expected outputs.
- **Identify System Strengths & Weaknesses** for further refinements.

<div align="center">
  <img src="images/evaluate_ai.png" alt="Meta-prompting Evaluate" width="800">
</div>

## Getting Started

Prerequisites:

- Python 3.11 or 3.12 installed on your system.
- LLM Agentic Tool Mesh library installed. If not, install it using: `pip install 'llmesh[all]'`
- Install kernel for Jupiter notebook `pip install ipykernel`
- Create `.env` file inside `notebooks/data` folder with the value of `PYTHONPATH` and `OPENAI_API_KEY` environment variables

In [1]:
from dotenv import load_dotenv
import os

# Load environment variables from the .env file located in the Data folder
dotenv_path = os.path.join('data', '.env')
load_dotenv(dotenv_path)

# Read environment variables
python_path = os.getenv('PYTHONPATH')
if not python_path:
    raise ValueError("Project path is not set.")
llm_api_key = os.getenv('OPENAI_API_KEY')
if not llm_api_key:
    raise ValueError("API key is not set in environment variables.")

# Change working directory
os.chdir(python_path)

## Loading the Routine

The generated routine is loaded to define the structured operational workflow for the LLM Agent, ensuring alignment with policies, optimized tool usage, and consistent decision-making.  

In [2]:
from IPython.display import Markdown
from notebooks.utils import get_routine

routine = get_routine()

2025-02-18 16:20:34,100 - ATHON - INFO - Data loaded from notebooks/files/p5g_routine_0_cached.pkl.


## Loading the Test Dataframe

The test dataframe is loaded to evaluate the LLM Agent's performance and ensure its responses align with the expected outcomes.  

In [3]:
from notebooks.utils import read_test_dataframe

test_dataframe = read_test_dataframe()
display(test_dataframe)

Unnamed: 0,Context,Request,Network Function,Profile Network Function,SUPI,UUID,PDU Session ID,Metric,Expected Function,Expected Inputs
0,I want to get the status of the N2 interface b...,Get the status of the N2 interface between AMF...,amf,,,,,,get_amf_ngap_status,"{""nf"": ""amf""}"
1,I need to check the status of the interface be...,What is the status of the interface between AM...,amf,,,,,,get_amf_ngap_status,"{""nf"": ""amf""}"
2,I want to verify the status of the interface b...,Check the interface status between AMF and NGAP,amf,,,,,,get_amf_ngap_status,"{""nf"": ""amf""}"
3,I need to confirm the current status of the N2...,Retrieve the status of the N2 interface betwee...,amf,,,,,,get_amf_ngap_status,"{""nf"": ""amf""}"
4,I want to monitor the status of the interface ...,Is the interface between AMF and NGAP active?,amf,,,,,,get_amf_ngap_status,"{""nf"": ""amf""}"
5,Track latency within the HPE 5G Core,Measure the latency in the HPE 5G Core,kpi,,,,,latency,get_network_performance,"{""nf"": ""kpi"",""metric"": ""latency""}"
6,Monitor bandwidth usage within the HPE 5G Core,Check the current throughput in the HPE 5G Core,kpi,,,,,throughput,get_network_performance,"{""nf"": ""kpi"",""metric"": ""throughput""}"
7,Observe signal strength within the HPE 5G Core,Assess the signal strength in the HPE 5G Core,kpi,,,,,signal_strength,get_network_performance,"{""nf"": ""kpi"",""metric"": ""signal_strength""}"
8,Analyze latency in the HPE 5G Core,What is the current latency in the HPE 5G Core?,kpi,,,,,latency,get_network_performance,"{""nf"": ""kpi"",""metric"": ""latency""}"
9,Evaluate bandwidth usage in the HPE 5G Core,What is the current bandwidth usage in the HPE...,kpi,,,,,bandwidth,get_network_performance,"{""nf"": ""kpi"",""metric"": ""bandwidth""}"


## Evaluating the Routine

For each test case, we **generate both the Agent and the Customer**, simulating their conversation in a controlled environment. The interaction follows a structured flow, allowing us to assess the Agent's ability to interpret requests, apply reasoning, and execute the correct function with the appropriate inputs.


In [4]:
import pandas as pd
from notebooks.utils import evaluate_test_case, write_result_dataframe

records = []
for row_number, test_row in test_dataframe.iterrows():
    print(f"Processing row: {row_number}")  
    record = evaluate_test_case(routine, test_row)
    records.append(record)

# Save results in a file
result_dataframe = pd.DataFrame(records)
write_result_dataframe(result_dataframe)

Processing row: 0
2025-02-18 16:20:34,123 - ATHON - DEBUG - Prompt generated from notebooks/prompts//customer_persona.txt with params {'details': {'Context': 'I want to get the status of the N2 interface between AMF and NGAP', 'Request': 'Get the status of the N2 interface between AMF and NGAP', 'Network Function': 'amf'}}
2025-02-18 16:20:34,123 - ATHON - DEBUG - Selected Langchain ChatOpenAI
2025-02-18 16:20:34,207 - ATHON - DEBUG - Returned model 'gpt-4o-mini'
2025-02-18 16:20:34,209 - ATHON - DEBUG - Prompt generated from notebooks/prompts//agent_persona.txt with params {'routine': '1. Identify the type of customer request by parsing the input for keywords corresponding to the following categories:\n   a. If the request mentions monitoring network performance (KPIs such as latency, bandwidth, signal strength), then proceed to step 2.\n   b. If the request is about retrieving the AMF-NGAP status, then proceed to step 3.\n   c. If the request asks for network function logs, then proc

In [5]:
from IPython.display import HTML

# Print accuracy
total_accuracy = result_dataframe['Is Correct'].mean()
display(Markdown(f"### Accurancy: {round(total_accuracy * 100, 2)}"))

# Print results
result_dataframe = result_dataframe.map(lambda x: x.replace("\n", "<br>") if isinstance(x, str) else x)
display(HTML(result_dataframe.to_html(escape=False)))

### Accurancy: 66.67

Unnamed: 0,Request,Context,Expected Function,Expected Inputs,Actual Function,Actual Inputs,Is Correct,Transcripts
0,Get the status of the N2 interface between AMF and NGAP,I want to get the status of the N2 interface between AMF and NGAP,get_amf_ngap_status,{'nf': 'amf'},get_amf_ngap_status,{'nf': 'amf'},True,"* User: Get the status of the N2 interface between AMF and NGAP * Assistant: Please provide the network function name where the AMF-NGAP status is required. * User: The network function name is ""amf."" * Assistant: * Tool: {  ""connected_at"": ""2025-02-11T18:25:21Z"",  ""id"": 80,  ""plmn"": ""00101"",  ""tac"": 1,  ""address"": ""10.132.0.10"",  ""ran_node_name"": ""UERANSIM-gnb-1-1-4"" } * Assistant: The status of the N2 interface between AMF and NGAP is as follows: - **Connected At:** 2025-02-11T18:25:21Z - **ID:** 80 - **PLMN:** 00101 - **TAC:** 1 - **Address:** 10.132.0.10 - **RAN Node Name:** UERANSIM-gnb-1-1-4 Is there anything more I can assist you with regarding managing the HPE 5G Core? * User: No, that will be all. Thank you for your assistance! * Assistant: * Tool: Close case for amf"
1,What is the status of the interface between AMF and NGAP?,I need to check the status of the interface between AMF and NGAP,get_amf_ngap_status,{'nf': 'amf'},get_amf_ngap_status,{'nf': 'amf'},True,"* User: What is the status of the interface between AMF and NGAP? * Assistant: Please provide the network function name where the AMF-NGAP status is required. * User: The network function name is amf. * Assistant: * Tool: {  ""connected_at"": ""2025-02-11T18:25:21Z"",  ""id"": 80,  ""plmn"": ""00101"",  ""tac"": 1,  ""address"": ""10.132.0.10"",  ""ran_node_name"": ""UERANSIM-gnb-1-1-4"" } * Assistant: The status of the interface between AMF and NGAP is as follows: - **Connected At:** 2025-02-11T18:25:21Z - **ID:** 80 - **PLMN:** 00101 - **TAC:** 1 - **Address:** 10.132.0.10 - **RAN Node Name:** UERANSIM-gnb-1-1-4 Is there anything more I can assist you with regarding managing the HPE 5G Core? * User: No, that will be all. Thank you for your assistance! * Assistant: * Tool: Close case for amf"
2,Check the interface status between AMF and NGAP,I want to verify the status of the interface between AMF and NGAP,get_amf_ngap_status,{'nf': 'amf'},get_amf_ngap_status,{'nf': 'AMF'},False,"* User: Check the interface status between AMF and NGAP * Assistant: Please provide the network function name where the AMF-NGAP status is required. * User: The network function is AMF. * Assistant: * Tool: {  ""connected_at"": ""2025-02-11T18:25:21Z"",  ""id"": 80,  ""plmn"": ""00101"",  ""tac"": 1,  ""address"": ""10.132.0.10"",  ""ran_node_name"": ""UERANSIM-gnb-1-1-4"" } * Assistant: The AMF-NGAP status for the network function AMF is as follows: - **Connected At:** 2025-02-11T18:25:21Z - **ID:** 80 - **PLMN:** 00101 - **TAC:** 1 - **Address:** 10.132.0.10 - **RAN Node Name:** UERANSIM-gnb-1-1-4 Is there anything more I can assist you with regarding managing the HPE 5G Core? * User: No, that will be all. Thank you for your assistance! * Assistant: * Tool: Close case for AMF"
3,Retrieve the status of the N2 interface between AMF and NGAP,I need to confirm the current status of the N2 interface between AMF and NGAP,get_amf_ngap_status,{'nf': 'amf'},get_amf_ngap_status,{'nf': 'AMF'},False,"* User: Retrieve the status of the N2 interface between AMF and NGAP * Assistant: Please provide the network function name where the AMF-NGAP status is required. * User: The network function is AMF. * Assistant: * Tool: {  ""connected_at"": ""2025-02-11T18:25:21Z"",  ""id"": 80,  ""plmn"": ""00101"",  ""tac"": 1,  ""address"": ""10.132.0.10"",  ""ran_node_name"": ""UERANSIM-gnb-1-1-4"" } * Assistant: The status of the N2 interface between AMF and NGAP is as follows: - **Connected At:** 2025-02-11T18:25:21Z - **ID:** 80 - **PLMN:** 00101 - **TAC:** 1 - **Address:** 10.132.0.10 - **RAN Node Name:** UERANSIM-gnb-1-1-4 Is there anything more I can assist you with regarding managing the HPE 5G Core? * User: No, that will be all. Thank you for your assistance! * Assistant: * Tool: Close case for AMF"
4,Is the interface between AMF and NGAP active?,I want to monitor the status of the interface between AMF and NGAP,get_amf_ngap_status,{'nf': 'amf'},get_amf_ngap_status,{'nf': 'amf'},True,"* User: Is the interface between AMF and NGAP active? * Assistant: Please provide the network function name where the AMF-NGAP status is required. * User: The network function is amf. * Assistant: * Tool: {  ""connected_at"": ""2025-02-11T18:25:21Z"",  ""id"": 80,  ""plmn"": ""00101"",  ""tac"": 1,  ""address"": ""10.132.0.10"",  ""ran_node_name"": ""UERANSIM-gnb-1-1-4"" } * Assistant: The interface between AMF and NGAP is active. Here are the details: - **Connected At:** 2025-02-11T18:25:21Z - **ID:** 80 - **PLMN:** 00101 - **TAC:** 1 - **Address:** 10.132.0.10 - **RAN Node Name:** UERANSIM-gnb-1-1-4 Is there anything more I can assist you with regarding managing the HPE 5G Core? * User: No, that will be all. Thank you for your assistance! * Assistant: * Tool: Close case for amf"
5,Measure the latency in the HPE 5G Core,Track latency within the HPE 5G Core,get_network_performance,"{'nf': 'kpi', 'metric': 'latency'}",get_network_performance,"{'nf': 'kpi', 'metric': 'latency'}",True,"* User: Measure the latency in the HPE 5G Core * Assistant: Please provide the network function name and the specific KPI metric you want to track (e.g., latency, bandwidth, signal strength). * User: The network function is ""kpi"" and the metric I want to track is ""latency."" * Assistant: * Tool: {  ""throughput"": {  ""measure"": 5,  ""unit"": ""mbps""  },  ""latency"": {  ""measure"": 3,  ""unit"": ""Mbps""  },  ""signal_strength"": {  ""measure"": 100,  ""unit"": ""dBm""  } } * Assistant: The latency measurement in the HPE 5G Core for the network function ""kpi"" is 3 Mbps. Is there anything more I can assist you with regarding managing the HPE 5G Core? * User: No, that will be all. Thank you for your assistance! * Assistant: * Tool: Close case for kpi"
6,Check the current throughput in the HPE 5G Core,Monitor bandwidth usage within the HPE 5G Core,get_network_performance,"{'nf': 'kpi', 'metric': 'throughput'}",get_network_performance,"{'nf': 'kpi', 'metric': 'throughput'}",True,"* User: Check the current throughput in the HPE 5G Core * Assistant: Please provide the network function name and the specific KPI metric you want to track (e.g., latency, bandwidth, signal strength). * User: The network function is ""kpi"" and the metric I want to track is ""throughput."" * Assistant: * Tool: {  ""throughput"": {  ""measure"": 5,  ""unit"": ""mbps""  },  ""latency"": {  ""measure"": 3,  ""unit"": ""Mbps""  },  ""signal_strength"": {  ""measure"": 100,  ""unit"": ""dBm""  } } * Assistant: The current throughput in the HPE 5G Core is 5 Mbps. Is there anything more I can assist you with regarding managing the HPE 5G Core? * User: No, that will be all. Thank you for your assistance! * Assistant: * Tool: Close case for kpi"
7,Assess the signal strength in the HPE 5G Core,Observe signal strength within the HPE 5G Core,get_network_performance,"{'nf': 'kpi', 'metric': 'signal_strength'}",get_network_performance,"{'nf': 'kpi', 'metric': 'signal_strength'}",True,"* User: Assess the signal strength in the HPE 5G Core * Assistant: Please provide the network function name and the specific KPI metric you want to track (e.g., latency, bandwidth, signal strength). * User: The network function is ""kpi"" and the metric is ""signal_strength."" * Assistant: * Tool: {  ""throughput"": {  ""measure"": 5,  ""unit"": ""mbps""  },  ""latency"": {  ""measure"": 3,  ""unit"": ""Mbps""  },  ""signal_strength"": {  ""measure"": 100,  ""unit"": ""dBm""  } } * Assistant: The signal strength for the network function ""kpi"" is measured at 100 dBm. Is there anything more I can assist you with regarding managing the HPE 5G Core? * User: No, that will be all. Thank you for your assistance! * Assistant: * Tool: Close case for kpi"
8,What is the current latency in the HPE 5G Core?,Analyze latency in the HPE 5G Core,get_network_performance,"{'nf': 'kpi', 'metric': 'latency'}",get_network_performance,"{'nf': 'kpi', 'metric': 'latency'}",True,"* User: What is the current latency in the HPE 5G Core? * Assistant: Please provide the network function name and the specific KPI metric you want to track (e.g., latency, bandwidth, signal strength). * User: The network function is ""kpi"" and the metric is ""latency."" * Assistant: * Tool: {  ""throughput"": {  ""measure"": 5,  ""unit"": ""mbps""  },  ""latency"": {  ""measure"": 3,  ""unit"": ""Mbps""  },  ""signal_strength"": {  ""measure"": 100,  ""unit"": ""dBm""  } } * Assistant: The current latency in the HPE 5G Core for the network function ""kpi"" is 3 Mbps. Is there anything more I can assist you with regarding managing the HPE 5G Core? * User: No, that will be all. Thank you for your assistance! * Assistant: * Tool: Close case for kpi"
9,What is the current bandwidth usage in the HPE 5G Core?,Evaluate bandwidth usage in the HPE 5G Core,get_network_performance,"{'nf': 'kpi', 'metric': 'bandwidth'}",get_network_performance,"{'nf': 'kpi', 'metric': 'bandwidth'}",True,"* User: What is the current bandwidth usage in the HPE 5G Core? * Assistant: Please provide the network function name and the specific KPI metric you want to track (e.g., latency, bandwidth, signal strength). * User: The network function is ""kpi"" and the metric is ""bandwidth."" * Assistant: * Tool: {  ""throughput"": {  ""measure"": 5,  ""unit"": ""mbps""  },  ""latency"": {  ""measure"": 3,  ""unit"": ""Mbps""  },  ""signal_strength"": {  ""measure"": 100,  ""unit"": ""dBm""  } } * Assistant: The current bandwidth usage in the HPE 5G Core is 5 Mbps. Is there anything more I can assist you with regarding managing the HPE 5G Core? * User: No, that will be all. Thank you for your assistance! * Assistant: * Tool: Close case for kpi"
