<a href="https://www.nvidia.com/dli"> <img src="images/nvidia_header.png" style="margin-left: -30px; width: 300px; float: left;"> </a>

# Evaluating and Profiling AgentIQ Workflows

You will walk through how to setup agent evaluation and observability in this notebook.

## Evaluating Our Agent

A key component of AgentIQ is that it can run several well-known evaluations against agentic workflows. Proper evaluation helps us:

### 1. Why Evaluate?
- **Measure Performance**: Quantify how well our agent performs on specific tasks
- **Identify Weaknesses**: Find edge cases or failure modes
- **Compare Versions**: Track improvements across different iterations
- **Ensure Reliability**: Verify the agent works consistently

### 2. Evaluation Process
1. **Create Test Data**: Define questions with known answers
2. **Configure Evaluators**: Set up metrics to measure performance
3. **Run Evaluation**: Process all test cases
4. **Analyze Results**: Review metrics and identify areas for improvement

### 3. Available Metrics
- **Answer Accuracy**: How correct are the agent's responses?
- **Context Relevance**: Is the agent using appropriate context?
- **Response Groundedness**: Are responses based on retrieved information?
- **Trajectory Analysis**: Is the agent's reasoning process sound?

This notebook will show how to 

## Evaluation Methods in AgentIQ

AgentIQ provides several built-in evaluators to assess the performance of your workflows:

1. **RAGAS Evaluator**: An open-source evaluation framework for RAG (Retrieval-Augmented Generation) workflows. RAGAS provides metrics like Answer Accuracy, Context Relevance, and Response Groundedness.

2. **Trajectory Evaluator**: Uses the intermediate steps generated by the workflow to evaluate the agent's reasoning process and decision-making path.

3. **SWE-Bench Evaluator**: Specifically designed for software engineering tasks, this evaluator tests if the agent can solve programming problems by running tests on the generated code.

In this notebook, we'll primarily use the **RAGAS Evaluator** and **Trajectory Evaluator** to assess our math tools agent.

Let's create a directory to store evaluation data. This will contain test cases with questions and expected answers.

In [1]:
!mkdir -p workflows/math_tools/data

In the next cell, we will create an evaluation JSON file.

It will include both standard and time-aware test cases.

Note that each test case includes the following:
- id: A unique identifier
- question: The input to send to our agent
- answer: The expected correct response (or "dynamic" for time-based answers)

In [2]:
%%writefile workflows/math_tools/data/comprehensive_eval.json

[
    {
        "id": 1,
        "question": "What is the square root of 49?",
        "answer": "7"
    },
    {
        "id": 2,
        "question": "Add 10 to 25",
        "answer": "35"
    },
    {
        "id": 3,
        "question": "What is the modulus of 100 divided by 3?",
        "answer": "1"
    },
    {
        "id": 4,
        "question": "What is five to the power of three?",
        "answer": "125"
    },
    {
        "id": 5,
        "question": "Is the current hour even?",
        "answer": "dynamic"
    }
]

Writing workflows/math_tools/data/comprehensive_eval.json


## Creating a Comprehensive Evaluation Configuration

We will write a single evaluation configuration file that includes all the tools our agent needs to handle both mathematical operations and time-based queries. By now you have seen most this configuration format. This configuration includes:

1. **General settings**: Where to store results and which dataset to use
2. **Functions**: All the tools our agent will use (math operations and time functions)
3. **LLMs**: Both the agent LLM and a separate evaluation LLM
4. **Evaluators**: The specific metrics we want to measure

The `eval` section is new. In the `eval` section, you can specify however many evaluators you want to run, calling either built-in evaluators or your own custom evaluation components.

In [3]:
%%writefile workflows/math_tools/configs/comprehensive_eval_config.yml

general:
  use_uvloop: true

functions:
  calculator_exponent:
    _type: calculator_exponent
  calculator_modulus:
    _type: calculator_modulus
  calculator_square_root:
    _type: calculator_square_root
  calculator_add:
    _type: calculator_add
  current_datetime:
    _type: current_datetime

llms:
  nim_llm:
    _type: nim
    model_name: meta/llama-3.1-70b-instruct
    temperature: 0.0
  eval_llm:
    _type: nim
    model_name: meta/llama-3.1-405b-instruct
    temperature: 0.0
    max_tokens: 1024

workflow:
  _type: react_agent
  tool_names:
    - calculator_exponent
    - calculator_modulus
    - calculator_square_root
    - calculator_add
    - current_datetime
  llm_name: nim_llm
  verbose: true

eval:
  general:
    output_dir: ./math_tools_eval/
    dataset:
      _type: json
      file_path: workflows/math_tools/data/comprehensive_eval.json
  evaluators:
    math_accuracy:
      _type: ragas
      metric: AnswerAccuracy
      llm_name: eval_llm
    math_trajectory_accuracy:
      _type: trajectory
      llm_name: eval_llm

Writing workflows/math_tools/configs/comprehensive_eval_config.yml


## Running the Comprehensive Evaluation

Now let's run the evaluation using our consolidated configuration. This will:
1. Load our test cases from the JSON file
2. Run each test case through our agent
3. Evaluate the responses using the specified metrics
4. Store the results in the output directory

This single evaluation run will test both standard mathematical operations and time-based queries.

In [4]:
!aiq eval --config_file=workflows/math_tools/configs/comprehensive_eval_config.yml

2025-07-11 09:05:09,105 - aiq.eval.evaluate - INFO - Starting evaluation run with config file: workflows/math_tools/configs/comprehensive_eval_config.yml
2025-07-11 09:05:12,273 - aiq.agent.react_agent.agent - INFO -  [00:00<?, ?it/s]
------------------------------
[AGENT]
[33mAgent input: Add 10 to 25
[36mAgent's thoughts: 
Thought: I need to add 10 and 25 together.
Action: calculator_add
Action Input: {"text": "10 + 25"}
[39m
------------------------------
2025-07-11 09:05:12,278 - aiq.agent.react_agent.agent - INFO - 
------------------------------
[AGENT]
[37mCalling tools: calculator_add
[33mTool's input: {"text": "10 + 25"}
[36mTool's response: 
Not implemented[39m
------------------------------
2025-07-11 09:05:12,383 - aiq.agent.react_agent.agent - INFO - 
------------------------------
[AGENT]
[33mAgent input: What is the square root of 49?
[36mAgent's thoughts: 
Thought: To find the square root of 49, I can use the calculator_square_root tool.
Action: calculator_squa

## Examining Evaluation Results

After running the evaluation, AgentIQ stores the results in JSON files in our specified `output_dir`. Let's analyze these JSON results with some helper functions.

In [5]:
import json
import pandas as pd
from IPython.display import display

# Simple function to load JSON files
def load_json(file_path):
    with open(file_path, 'r') as f:
        return json.load(f)

# Simple summary for accuracy evaluations
def get_accuracy_summary(data):
    items = data['eval_output_items']
    summary = []
    for item in items:
        summary.append({
            'Question': item['reasoning']['user_input'],
            'Score': item['score']
        })
    return pd.DataFrame(summary)

# Simple summary for workflow
def get_workflow_summary(data):
    summary = []
    for item in data:
        tools = []
        total_tokens = 0
        steps = len(item['intermediate_steps'])
        
        # Count tools and tokens
        for step in item['intermediate_steps']:
            if 'payload' in step and 'name' in step['payload']:
                tools.append(step['payload']['name'])
            if 'payload' in step and 'usage_info' in step['payload']:
                total_tokens += step['payload']['usage_info']['token_usage']['total_tokens']
        
        summary.append({
            'Question': item['question'],
            'Steps': steps,
            'Tokens': total_tokens,
            'Tools': ', '.join(set(tools))  # Unique tools only
        })
    return pd.DataFrame(summary)

# Load the three files
accuracy = load_json('./math_tools_eval/math_accuracy_output.json')
trajectory = load_json('./math_tools_eval/math_trajectory_accuracy_output.json')
workflow = load_json('./math_tools_eval/workflow_output.json')

# Create simple DataFrames
accuracy_df = get_accuracy_summary(accuracy)
workflow_df = get_workflow_summary(workflow)

# Show basic metrics
print("Overall Scores:")
print(f"Math Accuracy Score: {accuracy['average_score']}")
print(f"Trajectory Accuracy Score: {trajectory['average_score']}")
print(f"Average Steps: {workflow_df['Steps'].mean()}")
print(f"Average Tokens: {workflow_df['Tokens'].mean()}")

# Show accuracy results
print("\nMath Accuracy Results:")
display(accuracy_df)

# Show workflow results
print("\nWorkflow Summary:")
display(workflow_df)

# Count tool usage
tools_used = {}
for tools in workflow_df['Tools']:
    for tool in tools.split(', '):
        if tool:  # Skip empty strings
            tools_used[tool] = tools_used.get(tool, 0) + 1

print("\nTools Used:")
display(pd.DataFrame(list(tools_used.items()), columns=['Tool', 'Count']))

Overall Scores:
Math Accuracy Score: 0.65
Trajectory Accuracy Score: 0.8
Average Steps: 9.2
Average Tokens: 129220.6

Math Accuracy Results:


Unnamed: 0,Question,Score
0,What is the square root of 49?,1.0
1,Add 10 to 25,0.0
2,What is the modulus of 100 divided by 3?,1.0
3,What is five to the power of three?,1.0
4,Is the current hour even?,0.25



Workflow Summary:


Unnamed: 0,Question,Steps,Tokens,Tools
0,What is the square root of 49?,3,35420,"meta/llama-3.1-70b-instruct, calculator_square..."
1,Add 10 to 25,32,453574,"meta/llama-3.1-70b-instruct, calculator_add"
2,What is the modulus of 100 divided by 3?,3,41147,"meta/llama-3.1-70b-instruct, calculator_modulus"
3,What is five to the power of three?,3,38638,"calculator_exponent, meta/llama-3.1-70b-instruct"
4,Is the current hour even?,5,77324,"meta/llama-3.1-70b-instruct, calculator_modulu..."



Tools Used:


Unnamed: 0,Tool,Count
0,meta/llama-3.1-70b-instruct,5
1,calculator_square_root,1
2,calculator_add,1
3,calculator_modulus,2
4,calculator_exponent,1
5,current_datetime,1


## Setting Up Observability

Now that we can run our agent as a service, we need to monitor its performance and behavior. AgentIQ provides comprehensive observability features that help us understand what's happening inside our agent.

## Observability and Profiling in AgentIQ

AgentIQ offers comprehensive observability and profiling capabilities to monitor and optimize your workflows:

1. **Telemetry Options**:
   - **Logging**: Configure logs to console or file with different verbosity levels
   - **Tracing**: Track the flow of requests through your system
   - **Metrics**: Measure performance characteristics of your workflow

2. **Profiling Tools**:
   - **Token Usage Analysis**: Track and forecast token consumption
   - **Latency Analysis**: Identify performance bottlenecks
   - **Concurrency Analysis**: Understand parallel execution patterns

3. **Tracing Providers**:
   - **Phoenix Profiler**: A visualization tool by Arize AI for tracing and profiling
   - **OpenTelemetry Collector**: Standard collector for observability data
   - **Custom Providers**: Extensible system for custom telemetry exporters

In this notebook, we'll use the **Phoenix Profiler** to visualize the execution of our agent and understand its performance characteristics.

### 1. Understanding Observability in AgentIQ

AgentIQ supports multiple observability options, including:

- **Logging Providers**: Console logging and file-based logging with configurable verbosity levels
- **Tracing Providers**: Phoenix Profiler, OpenTelemetry Collector, and custom providers
- **Metrics Collection**: Performance measurements for optimization

For this notebook, we'll use the **Phoenix Profiler** for tracing. Phoenix is developed by Arize AI (https://github.com/Arize-ai/phoenix) and provides detailed insights into your agent's execution, including:

- Visual representation of the agent's reasoning process
- Timing information for each step and tool call
- Token usage statistics and bottleneck identification
- Hierarchical view of nested function calls

These features help with debugging, performance optimization, and understanding usage patterns.

### 2. Updating Configuration for Observability

Let's update our configuration file to enable observability features. We'll add a `telemetry` section to the `general` configuration that includes logging and tracing settings.

In [6]:
%%writefile workflows/math_tools/configs/observability_config.yml

general:
  use_uvloop: true
  telemetry:
    logging:
        console:
            _type: console
            level: WARN
    tracing:
        phoenix:
            _type: phoenix
            endpoint: http://phoenix:6006/v1/traces
            project: math_tools_example

functions:
  calculator_exponent:
    _type: calculator_exponent
  calculator_modulus:
    _type: calculator_modulus
  calculator_square_root:
    _type: calculator_square_root
  calculator_add:
    _type: calculator_add
  current_datetime:
    _type: current_datetime

llms:
  nim_llm:
    _type: nim
    model_name: meta/llama-3.1-70b-instruct
    temperature: 0.0

workflow:
  _type: react_agent
  tool_names:
    - calculator_exponent
    - calculator_modulus
    - calculator_square_root
    - calculator_add
    - current_datetime
  llm_name: nim_llm
  verbose: true


Writing workflows/math_tools/configs/observability_config.yml


### 3. Opening Phoenix

Now we can open the Phoenix UI and see the profiling data. Feel free to run the query in the next cell multiple times (or change the query and run it) to see observability data flowing into Phoenix.

In [7]:
%%js
const href = window.location.hostname;
let a = document.createElement('a');
let link = document.createTextNode('Click here to open Phoenix!');
a.appendChild(link);
a.href = "http://" + href + "/phoenix";
a.style.color = "navy"
a.target = "_blank"
element.append(a);

<IPython.core.display.Javascript object>

### 4. Running AgentIQ with Observability Enabled

Now let's run our agent with observability enabled. This will generate logs and traces that we can use to monitor and debug our agent.

In [8]:
!aiq run --config_file workflows/math_tools/configs/observability_config.yml --input "What is the square root of the current hour plus 5?"

2025-07-11 09:05:37,961 - aiq.cli.commands.start - INFO - Starting AIQ Toolkit from config file: 'workflows/math_tools/configs/observability_config.yml'
2025-07-11 09:05:37,994 - phoenix.config - INFO - 📋 Ensuring phoenix working directory: /root/.phoenix
2025-07-11 09:05:38,010 - phoenix.inferences.inferences - INFO - Dataset: phoenix_inferences_79e10326-ce33-494c-8e62-55d3692111c7 initialized

Configuration Summary:
--------------------
Workflow Type: react_agent
Number of Functions: 5
Number of LLMs: 1
Number of Embedders: 0
Number of Memory: 0
Number of Retrievers: 0

2025-07-11 09:05:43,797 - aiq.agent.react_agent.agent - INFO - 
------------------------------
[AGENT]
[33mAgent input: What is the square root of the current hour plus 5?
[36mAgent's thoughts: 
Thought: To find the square root of the current hour plus 5, I need to first find the current hour.
Action: current_datetime
Action Input: None
[39m
------------------------------
2025-07-11 09:05:43,800 - aiq.agent.react_a

After you have run the above (feel free to modify the query), look at Phoenix and you should see observability data flowing in.

## Summary

In this notebook, we've explored how to evaluate and profile our AgentIQ workflow. We've learned how to:

1. Create comprehensive evaluation datasets with test cases for different capabilities
2. Configure and run evaluations using RAGAS and Trajectory evaluators
3. Analyze evaluation results to understand agent performance
4. Handle time-based queries and dynamic responses in evaluations
5. Set up observability features using Phoenix Profiler by Arize AI for monitoring and debugging

These techniques help ensure that our agent performs reliably and efficiently and provide insights for further improvements.