# Agent Evaluation Using Trace ID

This notebook demonstrates how to evaluate agent performance using **Tool Call Accuracy** metric when you already have a trace ID from a previously executed agent run.

Unlike other notebooks that create and run agents, this notebook assumes you have already executed your agent elsewhere and obtained a trace ID. It focuses solely on fetching the trace and evaluating it using the Flotorch Eval framework.

---

## Key Concepts

* **Trace ID**: A unique identifier for an OpenTelemetry trace generated during agent execution.
* **OpenTelemetry Traces**: Detailed records of the agent's execution steps (spans) used to analyze tool call decisions and accuracy.
* **ToolCallAccuracy**: A Flotorch Eval metric that evaluates the accuracy and appropriateness of tool usage decisions. The evaluation metric used is **toolcall_accuracy**.

---

### Architecture Overview

![Workflow Diagram](diagrams/07_AgentEvaluationUsingTraceID_Workflow_Diagram.drawio.png)
*Figure : Detailed workflow diagram showing the step-by-step process of agent evaluation using trace ID from trace fetching through metric computation.*

---

## Requirements

* Flotorch account with configured models.
* Valid Flotorch API key and gateway base URL.
* A trace ID from a previously executed agent run with OpenTelemetry tracing enabled.

---


## 1. Setup and Installation

### Purpose
Install the necessary packages for the Flotorch Evaluation framework required for tool call accuracy evaluation.

### Key Components
- **`flotorch-eval`**: Flotorch evaluation framework with all dependencies for tool call accuracy metrics


In [None]:
# Install Flotorch Eval packages
# flotorch-eval: Flotorch evaluation framework with all dependencies

%pip install flotorch-eval==2.0.0b1 flotorch[adk]==3.1.0b1


## 2. Authentication and Credentials

### Purpose
Configure your Flotorch API credentials and gateway URL for authentication.

### Key Components
This cell configures the essential authentication and connection parameters:

**Authentication Parameters**:

| Parameter | Description | Example |
|-----------|-------------|---------|
| `FLOTORCH_API_KEY` | Your API authentication key (found in your Flotorch Console). Securely entered using `getpass` to avoid displaying in the notebook | `sk_...` |
| `FLOTORCH_BASE_URL` | Your Flotorch gateway endpoint URL | `https://dev-console.flotorch.cloud` |

**Note**: Use secure credential management in production environments.


In [None]:
import getpass  # Securely prompt without echoing in Prefect/notebooks

# authentication for Flotorch access
try:
    FLOTORCH_API_KEY = getpass.getpass("Paste your API key here: ")  
    print(f"✓ FLOTORCH_API_KEY set successfully")
except getpass.GetPassWarning as e:
    print(f"Warning: {e}")
    FLOTORCH_API_KEY = ""
    print(f"✗ FLOTORCH_API_KEY not set")

FLOTORCH_BASE_URL = input("Paste your Flotorch Base URL here: ")  # https://dev-gateway.flotorch.cloud
print(f"✓ FLOTORCH_BASE_URL set: {FLOTORCH_BASE_URL}")

print("✓ All credentials configured successfully!")


## 3. Global Provider Models and Evaluator Configuration

### Purpose
Define available models from the Flotorch gateway and configure the evaluator model for running the ToolCallAccuracy metric.

### Key Components

**Global Provider Models**: These are the available models from the Flotorch gateway that can be used for evaluation:

| Model Variable | Model Name | Description |
|----------------|------------|-------------|
| `MODEL_CLAUDE_HAIKU` | `flotorch/flotorch-claude-haiku-4-5` | Claude Haiku model via Flotorch gateway |
| `MODEL_CLAUDE_SONNET` | `flotorch/flotorch-claude-sonnet-3-5-v2` | Claude Sonnet model via Flotorch gateway |
| `MODEL_AWS_NOVA_PRO` | `flotorch/flotorch-aws-nova-pro` | AWS Nova Pro model via Flotorch gateway |
| `MODEL_AWS_NOVA_LITE` | `flotorch/flotorch-aws-nova-lite` | AWS Nova Lite model via Flotorch gateway |
| `MODEL_AWS_NOVA_MICRO` | `flotorch/flotorch-aws-nova-micro` | AWS Nova Micro model via Flotorch gateway |

**Evaluator Configuration**:

| Parameter | Description | Example |
|-----------|-------------|---------|
| `default_evaluator` | The LLM model used for evaluation (can use MODEL_* variables above) | `MODEL_CLAUDE_SONNET` or `flotorch/flotorch-model` |


In [None]:
# ============================================================================
# Global Provider Models (Flotorch Gateway Models)
# ============================================================================
# These models are available from the Flotorch gateway and can be used
# for evaluation and other tasks.

MODEL_CLAUDE_HAIKU = "flotorch/flotorch-claude-haiku-4-5"
MODEL_CLAUDE_SONNET = "flotorch/flotorch-claude-sonnet-3-5-v2"
MODEL_AWS_NOVA_PRO = "flotorch/flotorch-aws-nova-pro"
MODEL_AWS_NOVA_LITE = "flotorch/flotorch-aws-nova-lite"
MODEL_AWS_NOVA_MICRO = "flotorch/flotorch-aws-nova-micro"

print("✓ Global provider models defined")

# The LLM model used for evaluation.
# Can be modified to use any MODEL_* constant above (e.g., MODEL_CLAUDE_SONNET, MODEL_AWS_NOVA_PRO)
# You can use your own models from Flotorch Console as well
default_evaluator = MODEL_CLAUDE_HAIKU

print("✓ Evaluator configuration defined")


## 4. Provide Trace ID

### Purpose
Set the trace ID variable from a previously executed agent run. This trace ID is used to fetch the OpenTelemetry trace data for evaluation.

### Key Components

**Trace ID Configuration**:

| Parameter | Description | Example |
|-----------|-------------|---------|
| `TRACE_ID` | The unique identifier of the trace you want to evaluate. This trace ID should come from a previously executed agent run with OpenTelemetry tracing enabled | `abc123def456...` |

### How to Obtain a Trace ID

If you have executed your agent elsewhere:
- The trace ID is usually returned after agent execution
- Check your agent execution logs or console output for trace IDs
- Trace IDs are generated when agents run with OpenTelemetry tracing enabled

**Note**: This notebook assumes you already have a trace ID. If you need to create and run an agent first, please refer to other example notebooks that demonstrate agent creation and execution.


In [None]:
# ============================================================================
# Trace ID Configuration
# ============================================================================
# Paste your trace ID here
# This trace ID should come from a previously executed agent run

TRACE_ID = "<Your_Trace_ID>"  # The trace ID from a previously executed agent run   || ex : 89c81fa83eb639041529d3b7587b0d37

print("✓ Trace ID configuration defined")


## 5. Import Required Libraries

### Purpose
Import all required components for evaluating agent tool call accuracy using Flotorch Eval.

### Key Components
- **`AgentEvaluator`**: Core client for agent evaluation orchestration and trace fetching
- **`ToolCallAccuracy`**: Flotorch Eval metric that evaluates the accuracy and appropriateness of tool usage decisions
- **`pandas`**: Data manipulation and display for formatted results tables
- **`display`**: IPython display utility for rendering formatted outputs


In [None]:
# Required imports
# Flotorch Eval components
from flotorch_eval.agent_eval.core.client import AgentEvaluator
from flotorch_eval.agent_eval.metrics.llm_evaluators import ToolCallAccuracy

# Utilities
import pandas as pd
from IPython.display import display

print("✓ Imported necessary libraries successfully")


## 6. Initialize AgentEvaluator

### Purpose
Initialize the `AgentEvaluator` client that will be used to fetch traces and run evaluations.

### Key Components
1. **AgentEvaluator** (`client`):
   - Connects to the Flotorch gateway using API credentials
   - Configured with a default evaluator model for running LLM-based metrics
   - Provides methods to fetch traces using trace IDs and evaluate them

The AgentEvaluator is the main interface for all evaluation operations in this notebook.


In [None]:
# Initialize the ToolCallAccuracy metric
metrics = [ToolCallAccuracy()]

# Initialize the AgentEvaluator client
client = AgentEvaluator(
    api_key=FLOTORCH_API_KEY,
    base_url=FLOTORCH_BASE_URL,
    default_evaluator=default_evaluator
)

print("✓ AgentEvaluator initialized successfully")


## 7. Fetch Trace Using Trace ID

### Purpose
Fetch the OpenTelemetry trace data from the Flotorch gateway using the provided trace ID.

### Process
1. **Validate Trace ID**: Ensure a trace ID has been provided
2. **Fetch Trace**: Use `AgentEvaluator.fetch_traces()` method to retrieve the trace data
3. **Verify Success**: Confirm the trace was successfully fetched

### Key Components
- **`client.fetch_traces(trace_id)`**: Method that retrieves trace data from the Flotorch API
  - Takes a trace ID string as input
  - Returns the complete trace data containing all spans and execution information
  - The trace data includes tool call decisions, parameter details, and execution results

The fetched trace contains detailed information about tool call decisions and execution, which will be analyzed by the ToolCallAccuracy metric to compute the toolcall_accuracy score.


In [None]:
# Fetch the trace data from the Flotorch gateway using the trace ID
if TRACE_ID:
    try:
        traces = client.fetch_traces(TRACE_ID)
        if traces:
            print(f"✓ Trace fetched successfully for trace ID: {TRACE_ID}")
        else:
            print(f"✗ No trace data found for trace ID: {TRACE_ID}")
            traces = None
    except Exception as e:
        print(f"✗ Error fetching trace: {e}")
        traces = None
else:
    print("✗ No trace ID provided. Please set TRACE_ID variable with your trace ID in section 4.")
    traces = None


## 8. Run Evaluation

### Purpose
Execute the tool call accuracy evaluation by processing the fetched OpenTelemetry trace using the ToolCallAccuracy metric to assess tool usage decisions.

### Process
- Calls `client.evaluate()` with the trace data and ToolCallAccuracy metric
- The evaluator processes the trace to analyze tool call decisions and execution
- Computes the **toolcall_accuracy** metric which includes:
  - Accuracy score (0.0 to 1.0) indicating how appropriate tool usage was
  - Detailed evaluation explanation of tool call decisions
  - Assessment of tool selection, parameter accuracy, and timing
- Returns evaluation results with trajectory ID and metric scores

This step generates the tool call accuracy analysis that will be displayed in the next section.


In [None]:
if 'traces' in locals() and traces:
    try:
        # Evaluate the trace using the ToolCallAccuracy metric
        results = await client.evaluate(
            trace=traces,
            metrics=metrics
        )

        print("✓ Evaluation completed successfully!")
    except Exception as e:
        print(f"✗ Error during evaluation: {e}")
        results = None
else:
    print("Cannot evaluate: No traces were available.")
    results = None


## 9. Display and Interpret Results

### Purpose
Define helper functions to format and display the evaluation output clearly, showing the toolcall_accuracy metric results in a readable format.

### Functionality
The `display_metrics` function:
- Extracts the `toolcall_accuracy` metric from evaluation results
- Formats the accuracy score and evaluation details
- Creates a structured display showing:
  - Tool Call Accuracy Score (0.0 to 1.0)
  - Detailed evaluation explanation
- Uses pandas DataFrame with styled formatting for clean presentation

This function provides a user-friendly way to visualize tool call accuracy metrics.


In [None]:
def display_metrics(result):
    """
    Display tool call accuracy metrics in a formatted table.
    """
    # Find the toolcall_accuracy metric
    metric = next((m for m in result.scores if m.name == "toolcall_accuracy"), None)
    if not metric:
        print("No toolcall_accuracy metric found.")
        return

    # Extract metric details
    d = metric.details

    # Get the details string (which contains the evaluation explanation)
    details_text = d.get("details", "No details available.")

    # Format the details string with better readability
    details = f"Score: {metric.score:.2f} / 1.0\n\nEvaluation Details:\n{details_text}"

    # Create DataFrame for display
    df = pd.DataFrame([{
        "Metric": metric.name.replace("_", " ").title(),
        "Score": f"{metric.score:.2f}",
        "Details": details
    }])

    # Display DataFrame with multiline support
    display(df.style.set_properties(
        subset=['Details'],
        **{'white-space': 'pre-wrap', 'text-align': 'left'}
    ))

print("✓ Display metrics function defined successfully")


## 10. View Tool Call Accuracy Results

### Purpose
Display the tool call accuracy evaluation results in a formatted table showing the complete assessment.

### Output
The displayed table includes:
- **Metric**: The evaluation metric name (toolcall_accuracy)
- **Score**: The tool call accuracy score (0.0 to 1.0)
- **Details**: Comprehensive evaluation showing:
  - Accuracy score out of 1.0
  - Detailed explanation of tool call decisions
  - Assessment of tool selection, parameter accuracy, and timing appropriateness

This visualization helps identify tool usage issues and optimize the agent's tool call decisions.


In [None]:
if 'results' in locals() and results:
    display_metrics(results)
else:
    print("No results object found. Please run sections 7 and 8 first.")


### Interpreting the Tool Call Accuracy Results

The **toolcall_accuracy** metric is a vital tool for quality monitoring:

* **Accuracy Score (0.0 to 1.0)**: Indicates how appropriate and accurate the agent's tool usage decisions were:
    * **1.0**: Perfect tool call accuracy - tool selection, parameters, and timing were all appropriate
    * **0.5-0.9**: Good accuracy with minor issues in tool selection or parameter formatting
    * **0.0-0.4**: Poor accuracy - incorrect tool selection, wrong parameters, or inappropriate timing
* **Evaluation Details**: Provides a detailed explanation of:
    * **Tool Selection Appropriateness**: Whether the agent selected the correct tool for the task
    * **Parameter Accuracy and Formatting**: Whether tool parameters were correctly formatted and accurate
    * **Timing and Necessity**: Whether the tool call was made at the right time and was necessary for the task
    * **Overall Quality**: Comprehensive assessment of tool usage decision quality

Understanding tool call accuracy helps identify:
- **Tool selection issues**: If the agent uses incorrect tools or fails to use tools when needed
- **Parameter formatting problems**: If parameters are incorrectly extracted or formatted
- **Timing optimization**: If tool calls are made unnecessarily or at inappropriate times
- **Overall reliability**: Monitor accuracy to ensure the agent delivers accurate and reliable results via proper tool integration


## 11. Summary

This notebook demonstrates how to evaluate agent tool call accuracy using a **trace ID** from a previously executed agent run.

**Use Case**: Evaluate agent performance when you already have a trace ID.

**Evaluation Metric**: toolcall_accuracy

## Core Process

### 1. Trace ID Input
- Provide the trace ID from a previously executed agent run
- The trace ID should correspond to an agent execution with OpenTelemetry tracing enabled

### 2. Trace Fetching
- Use `AgentEvaluator.fetch_traces(trace_id)` to retrieve the complete trace data
- The trace contains detailed information about tool call decisions and execution

### 3. Evaluation
- Use the `AgentEvaluator` client along with the specialized **ToolCallAccuracy** metric from `flotorch-eval`
- The evaluator processes the trace data to compute tool call accuracy statistics using the **toolcall_accuracy** metric

### 4. Analysis
- The notebook displays a thorough tool call accuracy assessment, including:
  - **Accuracy Score** (0.0 to 1.0)
  - **Evaluation Details** explaining tool usage decisions
  - Assessment of tool selection, parameter accuracy, and timing

## Purpose and Benefits

This evaluation workflow is useful when:

- You have executed agents in separate environments and want to evaluate the traces
- You want to re-evaluate existing traces with different metrics or evaluators
- You need to evaluate traces without recreating the agent setup
- You want to analyze historical agent runs using their trace IDs

This approach provides **flexibility and separation of concerns** by allowing evaluation to be performed independently of agent execution.
