<center>
    <p style="text-align:center">
    <img alt="arize logo" src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="300"/>
        <br>
        <a href="https://docs.arize.com/arize/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/client_python">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg">Slack Community</a>
    </p>
</center>

# Agent Trajectory Evaluation

This notebook demonstrates how to evaluate whether an agent's tool calling trajectory matches expected patterns. Agent trajectories represent the sequence of actions (tool calls) an agent takes to accomplish a task.

**Why this matters**: Evaluating agent trajectories helps you:
- Understand if your agent follows expected problem-solving paths
- Identify inefficient or incorrect tool usage patterns
- Debug agent behavior


## Setup

Configure your environment variables and import dependencies. You'll need to set up your Arize API key and import necessary libraries for data processing and evaluation.

In [None]:
%pip install "arize>=7.44.0" arize-phoenix arize-phoenix-evals getpass pandas datetime

In [None]:
from phoenix.evals import (
    llm_classify,
    OpenAIModel # see https://docs.arize.com/phoenix/evaluation/evaluation-models
    # for a full list of supported models
)
import os
import pandas as pd

from datetime import datetime, timedelta

from arize.exporter import ArizeExportClient
from arize.utils.types import Environments



In [None]:
from getpass import getpass

api_key = getpass('Enter your Arize API key: ')
os.environ['ARIZE_API_KEY'] = api_key

space_id = getpass('Enter your Arize Space ID: ')
os.environ['ARIZE_SPACE_ID'] = space_id

model_id = getpass('Enter your Arize Space ID(Project Name): ')
os.environ['ARIZE_MODEL_ID'] = model_id

openai_key = getpass('Enter your OpenAI API key: ')
os.environ['OPENAI_API_KEY'] = openai_key

   ## Data Extraction
   
   Pull trace data from Arize and prepare it for analysis.

In [None]:
client = ArizeExportClient(api_key=os.environ['ARIZE_API_KEY'])

print('#### Exporting your primary dataset into a dataframe.')


primary_df = client.export_model_to_df(
    space_id=os.environ['ARIZE_SPACE_ID'],
    model_id=os.environ['ARIZE_MODEL_ID'],
    environment=Environments.TRACING,
    start_time=datetime.now() - timedelta(days=7),
    end_time=datetime.now(),
    # Optionally specify columns to improve query performance
    # columns=['context.span_id', 'attributes.llm.input']
)

In [None]:
# Sample data
# data_url = "https://storage.cloud.google.com/arize-assets/tutorials/example/agent_trajectory_sample_data.csv"

## Prompt Template Definition

The evaluation uses a carefully designed prompt template that instructs the LLM how to compare actual agent trajectories against golden trajectories. You can customize this template to fit your specific evaluation criteria.


### Prompt Variables

| Variable | Description | Source |
|----------|-------------|--------|
| `{reference_outputs}` | The golden/expected trajectory | From your reference data |
| `{tool_calls}` | The actual trajectory executed by the agent | Extracted from trace data |

### Customizing the Prompt

You may want to adjust the evaluation criteria or output format based on your specific use case:

- Add specific criteria relevant to your agent's domain
- Include additional metadata



In [None]:
TRAJECTORY_ACCURACY_PROMPT_WITH_REFERENCE = """
You are a helpful AI bot that checks whether an AI agent’s internal trajectory is accurate and effective.

You will be given:
1. The agent’s actual trajectory of tool calls
2. You will be given input data from a user that the agent used to make a decision
3. You will be given a tool call definition, what the agent used to make the tool call
4. You will be given a golden trajectory that represents the ideal flows in normal use

An accurate trajectory:
- Progresses logically from step to step
- Follows the golden trajectory where reasonable
- Shows a clear path toward completing a goal
- Is reasonably efficient (doesn’t take unnecessary detours)

##

Correct Trajectory:
{reference_outputs}

##

Actual Trajectory:
{tool_calls}

Use Inputs:
{attributes.input.value}

Tool Definition:
{attributes.llm.tools}

##

Compare the actual trajectory to the golden one:
- Highlight any major deviations
- Determine whether the deviations are acceptable or harmful
- Assess if the overall goal is still achieved effectively

Your response must be a single string, either `correct` or `incorrect`, and must not include any additional text.

- Respond with `correct` if the agent’s trajectory adheres to the rubric and accomplishes the task effectively.
- Respond with `incorrect` if the trajectory is confusing, misaligned with the goal, inefficient, or does not accomplish the task.


"""

In [None]:
TRAJECTORY_ACCURACY_PROMPT_WITHOUT_REFERENCE = """
You are a helpful AI bot that checks whether an AI agent’s internal trajectory is accurate and effective.

You will be given:
1. The agent’s actual trajectory of tool calls
2. You will be given input data from a user that the agent used to make a decision
3. You will be given a tool call definition, what the agent used to make the tool call

An accurate trajectory:
- Progresses logically from step to step
- Follows the golden trajectory where reasonable
- Shows a clear path toward completing a goal
- Is reasonably efficient (doesn’t take unnecessary detours)

##

Actual Trajectory:
{tool_calls}

Use Inputs:
{attributes.input.value}

Tool Definitions:
{attributes.llm.tools}

##


Your response must be a single string, either `correct` or `incorrect`, and must not include any additional text.

- Respond with `correct` if the agent’s trajectory adheres to the rubric and accomplishes the task effectively.
- Respond with `incorrect` if the trajectory is confusing, misaligned with the goal, inefficient, or does not accomplish the task.


"""

 ## Data Preparation

These functions filter and transform trace data into the format needed for evaluation.

**Core concepts:**
- **Trace filtering**: Selecting which agent executions to evaluate
- **Span filtering**: Selecting which parts of each execution to analyze
- **Tool call extraction**: Identifying the sequence of actions taken

The `filter_spans_by_trace_criteria` function is particularly important as it allows you to:
1. Select relevant traces using trace-level filters (e.g., by user query type, duration)
2. Focus on specific spans within those traces (e.g., only LLM-generated tool calls)

This two-level filtering gives you fine-grained control over your evaluation data.

In [None]:
from typing import Dict, Any

def filter_spans_by_trace_criteria(
    df: pd.DataFrame,
    trace_filters: Dict[str, Dict[str, Any]],
    span_filters: Dict[str, Dict[str, Any]]
) -> pd.DataFrame:
    """Filter spans based on trace-level and span-level criteria.

    Args:
        df: DataFrame with trace data
        trace_filters: Dictionary of column names and filtering criteria for traces
                      Format: {"column_name": {"operator": value}}
                      Supported operators: ">=", "<=", "==", "!=", "contains", "notna", "isna"
        span_filters: Dictionary of column names and filtering criteria for spans
                     Format: {"column_name": {"operator": value}}
                     Same supported operators as trace_filters

    Returns:
        DataFrame with filtered spans from traces that match trace_filters
    """
    # Get all unique trace_ids
    all_trace_ids = set(df['context.trace_id'].unique())
    print(f"Total traces: {len(all_trace_ids)}")

    # Create a copy of the dataframe for filtering
    df_copy = df.copy()

    # Find traces matching the trace criteria
    traces_df = df_copy.copy()
    for column, criteria in trace_filters.items():
        if column not in traces_df.columns:
            print(f"Warning: Column '{column}' not found in dataframe")
            continue

        for operator, value in criteria.items():
            if operator == ">=":
                matching_spans = traces_df[traces_df[column] >= value]
            elif operator == "<=":
                matching_spans = traces_df[traces_df[column] <= value]
            elif operator == "==":
                matching_spans = traces_df[traces_df[column] == value]
            elif operator == "!=":
                matching_spans = traces_df[traces_df[column] != value]
            elif operator == "contains":
                matching_spans = traces_df[traces_df[column].str.contains(value, case=False, na=False)]
            elif operator == "isna":
                matching_spans = traces_df[traces_df[column].isna()]
            elif operator == "notna":
                matching_spans = traces_df[traces_df[column].notna()]
            else:
                print(f"Warning: Unsupported operator '{operator}' - skipping")
                continue

            traces_df = matching_spans

    matching_trace_ids = set(traces_df['context.trace_id'].unique())
    print(f"Found {len(matching_trace_ids)} traces matching trace criteria")

    if not matching_trace_ids:
        print("No matching traces found")
        return pd.DataFrame()

    # Filter to keep only rows from matching traces
    result_df = df[df['context.trace_id'].isin(matching_trace_ids)].copy()

    # Apply span filters
    for column, criteria in span_filters.items():
        if column not in result_df.columns:
            print(f"Warning: Column '{column}' not found in dataframe")
            continue

        for operator, value in criteria.items():
            if operator == ">=":
                result_df = result_df[result_df[column] >= value]
            elif operator == "<=":
                result_df = result_df[result_df[column] <= value]
            elif operator == "==":
                result_df = result_df[result_df[column] == value]
            elif operator == "!=":
                result_df = result_df[result_df[column] != value]
            elif operator == "contains":
                result_df = result_df[result_df[column].str.contains(value, case=False, na=False)]
            elif operator == "isna":
                result_df = result_df[result_df[column].isna()]
            elif operator == "notna":
                result_df = result_df[result_df[column].notna()]
            else:
                print(f"Warning: Unsupported operator '{operator}' - skipping")
                continue

    print(f"Final result: {len(result_df)} spans from {len(matching_trace_ids)} traces")
    return result_df

In [None]:
def prepare_trace_data_for_evaluation(
    df,
    group_by_col="context.trace_id",
    extract_cols={"tool_calls": "tool_calls"},
    additional_data=None,
    filter_empty=True,
):
    """
    Prepare trace data for evaluation by grouping, sorting by start_time, and extracting specified columns.

    Args:
        df: DataFrame containing trace data
        group_by_col: Column to group traces by (default: "context.trace_id")
        extract_cols: Dict mapping {output_key: source_column} to extract from each row
                     Can contain multiple columns to extract
        additional_data: Dict of additional data to include with each trace (default: None)
        filter_empty: Whether to filter out empty values (default: True)

    Returns:
        DataFrame with processed trace data ready for evaluation
    """
    # Group by specified column
    grouped = df.groupby(group_by_col)

    # Prepare results list
    results = []

    for group_id, group in grouped:
        # Always sort by start_time to ensure correct order
        group = group.sort_values("start_time")

        # Initialize a dict to store extracted data
        trace_data = {group_by_col: group[group_by_col].iloc[0]}

        # Extract and process each requested column
        for output_key, source_col in extract_cols.items():
            ordered_extracts = []
            # Iterate through rows as dictionaries to handle column names with dots
            for i, (_, row_data) in enumerate(group.reset_index(drop=True).iterrows()):
                # Convert row to dictionary for easier access
                row_dict = row_data.to_dict()
                value = row_dict.get(source_col)
                if not filter_empty or (value is not None and value):
                    ordered_extracts.append({str(i + 1): value})
            trace_data[output_key] = ordered_extracts

        # Add any additional data
        if additional_data:
            trace_data.update(additional_data)

        # Add to results
        results.append(trace_data)

    # Convert to DataFrame
    return pd.DataFrame(results)

In [None]:
def extract_tool_calls(output_messages):
    if not output_messages:
        return []

    tool_calls = []
    for message in output_messages:
        if "message.tool_calls" in message:
            for tool_call in message["message.tool_calls"]:
                tool_calls.append({
                    "name": tool_call["tool_call.function.name"],
                    "arguments": tool_call["tool_call.function.arguments"]
                })
    return tool_calls


## Evaluation Configuration



**Reference outputs** define your golden path - what tools *should* be called and in what order. These represent your expectation of the ideal agent behavior for a given task.

Note: This only makes sense with deterministic paths.

In [None]:
# reference_outputs = {"1":"get_llm_table_search"}

#### **Filter Data**

Customize these parameters to match your specific evaluation needs:

| Parameter | Description | Example |
|-----------|-------------|---------|
| reference_outputs | Expected tool calls | `{"1": "get_llm_table_search"}` |
| trace_filters | Criteria for selecting traces | `{"name": {"contains": "searchrouter"}}` |
| span_filters | Criteria for selecting spans within traces | `{"attributes.openinference.span.kind": {"==": "LLM"}}` |

Span filters are crucial as they determine which specific spans within the matched traces will be used for the evaluation. For example, filtering for `"openinference.span.kind": "LLM"` ensures we only analyze LLM-related spans within the selected traces.
   > **Note**: Update the `trace_filters` and `span_filters` to match your specific evaluation criteria

In [None]:

eval_traces = filter_spans_by_trace_criteria(
    df=primary_df,
    trace_filters={"name": {"contains": "searchrouter"}},
    span_filters={"attributes.openinference.span.kind": {"==": "LLM"}}
)

We need to extract the tool calls from the output messages to use in the evaluation

In [None]:
eval_traces['tool_calls'] = eval_traces['attributes.llm.output_messages'].apply(extract_tool_calls)

eval_traces[['tool_calls']].head()

### Prepare the data for the evaluation
This will group the prompt variables by trace_id and extract the required columns and append any additional data to the dataframe

In [None]:
tool_calls_df = prepare_trace_data_for_evaluation(
    df=eval_traces,
    extract_cols={"tool_calls": "tool_calls", "attributes.llm.tools": "attributes.llm.tools", "attributes.input.value":"attributes.input.value"}, #can also add any additional columns to the dataframe
    # additional_data={"reference_outputs": reference_outputs},
)

In [None]:
tool_calls_df.head()

In [None]:
sample_data = tool_calls_df.head(2)

In [None]:
sample_data.head()

## Running the Evaluation

After preparing your traces and configuring the evaluation parameters, you can execute the LLM-based evaluation:


In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:

model = OpenAIModel(
    api_key=os.environ['OPENAI_API_KEY'],
    model="gpt-4o-mini",
    temperature=0.0,
)

In [None]:
rails =["correct","incorrect"]
eval_results = llm_classify(
    dataframe=sample_data,
    template=TRAJECTORY_ACCURACY_PROMPT_WITHOUT_REFERENCE,
    model=model,
    rails=rails,
    provide_explanation=True,
    verbose=False,
    concurrency=20,
)

   ## Analyzing Results
   
   The evaluation results contain:
   - **label**: Overall trajectory assessment (correct/incorrect)
   - **explanation**: Detailed reasoning for the assessment
   

In [None]:
eval_results.head()

The evaluation results can then be merged with your original data for analysis or to log back to Arize:

In [None]:
import pandas as pd

# merge with original df to get span_id
merged_df = pd.merge(
    sample_data, eval_results, left_index=True, right_index=True
)


merged_df.rename(columns={
    'label': 'trace_eval.AgentTrajectoryAccuracy.label',
    'explanation': 'trace_eval.AgentTrajectoryAccuracy.explanation'
}, inplace=True)

merged_df.head()

In [None]:
# Get the span_id where parent_id is null for each trace_id
root_spans = primary_df[primary_df['parent_id'].isnull()][['context.trace_id', 'context.span_id']]

# Merge with merged_df to get the root span_id
final_df = pd.merge(
    merged_df,
    root_spans,
    on='context.trace_id',
    how='left'
)



In [None]:
final_eval_df = final_df[['context.trace_id', 'context.span_id', 'trace_eval.AgentTrajectoryAccuracy.label', 'trace_eval.AgentTrajectoryAccuracy.explanation']]

final_eval_df.head()


In [None]:
import os
from arize.pandas.logger import Client



# Initialize Arize client using the model_id of your traces
arize_client = Client(space_id=os.environ['ARIZE_SPACE_ID'], api_key=os.environ['ARIZE_API_KEY'])


# Set the evals_df to have the correct span ID to log it to Arize
final_eval_df = final_eval_df.set_index(final_df["context.span_id"])

# Use Arize client to log evaluations
response = arize_client.log_evaluations_sync(
    dataframe=final_eval_df,
    model_id=os.environ['ARIZE_MODEL_ID'],
)

See your results in Arize

<img src="https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/agent-trajectory/Screenshot%202025-06-22%20at%208.43.47%E2%80%AFPM.png" width="800"/>