<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Evaluating an Agent</h1>

This notebook serves as an end-to-end example of how to trace and evaluate an agent. The example uses a "talk-to-your-data" agent as its example.

The notebook includes:
* Manually instrumenting an agent using Phoenix decorators
* Evaluating function calling accuracy using LLM as a Judge
* Evaluating function calling accuracy by comparing to ground truth
* Evaluating SQL query generation
* Evaluating Python code generation
* Evaluating the path of an agent

## Install Dependencies, Import Libraries, Set API Keys

In [3]:
!pip install -q openai "arize-phoenix>=8.8.0" "arize-phoenix-otel>=0.8.0" openinference-instrumentation-openai python-dotenv duckdb "openinference-instrumentation>=0.1.21" tqdm

In [5]:
import dotenv

dotenv.load_dotenv()

import json
import os
from getpass import getpass

import duckdb
import pandas as pd
from openai import OpenAI
from openinference.instrumentation.openai import OpenAIInstrumentor
from opentelemetry.trace import StatusCode
from pydantic import BaseModel, Field
from tqdm import tqdm

from phoenix.otel import register

In [6]:
if os.getenv("OPENAI_API_KEY") is None:
    os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

client = OpenAI()
model = "gpt-4o-mini"
project_name = "self-improving-agent"

# Enable Phoenix Tracing

Sign up for a free instance of [Phoenix Cloud](https://app.phoenix.arize.com) to get your API key. If you'd prefer, you can instead [self-host Phoenix](https://docs.arize.com/phoenix/deployment).

In [7]:
if os.getenv("PHOENIX_API_KEY") is None:
    os.environ["PHOENIX_API_KEY"] = getpass("Enter your Phoenix API key: ")

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com/"
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={os.getenv('PHOENIX_API_KEY')}"

In [9]:
tracer_provider = register(
    project_name=project_name,
    auto_instrument=True,
)

tracer = tracer_provider.get_tracer(__name__)

🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: self-improving-agent
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: https://app.phoenix.arize.com/v1/traces
|  Transport: HTTP + protobuf
|  Transport Headers: {'api_key': '****', 'authorization': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



## Prepare dataset

Your agent will interact with a local database. Start by loading in that data:

In [10]:
store_sales_df = pd.read_parquet(
    "https://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/llm/llama-index/Store_Sales_Price_Elasticity_Promotions_Data.parquet"
)
store_sales_df.head()

Unnamed: 0,Store_Number,SKU_Coded,Product_Class_Code,Sold_Date,Qty_Sold,Total_Sale_Value,On_Promo
0,1320,6172800,22875,2021-11-02,3,56.849998,0
1,2310,6172800,22875,2021-11-03,1,18.950001,0
2,3080,6172800,22875,2021-11-03,1,18.950001,0
3,2310,6172800,22875,2021-11-06,1,18.950001,0
4,4840,6172800,22875,2021-11-07,1,18.950001,0


## Define the tools

Now you can define your agent tools.

### Tool 1: Database Lookup

In [11]:
SQL_GENERATION_PROMPT = """
Generate an SQL query based on a prompt. Do not reply with anything besides the SQL query.
The prompt is: {prompt}

The available columns are: {columns}
The table name is: {table_name}
"""


def generate_sql_query(prompt: str, columns: list, table_name: str) -> str:
    """Generate an SQL query based on a prompt"""
    formatted_prompt = SQL_GENERATION_PROMPT.format(
        prompt=prompt, columns=columns, table_name=table_name
    )

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": formatted_prompt}],
    )

    return response.choices[0].message.content


@tracer.tool()
def lookup_sales_data(prompt: str) -> str:
    """Implementation of sales data lookup from parquet file using SQL"""
    try:
        table_name = "sales"
        # Read the parquet file into a DuckDB table
        duckdb.sql(f"CREATE TABLE IF NOT EXISTS {table_name} AS SELECT * FROM store_sales_df")

        print(store_sales_df.columns)
        print(table_name)
        sql_query = generate_sql_query(prompt, store_sales_df.columns, table_name)
        sql_query = sql_query.strip()
        sql_query = sql_query.replace("```sql", "").replace("```", "")

        with tracer.start_as_current_span(
            "execute_sql_query", openinference_span_kind="chain"
        ) as span:
            span.set_input(value=sql_query)

            # Execute the SQL query
            result = duckdb.sql(sql_query).df()
            span.set_output(value=str(result))
            span.set_status(StatusCode.OK)
        return result.to_string()
    except Exception as e:
        return f"Error accessing data: {str(e)}"

In [None]:
# example_data = lookup_sales_data("Show me all the sales for store 1320 on November 1st, 2021")
# example_data

### Tool 2: Data Visualization

In [13]:
class VisualizationConfig(BaseModel):
    chart_type: str = Field(..., description="Type of chart to generate")
    x_axis: str = Field(..., description="Name of the x-axis column")
    y_axis: str = Field(..., description="Name of the y-axis column")
    title: str = Field(..., description="Title of the chart")


@tracer.chain()
def extract_chart_config(data: str, visualization_goal: str) -> dict:
    """Generate chart visualization configuration

    Args:
        data: String containing the data to visualize
        visualization_goal: Description of what the visualization should show

    Returns:
        Dictionary containing line chart configuration
    """
    prompt = f"""Generate a chart configuration based on this data: {data}
    The goal is to show: {visualization_goal}"""

    response = client.beta.chat.completions.parse(
        model=model,
        messages=[{"role": "user", "content": prompt}],
        response_format=VisualizationConfig,
    )

    try:
        # Extract axis and title info from response
        content = response.choices[0].message.content

        # Return structured chart config
        return {
            "chart_type": content.chart_type,
            "x_axis": content.x_axis,
            "y_axis": content.y_axis,
            "title": content.title,
            "data": data,
        }
    except Exception:
        return {
            "chart_type": "line",
            "x_axis": "date",
            "y_axis": "value",
            "title": visualization_goal,
            "data": data,
        }


@tracer.chain()
def create_chart(config: VisualizationConfig) -> str:
    """Create a chart based on the configuration"""
    prompt = f"""Write python code to create a chart based on the following configuration.
    Only return the code, no other text.
    config: {config}"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )

    code = response.choices[0].message.content
    code = code.replace("```python", "").replace("```", "")
    code = code.strip()

    return code


@tracer.tool()
def generate_visualization(data: str, visualization_goal: str) -> str:
    """Generate a visualization based on the data and goal"""
    config = extract_chart_config(data, visualization_goal)
    code = create_chart(config)
    return code

In [10]:
# code = generate_visualization(example_data, "A line chart of sales over each day in november.")

In [14]:
@tracer.tool()
def run_python_code(code: str) -> str:
    """Execute Python code in a restricted environment"""
    # Create restricted globals/locals dictionaries with plotting libraries
    restricted_globals = {
        "__builtins__": {
            "print": print,
            "len": len,
            "range": range,
            "sum": sum,
            "min": min,
            "max": max,
            "int": int,
            "float": float,
            "str": str,
            "list": list,
            "dict": dict,
            "tuple": tuple,
            "set": set,
            "round": round,
            "__import__": __import__,
            "json": __import__("json"),
        },
        "plt": __import__("matplotlib.pyplot"),
        "pd": __import__("pandas"),
        "np": __import__("numpy"),
        "sns": __import__("seaborn"),
    }

    try:
        # Execute code in restricted environment
        exec_locals = {}
        exec(code, restricted_globals, exec_locals)

        # Capture any printed output or return the plot
        exec_locals.get("__builtins__", {}).get("_", "")
        if "plt" in exec_locals:
            return exec_locals["plt"]

        # Try to parse output as JSON before returning
        return "Code executed successfully"

    except Exception as e:
        return f"Error executing code: {str(e)}"

### Tool 3: Data Analysis

In [15]:
@tracer.tool()
def analyze_sales_data(prompt: str, data: str) -> str:
    """Implementation of AI-powered sales data analysis"""
    # Construct prompt based on analysis type and data subset
    prompt = f"""Analyze the following data: {data}
    Your job is to answer the following question: {prompt}"""

    response = client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )

    analysis = response.choices[0].message.content
    return analysis if analysis else "No analysis could be generated"

In [13]:
# analysis = analyze_sales_data("What is the most popular product SKU?", example_data)
# analysis


### Tool Schema:

You'll need to pass your tool descriptions into your agent router. The following code allows you to easily do so:

In [16]:
# Define tools/functions that can be called by the model
tools = [
    {
        "type": "function",
        "function": {
            "name": "lookup_sales_data",
            "description": "Look up data from Store Sales Price Elasticity Promotions dataset",
            "parameters": {
                "type": "object",
                "properties": {
                    "prompt": {
                        "type": "string",
                        "description": "The unchanged prompt that the user provided.",
                    }
                },
                "required": ["prompt"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "analyze_sales_data",
            "description": "Analyze sales data to extract insights",
            "parameters": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "string",
                        "description": "The lookup_sales_data tool's output.",
                    },
                    "prompt": {
                        "type": "string",
                        "description": "The unchanged prompt that the user provided.",
                    },
                },
                "required": ["data", "prompt"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "generate_visualization",
            "description": "Generate Python code to create data visualizations",
            "parameters": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "string",
                        "description": "The lookup_sales_data tool's output.",
                    },
                    "visualization_goal": {
                        "type": "string",
                        "description": "The goal of the visualization.",
                    },
                },
                "required": ["data", "visualization_goal"],
            },
        },
    },
    # {
    #     "type": "function",
    #     "function": {
    #         "name": "run_python_code",
    #         "description": "Run Python code in a restricted environment",
    #         "parameters": {
    #             "type": "object",
    #             "properties": {
    #                 "code": {"type": "string", "description": "The Python code to run."}
    #             },
    #             "required": ["code"]
    #         }
    #     }
    # }
]

# Dictionary mapping function names to their implementations
tool_implementations = {
    "lookup_sales_data": lookup_sales_data,
    "analyze_sales_data": analyze_sales_data,
    "generate_visualization": generate_visualization,
    # "run_python_code": run_python_code
}

## Save Prompts in Phoenix

Saving prompts in Phoenix allows for easy version tracking of your prompts. For this example, since you'll be optimizing the router prompt, we'll save that as a Prompt in Phoenix.

In [71]:
import phoenix as px
from phoenix.client.types import PromptVersion
from openai.types.chat.completion_create_params import CompletionCreateParamsBase

params = CompletionCreateParamsBase(
    model="gpt-4o-mini",
    tools=tools,
    messages=[
        {"role": "system", "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset."},
        {"role": "user", "content": "{user_query}"},
    ],
)

prompt_name = "self-improving-agent-router"
prompt = px.Client().prompts.create(
    name=prompt_name,
    version=PromptVersion.from_openai(params),
)

## Agent logic

With the tools defined, you're ready to define the main routing and tool call handling steps of your agent.

In [11]:
@tracer.chain()
def handle_tool_calls(tool_calls, messages):
    for tool_call in tool_calls:
        function = tool_implementations[tool_call.function.name]
        function_args = json.loads(tool_call.function.arguments)
        result = function(**function_args)

        messages.append({"role": "tool", "content": result, "tool_call_id": tool_call.id})
    return messages

In [12]:
def start_main_span(messages):
    print("Starting main span with messages:", messages)

    with tracer.start_as_current_span("AgentRun", openinference_span_kind="agent") as span:
        span.set_input(value=messages)
        ret = run_agent(messages)
        print("Main span completed with return value:", ret)
        span.set_output(value=ret)
        span.set_status(StatusCode.OK)
        return ret


def run_agent(messages):
    print("Running agent with messages:", messages)
    if isinstance(messages, str):
        messages = [{"role": "user", "content": messages}]
        print("Converted string message to list format")

    # Check and add system prompt if needed
    if not any(
        isinstance(message, dict) and message.get("role") == "system" for message in messages
    ):
        system_prompt = {
            "role": "system",
            "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset.",
        }
        messages.append(system_prompt)
        print("Added system prompt to messages")

    while True:
        # Router call span
        print("Starting router call span")
        with tracer.start_as_current_span(
            "router_call",
            openinference_span_kind="chain",
        ) as span:
            span.set_input(value=messages)

            response = client.chat.completions.create(
                model=model,
                messages=messages,
                tools=tools,
            )

            messages.append(response.choices[0].message.model_dump())
            tool_calls = response.choices[0].message.tool_calls
            print("Received response with tool calls:", bool(tool_calls))
            span.set_status(StatusCode.OK)

            if tool_calls:
                # Tool calls span
                print("Processing tool calls")
                messages = handle_tool_calls(tool_calls, messages)
                span.set_output(value=tool_calls)
            else:
                print("No tool calls, returning final response")
                span.set_output(value=response.choices[0].message.content)

                return response.choices[0].message.content

## Run the agent

Your agent is now good to go! Let's try it out with some example questions:

In [13]:
ret = start_main_span([{"role": "user", "content": "Create a line chart showing sales in 2021"}])
print(ret)

Starting main span with messages: [{'role': 'user', 'content': 'Create a line chart showing sales in 2021'}]
Running agent with messages: [{'role': 'user', 'content': 'Create a line chart showing sales in 2021'}]
Added system prompt to messages
Starting router call span
Received response with tool calls: True
Processing tool calls
Index(['Store_Number', 'SKU_Coded', 'Product_Class_Code', 'Sold_Date',
       'Qty_Sold', 'Total_Sale_Value', 'On_Promo'],
      dtype='object')
sales
Starting router call span
Received response with tool calls: True
Processing tool calls
Starting router call span
Received response with tool calls: False
No tool calls, returning final response
Main span completed with return value: I've created a line chart showing the sales for the months of November and December in 2021. The chart illustrates the total sales in these months. If you need any further analysis or a different visualization, feel free to ask!
I've created a line chart showing the sales for the m

In [14]:
# Note: this will take ~15 minutes to run

agent_questions = [
    "What was the most popular product SKU?",
    "What was the total revenue across all stores?",
    "Which store had the highest sales volume?",
    "Create a bar chart showing total sales by store",
    "What percentage of items were sold on promotion?",
    "Plot daily sales volume over time",
    "What was the average transaction value?",
    "Create a box plot of transaction values",
    "Which products were frequently purchased together?",
    "Plot a line graph showing the sales trend over time with a 7-day moving average",
]

for question in tqdm(agent_questions, desc="Processing questions"):
    try:
        ret = start_main_span([{"role": "user", "content": question}])
    except Exception as e:
        print(f"Error processing question: {question}")
        print(e)
        continue

Processing questions:   0%|          | 0/10 [00:00<?, ?it/s]

Starting main span with messages: [{'role': 'user', 'content': 'What was the most popular product SKU?'}]
Running agent with messages: [{'role': 'user', 'content': 'What was the most popular product SKU?'}]
Added system prompt to messages
Starting router call span
Received response with tool calls: True
Processing tool calls
Index(['Store_Number', 'SKU_Coded', 'Product_Class_Code', 'Sold_Date',
       'Qty_Sold', 'Total_Sale_Value', 'On_Promo'],
      dtype='object')
sales
Starting router call span


Processing questions:  10%|█         | 1/10 [00:03<00:28,  3.20s/it]

Received response with tool calls: False
No tool calls, returning final response
Main span completed with return value: The most popular product SKU was 6200700, with a total quantity sold of 52,262 units.
Starting main span with messages: [{'role': 'user', 'content': 'What was the total revenue across all stores?'}]
Running agent with messages: [{'role': 'user', 'content': 'What was the total revenue across all stores?'}]
Added system prompt to messages
Starting router call span
Received response with tool calls: True
Processing tool calls
Index(['Store_Number', 'SKU_Coded', 'Product_Class_Code', 'Sold_Date',
       'Qty_Sold', 'Total_Sale_Value', 'On_Promo'],
      dtype='object')
sales
Starting router call span


Processing questions:  20%|██        | 2/10 [00:06<00:26,  3.30s/it]

Received response with tool calls: False
No tool calls, returning final response
Main span completed with return value: The total revenue across all stores was approximately $13,272,640.
Starting main span with messages: [{'role': 'user', 'content': 'Which store had the highest sales volume?'}]
Running agent with messages: [{'role': 'user', 'content': 'Which store had the highest sales volume?'}]
Added system prompt to messages
Starting router call span
Received response with tool calls: True
Processing tool calls
Index(['Store_Number', 'SKU_Coded', 'Product_Class_Code', 'Sold_Date',
       'Qty_Sold', 'Total_Sale_Value', 'On_Promo'],
      dtype='object')
sales
Starting router call span


Processing questions:  30%|███       | 3/10 [00:09<00:23,  3.31s/it]

Received response with tool calls: False
No tool calls, returning final response
Main span completed with return value: The store with the highest sales volume is Store Number 2970, with a total sales volume of 59,322 units.
Starting main span with messages: [{'role': 'user', 'content': 'Create a bar chart showing total sales by store'}]
Running agent with messages: [{'role': 'user', 'content': 'Create a bar chart showing total sales by store'}]
Added system prompt to messages
Starting router call span
Received response with tool calls: True
Processing tool calls
Index(['Store_Number', 'SKU_Coded', 'Product_Class_Code', 'Sold_Date',
       'Qty_Sold', 'Total_Sale_Value', 'On_Promo'],
      dtype='object')
sales
Starting router call span
Received response with tool calls: True
Processing tool calls
Starting router call span


Processing questions:  40%|████      | 4/10 [00:48<01:42, 17.11s/it]

Received response with tool calls: False
No tool calls, returning final response
Main span completed with return value: Here is the bar chart showing total sales by store:

```python
import pandas as pd
import matplotlib.pyplot as plt
import io

# Data
data = '''    Store_Number    Total_Sales
0           1650  580443.007953
1            550  229727.498752
2           4180  272208.118542
3           3300  619660.167018
4           2970  836341.327191
5           2530  324046.518720
6           4400   95745.620250
7           3080  495458.238811
8           2090  309996.247965
9           4840  389056.668316
10          4070  322307.968330
11          2640  308990.318559
12          2750  453664.808068
13          1980  242290.828499
14           990  378433.018639
15          3740  359729.808228
16           880  420302.088397
17          1100  497509.528013
18          3190  335035.018792
19          2420  406715.767402
20          2860  132320.519487
21          3410  410567.848126
2

Processing questions:  50%|█████     | 5/10 [00:56<01:09, 13.83s/it]

Received response with tool calls: False
No tool calls, returning final response
Main span completed with return value: Approximately 2.63% of items were sold on promotion.
Starting main span with messages: [{'role': 'user', 'content': 'Plot daily sales volume over time'}]
Running agent with messages: [{'role': 'user', 'content': 'Plot daily sales volume over time'}]
Added system prompt to messages
Starting router call span
Received response with tool calls: True
Processing tool calls
Index(['Store_Number', 'SKU_Coded', 'Product_Class_Code', 'Sold_Date',
       'Qty_Sold', 'Total_Sale_Value', 'On_Promo'],
      dtype='object')
sales
Starting router call span
Received response with tool calls: True
Processing tool calls
Starting router call span


Processing questions:  60%|██████    | 6/10 [07:18<09:16, 139.01s/it]

Received response with tool calls: False
No tool calls, returning final response
Main span completed with return value: The daily sales volume has been plotted over time. Here’s the visualization that represents the trend of sales volume across the given dates. You should be able to observe the variations in sales volume, including any notable peaks or declines. 

If you have any further analysis or specific aspects you would like to explore, feel free to ask!
Starting main span with messages: [{'role': 'user', 'content': 'What was the average transaction value?'}]
Running agent with messages: [{'role': 'user', 'content': 'What was the average transaction value?'}]
Added system prompt to messages
Starting router call span
Received response with tool calls: True
Processing tool calls
Index(['Store_Number', 'SKU_Coded', 'Product_Class_Code', 'Sold_Date',
       'Qty_Sold', 'Total_Sale_Value', 'On_Promo'],
      dtype='object')
sales
Starting router call span


Processing questions:  70%|███████   | 7/10 [07:21<04:43, 94.52s/it] 

Received response with tool calls: False
No tool calls, returning final response
Main span completed with return value: The average transaction value was approximately $19.02.
Starting main span with messages: [{'role': 'user', 'content': 'Create a box plot of transaction values'}]
Running agent with messages: [{'role': 'user', 'content': 'Create a box plot of transaction values'}]
Added system prompt to messages
Starting router call span
Received response with tool calls: True
Processing tool calls
Index(['Store_Number', 'SKU_Coded', 'Product_Class_Code', 'Sold_Date',
       'Qty_Sold', 'Total_Sale_Value', 'On_Promo'],
      dtype='object')
sales
Starting router call span


Processing questions:  80%|████████  | 8/10 [07:47<02:25, 72.68s/it]

Error processing question: Create a box plot of transaction values
Error code: 400 - {'error': {'message': "Invalid 'messages[3].content': string too long. Expected a string with maximum length 1048576, but got a string with length 17447374 instead.", 'type': 'invalid_request_error', 'param': 'messages[3].content', 'code': 'string_above_max_length'}}
Starting main span with messages: [{'role': 'user', 'content': 'Which products were frequently purchased together?'}]
Running agent with messages: [{'role': 'user', 'content': 'Which products were frequently purchased together?'}]
Added system prompt to messages
Starting router call span
Received response with tool calls: True
Processing tool calls
Index(['Store_Number', 'SKU_Coded', 'Product_Class_Code', 'Sold_Date',
       'Qty_Sold', 'Total_Sale_Value', 'On_Promo'],
      dtype='object')
sales
Starting router call span


Processing questions:  90%|█████████ | 9/10 [08:04<00:55, 55.55s/it]

Error processing question: Which products were frequently purchased together?
Error code: 400 - {'error': {'message': "Invalid 'messages[3].content': string too long. Expected a string with maximum length 1048576, but got a string with length 11307469 instead.", 'type': 'invalid_request_error', 'param': 'messages[3].content', 'code': 'string_above_max_length'}}
Starting main span with messages: [{'role': 'user', 'content': 'Plot a line graph showing the sales trend over time with a 7-day moving average'}]
Running agent with messages: [{'role': 'user', 'content': 'Plot a line graph showing the sales trend over time with a 7-day moving average'}]
Added system prompt to messages
Starting router call span
Received response with tool calls: True
Processing tool calls
Index(['Store_Number', 'SKU_Coded', 'Product_Class_Code', 'Sold_Date',
       'Qty_Sold', 'Total_Sale_Value', 'On_Promo'],
      dtype='object')
sales
Starting router call span


Processing questions: 100%|██████████| 10/10 [13:25<00:00, 80.58s/it] 

Received response with tool calls: True
Processing tool calls
Error processing question: Plot a line graph showing the sales trend over time with a 7-day moving average
Unterminated string starting at: line 1 column 9 (char 8)





![Agent Traces](https://storage.googleapis.com/arize-phoenix-assets/assets/images/agent-traces.png)

# Test the Agent in Development

Before deploying your agent, you can first test it on a series of test cases. You'll need to initially either generate or source these test cases yourself, but in future rounds, this will be automated.

In [15]:
OpenAIInstrumentor().uninstrument()  # Uninstrument the OpenAI client to avoid capturing LLM as a Judge evaluation calls in your same project.

In [19]:
import nest_asyncio

import phoenix as px
from phoenix.evals import TOOL_CALLING_PROMPT_TEMPLATE, OpenAIModel, llm_classify
from phoenix.experiments import run_experiment
from phoenix.experiments.types import Example
from phoenix.trace import SpanEvaluations
from phoenix.trace.dsl import SpanQuery

nest_asyncio.apply()

In [20]:
px_client = px.Client()
eval_model = OpenAIModel(model="gpt-4o-mini")



### Function Calling Evals using Ground Truth

In order to run a test on the ground truth data effectively, you can use an Experiment.

Experiments follow a standard step-by-step process in Phoenix:
1. Create a dataset of test cases, and optionally, expected outputs
2. Create a task to run on each test case - usually this is invoking your agent or a specifc step of it
3. Create evaluator(s) to run on each output of your task
4. Visualize results in Phoenix

In [50]:
import uuid

id = str(uuid.uuid4())

# Create a list of tuples with input_messages and next_tool_call
data = [
    (
        [
            {
                "role": "user",
                "content": "Plot daily sales volume over time"
            },
            {
                "role": "system",
                "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset."
            },
            {
                "role": "assistant",
                "tool_calls": [
                    {
                        "id": "call_1",
                        "type": "function",
                        "function": {
                            "name": "lookup_sales_data",
                            "arguments": "{\"prompt\":\"Plot daily sales volume over time\"}"
                        }
                    }
                ]
            },
            {
                "role": "tool",
                "tool_call_id": "call_1",
                "content": "     Sold_Date  Daily_Sales_Volume\n0   2021-11-01              1021.0\n1   2021-11-02              1035.0\n2   2021-11-03               900.0"
            }
        ],
        "analyze_sales_data"
    ),
    (
        [
            {
                "role": "user",
                "content": "What were the top selling products last month?"
            },
            {
                "role": "system",
                "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset."
            }
        ],
        "lookup_sales_data"
    ),
    (
        [
            {
                "role": "user",
                "content": "Show me the relationship between promotions and sales"
            },
            {
                "role": "system",
                "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset."
            },
            {
                "role": "assistant",
                "tool_calls": [
                    {
                        "id": "call_2",
                        "type": "function",
                        "function": {
                            "name": "lookup_sales_data",
                            "arguments": "{\"prompt\":\"Get promotion and sales data\"}"
                        }
                    }
                ]
            },
            {
                "role": "tool",
                "tool_call_id": "call_2",
                "content": "   On_Promo  Total_Sale_Value\n0         0          1245678.50\n1         1           987654.32"
            }
        ],
        "analyze_sales_data"
    ),
    (
        [
            {
                "role": "user",
                "content": "Calculate the price elasticity for SKU 6172800"
            },
            {
                "role": "system",
                "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset."
            }
        ],
        "lookup_sales_data"
    ),
    (
        [
            {
                "role": "user",
                "content": "Create a bar chart of sales by store"
            },
            {
                "role": "system",
                "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset."
            },
            {
                "role": "assistant",
                "tool_calls": [
                    {
                        "id": "call_3",
                        "type": "function",
                        "function": {
                            "name": "lookup_sales_data",
                            "arguments": "{\"prompt\":\"Get sales by store\"}"
                        }
                    }
                ]
            },
            {
                "role": "tool",
                "tool_call_id": "call_3",
                "content": "   Store_Number  Total_Sales\n0          1320      56849.99\n1          2310      37900.00\n2          3080      18950.00"
            }
        ],
        "generate_visualization"
    ),
    (
        [
            {
                "role": "user",
                "content": "Find trends in seasonal sales patterns"
            },
            {
                "role": "system",
                "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset."
            }
        ],
        "lookup_sales_data"
    ),
    (
        [
            {
                "role": "user",
                "content": "How does product class code affect sales volume?"
            },
            {
                "role": "system",
                "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset."
            },
            {
                "role": "assistant",
                "tool_calls": [
                    {
                        "id": "call_4",
                        "type": "function",
                        "function": {
                            "name": "lookup_sales_data",
                            "arguments": "{\"prompt\":\"Get sales volume by product class code\"}"
                        }
                    }
                ]
            },
            {
                "role": "tool",
                "tool_call_id": "call_4",
                "content": "   Product_Class_Code  Total_Qty_Sold\n0               22875             7\n1               34567            12\n2               45678            23"
            }
        ],
        "analyze_sales_data"
    ),
    (
        [
            {
                "role": "user",
                "content": "Generate a scatter plot of price vs quantity sold"
            },
            {
                "role": "system",
                "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset."
            }
        ],
        "lookup_sales_data"
    ),
    (
        [
            {
                "role": "user",
                "content": "Which stores have the highest promotion effectiveness?"
            },
            {
                "role": "system",
                "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset."
            },
            {
                "role": "assistant",
                "tool_calls": [
                    {
                        "id": "call_5",
                        "type": "function",
                        "function": {
                            "name": "lookup_sales_data",
                            "arguments": "{\"prompt\":\"Get promotion and sales data by store\"}"
                        }
                    }
                ]
            },
            {
                "role": "tool",
                "tool_call_id": "call_5",
                "content": "   Store_Number  Promo_Sales  Regular_Sales  Effectiveness\n0          1320      12500.0        10000.0           1.25\n1          2310      15000.0        10000.0           1.50\n2          3080       9000.0        10000.0           0.90"
            }
        ],
        "no tool called"
    ),
    (
        [
            {
                "role": "user",
                "content": "Compare sales performance between 2020 and 2021"
            },
            {
                "role": "system",
                "content": "You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset."
            }
        ],
        "lookup_sales_data"
    ),
]

dataframe = pd.DataFrame(data, columns=["input_messages", "next_tool_call"])

dataset = px_client.upload_dataset(
    dataframe=dataframe,
    dataset_name=f"tool_calling_ground_truth_{id}",
    input_keys=["input_messages"],
    output_keys=["next_tool_call"],
)

For your task, you can simply run just the router call of your agent:

In [55]:
def run_router_step(example: Example) -> str:
    input_messages = example.input.get("input_messages")
    response = client.chat.completions.create(
        model=model,
        messages=input_messages,
        tools=tools,
    )
    
    if response.choices[0].message.tool_calls is None:
        return "no tool called"
    
    tool_calls = []
    for tool_call in response.choices[0].message.tool_calls:
        tool_calls.append(tool_call.function.name)
    return tool_calls

Your evaluator can also be simple, since you have expected outputs. If you didn't have those expected outputs, you could instead use an LLM as a Judge here, or even basic code:

In [56]:
def tools_match(expected: str, output: str) -> bool:
    if not isinstance(output, list):
        return False
    
    # Check if all expected tools are in output and no additional tools are present
    expected_tools = expected.get("next_tool_call").split(", ")
    expected_set = set(expected_tools)
    output_set = set(output)
    
    # Return True if the sets are identical (same elements, no extras)
    return expected_set == output_set

In [57]:
experiment = run_experiment(
    dataset,
    run_router_step,
    evaluators=[tools_match],
    experiment_name="Tool Calling Eval",
    experiment_description="Evaluating the tool calling step of the agent",
)



🧪 Experiment started.
📺 View dataset experiments: https://app.phoenix.arize.com/datasets/RGF0YXNldDo5MQ==/experiments
🔗 View this experiment: https://app.phoenix.arize.com/datasets/RGF0YXNldDo5MQ==/compare?experimentId=RXhwZXJpbWVudDoxNTE=


running tasks |██████████| 10/10 (100.0%) | ⏳ 00:14<00:00 |  1.43s/it


✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |██████████| 10/10 (100.0%) | ⏳ 00:02<00:00 |  4.76it/s


🔗 View this experiment: https://app.phoenix.arize.com/datasets/RGF0YXNldDo5MQ==/compare?experimentId=RXhwZXJpbWVudDoxNTE=

Experiment Summary (04/07/25 11:26 AM -0400)
--------------------------------------------
     evaluator   n  n_scores  avg_score  n_labels             top_2_labels
0  tools_match  10        10        0.8        10  {'True': 8, 'False': 2}

Tasks Summary (04/07/25 11:26 AM -0400)
---------------------------------------
   n_examples  n_runs  n_errors
0          10      10         0





## Optimize your Agent in Development

Now you can optimize your agent's routing prompt based on the labeled data you've created so far. To do this in the most automated and flexible way possible, you'll use DSPy.

In [1]:
!pip install -q dspy

In [48]:
import dspy

# Configure DSPy to use OpenAI
dspy_lm = dspy.LM(model="gpt-4o-mini")
dspy.settings.configure(lm=dspy_lm)

# Define the prompt classification task
class RouterPromptSignature(dspy.Signature):
    """Route a user prompt to the correct tool based on the task requirements.
    
    Available tools:
    1. analyze_sales_data: Use for complex analysis of sales data, including trends, patterns, and insights
    2. lookup_sales_data: Use for simple data retrieval or filtering of sales records
    3. generate_visualization: Use when the user needs visual representation of data
    4. no tool called: Use when no tool is needed
    
    The tool selection should be based on:
    - The complexity of the analysis needed
    - Whether raw data or processed insights are required
    - If visualization would help communicate the results
    """

    input_messages = dspy.InputField(desc="The routers input messages. Can include the user's query and any tool calls that have already been made.")
    tool_call = dspy.OutputField(
        desc="A list of tool calls to execute in sequence. Each tool call should include: "
             "1. tool_name: The name of the tool to use "
    )
router = dspy.Predict(RouterPromptSignature)

In [49]:
result = router(input_messages=[{"role": "user", "content": "Which stores had the highest sales volume?"}])
result

Prediction(
    tool_call='{"tool_name": "lookup_sales_data"}'
)

In [54]:
trainset = []

for input_messages, next_tool_call in dataframe.values:
    trainset.append(dspy.Example(input_messages=input_messages, tool_call=next_tool_call).with_inputs("input_messages"))

print(trainset[:3])

[Example({'input_messages': [{'role': 'user', 'content': 'Plot daily sales volume over time'}, {'role': 'system', 'content': 'You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset.'}, {'role': 'assistant', 'tool_calls': [{'id': 'call_1', 'type': 'function', 'function': {'name': 'lookup_sales_data', 'arguments': '{"prompt":"Plot daily sales volume over time"}'}}]}, {'role': 'tool', 'tool_call_id': 'call_1', 'content': '     Sold_Date  Daily_Sales_Volume\n0   2021-11-01              1021.0\n1   2021-11-02              1035.0\n2   2021-11-03               900.0'}], 'tool_call': 'analyze_sales_data'}) (input_keys={'input_messages'}), Example({'input_messages': [{'role': 'user', 'content': 'What were the top selling products last month?'}, {'role': 'system', 'content': 'You are a helpful assistant that can answer questions about the Store Sales Price Elasticity Promotions dataset.'}], 'tool_call': 'lookup_sales_data'}) (input_keys={'

In [57]:
# Optimize via BootstrapFinetune.
optimizer = dspy.BootstrapFewShot(metric=(lambda x, y, trace=None: x.tool_call == y.tool_call))
optimized = optimizer.compile(router, trainset=trainset)

 70%|███████   | 7/10 [00:08<00:03,  1.23s/it]

Bootstrapped 4 full traces after 7 examples for up to 1 rounds, amounting to 7 attempts.





In [58]:
optimized(input_messages=[{"role": "user", "content": "Which stores had the highest sales volume?"}])

Prediction(
    tool_call='lookup_sales_data'
)

In [66]:
# Get the prompt from the optimized router
print(optimized.signature.instructions)

Route a user prompt to the correct tool based on the task requirements.

Available tools:
1. analyze_sales_data: Use for complex analysis of sales data, including trends, patterns, and insights
2. lookup_sales_data: Use for simple data retrieval or filtering of sales records
3. generate_visualization: Use when the user needs visual representation of data
4. no tool called: Use when no tool is needed

The tool selection should be based on:
- The complexity of the analysis needed
- Whether raw data or processed insights are required
- If visualization would help communicate the results


## Evaluate your Agent in Production

Now 

It follows a standard pattern:
1. Export traces from Phoenix
2. Prepare those exported traces in a dataframe with the correct columns
3. Use `llm_classify` to run a standard template across each row of that dataframe and produce an eval label
4. Upload the results back into Phoenix

In [None]:
query = (
    SpanQuery()
    .where(
        "span_kind == 'LLM'",
    )
    .select(question="input.value", output_messages="llm.output_messages")
)

# The Phoenix Client can take this query and return the dataframe.
tool_calls_df = px.Client().query_spans(query, project_name=project_name, timeout=None)
tool_calls_df.dropna(subset=["output_messages"], inplace=True)


def get_tool_call(outputs):
    if outputs[0].get("message").get("tool_calls"):
        return (
            outputs[0]
            .get("message")
            .get("tool_calls")[0]
            .get("tool_call")
            .get("function")
            .get("name")
        )
    else:
        return "No tool used"


tool_calls_df["tool_call"] = tool_calls_df["output_messages"].apply(get_tool_call)
tool_calls_df.head()

In [None]:
tool_call_eval = llm_classify(
    dataframe=tool_calls_df,
    template=TOOL_CALLING_PROMPT_TEMPLATE.template.replace(
        "{tool_definitions}",
        "generate_visualization, lookup_sales_data, analyze_sales_data, run_python_code",
    ),
    rails=["correct", "incorrect"],
    model=eval_model,
    provide_explanation=True,
)

tool_call_eval["score"] = tool_call_eval.apply(
    lambda x: 1 if x["label"] == "correct" else 0, axis=1
)

tool_call_eval.head()

In [None]:
px.Client().log_evaluations(
    SpanEvaluations(eval_name="Tool Calling Eval", dataframe=tool_call_eval),
)

You should now see eval labels in Phoenix.

# ![Function Calling Evals](https://storage.googleapis.com/arize-phoenix-assets/assets/images/function-calling-evals.png)