# Monitoring and Evaluating your Agent 📊

This notebook demonstrates how to set up and evaluate an AI agent using tools like Arize Phoenix and OpenTelemetry for tracing and monitoring. It also explores how to integrate external libraries, such as Hugging Face and OpenAI, for building conversational and functional agents.

The notebook is divided into the following sections:
1. **Setup:** Installing and configuring dependencies for tracing and AI agent integration.
2. **Tracing an AI Agent:** Configuring OpenTelemetry and monitoring agent runs.
3. **Building a Functional Agent:** Implementing an agent for a simple snack-ordering system.
4. **Conclusion:** Summary of results and next steps.

<p style="background-color:#f7fff8; padding:15px; border-width:3px; border-color:#e0f0e0; border-style:solid; border-radius:6px"> 🚨
&nbsp; <b>Different Run Results:</b> The output generated by AI chat models can vary with each execution due to their dynamic, probabilistic nature. Don't be surprised if your results differ from those shown in the video.</p>

## Setup Tracing

More info on [Phoenix](https://docs.arize.com/phoenix)

In [None]:
# Install required libraries
# Arize Phoenix: For tracing and monitoring
# SmolAgents: For building lightweight AI agents

In [None]:
%pip install arize-phoenix-otel

In [1]:
import os

# Add Phoenix API Key for tracing
PHOENIX_API_KEY = "72cb1557a3f1b6d5e0a:81311b0"
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={PHOENIX_API_KEY}"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

In [None]:
%pip install openinference-instrumentation-smolagents smolagents

In [None]:
#os.environ["HF_TOKEN"] = "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"

In [None]:
# Import necessary libraries
from phoenix.otel import register

# Configure the Phoenix tracer for monitoring agent runs
tracer_provider = register(
  project_name="Customer-Success", # Tracing project name
  auto_instrument=True # Enable automatic instrumentation for dependencies
)

OpenTelemetry Tracing Details
|  Phoenix Project: Customer-Success
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: https://app.phoenix.arize.com/v1/traces
|  Transport: HTTP + protobuf
|  Transport Headers: {'api_key': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



In [3]:
import os
from dotenv import load_dotenv

load_dotenv()

True

In [4]:
from smolagents import HfApiModel

# Use the Hugging Face token from the environment variable
# Use the OpenAI API key from the environment variable
openai_api_key = os.environ.get("OPENAI_API_KEY")

# Initialize the model with the OpenAI API key
model = HfApiModel("gpt-3.5-turbo", provider="openai", token=openai_api_key)  # Use a model that supports conversational tasks

response = model([{"role": "user", "content": "Hello!"}])
print(response)

ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content='Hello! How can I assist you today?', tool_calls=None, raw=ChatCompletionOutput(choices=[ChatCompletionOutputComplete(finish_reason='stop', index=0, message=ChatCompletionOutputMessage(role='assistant', content='Hello! How can I assist you today?', tool_call_id=None, tool_calls=None, refusal=None, annotations=[]), logprobs=None)], created=1746019260, id='chatcmpl-BS1eu5Jv2tBJap7j4qJr0butLhVgO', model='gpt-3.5-turbo-0125', system_fingerprint=None, usage=ChatCompletionOutputUsage(completion_tokens=10, prompt_tokens=9, total_tokens=19, prompt_tokens_details={'cached_tokens': 0, 'audio_tokens': 0}, completion_tokens_details={'reasoning_tokens': 0, 'audio_tokens': 0, 'accepted_prediction_tokens': 0, 'rejected_prediction_tokens': 0}), object='chat.completion', service_tier='default'))


## Trace an agent run

In [5]:
from smolagents import HfApiModel, CodeAgent

agent = CodeAgent(model=model, tools=[])

In [6]:
agent.run("What is the 100th Fibonacci number?")

'354224848179261915075'

In [7]:
# This is where you can access the display:
#print(os.environ.get('DLAI_LOCAL_URL').format(port='6006'))
print(os.environ["PHOENIX_COLLECTOR_ENDPOINT"])

https://app.phoenix.arize.com


## Setup ice cream production system

## Building a Snack Ordering System

In this section, we define a simple snack-ordering system using a dictionary to store menu prices and orders. Two tools are implemented:
1. `place_order`: Places an order for specified quantities of menu items.
2. `get_prices`: Calculates the total price for a given order.

In [None]:
from smolagents import tool
from typing import Dict
# Define menu prices and global order book
menu_prices = {"crepe nutella": 1.50, "vanilla ice cream": 2, "maple pancake": 1.}

ORDER_BOOK = {}
# Define a tool for placing orders
@tool
def place_order(quantities: Dict[str, int], session_id: int) -> None:
    """Places a pre-order of snacks.

    Args:
        quantities: a dictionary with names as keys and quantities as values
        session_id: the id for the client session
    """
    global ORDER_BOOK
    assert isinstance(quantities, dict), "Incorrect type for the input dictionary!"
    assert [key in menu_prices for key in quantities.keys()], f"All food names should be within {menu_prices.keys()}"
    ORDER_BOOK[session_id] = quantities

@tool
def get_prices(quantities: Dict[str, int]) -> str:
    """Gets price for certain quantities of ice cream.

    Args:
        quantities: a dictionary with names as keys and quantities as values
    """
    assert isinstance(quantities, dict), "Incorrect type for the input dictionary!"
    assert [key in menu_prices for key in quantities.keys()], f"All food names should be within {menu_prices.keys()}"
    total_price = sum([menu_prices[key] * value for key, value in quantities.items()])
    return (
        f"Given the current menu prices:\n{menu_prices}\nThe total price for your order would be: ${total_price}"
    )

In [9]:
order_agent = CodeAgent(
    tools=[place_order, get_prices],
    #model=HfApiModel("Qwen/Qwen2.5-Coder-32B-Instruct", provider="together")
    model = HfApiModel("gpt-3.5-turbo", provider="openai", token=openai_api_key)
)

In [10]:
order_agent.run(
    "Could I come and collect one crepe nutella?",
    additional_args={"session_id": 192}
)

'Yes, you can come and collect one crepe nutella. The order has been successfully placed at the price of $1.5.'

### Try multiple orders

In [11]:
client_requests = [
    ("Could I come and collect one crepe nutella?", "place_order"),
    ("What would be the price for 1 crêpe nutella + 2 pancakes?", "get_prices"),
    ("How did you start your ice-cream business?", None),
    ("What's the weather at the Louvre right now?", None),
    ("I'm not sure if I should order. I want a vanilla ice cream. but if it's more expensive than $1, I don't want it. If it's below, I'll order it, please.", "place_order")
]

In [12]:
for request in client_requests:
    order_agent.run(
        request[0],
        additional_args={"session_id": 0, "menu_prices": menu_prices}
    )

In [13]:
import phoenix as px

spans = px.Client().get_spans_dataframe(project_name="Customer-Success")
spans.head(20)



Unnamed: 0_level_0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,context.span_id,context.trace_id,...,attributes.llm.model_name,attributes.llm.token_count.prompt,attributes.llm.token_count.completion,attributes.input.mime_type,attributes.openinference.span.kind,attributes.llm.tools,attributes.tool.parameters,attributes.tool.name,attributes.tool.description,attributes.smolagents
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
697865a1f202f662,HfApiModel.__call__,LLM,616f26d5a77528fa,2025-04-30 08:24:53.302007+00:00,2025-04-30 08:25:01.339542+00:00,OK,,[],697865a1f202f662,0022446925a64e3e5a9cb73873d69368,...,Qwen/Qwen2.5-Coder-32B-Instruct,2130.0,150.0,application/json,LLM,,,,,
638a6436a7d3ebb5,HfApiModel.__call__,LLM,0288e82dcf483ef8,2025-04-30 08:25:01.867190+00:00,2025-04-30 08:25:07.599219+00:00,OK,,[],638a6436a7d3ebb5,0022446925a64e3e5a9cb73873d69368,...,Qwen/Qwen2.5-Coder-32B-Instruct,1490.0,31.0,application/json,LLM,"[{'tool.json_schema': '{""type"": ""function"", ""f...",,,,
b0101a8469ed5979,DuckDuckGoSearchTool,TOOL,0288e82dcf483ef8,2025-04-30 08:25:07.915178+00:00,2025-04-30 08:25:09.154269+00:00,OK,,[],b0101a8469ed5979,0022446925a64e3e5a9cb73873d69368,...,,,,,TOOL,,"{'query': {'type': 'string', 'description': 'T...",web_search,Performs a duckduckgo web search based on your...,
0288e82dcf483ef8,Step 1,CHAIN,c635535172231c7d,2025-04-30 08:25:01.867190+00:00,2025-04-30 08:25:09.385871+00:00,OK,,[],0288e82dcf483ef8,0022446925a64e3e5a9cb73873d69368,...,,,,,CHAIN,,,,,
3dbee956a6f475ad,HfApiModel.__call__,LLM,fd348193d1178f73,2025-04-30 08:25:09.602121+00:00,2025-04-30 08:25:14.458031+00:00,OK,,[],3dbee956a6f475ad,0022446925a64e3e5a9cb73873d69368,...,Qwen/Qwen2.5-Coder-32B-Instruct,2797.0,47.0,application/json,LLM,"[{'tool.json_schema': '{""type"": ""function"", ""f...",,,,
8d42a5d7877c261f,VisitWebpageTool,TOOL,fd348193d1178f73,2025-04-30 08:25:14.790169+00:00,2025-04-30 08:25:26.166210+00:00,OK,,[],8d42a5d7877c261f,0022446925a64e3e5a9cb73873d69368,...,,,,,TOOL,,"{'url': {'type': 'string', 'description': 'The...",visit_webpage,Visits a webpage at the given url and reads it...,
fd348193d1178f73,Step 2,CHAIN,c635535172231c7d,2025-04-30 08:25:09.602121+00:00,2025-04-30 08:25:27.119630+00:00,OK,,[],fd348193d1178f73,0022446925a64e3e5a9cb73873d69368,...,,,,,CHAIN,,,,,
cf5abccd7386273b,HfApiModel.__call__,LLM,b1e76f6e05180a56,2025-04-30 08:25:27.741529+00:00,2025-04-30 08:25:29.073278+00:00,ERROR,HfHubHTTPError: 422 Client Error: Unprocessabl...,"[{'name': 'exception', 'timestamp': '2025-04-3...",cf5abccd7386273b,0022446925a64e3e5a9cb73873d69368,...,,,,application/json,LLM,,,,,
b1e76f6e05180a56,Step 3,CHAIN,c635535172231c7d,2025-04-30 08:25:27.740525+00:00,2025-04-30 08:25:30.343446+00:00,ERROR,AgentParsingError: Error while generating or p...,"[{'name': 'exception', 'timestamp': '2025-04-3...",b1e76f6e05180a56,0022446925a64e3e5a9cb73873d69368,...,,,,,CHAIN,,,,,
8dc17c5174bafdf5,HfApiModel.__call__,LLM,b053cb7586473e42,2025-04-30 08:25:30.558887+00:00,2025-04-30 08:25:31.651969+00:00,ERROR,HfHubHTTPError: 422 Client Error: Unprocessabl...,"[{'name': 'exception', 'timestamp': '2025-04-3...",8dc17c5174bafdf5,0022446925a64e3e5a9cb73873d69368,...,,,,application/json,LLM,,,,,


### Add processing to extract desired information

In [14]:
import pandas as pd
import json

agents = spans[spans['span_kind'] == 'AGENT'].copy()
agents['task'] = agents['attributes.input.value'].apply(
    lambda x: json.loads(x).get('task') if isinstance(x, str) else None
)

tools = spans.loc[
    spans['span_kind'] == 'TOOL',
    ["attributes.tool.name", "attributes.input.value", "context.trace_id"]
].copy()

tools_per_task = agents[
    ["name", "start_time", "task", "context.trace_id"]
].merge(
    tools,
    on="context.trace_id",
    how="left",
)
tools_per_task.head(20)

Unnamed: 0,name,start_time,task,context.trace_id,attributes.tool.name,attributes.input.value
0,ToolCallingAgent.run,2025-04-30 08:25:01.867190+00:00,You're a helpful agent named 'US_GDP_Growth_Ag...,0022446925a64e3e5a9cb73873d69368,web_search,"{""args"": [], ""sanitize_inputs_outputs"": true, ..."
1,ToolCallingAgent.run,2025-04-30 08:25:01.867190+00:00,You're a helpful agent named 'US_GDP_Growth_Ag...,0022446925a64e3e5a9cb73873d69368,visit_webpage,"{""args"": [], ""sanitize_inputs_outputs"": true, ..."
2,CodeAgent.run,2025-04-30 08:24:53.222497+00:00,"If the US keeps its 2024 growth rate, how many...",0022446925a64e3e5a9cb73873d69368,web_search,"{""args"": [], ""sanitize_inputs_outputs"": true, ..."
3,CodeAgent.run,2025-04-30 08:24:53.222497+00:00,"If the US keeps its 2024 growth rate, how many...",0022446925a64e3e5a9cb73873d69368,visit_webpage,"{""args"": [], ""sanitize_inputs_outputs"": true, ..."
4,ToolCallingAgent.run,2025-04-30 08:33:22.677324+00:00,What is the 100th Fibonacci number?,43303d5d41d9b4f265c11fcd68b85f4d,,
5,CodeAgent.run,2025-04-30 08:34:19.627527+00:00,What is the 100th Fibonacci number?,1981f2cc078437f141fdec6557be1bac,final_answer,"{""args"": [354224848179261915075], ""sanitize_in..."
6,CodeAgent.run,2025-04-30 08:37:30.342283+00:00,Could I come and collect one crepe nutella?,9b502fb667503f99229eb3215f5078bc,,
7,CodeAgent.run,2025-04-30 12:23:03.310823+00:00,What is the 100th Fibonacci number?,f09723bc539e9f9b00bf1d427fdbabfa,final_answer,"{""args"": [218922995834555169026], ""sanitize_in..."
8,CodeAgent.run,2025-04-30 12:42:19.208191+00:00,Could I come and collect one crepe nutella?,b4e79a295d7f85ea160ac4d76d47cfd8,place_order,"{""args"": [], ""sanitize_inputs_outputs"": false,..."
9,CodeAgent.run,2025-04-30 12:42:19.208191+00:00,Could I come and collect one crepe nutella?,b4e79a295d7f85ea160ac4d76d47cfd8,final_answer,"{""args"": [""Order successfully placed for one c..."


### Now, compare tool calls with exected tool calls

In [15]:
def score_request(expected_tool: str, tool_calls: list):
    if expected_tool is None:
        return tool_calls == set(["final_answer"])
    else:
        return expected_tool in tool_calls

results = []
for request, expected_tool in client_requests:
    tool_calls = set(tools_per_task.loc[tools_per_task["task"] == request, "attributes.tool.name"].tolist())
    results.append(
        {
            "request": request,
            "tool_calls_performed": tool_calls,
            "is_correct": score_request(expected_tool, tool_calls)
        }
    )
pd.DataFrame(results)

Unnamed: 0,request,tool_calls_performed,is_correct
0,Could I come and collect one crepe nutella?,"{nan, place_order, get_prices, final_answer}",True
1,What would be the price for 1 crêpe nutella + ...,"{get_prices, final_answer}",True
2,How did you start your ice-cream business?,{final_answer},True
3,What's the weather at the Louvre right now?,"{nan, final_answer}",False
4,I'm not sure if I should order. I want a vanil...,{final_answer},False


# Conclusion

In this notebook, we successfully demonstrated:
1. Setting up and configuring tracing for monitoring AI agents using Arize Phoenix and OpenTelemetry.
2. Building a functional agent for a snack-ordering system using SmolAgents.
3. Integrating external libraries like Hugging Face and OpenAI for conversational tasks.

This notebook provides a foundational understanding of monitoring and evaluating AI agents. Future work could involve:
- Expanding the agent's functionality.
- Integrating more complex tracing and monitoring setups.
- Applying the agent to real-world use cases.