<center>
    <p style="text-align:center">
    <img alt="arize logo" src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="300"/>
        <br>
        <a href="https://docs.arize.com/arize/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/client_python">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg">Slack Community</a>
    </p>
</center>

This tutorial demonstrates how to use AX Datasts & Experiments to systematically evaluate and improve AI agents. You'll learn how to create datasets, define task functions that run your agent on each example, and use both code-based and LLM-as-a-Judge evaluators to measure performance. By the end, you'll be able to run experiments that compare different agent versions and track improvements over time, enabling data-driven development and deployment decisions.

The notebook covers four main sections. Follow the documention for the complete tutorial.


*   **Define Agent**: Set up a customer support agent with tools for ticket classification and policy retrieval, using the agno framework labels, then upload it to Phoenix
*   **Create a Dataset**: Build a dataset of support ticket queries with ground truth labels, then upload it to Phoenix
*   **Define an Experiment**: Create task functions and evaluators (code-based and LLM judges), then run experiments to measure agent performance and compare different versions
*   **Iterations with Experiments**: Compare different agent versions using experiments to validate improvements before deployment


In [None]:
%pip install agno arize[otel] anthropic openinference-instrumentation-agno openinference-instrumentation-anthropic arize-phoenix-evals

In [None]:
import os

os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key"
os.environ["ARIZE_API_KEY"] = "your-arize-api-key"
os.environ["ARIZE_SPACE_ID"] = "your-arize-space-id"

In [None]:
from arize.otel import register
from openinference.instrumentation.anthropic import AnthropicInstrumentor
from openinference.instrumentation.agno import AgnoInstrumentor

tracer_provider = register(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    api_key=os.getenv("ARIZE_API_KEY"),
    project_name="experiments-tutorial",
)
AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)
AgnoInstrumentor().instrument(tracer_provider=tracer_provider)

# Define Support Agent

This agent is a customer support assistant that helps users resolve their issues by classifying tickets and retrieving relevant policies. The agent has two tools: `classify_ticket`, which categorizes support tickets into billing, technical, account, or other categories, and `retrieve_policy`, which fetches the appropriate internal support policy based on the ticket category.

In [None]:
from agno.models.anthropic import Claude
from agno.tools import tool
from anthropic import Anthropic

CATEGORIES = ["billing", "technical", "account", "other"]

anthropic_client = Anthropic()


@tool
def classify_ticket(ticket_text: str) -> str:
    """
    Classify a support ticket into:
    billing, technical, account, or other.
    """

    response = anthropic_client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=50,
        system=(
            "You classify customer support tickets into one of the "
            "following categories: billing, technical, account, other. "
            "Respond with ONLY the category name."
        ),
        messages=[
            {
                "role": "user",
                "content": ticket_text,
            },
        ],
    )
    label = response.content[0].text.strip().lower()

    if label not in CATEGORIES:
        return "other"

    return label

In [None]:
POLICIES = {
    "billing": "Billing policy: Refunds are issued for duplicate charges within 7 days.",
    "technical": "Technical policy: Troubleshoot login issues, outages, and errors.",
    "account": "Account policy: Users can update email and password in account settings.",
    "other": "General support policy: Route to a human agent.",
}


@tool
def retrieve_policy(category: str) -> str:
    """Retrieve internal support policy."""
    return POLICIES.get(category, POLICIES["other"])

In [None]:
from agno.agent import Agent

support_agent = Agent(
    name="SupportAgent",
    model=Claude(id="claude-sonnet-4-5-20250929"),
    tools=[classify_ticket, retrieve_policy],
    instructions="""
You are a customer support assistant.

Steps:
1. Use classify_ticket to determine the issue category.
2. Use retrieve_policy to fetch the relevant policy.
3. Write a helpful, polite response grounded in the policy.
Do not invent policies.
""",
)

In [None]:
sample_tickets = [
    "I was charged twice for my subscription this month.",
    "My app crashes every time I try to log in.",
    "How do I change the email on my account?",
    "This product is terrible and nothing works.",
]

for ticket in sample_tickets:
    support_agent.run(ticket)

# Section 1: Create a Dataset

In [None]:
import pandas as pd

data = [
    {"query": "I was charged twice for my subscription this month.", "expected_category": "billing"},
    {"query": "My app crashes every time I try to log in.", "expected_category": "technical"},
    {"query": "How do I change the email on my account?", "expected_category": "account"},
    {"query": "I want a refund because I was billed incorrectly.", "expected_category": "billing"},
    {"query": "The website shows a 500 error.", "expected_category": "technical"},
    {"query": "I forgot my password and cannot sign in.", "expected_category": "account"},
    {"query": "I was billed after canceling my subscription.", "expected_category": "billing"},
    {"query": "The app freezes on startup.", "expected_category": "technical"},
    {"query": "How can I update my billing address?", "expected_category": "account"},
    {"query": "Why was my credit card charged twice?", "expected_category": "billing"},
    {"query": "Push notifications are not working.", "expected_category": "technical"},
    {"query": "Can I change my username?", "expected_category": "account"},
    {"query": "I was charged even though my trial should be free.", "expected_category": "billing"},
    {"query": "The page won’t load on mobile.", "expected_category": "technical"},
    {"query": "How do I delete my account?", "expected_category": "account"},
    {"query": "I canceled last week but still see a pending charge and now the app won’t open.", "expected_category": "billing"},
    {"query": "Nothing works anymore and I don’t even know where to start.", "expected_category": "other"},
    {"query": "I updated my email and now I can’t log in — also was billed today.", "expected_category": "account"},
    {"query": "This service is unusable and I want my money back.", "expected_category": "billing"},
    {"query": "I think something is wrong with my account but support never responds.", "expected_category": "account"},
    {"query": "My subscription status looks wrong and the app crashes randomly.", "expected_category": "billing"},
    {"query": "Why am I being charged if I can’t access my account?", "expected_category": "billing"},
    {"query": "The app broke after the last update and now billing looks incorrect.", "expected_category": "technical"},
    {"query": "I’m locked out and still getting charged — please help.", "expected_category": "billing"},
    {"query": "This feels like both a billing and technical issue.", "expected_category": "billing"},
    {"query": "Everything worked yesterday, today nothing does.", "expected_category": "technical"},
    {"query": "I don’t recognize this charge and the app won’t load.", "expected_category": "billing"},
    {"query": "Account settings changed on their own and I was billed.", "expected_category": "account"},
    {"query": "I want to cancel but can’t log in.", "expected_category": "account"},
    {"query": "The system is broken and I’m losing money.", "expected_category": "billing"},
]

# Create DataFrame
dataset_df = pd.DataFrame(data)

In [None]:
# Upload Dataset

from arize import ArizeClient

client = ArizeClient(api_key= os.getenv("ARIZE_API_KEY"))

dataset = client.datasets.create(
    name="support-ticket-queries",
    space_id= os.getenv("ARIZE_SPACE_ID"),
    examples=dataset_df,
)
dataset_id = dataset.id

# Section 2: Define an Experiment

## Run an Experiment to Check Tool Call Accuracy (Code-Based Evaluator)

This is our tool function from above:

In [None]:
def classify_ticket_fn(ticket_text: str) -> str:
    """
    Classify a support ticket into:
    billing, technical, account, or other.
    """
    if isinstance(ticket_text, dict):
        ticket_text = ticket_text.get("query", str(ticket_text))

    ticket_text = str(ticket_text)

    response = anthropic_client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=50,
        system=(
            "You classify customer support tickets into one of the "
            "following categories: billing, technical, account, other. "
            "Respond with ONLY the category name."
        ),
        messages=[
            {
                "role": "user",
                "content": ticket_text,
            },
        ],
    )
    label = response.content[0].text.strip().lower()

    if label not in CATEGORIES:
        return "other"

    return label

In [None]:
def classify_ticket_task(input):
    """
    Task used specifically for evaluating tool call accuracy.
    """
    query = input.get("query")
    classification = classify_ticket(query)
    return classification

Since our "baseline" examples have a ground truth field, we can used a code based evaluator to check if the task output matches what we expect.

In [None]:
# Define Code-Based Evaluator for Tool Call Accuracy
from arize.experiments import EvaluationResult

def tool_call_accuracy(output, dataset_row) -> bool:
    """
    Code-based evaluator that checks if the classify_ticket tool output
    matches the expected category from the dataset.
    """

    expected_category = dataset_row.get("expected_category")
    score = output.strip().lower() == expected_category

    return EvaluationResult(
        score=1.0 if score else 0.0,
        label="correct" if score else "incorrect",
        explanation = "correct" if score else "incorrect"
    )


In [None]:
experiment, experiment_df = client.experiments.run(
    name="tool call experiment",
    dataset_id=dataset_id,
    task=classify_ticket_fn,
    evaluators=[tool_call_accuracy],  # Pass your evaluator(s) here
)

## Run an Experiment to Understand Overall Agent Performance (LLM-as-a-Judge Evaluator)

In [None]:
def support_agent_task(dataset_row):
    """
    Task function that will be run on each row of the dataset.
    """
    query = dataset_row.get("query")

    # Call the agent with the query
    response = support_agent.run(query)
    return response.content

In [None]:
# Define LLM Judge Evaluator checking for Actionable Responses
from phoenix.evals import LLM, create_classifier
from arize.experiments import EvaluationResult

# Define Prompt Template
support_response_actionability_judge = """
You are evaluating a customer support agent's response.

Determine whether the response is ACTIONABLE and helps resolve the user's issue.

Mark the response as CORRECT if it:
- Directly addresses the user's specific question
- Provides concrete steps, guidance, or information
- Clearly routes the user toward a solution

Mark the response as INCORRECT if it:
- Is generic, vague, or non-specific
- Avoids answering the question
- Provides no clear next steps
- Deflects with phrases like "contact support" without guidance

User Query:
{input}

Agent Response:
{output}

Return only one label: "correct" or "incorrect".
"""

# Create Evaluator using a different Anthropic model than the agent
actionability_judge = create_classifier(
    name="actionability-judge",
    prompt_template=support_response_actionability_judge,
    llm=LLM(model="claude-3-5-haiku-20241022", provider="anthropic"),
    choices={"correct": 1.0, "incorrect": 0.0},
)


def call_actionability_judge(dataset_row, output):
    """
    Wrapper function for the actionability judge evaluator.
    This is needed because run_experiment expects a function, not an evaluator object.
    """
    results = actionability_judge.evaluate({"input": dataset_row.get("query"), "output": output})
    result = results[0]
    return EvaluationResult(score=result.score, label=result.label, explanation=result.explanation)

In [None]:
experiment, experiment_df = client.experiments.run(
    name="support agent performance",
    dataset_id=dataset_id,
    task= support_agent_task,
    evaluators=[call_actionability_judge],  # Pass your evaluator(s) here
)

# CI/CD with Experiments

In [None]:
# Improved Agent with Better Actionability
# This version has enhanced instructions to improve actionability scores

improved_support_agent = Agent(
    name="SupportAgent",
    model=Claude(id="claude-sonnet-4-5-20250929"),
    tools=[classify_ticket, retrieve_policy],
    instructions="""
You are a customer support assistant. Your goal is to provide SPECIFIC, ACTIONABLE responses that directly help users resolve their issues.

1. Use classify_ticket to determine the issue category.
2. Use retrieve_policy to fetch the relevant policy.
3. Write a response that:
   - Directly addresses the user's specific question
   - Includes the policy information you retrieved
   - Provides clear, concrete next steps the user can take
   - Uses specific details from the policy (e.g., "within 7 days" not "soon")
   - Avoids vague phrases like "should be able to" or "might be able to"
   - Gives actionable guidance (ex: "Go to Settings > Account > Email" not "check your settings")

Example of GOOD response:
"Based on your billing issue, here's what you can do: Refunds are issued for duplicate charges within 7 days. To request your refund, please [specific action]. You should see the refund processed within 7 business days."

Example of BAD response:
"I understand your concern about billing. Please contact our support team for assistance with this matter."

Do not invent policies. Always use the policy information from retrieve_policy.
""",
)

In [None]:
# Task function using the improved agent
def improved_support_agent_task(dataset_row):
    """
    Task function using the improved agent with better actionability instructions.
    """
    query = dataset_row.get("query")
    response = improved_support_agent.run(query)

    return response.content

In [None]:
# Run experiment with improved agent to compare actionability scores

improved_experiment, improved_experiment_df = client.experiments.run(
    name="support agent performance with improved prompt",
    dataset_id=dataset_id,
    task= improved_support_agent_task,
    evaluators=[call_actionability_judge],  # Pass your evaluator(s) here
)