<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>

# Phoenix Experiments Tutorial (Python)

This tutorial demonstrates how to use Phoenix Experiments to systematically evaluate and improve AI agents. You'll learn how to create datasets, define task functions that run your agent on each example, and use both code-based and LLM-as-a-Judge evaluators to measure performance. By the end, you'll be able to run experiments that compare different agent versions and track improvements over time, enabling data-driven development and deployment decisions.

The notebook covers four main sections. Follow the documention for the complete tutorial.

- **Define Agent**: Set up a customer support agent with tools for ticket classification and policy retrieval, using the `agno` framework
- **Create a Dataset**: Build a dataset of support ticket queries with ground truth labels, then upload it to Phoenix
- **Define an Experiment**: Create task functions and evaluators (code-based and LLM judges), then run experiments to measure agent performance and compare different versions
- **Iterations with Experiments**: Compare different agent versions using experiments to validate improvements before deployment

# Install Dependencies and Keys

In [None]:
!pip install agno arize-phoenix openai openinference-instrumentation-agno openinference-instrumentation-openai

In [None]:
import os

os.environ["OPENAI_API_KEY"] = "your-openai-api-key"
os.environ["PHOENIX_API_KEY"] = "your-phoenix-api-key"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "your-phoenix-collector-endpoint"

In [None]:
from phoenix.otel import register

register(project_name="experiments-tutorial", auto_instrument=True)

# Define Support Agent

This agent is a customer support assistant that helps users resolve their issues by classifying tickets and retrieving relevant policies. The agent has two tools: `classify_ticket`, which categorizes support tickets into billing, technical, account, or other categories, and `retrieve_policy`, which fetches the appropriate internal support policy based on the ticket category.

In [None]:
from agno.models.openai import OpenAIChat
from agno.tools import tool
from openai import OpenAI

CATEGORIES = ["billing", "technical", "account", "other"]

openai_client = OpenAI()


@tool
def classify_ticket(ticket_text: str) -> str:
    """
    Classify a support ticket into:
    billing, technical, account, or other.
    """

    messages = [
        {
            "role": "system",
            "content": (
                "You classify customer support tickets into one of the "
                "following categories: billing, technical, account, other. "
                "Respond with ONLY the category name."
            ),
        },
        {
            "role": "user",
            "content": ticket_text,
        },
    ]

    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )
    label = response.choices[0].message.content.strip().lower()

    if label not in CATEGORIES:
        return "other"

    return label

In [None]:
POLICIES = {
    "billing": "Billing policy: Refunds are issued for duplicate charges within 7 days.",
    "technical": "Technical policy: Troubleshoot login issues, outages, and errors.",
    "account": "Account policy: Users can update email and password in account settings.",
    "other": "General support policy: Route to a human agent.",
}


@tool
def retrieve_policy(category: str) -> str:
    """Retrieve internal support policy."""
    return POLICIES.get(category, POLICIES["other"])

In [None]:
from agno.agent import Agent

support_agent = Agent(
    name="SupportAgent",
    model=OpenAIChat(id="gpt-4o-mini"),
    tools=[classify_ticket, retrieve_policy],
    instructions="""
You are a customer support assistant.

Steps:
1. Use classify_ticket to determine the issue category.
2. Use retrieve_policy to fetch the relevant policy.
3. Write a helpful, polite response grounded in the policy.
Do not invent policies.
""",
)

In [None]:
sample_tickets = [
    "I was charged twice for my subscription this month.",
    "My app crashes every time I try to log in.",
    "How do I change the email on my account?",
    "This product is terrible and nothing works.",
]

for ticket in sample_tickets:
    support_agent.run(ticket)

# Section 1: Create a Dataset

In [None]:
import pandas as pd

from phoenix.client import Client

data = [
    {
        "query": "I was charged twice for my subscription this month.",
        "expected_category": "billing",
    },
    {"query": "My app crashes every time I try to log in.", "expected_category": "technical"},
    {"query": "How do I change the email on my account?", "expected_category": "account"},
    {"query": "I want a refund because I was billed incorrectly.", "expected_category": "billing"},
    {"query": "The website shows a 500 error.", "expected_category": "technical"},
    {"query": "I forgot my password and cannot sign in.", "expected_category": "account"},
    {"query": "I was billed after canceling my subscription.", "expected_category": "billing"},
    {"query": "The app freezes on startup.", "expected_category": "technical"},
    {"query": "How can I update my billing address?", "expected_category": "account"},
    {"query": "Why was my credit card charged twice?", "expected_category": "billing"},
    {"query": "Push notifications are not working.", "expected_category": "technical"},
    {"query": "Can I change my username?", "expected_category": "account"},
    {"query": "I was charged even though my trial should be free.", "expected_category": "billing"},
    {"query": "The page won’t load on mobile.", "expected_category": "technical"},
    {"query": "How do I delete my account?", "expected_category": "account"},
    {
        "query": "I canceled last week but still see a pending charge and now the app won’t open.",
        "expected_category": "billing",
    },
    {
        "query": "Nothing works anymore and I don’t even know where to start.",
        "expected_category": "other",
    },
    {
        "query": "I updated my email and now I can’t log in — also was billed today.",
        "expected_category": "account",
    },
    {"query": "This service is unusable and I want my money back.", "expected_category": "billing"},
    {
        "query": "I think something is wrong with my account but support never responds.",
        "expected_category": "account",
    },
    {
        "query": "My subscription status looks wrong and the app crashes randomly.",
        "expected_category": "billing",
    },
    {
        "query": "Why am I being charged if I can’t access my account?",
        "expected_category": "billing",
    },
    {
        "query": "The app broke after the last update and now billing looks incorrect.",
        "expected_category": "technical",
    },
    {
        "query": "I’m locked out and still getting charged — please help.",
        "expected_category": "billing",
    },
    {
        "query": "This feels like both a billing and technical issue.",
        "expected_category": "billing",
    },
    {"query": "Everything worked yesterday, today nothing does.", "expected_category": "technical"},
    {
        "query": "I don’t recognize this charge and the app won’t load.",
        "expected_category": "billing",
    },
    {
        "query": "Account settings changed on their own and I was billed.",
        "expected_category": "account",
    },
    {"query": "I want to cancel but can’t log in.", "expected_category": "account"},
    {"query": "The system is broken and I’m losing money.", "expected_category": "billing"},
]

# Create DataFrame
dataset_df = pd.DataFrame(data)

# -----------------------------
# Upload Dataset
# -----------------------------

px_client = Client()

dataset = px_client.datasets.create_dataset(
    dataframe=dataset_df,
    name="support-ticket-queries",
    input_keys=["query"],
    output_keys=["expected_category"],
)

# Section 2: Define an Experiment

## Run an Experiment to Check Tool Call Accuracy (Code-Based Evaluator)

This is our tool function from above:

In [None]:
def classify_ticket(ticket_text: str) -> str:
    """
    Classify a support ticket into:
    billing, technical, account, or other.
    """

    messages = [
        {
            "role": "system",
            "content": (
                "You classify customer support tickets into one of the "
                "following categories: billing, technical, account, other. "
                "Respond with ONLY the category name."
            ),
        },
        {
            "role": "user",
            "content": ticket_text,
        },
    ]

    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=messages,
    )
    label = response.choices[0].message.content.strip().lower()

    if label not in CATEGORIES:
        return "other"

    return label

In [None]:
def classify_ticket_task(input):
    """
    Task used specifically for evaluating tool call accuracy.
    """
    query = input.get("query")
    classification = classify_ticket(query)
    return classification

Since our "baseline" examples have a ground truth field, we can used a code based evaluator to check if the task output matches what we expect.

In [None]:
# Define Code-Based Evaluator for Tool Call Accuracy
from phoenix.experiments.evaluators import create_evaluator


@create_evaluator(kind="CODE", name="tool-call-accuracy")
def tool_call_accuracy(output: str, expected: dict) -> bool:
    """
    Code-based evaluator that checks if the classify_ticket tool output
    matches the expected category from the dataset.
    """
    if expected is None:
        return None
    expected_category = expected.get("expected_category")
    return output.strip().lower() == expected_category.strip().lower()

In [None]:
from phoenix.experiments import run_experiment

golden_dataset = px_client.datasets.get_dataset(dataset="support-ticket-queries")

experiment = run_experiment(
    golden_dataset,
    classify_ticket_task,
    evaluators=[tool_call_accuracy],
    experiment_name="tool call experiment",
    experiment_description="Evaluating classify_ticket tool accuracy against ground truth labels using a code-based evaluator",
)

## Run an Experiment to Understand Overall Agent Performance (LLM-as-a-Judge Evaluator)

In [None]:
def my_support_agent_task(input):
    """
    Task function that will be run on each row of the dataset.
    """
    query = input.get("query")

    # Call the agent with the query
    response = support_agent.run(query)
    return response.content

In [None]:
# Define LLM Judge Evaluator checking for Actionable Responses
from phoenix.evals import LLM, create_classifier
from phoenix.experiments.types import EvaluationResult

# Define Prompt Template
support_response_actionability_judge = """
You are evaluating a customer support agent's response.

Determine whether the response is ACTIONABLE and helps resolve the user's issue.

Mark the response as CORRECT if it:
- Directly addresses the user's specific question
- Provides concrete steps, guidance, or information
- Clearly routes the user toward a solution

Mark the response as INCORRECT if it:
- Is generic, vague, or non-specific
- Avoids answering the question
- Provides no clear next steps
- Deflects with phrases like "contact support" without guidance

User Query:
{input.query}

Agent Response:
{output}

Return only one label: "correct" or "incorrect".
"""

# Create Evaluator
actionability_judge = create_classifier(
    name="actionability-judge",
    prompt_template=support_response_actionability_judge,
    llm=LLM(model="gpt-5", provider="openai"),
    choices={"correct": 1.0, "incorrect": 0.0},
)


def call_actionability_judge(input, output):
    """
    Wrapper function for the actionability judge evaluator.
    This is needed because run_experiment expects a function, not an evaluator object.
    """
    results = actionability_judge.evaluate({"input": input, "output": output})
    result = results[0]
    return EvaluationResult(score=result.score, label=result.label, explanation=result.explanation)

In [None]:
from phoenix.experiments import run_experiment

experiment = run_experiment(
    dataset,
    my_support_agent_task,
    evaluators=[call_actionability_judge],
    experiment_name="support agent",
    experiment_description="Initial support agent evaluation using actionability judge to measure how actionable and helpful the agent's responses are",
)

# CI/CD with Experiments

In [None]:
# Improved Agent with Better Actionability
# This version has enhanced instructions to improve actionability scores

improved_support_agent = Agent(
    name="SupportAgent",
    model=OpenAIChat(id="gpt-4o-mini"),
    tools=[classify_ticket, retrieve_policy],
    instructions="""
You are a customer support assistant. Your goal is to provide SPECIFIC, ACTIONABLE responses that directly help users resolve their issues.

1. Use classify_ticket to determine the issue category.
2. Use retrieve_policy to fetch the relevant policy.
3. Write a response that:
   - Directly addresses the user's specific question
   - Includes the policy information you retrieved
   - Provides clear, concrete next steps the user can take
   - Uses specific details from the policy (e.g., "within 7 days" not "soon")
   - Avoids vague phrases like "should be able to" or "might be able to"
   - Gives actionable guidance (ex: "Go to Settings > Account > Email" not "check your settings")

Example of GOOD response:
"Based on your billing issue, here's what you can do: Refunds are issued for duplicate charges within 7 days. To request your refund, please [specific action]. You should see the refund processed within 7 business days."

Example of BAD response:
"I understand your concern about billing. Please contact our support team for assistance with this matter."

Do not invent policies. Always use the policy information from retrieve_policy.
""",
)

In [None]:
# Task function using the improved agent
def improved_support_agent_task(input):
    """
    Task function using the improved agent with better actionability instructions.
    """
    query = input.get("query")
    response = improved_support_agent.run(query)

    return response.content

In [None]:
# Run experiment with improved agent to compare actionability scores
from phoenix.experiments import run_experiment

# Get the dataset
improved_experiment = run_experiment(
    dataset,
    improved_support_agent_task,
    evaluators=[call_actionability_judge],
    experiment_name="improved support agent",
    experiment_description="Agent with enhanced instructions to improve actionability - emphasizes specific, concrete responses with clear next steps",
)