<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Scientific Agent Evaluation - Observe Agents Demo</h1>

This tutorial demonstrates a **scientific approach to evaluating and improving AI agents** using Arize Phoenix. Based on the demo presented at **Arize Observe 2025**, this notebook shows how to apply rigorous evaluation methodologies to multi-agent systems.

## 🎯 What You'll Learn

In this tutorial, we'll explore:

1. **🔍 Agent Observability**: How to trace and monitor multi-agent workflows using Phoenix
2. **📊 Data-Driven Evaluation**: Creating systematic evaluation frameworks for agent performance
3. **🧪 Scientific Methodology**: Using human annotations to build ground truth for LLM judges
4. **🔄 Iterative Improvement**: How to use evaluation results to improve agent performance through experimentation
5. **📈 Performance Analysis**: Analyzing and comparing different agent configurations

## 🏗️ The Multi-Agent System

Our demo features a **personal assistant system** with three specialized agents:

- **📅 Calendar Agent**: Manages Google Calendar operations (scheduling, availability, events)
- **📧 Mail Agent**: Handles Gmail operations (sending, reading, extracting event info)
- **🎯 Coordinator Agent**: Orchestrates the workflow using handoffs between specialized agents

**Example workflow**: "Arrange a meeting with Bob next week" → find available slots → draft invite → send email → book calendar

## 🔬 Scientific Evaluation Approach

This tutorial follows a **scientific methodology** for agent evaluation:

1. **🎯 Define Success Metrics**: What makes an agent response "helpful" vs "unhelpful"
2. **📝 Collect Ground Truth**: Use human annotations to establish evaluation standards
3. **🤖 Build LLM Judges**: Create automated evaluators trained on human feedback
4. **🧪 Run Experiments**: Test different agent configurations systematically
5. **📊 Analyze Results**: Compare performance across different approaches

⚠️ **Prerequisites**: This tutorial requires OpenAI API key and a running Phoenix instance

Let's begin by setting up our environment and running some initial test cases.


## 🛠️ Setup and Installation

First, let's install the required dependencies and set up our environment. This will install all the necessary packages for running our multi-agent system and connecting to Phoenix for observability.


In [None]:
!uv pip install -r requirements.txt

Next:
1. Set your environment variables by copying the `.env.example` file in this directory.
2. Generate your google token file by running `python utils/generate_google_token.py`
3. Setup the necessary prompts in Phoenix by running `python utils/setup_prompts.py`



## 🎯 Phase 1: Generate Agent Traces

In this phase, we'll run our multi-agent system through a series of test cases to generate traces. Each test case represents a different type of user request that our personal assistant system should handle.

These traces will serve as the foundation for our evaluation framework.


### Configure Environment and Project Settings

In [None]:
# Load environment variables and set up project configuration
import os

import dotenv

dotenv.load_dotenv()

project_name = os.getenv("PHOENIX_PROJECT_NAME", "observe-agents")
print(f"📊 Using Phoenix project: {project_name}")

### Setup Phoenix Tracing

Now we'll set up Phoenix tracing to capture all agent interactions. This includes:
- Registering a tracer provider with Phoenix
- Instrumenting the OpenAI Agents framework to automatically capture spans
- Creating a tracer for manual span creation when needed


In [None]:
# Set up Phoenix tracing and OpenAI Agents instrumentation
import os

from openinference.instrumentation.openai_agents import OpenAIAgentsInstrumentor

from phoenix.otel import register

tracer_provider = register(project_name=project_name, auto_instrument=False, verbose=False)

OpenAIAgentsInstrumentor().instrument(tracer_provider=tracer_provider)
tracer = tracer_provider.get_tracer(__name__)

### Run Test Cases Through the Multi-Agent System

Now we'll execute our test cases through the coordinator agent.

We're using a subset of 5 test cases to keep the demo manageable, but this approach scales to hundreds or thousands of test cases.


In [None]:
import coordinator
import pandas as pd
from nest_asyncio import apply
from opentelemetry.trace import Status, StatusCode

apply()

test_cases = pd.read_csv("utils/agent_inputs.csv")[:5]

for index, row in test_cases.iterrows():
    print(f"Running test case {index+1}")
    with tracer.start_as_current_span("run_agent") as span:
        span.set_attribute("input.value", row["input"])
        span.set_attribute("openinference.span.kind", "CHAIN")
        result = coordinator.run_coordinator_agent(row["input"])
        span.set_attribute("output.value", result)
        span.set_status(Status(StatusCode.OK))

# ⚠️ Pause - Add Annotations in the Phoenix UI

Now that you have a few traces collected in Phoenix, you'll need to annotate them before continuing on in this notebook. You may need to create a new annotation config to complete this step.

![adding-annotations](https://storage.googleapis.com/arize-phoenix-assets/assets/gifs/annotation-example.gif)

For more on annotations, [check out our docs](https://arize.com/docs/phoenix/tracing/features-tracing/how-to-annotate-traces)

In [None]:
from getpass import getpass

getpass("Have you added annotations to your traces in the Phoenix UI?")

## 📊 Phase 2: Collect Human Annotations

In this phase, we'll collect human annotations that will serve as ground truth for our evaluation system. This is a crucial step in our scientific approach.


### Retrieve Traces and Annotations from Phoenix

We'll use the Phoenix client to retrieve our traces and any existing human annotations. This shows how Phoenix serves as both an observability platform and a data repository for evaluation workflows.

In [None]:
import os

from phoenix.client import Client
from phoenix.client.types.spans import SpanQuery

phoenix_client = Client(base_url=os.getenv("PHOENIX_BASE_URL"))
query = SpanQuery().where("name == 'run_agent'")

spans_df = phoenix_client.spans.get_spans_dataframe(project_identifier=project_name, query=query)
spans_df.head()
annotations_df = phoenix_client.spans.get_span_annotations_dataframe(
    spans_dataframe=spans_df, project_identifier=project_name
)

combined_df = annotations_df.join(spans_df, how="inner")

combined_df.head()

In [None]:
expamples_df = combined_df[
    ["annotation_name", "result.label", "attributes.input.value", "attributes.output.value"]
].head()
expamples_df = expamples_df[:3]

## 🤖 Phase 3: Build LLM Judge from Human Feedback

This is where the scientific approach really shines. We'll create an LLM-based evaluator that's trained on human annotations, allowing us to scale human judgment to larger datasets.

**What happens next:**
- Extract human-annotated examples to use as few-shot examples
- Design an evaluation prompt that teaches the LLM our quality criteria
- Create a structured evaluation template with clear classification categories
- Test the prompt to ensure it captures human judgment patterns

This approach ensures our automated evaluations are grounded in human expertise rather than arbitrary criteria.


### Create Few-Shot Evaluation Prompt

We'll build a prompt that incorporates human-annotated examples, teaching the LLM to evaluate agent responses according to our quality standards.

In [None]:
from datetime import datetime

examples_text = "\n\n".join(
    [
        f"Request: {example['attributes.input.value']}\nResponse: {example['attributes.output.value']}\nHuman Rating: {example['result.label']}"
        for example in expamples_df.to_dict(orient="records")
    ]
)

eval_prompt = f"""
You are an expert evaluator of agent responses. You will be given a human request and an agent's response.
The current date is {datetime.now().strftime("%Y-%m-%d")}. Be sure relative dates are correct.
Your job is to evaluate whether the agent's response is helpful and effective using binary classification:
- helpful: The response effectively addresses the request and provides useful information
- somewhat_helpful: The response partially addresses the request but misses key points
- unhelpful: The response doesn't address the request, is incorrect, or provides no useful information

Here are some examples of helpful and unhelpful responses:
{examples_text}

## Evaluation
Provide your answer in the following format:
Request:
Response:
Explanation:
Classification: helpful, somewhat_helpful, or unhelpful

Request: {{attributes.input.value}}
Response: {{attributes.output.value}}
Explanation:
"""

print(eval_prompt)

### Execute LLM-Based Evaluation

Now you'll run our LLM judge across all the traces to generate consistent, scalable evaluations.


In [None]:
from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel
from phoenix.evals.templates import PromptTemplate

evals_df = llm_classify(
    data=spans_df,
    model=OpenAIModel(model="gpt-4o"),
    rails=["helpful", "somewhat_helpful", "unhelpful"],
    template=PromptTemplate(
        template=eval_prompt,
    ),
    exit_on_error=False,
    provide_explanation=True,
)

## Assign 1 to helpful, 0.5 to somewhat helpful, and 0 to incorrect
evals_df["score"] = evals_df["label"].apply(
    lambda x: 1
    if x is not None and x.lower() == "helpful"
    else 0.5
    if x is not None and x.lower() == "somewhat_helpful"
    else 0
)
evals_df[["label", "score", "explanation"]].head()

### Store Evaluations in Phoenix

Finally, we'll log our evaluation results back to Phoenix, creating a complete feedback loop between observability and evaluation.


In [None]:
import phoenix as px
from phoenix.trace import SpanEvaluations

px.Client(endpoint=os.getenv("PHOENIX_BASE_URL")).log_evaluations(
    SpanEvaluations(
        dataframe=evals_df,
        eval_name="llm_agent_helpfulness",
    )
)

## 🧪 Phase 4: Experimental Evaluation - Email Extraction Tool

This phase demonstrates the power of systematic experimentation in agent improvement. Rather than making changes based on intuition, we'll use data-driven evaluation to compare different agent configurations.

**What happens next:**
- Run evaluations across multiple experimental configurations
- Compare performance metrics between different agent setups
- Identify which changes actually improve agent performance
- Make evidence-based decisions about agent improvements

This is the essence of **scientific agent development** - hypothesis, experiment, measure, iterate.

For this example, we're going to zoom in on a particular tool and show how you can use experimentation techniques to improve performance on that tool. In this case, the tool we will use is the **email extraction tool**. 

You could apply these same techniques to the agent routing or any other tool used by the agents.


# ⚠️ Pause - Run Experiment in Phoenix

While you can run experiments in code, this example uses Phoenix's Prompt Playground to send a set of examples through different prompt versions.

Follow along with the video below to see how:
from IPython.display import HTML

<video width="100%" controls>
  <source src="https://storage.googleapis.com/arize-phoenix-assets/assets/videos/observe-experiment-example.mp4" type="video/mp4">
  Your browser does not support the video tag.
</video>

### Compare Multiple Agent Configurations

With that experiment run, you can now add different evaluation metrics to the results. These evaluations are calculated in the evaluate experiment file that's attached in this project. 

In [None]:
from nest_asyncio import apply

from utils.evaluate_experiment import run_evaluation_for_experiment

apply()

run_evaluation_for_experiment(
    experiment_ids=["Add your Experiment IDs here", "Add as many as you want"]
)

You should now see experiment results that show improvement on the comparison against ground truth metric:
![experiment-results](https://storage.googleapis.com/arize-phoenix-assets/assets/images/observe-improvement.png)

## 🎯 Conclusion: The Scientific Agent Development Cycle

Congratulations! You've just completed a full cycle of **scientific agent evaluation** using Arize Phoenix. Here's what you've learned:

### 🔬 Key Takeaways

1. **📊 Observability First**: Comprehensive tracing provides the foundation for understanding agent behavior
2. **👥 Human-Centered Evaluation**: Ground truth from human annotators ensures meaningful quality metrics
3. **🤖 Scalable Automation**: LLM judges trained on human feedback enable evaluation at scale
4. **🧪 Systematic Experimentation**: A/B testing different agent configurations drives real improvements
5. **🔄 Continuous Improvement**: The feedback loop enables rapid iteration and problem-solving

### 🚀 Next Steps

To apply this methodology to your own agents:

1. **Set up comprehensive tracing** for your agent system using Phoenix
2. **Collect human annotations** on a representative sample of agent interactions
3. **Build LLM judges** that capture your specific quality criteria
4. **Run systematic experiments** comparing different agent configurations
5. **Iterate continuously** based on evaluation results

### 🌟 The Impact

This scientific approach to agent development enables:
- **Faster debugging** of agent failures
- **Evidence-based improvements** rather than guesswork
- **Scalable quality assurance** as your system grows
- **Measurable progress** toward your quality goals

---

*This tutorial was inspired by the **Arize Observe 2025** presentation on scientific approaches to AI agent evaluation. For more resources on agent observability and evaluation, visit [docs.arize.com/phoenix](https://docs.arize.com/phoenix/).*
