<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>

# Trace-Level Evals for a Movie Recommendation Agent

This notebook demonstrates how to run trace-level evaluations for a movie recommendation agent. By analyzing individual traces, each representing a single user request, you can gain insights into how well the system is performing on a per-interaction basis. Trace-level evaluations are particularly valuable for identifying successes and failures for end-to-end performance.

In this notebook, you will:
- Build and capture interactions (traces) from your movie recommendation agent
- Evaluate each trace across key dimensions such as Recommendation Relevance and Tool Usage
- Format the evaluation outputs to match Arize’s schema and log them to the platform
- Learn a robust pipeline for assessing trace-level performance

✅ You will need a free [Phoenix Cloud account](https://app.arize.com/auth/phoenix/login) and an OpenAI API key to run this notebook.

# Set Up Keys & Dependencies

In [None]:
%pip install openinference-instrumentation-openai openinference-instrumentation-openai-agents openinference-instrumentation arize-phoenix arize-phoenix-otel nest_asyncio openai openai-agents

In [1]:
import os
from getpass import getpass

import nest_asyncio

nest_asyncio.apply()

if not (phoenix_endpoint := os.getenv("PHOENIX_COLLECTOR_ENDPOINT")):
    phoenix_endpoint = getpass("🔑 Enter your Phoenix Collector Endpoint: ")
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = phoenix_endpoint


if not (phoenix_api_key := os.getenv("PHOENIX_API_KEY")):
    phoenix_api_key = getpass("🔑 Enter your Phoenix API key: ")
os.environ["PHOENIX_API_KEY"] = phoenix_api_key

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

# Configure Tracing

In [None]:
from phoenix.otel import register

# configure the Phoenix tracer
tracer_provider = register(project_name="movie-rec-agent", auto_instrument=True)

# Build Movie Recommendation System

First, we need to define the tools that our recommendation system will use. For this example, we will define 3 tools:
1. Movie Selector: Based on the desired genre indicated by the user, choose up to 5 recent movies availabtle for streaming
2. Reviewer: Find reviews for a movie. If given a list of movies, sort movies in order of highest to lowest ratings.
3. Preview Summarizer: For each movie, return a 1-2 sentence description

Our most ideal flow involves a user simply giving the system a type of movie they are looking for, and in return, the user gets a list of options returned with descriptions and reviews.

Let's test our agent & view traces in Arize

In [12]:
import ast
from typing import List, Union

from agents import Agent, Runner, function_tool
from openai import OpenAI
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

client = OpenAI()


@function_tool
def movie_selector_llm(genre: str) -> List[str]:
    prompt = (
        f"List up to 5 recent popular streaming movies in the {genre} genre. "
        "Provide only movie titles as a Python list of strings."
    )
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=150,
    )
    content = response.choices[0].message.content
    try:
        movie_list = ast.literal_eval(content)
        if isinstance(movie_list, list):
            return movie_list[:5]
    except Exception:
        return content.split("\n")


@function_tool
def reviewer_llm(movies: Union[str, List[str]]) -> str:
    if isinstance(movies, list):
        movies_str = ", ".join(movies)
        prompt = f"Sort the following movies by rating from highest to lowest and provide a short review for each:\n{movies_str}"
    else:
        prompt = f"Provide a short review and rating for the movie: {movies}"
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=300,
    )
    return response.choices[0].message.content.strip()


@function_tool
def preview_summarizer_llm(movie: str) -> str:
    prompt = f"Write a 1-2 sentence summary describing the movie '{movie}'."
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=100,
    )
    return response.choices[0].message.content.strip()

In [13]:
agent = Agent(
    name="MovieRecommendationAgentLLM",
    tools=[movie_selector_llm, reviewer_llm, preview_summarizer_llm],
    instructions=(
        "You are a helpful movie recommendation assistant with access to three tools:\n"
        "1. MovieSelector: Given a genre, returns up to 5 recent streaming movies.\n"
        "2. Reviewer: Given one or more movie titles, returns reviews and sorts them by rating.\n"
        "3. PreviewSummarizer: Given a movie title, returns a 1-2 sentence summary.\n\n"
        "Your goal is to provide a helpful, user-friendly response combining relevant information."
    ),
)


async def main():
    user_input = "Which comedy movie should I watch?"
    result = await Runner.run(agent, user_input)
    print(result.final_output)


await main()

{
    "name": "Response",
    "context": {
        "trace_id": "0x9c00c218ab92245c6b79bbdbe5b74a33",
        "span_id": "0x6535573db8961463",
        "trace_state": "[]"
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": null,
    "start_time": "2025-08-25T20:46:59.815690Z",
    "end_time": "2025-08-25T20:47:00.563075Z",
    "status": {
        "status_code": "OK"
    },
    "attributes": {
        "llm.provider": "openai",
        "llm.system": "openai",
        "input.value": "{\"include\": [], \"input\": [{\"content\": \"Which comedy movie should I watch?\", \"role\": \"user\"}], \"instructions\": \"You are a helpful movie recommendation assistant with access to three tools:\\n1. MovieSelector: Given a genre, returns up to 5 recent streaming movies.\\n2. Reviewer: Given one or more movie titles, returns reviews and sorts them by rating.\\n3. PreviewSummarizer: Given a movie title, returns a 1-2 sentence summary.\\n\\nYour goal is to provide a helpful, user-friendly response co

Next, we’ll run the agent a few more times to generate additional traces. Feel free to adapt or customize the questions as you see fit.

In [14]:
questions = [
    "Which Batman movie should I watch?",
    "I want to watch a good romcom",
    "What is a very scary horror movie?",
    "Name a feel-good holiday movie",
    "Recommend a musical with great songs",
    "Give me a classic drama from the 90s",
]

for question in questions:
    result = await Runner.run(agent, question)

{
    "name": "Response",
    "context": {
        "trace_id": "0x8efa1dedc40e9526beeecdfa79fccf7b",
        "span_id": "0x02594a2d7a95501a",
        "trace_state": "[]"
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": null,
    "start_time": "2025-08-25T20:47:38.752815Z",
    "end_time": "2025-08-25T20:47:39.559564Z",
    "status": {
        "status_code": "OK"
    },
    "attributes": {
        "llm.provider": "openai",
        "llm.system": "openai",
        "input.value": "{\"include\": [], \"input\": [{\"content\": \"Which Batman movie should I watch?\", \"role\": \"user\"}], \"instructions\": \"You are a helpful movie recommendation assistant with access to three tools:\\n1. MovieSelector: Given a genre, returns up to 5 recent streaming movies.\\n2. Reviewer: Given one or more movie titles, returns reviews and sorts them by rating.\\n3. PreviewSummarizer: Given a movie title, returns a 1-2 sentence summary.\\n\\nYour goal is to provide a helpful, user-friendly response co

# Get Span Data from Phoenix

Before running our evaluations, we first retrieve the span data from Arize. We then group the spans by trace and separate the input and output values.

In [None]:
from phoenix.client import Client

client = Client()
primary_df = client.spans.get_spans_dataframe(project_identifier="movie-rec-agent")

In [22]:
import pandas as pd

trace_df = primary_df.groupby("context.trace_id").agg(
    {
        "attributes.input.value": "first",
        "attributes.output.value": lambda x: " ".join(x.dropna()),
    }
)

trace_df.head()

Unnamed: 0_level_0,attributes.input.value,attributes.output.value
context.trace_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0037bf5d664fdad1ea584dcd435af0db,"{""include"": [], ""input"": [{""content"": ""What is...","{""id"":""resp_68accc26a94481919a9da846941aeaa005..."
00c8fb2d10146b439f6ce7b8e339951c,"{""messages"": [{""role"": ""user"", ""content"": ""Wri...","{""id"":""chatcmpl-C8YQ9LUta7DbUQtjCdkVEujeOpEnc""..."
03710e04ee489b7456aca88c84aa6286,"{""messages"": [{""role"": ""user"", ""content"": ""Wri...","{""id"":""chatcmpl-C8YNsh328Cmb1iWwfsal9KyRQkkVu""..."
10d940a0a100c3472f2ad3bb0e4a1cd0,"{""include"": [], ""input"": [{""content"": ""Name a ...","{""id"":""resp_68accc3332b881a0876eb000137fb60106..."
12fbc98b124324a47f064a3cf57e7f63,"{""messages"": [{""role"": ""user"", ""content"": ""Wri...","{""id"":""chatcmpl-C8YPhJiupKCnPkKqZz3GCezvMGw90""..."


# Define and Run Evaluators

In this tutorial, we will evaluate two aspects: tool usage and relevance. You can add any additional evaluation templates you like. We will then run the evaluations using an LLM as the judge.

In [23]:
TOOL_CALLING_ORDER = """
You are evaluating the correctness of the tool calling order in an LLM application's trace.

You will be given:
1. The user input that initiated the trace
2. The full trace output, including the sequence of tool calls made by the agent

##
User Input:
{attributes.input.value}

Trace Output:
{attributes.output.value}
##

Respond with exactly one word: `correct` or `incorrect`.
1. `correct` →
- The tool calls occur in the appropriate order to fulfill the user's request logically and effectively.
- A proper answer involves calls to reviews, summaries, and recommendations where relevant.
2. `incorrect` → The tool calls are out of order, missing, or do not follow a coherent sequence for the given input.
"""

In [24]:
RECOMMENDATION_RELEVANCE = """
You are evaluating the relevance of movie recommendations provided by an LLM application.

You will be given:
1. The user input that initiated the trace
2. The list of movie recommendations output by the system

##
User Input:
{attributes.input.value}

Recommendations:
{attributes.output.value}
##

Respond with exactly one word: `correct` or `incorrect`.
1. `correct` →
- All recommended movies match the requested genre or criteria in the user input.
- The recommendations should be relevant to the user's request and shouldn't be repetitive.
- `incorrect` → one or more recommendations do not match the requested genre or criteria.
"""

In [25]:
import os

import nest_asyncio

from phoenix.evals import OpenAIModel, llm_classify

nest_asyncio.apply()

model = OpenAIModel(
    api_key=os.environ["OPENAI_API_KEY"],
    model="gpt-4o-mini",
    temperature=0.0,
)

rails = ["correct", "incorrect"]

tool_eval_results = llm_classify(
    dataframe=trace_df,
    template=TOOL_CALLING_ORDER,
    model=model,
    rails=rails,
    provide_explanation=True,
    verbose=False,
)

tool_eval_results

  tool_eval_results = llm_classify(


llm_classify |          | 0/55 (0.0%) | ⏳ 00:00<? | ?it/s

{
    "name": "ChatCompletion",
    "context": {
        "trace_id": "0xb9e2c8db5737d2e066bdb7d0ee10866b",
        "span_id": "0x0c5ea17238f86409",
        "trace_state": "[]"
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": null,
    "start_time": "2025-08-25T20:50:57.607533Z",
    "end_time": "2025-08-25T20:50:58.777459Z",
    "status": {
        "status_code": "OK"
    },
    "attributes": {
        "llm.provider": "openai",
        "llm.system": "openai",
        "input.value": "{\"messages\": [{\"role\": \"system\", \"content\": \"\\nYou are evaluating the correctness of the tool calling order in an LLM application's trace.\\n\\nYou will be given:\\n1. The user input that initiated the trace\\n2. The full trace output, including the sequence of tool calls made by the agent\\n\\n##\\nUser Input:\\n{\\\"messages\\\": [{\\\"role\\\": \\\"user\\\", \\\"content\\\": \\\"Write a 1-2 sentence summary describing the movie 'Operation Christmas Drop'.\\\"}], \\\"model\\\": \\\"gpt-4

Unnamed: 0_level_0,label,explanation,exceptions,execution_status,execution_seconds,prompt_tokens,completion_tokens,total_tokens
context.trace_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0037bf5d664fdad1ea584dcd435af0db,correct,The tool calls occur in the appropriate order ...,[],COMPLETED,2.322211,1422,91,1513
00c8fb2d10146b439f6ce7b8e339951c,correct,The tool calls occur in the appropriate order ...,[],COMPLETED,1.93063,484,44,528
03710e04ee489b7456aca88c84aa6286,correct,The tool calls occur in the appropriate order ...,[],COMPLETED,1.910875,467,46,513
10d940a0a100c3472f2ad3bb0e4a1cd0,incorrect,The trace output shows that the only tool call...,[],COMPLETED,2.71293,1188,97,1285
12fbc98b124324a47f064a3cf57e7f63,correct,The trace output provides a summary of the mov...,[],COMPLETED,1.643881,479,53,532
181ab7e1b0d3c12eaa4c01d66275f57b,incorrect,The tool calls are incorrect because the first...,[],COMPLETED,2.611225,1271,86,1357
184937e48a812ef69c3c1224812afc0b,correct,The tool calls occur in the correct order. Fir...,[],COMPLETED,3.161473,1607,102,1709
1bd179a81506b600f2390cb4927edb8e,correct,The tool calls occur in the appropriate order ...,[],COMPLETED,2.682794,1800,70,1870
212741702ececc445f678332c989ca69,correct,The trace output provides a direct response to...,[],COMPLETED,2.273179,475,60,535
241ca4e109a3b8af09b5078e5388a2ec,correct,The tool calls occur in the appropriate order ...,[],COMPLETED,2.384951,455,46,501


In [26]:
relevance_eval_results = llm_classify(
    dataframe=trace_df,
    template=RECOMMENDATION_RELEVANCE,
    model=model,
    rails=rails,
    provide_explanation=True,
    verbose=False,
)

relevance_eval_results

  relevance_eval_results = llm_classify(


llm_classify |          | 0/55 (0.0%) | ⏳ 00:00<? | ?it/s

{
    "name": "ChatCompletion",
    "context": {
        "trace_id": "0xe323f8fafbb79b12648f8996d23a8a2f",
        "span_id": "0x570cf0bdecc5b825",
        "trace_state": "[]"
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": null,
    "start_time": "2025-08-25T20:51:09.534422Z",
    "end_time": "2025-08-25T20:51:10.364218Z",
    "status": {
        "status_code": "OK"
    },
    "attributes": {
        "llm.provider": "openai",
        "llm.system": "openai",
        "input.value": "{\"messages\": [{\"role\": \"system\", \"content\": \"\\nYou are evaluating the relevance of movie recommendations provided by an LLM application.\\n\\nYou will be given:\\n1. The user input that initiated the trace\\n2. The list of movie recommendations output by the system\\n\\n##\\nUser Input:\\n{\\\"messages\\\": [{\\\"role\\\": \\\"user\\\", \\\"content\\\": \\\"List up to 5 recent popular streaming movies in the romcom genre. Provide only movie titles as a Python list of strings.\\\"}], \\\"mo

Unnamed: 0_level_0,label,explanation,exceptions,execution_status,execution_seconds,prompt_tokens,completion_tokens,total_tokens
context.trace_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0037bf5d664fdad1ea584dcd435af0db,correct,"All recommended movies, including 'The Invisib...",[],COMPLETED,1.685321,1399,71,1470
00c8fb2d10146b439f6ce7b8e339951c,correct,The user requested a summary of the movie 'The...,[],COMPLETED,2.198212,461,51,512
03710e04ee489b7456aca88c84aa6286,correct,The user requested a summary of the movie 'Com...,[],COMPLETED,1.619496,444,49,493
10d940a0a100c3472f2ad3bb0e4a1cd0,incorrect,The output does not include any specific movie...,[],COMPLETED,1.594842,1165,61,1226
12fbc98b124324a47f064a3cf57e7f63,correct,The user requested a summary of the movie 'Ope...,[],COMPLETED,1.619444,456,55,511
181ab7e1b0d3c12eaa4c01d66275f57b,correct,The recommendations provided by the system inc...,[],COMPLETED,2.021733,1248,59,1307
184937e48a812ef69c3c1224812afc0b,incorrect,"The user asked for a Batman movie, but the rec...",[],COMPLETED,0.897057,1584,65,1649
1bd179a81506b600f2390cb4927edb8e,correct,All recommended movies are relevant to the use...,[],COMPLETED,1.6253,1777,51,1828
212741702ececc445f678332c989ca69,correct,The user requested a summary of the movie 'To ...,[],COMPLETED,1.20446,452,46,498
241ca4e109a3b8af09b5078e5388a2ec,correct,All recommended movies are recent and belong t...,[],COMPLETED,1.437561,432,40,472


# Log Results Back to Phoenix

The final step is to log our results back to Arize. After running the cell below, you’ll be able to view your trace-level evaluations on the platform, complete with relevant labels, scores, and explanations.

In [31]:
root_spans = primary_df[primary_df["parent_id"].isna()][["context.trace_id", "context.span_id"]]

tool_eval_results = tool_eval_results[["label", "explanation"]]

# Merge tool correctness eval results with trace_df
tool_correctness_df = pd.merge(
    trace_df, tool_eval_results, left_index=True, right_index=True, how="left"
)

# Merge with root spans to get valid span IDs
tool_correctness_df = pd.merge(
    tool_correctness_df.reset_index(), root_spans, on="context.trace_id", how="left"
).set_index("context.span_id", drop=False)

relevance_eval_results = relevance_eval_results[["label", "explanation"]]

# Merge relevance eval results with trace_df
relevance_df = pd.merge(
    trace_df, relevance_eval_results, left_index=True, right_index=True, how="left"
)

# Merge with root spans to get valid span IDs
relevance_df = pd.merge(
    relevance_df.reset_index(), root_spans, on="context.trace_id", how="left"
).set_index("context.span_id", drop=False)


# Log to Phoenix
client.annotations.log_span_annotations_dataframe(
    dataframe=tool_correctness_df,
    annotation_name="Tool Correctness",
    annotator_kind="LLM",
)
client.annotations.log_span_annotations_dataframe(
    dataframe=relevance_df,
    annotation_name="Recommendation Relevance",
    annotator_kind="LLM",
)

![Results](https://storage.googleapis.com/arize-phoenix-assets/assets/images/trace_level_evals_phoenix.png)