<center>
<center>
    <p style="text-align:center">
    <img alt="arize logo" src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="300"/>
        <br>
        <a href="https://docs.arize.com/arize/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/client_python">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg">Slack Community</a>
    </p>
</center>

# Trace-Level Evals for a Movie Recommendation Agent

This notebook demonstrates how to run trace-level evaluations for a movie recommendation agent. By analyzing individual traces, each representing a single user request, you can gain insights into how well the system is performing on a per-interaction basis. Trace-level evaluations are particularly valuable for identifying successes and failures for end-to-end performance.

In this notebook, you will:
- Build and capture interactions (traces) from your movie recommendation agent
- Evaluate each trace across key dimensions such as Recommendation Relevance and Tool Usage
- Format the evaluation outputs to match Arize’s schema and log them to the platform
- Learn a robust pipeline for assessing trace-level performance

✅ You will need a free Arize AX account and an OpenAI API key to run this notebook.

# Set Up Keys & Dependencies

In [None]:
%pip install -qqqqq openinference-instrumentation-openai openai openinference-instrumentation-openai-agents arize-otel openai-agents arize-phoenix

In [None]:
import os
from getpass import getpass

os.environ["ARIZE_SPACE_ID"] = globals().get("ARIZE_SPACE_ID") or getpass("🔑 Enter your Arize Space ID: ")

os.environ["ARIZE_API_KEY"] = globals().get("ARIZE_API_KEY") or getpass("🔑 Enter your Arize API Key: ")

os.environ["OPENAI_API_KEY"] = globals().get("OPENAI_API_KEY") or getpass("🔑 Enter your OpenAI API Key: ")


# Configure Tracing

In [None]:
from arize.otel import register

model_id = "movie-recommendation-agent"
tracer_provider = register(
    space_id=os.getenv("ARIZE_SPACE_ID"),
    api_key=os.getenv("ARIZE_API_KEY"),
    project_name=model_id,
    set_global_tracer_provider=True 
)

from openinference.instrumentation.openai_agents import OpenAIAgentsInstrumentor
from openinference.instrumentation.openai import OpenAIInstrumentor

OpenAIAgentsInstrumentor().instrument(tracer_provider=tracer_provider)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)


# Build Movie Recommendation System

First, we need to define the tools that our recommendation system will use. For this example, we will define 3 tools:
1. Movie Selector: Based on the desired genre indicated by the user, choose up to 5 recent movies availabtle for streaming
2. Reviewer: Find reviews for a movie. If given a list of movies, sort movies in order of highest to lowest ratings. 
3. Preview Summarizer: For each movie, return a 1-2 sentence description 

Our most ideal flow involves a user simply giving the system a type of movie they are looking for, and in return, the user gets a list of options returned with descriptions and reviews. 

Let's test our agent & view traces in Arize 

In [None]:
from agents import Agent, Runner, function_tool
from typing import List, Union
from openai import OpenAI
import ast
from opentelemetry import trace

client = OpenAI()

@function_tool
def movie_selector_llm(genre: str) -> List[str]:
    prompt = (
        f"List up to 5 recent popular streaming movies in the {genre} genre. "
        "Provide only movie titles as a Python list of strings."
    )
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=150,
    )
    content = response.choices[0].message.content
    try:
        movie_list = ast.literal_eval(content)
        if isinstance(movie_list, list):
            return movie_list[:5]
    except Exception:
        return content.split('\n')

@function_tool
def reviewer_llm(movies: Union[str, List[str]]) -> str:
    if isinstance(movies, list):
        movies_str = ", ".join(movies)
        prompt = f"Sort the following movies by rating from highest to lowest and provide a short review for each:\n{movies_str}"
    else:
        prompt = f"Provide a short review and rating for the movie: {movies}"
    response = client.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7,
            max_tokens=300,
    )
    return response.choices[0].message.content.strip()

@function_tool
def preview_summarizer_llm(movie: str) -> str:
    prompt = f"Write a 1-2 sentence summary describing the movie '{movie}'."
    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=100,
    )
    return response.choices[0].message.content.strip()


In [None]:
agent = Agent(
    name="MovieRecommendationAgentLLM",
    tools=[movie_selector_llm, reviewer_llm, preview_summarizer_llm],
    instructions=(
        "You are a helpful movie recommendation assistant with access to three tools:\n"
        "1. MovieSelector: Given a genre, returns up to 5 recent streaming movies.\n"
        "2. Reviewer: Given one or more movie titles, returns reviews and sorts them by rating.\n"
        "3. PreviewSummarizer: Given a movie title, returns a 1-2 sentence summary.\n\n"
        "Your goal is to provide a helpful, user-friendly response combining relevant information."
    ),
)

import asyncio

async def main():
    user_input = "Which comedy movie should I watch?"
    result = await Runner.run(agent, user_input)
    print(result.final_output)

await main()

![Results](https://storage.googleapis.com/arize-phoenix-assets/assets/images/trace-level-evals-1.png)

Next, we’ll run the agent a few more times to generate additional traces. Feel free to adapt or customize the questions as you see fit.

In [None]:
questions = [
    "Which Batman movie should I watch?",
    "I want to watch a good romcom",
    "What is a very scary horror movie?",
    "Name a feel-good holiday movie",
    "Recommend a musical with great songs",
    "Give me a classic drama from the 90s"
]

for question in questions:
    result = await Runner.run(agent, question)    

# Get Span Data from Arize 

Before running our evaluations, we first retrieve the span data from Arize. We then group the spans by trace and separate the input and output values.

In [None]:
from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
from datetime import datetime, timedelta, timezone

client = ArizeExportClient(api_key=os.environ["ARIZE_API_KEY"])

primary_df = client.export_model_to_df(
    space_id=os.environ["ARIZE_SPACE_ID"],
    model_id=model_id,
    environment=Environments.TRACING,
    start_time=datetime.now(timezone.utc) - timedelta(days=7),
    end_time=datetime.now(timezone.utc),
)

In [None]:
import pandas as pd

trace_df = (
    primary_df.groupby("context.trace_id")
      .agg({
          "attributes.input.value": "first",
          "attributes.output.value": lambda x: " ".join(x.dropna()),
      })
)

trace_df.head()

# Define and Run Evaluators

In this tutorial, we will evaluate two aspects: tool usage and relevance. You can add any additional evaluation templates you like. We will then run the evaluations using an LLM as the judge.

In [None]:
TOOL_CALLING_ORDER = """
You are evaluating the correctness of the tool calling order in an LLM application's trace.

You will be given:
1. The user input that initiated the trace
2. The full trace output, including the sequence of tool calls made by the agent 

##
User Input:
{attributes.input.value}

Trace Output:
{attributes.output.value}
##

Respond with exactly one word: `correct` or `incorrect`.
1. `correct` → 
- The tool calls occur in the appropriate order to fulfill the user's request logically and effectively. 
- A proper answer involves calls to reviews, summaries, and recommendations where relevant.
2. `incorrect` → The tool calls are out of order, missing, or do not follow a coherent sequence for the given input.
"""

In [None]:
RECOMMENDATION_RELEVANCE = """
You are evaluating the relevance of movie recommendations provided by an LLM application.

You will be given:
1. The user input that initiated the trace
2. The list of movie recommendations output by the system

##
User Input:
{attributes.input.value}

Recommendations:
{attributes.output.value}
##

Respond with exactly one word: `correct` or `incorrect`.
1. `correct` → 
- All recommended movies match the requested genre or criteria in the user input. 
- The recommendations should be relevant to the user's request and shouldn't be repetitive.
- `incorrect` → one or more recommendations do not match the requested genre or criteria.
"""

In [None]:
from phoenix.evals import llm_classify, OpenAIModel
import nest_asyncio, os

nest_asyncio.apply()

model = OpenAIModel(
    api_key = os.environ["OPENAI_API_KEY"],
    model   = "gpt-4o-mini",
    temperature = 0.0,
)

rails = ["correct", "incorrect"]

tool_eval_results = llm_classify(
    dataframe           = trace_df,
    template            = TOOL_CALLING_ORDER,
    model               = model,
    rails               = rails,
    provide_explanation = True,   
    verbose             = False,
)

tool_eval_results

In [None]:
relevance_eval_results = llm_classify(
    dataframe           = trace_df,
    template            = RECOMMENDATION_RELEVANCE,
    model               = model,
    rails               = rails,
    provide_explanation = True,   
    verbose             = False,
)

relevance_eval_results

# Log Results Back to Arize

The final step is to log our results back to Arize. After running the cell below, you’ll be able to view your trace-level evaluations on the platform, complete with relevant labels, scores, and explanations.

In [None]:
from arize.pandas.logger import Client

tool_eval_results = tool_eval_results.rename(columns={
    "label": "ToolEvaluation.label",
    "explanation": "ToolEvaluation.explanation",
})[["ToolEvaluation.label", "ToolEvaluation.explanation"]]

relevance_eval_results = relevance_eval_results.rename(columns={
    "label": "RecommendationRelevance.label",
    "explanation": "RecommendationRelevance.explanation",
})[["RecommendationRelevance.label", "RecommendationRelevance.explanation"]]

combined_eval_results = tool_eval_results \
    .join(relevance_eval_results, how="outer")

merged_df = pd.merge(trace_df, combined_eval_results, left_index=True, right_index=True)
merged_df.rename(
    columns={
        "ToolEvaluation.label": "trace_eval.ToolEvaluation.label",
        "ToolEvaluation.explanation": "trace_eval.ToolEvaluation.explanation",
        "RecommendationRelevance.label": "trace_eval.RecommendationRelevance.label",
        "RecommendationRelevance.explanation": "trace_eval.RecommendationRelevance.explanation",
    },
    inplace=True,
)

root_spans = primary_df[primary_df["parent_id"].isna()][["context.trace_id", "context.span_id"]]
log_df = trace_df.merge(merged_df, left_on="context.trace_id", right_index=True)
log_df = merged_df.merge(root_spans, on="context.trace_id", how="left")


arize_client = Client(
    space_id = os.environ["ARIZE_SPACE_ID"],
    api_key  = os.environ["ARIZE_API_KEY"],
)
resp = arize_client.log_evaluations_sync(
    dataframe = log_df,
    model_id  = model_id,
)

![Trace Evals in Arize](https://storage.googleapis.com/arize-phoenix-assets/assets/images/trace-level-evals-2.png)