<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>

# Session Level Evals for an AI Tutor

This tutorial demonstrates how to run session-level evaluations on conversations with an AI tutor. You'll log the results back to Phoenix for further monitoring and analysis. Session-level evaluations are valuable because they provide a holistic view of the entire interaction, enabling you to assess broader patterns and answer high-level questions about user experience and system performance.

In this tutorial, you will:
- Trace and aggregate multi-turn interactions into structured sessions
- Evaluate sessions across multiple dimensions such as Correctness, Goal Completion, and Frustration
- Format the evaluation outputs to match the Phoenix schema and log them to the platform

By the end, you’ll have a robust evaluation pipeline for analyzing and comparing session-level performance.

✅ You’ll need a free [Phoenix Cloud account](https://app.arize.com/auth/phoenix/login) and an Anthropic API key to run this notebook.

# Set up Dependencies & Keys

In [None]:
%pip install openinference-instrumentation-anthropic openinference-instrumentation arize-phoenix arize-phoenix-otel nest_asyncio anthropic

In [None]:
import os
from getpass import getpass

import nest_asyncio

nest_asyncio.apply()

if not (phoenix_endpoint := os.getenv("PHOENIX_COLLECTOR_ENDPOINT")):
    phoenix_endpoint = getpass("🔑 Enter your Phoenix Collector Endpoint: ")
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = phoenix_endpoint


if not (phoenix_api_key := os.getenv("PHOENIX_API_KEY")):
    phoenix_api_key = getpass("🔑 Enter your Phoenix API key: ")
os.environ["PHOENIX_API_KEY"] = phoenix_api_key

if not (anthropic_api_key := os.getenv("ANTHROPIC_API_KEY")):
    anthropic_api_key = getpass("🔑 Enter your Anthropic API key: ")
os.environ["ANTHROPIC_API_KEY"] = anthropic_api_key

# Configure Tracing

In [None]:
from phoenix.otel import register

# configure the Phoenix tracer
tracer_provider = register(project_name="ai-tutor-session", auto_instrument=True)

# Build and Run AI Tutor

In this example, we demonstrate how to evaluate AI tutor sessions. The tutor begins by receiving a user ID, topic, and question. It then explains the topic to the student and engages them with follow-up questions in a multi-turn conversation, continuing until the student ends the session. Our goal is to assess the overall quality of this interaction from start to finish.

In [None]:
import uuid

import anthropic
from openinference.instrumentation import using_attributes

client = anthropic.Anthropic(api_key=os.getenv("ANTHROPIC_API_KEY"))


def run_session(user_id: str, topic: str, question: str):
    session_id = f"tutor-{uuid.uuid4()}"
    chat = [
        {
            "role": "system",
            "content": (
                f"You are a thoughtful AI tutor teaching {topic}. "
                "Ask questions, give hints, and only suggest full answers "
                "when student shows correct reasoning."
            ),
        },
        {"role": "user", "content": question},
    ]

    while True:
        with using_attributes(session_id=session_id, user_id=user_id):
            messages = []
            for msg in chat:
                if msg["role"] == "system":
                    if not messages:
                        messages.append({"role": "user", "content": msg["content"]})
                else:
                    messages.append(msg)

            resp = client.messages.create(
                model="claude-3-5-sonnet-20241022",
                messages=messages,
                max_tokens=1000,
                temperature=0.5,
            )
        assistant_msg = resp.content[0].text.strip()
        assistant_msg += "\n\n(You can type 'DONE' if you're finished.)"

        chat.append({"role": "assistant", "content": assistant_msg})
        print(f"Tutor: {assistant_msg}")

        student_input = input("> your answer: ")
        if student_input.strip().upper() == "DONE":
            print("✅ Student is DONE — ending session.")
            break

        chat.append({"role": "user", "content": student_input})
    return session_id

In [None]:
# Ask any question to the AI tutor!
run_session(user_id="Sanjana", topic="Science", question="Why is the sky blue?")

# Prepare Spans for Session-Level Evaluation

These following cells prepare the data for session-level evaluation. We start by loading all spans into a DataFrame, then sort them chronologically and group them by session ID. You can also group the spans by user ID.

Next, we separate user inputs from AI responses, and finally, store the structured results in a dataframe. We will use this dataframe to run our evaluations.

In [None]:
from phoenix.client import Client

client = Client()
primary_df = client.spans.get_spans_dataframe(project_identifier="ai-tutor-session")

Here, we group our spans together to make a session dataframe. We also include logic to truncate part of the sesssion messages if token limits are exceeded. This prevents context window issues for longer sessions.

In [None]:
import pandas as pd


def truncate_text(text, max_chars, strategy="end"):
    """Truncate text to max_chars using the specified strategy."""
    if not text or len(text) <= max_chars:
        return text

    if strategy == "start":
        return "..." + text[-(max_chars - 3) :]
    elif strategy == "middle":
        half = (max_chars - 3) // 2
        return text[:half] + "..." + text[-half:]
    else:  # "end"
        return text[: max_chars - 3] + "..."


def estimate_session_size(messages):
    """Estimate total character count of session content."""
    return sum(len(msg) for msg in messages if isinstance(msg, str))


def prepare_sessions(
    df: pd.DataFrame,
    max_chars_per_value=10000,  # Limit for each individual message
    max_chars_per_session=700000,  # Based on claude-3-7-sonnet-latest having 200k tokens (~4 chars/token)
    truncation_strategy="end",
) -> pd.DataFrame:
    """
    Collapse spans into a single row per session with truncation support,
    preserving message order (user/output interleaved).
    """
    sessions = []

    # Sort and group
    grouped = df.sort_values("start_time").groupby("attributes.session.id", as_index=False)

    for session_id, group in grouped:
        # Collect all messages in order
        messages = []
        for _, row in group.iterrows():
            if pd.notna(row.get("attributes.input.value")):
                messages.append(
                    truncate_text(
                        row["attributes.input.value"], max_chars_per_value, truncation_strategy
                    )
                )
            if pd.notna(row.get("attributes.output.value")):
                messages.append(
                    truncate_text(
                        row["attributes.output.value"], max_chars_per_value, truncation_strategy
                    )
                )

        # Estimate total session size
        total_chars = estimate_session_size(messages)

        # Truncate session-level size if needed
        if total_chars > max_chars_per_session:
            print(f"Session {session_id} exceeds {max_chars_per_session} chars. Truncating...")

            # Keep messages evenly from start and end (half-half)
            keep_half = len(messages) // 2
            messages = messages[: keep_half // 2] + messages[-(keep_half - keep_half // 2) :]

            # Optional: truncate remaining messages again more aggressively
            total_chars = estimate_session_size(messages)
            if total_chars > max_chars_per_session:
                aggressive_limit = max_chars_per_value // 2
                messages = [
                    truncate_text(m, aggressive_limit, truncation_strategy) for m in messages
                ]

        sessions.append(
            {
                "session_id": session_id,
                "messages": messages,
                "trace_count": group["context.trace_id"].nunique(),
            }
        )

    return pd.DataFrame(sessions)


sessions_df = prepare_sessions(primary_df, truncation_strategy="middle")

In [None]:
sessions_df

# Session Correctness Eval

We are ready to begin running our evals. Let's start with an eval that ensures the AI tutor is giving the student factual information:

In [None]:
SESSION_CORRECTNESS_PROMPT = """
You are an expert tutor assistant evaluating the **correctness and educational quality** of an AI tutor's session with a student.

A session consists of multiple traces (interactions) between a student and an AI tutor. Each message includes a role field:
1. If role is user, the message is from the student.
2. If role is assistant, the message is from the AI tutor.
You will be provided with the series of messages that took place, in the order they occurred.

An effective and correct tutoring session should:
- Provide factually and conceptually accurate explanations
- Correctly answer student questions
- Clarify misunderstandings if they occur
- Build upon previous context in a coherent way
- Avoid hallucinations, vague responses, or incorrect reasoning

##
Messages:
{messages}
##

Based on the above, evaluate the session **only for correctness and educational soundness**.

Respond with a single word: `correct` or `incorrect`.

- Respond with `correct` if the AI tutor consistently provides accurate, clear, and educationally sound answers.
- Respond with `incorrect` if the AI tutor gives factually wrong, misleading, or incoherent explanations at any point.
"""

In [None]:
import anthropic
import nest_asyncio

from phoenix.evals import AnthropicModel, llm_classify

nest_asyncio.apply()

# Configure your evaluation model using Claude 3.5 Sonnet
model = AnthropicModel(
    model="claude-3-7-sonnet-latest",
)

# Run the evaluation
rails = ["correct", "incorrect"]
eval_results_correctness = llm_classify(
    data=sessions_df,
    template=SESSION_CORRECTNESS_PROMPT,
    model=model,
    rails=rails,
    provide_explanation=True,
    verbose=False,
)

eval_results_correctness

# Session Frustration Prompt

This evaluation is used to make sure the student isn't getting frustrated with the tutor:

In [None]:
SESSION_FRUSTRATION_PROMPT = """
You are an AI assistant evaluating whether a student became frustrated during a tutoring session with an AI tutor.

A session consists of multiple traces (interactions) between a student and an AI tutor. Each message includes a role field:
1. If role is user, the message is from the student.
2. If role is assistant, the message is from the AI tutor.
You will be provided with the series of messages that took place, in the order they occurred.

Signs of student frustration may include:
- Repeating or rephrasing the same question multiple times
- Expressing confusion ("I don't get it", "This doesn't make sense", etc.)
- Disagreeing with the tutor's responses
- Asking for clarification frequently without resolution
- Expressing annoyance, impatience, or disengagement
- Abruptly ending the session

##
Messages:
{messages}
##


Based on the above, evaluate whether the student showed signs of frustration at any point in the session.

Respond with a single word: `frustrated` or `not_frustrated`.

- Respond with `frustrated` if there is evidence of confusion, dissatisfaction, or emotional frustration.
- Respond with `not_frustrated` if the student appears to stay engaged and satisfied throughout.
"""

In [None]:
# Run the evaluation
rails = ["frustrated", "not_frustrated"]
eval_results_frustration = llm_classify(
    data=sessions_df,
    template=SESSION_FRUSTRATION_PROMPT,
    model=model,
    rails=rails,
    provide_explanation=True,
    verbose=False,
)

eval_results_frustration

# Session Goal Achievement Eval

Finally, we evaluate to ensure the tutor helped the student reach their learning goals:

In [None]:
SESSION_GOAL_ACHIEVEMENT_PROMPT = """
You are an AI assistant evaluating whether the AI tutor successfully helped the student achieve their learning goals during a tutoring session.

A session consists of multiple traces (interactions) between a student and an AI tutor. Each message includes a role field:
1. If role is user, the message is from the student.
2. If role is assistant, the message is from the AI tutor.
You will be provided with the series of messages that took place, in the order they occurred.

To determine if the student’s goals were achieved, consider:
- Whether the AI tutor addressed the student’s questions and requests directly
- Whether the explanations provided resolved the student’s doubts or problems
- Whether the student’s inputs indicate understanding or closure by the end
- Whether the conversation logically progressed toward completing the student’s objectives

##
Messages:
{messages}
##


Evaluate the session and respond with a single word: `achieved` or `not_achieved`.

- Respond with `achieved` if the tutoring session successfully met the student’s learning goals and resolved their questions.
- Respond with `not_achieved` if the session left the student’s questions unanswered or goals unmet.
"""

In [None]:
# Run the evaluation
rails = ["achieved", "not_achieved"]
eval_results_goal_achievement = llm_classify(
    data=sessions_df,
    template=SESSION_GOAL_ACHIEVEMENT_PROMPT,
    model=model,
    rails=rails,
    provide_explanation=True,
    verbose=False,
)

eval_results_goal_achievement

# Log Evaluations Back to Phoenix

Finally, we can log the evaluation results back to Phoenix. In the sessions, tab of your project, you will see the evaluation results populate for each session.

In [None]:
from phoenix.client import Client

# --- Find the root span for each session ---
root_spans = primary_df.sort_values("start_time").drop_duplicates(
    subset=["attributes.session.id"], keep="first"
)[["attributes.session.id", "context.span_id"]]

# --- Merge with Session Correctness Eval with Session Data ---
eval_results_correctness = eval_results_correctness[["label", "explanation"]]

eval_results_correctness = pd.merge(
    sessions_df, eval_results_correctness, left_index=True, right_index=True
)

correctness_final_df = pd.merge(
    eval_results_correctness,
    root_spans,
    left_on="session_id",
    right_on="attributes.session.id",
    how="left",
)
correctness_final_df = correctness_final_df.set_index("context.span_id", drop=False)

# --- Merge with Frustration Eval with Session Data ---
eval_results_frustration = eval_results_frustration[["label", "explanation"]]

eval_results_frustration = pd.merge(
    sessions_df, eval_results_frustration, left_index=True, right_index=True
)

frustration_final_df = pd.merge(
    eval_results_frustration,
    root_spans,
    left_on="session_id",
    right_on="attributes.session.id",
    how="left",
)
frustration_final_df = frustration_final_df.set_index("context.span_id", drop=False)

# --- Merge with Goal Eval with Session Data ---
eval_results_goal_achievement = eval_results_goal_achievement[["label", "explanation"]]

eval_results_goal_achievement = pd.merge(
    sessions_df, eval_results_goal_achievement, left_index=True, right_index=True
)

goal_final_df = pd.merge(
    eval_results_goal_achievement,
    root_spans,
    left_on="session_id",
    right_on="attributes.session.id",
    how="left",
)
goal_final_df = goal_final_df.set_index("context.span_id", drop=False)


from phoenix.client import AsyncClient

px_client = AsyncClient()
await px_client.annotations.log_span_annotations_dataframe(
    dataframe=correctness_final_df,
    annotation_name="Session Correctness",
    annotator_kind="LLM",
)
await px_client.annotations.log_span_annotations_dataframe(
    dataframe=frustration_final_df,
    annotation_name="Session Frustration",
    annotator_kind="LLM",
)
await px_client.annotations.log_span_annotations_dataframe(
    dataframe=goal_final_df,
    annotation_name="Session Goal Achievement",
    annotator_kind="LLM",
)

![Session Eval Results](https://storage.googleapis.com/arize-phoenix-assets/assets/images/phoenix-session-level-evals.png)