<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg" width="1000"/>
        <br>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Arize Phoenix Evals 2.0</h1>

Arize Phoenix is a fully open-source AI observability platform. It's designed for experimentation, evaluation, and troubleshooting.

In this notebook, you will learn how to do the following things using Evals 2.0:

1. How to evaluate Phoenix project traces.
2. How to improve your custom evaluators using experiments.
3. How to iterate on your application and evals on a realistic example.


### Requirements

1. Kaggle API key
2. OpenAI API key
3. A Phoenix instance (cloud or local)


In [1]:
! uv pip install "arize-phoenix-evals>=2.0.0" "arize-phoenix-client>=1.19.0" kagglehub openinference-instrumentation-llama_index llama-index numpy pandas --quiet

# Dataset Preparation and Setup

The dataset has two components:

1. A knowledge base of 20 documents of various lengths and sources.
2. 4 question-answer pairs per document.

   a. 2 which are not answerable by the document

   b. 2 which require a single passage to answer

First, we need to do some data preparation.


In [2]:
# Download dataset
# Requires a Kaggle API key and username in your environment

import os

import kagglehub

path = kagglehub.dataset_download("samuelmatsuoharris/single-topic-rag-evaluation-dataset")

print("Path to dataset files:", path)
print(os.listdir(path))

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: /Users/elizabethhutton/.cache/kagglehub/datasets/samuelmatsuoharris/single-topic-rag-evaluation-dataset/versions/4
['multi_passage_answer_questions.csv', 'documents.csv', 'single_passage_answer_questions.csv', 'no_answer_questions.csv']


In [3]:
import pandas as pd


def prepare_data(path):
    single_passage_df = pd.read_csv(os.path.join(path, "single_passage_answer_questions.csv"))
    no_answer_df = pd.read_csv(os.path.join(path, "no_answer_questions.csv"))

    # Single-passage questions
    single_passage_processed = pd.DataFrame(
        {
            "document_index": single_passage_df["document_index"],
            "query": single_passage_df["question"],
            "answer": single_passage_df["answer"],
            "query_type": "single_passage",
        }
    )

    # No-answer questions
    no_answer_processed = pd.DataFrame(
        {
            "document_index": no_answer_df["document_index"],
            "query": no_answer_df["question"],
            "answer": "N/A",
            "query_type": "no_answer",
        }
    )

    # Combine all dataframes
    combined_df = pd.concat([single_passage_processed, no_answer_processed], ignore_index=True)

    return combined_df


query_df = prepare_data(path)
query_df.sample(5).head()

Unnamed: 0,document_index,query,answer,query_type
27,13,How can I freeze a variable during training in...,"To freeze an attribute during training, you ca...",single_passage
69,14,Who is Perry and Jackie's sibling?,,no_answer
20,10,What is the policy on Tai Chi?,In order to calm down the passions and stresse...,single_passage
70,15,What colour is Nan-E?,,no_answer
51,5,What is the advice for data scientists in fina...,,no_answer


### Split data into train/test

Split documents into a 60/40 train/test split.


In [63]:
import numpy as np

unique_docs = query_df["document_index"].unique()
print(f"Total unique documents: {len(unique_docs)}")

np.random.seed(42)
sample_size = int(len(unique_docs) * 0.6)
train_docs = np.random.choice(unique_docs, size=sample_size, replace=False)
print(f"Sampled {len(train_docs)} documents ({len(train_docs) / len(unique_docs) * 100:.1f}%)")

# Split queries based on sampled document indices
all_queries = query_df.copy()
train_queries = query_df[query_df["document_index"].isin(train_docs)]
test_queries = query_df[~query_df["document_index"].isin(train_docs)]
print(f"Train queries: {len(train_queries)}, Test queries: {len(test_queries)}")

Total unique documents: 20
Sampled 12 documents (60.0%)
Train queries: 48, Test queries: 32


### Inspect the knowledge base documents


In [5]:
documents = pd.read_csv(os.path.join(path, "documents.csv"))
documents.head()

Unnamed: 0,index,source_url,text
0,0,https://enterthegungeon.fandom.com/wiki/Bullet...,Bullet Kin\nBullet Kin are one of the most com...
1,1,https://www.dropbox.com/scl/fi/ljtdg6eaucrbf1a...,---The Paths through the Underground/Underdark...
2,2,https://bytes-and-nibbles.web.app/bytes/stici-...,Semantic and Textual Inference Chatbot Interfa...
3,3,https://github.com/llmware-ai/llmware,llmware\n\nBuilding Enterprise RAG Pipelines w...
4,4,https://docs.marimo.io/recipes.html,Recipes\nThis page includes code snippets or “...


### Set Up Phoenix Tracing


In [7]:
# Set up Phoenix Tracing
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

from phoenix.otel import register

project_name = "rag-demo-v1"
tracer_provider = register(project_name=project_name, verbose=False)
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)



In [6]:
import os
from getpass import getpass

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

# Set Up a RAG App using Llama Index


In [65]:
import os

from llama_index.core import (
    Document,
    Settings,
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

index_dir = "llamaindex_store"

# --- Ingest documents ---
if os.path.exists(index_dir):
    storage_context = StorageContext.from_defaults(persist_dir=index_dir)
    index = load_index_from_storage(storage_context)
else:
    # --- Set up the LLM and embedding model ---
    Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)  # generator
    Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")  # retriever

    kb_docs = []
    for _, row in documents.iterrows():
        doc = Document(
            text=str(row["text"]),
            metadata={"source_url": row["source_url"], "document_index": row["index"]},
            id_=str(row["index"]),
        )
        kb_docs.append(doc)

    index = VectorStoreIndex.from_documents(kb_docs)

# Optional: persist to disk so you can reuse later
index.storage_context.persist(persist_dir=index_dir)

# Create the query engine
query_engine = index.as_query_engine()
query_engine.query("What is data science?")

Loading llama_index.core.storage.kvstore.simple_kvstore from llamaindex_store/docstore.json.
Loading llama_index.core.storage.kvstore.simple_kvstore from llamaindex_store/index_store.json.


Response(response='Data science is a field that involves using scientific methods, algorithms, and systems to extract insights and knowledge from structured and unstructured data. It combines various disciplines such as statistics, machine learning, data analysis, and domain expertise to analyze and interpret complex data.', source_nodes=[NodeWithScore(node=TextNode(id_='3a08e6e4-f1b5-40b5-bddd-1abbca7f7571', embedding=None, metadata={'source_url': 'https://towardsdatascience.com/gpt-from-scratch-with-mlx-acf2defda30e', 'document_index': 13}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='13', node_type='4', metadata={'source_url': 'https://towardsdatascience.com/gpt-from-scratch-with-mlx-acf2defda30e', 'document_index': 13}, hash='2187a4efee001b52656775153aab90cdc0d580feb56478d77d92d778b432b832'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='2eac43f3-4c62-4aa6-a764-878bf0e694e4', nod

Let's wrap our query engine so it's easier to run on our dataset.


In [None]:
from openinference.instrumentation import using_metadata


async def run_rag_with_metadata(example, rag_engine):
    """Ask a question of the knowledge base."""
    metadata = {
        "expected_answer": example["answer"],
        "query_type": example["query_type"],
        "expected_document_index": example["document_index"],
        "split": "test" if example["document_index"] not in train_docs else "train",
    }
    with using_metadata(metadata):
        rag_engine.query(example["query"])

### Run RAG on Train Set

We use the `AsyncExecutor` to run our RAG app on the training dataset for optimal speed.


In [None]:
# Run application on the train set to get a baseline
from functools import partial

from phoenix.evals.executors import AsyncExecutor
from phoenix.evals.utils import get_tqdm_progress_bar_formatter

executor = AsyncExecutor(
    generation_fn=partial(run_rag_with_metadata, rag_engine=query_engine),
    concurrency=10,
    exit_on_error=True,
    tqdm_bar_format=get_tqdm_progress_bar_formatter("Run RAG"),
)

results, execution_details = await executor.execute(
    [row.to_dict() for _, row in train_queries.iterrows()],
)

Run RAG |██████████| 48/48 (100.0%) | ⏳ 01:21<00:00 |  1.71s/it


# Evaluate the Traces

Let's define a few evaluators for our RAG app and run them on our traces.

Steps:

1. Export traces from Phoenix
2. Define evaluators
3. Run evaluators on the trace data
4. Log the evaluation results back up to Phoenix


In [None]:
import json

from phoenix.client import Client
from phoenix.client.types.spans import SpanQuery

# Export all the top level spans
query = SpanQuery().where("name == 'RetrieverQueryEngine.query'")
spans_df = Client().spans.get_spans_dataframe(query=query, project_identifier=project_name)
spans_df.dropna(subset=["attributes.metadata"], inplace=True)

# Shape the spans dataframe
spans_df["query"] = spans_df["attributes.input.value"]
spans_df["response"] = spans_df["attributes.output.value"]
spans_df["split"] = spans_df["attributes.metadata"].apply(lambda x: x["split"])
spans_df["expected_document_index"] = spans_df["attributes.metadata"].apply(
    lambda x: x["expected_document_index"]
)
spans_df["expected_answer"] = spans_df["attributes.metadata"].apply(lambda x: x["expected_answer"])

# Export all the retrieval spans to get the retrieved documents
query = SpanQuery().where("name == 'VectorIndexRetriever.retrieve'")
retrieval_spans_df = Client().spans.get_spans_dataframe(
    query=query, project_identifier=project_name
)
# Process the spans and combine with the retrieval spans
retrieval_spans_df["document_content"] = retrieval_spans_df["attributes.retrieval.documents"].apply(
    lambda x: "\n----------------------------------\n".join([doc["document.content"] for doc in x])
)
retrieval_spans_df["retrieved_documents"] = retrieval_spans_df[
    "attributes.retrieval.documents"
].apply(lambda x: [doc["document.metadata"]["document_index"] for doc in x])

# Combine
spans_df = spans_df.merge(
    retrieval_spans_df[["context.trace_id", "document_content", "retrieved_documents"]],
    on="context.trace_id",
    how="left",
)

print(spans_df.shape)
spans_df.head()

(89, 21)


Unnamed: 0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,context.span_id,context.trace_id,...,attributes.openinference.span.kind,attributes.input.value,attributes.metadata,query,response,split,expected_document_index,expected_answer,document_content,retrieved_documents
0,RetrieverQueryEngine.query,CHAIN,,2025-09-18 05:39:18.772057+00:00,2025-09-18 05:39:21.272134+00:00,OK,,[],f14258f30a6b4964,292c99661771aa0144781e838aa6aa07,...,CHAIN,What is the name of the space station?,"{'split': 'train', 'query_type': 'single_passa...",What is the name of the space station?,The name of the space station is not explicitl...,train,15,"The space station is called ""Babystation Beta"".",ERIC: But... you. Oh. We've been waiting for a...,"[15, 15, 15, 13]"
1,RetrieverQueryEngine.query,CHAIN,,2025-09-18 05:39:17.015420+00:00,2025-09-18 05:39:18.710415+00:00,OK,,[],094db9bbb4125f03,69efc197eaefcabe65e8cb9660300225,...,CHAIN,How can I freeze a variable during training in...,"{'split': 'train', 'query_type': 'single_passa...",How can I freeze a variable during training in...,You can freeze a variable during training in M...,train,13,"To freeze an attribute during training, you ca...",2023. Retrieval-generation\nsynergy augmented ...,"[19, 19, 19, 19]"
2,RetrieverQueryEngine.query,CHAIN,,2025-09-18 05:39:15.388146+00:00,2025-09-18 05:39:16.952564+00:00,OK,,[],cc6789e5f644862b,47b0a0c761e6e53b35bd0008198f3daf,...,CHAIN,Why are the top-right values in the raw attent...,"{'split': 'train', 'query_type': 'single_passa...",Why are the top-right values in the raw attent...,The top-right values in the raw attention weig...,train,13,A mask of negative infinity is applied to the ...,"In the naive implementation, we had separate a...","[13, 19, 13, 19]"
3,RetrieverQueryEngine.query,CHAIN,,2025-09-18 05:39:13.477125+00:00,2025-09-18 05:39:15.310057+00:00,OK,,[],2c964580d78e3f3c,a94aba2ff97ebcc01ea77771add8693b,...,CHAIN,For what work did I receive criticism for my r...,"{'split': 'train', 'query_type': 'single_passa...",For what work did I receive criticism for my r...,You received criticism for your reduction of F...,train,11,You received criticism for your research on sp...,Then we perform row-wise softmax to get the fi...,"[13, 13, 13, 13]"
4,RetrieverQueryEngine.query,CHAIN,,2025-09-18 05:39:11.572926+00:00,2025-09-18 05:39:13.418715+00:00,OK,,[],e5f75f7a401ea570,8fad7b22cd8e9fcf20502ad0a1d62eea,...,CHAIN,How much faster is the Tesla A100 compared to ...,"{'split': 'train', 'query_type': 'single_passa...",How much faster is the Tesla A100 compared to ...,"In Gleam, both values must be floats.",train,11,When measured on the SE-ResNeXt101 deep learni...,"Python\nPython tuples are immutable, fixed-siz...","[12, 12, 16, 19]"


## Define Evaluators


### 1. Refusal

Since we know some questions in our dataset are unanswerable, we expect the LLM to respond accordingly. So let's create an LLM judge to check for refusals in the output.


In [69]:
from phoenix.evals import create_classifier
from phoenix.evals.llm import LLM

prompt = """
Given a question and response, determine if the response is a refusal to answer the question.

Question: {query}
Response: {response}

Is the response a refusal or an informative answer to the question?
"""

llm = LLM(model="gpt-4o", provider="openai")
refusal_evaluator = create_classifier(
    llm=llm,
    name="llm_refusal",
    prompt_template=prompt,
    choices={"refusal": 0, "answer": 1},
)
refusal_evaluator.evaluate(spans_df.iloc[0].to_dict())

[Score(name='llm_refusal', score=1, label='answer', explanation='The response indicates that the information needed to answer the question is not available, rather than directly refusing to provide an answer. A refusal would be a deliberate choice not to provide known information, whereas this response implies the information is unavailable.', metadata={'model': 'gpt-4o'}, source='llm', direction='maximize')]

### 2. Hallucination

Let's also check to see if our RAG pipeline is producing hallucinations. Phoenix evals has a built-in `HallucinationEvaluator` so we'll use that. First, let's inspect the `input_schema` so we know what it needs to run.


In [70]:
from phoenix.evals.llm import LLM
from phoenix.evals.metrics import HallucinationEvaluator

llm = LLM(model="gpt-4o", provider="openai")
hallucination_evaluator = HallucinationEvaluator(llm=llm)
hallucination_evaluator.describe()

{'name': 'hallucination',
 'source': 'llm',
 'direction': 'maximize',
 'input_schema': {'properties': {'input': {'description': 'The input query.',
    'title': 'Input',
    'type': 'string'},
   'output': {'description': 'The response to the query.',
    'title': 'Output',
    'type': 'string'},
   'context': {'description': 'The context or reference text.',
    'title': 'Context',
    'type': 'string'}},
  'required': ['input', 'output', 'context'],
  'title': 'HallucinationInputSchema',
  'type': 'object'}}

Okay, we need to provide an `input_mapping` so it works on our data. Let's bind it to the evaluator so we can reuse it.


In [71]:
hallucination_mapping = {
    "input": "query",
    "output": "response",
    "context": "document_content",
}
hallucination_evaluator.bind(hallucination_mapping)
hallucination_evaluator.evaluate(spans_df.iloc[0].to_dict())

[Score(name='hallucination', score=1.0, label='factual', explanation='The context does not provide the name of the space station, and the response acknowledges this lack of information.', metadata={'model': 'gpt-4o'}, source='llm', direction='maximize')]

### 3. Retrieval Precision

We also want to measure how well the information retrieval component of our system is working. Let's add a precision metric which checks to see how often the target document appeared in the retrieved results.


In [72]:
from phoenix.evals import bind_evaluator, create_evaluator


@create_evaluator(name="precision")
def precision(retrieved_documents: list[int], relevant_documents: list[int]) -> float:
    relevant_set = set(relevant_documents)
    hits = sum(1 for doc in retrieved_documents if doc in relevant_set)
    return hits / len(retrieved_documents)


precision_mapping = {
    "relevant_documents": lambda x: [x["expected_document_index"]],
}

precision_evaluator = bind_evaluator(precision, precision_mapping)
precision_evaluator.evaluate(spans_df.iloc[0].to_dict())

[Score(name='precision', score=0.75, label=None, explanation=None, metadata={}, source='heuristic', direction='maximize')]

In [73]:
from phoenix.evals.evaluators import async_evaluate_dataframe

train_df = spans_df[spans_df["split"] == "train"].reset_index(drop=True)
results = await async_evaluate_dataframe(
    train_df,
    [precision_evaluator, hallucination_evaluator, refusal_evaluator],
    concurrency=10,
    tqdm_bar_format=get_tqdm_progress_bar_formatter("Run Evaluation"),
    exit_on_error=True,
)
results.head()

Run Evaluation |          | 0/267 (0.0%) | ⏳ 00:00<? | ?it/s

Run Evaluation |██████████| 267/267 (100.0%) | ⏳ 00:42<00:00 |  6.25it/s


Unnamed: 0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,context.span_id,context.trace_id,...,expected_document_index,expected_answer,document_content,retrieved_documents,precision_execution_details,hallucination_execution_details,llm_refusal_execution_details,precision_score,hallucination_score,llm_refusal_score
0,RetrieverQueryEngine.query,CHAIN,,2025-09-18 05:39:18.772057+00:00,2025-09-18 05:39:21.272134+00:00,OK,,[],f14258f30a6b4964,292c99661771aa0144781e838aa6aa07,...,15,"The space station is called ""Babystation Beta"".",ERIC: But... you. Oh. We've been waiting for a...,"[15, 15, 15, 13]","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""name"": ""precision"", ""score"": 0.75, ""metadata...","{""name"": ""hallucination"", ""score"": 1.0, ""label...","{""name"": ""llm_refusal"", ""score"": 0, ""label"": ""..."
1,RetrieverQueryEngine.query,CHAIN,,2025-09-18 05:39:17.015420+00:00,2025-09-18 05:39:18.710415+00:00,OK,,[],094db9bbb4125f03,69efc197eaefcabe65e8cb9660300225,...,13,"To freeze an attribute during training, you ca...",2023. Retrieval-generation\nsynergy augmented ...,"[19, 19, 19, 19]","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""name"": ""precision"", ""score"": 0.0, ""metadata""...","{""name"": ""hallucination"", ""score"": 0.0, ""label...","{""name"": ""llm_refusal"", ""score"": 1, ""label"": ""..."
2,RetrieverQueryEngine.query,CHAIN,,2025-09-18 05:39:15.388146+00:00,2025-09-18 05:39:16.952564+00:00,OK,,[],cc6789e5f644862b,47b0a0c761e6e53b35bd0008198f3daf,...,13,A mask of negative infinity is applied to the ...,"In the naive implementation, we had separate a...","[13, 19, 13, 19]","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""name"": ""precision"", ""score"": 0.5, ""metadata""...","{""name"": ""hallucination"", ""score"": 1.0, ""label...","{""name"": ""llm_refusal"", ""score"": 1, ""label"": ""..."
3,RetrieverQueryEngine.query,CHAIN,,2025-09-18 05:39:13.477125+00:00,2025-09-18 05:39:15.310057+00:00,OK,,[],2c964580d78e3f3c,a94aba2ff97ebcc01ea77771add8693b,...,11,You received criticism for your research on sp...,Then we perform row-wise softmax to get the fi...,"[13, 13, 13, 13]","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""name"": ""precision"", ""score"": 0.0, ""metadata""...","{""name"": ""hallucination"", ""score"": 0.0, ""label...","{""name"": ""llm_refusal"", ""score"": 1, ""label"": ""..."
4,RetrieverQueryEngine.query,CHAIN,,2025-09-18 05:39:11.572926+00:00,2025-09-18 05:39:13.418715+00:00,OK,,[],e5f75f7a401ea570,8fad7b22cd8e9fcf20502ad0a1d62eea,...,11,When measured on the SE-ResNeXt101 deep learni...,"Python\nPython tuples are immutable, fixed-siz...","[12, 12, 16, 19]","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""name"": ""precision"", ""score"": 0.0, ""metadata""...","{""name"": ""hallucination"", ""score"": 0.0, ""label...","{""name"": ""llm_refusal"", ""score"": 0, ""label"": ""..."


In [74]:
from phoenix.client import AsyncClient


def prepare_eval_df_for_logging(results_df, score_column):
    """Helper function to prepare evaluation results for logging to Phoenix."""
    eval_df = results_df[["context.span_id", score_column]].copy()

    # Extract score components
    eval_df[score_column] = eval_df[score_column].apply(lambda x: json.loads(x))
    eval_df["score"] = eval_df[score_column].apply(lambda x: x.get("score", None) if x else None)
    eval_df["label"] = eval_df[score_column].apply(lambda x: x.get("label", None) if x else None)
    eval_df["explanation"] = eval_df[score_column].apply(
        lambda x: x.get("explanation", None) if x else None
    )
    eval_df["metadata"] = eval_df[score_column].apply(
        lambda x: x.get("metadata", None) if x else None
    )

    # Clean up
    eval_df = eval_df.rename(columns={"context.span_id": "span_id"})
    eval_df = eval_df.drop(score_column, axis=1)
    eval_df = eval_df.dropna(subset=["span_id"])
    eval_df = eval_df.reset_index(drop=True)

    return eval_df


async def log_eval_annotations(results_df, score_name, annotation_name, annotator_kind="LLM"):
    eval_df = prepare_eval_df_for_logging(results_df, score_name)
    await AsyncClient().spans.log_span_annotations_dataframe(
        dataframe=eval_df,
        annotation_name=annotation_name,
        annotator_kind=annotator_kind,
    )

In [None]:
await log_eval_annotations(results, "precision_score", "precision", annotator_kind="CODE")
await log_eval_annotations(results, "hallucination_score", "hallucination")
await log_eval_annotations(results, "llm_refusal_score", "llm_refusal")

# Improve Evaluators

Steps:

1. Create a dataset of our annotations for experimentation.
2. Define my LLM judge (refusal) and use as the experiment "task".
3. Create a simple heuristic experiment evaluator to compare judge output with ground truth
4. Iterate on the judge prompt until we are happy with the results.


In [26]:
from phoenix.client import Client
from phoenix.client.types.spans import SpanQuery

project_name = "rag-demo-baseline"
# Export all the top level spans
query = SpanQuery().where("name == 'RetrieverQueryEngine.query'")
spans_df = Client().spans.get_spans_dataframe(query=query, project_identifier=project_name)

# Shape the spans dataframe
spans_df["query"] = spans_df["attributes.input.value"]
spans_df["response"] = spans_df["attributes.output.value"]
spans_df.head()
spans_df.dropna(subset=["attributes.metadata"], inplace=True)
spans_df["split"] = spans_df["attributes.metadata"].apply(lambda x: x["split"])
spans_df["expected_document_index"] = spans_df["attributes.metadata"].apply(
    lambda x: x["expected_document_index"]
)
spans_df["expected_answer"] = spans_df["attributes.metadata"].apply(lambda x: x["expected_answer"])

# Export annotations and add to the spans from earlier
annotations_df = Client().spans.get_span_annotations_dataframe(
    spans_dataframe=spans_df, project_identifier=project_name
)
refusal_ground_truth = annotations_df[
    (annotations_df["annotator_kind"] == "HUMAN") & (annotations_df["annotation_name"] == "refusal")
]
refusal_ground_truth = refusal_ground_truth.rename_axis(index={"span_id": "context.span_id"})
refusal_ground_truth = refusal_ground_truth.rename(columns={"result.score": "refusal_score"})
labeled_spans_df = spans_df.merge(
    refusal_ground_truth[["refusal_score"]], left_index=True, right_index=True, how="left"
)
labeled_spans = labeled_spans_df.dropna(subset=["refusal_score"])
labeled_spans["refusal_score"].value_counts()
print(labeled_spans["refusal_score"].value_counts())

refusal_score
1.0    36
0.0     9
Name: count, dtype: int64


In [None]:
ds = await AsyncClient().datasets.create_dataset(
    name="annotated-queries-train",
    dataframe=labeled_spans,
    input_keys=["query", "response"],
    output_keys=["refusal_score", "expected_answer"],
)

# get the dataset if already created
# ds = await AsyncClient().datasets.get_dataset(dataset="annotated-queries-train")

In [81]:
from phoenix.evals import create_evaluator


# experiment task
def run_refusal_eval(input, evaluator):
    result = evaluator.evaluate(input)
    return result[0].score


# experiment evaluator
@create_evaluator(name="exact_match")
def exact_match(output, expected):
    return float(output) == float(expected["refusal_score"])

First, let's test out initial refusal evaluator against our annotations.


In [76]:
from phoenix.evals import create_classifier
from phoenix.evals.llm import LLM

prompt = """
Given a question and response, determine if the response is a refusal to answer the question.

Question: {query}
Response: {response}

Is the response a refusal or an informative response to the question?
"""

llm = LLM(model="gpt-4o", provider="openai")
baseline_refusal = create_classifier(
    llm=llm,
    name="refusal",
    prompt_template=prompt,
    choices={"refusal": 0, "answer": 1},
)

In [None]:
from functools import partial

from phoenix.client import AsyncClient

async_client = AsyncClient()

experiment = await async_client.experiments.run_experiment(
    dataset=ds,
    task=partial(run_refusal_eval, evaluator=baseline_refusal),
    experiment_name="baseline",
    evaluators=[exact_match],
    concurrency=10,
    # dry_run=3,
)

🧪 Experiment started.
🌵️ This is a dry-run for these example IDs:
RGF0YXNldEV4YW1wbGU6MjEzNjY=
RGF0YXNldEV4YW1wbGU6MjEzMzM=
RGF0YXNldEV4YW1wbGU6MjEzMjc=


running tasks |██████████| 3/3 (100.0%) | ⏳ 00:04<00:00 |  1.52s/it


✅ Task runs completed.
🧠 Evaluation started.
🌵️ This is a dry-run evaluation.


running experiment evaluations |██████████| 3/3 (100.0%) | ⏳ 00:01<00:00 |  2.92it/s

Experiment completed: 3 task runs, 1 evaluator runs, 3 evaluations





Now let's tweak our prompt to make the evaluation criteria more clear to the LLM judge. Describe exactly what a "refusal" looks like.


In [None]:
from phoenix.evals import create_classifier
from phoenix.evals.llm import LLM

prompt = """
Given a question and response, determine if the response is a refusal to answer the question.
Refusals often contain phrases of uncertainty like 'I don't know' and 'I don't have that information'.
They also often mention that the answer is not provided in the information or context.

If the response contains these phrases, it is a refusal. Even if the response contains other
text indicating an attempt to answer the question, it is still a refusal.

If the response does not contain these "hedging" phrases, it is an informative response. Do not
consider the correctness of the response, only whether it is a refusal or not.

Question: {query}
Response: {response}

Is the response a refusal or an informative answer to the question?
"""

refusal_v2 = create_classifier(
    llm=llm,
    name="refusal",
    prompt_template=prompt,
    choices={"refusal": 0, "answer": 1},
)

In [None]:
experiment = await async_client.experiments.run_experiment(
    dataset=ds,
    task=partial(run_refusal_eval, evaluator=refusal_v2),
    experiment_name="prompt-v2",
    evaluators=[exact_match],
    concurrency=10,
    # dry_run=3,
)

🧪 Experiment started.
📺 View dataset experiments: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoxOA==/experiments
🔗 View this experiment: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoxOA==/compare?experimentId=RXhwZXJpbWVudDo2Mw==


running tasks |██████████| 45/45 (100.0%) | ⏳ 00:55<00:00 |  1.24s/it


✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |██████████| 45/45 (100.0%) | ⏳ 00:04<00:00 | 10.32it/s

Experiment completed: 45 task runs, 1 evaluator runs, 45 evaluations





Let's go look at our experiment in Phoenix.

I see that our v2 refusal evaluator is greatly improved over the baseline! But it still has a few false negatives - mislabeling responses that are not refusals.

Let's tweak the prompt once more to clarify that point.


In [None]:
prompt = """
Given a question and response, determine if the response is a refusal to answer the question.
Refusals often contain phrases of uncertainty like 'I don't know' and 'I can't find that
information'. They also often mention that the answer is not provided in the information or context.

If the response contains these phrases, it is a refusal.

If the response does NOT contain these "hedging" phrases, it is an answer. Do not
consider the correctness or completeness of the response, only whether it contains the
phrases indicating a refusal.

Question: {query}
Response: {response}

Is the response a refusal or an answer to the question?
"""

llm = LLM(model="gpt-4o", provider="openai")
improved_refusal_v3 = create_classifier(
    llm=llm,
    name="refusal",
    prompt_template=prompt,
    choices={"refusal": 0, "answer": 1},
)

In [38]:
experiment = await async_client.experiments.run_experiment(
    dataset=ds,
    task=partial(run_refusal_eval, evaluator=improved_refusal_v3),
    experiment_name="prompt-v3",
    evaluators=[exact_match],
    concurrency=10,
    # dry_run=3,
)

🧪 Experiment started.
📺 View dataset experiments: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoxOA==/experiments
🔗 View this experiment: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoxOA==/compare?experimentId=RXhwZXJpbWVudDo2NA==


running tasks |██████████| 45/45 (100.0%) | ⏳ 00:57<00:00 |  1.28s/it


✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |██████████| 45/45 (100.0%) | ⏳ 00:04<00:00 |  9.60it/s

Experiment completed: 45 task runs, 1 evaluator runs, 45 evaluations





# Improve the Application

1. Create a dataset using the test set queries
2. Define our experiment task (running RAG on our dataset)
3. Use our new and improved refusal classifier as the experiment evaluator


In [None]:
ds = await AsyncClient().datasets.create_dataset(
    name="test-queries",
    dataframe=test_queries,
    input_keys=["query"],
)

In [86]:
from phoenix.evals import bind_evaluator


# define experiment task (running the RAG)
async def run_rag_task(input, rag_engine):
    """Ask a question of the knowledge base."""
    response = rag_engine.query(input["query"])
    return response


# update the evaluator to fit out dataset
refusal_evaluator = bind_evaluator(
    improved_refusal_v3, {"query": "input.query", "response": "output"}
)

### Experiment 1: Baseline RAG System


In [51]:
query_engine_baseline = index.as_query_engine()

In [59]:
baseline_experiment = await AsyncClient().experiments.run_experiment(
    dataset=ds,
    task=partial(run_rag_task, rag_engine=query_engine_baseline),
    experiment_name="baseline",
    evaluators=[refusal_evaluator],
    concurrency=10,
    # dry_run=3,
)

🧪 Experiment started.
📺 View dataset experiments: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoxOQ==/experiments
🔗 View this experiment: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoxOQ==/compare?experimentId=RXhwZXJpbWVudDo2NQ==


running tasks |██████████| 32/32 (100.0%) | ⏳ 00:58<00:00 |  1.82s/it


✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |██████████| 32/32 (100.0%) | ⏳ 00:10<00:00 |  2.93it/s

Experiment completed: 32 task runs, 1 evaluator runs, 32 evaluations





### Experiment 2: RAG with Custom Prompt


In [None]:
from textwrap import dedent

custom_system_prompt = """You are an expert Q&A system that is trusted around the world.
\nAlways answer the query using the provided context information, and not prior knowledge.
\nSome rules to follow:
\n1. Never directly reference the given context in your answer.
\n2. Avoid statements like 'Based on the context, ...' or 'The context information ...'
or anything along those lines.
\n3. If you cannot find the answer in the context, say 'I cannot find that information.' When in
doubt, default to responding 'I cannot find that information.'
"""
custom_query_engine = index.as_query_engine(system_prompt=dedent(custom_system_prompt))

In [87]:
experiment = await AsyncClient().experiments.run_experiment(
    dataset=ds,
    task=partial(run_rag_task, rag_engine=custom_query_engine),
    experiment_name="custom-prompt-2",
    evaluators=[refusal_evaluator],
    concurrency=10,
    dry_run=3,
)

🧪 Experiment started.
🌵️ This is a dry-run for these example IDs:
RGF0YXNldEV4YW1wbGU6MjEzNjY=
RGF0YXNldEV4YW1wbGU6MjEzMzM=
RGF0YXNldEV4YW1wbGU6MjEzMjc=


running tasks |██████████| 3/3 (100.0%) | ⏳ 00:06<00:00 |  2.02s/it


✅ Task runs completed.
🧠 Evaluation started.
🌵️ This is a dry-run evaluation.


running experiment evaluations |██████████| 3/3 (100.0%) | ⏳ 00:05<00:00 |  1.76s/it

Experiment completed: 3 task runs, 1 evaluator runs, 3 evaluations





# Conclusion


In this notebook, we have covered:

1. How to evaluate traces using different types of evaluators:
   - custom LLM classifiers
   - built-in metrics
   - heuristic functions using the `create_evaluator` decorator
2. How to build and iterate on an LLM Evaluator using experiments
3. How to iterate on an application using experiments and evaluators
