<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg" width="1000"/>
        <br>
        <br>
        <a href="https://arize-phoenix.readthedocs.io/projects/evals/en/latest/">Evals Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Arize Phoenix Evals 2.0</h1>

Arize Phoenix is a fully open-source AI observability platform. It's designed for experimentation, evaluation, and troubleshooting.

**In this notebook, you will learn how to do the following things using Evals 2.0:**

1. How to evaluate Phoenix project traces.
2. How to improve your custom evaluators using experiments.
3. How to iterate on your application and evals on a realistic example.

<center>
    <h3 align="left">The Evaluation Driven Development Lifecycle</h3>
    <p style="text-align:center">
        <img alt="eval lifecycle" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/evals_lifecycle.png" width="1000"/>
    </p>
</center>


### Requirements

1. Kaggle API key
2. OpenAI API key
3. A Phoenix instance (cloud or local)


In [1]:
! uv pip install "arize-phoenix-evals>=2.0.0" "arize-phoenix-client>=1.19.0" arize-phoenix-otel kagglehub openinference-instrumentation-llama_index llama-index numpy pandas --quiet

In [2]:
! uv pip install openinference-instrumentation-llama_index llama-index

[2mUsing Python 3.9.23 environment at: /Users/elizabethhutton/Projects/phoenix/.venv[0m
[2K[2mResolved [1m94 packages[0m [2min 140ms[0m[0m                                        [0m
[2K[2mInstalled [1m2 packages[0m [2min 6ms[0m[0mmentation==0.58b0                [0m
 [32m+[39m [1mopeninference-instrumentation-llama-index[0m[2m==4.3.5[0m
 [32m+[39m [1mopentelemetry-instrumentation[0m[2m==0.58b0[0m


# Dataset Preparation and Setup

We are using a public RAG evaluation dataset. It has two components:

1. A knowledge base of 20 documents of various lengths and sources.
2. 4 question-answer pairs per document.
   - 2 which are not answerable by the document
   - 2 which require a single passage to answer

First, we need to do some data preparation.


In [2]:
# Download dataset
# Requires a Kaggle API key and username in your environment

import os

import kagglehub

path = kagglehub.dataset_download("samuelmatsuoharris/single-topic-rag-evaluation-dataset")

print("Path to dataset files:", path)
print(os.listdir(path))

  from .autonotebook import tqdm as notebook_tqdm


Path to dataset files: /Users/elizabethhutton/.cache/kagglehub/datasets/samuelmatsuoharris/single-topic-rag-evaluation-dataset/versions/4
['multi_passage_answer_questions.csv', 'documents.csv', 'single_passage_answer_questions.csv', 'no_answer_questions.csv']


In [3]:
import pandas as pd


def prepare_query_data(path: str) -> pd.DataFrame:
    single_passage_df = pd.read_csv(os.path.join(path, "single_passage_answer_questions.csv"))
    no_answer_df = pd.read_csv(os.path.join(path, "no_answer_questions.csv"))

    # Single-passage questions
    single_passage_processed = pd.DataFrame(
        {
            "document_index": single_passage_df["document_index"],
            "query": single_passage_df["question"],
            "answer": single_passage_df["answer"],
            "query_type": "single_passage",
        }
    )

    # No-answer questions
    no_answer_processed = pd.DataFrame(
        {
            "document_index": no_answer_df["document_index"],
            "query": no_answer_df["question"],
            "answer": "N/A",
            "query_type": "no_answer",
        }
    )

    # Combine all dataframes
    combined_df = pd.concat([single_passage_processed, no_answer_processed], ignore_index=True)

    return combined_df


query_df = prepare_query_data(path)
query_df.sample(5).head()

Unnamed: 0,document_index,query,answer,query_type
72,16,In what version was the velociraptor introduced?,,no_answer
6,3,How do the data storage options compare?,For fast start: use SQLite3 and ChromaDB (File...,single_passage
73,16,What was added in version 1.5.7?,,no_answer
46,3,When was GPT 4 made available,,no_answer
74,17,Why did Saga stab Scratch?,,no_answer


### Split data into train/test

Split documents into a 60/40 train/test split. We will iterate and experiment on our train set only, leaving the test set for any final comparisons.


In [4]:
import numpy as np

unique_docs = query_df["document_index"].unique()
print(f"Total unique documents: {len(unique_docs)}")

np.random.seed(42)
sample_size = int(len(unique_docs) * 0.6)
train_docs = np.random.choice(unique_docs, size=sample_size, replace=False)
print(f"Sampled {len(train_docs)} documents ({len(train_docs) / len(unique_docs) * 100:.1f}%)")

# Split queries based on sampled document indices
all_queries = query_df.copy()
train_queries = query_df[query_df["document_index"].isin(train_docs)]
test_queries = query_df[~query_df["document_index"].isin(train_docs)]
print(f"Train queries: {len(train_queries)}, Test queries: {len(test_queries)}")

Total unique documents: 20
Sampled 12 documents (60.0%)
Train queries: 48, Test queries: 32


### Inspect the knowledge base documents


In [5]:
documents = pd.read_csv(os.path.join(path, "documents.csv"))
documents.head()

Unnamed: 0,index,source_url,text
0,0,https://enterthegungeon.fandom.com/wiki/Bullet...,Bullet Kin\nBullet Kin are one of the most com...
1,1,https://www.dropbox.com/scl/fi/ljtdg6eaucrbf1a...,---The Paths through the Underground/Underdark...
2,2,https://bytes-and-nibbles.web.app/bytes/stici-...,Semantic and Textual Inference Chatbot Interfa...
3,3,https://github.com/llmware-ai/llmware,llmware\n\nBuilding Enterprise RAG Pipelines w...
4,4,https://docs.marimo.io/recipes.html,Recipes\nThis page includes code snippets or “...


### Set Up Phoenix Tracing

This allows us to capture traces not only of our application, but also any evaluations and experiments we do.

You can use either a locally hosted instance of Phoenix or Phoenix Cloud.


In [1]:
# Set up Phoenix Tracing
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

from phoenix.otel import register

project_name = "rag-demo"  # project for our application traces
tracer_provider = register(project_name=project_name, verbose=False)
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

* 'fields' has been removed


Add your LLM API credentials. Here, we are using OpenAI.


In [2]:
import os
from getpass import getpass

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

# Set Up a RAG App using Llama Index

For this demo application, we are building a simple RAG pipeline that has two components:

1. Vector index to retrieve documents
2. LLM to generate responses

For this initial application, let's keep it simple and use the default configuration and prompts from Llama Index.


In [None]:
import os

from llama_index.core import (
    Document,
    Settings,
    StorageContext,
    VectorStoreIndex,
    load_index_from_storage,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

index_dir = "llamaindex_store"

# --- Ingest documents ---
if os.path.exists(index_dir):
    storage_context = StorageContext.from_defaults(persist_dir=index_dir)
    index = load_index_from_storage(storage_context)
else:
    # --- Set up the LLM and embedding model ---
    Settings.llm = OpenAI(model="gpt-4o-mini", temperature=0)  # generator
    Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")  # retriever

    kb_docs = []
    for _, row in documents.iterrows():
        doc = Document(
            text=str(row["text"]),
            metadata={"source_url": row["source_url"], "document_index": row["index"]},
            id_=str(row["index"]),
        )
        kb_docs.append(doc)

    index = VectorStoreIndex.from_documents(kb_docs)

# Optional: persist to disk so you can reuse later
index.storage_context.persist(persist_dir=index_dir)

# Create the query engine
query_engine = index.as_query_engine()

Let's test to make sure our RAG system is working:


In [9]:
query_engine.query("What is data science?")

Response(response='Data science is a multidisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from structured and unstructured data.', source_nodes=[NodeWithScore(node=TextNode(id_='9069eda0-f363-47bc-9f89-e0888c6aeffb', embedding=None, metadata={'source_url': 'https://towardsdatascience.com/gpt-from-scratch-with-mlx-acf2defda30e', 'document_index': 13}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='13', node_type='4', metadata={'source_url': 'https://towardsdatascience.com/gpt-from-scratch-with-mlx-acf2defda30e', 'document_index': 13}, hash='2187a4efee001b52656775153aab90cdc0d580feb56478d77d92d778b432b832'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='7d21cbc0-db4a-43f6-adf9-3f5d922207a5', node_type='1', metadata={'source_url': 'https://towardsdatascience.com/gpt-from-scratch-with-mlx-acf2defda30e', 'document_index':

### Run RAG on Train Set


Let's wrap our query engine so it's easier to run on our dataset.


In [10]:
from openinference.instrumentation import using_metadata


async def run_rag_with_metadata(example, rag_engine):
    """Ask a question of the knowledge base."""
    metadata = {
        "expected_answer": example["answer"],
        "query_type": example["query_type"],
        "expected_document_index": example["document_index"],
        "split": "test" if example["document_index"] not in train_docs else "train",
    }
    with using_metadata(metadata):
        rag_engine.query(example["query"])

We use the `AsyncExecutor` to run our RAG app on the training dataset with optimal speed.


In [11]:
# Run application on the train set to get a baseline
from functools import partial

from phoenix.evals.executors import AsyncExecutor
from phoenix.evals.utils import get_tqdm_progress_bar_formatter

executor = AsyncExecutor(
    generation_fn=partial(run_rag_with_metadata, rag_engine=query_engine),
    concurrency=10,  # adjust this as needed
    exit_on_error=True,
    tqdm_bar_format=get_tqdm_progress_bar_formatter("Run RAG"),
)

results, execution_details = await executor.execute(
    [row.to_dict() for _, row in train_queries.iterrows()],
)

Run RAG |██████████| 48/48 (100.0%) | ⏳ 01:27<00:00 |  1.83s/it


# Evaluate the Traces

First, let's go to Phoenix and look at our application traces. Do we observe any issues?

- Is the RAG agent correctly refusing to answer the unanswerable queries?
- Is it retrieving the correct documents?
- Is it hallucinating?

These are common questions we can turn into repeatable evaluations. So let's create a few evaluators for our RAG app and run them on our traces.

**Steps:**

1. Export traces from Phoenix
2. Define evaluators
3. Run evaluators on the trace data
4. Log the evaluation results back up to Phoenix


In [3]:
from phoenix.client import Client
from phoenix.client.types.spans import SpanQuery

# Export all the top level spans
query = SpanQuery().where("name == 'RetrieverQueryEngine.query'")
spans_df = Client().spans.get_spans_dataframe(query=query, project_identifier=project_name)
spans_df.dropna(
    subset=["attributes.metadata"], inplace=True
)  # drop any traces not from our dataset

# Shape the spans dataframe
spans_df["query"] = spans_df["attributes.input.value"]
spans_df["response"] = spans_df["attributes.output.value"]
spans_df["split"] = spans_df["attributes.metadata"].apply(lambda x: x["split"])
spans_df["expected_document_index"] = spans_df["attributes.metadata"].apply(
    lambda x: x["expected_document_index"]
)
spans_df["expected_answer"] = spans_df["attributes.metadata"].apply(lambda x: x["expected_answer"])

# Export and process the retrieval spans to get the retrieved documents
query = SpanQuery().where("name == 'VectorIndexRetriever.retrieve'")
retrieval_spans_df = Client().spans.get_spans_dataframe(
    query=query, project_identifier=project_name
)
retrieval_spans_df["document_content"] = retrieval_spans_df["attributes.retrieval.documents"].apply(
    lambda x: "\n----------------------------------\n".join([doc["document.content"] for doc in x])
)
retrieval_spans_df["retrieved_documents"] = retrieval_spans_df[
    "attributes.retrieval.documents"
].apply(lambda x: [doc["document.metadata"]["document_index"] for doc in x])

# Combine the spans with the retrieval spans
spans_df = spans_df.merge(
    retrieval_spans_df[["context.trace_id", "document_content", "retrieved_documents"]],
    on="context.trace_id",
    how="left",
)

print(spans_df.shape)
spans_df.head()

(48, 21)


Unnamed: 0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,context.span_id,context.trace_id,...,attributes.openinference.span.kind,attributes.input.value,attributes.metadata,query,response,split,expected_document_index,expected_answer,document_content,retrieved_documents
0,RetrieverQueryEngine.query,CHAIN,,2025-09-18 21:44:58.645931+00:00,2025-09-18 21:45:00.263051+00:00,OK,,[],124677f2c156e8f5,3357a7c847550d6e0b0e0831f5455ed4,...,CHAIN,Which book is the best?,"{'split': 'train', 'query_type': 'no_answer', ...",Which book is the best?,I'm unable to provide an answer to that query ...,train,18,,"In the naive implementation, we had separate a...","[13, 16]"
1,RetrieverQueryEngine.query,CHAIN,,2025-09-18 21:44:56.894284+00:00,2025-09-18 21:44:58.576838+00:00,OK,,[],e7fc3976b381d1bb,ae0a44836ad5a4abd288c98234621ff1,...,CHAIN,"In 'he who drowned the world', why did Gong Li...","{'split': 'train', 'query_type': 'no_answer', ...","In 'he who drowned the world', why did Gong Li...",Gong Li sacrificed her brother in 'he who drow...,train,18,,"In the naive implementation, we had separate a...","[13, 13]"
2,RetrieverQueryEngine.query,CHAIN,,2025-09-18 21:44:55.186142+00:00,2025-09-18 21:44:56.819187+00:00,OK,,[],1863d59e8310f155,388332cccc824afe407ec2974271237d,...,CHAIN,What caliber is the bullet of light?,"{'split': 'train', 'query_type': 'no_answer', ...",What caliber is the bullet of light?,The bullet of light does not have a specified ...,train,17,,Fixed farmhand crash while fishing in rare cas...,"[16, 13]"
3,RetrieverQueryEngine.query,CHAIN,,2025-09-18 21:44:53.432784+00:00,2025-09-18 21:44:55.119930+00:00,OK,,[],363c1325fff25453,bedf65b9a9f17baaeaef14a7e3484a7d,...,CHAIN,Why did Saga stab Scratch?,"{'split': 'train', 'query_type': 'no_answer', ...",Why did Saga stab Scratch?,The reason Saga stabbed Scratch was to ensure ...,train,17,,Then we perform row-wise softmax to get the fi...,"[13, 5]"
4,RetrieverQueryEngine.query,CHAIN,,2025-09-18 21:44:51.394301+00:00,2025-09-18 21:44:53.365811+00:00,OK,,[],f3738a9a735840b7,36a20133ae9b0d602c66d6b664d18ec2,...,CHAIN,What was added in version 1.5.7?,"{'split': 'train', 'query_type': 'no_answer', ...",What was added in version 1.5.7?,"In version 1.5.7, the following features were ...",train,16,,I knew that limiting it to running on my M1 Ma...,"[2, 16]"


## Define Evaluators


### 1. Refusal

Since we know some questions in our dataset are unanswerable, we expect the LLM to respond accordingly. So let's create an LLM judge to check for refusals in the output.


In [4]:
from phoenix.evals import create_classifier
from phoenix.evals.llm import LLM

prompt = """
Given a question and response, determine if the response is a refusal to answer the question.

Question: {query}
Response: {response}

Is the response a refusal or an informative answer to the question?
"""

llm = LLM(model="gpt-4o", provider="openai")
refusal_evaluator = create_classifier(
    llm=llm,
    name="llm_refusal",
    prompt_template=prompt,
    choices={"refusal": 0, "answer": 1},
)

# test the evaluator on a single example
refusal_evaluator.evaluate(spans_df.iloc[0].to_dict())

[Score(name='llm_refusal', score=0, label='refusal', explanation='The response explicitly states that an answer is not possible due to the lack of relevant information, indicating a refusal.', metadata={'model': 'gpt-4o'}, source='llm', direction='maximize')]

### 2. Hallucination

Let's also check to see if our RAG pipeline is producing hallucinations. Phoenix evals has a built-in `HallucinationEvaluator` so we'll use that. First, let's inspect the `input_schema` so we know what it needs to run.


In [5]:
from phoenix.evals.llm import LLM
from phoenix.evals.metrics import HallucinationEvaluator

llm = LLM(model="gpt-4o", provider="openai")
hallucination_evaluator = HallucinationEvaluator(llm=llm)
hallucination_evaluator.describe()

{'name': 'hallucination',
 'source': 'llm',
 'direction': 'maximize',
 'input_schema': {'properties': {'input': {'description': 'The input query.',
    'title': 'Input',
    'type': 'string'},
   'output': {'description': 'The response to the query.',
    'title': 'Output',
    'type': 'string'},
   'context': {'description': 'The context or reference text.',
    'title': 'Context',
    'type': 'string'}},
  'required': ['input', 'output', 'context'],
  'title': 'HallucinationInputSchema',
  'type': 'object'}}

Okay, we need to provide an `input_mapping` so it works on our data. Let's bind it to the evaluator so we can reuse it.


In [6]:
hallucination_mapping = {
    "input": "query",
    "output": "response",
    "context": "document_content",
}
hallucination_evaluator.bind(hallucination_mapping)

# test the evaluator on a single example
hallucination_evaluator.evaluate(spans_df.iloc[0].to_dict())

[Score(name='hallucination', score=1.0, label='factual', explanation='The response correctly identifies that the query about "which book is the best" is not related to the given context, which discusses technical details of multi-head attention implementation and patch notes of a video game.', metadata={'model': 'gpt-4o'}, source='llm', direction='maximize')]

### 3. Retrieval Precision

We also want to measure how well the information retrieval component of our system is working. Let's add a precision metric which checks to see how often the target document appeared in the retrieved results.


In [7]:
from phoenix.evals import bind_evaluator, create_evaluator


@create_evaluator(name="precision")
def precision(retrieved_documents: list[int], relevant_documents: list[int]) -> float:
    relevant_set = set(relevant_documents)
    hits = sum(1 for doc in retrieved_documents if doc in relevant_set)
    return hits / len(retrieved_documents)


# our precision evaluator expects a list of relevant documents,
# but our dataset only has one relevant document per query, so we
# wrap the expected document index in a list inside our mapping using a lambda function
precision_mapping = {
    "relevant_documents": lambda x: [x["expected_document_index"]],
}

precision_evaluator = bind_evaluator(precision, precision_mapping)

# test the evaluator on a single example
precision_evaluator.evaluate(spans_df.iloc[0].to_dict())

[Score(name='precision', score=0.0, label=None, explanation=None, metadata={}, source='heuristic', direction='maximize')]

### Putting it all together

Let's run our 3 evaluators on all of our project traces.


In [None]:
from phoenix.evals import async_evaluate_dataframe

train_spans = spans_df[spans_df["split"] == "train"]
results = await async_evaluate_dataframe(
    train_spans,
    [precision_evaluator, hallucination_evaluator, refusal_evaluator],
    concurrency=10,
    tqdm_bar_format=get_tqdm_progress_bar_formatter("Run Evaluation"),
    exit_on_error=True,
)
results.head()

Unnamed: 0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,context.span_id,context.trace_id,...,expected_document_index,expected_answer,document_content,retrieved_documents,precision_execution_details,hallucination_execution_details,llm_refusal_execution_details,precision_score,hallucination_score,llm_refusal_score
0,RetrieverQueryEngine.query,CHAIN,,2025-09-18 21:44:58.645931+00:00,2025-09-18 21:45:00.263051+00:00,OK,,[],124677f2c156e8f5,3357a7c847550d6e0b0e0831f5455ed4,...,18,,"In the naive implementation, we had separate a...","[13, 16]","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""name"": ""precision"", ""score"": 0.0, ""metadata""...","{""name"": ""hallucination"", ""score"": 1.0, ""label...","{""name"": ""llm_refusal"", ""score"": 0, ""label"": ""..."
1,RetrieverQueryEngine.query,CHAIN,,2025-09-18 21:44:56.894284+00:00,2025-09-18 21:44:58.576838+00:00,OK,,[],e7fc3976b381d1bb,ae0a44836ad5a4abd288c98234621ff1,...,18,,"In the naive implementation, we had separate a...","[13, 13]","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""name"": ""precision"", ""score"": 0.0, ""metadata""...","{""name"": ""hallucination"", ""score"": 0.0, ""label...","{""name"": ""llm_refusal"", ""score"": 1, ""label"": ""..."
2,RetrieverQueryEngine.query,CHAIN,,2025-09-18 21:44:55.186142+00:00,2025-09-18 21:44:56.819187+00:00,OK,,[],1863d59e8310f155,388332cccc824afe407ec2974271237d,...,17,,Fixed farmhand crash while fishing in rare cas...,"[16, 13]","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""name"": ""precision"", ""score"": 0.0, ""metadata""...","{""name"": ""hallucination"", ""score"": 1.0, ""label...","{""name"": ""llm_refusal"", ""score"": 1, ""label"": ""..."
3,RetrieverQueryEngine.query,CHAIN,,2025-09-18 21:44:53.432784+00:00,2025-09-18 21:44:55.119930+00:00,OK,,[],363c1325fff25453,bedf65b9a9f17baaeaef14a7e3484a7d,...,17,,Then we perform row-wise softmax to get the fi...,"[13, 5]","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""name"": ""precision"", ""score"": 0.0, ""metadata""...","{""name"": ""hallucination"", ""score"": 0.0, ""label...","{""name"": ""llm_refusal"", ""score"": 1, ""label"": ""..."
4,RetrieverQueryEngine.query,CHAIN,,2025-09-18 21:44:51.394301+00:00,2025-09-18 21:44:53.365811+00:00,OK,,[],f3738a9a735840b7,36a20133ae9b0d602c66d6b664d18ec2,...,16,,I knew that limiting it to running on my M1 Ma...,"[2, 16]","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""name"": ""precision"", ""score"": 0.5, ""metadata""...","{""name"": ""hallucination"", ""score"": 1.0, ""label...","{""name"": ""llm_refusal"", ""score"": 1, ""label"": ""..."


### Log trace evaluations back to Phoenix


In [None]:
from phoenix.client import AsyncClient
from phoenix.evals.utils import to_annotation_dataframe

client = AsyncClient()

annotations = to_annotation_dataframe(
    results
)  # can also specify score_names to log only certain scores
await client.spans.log_span_annotations_dataframe(dataframe=annotations)

# Improve Evaluators

Go into Phoenix and look at your project traces now that you've added some eval metrics. Pay attention to the "llm_refusal" metric - is it catching all the refusals?
No, it looks like it is not performing as expected.

Let's see if we can improve our LLM Judge so it is better aligned.

**Steps:**

1. Manually annotate some traces as "refused" or "responded" inside Phoenix.
2. Export those annotated traces and use to create a dataset for experimentation.
3. Define an LLM judge (refusal) and use as the experiment "task".
4. Create a simple heuristic experiment evaluator that checks for an exact match between the judge score and our annotation
5. Iterate on the judge prompt until we are happy with the results.

<center>
    <h3 align="left">Phoenix Experiments</h3>
    <p style="text-align:center">
        <img alt="eval lifecycle" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/experiment.png" width="1000"/>
    </p>
</center>


After manual annotation, pull down those traces:


In [None]:
from phoenix.client import Client
from phoenix.client.types.spans import SpanQuery

# Export all the top level spans
query = SpanQuery().where("name == 'RetrieverQueryEngine.query'")
spans_df = Client().spans.get_spans_dataframe(query=query, project_identifier=project_name)

# Shape the spans dataframe
spans_df["query"] = spans_df["attributes.input.value"]
spans_df["response"] = spans_df["attributes.output.value"]
spans_df.dropna(subset=["attributes.metadata"], inplace=True)
spans_df["expected_answer"] = spans_df["attributes.metadata"].apply(lambda x: x["expected_answer"])

# Export annotations and add to the spans from earlier
annotations_df = Client().spans.get_span_annotations_dataframe(
    spans_dataframe=spans_df, project_identifier=project_name
)
refusal_ground_truth = annotations_df[
    (annotations_df["annotator_kind"] == "HUMAN") & (annotations_df["annotation_name"] == "refusal")
]
refusal_ground_truth = refusal_ground_truth.rename_axis(index={"span_id": "context.span_id"})
refusal_ground_truth = refusal_ground_truth.rename(columns={"result.score": "refusal_score"})
labeled_spans_df = spans_df.merge(
    refusal_ground_truth[["refusal_score"]], left_index=True, right_index=True, how="left"
)
labeled_spans_df = labeled_spans_df[
    ["context.span_id", "query", "response", "refusal_score", "expected_answer"]
]
labeled_spans = labeled_spans_df.dropna(subset=["refusal_score"])
labeled_spans["refusal_score"].value_counts()
print(labeled_spans["refusal_score"].value_counts())

refusal_score
1.0    20
0.0     8
Name: count, dtype: int64


In [28]:
labeled_spans.head()

Unnamed: 0_level_0,context.span_id,query,response,refusal_score,expected_answer
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
4cdfbaf375a2316b,4cdfbaf375a2316b,Which book is the best?,I'm unable to provide an answer to that query ...,0.0,
7a2bba89818d79b0,7a2bba89818d79b0,"In 'he who drowned the world', why did Gong Li...",Gong Li sacrificed her brother in 'he who drow...,1.0,
2ae8849b57062841,2ae8849b57062841,What caliber is the bullet of light?,The bullet of light does not have a specified ...,0.0,
06c4934f0a9bf16f,06c4934f0a9bf16f,Why did Saga stab Scratch?,The reason Saga stabbed Scratch was to ensure ...,1.0,
e7897fdccc20b7b4,e7897fdccc20b7b4,What was added in version 1.5.7?,"In version 1.5.7, the following features were ...",1.0,


In [None]:
dataset_name = "annotated-queries-train"
ds = await AsyncClient().datasets.create_dataset(
    name=dataset_name,
    dataframe=labeled_spans,
    input_keys=["query", "response"],
    output_keys=["refusal_score", "expected_answer"],
)

# get the dataset if already created
# ds = await AsyncClient().datasets.get_dataset(dataset=dataset_name)

Next, we define:

1. The experiment task (here, our LLM evaluator is the task).
2. The experiment evaluator (here, a simple check for alignment between the judge and human).

**Note: 2.0 evaluators are now drop-in compatible as experiment evaluators!**


In [31]:
from phoenix.evals import create_evaluator


# experiment task
def run_refusal_eval(input, evaluator):
    result = evaluator.evaluate(input)
    return result[0]


# experiment evaluator
@create_evaluator(name="exact_match")
def exact_match(output, expected):
    return float(output["score"]) == float(expected["refusal_score"])

First, let's get a baseline by testing out the initial refusal evaluator against our annotations.


In [32]:
prompt = """
Given a question and response, determine if the response is a refusal to answer the question.

Question: {query}
Response: {response}

Is the response a refusal or an informative response to the question?
"""

llm = LLM(model="gpt-4o", provider="openai")
baseline_refusal = create_classifier(
    llm=llm,
    name="refusal",
    prompt_template=prompt,
    choices={"refusal": 0, "answer": 1},
)

In [33]:
from functools import partial

from phoenix.client import AsyncClient

async_client = AsyncClient()

experiment = await async_client.experiments.run_experiment(
    dataset=ds,
    task=partial(run_refusal_eval, evaluator=baseline_refusal),
    experiment_name="baseline",
    evaluators=[exact_match],
    concurrency=10,
    # dry_run=3,
)

🧪 Experiment started.
📺 View dataset experiments: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyNA==/experiments
🔗 View this experiment: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyNA==/compare?experimentId=RXhwZXJpbWVudDo3OQ==


running tasks |██████████| 28/28 (100.0%) | ⏳ 00:43<00:00 |  1.55s/it


✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |██████████| 28/28 (100.0%) | ⏳ 00:03<00:00 |  8.14it/s

Experiment completed: 28 task runs, 1 evaluator runs, 28 evaluations





Go to the experiment and filter by "exact_match==0" -- these are the examples where our refusal judge disagreed with the human annotations. Look at the RAG responses and the judge explanations. I notice that the LLM judge considers uncertain responses as valid answers rather than refusals.

Based on these observations, let's tweak our prompt to make the evaluation criteria more clear to the LLM judge. Describe exactly what a "refusal" looks like.


In [34]:
from phoenix.evals import create_classifier
from phoenix.evals.llm import LLM

prompt = """
Given a question and response, determine if the response is a refusal to answer the question.
Refusals often contain phrases of uncertainty like 'I don't know' and 'I don't have that information'.
They also often mention that the answer is not provided in the information or context.

If the response contains these phrases, it is a refusal. Even if the response contains other
text indicating an attempt to answer the question, it is still a refusal.

If the response does not contain these "hedging" phrases, it is an informative response. Do not
consider the correctness of the response, only whether it is a refusal or not.

Question: {query}
Response: {response}

Is the response a refusal or an informative answer to the question?
"""

refusal_v2 = create_classifier(
    llm=llm,
    name="refusal",
    prompt_template=prompt,
    choices={"refusal": 0, "answer": 1},
)

In [35]:
experiment = await async_client.experiments.run_experiment(
    dataset=ds,
    task=partial(run_refusal_eval, evaluator=refusal_v2),
    experiment_name="prompt-v2",
    evaluators=[exact_match],
    concurrency=10,
    # dry_run=3,
)

🧪 Experiment started.
📺 View dataset experiments: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyNA==/experiments
🔗 View this experiment: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyNA==/compare?experimentId=RXhwZXJpbWVudDo4MA==


running tasks |██████████| 28/28 (100.0%) | ⏳ 00:42<00:00 |  1.52s/it


✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |██████████| 28/28 (100.0%) | ⏳ 00:03<00:00 |  8.15it/s

Experiment completed: 28 task runs, 1 evaluator runs, 28 evaluations





Looking at this experiment in Phoenix, I see that we now have "exact_match == 1.0" indicating 100% agreement between our new judge and the annotations!

Through experimentation we were able to improve the evaluation metric itself, much in the same way we would improve any process.


# Improve the Application

Now that we feel good about our refusal metric, let's see if we can improve our RAG system.

Exactly 50% of the queries in our dataset are unanswerable, so ideally we would like to see the "llm_refusal" score close to 0.5. We don't want the RAG system attempting to answer questions that are not answerable from the context because this increases the chances of hallucination - not good!

**Steps:**

1. Create a dataset using the train set queries.
2. Define our experiment task (running RAG on our dataset).
3. Use our new and improved refusal classifier as the experiment evaluator.
4. Iterate on the RAG agent's prompt until we are happy.


In [None]:
dataset_name = "train-queries"
ds = await AsyncClient().datasets.create_dataset(
    name=dataset_name,
    dataframe=train_queries,
    input_keys=["query"],
)

# if already created
# ds = await AsyncClient().datasets.get_dataset(dataset=dataset_name)

In [39]:
from phoenix.evals import bind_evaluator


# define experiment task (running the RAG engine)
async def run_rag_task(input, rag_engine):
    """Ask a question of the knowledge base."""
    response = rag_engine.query(input["query"])
    return response


# use an input mapping to fit our dataset to the evaluator we created earlier
refusal_evaluator = bind_evaluator(refusal_v2, {"query": "input.query", "response": "output"})

### Experiment 1: Baseline RAG System

Let's rerun our initial RAG system to get a baseline. How do the "out-of-the-box" defaults work?


In [40]:
query_engine_baseline = index.as_query_engine()
baseline_experiment = await AsyncClient().experiments.run_experiment(
    dataset=ds,
    task=partial(run_rag_task, rag_engine=query_engine_baseline),
    experiment_name="baseline",
    evaluators=[refusal_evaluator],
    concurrency=10,
    # dry_run=3,
)

🧪 Experiment started.
📺 View dataset experiments: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyMw==/experiments
🔗 View this experiment: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyMw==/compare?experimentId=RXhwZXJpbWVudDo4MQ==


running tasks |██████████| 48/48 (100.0%) | ⏳ 01:36<00:00 |  2.01s/it


✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |██████████| 48/48 (100.0%) | ⏳ 00:39<00:00 |  1.21it/s

Experiment completed: 48 task runs, 1 evaluator runs, 48 evaluations





### Experiment 2: RAG with Custom Prompt

Go into Phoenix to see the results of our experiment.

The refusal score is a little high - we want to get it down closer to 0.5 since we know 50% of our queries are unanswerable. Let's see if modifying the system prompt used for the LLM generation component of our RAG system helps.


In [41]:
from textwrap import dedent

custom_system_prompt = """You are an expert at answering questions about a given context.
\nAlways answer the query using the provided context information, and not prior knowledge.
\nSome rules to follow:
\n1. Never directly reference the given context in your answer.
\n2. Avoid statements like 'Based on the context, ...' or 'The context information ...'
or anything along those lines.
\n3. Do NOT use prior knowledge to answer the question. Only use the context provided.
\n4. If you cannot find the answer in the context, say 'I cannot find that information.' When in
doubt, default to responding 'I cannot find that information.'
"""
custom_query_engine = index.as_query_engine(system_prompt=dedent(custom_system_prompt))

In [42]:
experiment = await AsyncClient().experiments.run_experiment(
    dataset=ds,
    task=partial(run_rag_task, rag_engine=custom_query_engine),
    experiment_name="custom-prompt",
    evaluators=[refusal_evaluator],
    concurrency=10,
    # dry_run=3,
)

🧪 Experiment started.
📺 View dataset experiments: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyMw==/experiments
🔗 View this experiment: https://app.phoenix.arize.com/s/ehutton//datasets/RGF0YXNldDoyMw==/compare?experimentId=RXhwZXJpbWVudDo4Mg==


running tasks |██████████| 48/48 (100.0%) | ⏳ 01:35<00:00 |  1.99s/it


✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |██████████| 48/48 (100.0%) | ⏳ 00:17<00:00 |  2.82it/s

Experiment completed: 48 task runs, 1 evaluator runs, 48 evaluations





Check out the results of this experiment in Phoenix.

Nice, we are heading in the right direction! Our refusal score went down a bit closer to 0.5, indicating that our RAG system is correctly refusing to answer more queries.


# Conclusion


In this notebook, we have covered a lot! Now you know:

1. How to evaluate traces using different types of evaluators:
   - custom LLM classifiers
   - built-in metrics
   - heuristic functions using the `create_evaluator` decorator
2. How to build and iterate on an LLM Evaluator using experiments
3. How to iterate on an application using experiments and evaluators

For more information, check out our [Documentation!](https://arize-phoenix.readthedocs.io/projects/evals/en/latest/)
