<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Retrieval Relevance Evals</h1>

Phoenix evals are designed to be robust to many kinds of errors, providing many tools to control error handling and retry behavior, as well as the ability to surface details about what happened during long eval runs.

In this notebook, we'll simulate various kinds of errors that might happen while running evals and show different ways Phoenix evals can work with them.

## Install Dependencies and Import Libraries

In [2]:
N_EVAL_SAMPLE_SIZE = 40

In [None]:
!pip install -qq "arize-phoenix-evals" "openai>=1" ipython matplotlib pycm scikit-learn tiktoken nest_asyncio

ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use `nest_asyncio`. `nest_asyncio` globally patches `asyncio` to enable event loops to be re-entrant. This is not required for non-notebook environments.

Without `nest_asyncio`, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical.

In [3]:
import nest_asyncio

nest_asyncio.apply()

In [4]:
import os
from collections import Counter
from getpass import getpass

import pandas as pd
from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

pd.set_option("display.max_colwidth", None)

## Download Dataset

In [5]:
df = download_benchmark_dataset(
    task="binary-relevance-classification", dataset_name="wiki_qa-train"
)

## Configure a test LLM

Configure your OpenAI API key.

In [6]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

## Sample Input Dataset
Sample size determines run time
Recommend iterating small: 100 samples
Then increasing to large test set

In [7]:
df_sample = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)
df_sample = df_sample.rename(
    columns={
        "query_text": "input",
        "document_text": "reference",
    },
)

## Run LLM Evals
Run relevance against a subset of the data.
Instantiate the LLM and set parameters.

## Set up test model wrapper

To demonstrate error handling while running evals, we'll remove some input data that was required from our sampled dataset.

Second, we'll create a buggy model that inherits from the `OpenAIModel` wrapper to simulate spurious errors that might occur when trying to run evals.

In [8]:
df_sample.loc[28, "reference"] = None
df_sample.loc[37, "input"] = None

In [9]:
import random


class FunnyAIModel(OpenAIModel):
    async def _async_generate(self, *args, **kwargs):
        if random.random() < 0.3:
            raise RuntimeError("What could have possibly happened here?!")
        return await super()._async_generate(*args, **kwargs)

In [10]:
funny_model = FunnyAIModel(
    model="gpt-4o",
    temperature=0.0,
)

In [11]:
funny_model("Hello world, this is a test if you are working?")

"Hello! Yes, I'm here and working. How can I assist you today?"

## Default Behavior

The default behavior is to retry (with a default maximum of 10) on exceptions while running evals. However, is input data is missing and a prompt cannot be generated from a template, that row will fail. `llm_classify` will return early, and the rows that will not be run will not have an eval.

In [12]:
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
default_evals = llm_classify(
    dataframe=df_sample,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=funny_model,
    rails=rails,
    concurrency=3,
)

llm_classify |          | 0/40 (0.0%) | ⏳ 00:00<? | ?it/s

Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 2: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 3: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 2: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 3: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 1: raised RuntimeError('What could have possibly 

In [13]:
default_evals

Unnamed: 0,label
0,relevant
1,relevant
2,unrelated
3,unrelated
4,unrelated
5,unrelated
6,relevant
7,relevant
8,relevant
9,unrelated


## Including exception details

By setting the `include_exceptions` flag to `True` in `llm_classify`, two additional columns will be provided in the output that will show all exceptions that were encountered during execution, as well as a status that summarizes what happened for each row.

In [14]:
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
evals_with_exception_info = llm_classify(
    dataframe=df_sample,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=funny_model,
    rails=rails,
    concurrency=3,
    include_exceptions=True,
)

llm_classify |          | 0/40 (0.0%) | ⏳ 00:00<? | ?it/s

Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 2: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 3: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 4: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 5: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 6: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 1: raised RuntimeError('What could have possibly 

In [15]:
evals_with_exception_info

Unnamed: 0,label,exceptions,execution_status
0,relevant,[],COMPLETED
1,relevant,[],COMPLETED
2,unrelated,[],COMPLETED
3,unrelated,[],COMPLETED
4,unrelated,[RuntimeError('What could have possibly happened here?!?!?!')],COMPLETED WITH RETRIES
5,unrelated,[RuntimeError('What could have possibly happened here?!?!?!')],COMPLETED WITH RETRIES
6,relevant,"[RuntimeError('What could have possibly happened here?!?!?!'), RuntimeError('What could have possibly happened here?!?!?!'), RuntimeError('What could have possibly happened here?!?!?!'), RuntimeError('What could have possibly happened here?!?!?!'), RuntimeError('What could have possibly happened here?!?!?!'), RuntimeError('What could have possibly happened here?!?!?!')]",COMPLETED WITH RETRIES
7,relevant,[RuntimeError('What could have possibly happened here?!?!?!')],COMPLETED WITH RETRIES
8,relevant,"[RuntimeError('What could have possibly happened here?!?!?!'), RuntimeError('What could have possibly happened here?!?!?!')]",COMPLETED WITH RETRIES
9,unrelated,[],COMPLETED


Notice that after a terminal error occurs, `llm_classify` stops early and some rows are left in a `DID NOT RUN` state. We can use a `Counter` to show many evals did not finish or encountered an error.

In [16]:
Counter(evals_with_exception_info["execution_status"])

Counter({'COMPLETED': 16,
         'DID NOT RUN': 13,
         'COMPLETED WITH RETRIES': 10,
         'MISSING INPUT': 1})

## Configuring Early Exit Behavior

You can also pass `exit_on_error=False` to `llm_classify`, which will skip rows that either are missing inputs or fail during execution. This setting can be combined with `maximum_retries` to fully configure exception handling behavior.

In [17]:
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
all_evals = llm_classify(
    dataframe=df_sample,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=funny_model,
    rails=rails,
    concurrency=3,
    max_retries=2,
    include_exceptions=True,
    exit_on_error=False,
)

llm_classify |          | 0/40 (0.0%) | ⏳ 00:00<? | ?it/s

Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 2: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 2: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 1: raised RuntimeError('What could have possibly happened here?!?!?!')
Requeuing...
Exception in worker on attempt 1: raised RuntimeError('What could have possibly 

In [18]:
all_evals

Unnamed: 0,label,exceptions,execution_status
0,relevant,[],COMPLETED
1,relevant,[RuntimeError('What could have possibly happened here?!?!?!')],COMPLETED WITH RETRIES
2,unrelated,"[RuntimeError('What could have possibly happened here?!?!?!'), RuntimeError('What could have possibly happened here?!?!?!')]",COMPLETED WITH RETRIES
3,unrelated,[],COMPLETED
4,unrelated,[],COMPLETED
5,unrelated,[],COMPLETED
6,relevant,[],COMPLETED
7,relevant,[],COMPLETED
8,relevant,[],COMPLETED
9,unrelated,"[RuntimeError('What could have possibly happened here?!?!?!'), RuntimeError('What could have possibly happened here?!?!?!')]",COMPLETED WITH RETRIES


With `exit_on_error=False`, no evals should be left in a `DID NOT RUN` state.

In [19]:
Counter(all_evals["execution_status"])

Counter({'COMPLETED': 21,
         'COMPLETED WITH RETRIES': 16,
         'MISSING INPUT': 2,
         'FAILED': 1})