<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Retrieval Relevance Evals</h1>

Phoenix evals are designed to be robust to many kinds of errors, providing many tools to control error handling and retry behavior, as well as the ability to surface details about what happened during long eval runs.

In this notebook, we'll simulate various kinds of errors that might happen while running evals and show different ways Phoenix evals can work with them.

## Install Dependencies and Import Libraries

In [None]:
N_EVAL_SAMPLE_SIZE = 40

In [None]:
!pip install -qq "arize-phoenix-evals" "openai>=1" ipython matplotlib pycm scikit-learn tiktoken nest_asyncio

‚ÑπÔ∏è To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use `nest_asyncio`. `nest_asyncio` globally patches `asyncio` to enable event loops to be re-entrant. This is not required for non-notebook environments.

Without `nest_asyncio`, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical.

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
import os
from collections import Counter
from getpass import getpass

import pandas as pd
from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)
from pycm import ConfusionMatrix
from sklearn.metrics import classification_report

pd.set_option("display.max_colwidth", None)

## Download Dataset

In [None]:
df = download_benchmark_dataset(
    task="binary-relevance-classification", dataset_name="wiki_qa-train"
)

## Configure a test LLM

Configure your OpenAI API key.

In [None]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("üîë Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

## Sample Input Dataset
Sample size determines run time
Recommend iterating small: 100 samples
Then increasing to large test set

In [None]:
df_sample = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)
df_sample = df_sample.rename(
    columns={
        "query_text": "input",
        "document_text": "reference",
    },
)

## Run LLM Evals
Run relevance against a subset of the data.
Instantiate the LLM and set parameters.

## Set up test model wrapper

To demonstrate error handling while running evals, we'll remove some input data that was required from our sampled dataset.

Second, we'll create a buggy model that inherits from the `OpenAIModel` wrapper to simulate spurious errors that might occur when trying to run evals.

In [None]:
df_sample.loc[28, "reference"] = None
df_sample.loc[37, "input"] = None

In [None]:
import random


class FunnyAIModel(OpenAIModel):
    async def _async_generate(self, *args, **kwargs):
        if random.random() < 0.3:
            raise RuntimeError("What could have possibly happened here?!?!?!")
        return await super()._async_generate(*args, **kwargs)

In [None]:
funny_model = FunnyAIModel(
    model="gpt-4o",
    temperature=0.0,
)

In [None]:
funny_model("Hello world, this is a test if you are working?")

## Default Behavior

The default behavior is to retry (with a default maximum of 10) on exceptions while running evals. However, is input data is missing and a prompt cannot be generated from a template, that row will fail. `llm_classify` will return early, and the rows that will not be run will not have an eval.

In [None]:

rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
default_evals = llm_classify(
    dataframe=df_sample,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=funny_model,
    rails=rails,
    concurrency=3,
)

In [None]:
default_evals

## Including exception details

By setting the `include_exceptions` flag to `True` in `llm_classify`, two additional columns will be provided in the output that will show all exceptions that were encountered during execution, as well as a status that summarizes what happened for each row.

In [None]:
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
evals_with_exception_info = llm_classify(
    dataframe=df_sample,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=funny_model,
    rails=rails,
    concurrency=3,
    include_exceptions=True,
)

In [None]:
evals_with_exception_info

Notice that after a terminal error occurs, `llm_classify` stops early and some rows are left in a `DID NOT RUN` state. We can use a `Counter` to show many evals did not finish or encountered an error.

In [None]:
Counter(evals_with_exception_info["execution_status"])

## Configuring Early Exit Behavior

You can also pass `exit_on_error=False` to `llm_classify`, which will skip rows that either are missing inputs or fail during execution. This setting can be combined with `maximum_retries` to fully configure exception handling behavior.

In [None]:
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
all_evals = llm_classify(
    dataframe=df_sample,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=funny_model,
    rails=rails,
    concurrency=3,
    max_retries=2,
    include_exceptions=True,
    exit_on_error=False,
)

In [None]:
all_evals

With `exit_on_error=False`, no evals should be left in a `DID NOT RUN` state.

In [None]:
Counter(all_evals["execution_status"])