<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Relevance Classification Evals</h1>

Arize provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) applications. This relevance is then used to measure the quality of each retrieval using ranking metrics such as precision@k. In order to determine whether each retrieved document is relevant or irrelevant to the corresponding query, our approach is straightforward: ask an LLM.

The purpose of this notebook is:

- to evaluate the performance of an LLM-assisted approach to relevance classification against information retrieval datasets with ground-truth relevance labels,
- to provide an experimental framework for users to iterate and improve on the default classification template.

## Install Dependencies and Import Libraries

In [11]:
!pip install -qq "arize-phoenix[experimental]==0.0.33rc6" ipython matplotlib openai pycm scikit-learn

In [12]:
import os
from getpass import getpass
import pandas as pd
import matplotlib.pyplot as plt
import openai
from phoenix.experimental.evals import (
    RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
    TOXIC_PROMPT_TEMPLATE_STR,
    OpenAiModel,
    download_benchmark_dataset,
    llm_eval_binary,
)
from pycm import ConfusionMatrix
from sklearn.metrics import classification_report

ImportError: cannot import name 'TOXIC_PROMPT_TEMPLATE_STR' from 'phoenix.experimental.evals' (/Users/amankhan/.pyenv/versions/3.10.4/lib/python3.10/site-packages/phoenix/experimental/evals/__init__.py)

## Download Benchmark Dataset

We'll evaluate the evaluation system consisting of an LLM model and settings in addition to an evaluation prompt template against benchmark datasets of queries and retrieved documents with ground-truth toxicity labels. Currently supported datasets include:

- "wiki_toxic"


In [6]:
df = pd.read_parquet('https://huggingface.co/api/datasets/OxAISH-AL-LLM/wiki_toxic/parquet/default/test/0.parquet')
df.head()

Unnamed: 0,id,comment_text,label
0,0001ea8717f6de06,Thank you for understanding. I think very high...,0
1,000247e83dcc1211,:Dear god this site is horrible.,0
2,0002f87b16116a7f,"""::: Somebody will invariably try to add Relig...",0
3,0003e1cccfd5a40a,""" \n\n It says it right there that it IS a typ...",0
4,00059ace3e3e9a53,""" \n\n == Before adding a new product to the l...",0


## Display Toxicity Classification Template

View the default template used to classify toxicity. You can tweak this template and evaluate its performance relative to the default.

In [13]:
print(TOXIC_PROMPT_TEMPLATE_STR)

NameError: name 'TOXIC_PROMPT_TEMPLATE_STR' is not defined

The template variables are:

- **text:** the text provided by a user

# Configure the LLM

Configure your OpenAI API key.

In [None]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

Instantiate the LLM and set parameters.

In [None]:
model = OpenAiModel(
    model_name="gpt-4",
    temperature=0.0,
)

## Run Toxicity Classifications

Run toxicity classifications against a subset of the data.

In [None]:
df = df.sample(n=100).reset_index(drop=True)
df = df.rename(
    columns={
        "comment_text": "text"
    },
)

In [None]:
toxic_classifications = llm_eval_binary(
    dataframe=df,
    template=TOXIC_PROMPT_TEMPLATE_STR,
    model=model,
    rails=TOXIC_PROMPT_RAILS,
)

## Evaluate Classifications

Evaluate the predictions against human-labeled ground-truth relevance labels.

In [None]:
classes = TOXIC_PROMPT_RAILS
true_labels = df["toxic"].replace((True, False), classes).tolist()
predicted_labels = relevance_classifications

print(classification_report(true_labels, predicted_labels, labels=classes))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=predicted_labels, classes=classes
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
);