<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Toxicity Classification Evals</h1>

Arize provides tooling to evaluate LLM applications, including tools to determine if the generation of a model (or user response) is toxic. This detection can look for racist, bias'ed, derogatory, and bad language/angry responses.

The purpose of this notebook is:

- to evaluate the performance of an LLM-assisted toxic detection
- to provide an experimental framework for users to iterate and improve on the default classification template.

## Install Dependencies and Import Libraries

In [None]:
#####################
## N_EVAL_SAMPLE_SIZE
#####################
# Eval sample size determines the run time
# 100 samples: GPT-4 ~ 80 sec / GPT-3.5 ~ 40 sec
# 1,000 samples: GPT-4 ~15-17 min / GPT-3.5 ~ 6-7min (depending on retries)
# 10,000 samples GPT-4 ~170 min / GPT-3.5 ~ 70min
N_EVAL_SAMPLE_SIZE = 100
# Balance the toxicity class data for the test
BALANCE_DATA = True

In [None]:
!pip install -qq "arize-phoenix[experimental]" "openai>=1" ipython matplotlib pycm scikit-learn tiktoken nest_asyncio

## Optional: Patch `asyncio` with `nest_asyncio` 🚀

In a notebook environment (e.g. Jupyter or Google Colab), you can use nest_asyncio to enable async request submission. nest_asyncio globally patches asyncio to enable event loops to be re-entrant. Notebooks run an event loop, and synchronous functions called within them cannot nest an event loop without this patch. This is not required for non-notebook environments.

Without `nest_asyncio`, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical.

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
import os
from getpass import getpass

import matplotlib.pyplot as plt
import openai
import pandas as pd
from phoenix.experimental.evals import (
    TOXICITY_PROMPT_RAILS_MAP,
    TOXICITY_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)
from pycm import ConfusionMatrix
from sklearn.metrics import classification_report

pd.set_option("display.max_colwidth", None)

## Download Benchmark Dataset

We'll evaluate the evaluation system consisting of an LLM model and settings in addition to an evaluation prompt template against a benchmark datasets of toxic and non-toxic text with ground-truth labels. Currently supported datasets include:

- "wiki_toxic"


In [None]:
df = download_benchmark_dataset(task="toxicity-classification", dataset_name="wiki_toxic-test")
df.head()

## Display Toxicity Classification Template

View the default template used to classify toxicity. You can tweak this template and evaluate its performance relative to the default.

In [None]:
print(TOXICITY_PROMPT_TEMPLATE)

The template variables are:

- **input:** the text to be classified

# Configure the LLM

Configure your OpenAI API key.

In [None]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

## Benchmark Dataset Sample
Sample size determines run time
Recommend iterating small: 100 samples
Then increasing to large test set

In [None]:
if BALANCE_DATA:
    # The data set is unbalanced, lets balance so we can test with smaller sample sizes
    # At 100 samples sometimes you only get 6 toxic classes
    # Split the dataset into two groups: toxic and non-toxic
    toxic_df = df[df["toxic"]]
    non_toxic_df = df[~df["toxic"]]

    # Get the minimum count between the two groups
    min_count = min(len(toxic_df), len(non_toxic_df))

    # Sample the minimum count from each group
    toxic_sample = toxic_df.sample(min_count, random_state=2)
    non_toxic_sample = non_toxic_df.sample(min_count, random_state=2)

    # Concatenate the samples together
    df_sample = pd.concat([toxic_sample, non_toxic_sample], axis=0).sample(
        n=N_EVAL_SAMPLE_SIZE
    )  # The second sample function is to shuffle the row
else:
    df_sample = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)

In [None]:
df_sample = df_sample.rename(
    columns={"text": "input"},
)

Instantiate the LLM and set parameters.

In [None]:
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

In [None]:
model("Hello world, this is a test if you are working?")

## LLM Evals: Toxicity Evals Classifications GPT-4

Instantiate the LLM and set parameters.
Run toxicity classifications against a subset of the data.

In [None]:
# The rails is used to hold the output to specific values based on the template
# It will remove text such as ",,," or "..."
# Will ensure the binary value expected from the template is returned
rails = list(TOXICITY_PROMPT_RAILS_MAP.values())
toxic_classifications = llm_classify(
    dataframe=df_sample,
    template=TOXICITY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    concurrency=20,
)["label"].tolist()


Evaluate the predictions against human-labeled ground-truth toxicity labels.

In [None]:
true_labels = df_sample["toxic"].map(TOXICITY_PROMPT_RAILS_MAP).tolist()

print(classification_report(y_true=true_labels, y_pred=toxic_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=toxic_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)

## LLM Evals: Toxicity Evals Classifications GPT-3.5
Instantiate the LLM and set parameters.
Run toxicity classifications against a subset of the data.

In [None]:
model = OpenAIModel(model_name="gpt-3.5-turbo", temperature=0.0, request_timeout=20)

In [None]:
rails = list(TOXICITY_PROMPT_RAILS_MAP.values())
toxic_classifications = llm_classify(
    dataframe=df_sample,
    template=TOXICITY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    concurrency=20,
)["label"].tolist()

In [None]:
true_labels = df_sample["toxic"].map(TOXICITY_PROMPT_RAILS_MAP).tolist()

print(classification_report(true_labels, toxic_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=toxic_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)

# LLM Evals: Toxicity Evals Classifications GPT-4 Turbo
Instantiate the LLM and set parameters.
Run toxicity classifications against a subset of the data.

In [None]:
model = OpenAIModel(model_name="gpt-4-1106-preview", temperature=0.0)

In [None]:
rails = list(TOXICITY_PROMPT_RAILS_MAP.values())
toxic_classifications = llm_classify(
    dataframe=df_sample,
    template=TOXICITY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    concurrency=20,
)["label"].tolist()

In [None]:
true_labels = df_sample["toxic"].map(TOXICITY_PROMPT_RAILS_MAP).tolist()

print(classification_report(y_true=true_labels, y_pred=toxic_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=toxic_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)