<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Q&A Classification Evals</h1>

The purpose of this notebook is:

- to evaluate the performance of an LLM-assisted approach to detecting issues with Q&A systems on retrieved context data
- to provide an experimental framework for users to iterate and improve on the default classification template.

## Install Dependencies and Import Libraries

In [None]:
#####################
## N_EVAL_SAMPLE_SIZE
#####################
# Eval sample size determines the run time
# 100 samples: GPT-4 ~ 80 sec / GPT-3.5 ~ 40 sec
# 1,000 samples: GPT-4 ~15-17 min / GPT-3.5 ~ 6-7min (depending on retries)
# 10,000 samples GPT-4 ~170 min / GPT-3.5 ~ 70min
N_EVAL_SAMPLE_SIZE = 100

In [None]:
!pip install -qq "arize-phoenix-evals" "openai>=1" ipython matplotlib pycm scikit-learn tiktoken nest_asyncio

ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use `nest_asyncio`. `nest_asyncio` globally patches `asyncio` to enable event loops to be re-entrant. This is not required for non-notebook environments.

Without `nest_asyncio`, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical.

In [None]:
import nest_asyncio

nest_asyncio.apply()

In [None]:
import os
from getpass import getpass

import matplotlib.pyplot as plt
import openai
import pandas as pd

In [None]:
from pycm import ConfusionMatrix
from sklearn.metrics import classification_report

from phoenix.evals import (
    QA_PROMPT_RAILS_MAP,
    QA_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

pd.set_option("display.max_colwidth", None)

## Download Benchmark Dataset



- Squad 2:
The 2.0 version of the large-scale dataset Stanford Question Answering Dataset (SQuAD 2.0) allows researchers to design AI models for reading comprehension tasks under challenging constraints.
https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/reports/default/15785042.pdf
- Supplemental Data to Sqaud 2: In order to check the case of detecting incorrect answers, we created wrong answers based on the context data. The wrong answers are intermixed with right answers.
- sampled_answer is a sampled column of randomly original Squad 2 or incorrect answers

In [None]:
df = download_benchmark_dataset(task="qa-classification", dataset_name="qa_generated_dataset")

- **question**: This is the question the Q&A system is running against
- **sampled_answer**: This is a random sample of correct_answer from Squad 2 or wrong_answer which is a made up incorrect answer. This is the column we test against as it has wrong and right answers.
- **correct_answer**: True if answer is correct, False if not. The ground truth to test against.
- **answers**: This is the right answer to the question.
- **wrong_answer**: This is an incorrect answer generated by the context.
- **context**: This is the context to be used to answer the question, and is what Q&A Eval must use to check the correct answer.



In [None]:
df.head()

## Display Binary Q&A Classification Template

View the default template used to classify hallucinations. You can tweak this template and evaluate its performance relative to the default.

In [None]:
print(QA_PROMPT_TEMPLATE)

## Configure the API Key

Configure your OpenAI API key.

In [None]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

## Benchmark Dataset Sample
Sample size determines run time
Recommend iterating small: 100 samples
Then increasing to large test set

In [None]:
df_sample = (
    df.sample(n=N_EVAL_SAMPLE_SIZE)
    .reset_index(drop=True)
    .rename(
        columns={
            "question": "input",
            "context": "reference",
            "sampled_answer": "output",
        }
    )
)

## LLM Evals: Q&A Classifications GPT-4
Run Q&A classifications against a subset of the data.

Instantiate the LLM and set parameters.

In [None]:
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

In [None]:
model("Hello world, this is a test if you are working?")

Run LLM Eval using the template against the dataset: This is the main Eval function

In [None]:
# The rails force the output to specific values of the template
# It will remove text such as ",,," or "...", anything not the
# binary value expected from the template
rails = list(QA_PROMPT_RAILS_MAP.values())
Q_and_A_classifications = llm_classify(
    dataframe=df_sample,
    template=QA_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    concurrency=20,
)["label"].tolist()



Evaluate the predictions against human-labeled ground-truth Q&A labels.

In [None]:
true_labels = df_sample["answer_true"].map(QA_PROMPT_RAILS_MAP).tolist()

print(classification_report(true_labels, Q_and_A_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=Q_and_A_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)

##  LLM Evals: Q&A Classifications GPT-3.5


Evaluate the predictions against human-labeled ground-truth Q&A labels.

In [None]:
model = OpenAIModel(model_name="gpt-3.5-turbo", temperature=0.0, request_timeout=20)

In [None]:
Q_and_A_classifications = llm_classify(
    dataframe=df_sample,
    template=QA_PROMPT_TEMPLATE,
    model=model,
    rails=list(QA_PROMPT_RAILS_MAP.values()),
    concurrency=20,
)["label"].tolist()

In [None]:
true_labels = df_sample["answer_true"].map(QA_PROMPT_RAILS_MAP).tolist()
classes = list(QA_PROMPT_RAILS_MAP.values())

print(classification_report(true_labels, Q_and_A_classifications, labels=classes))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=Q_and_A_classifications, classes=classes
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)

##  LLM Evals: Q&A Classifications GPT-4 Turbo


Evaluate the predictions against human-labeled ground-truth Q&A labels.

In [None]:
model = OpenAIModel(model_name="gpt-4-turbo-preview", temperature=0.0)

In [None]:
Q_and_A_classifications = llm_classify(
    dataframe=df_sample,
    template=QA_PROMPT_TEMPLATE,
    model=model,
    rails=list(QA_PROMPT_RAILS_MAP.values()),
    concurrency=20,
)["label"].tolist()

In [None]:
true_labels = df_sample["answer_true"].map(QA_PROMPT_RAILS_MAP).tolist()
classes = list(QA_PROMPT_RAILS_MAP.values())

print(classification_report(true_labels, Q_and_A_classifications, labels=classes))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=Q_and_A_classifications, classes=classes
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)