<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Human/GroundTruth Versus AI  Evals</h1>

Arize provides tooling to evaluate LLM applications, including tools to determine whether AI answers match Human Groundtruth answers. In many Q&A systems its important to test the AI answer results as compared to Human answers prior to deployment. These help assess how often the answers are correctly generated by the AI system. 

The purpose of this notebook is:

- to evaluate the performance of an LLM-assisted Evals for AI vs Human answers 
- to provide an experimental framework for users to iterate and improve on the default classification template.

## Install Dependencies and Import Libraries

In [None]:
# Requires arize-phoenix as it usees UI / tracing
!pip install -qq "arize-phoenix-evals" "arize-phoenix" "openai>=1" ipython matplotlib pycm scikit-learn tiktoken

In [None]:
import os
from getpass import getpass

import matplotlib.pyplot as plt
import pandas as pd
from pycm import ConfusionMatrix
from sklearn.metrics import classification_report

from phoenix.evals import (
    HUMAN_VS_AI_PROMPT_RAILS_MAP,
    HUMAN_VS_AI_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

pd.set_option("display.max_colwidth", None)

## Download the Dataset

We've crafted a dataset of common questions and answers about the Arize platform.

In [None]:
csv_file_path = "https://storage.googleapis.com/arize-phoenix-assets/evals/human_vs_ai/human_vs_ai_classifications.csv"

# Read the CSV file into a DataFrame
df = pd.read_csv(csv_file_path).dropna(subset=["correct_answer"]).reset_index(drop=True)
df.head()

##  Vizualization of Prompts/Templates Evals in Phoenix (Optional Section)

Visualization of Evals is not required but can be helpful to see the actual calls to the LLM. 
The link below starts the Phoenix UI/server and is a link to Phoenix running locally

In [None]:
import phoenix as px
from phoenix.trace.openai import OpenAIInstrumentor

px.launch_app().view()

OpenAIInstrumentor().instrument()

## Human vs AI Template

View the default template used to evaluate the AI answers.

In [None]:
print(HUMAN_VS_AI_PROMPT_TEMPLATE)

The template variables are:

- **question:** the question asked by a user
- **correct_answer:** human labeled correct answer 
- **ai_answer:** AI generated answer

## Configure the LLM

Configure your OpenAI API key.

In [None]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

## LLM Evals:Human Groundtruth vs AI GPT-4
Run Human vs AI Eval against a subset of the data.
Instantiate the LLM and set parameters.

In [None]:
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

In [None]:
model("Hello world, this is a test if you are working?")

## Classifications with explanations

When evaluating a dataset for relevance, it can be useful to know why the LLM classified an AI answer as relevant or irrelevant. The following code block runs `llm_classify` with explanations turned on so that we can inspect why the LLM made the classification it did. There is speed tradeoff since more tokens is being generated but it can be highly informative when troubleshooting.

In [None]:
import nest_asyncio

nest_asyncio.apply()
# The rails is used to hold the output to specific values based on the template
# It will remove text such as ",,," or "..."
# Will ensure the binary value expected from the template is returned
rails = list(HUMAN_VS_AI_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df,
    template=HUMAN_VS_AI_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    verbose=False,
    provide_explanation=True,
    concurrency=50,
)

## Evaluate Classifications

Evaluate the predictions against human-labeled ground-truth relevance labels.

In [None]:
true_labels = df["true_value"].map(HUMAN_VS_AI_PROMPT_RAILS_MAP).tolist()

print(classification_report(true_labels, relevance_classifications["label"], labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=list(true_labels),
    predict_vector=list(relevance_classifications["label"]),
    classes=rails,
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)

## LLM Evals: Human Groundtruth vs AI  Classifications GPT-3.5 Turbo
Run against a subset of the data using GPT-3.5. GPT-3.5 can significantly speed up the classification process. However there are tradeoffs as  we will see below.

In [None]:
model = OpenAIModel(model_name="gpt-3.5-turbo", temperature=0.0, request_timeout=20)

In [None]:
rails = list(HUMAN_VS_AI_PROMPT_RAILS_MAP.values())
relevance_classifications_df = llm_classify(
    dataframe=df,
    template=HUMAN_VS_AI_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    concurrency=50,
    provide_explanation=True,
)

In [None]:
relevance_classifications_df.head()

In [None]:
relevance_classifications = relevance_classifications_df["label"].tolist()

In [None]:
true_labels = df["true_value"].map(HUMAN_VS_AI_PROMPT_RAILS_MAP).tolist()

print(classification_report(true_labels, relevance_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=relevance_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)

## Preview: Running with GPT-4 Turbo

In [None]:
model = OpenAIModel(model_name="gpt-4-turbo-preview")
relevance_classifications = llm_classify(
    dataframe=df,
    template=HUMAN_VS_AI_PROMPT_TEMPLATE,
    model=model,
    rails=list(HUMAN_VS_AI_PROMPT_RAILS_MAP.values()),
    concurrency=50,
)["label"].tolist()

In [None]:
true_labels = df["true_value"].map(HUMAN_VS_AI_PROMPT_RAILS_MAP).tolist()

print(classification_report(true_labels, relevance_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=relevance_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)