<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Benchmarking Hallucination Evals</h1>

The purpose of this notebook is:

- to benchmark the performance of LLM-assisted approaches to detecting hallucinations,
- to leverage Phoenix experiments to iterate and improve on the evaluation approach.


In [None]:
!pip install arize-phoenix openinference-instrumentation nest-asyncio openai pandas

ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use `nest_asyncio`. `nest_asyncio` globally patches `asyncio` to enable event loops to be re-entrant. This is not required for non-notebook environments.

Without `nest_asyncio`, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical.

In [None]:
import nest_asyncio

nest_asyncio.apply()

Set up tracing to log runs to your Phoenix instance. 

In [None]:
from openinference.instrumentation.openai import OpenAIInstrumentor

from phoenix.otel import register

# PHOENIX_COLLECTOR_ENDPOINT and PHOENIX_API_KEY are set in the environment
tracer_provider = register(project_name="hallucination_benchmark", auto_instrument=True, batch=True)

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

In [None]:
import asyncio
import os
from getpass import getpass

import openai
import pandas as pd

from phoenix.evals import (
    HALLUCINATION_PROMPT_RAILS_MAP,
    HALLUCINATION_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

pd.set_option("display.max_colwidth", None)

## Prepare Benchmark Dataset

We'll evaluate the evaluation system consisting of an LLM model and settings in addition to an evaluation prompt template against benchmark datasets of queries and retrieved documents with ground-truth relevance labels. Currently supported datasets include "halueval_qa_data" from the HaluEval benchmark:

- https://arxiv.org/abs/2305.11747
- https://github.com/RUCAIBox/HaluEval

In [None]:
df = download_benchmark_dataset(
    task="binary-hallucination-classification", dataset_name="halueval_qa_data"
)
print("`halueval_qa_data` dataset has", df.shape[0], "rows")
# rename columns
df.rename(columns={"reference": "context", "is_hallucination": "expected"}, inplace=True)
df["expected"] = (~df["expected"]).astype(int)  # no hallucination = 1 (bc higher is better)

In [None]:
from phoenix.client import Client

phoenix_client = Client()

dataset = phoenix_client.datasets.create_dataset(
    name="halueval_qa_data",
    dataframe=df,
    input_keys=["context", "query", "response"],
    output_keys=["expected"],
    timeout=30,  # large dataset takes a while to upload
)
# dataset = phoenix_client.datasets.get_dataset(dataset="halueval_qa_data")

## Baseline: Arize's Built-In Binary Hallucination Classification Template


### Configure the LLM

In [None]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

View the default template used to classify hallucinations. You can tweak this template and evaluate its performance relative to the default.

In [None]:
print(HALLUCINATION_PROMPT_TEMPLATE)

### Define the experiment task (hallucination classification)

In [None]:
model = OpenAIModel(
    model="gpt-4o-mini",
    temperature=0.0,
)


def sync_classify(input) -> int:
    """
    Runs the llm_classify function on a single input in a sync context.
    """
    rails = list(HALLUCINATION_PROMPT_RAILS_MAP.values())
    single_df = pd.DataFrame(
        [{"reference": input["context"], "input": input["query"], "output": input["response"]}]
    )
    result = llm_classify(
        data=single_df,
        template=HALLUCINATION_PROMPT_TEMPLATE,
        model=model,
        rails=rails,
        run_sync=True,
        max_retries=3,
    )
    return rails.index(result["label"].iloc[0])  # map label to 0 or 1


async def async_classify(input) -> int:
    """
    Runs the sync_classify function on a single input in an async context.
    """
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(None, sync_classify, input)

### Define evaluators

Accuracy: does the expected label match the eval? 

Note: The accuracy evaluator rounds the output/expected to the nearest integer, so if we move to continuous score [0,1], then the accuracy evaluator still works. We could also add a mean-squared-error (MSE) evaluator for continuous scores down the line.

We should also calculate F1, precision, recall at the dataset level for better understanding of the eval performance. 


In [None]:
from typing import Any


def accuracy(output: float | int, expected: dict[str, Any]) -> bool:
    # rounds to 0 or 1 if score is continuous (e.g. 0.7 -> 1, 0.3 -> 0)
    return round(output) == round(float(expected["expected"]))

### Run Experiment

In [None]:
from phoenix.experiments.functions import run_experiment

experiment = run_experiment(
    dataset=dataset,
    task=async_classify,
    evaluators=[accuracy],
    experiment_name="baseline-python",
    experiment_description="Built-in hallucination eval with python SDK",
    experiment_metadata={"sdk": "phoenix", "sdk_type": "python"},
    concurrency=8,
    # dry_run=10,
)