<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Benchmarking Hallucination Evals</h1>

The purpose of this notebook is:

- to benchmark the performance of LLM-assisted approaches to detecting hallucinations,
- to leverage Phoenix experiments to iterate and improve on the evaluation approach.


In [None]:
!uv pip install arize-phoenix openinference-instrumentation openinference-instrumentation-anthropic openinference-instrumentation-openai nest-asyncio openai pandas anthropic

In [None]:
import asyncio
import os
from typing import Any

import pandas as pd

from phoenix.evals import (
    HALLUCINATION_PROMPT_RAILS_MAP,
    HALLUCINATION_PROMPT_TEMPLATE,
    AnthropicModel,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)

pd.set_option("display.max_colwidth", None)

ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use `nest_asyncio`. `nest_asyncio` globally patches `asyncio` to enable event loops to be re-entrant. This is not required for non-notebook environments.

Without `nest_asyncio`, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical.

In [None]:
import nest_asyncio

nest_asyncio.apply()

Set up tracing to log runs to your Phoenix instance. 

In [None]:
from phoenix.otel import register

# PHOENIX_COLLECTOR_ENDPOINT and PHOENIX_API_KEY are pulled from the environment
tracer_provider = register(project_name="hallucination_benchmark", auto_instrument=True, batch=True)

## Prepare Benchmark Dataset

We'll evaluate the evaluation system consisting of an LLM model and settings in addition to an evaluation prompt template against benchmark datasets of queries and retrieved documents with ground-truth relevance labels. Currently supported datasets include "halueval_qa_data" from the HaluEval benchmark:

- https://arxiv.org/abs/2305.11747
- https://github.com/RUCAIBox/HaluEval

In [None]:
df = download_benchmark_dataset(
    task="binary-hallucination-classification", dataset_name="halueval_qa_data"
)
print("`halueval_qa_data` dataset has", df.shape[0], "rows")
# rename columns
df.rename(columns={"reference": "context", "is_hallucination": "expected"}, inplace=True)
df["expected"] = (1 - df["expected"]).astype(int)  # no hallucination = 1 (bc higher is better)
df_subset = df.sample(10, random_state=42)  # increase size for larger experiments

In [None]:
from phoenix.client import Client

phoenix_client = Client()
dataset_name = "halueval_qa_data_subset_xs"
try:
    dataset = phoenix_client.datasets.create_dataset(
        name=dataset_name,
        dataframe=df_subset,
        input_keys=["context", "query", "response"],
        output_keys=["expected"],
        timeout=30,  # large dataset takes a while to upload
    )
    print(f"Dataset {dataset_name} created.")
except Exception:
    print(f"Dataset {dataset_name} already exists. Getting existing dataset.")
    dataset = phoenix_client.datasets.get_dataset(dataset=dataset_name)

## Baseline: Arize's Built-In Binary Hallucination Classification Template


### Configure the LLM

In [None]:
from getpass import getpass

import openai

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

if not (anthropic_api_key := os.getenv("ANTHROPIC_API_KEY")):
    anthropic_api_key = getpass("🔑 Enter your Anthropic API key: ")

os.environ["ANTHROPIC_API_KEY"] = anthropic_api_key

View the default template used to classify hallucinations. You can tweak this template and evaluate its performance relative to the default.

In [None]:
print(HALLUCINATION_PROMPT_TEMPLATE.explanation_template[0].template)

### Define the experiment task (hallucination classification)

In [None]:
def sync_classify(input, model_name: str) -> dict[str, Any]:
    """
    Runs the llm_classify function on a single input in a sync context using the specified model.
    """
    if "claude" in model_name:
        model = AnthropicModel(
            model=model_name,
            temperature=0.0,
            initial_rate_limit=100,  # change depending on your rate limit
        )
    else:
        model = OpenAIModel(
            model=model_name,
            temperature=0.0,
            initial_rate_limit=100,  # change depending on your rate limit
        )
    rails = list(HALLUCINATION_PROMPT_RAILS_MAP.values())
    label_to_score = {"hallucinated": 0, "factual": 1}  # explicit mapping
    single_df = pd.DataFrame(
        [{"reference": input["context"], "input": input["query"], "output": input["response"]}]
    )
    result = llm_classify(
        data=single_df,
        template=HALLUCINATION_PROMPT_TEMPLATE,
        provide_explanation=True,
        use_function_calling_if_available=True,
        model=model,
        rails=rails,
        run_sync=True,
        max_retries=3,
    )
    score = label_to_score[result["label"].iloc[0]]  # map label to 0 or 1
    return {"hallucination_score": score, "explanation": result["explanation"].iloc[0]}


async def async_classify(input, model_name: str) -> dict[str, Any]:
    """
    Runs the sync_classify function on a single input in an async context.
    """
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(None, sync_classify, input, model_name)

### Define evaluators

Accuracy: does the expected label match the eval? 

Note: The accuracy evaluator rounds the output/expected to the nearest integer, so if we move to continuous score [0,1], then the accuracy evaluator still works. We could also add a mean-squared-error (MSE) evaluator for continuous scores down the line.

We should also calculate F1, precision, recall at the dataset level for better understanding of the eval performance. 


In [None]:
def accuracy(output: dict[str, Any], expected: dict[str, Any]) -> bool:
    # rounds to 0 or 1 if score is continuous (e.g. 0.7 -> 1, 0.3 -> 0)
    return round(output["hallucination_score"]) == round(float(expected["expected"]))

### Run Experiment

In [None]:
# List of models to benchmark as judges
judge_models = [
    # OpenAI models
    "gpt-4o",  # GPT-4 Omni (May 2024)
    "gpt-4o-mini",  # Smaller version of GPT-4o
    "gpt-3.5-turbo",  # GPT-3.5 Turbo (March 2023)
    "o3",  # Successor to o1 with improved reasoning
    "o3-mini",  # Smaller version of o3
    "o4-mini",  # Latest mini reasoning model (April 2025)
    # Claude (Anthropic) models
    "claude-3-opus-20240229",  # Claude 3 Opus
    "claude-3-sonnet-20240229",  # Claude 3 Sonnet
    "claude-3-haiku-20240307",  # Claude 3 Haiku
    "claude-opus-4-20250514",  # Claude Opus 4
    "claude-sonnet-4-20250514",  # Claude Sonnet 4
]

In [None]:
from functools import partial

from phoenix.client import AsyncClient

async_client = AsyncClient()
model_name = "gpt-4o-mini"  # loop through judge_models if you want or just try one for now

experiment = await async_client.experiments.run_experiment(
    dataset=dataset,
    task=partial(async_classify, model_name=model_name),
    evaluators=[accuracy],
    experiment_name=f"baseline-{model_name}",
    experiment_description="Built-in hallucination eval with python SDK",
    experiment_metadata={"sdk": "phoenix", "sdk_type": "python", "model": model_name},
    concurrency=10,
    # dry_run=10,
)