<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg" width="1000"/>
        <br>
        <br>
        <a href="https://arize-phoenix.readthedocs.io/projects/evals/en/latest/">Evals Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Arize Phoenix Evals 2.0</h1>

We are excited to introduce `arize-phoenix-evals` 2.0, an open-source library providing tools to evaluate AI systems so you can build faster and with more confidence. We have rebuilt the library from the ground up to make evaluation faster, easier, and more powerful.

**In this notebook, you will learn more about:**

1. Our guiding principles
2. The core library abstractions
3. Usage examples
4. What's changed between 2.0 and the previous version

#### Our Guiding Principles

**Fast:** We are optimizing for maximum speed, minimal headache.

**Ergonomic:** It should be user-friendly and easy to pick up.

**Flexible:** We make minimal assumptions about the shape of your data or evals.

**Powerful:** Built with extensibility in mind, the library enables complex evaluation tasks.


In [1]:
! pip install arize-phoenix "arize-phoenix-evals>=2.0.0" openai pandas openinference-instrumentation-openai --quiet

In [None]:
# set up phoenix app and tracing
import phoenix as px
from phoenix.otel import register

px.launch_app()
tracer_provider = register(auto_instrument=True)

## LLM Configuration

**Core Design Principle:** The library should work with any LLM model and provider.

The LLM wrapper unifies generation tasks across model providers by delegating to the most commonly installed client SDKs (OpenAI, LangChain, LiteLLM) via adapters.


In [3]:
import os
from getpass import getpass

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

In [None]:
from phoenix.evals import LLM

llm = LLM(
    provider="openai", model="gpt-4o-mini"
)  # you could also specify the client e.g. "langchain", "litellm", "openai"

## Evaluators and Scores

An evaluation is defined as any process that returns a `Score`.


In [5]:
from phoenix.evals.metrics import (
    HallucinationEvaluator,
)

llm = LLM(provider="openai", model="gpt-4o-mini")
hallucination_evaluator = HallucinationEvaluator(llm=llm)
result = hallucination_evaluator.evaluate(
    {
        "input": "What is the capital of France?",
        "output": "Paris is the capital of France.",
        "context": "Paris is the capital and largest city of France.",
    }
)
print("Hallucination result:")
result[0].pretty_print()

Hallucination result:
{
  "name": "hallucination",
  "score": 1.0,
  "label": "factual",
  "explanation": "The response correctly states that Paris is the capital of France, which is supported by the context provided. Thus, it does not contain any false information or hallucinations.",
  "metadata": {
    "model": "gpt-4o-mini"
  },
  "source": "llm",
  "direction": "maximize"
}


**Core Design Principle:** The output of evaluators should be rich with information.

Evaluators always return a **list** of `Score` objects. Often, this will be a list of length 1, but some evaluators may return multiple scores for a single `eval_input` (e.g. precision/recall or multi-criteria evals).


## Built-In Metrics


### Precision, Recall, F1 (multi-score)

A single evaluator can return multiple scores!


In [6]:
from phoenix.evals.metrics import PrecisionRecallFScore

precision_recall_fscore = PrecisionRecallFScore(positive_label="yes")
result = precision_recall_fscore.evaluate(
    {"output": ["no", "yes", "yes"], "expected": ["yes", "no", "yes"]}
)
print("Results:")
print(result[0])
print(result[1])
print(result[2])

Results:
Score(name='precision', score=0.5, label=None, explanation=None, metadata={'beta': 1.0, 'average': 'macro', 'labels': ['yes', 'no'], 'positive_label': 'yes'}, source='heuristic', direction='maximize')
Score(name='recall', score=0.5, label=None, explanation=None, metadata={'beta': 1.0, 'average': 'macro', 'labels': ['yes', 'no'], 'positive_label': 'yes'}, source='heuristic', direction='maximize')
Score(name='f1', score=0.5, label=None, explanation=None, metadata={'beta': 1.0, 'average': 'macro', 'labels': ['yes', 'no'], 'positive_label': 'yes'}, source='heuristic', direction='maximize')


## Custom LLM Classification Evaluators

This is similar to `llm_classify`, for LLM-as-a-judge evaluations that output a label and explanation.


In [None]:
from phoenix.evals import LLM, ClassificationEvaluator

llm = LLM(provider="openai", model="gpt-4o-mini")

evaluator = ClassificationEvaluator(
    name="sentiment",
    llm=llm,
    prompt_template="Classify the sentiment of this text: {text}",
    choices={"positive": 1.0, "negative": 0.0, "neutral": 0.5},  # specify custom score mapping!
)

result = evaluator.evaluate({"text": "I love this!"})
result[0].pretty_print()

{
  "name": "sentiment",
  "score": 1.0,
  "label": "positive",
  "explanation": "The text expresses a strong positive emotion of love, indicating a favorable sentiment towards something.",
  "metadata": {
    "model": "gpt-4o-mini"
  },
  "source": "llm",
  "direction": "maximize"
}


### About the `ClassificationEvaluator`

**New features:**

- Specify scores for each label
- Runs on single records (not just a dataframe)
- Leverages model tool calling / structured output for more reliable output parsing
- There is also a factory function `create_classifier` to create `ClassificationEvaluator` objects.

This abstraction can be easily extended to support multi-criteria evaluations where a judge is asked to evaluate an input across multiple dimensions in one request.


## Input Mapping and Transformation

**Core Design Principle:** The inputs to an evaluator should be well-defined and discoverable.

Every evaluator has an `input_schema` which describes what inputs it expects.


### Use `.describe()` to inspect an `Evaluator`'s input schema

Because pydantic `BaseModel` is used for the `input_schema`, input fields can be annotated with types, descriptions, and even aliases.


In [8]:
hallucination_evaluator.describe()  # requires strings for input, output, and context

{'name': 'hallucination',
 'source': 'llm',
 'direction': 'maximize',
 'input_schema': {'properties': {'input': {'description': 'The input query.',
    'title': 'Input',
    'type': 'string'},
   'output': {'description': 'The response to the query.',
    'title': 'Output',
    'type': 'string'},
   'context': {'description': 'The context or reference text.',
    'title': 'Context',
    'type': 'string'}},
  'required': ['input', 'output', 'context'],
  'title': 'HallucinationInputSchema',
  'type': 'object'}}

In [9]:
from phoenix.evals.metrics import exact_match

exact_match.describe()  # requires string output and expected

{'name': 'exact_match',
 'source': 'heuristic',
 'direction': 'maximize',
 'input_schema': {'properties': {'output': {'title': 'Output',
    'type': 'string'},
   'expected': {'title': 'Expected', 'type': 'string'}},
  'required': ['output', 'expected'],
  'title': 'Exact_matchInput',
  'type': 'object'}}

### Use `input_mapping` to map/transform data into expected `input_schema`

An evaluator's input arguments may not perfectly match those in your example or dataset. Or, you may want to run multiple evaluators on the same example, but they have different or conflicting `input_schema`'s.

You may have noticed that `Evaluators` accept an `eval_input` payload rather than keyword arguments.

**Core Design Principle:** You should not have to modify your data to run evaluations.

To extract the values from a nested `eval_input` payload, provide an `input_mapping` that maps evaluator's input fields to a path spec in your original data.

**Possible Mapping Values:**

- top-level keys in your JSON
- a path spec following JSON path syntax
- callable functions


In [10]:
# example nested eval input for a RAG system
eval_input = {
    "input": {"query": "user input query"},
    "output": {
        "responses": ["model answer", "model answer 2"],
        "documents": ["doc A", "doc B"],
    },
    "expected": "correct answer",
}

# in order to run the hallucination evaluator, we need to process the eval_input to the fit the input schema
input_mapping = {
    "input": "input.query",  # dot notation to access nested keys
    "output": "output.responses[0]",  # brackets to access list elements
    "context": lambda x: " ".join(
        x["output"]["documents"]
    ),  # lambda function to combine the document chunks
}

# the evaluator uses the input_mapping to transform the eval_input into the expected input schema
result = hallucination_evaluator.evaluate(eval_input, input_mapping)
result[0].pretty_print()

{
  "name": "hallucination",
  "score": 0.0,
  "label": "hallucinated",
  "explanation": "The provided data lacks specific details in the query, context, or response, making it impossible to determine if the response is factual or hallucinated. Therefore, I cannot classify the response as either.",
  "metadata": {
    "model": "gpt-4o-mini"
  },
  "source": "llm",
  "direction": "maximize"
}


### Use `bind_evaluator` to bind an `input_mapping` to an `Evaluator` for reuse

Note: We don't need to remap "expected" for the `exact_match` eval because it already exists in our `eval_input`


In [11]:
from phoenix.evals import bind_evaluator

# we can bind an input_mapping to an evaluator ahead of call time for easier sequential evals
evaluators = [
    bind_evaluator(hallucination_evaluator, input_mapping),
    bind_evaluator(exact_match, {"output": "output.responses[0]"}),
]
scores = []
for evaluator in evaluators:
    scores.append(evaluator.evaluate(eval_input))  # no need to pass input_mapping each time

[score[0].pretty_print() for score in scores]

{
  "name": "hallucination",
  "score": 0.0,
  "label": "hallucinated",
  "explanation": "The response does not align with the context provided, indicating it may include inaccurate information.",
  "metadata": {
    "model": "gpt-4o-mini"
  },
  "source": "llm",
  "direction": "maximize"
}
{
  "name": "exact_match",
  "score": 0.0,
  "metadata": {},
  "source": "heuristic",
  "direction": "maximize"
}


[None, None]

## A Convenient Decorator

Use the `create_evaluator` decorator to turn any function that returns something "score-like" into an `Evaluator`.


In [12]:
from phoenix.evals import create_evaluator


# heuristic evaluator that returns a tuple of score and label
@create_evaluator(name="text_length")
def text_length_score(text: str) -> tuple[float, str]:
    """Score text based on length (longer = better, up to a point)"""
    length = len(text)
    if length < 10:
        score = 0.0
        label = "too_short"
    elif length < 50:
        score = 0.5
        label = "short"
    elif length < 200:
        score = 1.0
        label = "good_length"
    else:
        score = 0.8
        label = "too_long"

    return (score, label)


text_length_score.evaluate({"text": "This is a test"})

[Score(name='text_length', score=0.5, label='short', explanation=None, metadata={}, source='heuristic', direction='maximize')]

In [13]:
from phoenix.evals import Score, create_evaluator


# heuristic evaluator that returns a Score object with metadata
@create_evaluator(name="keyword_presence", source="heuristic", direction="maximize")
def keyword_presence_score(text: str, keywords: list[str]) -> tuple[float, str, str]:
    """Score text based on presence of keywords"""
    text_lower = text.lower()
    keyword_list = keywords

    found_keywords = [k for k in keyword_list if k in text_lower]
    score = len(found_keywords) / len(keyword_list) if keyword_list else 0.0

    return Score(
        score=score,
        label=f"found_{len(found_keywords)}_of_{len(keyword_list)}",
        explanation=f"Found keywords: {found_keywords}",
        metadata={"found_keywords": found_keywords, "total_keywords": len(keyword_list)},
    )


keyword_presence_score.describe()  # input schema is inferred from the function signature

{'name': 'keyword_presence',
 'source': 'heuristic',
 'direction': 'maximize',
 'input_schema': {'properties': {'text': {'title': 'Text', 'type': 'string'},
   'keywords': {'items': {'type': 'string'},
    'title': 'Keywords',
    'type': 'array'}},
  'required': ['text', 'keywords'],
  'title': 'Keyword_presenceInput',
  'type': 'object'}}

## Dataframe Evaluation

Run multiple evaluators over a pandas dataframe. The output is an augmented dataframe with two added columns per score:

1. `{score_name}_score` contains the JSON serialized score (or None if the evaluation failed)
2. `{evaluator_name}_execution_details` contains information about the execution status, duration, and any exceptions that ocurred.

**Notes:**

- use `bind_evaluator` to bind `input_mappings` to your evaluators so they match your dataframe columns.

### Example 1: Async version with multiple evaluators


In [None]:
import pandas as pd

from phoenix.evals import LLM, async_evaluate_dataframe, bind_evaluator
from phoenix.evals.metrics import HallucinationEvaluator, exact_match

exact_match._input_mapping = {}  # unset the input mapping from earlier

df = pd.DataFrame(
    {
        "output": ["Yes", "Yes", "No"],
        "expected": ["Yes", "No", "No"],
        "context": ["This is a test", "This is another test", "This is a third test"],
        "query": [
            "What is the name of this test?",
            "What is the name of this test?",
            "What is the name of this test?",
        ],
        "response": ["First test", "Another test", "Third test"],
    }
)

llm = LLM(provider="openai", model="gpt-4o-mini")

hallucination_evaluator = bind_evaluator(
    HallucinationEvaluator(llm=llm), {"input": "query", "output": "response"}
)

result = await async_evaluate_dataframe(
    dataframe=df, evaluators=[hallucination_evaluator, exact_match]
)
result.head()

Unnamed: 0,output,expected,context,query,response,hallucination_execution_details,exact_match_execution_details,hallucination_score,exact_match_score
0,Yes,Yes,This is a test,What is the name of this test?,First test,"{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""name"": ""hallucination"", ""score"": 0.0, ""label...","{""name"": ""exact_match"", ""score"": 1.0, ""metadat..."
1,Yes,No,This is another test,What is the name of this test?,Another test,"{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""name"": ""hallucination"", ""score"": 1.0, ""label...","{""name"": ""exact_match"", ""score"": 0.0, ""metadat..."
2,No,No,This is a third test,What is the name of this test?,Third test,"{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""name"": ""hallucination"", ""score"": 1.0, ""label...","{""name"": ""exact_match"", ""score"": 1.0, ""metadat..."


### Example 2: Sync version with multi-score evaluator


In [None]:
import pandas as pd

from phoenix.evals import evaluate_dataframe
from phoenix.evals.metrics import PrecisionRecallFScore

precision_recall_fscore = PrecisionRecallFScore(positive_label="Yes")

df = pd.DataFrame(
    {
        "output": [["Yes", "Yes", "No"], ["Yes", "No", "No"]],
        "expected": [["Yes", "No", "No"], ["Yes", "No", "No"]],
    }
)

result = evaluate_dataframe(dataframe=df, evaluators=[precision_recall_fscore])
result.head()

Unnamed: 0,output,expected,precision_recall_fscore_execution_details,precision_score,recall_score,f1_score
0,"[Yes, Yes, No]","[Yes, No, No]","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""name"": ""precision"", ""score"": 0.5, ""metadata""...","{""name"": ""recall"", ""score"": 1.0, ""metadata"": {...","{""name"": ""f1"", ""score"": 0.6666666666666666, ""m..."
1,"[Yes, No, No]","[Yes, No, No]","{""status"": ""COMPLETED"", ""exceptions"": [], ""exe...","{""name"": ""precision"", ""score"": 1.0, ""metadata""...","{""name"": ""recall"", ""score"": 1.0, ""metadata"": {...","{""name"": ""f1"", ""score"": 1.0, ""metadata"": {""bet..."


# Practice: BYO Judge

**Your task:** Create a custom LLM judge to classify text complexity. Inputs can be classified into one of the following labels: simple, moderate, or complex. For your use case, simple text is better than moderate or complex.

Use the following 3 examples to test your new evaluator:


In [18]:
data = [
    {
        "text": "AI is when computers learn to do things like people, like recognizing faces or playing games."
    },
    {
        "text": "Machine learning is a method in artificial intelligence where systems improve their performance by learning from data, without being explicitly programmed for each task"
    },
    {
        "text": "Artificial intelligence systems employing deep reinforcement learning utilize hierarchical neural architectures to iteratively optimize policy gradients across high-dimensional state-action spaces, converging toward sub-optimal equilibria in stochastic environments via backpropagated reward signals and temporally extended credit assignment mechanisms."
    },
]

In [None]:
# write your judge here

In [None]:
# test your judge on the examples here

# Practice: BYO Heuristic Evaluator

**Your task:** Turn the following function into an Evaluator that calculates the Levenshtein distance between two strings.

Note: Smaller values indicate higher similarity (lower score = better).

Run the Evaluator on the following data:


In [22]:
eval_input = {
    "input": {"query": "What is the capital of France?"},
    "output": {"response": "It is Paris"},
    "expected": "Paris",
}

In [24]:
# turn this function into a heuristic evaluator
def levenshtein_distance(s1: str, s2: str) -> int:
    """
    Compute the Levenshtein distance between two strings s1 and s2.
    """
    m, n = len(s1), len(s2)

    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            cost = 0 if s1[i - 1] == s2[j - 1] else 1
            dp[i][j] = min(dp[i - 1][j] + 1, dp[i][j - 1] + 1, dp[i - 1][j - 1] + cost)

    return dp[m][n]

In [None]:
# test your evaluator on the example above.
# hint: use an input_mapping to map/transform the input to the function's expected arguments.