# Phoenix-Evals 2.0: Preview


In [1]:
! pip install arize-phoenix arize-phoenix-evals==0.28.1 openai openinference-instrumentation-openai

Collecting arize-phoenix
  Downloading arize_phoenix-11.27.0-py3-none-any.whl.metadata (31 kB)
Collecting arize-phoenix-evals==0.28.1
  Using cached arize_phoenix_evals-0.28.1-py3-none-any.whl.metadata (4.9 kB)
Collecting openai
  Downloading openai-1.101.0-py3-none-any.whl.metadata (29 kB)
Collecting openinference-instrumentation-openai
  Downloading openinference_instrumentation_openai-0.1.31-py3-none-any.whl.metadata (4.7 kB)
Collecting glom (from arize-phoenix-evals==0.28.1)
  Using cached glom-24.11.0-py3-none-any.whl.metadata (5.1 kB)
Collecting pandas (from arize-phoenix-evals==0.28.1)
  Downloading pandas-2.3.2-cp311-cp311-macosx_11_0_arm64.whl.metadata (91 kB)
Collecting pydantic>=2.0.0 (from arize-phoenix-evals==0.28.1)
  Downloading pydantic-2.11.7-py3-none-any.whl.metadata (67 kB)
Collecting pystache (from arize-phoenix-evals==0.28.1)
  Using cached pystache-0.6.8-py3-none-any.whl.metadata (14 kB)
Collecting tqdm (from arize-phoenix-evals==0.28.1)
  Downloading tqdm-4.67.1-

In [3]:
import phoenix as px
from phoenix.otel import register

px.launch_app()
tracer_provider = register(auto_instrument=True)

Existing running Phoenix instance detected! Shutting it down and starting a new instance...
⚠️ PHOENIX_COLLECTOR_ENDPOINT is set to https://app.phoenix.arize.com/s/ehutton.
⚠️ This means that traces will be sent to the collector endpoint and not this app.
⚠️ If you would like to use this app to view traces, please unset this environmentvariable via e.g. `del os.environ['PHOENIX_COLLECTOR_ENDPOINT']` 
⚠️ You will need to restart your notebook to apply this change.
Overriding of current TracerProvider is not allowed
Attempting to instrument while already instrumented


🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📖 For more information on how to use Phoenix, check out https://arize.com/docs/phoenix
🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: default
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: https://app.phoenix.arize.com/s/ehutton/v1/traces
|  Transport: HTTP + protobuf
|  Transport Headers: {'authorization': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



## LLM Configuration

**Core Design Principle:** The library should work with any LLM model and provider.

The LLM wrapper unifies generation tasks across model providers by delegating to the most commonly installed client SDKs (OpenAI, LangChain, LiteLLM) via adapters.


In [1]:
import os
from getpass import getpass

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

In [2]:
from phoenix.evals.preview.llm import LLM, show_provider_availability

show_provider_availability()  # shows which providers/clients are available based on what's installed in your environment
llm = LLM(
    provider="openai", model="gpt-4o"
)  # you could also specify the client e.g. "langchain" or "openai"

  from .autonotebook import tqdm as notebook_tqdm



📦 AVAILABLE PROVIDERS (sorted by client priority)
--------------------------------------------------------------------
Provider  | Status      | Client    | Dependencies                  
--------------------------------------------------------------------
openai    | [92m✓ Available[0m | openai    | [92mopenai[0m               
anthropic | [91m✗ Disabled [0m | langchain | [91mlangchain[0m, [91mlangchain_anthropic[0m
openai    | [91m✗ Disabled [0m | langchain | [91mlangchain[0m, [91mlangchain_openai[0m
openai    | [91m✗ Disabled [0m | litellm   | [91mlitellm[0m              
anthropic | [91m✗ Disabled [0m | litellm   | [91mlitellm[0m              


## About the `Score` Data Model

An evaluation is defined as any process that returns a `Score`.


In [3]:
from phoenix.evals.preview.metrics import (
    HallucinationEvaluator,
)

llm = LLM(provider="openai", model="gpt-4o-mini")
hallucination_evaluator = HallucinationEvaluator(llm=llm)
result = hallucination_evaluator(
    {
        "input": "What is the capital of France?",
        "output": "Paris is the capital of France.",
        "context": "Paris is the capital and largest city of France.",
    }
)
print("Hallucination result:")
result[0].pretty_print()

Hallucination result:
{
  "name": "hallucination",
  "score": 1.0,
  "label": "factual",
  "explanation": "The response correctly states that Paris is the capital of France, which aligns with the information provided in the context.",
  "metadata": {
    "model": "gpt-4o-mini"
  },
  "source": "llm",
  "direction": "maximize"
}


**Core Design Principle:** The output of evaluators should be rich with information.

All evaluators output a list of `Score` objects with some or all of the following properties:

- **name**: the name of the score
- **score**: numeric score
- **label**: str label for categorical evals
- **explanation**: an explanation for the result
- **direction**: optimization direction, either maximize or minimize
- **source**: source of the eval (llm, heuristic, or human)
- **metadata**: other metadata attached to the score

**Note:** evaluations always return a **list** of `Score` objects. Often, this will be a list of length 1, but some evaluators may return multiple scores for a single `eval_input` (e.g. precision/recall or multi-criteria evals).


## Built-In Metrics


### Exact Match (heuristic)


In [11]:
from phoenix.evals.preview.metrics import exact_match

result = exact_match({"output": "no", "expected": "yes"})
print("Exact match result:")
result[0].pretty_print()

Exact match result:
{
  "name": "exact_match",
  "score": 0.0,
  "metadata": {},
  "source": "heuristic",
  "direction": "maximize"
}


### Precision, Recall, F1 (multi-score)

A single evaluator can return multiple scores!

Notes:

- Works for binary or multi-class labels, as well as integer values.
- Provide positive label for best results. If binary, 1.0 is presumed positive.
- Default F score is F1, but beta is configurable.
- Default averaging technique is macro, but it is configurable


In [12]:
from phoenix.evals.preview.metrics import PrecisionRecallFScore

precision_recall_fscore = PrecisionRecallFScore(positive_label="yes")
result = precision_recall_fscore({"output": ["no", "yes", "yes"], "expected": ["yes", "no", "yes"]})
print("Results:")
print(result[0])
print(result[1])
print(result[2])

Results:
Score(name='precision', score=0.5, label=None, explanation=None, metadata={'beta': 1.0, 'average': 'macro', 'labels': ['yes', 'no'], 'positive_label': 'yes'}, source='heuristic', direction='maximize')
Score(name='recall', score=0.5, label=None, explanation=None, metadata={'beta': 1.0, 'average': 'macro', 'labels': ['yes', 'no'], 'positive_label': 'yes'}, source='heuristic', direction='maximize')
Score(name='f1', score=0.5, label=None, explanation=None, metadata={'beta': 1.0, 'average': 'macro', 'labels': ['yes', 'no'], 'positive_label': 'yes'}, source='heuristic', direction='maximize')


## Custom LLM Classification Evaluators

This is similar to `llm_classify`, for LLM-as-a-judge evaluations that output a label and explanation.


In [13]:
from phoenix.evals.preview import ClassificationEvaluator
from phoenix.evals.preview.llm import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")

evaluator = ClassificationEvaluator(
    name="sentiment",
    llm=llm,
    prompt_template="Classify the sentiment of this text: {text}",
    choices={"positive": 1.0, "negative": 0.0, "neutral": 0.5},  # specify custom score mapping!
)

result = evaluator.evaluate({"text": "I love this!"})
result[0].pretty_print()

{
  "name": "sentiment",
  "score": 1.0,
  "label": "positive",
  "explanation": "The phrase 'I love this!' expresses strong positive feelings, indicating enthusiasm or admiration.",
  "metadata": {
    "model": "gpt-4o-mini"
  },
  "source": "llm",
  "direction": "maximize"
}


### About the `ClassificationEvaluator`

**New features:**

- Specify scores for each label
- Runs on single records (not just a dataframe)
- Leverages model tool calling / structured output for more reliable output parsing

**Notes**

- Allows user to specify labels and custom score mapping via `choices`, with various supported formats (see docs for more info)
- Option to turn off explanations, though they are included by default in accordance with best practices
- Requires the LLM to have some kind of tool calling or structured output ability
- There is also a factory function `create_classifier` to create `ClassificationEvaluator` objects.

This abstraction can be easily extended to support multi-criteria evaluations where a judge is asked to evaluate an input across multiple dimensions in one request.

For more complex LLM evaluation tasks that don't fit the classification mold, there is an `LLMEvaluator` class with open-ended tool calling support that can be inherited.


## Input Mapping and Transformation

**Core Design Principle:** The inputs to an evaluator should be well-defined and discoverable.

Every evaluator has an `input_schema` which describes what inputs it expects.


### Use `.describe()` to inspect an `Evaluator`'s input schema

Because pydantic `BaseModel` is used for the `input_schema`, input fields can be annotated with types, descriptions, and even aliases.


In [8]:
# describe an evaluator to inspect its input schema
hallucination_evaluator.describe()  # requires strings for input, output, and context

{'name': 'hallucination',
 'source': 'llm',
 'direction': 'maximize',
 'input_schema': {'properties': {'input': {'description': 'The input query.',
    'title': 'Input',
    'type': 'string'},
   'output': {'description': 'The response to the query.',
    'title': 'Output',
    'type': 'string'},
   'context': {'description': 'The context or reference text.',
    'title': 'Context',
    'type': 'string'}},
  'required': ['input', 'output', 'context'],
  'title': 'HallucinationInputSchema',
  'type': 'object'}}

In [14]:
exact_match.describe()  # requires string output and expected

{'name': 'exact_match',
 'source': 'heuristic',
 'direction': 'maximize',
 'input_schema': {'properties': {'output': {'title': 'Output',
    'type': 'string'},
   'expected': {'title': 'Expected', 'type': 'string'}},
  'required': ['output', 'expected'],
  'title': 'Exact_matchInput',
  'type': 'object'}}

### Use `input_mapping` to map/transform data into expected `input_schema`

You may have noticed that `Evaluators` accept an `eval_input` payload rather than keyword arguments.

**Core Design Principle:** You should not have to modify your data to run evaluations.

An evaluator's input arguments may not perfectly match those in your example or dataset. Or, you may want to run multiple evaluators on the same example, but they have different or conflicting `input_schema`'s.

To extract the values from a nested `eval_input` payload, provide an `input_mapping` that maps evaluator's input fields to a path spec in your original data.

Possible Mapping Values:

- top-level keys in your
- a path spec following JSON path syntax
- callable functions


In [9]:
# example nested eval input for a RAG system
eval_input = {
    "input": {"query": "user input query"},
    "output": {
        "responses": ["model answer", "model answer 2"],
        "documents": ["doc A", "doc B"],
    },
    "expected": "correct answer",
}

# in order to run the hallucination evaluator, we need to process the eval_input to the fit the input schema
input_mapping = {
    "input": "input.query",  # dot notation to access nested keys
    "output": "output.responses.0",  # dot notation to access list indices
    "context": lambda x: " ".join(
        x["output"]["documents"]
    ),  # lambda function to combine the document chunks
}

# the evaluator uses the input_mapping to transform the eval_input into the expected input schema
result = hallucination_evaluator.evaluate(eval_input, input_mapping)
result[0].pretty_print()

{
  "name": "hallucination",
  "score": 0.0,
  "label": "hallucinated",
  "explanation": "The context provided (doc A doc B) does not contain enough information to determine if the response (model answer) is based on factual information. Therefore, without specific content in both the context and response, it is not possible to assert the accuracy of the response against the context provided.",
  "metadata": {
    "model": "gpt-4o-mini"
  },
  "source": "llm",
  "direction": "maximize"
}


### Use `bind_evaluator` to bind an `input_mapping` to an `Evaluator` for reuse

Note: We don't need to remap "expected" for the `exact_match` eval because it already exists in our `eval_input`


In [15]:
from phoenix.evals.preview import bind_evaluator

# we can bind an input_mapping to an evaluator ahead of call time for easier sequential evals
evaluators = [
    bind_evaluator(hallucination_evaluator, input_mapping),
    bind_evaluator(exact_match, {"output": "output.responses.0"}),
]
scores = []
for evaluator in evaluators:
    scores.append(evaluator.evaluate(eval_input))  # no need to pass input_mapping each time

[score[0].pretty_print() for score in scores]

{
  "name": "hallucination",
  "score": 1.0,
  "label": "factual",
  "explanation": "The response is based on the context provided, which states 'doc A doc B.' Since there is no additional information to contradict or verify the response as being incorrect, and assuming the model answer appropriately relates to the input, it is considered factual.",
  "metadata": {
    "model": "gpt-4o-mini"
  },
  "source": "llm",
  "direction": "maximize"
}
{
  "name": "exact_match",
  "score": 0.0,
  "metadata": {},
  "source": "heuristic",
  "direction": "maximize"
}


[None, None]

## More About the `Evaluator` Abstraction

- sync and async methods for single record evals
- evaluators are directly callable e.g. `evaluator(eval_input)` in addition to `evaluator.evaluate(eval_input)`
- inheritors of the base class only have to implement `_evaluate` and the remaining methods come for free unless explicitly overwritten
- all evaluators have a well-defined `input_schema` that, if not provided at instantiation, is inferred from either the prompt template (for LLM evaluators) or decorated function signature (for heuristic evaluators)
- accept an arbitrary `eval_input` payload, and an optional `input_mapping` to map/transform the `eval_input` to match the `input_schema`. Input remapping is handled by the base `Evaluator` class.
- evaluations always return a **list** of `Score` objects. Often, this will be a list of length 1, but some evaluators may return multiple scores for a single `eval_input` (e.g. precision/recall or multi-criteria evals).


## About the `create_evaluator` decorator

Turn any function that returns something "score-like" into an `Evaluator`.


In [16]:
from phoenix.evals.preview import create_evaluator


# heuristic evaluator that returns a tuple of score and label
@create_evaluator(name="text_length")
def text_length_score(text: str) -> tuple[float, str]:
    """Score text based on length (longer = better, up to a point)"""
    length = len(text)
    if length < 10:
        score = 0.0
        label = "too_short"
    elif length < 50:
        score = 0.5
        label = "short"
    elif length < 200:
        score = 1.0
        label = "good_length"
    else:
        score = 0.8
        label = "too_long"

    return (score, label)


text_length_score(eval_input={"text": "This is a test"})

[Score(name='text_length', score=0.5, label='short', explanation=None, metadata={}, source='heuristic', direction='maximize')]

In [17]:
from phoenix.evals.preview import Score, create_evaluator


# heuristic evaluator that returns a Score object with metadata
@create_evaluator(name="keyword_presence", source="heuristic", direction="maximize")
def keyword_presence_score(text: str, keywords: list[str]) -> tuple[float, str, str]:
    """Score text based on presence of keywords"""
    text_lower = text.lower()
    keyword_list = keywords

    found_keywords = [k for k in keyword_list if k in text_lower]
    score = len(found_keywords) / len(keyword_list) if keyword_list else 0.0

    return Score(
        score=score,
        label=f"found_{len(found_keywords)}_of_{len(keyword_list)}",
        explanation=f"Found keywords: {found_keywords}",
        metadata={"found_keywords": found_keywords, "total_keywords": len(keyword_list)},
    )


keyword_presence_score.describe()  # input schema is inferred from the function signature

{'name': 'keyword_presence',
 'source': 'heuristic',
 'direction': 'maximize',
 'input_schema': {'properties': {'text': {'title': 'Text', 'type': 'string'},
   'keywords': {'items': {'type': 'string'},
    'title': 'Keywords',
    'type': 'array'}},
  'required': ['text', 'keywords'],
  'title': 'Keyword_presenceInput',
  'type': 'object'}}

## Dataframe Evaluation

Run multiple evaluators over a pandas dataframe. The output is an augmented dataframe with two added columns per score:

1. `{score_name}_score` contains the JSON serialized score (or None if the evaluation failed)
2. `{evaluator_name}_execution_details` contains information about the execution status, duration, and any exceptions that ocurred.

Notes:

- use `bind_evaluator` to bind `input_mappings` to your evaluators so they match your dataframe columns.


In [19]:
import pandas as pd

from phoenix.evals.preview.evaluators import evaluate_dataframe
from phoenix.evals.preview.metrics import PrecisionRecallFScore

precision_recall_fscore = PrecisionRecallFScore(positive_label="Yes")

df = pd.DataFrame(
    {
        "output": [["Yes", "Yes", "No"], ["Yes", "No", "No"]],
        "expected": [["Yes", "No", "No"], ["Yes", "No", "No"]],
    }
)

result = evaluate_dataframe(df, [precision_recall_fscore])
result.head()

ImportError: cannot import name 'evaluate_dataframe' from 'phoenix.evals.preview.evaluators' (/Users/elizabethhutton/Projects/phoenix/.conda/lib/python3.11/site-packages/phoenix/evals/preview/evaluators.py)

In [None]:
from phoenix.evals.preview.evaluators import bind_evaluator
from phoenix.evals.preview.llm import LLM
from phoenix.evals.preview.metrics import HallucinationEvaluator, exact_match

df = pd.DataFrame(
    {
        "output": ["Yes", "Yes", "No"],
        "expected": ["Yes", "No", "No"],
        "context": ["This is a test", "This is another test", "This is a third test"],
        "query": [
            "What is the name of this test?",
            "What is the name of this test?",
            "What is the name of this test?",
        ],
        "response": ["First test", "Another test", "Third test"],
    }
)

llm = LLM(provider="openai", model="gpt-4o")

hallucination_evaluator = bind_evaluator(
    HallucinationEvaluator(llm=llm), {"input": "query", "output": "response"}
)

result = evaluate_dataframe(df, [exact_match, hallucination_evaluator])
result.head()

# Summary: Before and After

**Before:** limited return info -> **After:** rich Score objects with directionality and flexible metadata

Before: only 0,1 scores allowed -> After: custom score mapping

Before: only evals on dataframes -> After: now can do single record evals or dataframes

Before: running llm classify separately for each eval -> After running multiple evaluators on a dataframe at once

Before: adding new dataframe columns for each eval -> After: use input_mappings so your data stays untouched

Before: single score evals -> After: multi-criteria evals

Before: no easy way to contruct function/heuristic evals -> After: convenient decorator


# Practice: BYO Judge

**Your task:** Create a custom LLM judge to classify text complexity. Inputs can be classified into one of the following labels: simple, moderate, or complex. For your use case, simple text is better than moderate or complex.

Use the following 3 examples to test your new evaluator:


In [18]:
data = [
    {
        "text": "AI is when computers learn to do things like people, like recognizing faces or playing games."
    },
    {
        "text": "Machine learning is a method in artificial intelligence where systems improve their performance by learning from data, without being explicitly programmed for each task"
    },
    {
        "text": "Artificial intelligence systems employing deep reinforcement learning utilize hierarchical neural architectures to iteratively optimize policy gradients across high-dimensional state-action spaces, converging toward sub-optimal equilibria in stochastic environments via backpropagated reward signals and temporally extended credit assignment mechanisms."
    },
]

In [None]:
# write your judge here

In [None]:
# test your judge on the examples here

# Practice: BYO Heuristic Evaluator

**Your task:** Turn the following function into an Evaluator that calculates the Levenshtein distance between two strings.

Note: Smaller values indicate higher similarity (lower score = better).

Run the Evaluator on the following data:


In [22]:
eval_input = {
    "input": {"query": "What is the capital of France?"},
    "output": {"response": "It is Paris"},
    "expected": "Paris",
}

In [24]:
# turn this function into a heuristic evaluator
def levenshtein_distance(s1: str, s2: str) -> int:
    """
    Compute the Levenshtein distance between two strings s1 and s2.
    """
    m, n = len(s1), len(s2)

    dp = [[0] * (n + 1) for _ in range(m + 1)]

    for i in range(m + 1):
        dp[i][0] = i
    for j in range(n + 1):
        dp[0][j] = j

    for i in range(1, m + 1):
        for j in range(1, n + 1):
            cost = 0 if s1[i - 1] == s2[j - 1] else 1
            dp[i][j] = min(dp[i - 1][j] + 1, dp[i][j - 1] + 1, dp[i - 1][j - 1] + cost)

    return dp[m][n]

In [None]:
# test your evaluator on the example above.
# hint: use an input_mapping to map/transform the input to the function's expected arguments.