<center>
<center>
    <p style="text-align:center">
    <img alt="arize logo" src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="300"/>
        <br>
        <a href="https://docs.arize.com/arize/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/client_python">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg">Slack Community</a>
    </p>
</center>

# Using a Benchmark Dataset to Build a Custom LLM as a Judge Evaluator

In this tutorial, you‚Äôll learn how to build a custom LLM-as-a-Judge Evaluator tailored to your specific use case. While Arize provides several [pre-built evaluators](https://arize.com/docs/ax/evaluate/llm-as-a-judge/arize-evaluators-llm-as-a-judge) that have been tested against benchmark datasets, these may not always cover the nuances of your application.

So how can you achieve the same level of rigor when your use case falls outside the scope of standard evaluators?

We‚Äôll walk through how to create your own benchmark dataset using a small set of annotated examples. This dataset will allow you to build and refine a custom evaluator by revealing failure cases and guiding iteration. The use case we will be exploring is data extraction from an image of a receipt. 

To follow along, you‚Äôll need:

*   A free [Arize AX](https://app.arize.com/auth/join) account
*   An OpenAI API Key



## Set up Keys and Dependencies

In [None]:
%pip install -qqqq arize arize-otel openinference-instrumentation-openai arize-phoenix-evals

In [None]:
%pip install -qq openai nest_asyncio

In [None]:
import os
import pandas as pd
import nest_asyncio
from getpass import getpass

nest_asyncio.apply()


# if not "SPACE_ID" in os.environ:
#     os.environ["SPACE_ID"] = getpass("üîë Enter your Arize Space ID: ")
# if not "API_KEY" in os.environ:
#     os.environ["API_KEY"] = getpass("üîë Enter your Arize API Key: ")

os.environ["SPACE_ID"] = "U3BhY2U6MTg3MjU6S2RTRQ=="
os.environ["API_KEY"] = "ak-7641807d-614f-4c03-b21c-87ad7ad7f7b7-jPieRjiTYRsm-17_D3mxNbRKs1_ai3I2"
if not "OPENAI_API_KEY" in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass("üîë Enter your OpenAI API Key: ")

# Configure Tracing

In [None]:
from arize.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor

# configure the Phoenix tracer
tracer_provider = register(
    space_id=os.environ["SPACE_ID"],
    api_key=os.environ["API_KEY"],
    project_name="receipt-classifications",
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

# Generate Image Classification Traces

In this tutorial, we‚Äôll ask an LLM to generate expense reports from receipt images provided as public URLs. Running the cells below will generate traces, which you can explore directly in Phoenix for annotation. We‚Äôll use GPT-4, which supports image inputs.


Dataset Information:
Jakob (2024). Receipt or Invoice Dataset. Roboflow Universe. CC‚ÄØBY‚ÄØ4.0. Available at: https://universe.roboflow.com/jakob-awn1e/receipt-or-invoice (accessed on 2025‚Äë07‚Äë29)

In [None]:
import uuid
import pandas as pd
urls = [
"https://source.roboflow.com/Zf1kEIcRTrhHBZ7wgJleS4E92P23/8M5px2yLoNtZ6gOQ2r1D/original.jpg",
"https://source.roboflow.com/Zf1kEIcRTrhHBZ7wgJleS4E92P23/8EVgYMNObyV6kLqBNeFG/original.jpg",
"https://source.roboflow.com/Zf1kEIcRTrhHBZ7wgJleS4E92P23/86aohWmcEfO0XkflO8AB/original.jpg",
"https://source.roboflow.com/HahhKcbQqdf8YAudM4kU3PuVCS72/1eGPBChz7wvovQROk2l8/original.jpg",
"https://source.roboflow.com/HahhKcbQqdf8YAudM4kU3PuVCS72/0WqR2GSfGmxWB7ozo3Pj/original.jpg",
"https://source.roboflow.com/HahhKcbQqdf8YAudM4kU3PuVCS72/FAEJRtviIboCYSKFZcEZ/original.jpg",
"https://source.roboflow.com/HahhKcbQqdf8YAudM4kU3PuVCS72/0AoEaFy8FAw6DVieWCa8/original.jpg",
"https://source.roboflow.com/HahhKcbQqdf8YAudM4kU3PuVCS72/0Q3hAyNwXNpHTeoWU7fz/original.jpg",
"https://source.roboflow.com/HahhKcbQqdf8YAudM4kU3PuVCS72/2r876u4WpaCYFdMPwieK/original.jpg",
"https://source.roboflow.com/HahhKcbQqdf8YAudM4kU3PuVCS72/2ZWeE0yO0oJUDtpgEAPY/original.jpg",
"https://source.roboflow.com/HahhKcbQqdf8YAudM4kU3PuVCS72/37PF6xfHyuqzIBdO7Kgw/original.jpg",
"https://source.roboflow.com/Zf1kEIcRTrhHBZ7wgJleS4E92P23/6mo4M0nJeKZEsdKrRfsR/original.jpg",
"https://source.roboflow.com/Zf1kEIcRTrhHBZ7wgJleS4E92P23/5ezJ8tUBGbNnt0jZi2JU/original.jpg",
"https://source.roboflow.com/Zf1kEIcRTrhHBZ7wgJleS4E92P23/4BCIWGazhCj03oTMWboO/original.jpg",
"https://source.roboflow.com/Zf1kEIcRTrhHBZ7wgJleS4E92P23/4B8vXJNwJ7ZuHEWyjgAv/original.jpg",
"https://source.roboflow.com/Zf1kEIcRTrhHBZ7wgJleS4E92P23/2EpeKbAqsSwciH2IHGyV/original.jpg",
"https://source.roboflow.com/Zf1kEIcRTrhHBZ7wgJleS4E92P23/2LP3g9rKZrYDkNB3I78c/original.jpg",
"https://source.roboflow.com/Zf1kEIcRTrhHBZ7wgJleS4E92P23/1hT6iLEIAFBw8W70u2FY/original.jpg",
"https://source.roboflow.com/Zf1kEIcRTrhHBZ7wgJleS4E92P23/1zaKpaDhRPxkiIDTvMuc/original.jpg",
"https://source.roboflow.com/Zf1kEIcRTrhHBZ7wgJleS4E92P23/1hF1R2Pt41hnlqhlXLDD/original.jpg"
]

In [None]:
from openai import OpenAI
client = OpenAI()

def extract_receipt_data(input):
  response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Analyze this receipt and return a brief summary for an expense report. Only include category of expense, total cost, and summary of items"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": input,
                    },
                },
            ],
        }
    ],
    max_tokens=500,
  )
  return response

In [None]:
for url in urls:
  extract_receipt_data(url)

# Create Benchmarked Dataset

After generating traces, open Arize to begin annotating your dataset. In this example, we‚Äôll annotate based on "accuracy", but you can choose any evaluation criterion that fits your use case. Just be sure to update the query below to match the annotation key you‚Äôre using‚Äîthis ensures the annotated examples are included in your benchmark dataset.

Run the cell below to see annotations in action:

In [None]:
from IPython.display import HTML

video_url = "https://storage.googleapis.com/arize-phoenix-assets/assets/videos/arize-annotation.mp4"

HTML(f"""
<iframe width="1200" height="700" src="{video_url}"
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen></iframe>
""")

In [None]:
from datetime import datetime, timedelta

from arize.exporter import ArizeExportClient
from arize.utils.types import Environments

client = ArizeExportClient(api_key = os.environ["API_KEY"])

print('#### Exporting your primary dataset into a dataframe.')

primary_df = client.export_model_to_df(
    space_id=os.environ["SPACE_ID"],
    model_id='receipt-classifications',
    environment=Environments.TRACING,
    start_time=datetime.now() - timedelta(days=50),
    end_time=datetime.now(),
)

In [None]:
filtered_df = primary_df[
    (primary_df['annotation.accuracy.label'].notna())
][[
    'attributes.input.value',
    'attributes.output.value',
    'annotation.accuracy.label',
]].rename(columns={
    'attributes.input.value': 'image',
    'attributes.output.value': 'response',
    'annotation.accuracy.label': 'accuracy'
})

filtered_df

In [None]:
import json

def extract_url(input_value):
    data = json.loads(input_value)
    return data["messages"][0]["content"][1]["image_url"]["url"]

def extract_content(input_value):
    data = json.loads(input_value) 
    return data["choices"][0]["message"]["content"]

filtered_df['image'] = filtered_df['image'].apply(extract_url)
filtered_df['response'] = filtered_df['response'].apply(extract_content)


filtered_df

In [None]:
from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.utils.constants import GENERATIVE

client = ArizeDatasetsClient(api_key=os.environ["API_KEY"])
dataset_id = client.create_dataset(
    space_id=os.environ["SPACE_ID"],
    dataset_name="annotated-receipts",
    dataset_type=GENERATIVE,
    data=filtered_df
)

dataset_id

![Dataset](https://storage.googleapis.com/arize-phoenix-assets/assets/images/eval-tutorial-annotated-dataset.png)

# Create evaluation template

Next, we‚Äôll create a baseline evaluation template and define both the task and the evaluation function. Once these are set up, we‚Äôll run an experiment to compare the evaluator‚Äôs performance against our ground truth annotations.

In [None]:
from phoenix.evals.templates import (
    ClassificationTemplate,
    PromptPartTemplate,
    PromptPartContentType,
)

rails = ["accurate", "almost accurate", "inaccurate"]
classification_template = ClassificationTemplate(
    rails=rails,  # Specify the valid output labels
    template=[
        # Prompt part 1: Task description
        PromptPartTemplate(
            content_type=PromptPartContentType.TEXT,
            template=""" You are an evaluator tasked with assessing the quality of a model-generated expense report based on a receipt.
Below is the model‚Äôs generated expense report and the input image:
---
MODEL OUTPUT (Expense Report): {output}

---
INPUT RECEIPT: """,
        ),
        # Prompt part 2: Insert the image data
        PromptPartTemplate(
            content_type=PromptPartContentType.IMAGE,
            template="{image}",  # Placeholder for the image URL
        ),
        # Prompt part 3: Define the response format
        PromptPartTemplate(
            content_type=PromptPartContentType.TEXT,
            template=""" Evaluate the following three aspects and assign one of the following labels for each. Only include the label:
- **"accurate"** ‚Äì Fully correct
- **"almost accurate"** ‚Äì Mostly correct
- **"inaccurate"** ‚Äì Substantially wrong
            """,
        ),
    ],
)

print(classification_template)

In [None]:
from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel
from arize.experimental.datasets.experiments.evaluators.base import EvaluationResult, Evaluator
from typing import Dict, Any



def task_function(dataset_row):
    image_url = dataset_row["image"]
    output = dataset_row["response"]
    response_classification = llm_classify(
        data=pd.DataFrame([{"image": image_url, "output": output}]),
        template=classification_template,
        model=OpenAIModel(model="gpt-4.1"),
        rails=rails,
        provide_explanation=True,
    )
    label = response_classification.iloc[0]["label"]
    return label

class MyEval(Evaluator):
    def evaluate(
        self, *, output: str, dataset_row: Dict[str, Any], **kwargs: Any
    ) -> EvaluationResult:
        expected_label = dataset_row["accuracy"]
        
        # Your evaluation logic here
        if output == expected_label:
            return EvaluationResult(
                explanation="Output matches expected accuracy",
                score=1.0,
                label="correct"
            )
        else:
            return EvaluationResult(
                explanation="Output does not match expected accuracy",
                score=0.0,
                label="incorrect"
            )

In [None]:
client.run_experiment(
    space_id=os.environ["SPACE_ID"],
    dataset_id=dataset_id,
    task=task_function,
    evaluators=[MyEval()],
    experiment_name="Initial Experiment",
)

You will see your experiment result in the experiments tab of your dataset: 

![Initial Experiment](https://storage.googleapis.com/arize-phoenix-assets/assets/images/eval-tutorial-first-experiment-arize.png)

# Iteration 1 to improve evaluator prompt template

Next, we‚Äôll refine our evaluation prompt template by adding more specific instructions to classification rules. We can add these rules based on gaps we saw in the previous iteration. This additional guidance helps improve accuracy and ensures the evaluator's judgments better align with human expectations.

In [None]:
classification_template = ClassificationTemplate(
    rails=rails,  # Specify the valid output labels
    template=[
        # Prompt part 1: Task description
        PromptPartTemplate(
            content_type=PromptPartContentType.TEXT,
            template=""" You are an evaluator tasked with assessing the quality of a model-generated expense report based on a receipt.
Below is the model‚Äôs generated expense report and the input image:
---
MODEL OUTPUT (Expense Report): {output}

---
INPUT RECEIPT: """,
        ),
        # Prompt part 2: Insert the image data
        PromptPartTemplate(
            content_type=PromptPartContentType.IMAGE,
            template="{image}",  # Placeholder for the image URL
        ),
        # Prompt part 3: Define the response format
        PromptPartTemplate(
            content_type=PromptPartContentType.TEXT,
            template=""" Evaluate the following and assign one of the following labels for each. Only include the label:
- **"accurate"** ‚Äì Total price, itemized list, and expense category are all accurate. All three must be correct to get this label.
- **"almost accurate"** ‚Äì Mostly correct but with small issues. For example, expense category is too vague.
- **"inaccurate"** ‚Äì Substantially wrong or missing information. For example, incorrect total price.
            """,
        ),
    ],
)

In [None]:
client.run_experiment(
    space_id=os.environ["SPACE_ID"],
    dataset_id=dataset_id,
    task=task_function,
    evaluators=[MyEval()],
    experiment_name="Stronger Prompt Experiment",
)

# Iteration 2 to improve evaluator prompt template

To further improve our evaluator, we‚Äôll introduce few-shot examples into the evaluation prompt. These examples help highlight common failure cases and guide the evaluator toward more consistent and generalized judgments.

In [None]:
classification_template = ClassificationTemplate(
    rails=rails,  # Specify the valid output labels
    template=[
        # Prompt part 1: Task description
        PromptPartTemplate(
            content_type=PromptPartContentType.TEXT,
            template=""" You are an evaluator tasked with assessing the quality of a model-generated expense report based on a receipt.
Below is the model‚Äôs generated expense report and the input image:
---
MODEL OUTPUT (Expense Report): {output}

---
INPUT RECEIPT: """,
        ),
        # Prompt part 2: Insert the image datae
        PromptPartTemplate(
            content_type=PromptPartContentType.IMAGE,
            template="{image}",  # Placeholder for the image URL
        ),
        # Prompt part 3: Define the response format
        PromptPartTemplate(
            content_type=PromptPartContentType.TEXT,
            template=""" Evaluate the following three aspects and assign one of the following labels for each. Only include the label:
- **"accurate"** ‚Äì Total price, itemized list, and expense category are accurate. All three must be correct to get this label.
  An incorrect category is one that is overly vague (e.g., ‚ÄúMiscellaneous‚Äù, "Supplies") or does not accurately reflect the itemized list.
  For example, "Dining and Entertainment" should not be grouped together if the itemized list only includes food.
  Reasonable general categories like ‚ÄúOffice Supplies‚Äù or ‚ÄúGroceries‚Äù are acceptable if they align with the listed items.

- **"almost accurate"** ‚Äì Mostly correct but with small issues. For example, expense category is too vague.
  If a category includes extra fields (ex: "Dining and Entertainment", but the receipt only includes food) mark this as almost correct.
- **"inaccurate"** ‚Äì Substantially wrong or missing. For example, incorrect total price or one of more of the items is missing makes the total result inaccurate.
            """,
        ),
    ],
)


In [None]:
client.run_experiment(
    space_id=os.environ["SPACE_ID"],
    dataset_id=dataset_id,
    task=task_function,
    evaluators=[MyEval()],
    experiment_name="Few Shot Experiment",
)

# Final Results

Once your evaluator reaches a performance level you're satisfied with, it's ready for use. The target score will depend on your benchmark dataset and specific use case. That said, you can continue applying the techniques from this tutorial to refine and iterate until the evaluator meets your desired level of quality.

You can also compare your experiment outcomes to baseline results or previous versions to evaluate progress.

![Final Results](https://storage.googleapis.com/arize-phoenix-assets/assets/images/eval-tutorial-compare-experiment.png)