<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>

# **Building an Eval-Driven Development Pipeline with a Custom Annotations UI**

In this tutorial, we will explore how to leverage a custom annotation UI for Phoenix using [Lovable](https://lovable.dev) to build experiments and evaluate your application.

The purpose of a custom annotations UI is to make it easy for anyone to provide structured human feedback on traces, capturing essential details directly in Phoenix. Annotations are vital for collecting feedback during human review, enabling iterative improvement of your LLM applications.

By establishing this feedback loop and an evaluation pipeline, you can effectively monitor and enhance your system’s performance.

You will need a [Phoenix Cloud account](https://app.arize.com/auth/phoenix/signup) and an OpenAI API key to follow along.

## Install Dependencies & Setup Keys

In [None]:
!pip install -qqq arize-phoenix arize-phoenix-otel openinference-instrumentation-openai

In [None]:
!pip install -qq openai nest_asyncio

In [None]:
import os
from getpass import getpass

import nest_asyncio

nest_asyncio.apply()

if not (phoenix_endpoint := os.getenv("PHOENIX_COLLECTOR_ENDPOINT")):
    phoenix_endpoint = getpass("🔑 Enter your Phoenix Collector Endpoint")
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = phoenix_endpoint


if not (phoenix_api_key := os.getenv("PHOENIX_API_KEY")):
    phoenix_api_key = getpass("🔑 Enter your Phoenix API key")
os.environ["PHOENIX_API_KEY"] = phoenix_api_key

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

# Configure Tracing

In [None]:
from phoenix.otel import register

# configure the Phoenix tracer
tracer_provider = register(
    project_name="my-annotations-app",  # Default is 'default'
    auto_instrument=True,
)

# Generate traces to annotate

We will generate some agent traces and send them to Phoenix. We will then annotate these traces to add labels, scores, or explanations directly onto specific spans. Annotations allow us to enrich our traces with structured feedback, making it easier to filter, track, and improve LLM outputs.

In [None]:
questions = [
    "What is the capital of France?",
    "Who wrote 'Pride and Prejudice'?",
    "What is the boiling point of water in Celsius?",
    "What is the largest planet in our solar system?",
    "Who developed the theory of relativity?",
    "What is the chemical symbol for gold?",
    "In which year did the Apollo 11 mission land on the moon?",
    "What language has the most native speakers worldwide?",
    "Which continent has the most countries?",
    "What is the square root of 144?",
    "What is the largest country in the world by land area?",
    "Why is the sky blue?",
    "Who painted the Mona Lisa?",
    "What is the smallest prime number?",
    "What gas do plants absorb from the atmosphere?",
    "Who was the first President of the United States?",
    "What is the currency of Japan?",
    "How many continents are there on Earth?",
    "What is the tallest mountain in the world?",
    "Who is the author of '1984'?",
]

We deliberately generate some bad or nonsensical traces in the system prompt to demonstrate annotating and experimenting with different types of results.



In [None]:
from openai import OpenAI

openai_client = OpenAI()

# System prompt
system_prompt = """
You are a question-answering assistant. For each user question, randomly choose an option: NONSENSE or RHYME. If you choose RHYME, answer correctly in the form of a rhyme.

If it NONSENSE, do not answer the question at all, and instead respond with nonsense words and random numbers that do not rhyme, ignoring the user’s question completely.
When responding with NONSENSE, include at least five nonsense words and at least five random numbers between 0 and 9999 in your response.

Do not explain your choice.
"""

# Run through the dataset and collect spans
for question in questions:
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question},
        ],
    )

# Launch Custom Annotation UI

Visit our implementation here: https://phoenix-trace-annotator.lovable.app/

*Note: This annotation UI was built for Phoenix Cloud demo purposes and is not optimized for high-volume trace workflows.*

How to annotate your traces in Lovable:

1. Enter your Phoenix Cloud endpoint, API key, and project name. Optionally, also include an identifier to tie annotations to a specific user.
2. Click Refresh Traces.
3. Select the traces you want to annotate and click Send to Phoenix.
4. See your annotations appear instantly in Phoenix.

Run the cell below to see it in action:

In [None]:
from IPython.display import HTML

video_url = (
    "https://storage.googleapis.com/arize-phoenix-assets/assets/videos/custom-annotations-UI.mp4"
)

HTML(f"""
<iframe width="1200" height="700" src="{video_url}"
frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture"
allowfullscreen></iframe>
""")

#Create a dataset from annotated spans

After you have annotated some spans, save them as a dataset in Phoenix.

In [None]:
import pandas as pd

import phoenix as px
from phoenix.client import Client
from phoenix.client.types import spans

client = Client()
# replace "correctness" if you chose to annotate on different criteria
query = spans.SpanQuery().where("annotations['correctness']")
spans_df = client.spans.get_spans_dataframe(query=query, project_identifier="my-annotations-app")
dataset = px.Client().upload_dataset(
    dataframe=spans_df,
    dataset_name="annotated-rhymes",
    input_keys=["attributes.input.value"],
    output_keys=["attributes.llm.output_messages"],
)

# Build an Eval based on annotations

Next, you will construct an LLM-as-a-Judge template to evaluate your experiments. This evaluator will mark nonsensical outputs as incorrect. As you experiment (by improving your system prompt), you’ll see evaluation results improve. Once your annotated trace dataset shows consistent improvement, you can confidently apply these changes to your production system.

In [None]:
RHYME_PROMPT_TEMPLATE = """
Examine the assistant’s responses in the conversation and determine whether the assistant used rhyme in any of its responses.

Rhyme means that the assistant’s response contains clear end rhymes within or across lines. This should be applicable to the entire response.
There should be no irrelevant phrases or numbers in the response.
Determine whether the rhyme is high quality or forced in addition to checking for the presence of rhyme.
This is the criteria for determining a well-written rhyme.

If none of the assistant's responses contain rhyme, output that the assistant did not rhyme.

[BEGIN DATA]
    ************
    [Question]: {question}
    ************
    [Response]: {answer}
    [END DATA]


Your response must be a single word, either "correct" or "incorrect", and should not contain any text or characters aside from that word.

"correct" means the response contained a well written rhyme.

"incorrect" means the response did not contain a rhyme.
"""

# Experimentation Example #1 ~ Changing Models

In [None]:
import json

from phoenix.evals import OpenAIModel, llm_classify
from phoenix.experiments import run_experiment
from phoenix.experiments.types import Example

system_prompt = """
You are a question-answering assistant. For each user question, randomly choose an option: NONSENSE or RHYME. You can favor NONSENSE. If you choose RHYME, answer correctly in the form of a rhyme.

If it NONSENSE, do not answer the question at all, and instead respond with nonsense words and random numbers that do not rhyme, ignoring the user’s question completely.
When responding with NONSENSE, include at least five nonsense words and at least five random numbers between 0 and 9999 in your response.

Do not explain your choice or indicate if you are using RHYME or NONSENSE in the response.
"""


def updated_task(example: Example) -> str:
    raw_input_value = example.input["attributes.input.value"]
    data = json.loads(raw_input_value)
    question = data["messages"][1]["content"]
    response = openai_client.chat.completions.create(
        model="gpt-4.1",  # swap model. other examples: gpt-4o-mini, gpt-4.1-mini, gpt-4, etc.
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content

In [None]:
def evaluate_response(input, output):
    raw_input_value = input["attributes.input.value"]
    data = json.loads(raw_input_value)
    question = data["messages"][1]["content"]
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"question": question, "answer": output}]),
        template=RHYME_PROMPT_TEMPLATE,
        model=OpenAIModel(model="gpt-4-turbo"),
        rails=["correct", "incorrect"],
        provide_explanation=True,
    )
    score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1)
    return score

After running the cell below, you will see your experiment results in the “Experiments” tab. Don’t expect high evaluator performance yet, as we haven’t fixed the root issue—the system prompt. The purpose here is to demonstrate an example of running an experiment.

In [None]:
experiment = run_experiment(
    dataset,
    task=updated_task,
    evaluators=[evaluate_response],
    experiment_name="updated model",
    experiment_description="updated model gpt-4.1",
)

# Experiment #2 ~ Improving the System Prompt

In [None]:
system_prompt = """
You are a question-answering assistant. For each user question, answer correctly in the form of a rhyme.
"""


def updated_task(example: Example) -> str:
    raw_input_value = example.input["attributes.input.value"]
    data = json.loads(raw_input_value)
    question = data["messages"][1]["content"]
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question},
        ],
    )
    return response.choices[0].message.content

In [None]:
def evaluate_response(input, output):
    raw_input_value = input["attributes.input.value"]
    data = json.loads(raw_input_value)
    question = data["messages"][1]["content"]
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"question": question, "answer": output}]),
        template=RHYME_PROMPT_TEMPLATE,
        model=OpenAIModel(model="gpt-4.1"),
        rails=["correct", "incorrect"],
        provide_explanation=True,
    )
    score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1)
    return score

Here, we expect to see improvements in our experiment. The evaluator should flag significantly fewer nonsensical answers as you have refined your system prompt.

In [None]:
experiment = run_experiment(
    dataset,
    task=updated_task,
    evaluators=[evaluate_response],
    experiment_name="updated system prompt",
    experiment_description="updated system prompt",
)

#Apply updates to your application

Now that we’ve completed an experimentation cycle and confirmed our changes on the annotated traces, we can update the application and test the results on the broader dataset. This helps ensure that improvements made during experimentation translate effectively to real-world usage and that your system performs reliably at scale.

In [None]:
questions = [
    "What is the capital of France?",
    "Who wrote 'Pride and Prejudice'?",
    "What is the boiling point of water in Celsius?",
    "What is the largest planet in our solar system?",
    "Who developed the theory of relativity?",
    "What is the chemical symbol for gold?",
    "In which year did the Apollo 11 mission land on the moon?",
    "What language has the most native speakers worldwide?",
    "Which continent has the most countries?",
    "What is the square root of 144?",
    "What is the largest country in the world by land area?",
    "Why is the sky blue?",
    "Who painted the Mona Lisa?",
    "What is the smallest prime number?",
    "What gas do plants absorb from the atmosphere?",
    "Who was the first President of the United States?",
    "What is the currency of Japan?",
    "How many continents are there on Earth?",
    "What is the tallest mountain in the world?",
    "Who is the author of '1984'?",
]
questions_df = pd.DataFrame(questions, columns=["Questions"])
dataset = px.Client().upload_dataset(
    dataframe=questions_df,
    input_keys=["Questions"],
    dataset_name="all-trivia-questions",
)

In [None]:
system_prompt = """
You are a question-answering assistant. For each user question, answer correctly in the form of a rhyme.
"""


# Run through the dataset and collect spans
def complete_task(question) -> str:
    question_str = question["Questions"]
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": question_str},
        ],
    )
    return response.choices[0].message.content


def evaluate_all_responses(input, output):
    response_classifications = llm_classify(
        dataframe=pd.DataFrame([{"question": input["Questions"], "answer": output}]),
        template=RHYME_PROMPT_TEMPLATE,
        model=OpenAIModel(model="gpt-4o"),
        rails=["correct", "incorrect"],
        provide_explanation=True,
    )
    score = response_classifications.apply(lambda x: 0 if x["label"] == "incorrect" else 1, axis=1)
    return score


experiment = run_experiment(
    dataset=dataset,
    task=complete_task,
    evaluators=[evaluate_all_responses],
    experiment_name="modified-system-prompt-full-dataset",
)

## Tips for building your custom annotation UI

Here is a sample prompt you can feed into [Lovable](https://lovable.dev) to start building your custom LLM trace annotation UI. Feel free to adjust it to your needs. Note that you will need to implement functionality to fetch spans and send annotations to Phoenix. We’ve also included a brief explanation of how we approached this in our own implementation.

**Prompt for Lovable:**

Build a platform for annotating LLM spans and traces:

1. Connect to Phoenix Cloud by collecting endpoint, API Key, and project name from the user
2. Load traces and spans from Phoenix (via [REST API](https://arize.com/docs/phoenix/sdk-api-reference/spans#get-v1-projects-project_identifier-spans) or [Python SDK](https://arize.com/docs/phoenix/tracing/how-to-tracing/feedback-and-annotations/evaluating-phoenix-traces#download-trace-dataset-from-phoenix)).
3. Display spans grouped by trace_id, with clear visual separation.
4. Allow annotators to assign a label, score, and explanation to each span or entire trace.
5. Support sending annotations back to Phoenix and reloading to see updates.
6. Use a clean, modern design

**Details on how we built our Annotation UI:**

✅ Frontend (Lovable):

- Built in Lovable for easy UI generation.
- Allows loading LLM traces, displaying spans grouped by trace_id, and annotating spans with label, score, explanation.

✅ Backend (Render, FastAPI):

- Hosted on Render using FastAPI.
- Adds CORS for your Lovable frontend to communicate securely.
- Uses two key endpoints:
  1.   GET /v1/projects/{project_identifier}/spans
  2.  POST /v1/span_annotations