<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg" width="1000"/>
        <br>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Arize Phoenix</h1>

Arize Phoenix is a fully open-source AI observability platform. It's designed for experimentation, evaluation, and troubleshooting. It provides:

- [**_Tracing_**](https://arize.com/docs/phoenix/tracing/llm-traces) - Trace your LLM application's runtime using OpenTelemetry-based instrumentation.
- [**_Evaluation_**](https://arize.com/docs/phoenix/evaluation/llm-evals) - Leverage LLMs to benchmark your application's performance using response and retrieval evals.
- [**_Datasets_**](https://arize.com/docs/phoenix/datasets-and-experiments/overview-datasets) - Create versioned datasets of examples for experimentation, evaluation, and fine-tuning.
- [**_Experiments_**](https://arize.com/docs/phoenix/datasets-and-experiments/overview-datasets#experiments) - Track and evaluate changes to prompts, LLMs, and retrieval.
- [**_Playground_**](https://arize.com/docs/phoenix/prompt-engineering/overview-prompts)- Optimize prompts, compare models, adjust parameters, and replay traced LLM calls.
- [**_Prompt Management_**](https://arize.com/docs/phoenix/prompt-engineering/overview-prompts/prompt-management)- Manage and test prompt changes systematically using version control, tagging, and experimentation.

Phoenix is vendor and language agnostic with out-of-the-box support for popular frameworks and AI providers.
<center>
    <p style="text-align:center">
        <img alt="First-class support for various frameworks and ai providers" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/openinference_integrations.jpg" width="1000"/>
    </p>
</center>

Phoenix runs practically anywhere, including your local machine, a Jupyter notebook, a containerized deployment, or in the cloud.

<center>
    <p style="text-align:center">
        <img alt="First-class support for various frameworks and ai providers" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/deployment_strategies.png" width="1000"/>
    </p>
</center>

The most important thing about choosing a good observability and evaluation tool is first: "Does the tool help me build good and responsible AI systems?" All modern platforms will and can do this. The things that make Phoenix somewhat unique are:

- 🌎 It's fully open-source and its development is driven heavily by developer feedback
- 🔐 It's privacy first, where the data is easily accessible inside your VPC or computer
- 🕊️ It has no feature gates and strives to maximize value for its users
- ⚙️ It's designed to be customizable to your needs through APIs and SDKs
- ✌️ Built on open standards and protocols like OTEL
- 💸 It's free - because its goal is to be a platform built by developers for developers

# The AI Problem
<p style="text-align:center">
  <img alt="AI dev as scientific method" src="https://storage.googleapis.com/arize-phoenix-assets/assets/gifs/20250524_1125_Forest%20Robots%20Interaction_simple_compose_01jw1n770bep1a829kw3cvvcsc.gif" width="80%" />
</p>
The hard truth: Building great AI native products requires a rigorous evaluation process.

Talking to an LLM can feel like talking to a new species. We don't think this is an accident. In many ways we are AI scientists observing emergent behavior and the AI development cycle really is the scientific method in disguise. Just as scientists meticulously record experiments and take detailed notes to advance their understanding, AI systems require rigorous observation through tracing, annotations, and experimentation to reach their full potential. The goal of AI-native products is to build tools that empower humans, and it requires careful human judgment to align AI with human preferences and values.

 <p style="text-align:center">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/scientific_method.png" width="80%">
</p>

## 👷‍♀️ Let's build an App

Let's build an App that uses common LLM prompting techniques. Specifically, let's try to get an LLM to produce structured output. Let's tackle a particularly messy problem - getting an LLM to produce SQL.

We are going to build a simple agent that can answer movie trivia. While this can probably be performed by an LLM, we are going to force the LLM to look up the movie trivia from a SQL database. You can imagine this technique could be very useful if you wanted to expose an internal knowledge store to your agent.

<p style="text-align: center">
  <img src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/txt_2_sql.png" style="width: 80%" />
</p>

## 🎥 Tracing
Just like scientists, every AI engineer needs a great camera. For this we will use OpenTelemetry. Telemetry produces traces of your LLM, Tools, and more.

OpenTelemetry helps to capture the inputs and outputs to our LLM system. We want to trace enough parts of our system so that we can debug failure modes and perform error analysis.

Let's roll camera.

In [None]:
!pip install -U "arize-phoenix-otel" "arize-phoenix-client>=1.20.0" "arize-phoenix-evals>=2.3.0" openai 'httpx<0.28' duckdb datasets pyarrow "pydantic>=2.0.0" nest_asyncio "openinference-instrumentation>=0.1.38" openinference-instrumentation-openai --quiet

This tutorial assumes you have a locally running Phoenix server. We can think of phoenix like a video recorder, observing every activity of your AI application.

```shell
phoenix serve
```

In [None]:
from phoenix.otel import register

tracer_provider = register(
    project_name="movie-app",
    endpoint="http://localhost:6006/v1/traces",
    verbose=False,
    auto_instrument=True,  # Start recording traces via OpenAIInstrumentor
)

tracer = tracer_provider.get_tracer(__name__)

Lastly, let's make sure we have our OpenAI API key set up.

In [None]:
import os
from getpass import getpass

if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")

## 🗄️ Download Movie Data

We are going to use a movie dataset that contains recent titles and their ratings. We will use DuckDB as our SQL database so that we can run the queries directly in the notebook, but you can imagine that this could be a pre-existing SQL database with business-specific data.

In [None]:
import duckdb
from datasets import load_dataset

data = load_dataset("wykonos/movies")["train"]

conn = duckdb.connect(database=":memory:", read_only=False)
conn.register("movies", data.to_pandas())

In [None]:
records = conn.query("SELECT * FROM movies LIMIT 10").to_df().to_dict(orient="records")

for record in records:
    print(record)

## Convert Human Questions -> SQL (text-to-sql)

Let's use an LLM to take human questions and to convert it into SQL so we can query the data above. Note that the prompt does a few specific things:

- We need to tell the LLM what our database table looks like. Let's pass it the columns and the column types
- We want the output to be pure SQL (select * from ...). LLMs tend to respond in markdown. Let's try to make sure it doesn't

In [None]:
import os

import openai

from phoenix.client import AsyncClient
from phoenix.client.types import PromptVersion

px_client = AsyncClient()
client = openai.AsyncClient()

columns = conn.query("DESCRIBE movies").to_df().to_dict(orient="records")

# We will use GPT4o to start
TASK_MODEL = "gpt-4o"
CONFIG = {"model": TASK_MODEL}

system_prompt = f"""
You are a SQL expert who takes user queries and transforms them into a SQL query to be executed.

You are given a table named `movies` with the following columns and types:

{",".join(column["column_name"] + ": " + column["column_type"] for column in columns)}

Write a raw DuckDB SQL query corresponding to the user's question. Return only a SQL query
with no formatting. The response SHOULD NOT include backticks or markdown formatting.

[BEGIN EXAMPLES]
************
[BAD RESPONSES]
***************
- `SELECT * FROM movies`
- sql```SELECT * FROM movies``
- here is the sql: SELECT * FROM movies
***************
[GOOD RESPONSES]
***************
- SELECT * FROM movies
***************
[END EXAMPLES]
"""

prompt_template = await px_client.prompts.create(
    name="movie-text-to-sql",
    version=PromptVersion(
        [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": "{{question}}",
            },
        ],
        description="Initial prompt for text-to-sql",
        model_name=TASK_MODEL,
    ),
)


@tracer.chain
async def generate_sql(question):
    # Vendor agnostic - can directly use OpenAI
    prompt = prompt_template.format(variables={"question": question}, sdk="openai")
    response = await client.chat.completions.create(
        **prompt,
        temperature=0,
    )
    return response.choices[0].message.content

In [None]:
query = await generate_sql("What is the top grossing movie?")
print(query)

Looks like the LLM is producing SQL. Let's try running the query against the database and see if we get the expected results. Just because the SQL query looks valid doesn't mean it's correct.

Note: we again wrap this function in a decorator and denote that this is a tool that the LLM is using. While not explicitly a tool call, it's largely the same paradigm.

In [None]:
import math


def sanitize_records(records):
    return {k: None if isinstance(v, float) and math.isnan(v) else v for k, v in records.items()}


@tracer.tool
def execute_sql(query):
    records = conn.query(query).fetchdf().to_dict(orient="records")
    return list(map(sanitize_records, records))

In [None]:
execute_sql(query)

Let's put the pieces together and see if we can create a movie agent that feels helpful. Here we are performing very simple RAG where the SQL query results are being passed to an LLM to synthesize a human-friendly answer.

In [None]:
@tracer.chain
async def query_db(question):  # noqa: F811
    sql = await generate_sql(question)
    results = execute_sql(sql)
    return {
        "sql": sql,
        "results": results,
    }

In [None]:
synthesis_system_prompt = """
You are a helpful assistant that can answer questions about movies. You are charming, witty, honest, and interesting.

Answer the question based on the SQL results. Do not rely on your internal knowledge.

Do not use SQL or abbreviations for genres or languages. Use an informative, concise voice.
Your response should be purely in natural language, do not include any SQL or other technical details.

If the SQL results are empty, say you don't know.
"""

synthesis_user_prompt_template = """
Answer the question based on the SQL results.

[BEGIN DATA]
************
[Question]: {{question}}
************
[SQL Results]: {{results}}
************
[END DATA]

Answer:
"""

synthesis_prompt_template = await px_client.prompts.create(
    name="movie-synthesis",
    version=PromptVersion(
        [
            {
                "role": "system",
                "content": synthesis_system_prompt,
            },
            {
                "role": "user",
                "content": synthesis_user_prompt_template,
            },
        ],
        description="Initial prompt for synthesis",
        model_name=TASK_MODEL,
    ),
)


@tracer.agent
async def movie_agent(question):
    sql_response = await query_db(question)
    prompt = synthesis_prompt_template.format(
        variables={"question": question, "results": str(sql_response["results"])}, sdk="openai"
    )
    answer = await client.chat.completions.create(**prompt)
    return answer.choices[0].message.content

In [None]:
await movie_agent("What is the top grossing movie?")

Looks like we have a working movie expert. Or do we? Let's double check. Let's run the agent over some examples we think the agent should be able to answer.

In [None]:
questions = [
    "Which Brad Pitt movie received the highest rating?",
    "What is the top grossing Marvel movie?",
    "What foreign-language fantasy movie was the most popular?",
    "what are the best sci-fi movies of 2017?",
    "What anime topped the box office in the 2010s?",
    "Recommend a romcom that stars Paul Rudd.",
]

Let's run the above queries against our agent and record it under a project as a "baseline" so we can see if we can improve it.

In [None]:
from openinference.instrumentation import dangerously_using_project

with dangerously_using_project(project_name="movie-agent-baseline"):
    for question in questions:
        try:
            answer = await movie_agent(question)
            print("Question: ", question)
            print("Answer: ", answer)
            print("\n")
        except Exception as e:
            print(e)

Let's look at the data and annotate it to see what the issues might be. Go to Settings > Annotations and add a correctness annotation config. Configure it as a categorical annotation with two categories, `correct` and `incorrect`. We can now quickly annotate the 7 traces (e.g. the agent spans) above as `correct` or `incorrect`. Once we've annotated some data we can bring it back into the notebook to analyze it.

In [None]:
from phoenix.client import AsyncClient
from phoenix.client.types.spans import SpanQuery

px_client = AsyncClient()
query = SpanQuery().where("name == 'movie_agent'")

spans_df = await px_client.spans.get_spans_dataframe(
    project_identifier="movie-agent-baseline", query=query
)
annotations_df = await px_client.spans.get_span_annotations_dataframe(
    spans_dataframe=spans_df, project_identifier="movie-agent-baseline"
)

combined_df = annotations_df.join(spans_df, how="inner")

In [None]:
examples_df = combined_df[
    ["annotation_name", "result.label", "attributes.input.value", "attributes.output.value"]
].head()
examples_df

Let's see if we can create an LLM judge that aligns with our human annotations.

In [None]:
example_answers = "\n\n".join(
    [
        f"Question: {example['attributes.input.value']}\nAnswer: {example['attributes.output.value']}\nLabel: {example['result.label']}"
        for example in examples_df.to_dict(orient="records")
    ]
)
eval_prompt = f"""
You are an expert evaluator of question and answer pairs. You will be given a human question and an answer from an AI agent.
Your job is to determine if the answer is "correct" or "incorrect" and to provide a clear reason why the label should be assigned.

Here are some examples of correct and incorrect answers:
<examples>
{example_answers}
</examples>

<data>
<question>
{{attributes.input.value}}
</question>
<answer>
{{attributes.output.value}}
</answer>
</data>
"""

print(eval_prompt)

In [None]:
spans_df[["attributes.input.value", "attributes.output.value"]].head()

In [None]:
from phoenix.evals.evaluators import create_classifier
from phoenix.evals.llm import LLM

# Define a classification based evaluation
llm_correctness = create_classifier(
    name="llm_correctness",
    llm=LLM(model="gpt-4o", provider="openai"),
    prompt_template=eval_prompt,
    choices={"correct": 1, "incorrect": 0},
)

In [None]:
from phoenix.evals import async_evaluate_dataframe

evals_df = await async_evaluate_dataframe(dataframe=spans_df, evaluators=[llm_correctness])

In [None]:
evals_df.head()

In [None]:
from phoenix.client import AsyncClient
from phoenix.evals.utils import to_annotation_dataframe

px_client = AsyncClient()
await px_client.spans.log_span_annotations_dataframe(
    dataframe=to_annotation_dataframe(evals_df),
)

## 🧪Experimentation

The velocity AI application development is bottlenecked by high quality evaluations because engineers are often faced with hard trade-offs: which prompt or LLM best balances performance, latency, and cost. Quality Evaluations are critical as they help answer these types of questions with greater confidence.

Evaluation consists of three parts — data, task, and evals. We'll start with data.

<p style="text-align: center">
<img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/experiment_analogy.png" width="800">
</p>

Let's store the movie questions we created above as a versioned dataset in phoenix.

In [None]:
import pandas as pd

ds = await px_client.datasets.create_dataset(
    name="movie-train",
    dataframe=pd.DataFrame([{"question": question} for question in questions]),
    input_keys=["question"],
    output_keys=[],
)

# If you have already uploaded the dataset, you can fetch it using the following line
# ds = await px_client.datasets.get_dataset(dataset="movie-train")

Next, we'll define the task. The task is to generate SQL queries from natural language questions.

In [None]:
@tracer.chain
async def query_db(question):  # noqa: F811
    query = await generate_sql(question)
    results = execute_sql(query)
    return {
        "query": query,
        "results": results,
    }

In [None]:
res = await query_db("What are the top Sci-Fi movies?")
for row in res["results"]:
    print(row)

Finally, we'll define the evaluators. We'll use the following simple function that produces 1 if we got results and 0 if not.

In [None]:
# Test if the query has results
def has_results(output):
    results = output.get("results")
    has_results = results is not None and len(results) > 0
    return 1.0 if has_results else 0.0

Now let's run the experiment. To run the experiment, we pass the dataset of examples, the task which runs the SQL generation, and the evals described above.

In [None]:
from phoenix.client.experiments import async_run_experiment


# Define the task to run query_db on the input question
async def task(input):
    return await query_db(input["question"])


experiment = await async_run_experiment(
    dataset=ds,
    task=task,
    evaluators=[has_results],
    experiment_metadata=CONFIG,
    experiment_name="baseline",
    repetitions=3,
)

Ok. Not looking very good. It looks like only 4 out 6 of our questions are yielding results. Let's dig in to see how we can fix these.


## Interpreting the results

Now that we ran the initial evaluation, it looks like 2 of the results are empty due to getting the genre wrong.

- `Sci-Fi` needs to be queried as `Science Fiction`
- `Anime` needs to be queried as `Animation` + language specification.

These two issues would probably be improved by showing a sample of the data to the model (e.g. few shot example) since the data will show the LLM what is queryable.

Let's try to improve the prompt with few-shot examples and see if we can get better results.

In [None]:
samples = conn.query("SELECT * FROM movies LIMIT 5").to_df().to_dict(orient="records")

example_row = "\n".join(
    f"{column['column_name']} | {column['column_type']} | {samples[0][column['column_name']]}"
    for column in columns
)

column_header = " | ".join(column["column_name"] for column in columns)

few_shot_examples = "\n".join(
    " | ".join(str(sample[column["column_name"]]) for column in columns) for sample in samples
)

system_prompt = f"""
You are a SQL expert who takes user queries and transforms them into a SQL query to be executed.

You are given a table named `movies` with the following columns:

[BEGIN EXAMPLES]
************
Column | Type | Example
-------|------|--------
{example_row}
************
[Example table rows]
{column_header}
{few_shot_examples}
************
[END EXAMPLES]

Write a raw DuckDB SQL query corresponding to the user's question. Return only the raw SQL query
with no formatting. The response SHOULD NOT include backticks or markdown formatting. Never query for more than 10 rows.

BAD RESPONSES:
- `SELECT * FROM movies`
- sql```SELECT * FROM movies``
- here is the sql: SELECT * FROM movies

GOOD RESPONSES:
- SELECT * FROM movies
"""

prompt_template = await px_client.prompts.create(
    name="movie-text-to-sql",
    version=PromptVersion(
        [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": "{{question}}",
            },
        ],
        description="Add few shot examples to the prompt",
        model_name=TASK_MODEL,
    ),
)

In [None]:
print(await generate_sql("What is the best Sci-Fi movies of 2017?"))

Looking much better! Since the prompt shows that "Sci-Fi" is represented as "Science Fiction", the LLM is able to synthesize the right where clause.

Pro-tip: You can try out the prompt in the playground even before the next step!

Let's run the experiment again.

In [None]:
experiment = await async_run_experiment(
    dataset=ds,
    experiment_name="with examples",
    task=task,
    evaluators=[has_results],
    experiment_metadata=CONFIG,
)

Looks much improved. It looks like we're getting data our of our system. But just because we are getting info out of the DB doesn't mean these records are useful. Let's construct an LLM judge to see if the results are relevant to the question.

In [None]:
from phoenix.client.experiments import async_evaluate_experiment
from phoenix.evals import create_classifier

judge_prompt_template = """
You are a judge that determines if a given question can be answered with the SQL results.

Provide the label `useful` if the SQL results contain records that help answer the question.
Provide the label `useless` if the SQL results do not contain records that help answer the question.

<data>
<question>
{input.question}
</question>
<results>
{output.results}
<results>
</data>
"""

usefulness = create_classifier(
    name="usefulness",
    llm=LLM(model="gpt-4o", provider="openai"),
    prompt_template=judge_prompt_template,
    choices={"useful": 1, "useless": 0},
)

await async_evaluate_experiment(experiment=experiment, evaluators=[usefulness])

The LLM judge's scoring closely matches our manual review, demonstrating its effectiveness as an automated evaluation method. This approach is particularly valuable when traditional rule-based scoring functions are difficult to implement.

The LLM judge also shows an advantage in nuanced understanding - for example, it correctly identifies that 'Anime' and 'Animation' are distinct genres, a subtlety our code-based evaluators missed. This highlights why developing custom LLM judges tailored to your specific task requirements is crucial for accurate evaluation.


We now have a simple text-to-sql pipeline that can be used to generate SQL queries from natural language questions. Since Phoenix has been tracing the entire pipeline, we can now use the Phoenix UI to convert the spans that generated successful queries into examples to use in **Golden Dataset** for regression testing as well.

## Bringing it all together

Now that we've seen the experiment improve our outcome, let's put it to a test given the evals we built out earlier.

In [None]:
from openinference.instrumentation import dangerously_using_project


@tracer.agent
async def movie_agent_improved(question):
    sql_response = await query_db(question)
    prompt = synthesis_prompt_template.format(
        variables={"question": question, "results": str(sql_response["results"])}, sdk="openai"
    )
    answer = await client.chat.completions.create(**prompt)
    return answer.choices[0].message.content


with dangerously_using_project(project_name="movie-agent-improved"):
    for question in questions:
        try:
            answer = await movie_agent_improved(question)
            print("Question: ", question)
            print("Answer: ", answer)
            print("\n")
        except Exception as e:
            print(e)

In [None]:
from phoenix.client import Client
from phoenix.client.types.spans import SpanQuery

phoenix_client = Client()
query = SpanQuery().where("name == 'movie_agent_improved'")

spans_df = phoenix_client.spans.get_spans_dataframe(
    project_identifier="movie-agent-improved", query=query
)

spans_df.head()

In [None]:
from phoenix.evals import async_evaluate_dataframe

evals_df = await async_evaluate_dataframe(dataframe=spans_df, evaluators=[llm_correctness])

In [None]:
evals_df.head()

In [None]:
from phoenix.evals.utils import to_annotation_dataframe

px_client = AsyncClient()

await px_client.spans.log_span_annotations_dataframe(
    dataframe=to_annotation_dataframe(evals_df),
)

Our improved agent now is able to answer all 6 questions but our `llm_correctness` eval was able to spot that the agent responses are not very good:

- querying for `Anime` and responding with `Frozen II` misses the mark on anime being a japanese form of animation
- the LLM thinks "top" or "best" means rating but doesn't take into account the number of votes.

Our `movie-text-to-sql` prompt still needs more instructions if we want to improve its performance. But we're on the right track and can find more ways to guide the LLM.

This tutorial demonstrated the core principles of building **evals that work** for AI applications. Here are the key concepts you should take away:

1. **Build & Trace**: Instrument your AI application with tracing from day one
2. **Annotate**: Use human judgment to label traces with simple heuristics like correct/incorrect
3. **Create Evaluators**: Build both simple programmatic evals as well as LLM judges
4. **Experiment**: Run systematic experiments to compare different approaches
5. **Iterate**: Use evaluation results to improve prompts, models, or architecture



# Bibliography

<cite id="yan2025">Yan, Z. (2025). An LLM-as-Judge Won't Save The Product—Fixing Your Process Will. *eugeneyan.com*. https://eugeneyan.com/writing/eval-process/</cite>