<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg" width="1000"/>
        <br>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Evals that Work</h1>

Building great AI native products requires a rigorous evaluation process. While the idea of evaluation-driven development may seem novel to some, it really is the scientific method in disguise.  Just as scientists meticulously record experiments and take detailed notes to advance their understanding, AI systems require rigorous observation through tracing, annotations, and experimentation to reach their full potential. The goal of AI-native products is to build tools that empower humans, and it requires careful human judgment to align AI with human preferences and values.

note: this is inspired by a tutorial originally authored by [Ankur Goyal](https://www.braintrust.dev/docs/cookbook/recipes/Text2SQL-Data)

 <p style="text-align:center">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/scientific_method.png" width="60%" style="float: left">
  <img alt="AI dev as scientific method" src="https://storage.googleapis.com/arize-phoenix-assets/assets/gifs/20250524_1125_Forest%20Robots%20Interaction_simple_compose_01jw1n770bep1a829kw3cvvcsc.gif" width="40%" style="float: right"/>
</p>

In [None]:
!pip install "arize-phoenix>=10.0.0" openai 'httpx<0.28' duckdb datasets pyarrow "pydantic>=2.0.0" nest_asyncio openinference-instrumentation-openai --quiet

This tutorial assumes you have a locally running Phoenix server. We can think of phoenix like a video recorder, observing every activity of your AI application.

```shell
phoenix serve
```

Let's also setup tracing for OpenAI as we will be using their API to perform the synthesis.

In [1]:
from phoenix.otel import register

tracer_provider = register(
    project_name="movie-app",
    auto_instrument=True,  # Start recording traces via OpenAIInstrumentor
)

tracer = tracer_provider.get_tracer(__name__)

🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: movie-app
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: localhost:4317
|  Transport: gRPC
|  Transport Headers: {'user-agent': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



Let's make sure we can run async code in the notebook.

In [5]:
import nest_asyncio

nest_asyncio.apply()

Lastly, let's make sure we have our openai API key set up.

In [6]:
import os
from getpass import getpass

if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")

## Download Data

We are going to use a movie dataset that contains recent titles and their ratings. We will use DuckDB as our database so that we can run the queries directly in the notebook.

In [10]:
import duckdb
from datasets import load_dataset

data = load_dataset("wykonos/movies")["train"]

conn = duckdb.connect(database=":memory:", read_only=False)
conn.register("movies", data.to_pandas())

records =conn.query("SELECT * FROM movies LIMIT 5").to_df().to_dict(orient="records")

for record in records:
    print(record)

Using the latest cached version of the dataset since wykonos/movies couldn't be found on the Hugging Face Hub
Found the latest cached dataset configuration 'default' at /Users/mikeldking/.cache/huggingface/datasets/wykonos___movies/default/0.0.0/c14904f332945044de82e24ecbab3ae865535199 (last modified on Mon Jun 30 10:58:51 2025).


{'id': 385687, 'title': 'Fast X', 'genres': 'Action-Crime-Thriller', 'original_language': 'en', 'overview': "Over many missions and against impossible odds Dom Toretto and his family have outsmarted out-nerved and outdriven every foe in their path. Now they confront the most lethal opponent they've ever faced: A terrifying threat emerging from the shadows of the past who's fueled by blood revenge and who is determined to shatter this family and destroy everything—and everyone—that Dom loves forever.", 'popularity': 6682.1, 'production_companies': 'Universal Pictures-Original Film-One Race-Perfect Storm Entertainment', 'release_date': '2023-05-17', 'budget': 340000000.0, 'revenue': 686700000.0, 'runtime': 142.0, 'status': 'Released', 'tagline': 'The end of the road begins.', 'vote_average': 7.331, 'vote_count': 1856.0, 'credits': 'Vin Diesel-Michelle Rodriguez-Tyrese Gibson-Ludacris-John Cena-Nathalie Emmanuel-Jordana Brewster-Sung Kang-Jason Momoa-Scott Eastwood-Daniela Melchior-Alan R

## Implement Text2SQL

Let's start by implementing a simple text to sql logic. Note that we prompt the llm to just respond with the sql so that we can plug it directly into duckDB.

In [22]:
import os

import openai

from phoenix.client import Client
from phoenix.client.types import PromptVersion

phoenix_client = Client()
client = openai.AsyncClient()

columns = conn.query("DESCRIBE movies").to_df().to_dict(orient="records")

# We will use GPT4o to start
TASK_MODEL = "gpt-4o"
CONFIG = {"model": TASK_MODEL}

system_prompt = (
    "You are a SQL expert, and you are given a single table named `movies` with the following columns:\n"
    f'{",".join(column["column_name"] + ": " + column["column_type"] for column in columns)}\n'
    "Write a SQL query corresponding to the user's request. Return just the query text, "
    "with no formatting (backticks, markdown, etc.). The response should be pure SQL."
)

prompt_template = phoenix_client.prompts.create(
    name="text2sql",
    version=PromptVersion(
        [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": "{{question}}",
            },
        ],
        description="Initial prompt for text2sql",
        model_name=TASK_MODEL,
    ),
)


@tracer.chain
async def generate_query(question):
    prompt = prompt_template.format(variables={"question": question}, sdk="openai")
    response = await client.chat.completions.create(
        **prompt,
        temperature=0,
    )
    return response.choices[0].message.content

In [27]:
query = await generate_query("What is the top rated movie?")
print(query)

SELECT title FROM movies ORDER BY vote_average DESC LIMIT 1;


Awesome, looks like the LLM is producing SQL! let's try running the query and see if we get the expected results.

In [28]:
@tracer.tool
def execute_query(query):
    return conn.query(query).fetchdf().to_dict(orient="records")


execute_query(query)

[{'title': 'Inside The Walking Dead Season 11 (Part 1)'}]

Let's put the pieces together and see if we can create a movie agent

In [30]:
@tracer.chain
async def text2sql(question):  # noqa: F811
    query = await generate_query(question)
    results = None
    error = None
    try:
        results = execute_query(query)
    except duckdb.Error as e:
        error = str(e)

    return {
        "query": query,
        "results": results,
        "error": error,
    }


synthesis_system_prompt = """
You are a helpful assistant that can answer questions about movies.Answer the question based on the sql results.

Do not use sql or abbriviations for genres or languages. Use an informative, concise voice.
Your response should be purely in natural language, do not include any sql or other technical details.

If the sql results are empty, say you don't know.
"""


@tracer.agent
async def movie_agent(question):
    sql_response = await text2sql(question)
    if sql_response["error"]:
        raise Exception(sql_response["error"])
    results = sql_response["results"]
    answer = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": synthesis_system_prompt},
            {
                "role": "user",
                "content": f"The sql results of the query are: {results}. Answer the following question: {question}. Answer:",
            },
        ],
    )
    return answer.choices[0].message.content


await movie_agent("What is the top rated movie?")

'The top rated movie in the provided results is "Inside The Walking Dead Season 11 (Part 1)".'

Let's run the agent over some examples

In [None]:
from phoenix.trace import using_project

questions = [
    "Which movie recieved the most votes?",
    "Which movie had the highest rating?",
    "What french film was the most popular in 2015?",
    "what are the best sci-fi movies",
    "How many movies were released around Christmas in 2018?",
]

with using_project(project_name="movie-agent-baseline"):
    for question in questions:
        try:
            answer = await movie_agent(question)
            print(answer)
        except Exception as e:
            print(e)

Transient error StatusCode.UNAVAILABLE encountered while exporting traces to localhost:4317, retrying in 1s.


The movie that received the most votes is "Inception".
I don't know.
Parser Error: syntax error at or near "```"
I don't know.
In 2018, 128 movies were released around Christmas.


Let's look at the data and annotate some of the data to see what the issues might be. Go to Settings > Annotations and add a correctness annotation config. Configure it as a categorical annotation with two categories, `correct` and `incorrect`. We can now quickly annotate the 5 traces above as `correct` or `incorrect`. Once we've annotated some data we can bring it back into the notebook to analyze it.

In [35]:
from phoenix.client import Client
from phoenix.client.types.spans import SpanQuery

phoenix_client = Client()
query = SpanQuery().where("name == 'movie_agent'")

spans_df = phoenix_client.spans.get_spans_dataframe(
    project_identifier="movie-agent-baseline", query=query
)
annotations_df = phoenix_client.spans.get_span_annotations_dataframe(
    spans_dataframe=spans_df, project_identifier="movie-agent-baseline"
)

combined_df = annotations_df.join(spans_df, how="inner")

combined_df.head()

Unnamed: 0,annotation_name,annotator_kind,metadata,identifier,id,created_at,updated_at,source,user_id,result.label,...,status_code,status_message,events,context.span_id,context.trace_id,attributes.input.mime_type,attributes.input.value,attributes.openinference.span.kind,attributes.output.mime_type,attributes.output.value
f008685457d5886d,correctness,HUMAN,{},,U3BhbkFubm90YXRpb246NA==,2025-06-30T18:51:35+00:00,2025-06-30T18:51:35+00:00,APP,,incorrect,...,OK,,[],f008685457d5886d,381efb4c8f96a70a56626203d21353aa,text/plain,Which movie had the highest rating?,AGENT,text/plain,I don't know.
e6babd5ab11f411c,correctness,HUMAN,{},,U3BhbkFubm90YXRpb246Mw==,2025-06-30T18:50:12+00:00,2025-06-30T18:50:12+00:00,APP,,incorrect,...,ERROR,Exception: Parser Error: syntax error at or ne...,"[{'name': 'exception', 'timestamp': '2025-06-3...",e6babd5ab11f411c,1b80840b5610fb83a545d9ea033523e1,text/plain,What french film was the most popular in 2015?,AGENT,,
32491c653ea4a0bb,correctness,HUMAN,{},,U3BhbkFubm90YXRpb246Mg==,2025-06-30T18:50:07+00:00,2025-06-30T18:50:07+00:00,APP,,incorrect,...,OK,,[],32491c653ea4a0bb,783aec24bdf339496c64061babea0cbd,text/plain,what are the best sci-fi movies in the 2000s?,AGENT,text/plain,I don't know.
a58932aa7c4eb068,correctness,HUMAN,{},,U3BhbkFubm90YXRpb246MQ==,2025-06-30T18:49:59+00:00,2025-06-30T18:49:59+00:00,APP,,correct,...,OK,,[],a58932aa7c4eb068,b87c5e4e72c57b5f96680e63dc523613,text/plain,How many movies were released around Christmas...,AGENT,text/plain,"In 2018, 128 movies were released around Chris..."


In [37]:
expamples_df = combined_df[
    ["annotation_name", "result.label", "attributes.input.value", "attributes.output.value"]
].head()
expamples_df

Unnamed: 0,annotation_name,result.label,attributes.input.value,attributes.output.value
f008685457d5886d,correctness,incorrect,Which movie had the highest rating?,I don't know.
e6babd5ab11f411c,correctness,incorrect,What french film was the most popular in 2015?,
32491c653ea4a0bb,correctness,incorrect,what are the best sci-fi movies in the 2000s?,I don't know.
a58932aa7c4eb068,correctness,correct,How many movies were released around Christmas...,"In 2018, 128 movies were released around Chris..."


Let's see if we can create an LLM judge that aligns with our human evaluation.

In [38]:
eval_prompt = f"""
You are an expert evaluator of question and answer pairs. You will be given a human question and an answer from a model.
Your job is to determine if the answer is "correct" or "incorrect".

Here are some examples of correct and incorrect answers:
{'\n\n'.join([f"Question: {example['attributes.input.value']}\nAnswer: {example['attributes.output.value']}\nLabel: {example['result.label']}" for example in expamples_df.to_dict(orient="records")])}

## Evaluation
Provide your answer in the following format:
Question: <question>
Answer: <answer>
Explanation: <explanation>
Label: <correct|incorrect>

Question: {{attributes.input.value}}
Answer: {{attributes.output.value}}
Explanation:
"""

print(eval_prompt)


You are an expert evaluator of question and answer pairs. You will be given a human question and an answer from a model.
Your job is to determine if the answer is "correct" or "incorrect".

Here are some examples of correct and incorrect answers:
Question: Which movie had the highest rating?
Answer: I don't know.
Label: incorrect

Question: What french film was the most popular in 2015?
Answer: None
Label: incorrect

Question: what are the best sci-fi movies in the 2000s?
Answer: I don't know.
Label: incorrect

Question: How many movies were released around Christmas in 2018?
Answer: In 2018, 128 movies were released around Christmas.
Label: correct

## Evaluation
Provide your answer in the following format:
Question: <question>
Answer: <answer>
Explanation: <explanation>
Label: <correct|incorrect>

Question: {attributes.input.value}
Answer: {attributes.output.value}
Explanation:



In [39]:
spans_df[["attributes.input.value", "attributes.output.value"]].head()

Unnamed: 0_level_0,attributes.input.value,attributes.output.value
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1
506fbb9dfef6a379,Which movie recieved the most votes?,
49a5cde939cdfef6,Which movie had the highest rating?,I don't know which movie had the highest ratin...
b9d6e10716675c4b,What french film was the most popular in 2015?,
95e94aa35380c6ab,what are the best sci-fi movies in the 2000s?,I don't have information on the best sci-fi mo...
91a4b762a1861cc4,How many movies were released around Christmas...,A total of 128 movies were released around Chr...


In [40]:
from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel
from phoenix.evals.templates import PromptTemplate

evals_df = llm_classify(
    data=spans_df,
    model=OpenAIModel(model="gpt-4o"),
    rails=["correct", "incorrect"],
    template=PromptTemplate(
        template=eval_prompt,
    ),
    exit_on_error=False,
    provide_explanation=True,
)

## Assign 1 to correct and 0 to incorrect
evals_df["score"] = evals_df["label"].apply(lambda x: 1 if x == "correct" else 0)
evals_df[["label", "score", "explanation"]].head()

llm_classify |          | 0/10 (0.0%) | ⏳ 00:00<? | ?it/s

Retries exhausted after 1 attempts: Missing template variables: attributes.output.value
Retries exhausted after 1 attempts: Missing template variables: attributes.output.value
Retries exhausted after 1 attempts: Missing template variables: attributes.output.value


Transient error StatusCode.UNAVAILABLE encountered while exporting traces to localhost:4317, retrying in 1s.


Unnamed: 0_level_0,label,score,explanation
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
506fbb9dfef6a379,,0,
49a5cde939cdfef6,incorrect,0,The answer does not provide any information ab...
b9d6e10716675c4b,,0,
95e94aa35380c6ab,correct,1,The answer provides a list of widely acclaimed...
91a4b762a1861cc4,correct,1,The answer provides a specific number of movie...


In [41]:
import phoenix as px
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(
        dataframe=evals_df,
        eval_name="llm_correctness",
    )
)



<p style="text-align: center">
<img src="https://eugeneyan.com/assets/ai-monitoring.webp" width="800">
<a href="">
</p>

## Experimentation

<p style="text-align: center">
<img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/evaluator.png" width="800">
</p>

The velocity AI application development is bottlenecked by high quality evaluations because engineers are often faced with hard tradeoffs: which prompt or LLM best balances performance, latency, and cost. Quality Evaluations are critical as they help answer these types of questions with greater confidence.

Evaluation consists of three parts — data, task, and scores. We'll start with data.

<p style="text-align: center">
<img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/experiment_analogy.png" width="800">
</p>

In [55]:
questions = [
    "Which movie recieved the most votes?",
    "Which movie had the highest rating?",
    "What french film was the most popular in 2015?",
    "what are the best sci-fi movies?",
    "How many movies were released around Christmas in 2018?",
]

Let's store the data above as a versioned dataset in phoenix.

In [56]:
import pandas as pd

import phoenix as px

ds = px.Client().upload_dataset(
    dataset_name="movie-example-questions-2",
    dataframe=pd.DataFrame([{"question": question} for question in questions]),
    input_keys=["question"],
    output_keys=[],
)

# If you have already uploaded the dataset, you can fetch it using the following line
# ds = px.Client().get_dataset(name="nba-questions")

📤 Uploading dataset...




DatasetUploadError: Dataset with the same name already exists: name='movie-example-questions-2'

Next, we'll define the task. The task is to generate SQL queries from natural language questions.

In [52]:
@tracer.chain
async def text2sql(question):  # noqa: F811
    query = await generate_query(question)
    results = None
    error = None
    try:
        results = execute_query(query)
    except duckdb.Error as e:
        error = str(e)

    return {
        "query": query,
        "results": results,
        "error": error,
    }

Finally, we'll define the scores. We'll use the following simple scoring functions to see if the generated SQL queries are correct.

In [53]:
# Test if there are no sql execution errors


def no_error(output):
    return 1.0 if output.get("error") is None else 0.0


# Test if the query has results
def has_results(output):
    results = output.get("results")
    has_results = results is not None and len(results) > 0
    return 1.0 if has_results else 0.0

Now let's run the evaluation experiment.

In [57]:
import phoenix as px
from phoenix.experiments import run_experiment


# Define the task to run text2sql on the input question
def task(input):
    return text2sql(input["question"])


experiment = run_experiment(
    ds,
    task=task,
    evaluators=[no_error, has_results],
    experiment_metadata=CONFIG,
    experiment_name="baseline",
)

🧪 Experiment started.
📺 View dataset experiments: http://127.0.0.1:6006/datasets/RGF0YXNldDoz/experiments
🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDoz/compare?experimentId=RXhwZXJpbWVudDo2


running tasks |          | 0/5 (0.0%) | ⏳ 00:00<? | ?it/s

I0000 00:00:1751311160.413454 5901445 chttp2_transport.cc:1154] ipv6:%5B::1%5D:4317: Got goaway [11] err=UNAVAILABLE:GOAWAY received; Error code: 11; Debug Text: ping_timeout {created_time:"2025-06-30T12:19:20.413447-07:00", http2_error:11, grpc_status:14}


✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/10 (0.0%) | ⏳ 00:00<? | ?it/s


🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDoz/compare?experimentId=RXhwZXJpbWVudDo2

Experiment Summary (06/30/25 12:19 PM -0700)
--------------------------------------------
| evaluator   |   n |   n_scores |   avg_score |
|:------------|----:|-----------:|------------:|
| has_results |   5 |          5 |         0.6 |
| no_error    |   5 |          5 |         0.8 |

Tasks Summary (06/30/25 12:19 PM -0700)
---------------------------------------
|   n_examples |   n_runs |   n_errors |
|-------------:|---------:|-----------:|
|            5 |        5 |          0 |


Ok. Not looking very good. It looks like only 3 of our questions are yielding results and one of our queries are leading to errors. Let's dig in to see how we can fix these.


## Interpreting the results

Now that we ran the initial evaluation, it looks like three of the results are valid, one produces a sql error, and one has no results.

- The incorrect query didn't seem to get the right way to query the genre (e.g. Sci-Fi might not be the label). That would probably be improved by showing a sample of the data to the model (e.g. few shot example) since the data might contain genres in a different format.

Let's try to improve the prompt with few-shot examples and see if we can get better results.

In [64]:
samples = conn.query("SELECT * FROM movies LIMIT 1").to_df().to_dict(orient="records")[0]
sample_rows = "\n".join(
    f"{column['column_name']} | {column['column_type']} | {samples[column['column_name']]}"
    for column in columns
)
system_prompt = (
    "You are a SQL expert, and you are given a single table named `movies` with the following columns:\n\n"
    "Column | Type | Example\n"
    "-------|------|--------\n"
    f"{sample_rows}\n"
    "\n"
    "Write a DuckDB SQL query corresponding to the user's request. "
    "Return just the query text, with no formatting (backticks, markdown, etc.)."
)


prompt_template = phoenix_client.prompts.create(
    name="text2sql",
    version=PromptVersion(
        [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": "{{question}}",
            },
        ],
        description="Add few shot examples to the prompt",
        model_name=TASK_MODEL,
    ),
)

print(await generate_query("What is the best Science Fiction movie?"))

SELECT title, MAX(vote_average) AS highest_vote_average
FROM movies
WHERE genres LIKE '%Science Fiction%'
GROUP BY title
ORDER BY highest_vote_average DESC
LIMIT 1;


I0000 00:00:1751311641.392324 5901445 chttp2_transport.cc:1154] ipv6:%5B::1%5D:4317: Got goaway [11] err=UNAVAILABLE:GOAWAY received; Error code: 11; Debug Text: ping_timeout {grpc_status:14, http2_error:11, created_time:"2025-06-30T12:27:21.392316-07:00"}


Looking much better! Finally, let's add a scoring function that compares the results, if they exist, with the expected results.




In [None]:
experiment = run_experiment(
    ds,
    experiment_name="with examples",
    task=task,
    evaluators=[has_results, no_error],
    experiment_metadata=CONFIG,
)

Amazing. It looks like we removed one of the errors, and got a result for the incorrect query. Let's try out using LLM as a judge to see how well it can assess the results.


In [None]:
from phoenix.evals.models import OpenAIModel
from phoenix.experiments import evaluate_experiment
from phoenix.experiments.evaluators.llm_evaluators import LLMCriteriaEvaluator

llm_evaluator = LLMCriteriaEvaluator(
    name="is_sql",
    criteria="is_sql",
    description="the output is a valid SQL query and that it executes without errors",
    model=OpenAIModel(),
)

evaluate_experiment(experiment, evaluators=[llm_evaluator])

Sure enough the LLM agrees with our scoring. Pretty neat trick! This can come in useful when it's difficult to define a scoring function.


We now have a simple text2sql pipeline that can be used to generate SQL queries from natural language questions. Since Phoenix has been tracing the entire pipeline, we can now use the Phoenix UI to convert the spans that generated successful queries into examples to use in **Golden Dataset** for regression testing!

## Bringing it all together

Now that we've seen the experiment improve our outcome, let's put it to a test given the evals we built out.

In [None]:
from phoenix.trace import using_project

questions = [
    "Which team won the most games?",
    "Which team won the most games in 2015?",
    "Who led the league in 3 point shots?",
    "Which team had the biggest difference in records across two consecutive years?",
    "What is the average number of free throws per year?",
]


@tracer.agent
async def basketball_agent_improved(question):
    sql_response = await text2sql(question)
    if sql_response["error"]:
        raise Exception(sql_response["error"])
    results = sql_response["results"]
    answer = await client.chat.completions.create(
        model="o3",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful assistant that can answer questions about the NBA. Do not use sql or abbriviations for teams. Use the full team names and use an informative, concise voice. Your response should be purely in natural language, do not include any sql or other technical details.",
            },
            {
                "role": "user",
                "content": f"The sql results of the query are: {results}. Answer the following question: {question}.",
            },
        ],
    )
    return answer.choices[0].message.content


with using_project(project_name="basketball-agent-improved"):
    for question in questions:
        try:
            answer = await basketball_agent_improved(question)
            print(answer)
        except Exception as e:
            print(e)

In [None]:
from phoenix.client import Client
from phoenix.client.types.spans import SpanQuery

phoenix_client = Client()
query = SpanQuery().where("name == 'basketball_agent_improved'")

spans_df = phoenix_client.spans.get_spans_dataframe(
    project_identifier="basketball-agent-improved", query=query
)

spans_df.head()

In [None]:
from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel
from phoenix.evals.templates import PromptTemplate

evals_df = llm_classify(
    data=spans_df,
    model=OpenAIModel(model="gpt-4o"),
    rails=["correct", "incorrect"],
    template=PromptTemplate(
        template=eval_prompt,
    ),
    exit_on_error=False,
    provide_explanation=True,
)

## Assign 1 to correct and 0 to incorrect
evals_df["score"] = evals_df["label"].apply(lambda x: 1 if x == "correct" else 0)
evals_df[["label", "score", "explanation"]].head()

In [None]:
import phoenix as px
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(
        dataframe=evals_df,
        eval_name="llm_correctness",
    )
)

# Sources

 - [An LLM-as-Judge Won't Save The Product—Fixing Your Process Will](https://eugeneyan.com/writing/eval-process/) by Ziyou Yan (April 2025, eugeneyan.com)
- [LLM Eval for TxtToSql](https://www.braintrust.dev/docs/cookbook/recipes/Text2SQL-Data)
- [AI Robotics Ethics Society](https://huggingface.co/AiresPucrs)