<!-- SPDX-License-Identifier: CC-BY-NC-SA-4.0 -->

*This notebook is © [Braintrust Cookbook](https://www.braintrust.dev/docs/cookbook/recipes/Text2SQL-Data) and licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).*  

<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg" width="1000"/>
        <br>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Evals that Work</h1>

Building great AI native products requires a rigorous evaluation process. While the idea of evaluation-driven development may seem novel to some, it really is the scientific method in disguise.  Just as scientists meticulously record experiments and take detailed notes to advance their understanding, AI systems require rigorous observation through tracing, annotations, and experimentation to reach their full potential. The goal of AI-native products is to build tools that empower humans, and it requires careful human judgment to align AI with human preferences and values.

This notebook is inspired by [Eugene Yan's "A Process for LLM Evaluation"](https://eugeneyan.com/writing/eval-process/) and © [Braintrust Cookbook](https://www.braintrust.dev/docs/cookbook/recipes/Text2SQL-Data), licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).

 <p style="text-align:center">
  <img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/scientific_method.png" width="60%" style="float: left">
  <img alt="AI dev as scientific method" src="https://storage.googleapis.com/arize-phoenix-assets/assets/gifs/20250524_1125_Forest%20Robots%20Interaction_simple_compose_01jw1n770bep1a829kw3cvvcsc.gif" width="40%" style="float: right"/>
</p>

In [None]:
!pip install "arize-phoenix>=10.0.0" openai 'httpx<0.28' duckdb datasets pyarrow "pydantic>=2.0.0" nest_asyncio openinference-instrumentation-openai --quiet

This tutorial assumes you have a locally running Phoenix server. We can think of phoenix like a video recorder, observing every activity of your AI application.

```shell
phoenix serve
```

Let's also setup tracing for OpenAI as we will be using their API to perform the synthesis.

In [1]:
from phoenix.otel import register

tracer_provider = register(
    project_name="movie-app",
    auto_instrument=True,  # Start recording traces via OpenAIInstrumentor
)

tracer = tracer_provider.get_tracer(__name__)

🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: movie-app
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: localhost:4317
|  Transport: gRPC
|  Transport Headers: {'user-agent': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



Let's make sure we can run async code in the notebook.

In [2]:
import nest_asyncio

nest_asyncio.apply()

Lastly, let's make sure we have our openai API key set up.

In [3]:
import os
from getpass import getpass

if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")

## Download Data

We are going to use a movie dataset that contains recent titles and their ratings. We will use DuckDB as our database so that we can run the queries directly in the notebook, but you can imagine that this could be a pre-existing SQL database with business-specific data.

In [4]:
import duckdb
from datasets import load_dataset

data = load_dataset("wykonos/movies")["train"]

conn = duckdb.connect(database=":memory:", read_only=False)
conn.register("movies", data.to_pandas())

<duckdb.duckdb.DuckDBPyConnection at 0x319c79130>

In [5]:
records = conn.query("SELECT * FROM movies LIMIT 10").to_df().to_dict(orient="records")

for record in records:
    print(record)

{'id': 385687, 'title': 'Fast X', 'genres': 'Action-Crime-Thriller', 'original_language': 'en', 'overview': "Over many missions and against impossible odds Dom Toretto and his family have outsmarted out-nerved and outdriven every foe in their path. Now they confront the most lethal opponent they've ever faced: A terrifying threat emerging from the shadows of the past who's fueled by blood revenge and who is determined to shatter this family and destroy everything—and everyone—that Dom loves forever.", 'popularity': 6682.1, 'production_companies': 'Universal Pictures-Original Film-One Race-Perfect Storm Entertainment', 'release_date': '2023-05-17', 'budget': 340000000.0, 'revenue': 686700000.0, 'runtime': 142.0, 'status': 'Released', 'tagline': 'The end of the road begins.', 'vote_average': 7.331, 'vote_count': 1856.0, 'credits': 'Vin Diesel-Michelle Rodriguez-Tyrese Gibson-Ludacris-John Cena-Nathalie Emmanuel-Jordana Brewster-Sung Kang-Jason Momoa-Scott Eastwood-Daniela Melchior-Alan R

## Implement Text2SQL

Let's start by implementing a simple logic to take human questions and to convert it into a sql query. Note that we prompt the llm to just respond with the sql so that we can plug it directly into duckDB.

In [32]:
import os

import openai

from phoenix.client import Client
from phoenix.client.types import PromptVersion

phoenix_client = Client()
client = openai.AsyncClient()

columns = conn.query("DESCRIBE movies").to_df().to_dict(orient="records")

# We will use GPT4o to start
TASK_MODEL = "gpt-4o"
CONFIG = {"model": TASK_MODEL}

system_prompt = f"""
You are a SQL expert, and you are given a table named `movies` with the following columns:

{",".join(column["column_name"] + ": " + column["column_type"] for column in columns)}

Write a SQL query corresponding to the user's request. Return just the SQL query
with no formatting (no backticks, no markdown, etc.).
"""

prompt_template = phoenix_client.prompts.create(
    name="text2sql",
    version=PromptVersion(
        [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": "{{question}}",
            },
        ],
        description="Initial prompt for text2sql",
        model_name=TASK_MODEL,
    ),
)


@tracer.chain
async def generate_query(question):
    prompt = prompt_template.format(variables={"question": question}, sdk="openai")
    response = await client.chat.completions.create(
        **prompt,
        temperature=0,
    )
    return response.choices[0].message.content

In [41]:
query = await generate_query("What is the top grossing movie?")
print(query)

SELECT title FROM movies ORDER BY revenue DESC LIMIT 1;


Awesome, looks like the LLM is producing SQL! let's try running the query against the database and see if we get the expected results.

In [42]:
@tracer.tool
def execute_query(query):
    return conn.query(query).fetchdf().to_dict(orient="records")


execute_query(query)

[{'title': 'Avatar'}]

Let's put the pieces together and see if we can create a movie agent. Here we are performing very simple RAG where the sql query results are being passed to an LLM to synthesize a human-friently answer.

In [49]:
@tracer.chain
async def text2sql(question):  # noqa: F811
    query = await generate_query(question)
    results = None
    error = None
    try:
        results = execute_query(query)
    except duckdb.Error as e:
        error = str(e)

    return {
        "query": query,
        "results": results,
        "error": error,
    }


synthesis_system_prompt = """
You are a helpful assistant that can answer questions about movies.

Answer the question based on the sql results. Do not rely on your internal knowledge.

Do not use sql or abbriviations for genres or languages. Use an informative, concise voice.
Your response should be purely in natural language, do not include any sql or other technical details.

If the sql results are empty, say you don't know.
"""

synthesis_prompt_template = """
Answer the question based on the sql results.

[BEGIN DATA]
************
[Question]: {question}
************
[SQL Results]: {results}
************
[END DATA]

Answer:
"""

@tracer.agent
async def movie_agent(question):
    sql_response = await text2sql(question)
    if sql_response["error"]:
        raise Exception(sql_response["error"])
    results = sql_response["results"]
    answer = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": synthesis_system_prompt},
            {
                "role": "user",
                "content": synthesis_prompt_template.format(question=question, results=results)
            },
        ],
    )
    return answer.choices[0].message.content

In [None]:
await movie_agent("What is the top grossing movie?")

'The top grossing movie is "Avatar."'

Looks like we have a working movie expert. Or do we? Let's double check. Let's run the agent over some examples

In [55]:
questions = [
    "Which Brad Pitt movie received the highest rating?",
    "What is the top grossing Marvel movie?",
    "What foreign-language fantasy movie was the most popular?",
    "what are the best sci-fi movies of 2017?",
    "What anime topped the box office in the 2010s?",
    "Recommend a romcom that stars Paul Rudd.",
]

In [59]:
from phoenix.trace import using_project

with using_project(project_name="movie-agent-baseline"):
    for question in questions:
        try:
            answer = await movie_agent(question)
            print("Question: ", question)
            print("Answer: ", answer)
            print("\n")
        except Exception as e:
            print(e)

Question:  Which Brad Pitt movie received the highest rating?
Answer:  The Brad Pitt movie "Voom Portraits" received the highest rating with a vote average of 10.0.


Question:  What is the top grossing Marvel movie?
Answer:  The top grossing Marvel movie is "Avengers: Endgame," with a revenue of approximately $2,799,439,100.


Question:  What foreign-language fantasy movie was the most popular?
Answer:  The foreign-language fantasy movie titled "The Nights Belong to Monsters" was the most popular.


Question:  what are the best sci-fi movies of 2017?
Answer:  I don't know.


Question:  What anime topped the box office in the 2010s?
Answer:  I don't know which anime topped the box office in the 2010s based on the given data.


Question:  Recommend a romcom that stars Paul Rudd.
Answer:  I recommend the romantic comedy "Clueless," which stars Paul Rudd.




Let's look at the data and annotate it to see what the issues might be. Go to Settings > Annotations and add a correctness annotation config. Configure it as a categorical annotation with two categories, `correct` and `incorrect`. We can now quickly annotate the 7 traces (e.g. the agent spans) above as `correct` or `incorrect`. Once we've annotated some data we can bring it back into the notebook to analyze it.

In [60]:
from phoenix.client import Client
from phoenix.client.types.spans import SpanQuery

phoenix_client = Client()
query = SpanQuery().where("name == 'movie_agent'")

spans_df = phoenix_client.spans.get_spans_dataframe(
    project_identifier="movie-agent-baseline", query=query
)
annotations_df = phoenix_client.spans.get_span_annotations_dataframe(
    spans_dataframe=spans_df, project_identifier="movie-agent-baseline"
)

combined_df = annotations_df.join(spans_df, how="inner")

combined_df.head()

Unnamed: 0,annotation_name,annotator_kind,metadata,identifier,id,created_at,updated_at,source,user_id,result.label,...,status_code,status_message,events,context.span_id,context.trace_id,attributes.openinference.span.kind,attributes.output.mime_type,attributes.output.value,attributes.input.value,attributes.input.mime_type
680faf992ac440b4,correctness,HUMAN,{},,U3BhbkFubm90YXRpb246NA==,2025-07-01T06:12:59+00:00,2025-07-01T06:12:59+00:00,APP,,correct,...,OK,,[],680faf992ac440b4,791bc2011e7c8ebf7ca2ba09f28f424e,AGENT,text/plain,"The top grossing Marvel movie is ""Avengers: En...",What is the top grossing Marvel movie?,text/plain
58dea9e47c4e6b82,correctness,HUMAN,{},,U3BhbkFubm90YXRpb246Mw==,2025-07-01T06:12:34+00:00,2025-07-01T06:12:34+00:00,APP,,incorrect,...,OK,,[],58dea9e47c4e6b82,dff72850f370d70d78e816c4114a45d8,AGENT,text/plain,I don't know.,what are the best sci-fi movies of 2017?,text/plain
acaeaa0e67bcbb72,correctness,HUMAN,{},,U3BhbkFubm90YXRpb246Mg==,2025-07-01T06:12:26+00:00,2025-07-01T06:12:26+00:00,APP,,incorrect,...,OK,,[],acaeaa0e67bcbb72,133bd69fe9339a1b699f5854ee7fd789,AGENT,text/plain,I don't know which anime topped the box office...,What anime topped the box office in the 2010s?,text/plain
cd5563093e84769e,correctness,HUMAN,{},,U3BhbkFubm90YXRpb246MQ==,2025-07-01T06:12:18+00:00,2025-07-01T06:12:18+00:00,APP,,correct,...,OK,,[],cd5563093e84769e,52bc08dc81a94e85d135f7a95fcaa727,AGENT,text/plain,"I recommend the romantic comedy ""Clueless,"" wh...",Recommend a romcom that stars Paul Rudd.,text/plain


In [61]:
expamples_df = combined_df[
    ["annotation_name", "result.label", "attributes.input.value", "attributes.output.value"]
].head()
expamples_df

Unnamed: 0,annotation_name,result.label,attributes.input.value,attributes.output.value
680faf992ac440b4,correctness,correct,What is the top grossing Marvel movie?,"The top grossing Marvel movie is ""Avengers: En..."
58dea9e47c4e6b82,correctness,incorrect,what are the best sci-fi movies of 2017?,I don't know.
acaeaa0e67bcbb72,correctness,incorrect,What anime topped the box office in the 2010s?,I don't know which anime topped the box office...
cd5563093e84769e,correctness,correct,Recommend a romcom that stars Paul Rudd.,"I recommend the romantic comedy ""Clueless,"" wh..."


Let's see if we can create an LLM judge that aligns with our human evaluation.

In [62]:
eval_prompt = f"""
You are an expert evaluator of question and answer pairs. You will be given a human question and an answer from a model.
Your job is to determine if the answer is "correct" or "incorrect".

Here are some examples of correct and incorrect answers:
{'\n\n'.join([f"Question: {example['attributes.input.value']}\nAnswer: {example['attributes.output.value']}\nLabel: {example['result.label']}" for example in expamples_df.to_dict(orient="records")])}

## Evaluation
Provide your answer in the following format:
Question: <question>
Answer: <answer>
Explanation: <explanation>
Label: <correct|incorrect>

Question: {{attributes.input.value}}
Answer: {{attributes.output.value}}
Explanation:
"""

print(eval_prompt)


You are an expert evaluator of question and answer pairs. You will be given a human question and an answer from a model.
Your job is to determine if the answer is "correct" or "incorrect".

Here are some examples of correct and incorrect answers:
Question: What is the top grossing Marvel movie?
Answer: The top grossing Marvel movie is "Avengers: Endgame," with a revenue of approximately $2,799,439,100.
Label: correct

Question: what are the best sci-fi movies of 2017?
Answer: I don't know.
Label: incorrect

Question: What anime topped the box office in the 2010s?
Answer: I don't know which anime topped the box office in the 2010s based on the given data.
Label: incorrect

Question: Recommend a romcom that stars Paul Rudd.
Answer: I recommend the romantic comedy "Clueless," which stars Paul Rudd.
Label: correct

## Evaluation
Provide your answer in the following format:
Question: <question>
Answer: <answer>
Explanation: <explanation>
Label: <correct|incorrect>

Question: {attributes.in

In [63]:
spans_df[["attributes.input.value", "attributes.output.value"]].head()

Unnamed: 0_level_0,attributes.input.value,attributes.output.value
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0dffcd95160b817f,Which Brad Pitt movie received the highest rat...,"The Brad Pitt movie ""Voom Portraits"" received ..."
680faf992ac440b4,What is the top grossing Marvel movie?,"The top grossing Marvel movie is ""Avengers: En..."
1b80a04760108be5,What foreign-language fantasy movie was the mo...,"The foreign-language fantasy movie titled ""The..."
58dea9e47c4e6b82,what are the best sci-fi movies of 2017?,I don't know.
acaeaa0e67bcbb72,What anime topped the box office in the 2010s?,I don't know which anime topped the box office...


In [64]:
from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel
from phoenix.evals.templates import PromptTemplate

evals_df = llm_classify(
    data=spans_df,
    model=OpenAIModel(model="gpt-4o"),
    rails=["correct", "incorrect"],
    template=PromptTemplate(
        template=eval_prompt,
    ),
    exit_on_error=False,
    provide_explanation=True,
)

## Assign 1 to correct and 0 to incorrect
evals_df["score"] = evals_df["label"].apply(lambda x: 1 if x == "correct" else 0)
evals_df[["label", "score", "explanation"]].head()

llm_classify |          | 0/6 (0.0%) | ⏳ 00:00<? | ?it/s

I0000 00:00:1751350424.187736 7044701 chttp2_transport.cc:1154] ipv6:%5B::1%5D:4317: Got goaway [11] err=UNAVAILABLE:GOAWAY received; Error code: 11; Debug Text: ping_timeout {created_time:"2025-07-01T00:13:44.187723-06:00", http2_error:11, grpc_status:14}


Unnamed: 0_level_0,label,score,explanation
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0dffcd95160b817f,incorrect,0,"The answer states that ""Voom Portraits"" receiv..."
680faf992ac440b4,correct,1,The answer correctly identifies 'Avengers: End...
1b80a04760108be5,incorrect,0,"The answer provided, ""The Nights Belong to Mon..."
58dea9e47c4e6b82,incorrect,0,The answer 'I don't know' does not provide any...
acaeaa0e67bcbb72,incorrect,0,The answer does not provide the information re...


In [65]:
import phoenix as px
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(
        dataframe=evals_df,
        eval_name="llm_correctness",
    )
)



<p style="text-align: center">
<img src="https://eugeneyan.com/assets/ai-monitoring.webp" width="800">
<cite data-cite="yan2025">(Yan, 2025)</cite>
</p>

## Experimentation

<p style="text-align: center">
<img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/evaluator.png" width="800">
</p>

The velocity AI application development is bottlenecked by high quality evaluations because engineers are often faced with hard tradeoffs: which prompt or LLM best balances performance, latency, and cost. Quality Evaluations are critical as they help answer these types of questions with greater confidence.

Evaluation consists of three parts — data, task, and scores. We'll start with data.

<p style="text-align: center">
<img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/experiment_analogy.png" width="800">
</p>

Let's store the movie questions we created above as a versioned dataset in phoenix.

In [None]:
import pandas as pd

import phoenix as px

ds = px.Client().upload_dataset(
    dataset_name="movie-example-questions",
    dataframe=pd.DataFrame([{"question": question} for question in questions]),
    input_keys=["question"],
    output_keys=[],
)

# If you have already uploaded the dataset, you can fetch it using the following line
# ds = px.Client().get_dataset(name="movie-example-questions")

📤 Uploading dataset...
💾 Examples uploaded: http://127.0.0.1:6006/datasets/RGF0YXNldDo4/examples
🗄️ Dataset version ID: RGF0YXNldFZlcnNpb246OA==




Next, we'll define the task. The task is to generate SQL queries from natural language questions.

In [68]:
@tracer.chain
async def text2sql(question):  # noqa: F811
    query = await generate_query(question)
    results = None
    error = None
    try:
        results = execute_query(query)
    except duckdb.Error as e:
        error = str(e)

    return {
        "query": query,
        "results": results,
        "error": error,
    }

Finally, we'll define the evaluators. We'll use the following simple functions that produce 1 for true and 0 for false to see if the generated SQL queries are correct.

In [69]:
# Test if there are no sql execution errors
def no_error(output):
    return 1.0 if output.get("error") is None else 0.0


# Test if the query has results
def has_results(output):
    results = output.get("results")
    has_results = results is not None and len(results) > 0
    return 1.0 if has_results else 0.0

Now let's run the experiment. To run the experiment, we pass the dataset of exaples, the task which runs the sql generation, and the evals described above.

In [70]:
import phoenix as px
from phoenix.experiments import run_experiment


# Define the task to run text2sql on the input question
def task(input):
    return text2sql(input["question"])


experiment = run_experiment(
    ds,
    task=task,
    evaluators=[no_error, has_results],
    experiment_metadata=CONFIG,
    experiment_name="baseline",
)

🧪 Experiment started.
📺 View dataset experiments: http://127.0.0.1:6006/datasets/RGF0YXNldDo4/experiments
🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDo4/compare?experimentId=RXhwZXJpbWVudDoxNg==




running tasks |          | 0/6 (0.0%) | ⏳ 00:00<? | ?it/s

✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/12 (0.0%) | ⏳ 00:00<? | ?it/s


🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDo4/compare?experimentId=RXhwZXJpbWVudDoxNg==

Experiment Summary (07/01/25 12:14 AM -0600)
--------------------------------------------
| evaluator   |   n |   n_scores |   avg_score |
|:------------|----:|-----------:|------------:|
| has_results |   6 |          6 |    0.666667 |
| no_error    |   6 |          6 |    1        |

Tasks Summary (07/01/25 12:14 AM -0600)
---------------------------------------
|   n_examples |   n_runs |   n_errors |
|-------------:|---------:|-----------:|
|            6 |        6 |          0 |


Ok. Not looking very good. It looks like only 4 out 6 of our questions are yielding results. Let's dig in to see how we can fix these.


## Interpreting the results

Now that we ran the initial evaluation, it looks like 2 of the results are empty due to getting the genre wrong.

- `Sci-Fi` needs to be queried as `Science Fiction`
- `Anime` needs to be queries as `Animation` + language specification. 

These two issues would probably be improved by showing a sample of the data to the model (e.g. few shot example) since the data will show the LLM what is queryable.

Let's try to improve the prompt with few-shot examples and see if we can get better results.

In [71]:
samples = conn.query("SELECT * FROM movies LIMIT 5").to_df().to_dict(orient="records")

example_row = "\n".join(
    f"{column['column_name']} | {column['column_type']} | {samples[0][column['column_name']]}"
    for column in columns
)

column_header = " | ".join(column["column_name"] for column in columns)

few_shot_examples = "\n".join(
    " | ".join(str(sample[column["column_name"]]) for column in columns) for sample in samples
)

system_prompt = (
    "You are a SQL expert, and you are given a single table named `movies` with the following columns:\n\n"
    "Column | Type | Example\n"
    "-------|------|--------\n"
    f"{example_row}\n"
    "\n"
    "Examples:\n"
    f"{column_header}\n"
    f"{few_shot_examples}\n"
    "\n"
    "Write a DuckDB SQL query corresponding to the user's request. "
    "Return just the query text, with no formatting (backticks, markdown, etc.)."
)


prompt_template = phoenix_client.prompts.create(
    name="text2sql",
    version=PromptVersion(
        [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": "{{question}}",
            },
        ],
        description="Add few shot examples to the prompt",
        model_name=TASK_MODEL,
    ),
)

In [72]:
print(await generate_query("What is the best Sci-Fi movie?"))

SELECT title, MAX(vote_average) AS max_vote_average
FROM movies
WHERE genres LIKE '%Science Fiction%'
GROUP BY title
ORDER BY max_vote_average DESC
LIMIT 1;


Looking much better! Since the prompt shows that "Sci-Fi" is represented as "Science Fiction", the LLM is able to synthesize the right where clause. 

Let's run the experiment again.

In [73]:
experiment = run_experiment(
    ds,
    experiment_name="with examples",
    task=task,
    evaluators=[has_results, no_error],
    experiment_metadata=CONFIG,
)

🧪 Experiment started.
📺 View dataset experiments: http://127.0.0.1:6006/datasets/RGF0YXNldDo4/experiments
🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDo4/compare?experimentId=RXhwZXJpbWVudDoxNw==




running tasks |          | 0/6 (0.0%) | ⏳ 00:00<? | ?it/s

✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/12 (0.0%) | ⏳ 00:00<? | ?it/s


🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDo4/compare?experimentId=RXhwZXJpbWVudDoxNw==

Experiment Summary (07/01/25 12:14 AM -0600)
--------------------------------------------
| evaluator   |   n |   n_scores |   avg_score |
|:------------|----:|-----------:|------------:|
| has_results |   6 |          6 |           1 |
| no_error    |   6 |          6 |           1 |

Tasks Summary (07/01/25 12:14 AM -0600)
---------------------------------------
|   n_examples |   n_runs |   n_errors |
|-------------:|---------:|-----------:|
|            6 |        6 |          0 |


Looks much improved. It looks like we've eliminated the errors, and got a result for the incorrect queries. But just because we are getting info out of the DB doesn't mean these answers are correct. Let's construct an LLM judge to validate the queries.

In [74]:
import json

from openai import OpenAI

from phoenix.experiments import evaluate_experiment
from phoenix.experiments.evaluators import create_evaluator
from phoenix.experiments.types import EvaluationResult

openai_client = OpenAI()

judge_instructions = """
You are a judge that determines if a given question can be answered with the provided SQL query and results.
Make sure to ensure that the SQL query maps to the question accurately.

Provide the label `correct` if the SQL query and results accurately answer the question.
Provide the label `invalid` if the SQL query does not map to the question or is not valid.
"""


@create_evaluator(name="qa_correctness", kind="llm")
def qa_correctness(input, output):
    question = input.get("question")
    query = output.get("query")
    results = output.get("results")
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": judge_instructions},
            {
                "role": "user",
                "content": f"Question: {question}\nSQL Query: {query}\nSQL Results: {results}",
            },
        ],
        tool_choice="required",
        tools=[
            {
                "type": "function",
                "function": {
                    "name": "qa_correctness",
                    "description": "Determine if the SQL query and results accurately answer the question.",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "explanation": {
                                "type": "string",
                                "description": "Explain why the label is correct or invalid.",
                            },
                            "label": {"type": "string", "enum": ["correct", "invalid"]},
                        },
                    },
                },
            }
        ],
    )
    if response.choices[0].message.tool_calls is None:
        raise ValueError("No tool call found in response")
    args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
    label = args["label"]
    explanation = args["explanation"]
    score = 1 if label == "correct" else 0
    return EvaluationResult(score=score, label=label, explanation=explanation)


evaluate_experiment(experiment, evaluators=[qa_correctness])

🧠 Evaluation started.


running experiment evaluations |          | 0/6 (0.0%) | ⏳ 00:00<? | ?it/s


🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDo4/compare?experimentId=RXhwZXJpbWVudDoxNw==

Experiment Summary (07/01/25 12:14 AM -0600)
--------------------------------------------
| evaluator      |   n |   n_scores |   avg_score |   n_labels | top_2_labels                 |
|:---------------|----:|-----------:|------------:|-----------:|:-----------------------------|
| qa_correctness |   6 |          6 |    0.833333 |          6 | {'correct': 5, 'invalid': 1} |

Experiment Summary (07/01/25 12:14 AM -0600)
--------------------------------------------
| evaluator   |   n |   n_scores |   avg_score |
|:------------|----:|-----------:|------------:|
| has_results |   6 |          6 |           1 |
| no_error    |   6 |          6 |           1 |

Tasks Summary (07/01/25 12:14 AM -0600)
---------------------------------------
|   n_examples |   n_runs |   n_errors |
|-------------:|---------:|-----------:|
|            6 |        6 |          0 |


RanExperiment(id='RXhwZXJpbWVudDoxNw==', dataset_id='RGF0YXNldDo4', dataset_version_id='RGF0YXNldFZlcnNpb246OA==', repetitions=1)

The LLM judge's scoring closely matches our manual evaluation, demonstrating its effectiveness as an automated evaluation method. This approach is particularly valuable when traditional rule-based scoring functions are difficult to implement. 

The LLM judge also shows an advantage in nuanced understanding - for example, it correctly identifies that 'Anime' and 'Animation' are distinct genres, a subtlety our code-based evaluators missed. This highlights why developing custom LLM judges tailored to your specific task requirements is crucial for accurate evaluation.


We now have a simple text2sql pipeline that can be used to generate SQL queries from natural language questions. Since Phoenix has been tracing the entire pipeline, we can now use the Phoenix UI to convert the spans that generated successful queries into examples to use in **Golden Dataset** for regression testing as well.

## Bringing it all together

Now that we've seen the experiment improve our outcome, let's put it to a test given the evals we built out earlier.

In [None]:
from phoenix.trace import using_project


@tracer.agent
async def movie_agent_improved(question):
    sql_response = await text2sql(question)
    if sql_response["error"]:
        raise Exception(sql_response["error"])
    results = sql_response["results"]
    answer = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": synthesis_system_prompt},
            {
                "role": "user",
                "content": synthesis_prompt_template.format(question=question, results=results),
            },
        ],
    )
    return answer.choices[0].message.content


with using_project(project_name="movie-agent-improved"):
    for question in questions:
        try:
            answer = await movie_agent_improved(question)
            print("Question: ", question)
            print("Answer: ", answer)
            print("\n")
        except Exception as e:
            print(e)

The Brad Pitt movie that received the highest rating is "Voom Portraits" with a rating of 10.0.
The top-grossing Marvel movie is "Avengers: Endgame," with a revenue of approximately 2.799 billion dollars.
The most popular foreign-language fantasy movie is "The Nights Belong to Monsters" with a popularity score of 742.199.
In 2017, some of the highest-rated science fiction movies, each with a perfect score of 10.0, were "T&A Time Travelers," "H. P. Lovecraft Film Festival Best of 2017," "Anukul," "Satria Heroes: Revenge of Darkness," "Ready Jet Go! Back to Bortron 7," "Electric Sandwich," "Uchuu Sentai Kyuranger: Episode of Stinger," "Navy SEALS v Demons," "Roam - Short Film," "Moon Men," "Toxic Tutu," "The Aalto Natives," "Zombie City," "Blade Of Honor," "The Idol," "I Am the Doorway," "Eureka!!," "X-Manas," "Femaliens: Seduction of the Species," "Lucas Chronicle," and "Among Us - In the Land of Our Shadows." These films stood out in 2017 for their high viewer ratings.
I don't know.
I 

In [79]:
from phoenix.client import Client
from phoenix.client.types.spans import SpanQuery

phoenix_client = Client()
query = SpanQuery().where("name == 'movie_agent_improved'")

spans_df = phoenix_client.spans.get_spans_dataframe(
    project_identifier="movie-agent-improved", query=query
)

spans_df.head()

Unnamed: 0_level_0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,context.span_id,context.trace_id,attributes.openinference.span.kind,attributes.output.mime_type,attributes.output.value,attributes.input.value,attributes.input.mime_type
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
9c66b61c2cf52502,movie_agent_improved,AGENT,,2025-07-01 06:14:51.803067+00:00,2025-07-01 06:14:53.117168+00:00,OK,,[],9c66b61c2cf52502,ca223f4d90f76412af4a5d6c7f819c57,AGENT,text/plain,The Brad Pitt movie that received the highest ...,Which Brad Pitt movie received the highest rat...,text/plain
55a8b854d602f733,movie_agent_improved,AGENT,,2025-07-01 06:14:53.118755+00:00,2025-07-01 06:14:54.688625+00:00,OK,,[],55a8b854d602f733,e8ce61899e4b12e8ed29b39f4192164e,AGENT,text/plain,"The top-grossing Marvel movie is ""Avengers: En...",What is the top grossing Marvel movie?,text/plain
2c150fc600402bc7,movie_agent_improved,AGENT,,2025-07-01 06:14:54.692647+00:00,2025-07-01 06:14:56.342621+00:00,OK,,[],2c150fc600402bc7,5640d03ddee4c1a4733affef4e1ebae3,AGENT,text/plain,The most popular foreign-language fantasy movi...,What foreign-language fantasy movie was the mo...,text/plain
86b36e81f8a12b0e,movie_agent_improved,AGENT,,2025-07-01 06:14:56.343718+00:00,2025-07-01 06:15:02.652487+00:00,OK,,[],86b36e81f8a12b0e,b3b8f774901c166a0b76441d093d7d62,AGENT,text/plain,"In 2017, some of the highest-rated science fic...",what are the best sci-fi movies of 2017?,text/plain
03610eb2d13d7f12,movie_agent_improved,AGENT,,2025-07-01 06:15:02.654279+00:00,2025-07-01 06:15:04.477173+00:00,OK,,[],03610eb2d13d7f12,d7eb7b890af9895cc5eccfe166d838fc,AGENT,text/plain,I don't know.,What anime topped the box office in the 2010s?,text/plain


In [80]:
from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel
from phoenix.evals.templates import PromptTemplate

evals_df = llm_classify(
    data=spans_df,
    model=OpenAIModel(model="gpt-4o"),
    rails=["correct", "incorrect"],
    template=PromptTemplate(
        template=eval_prompt,
    ),
    exit_on_error=False,
    provide_explanation=True,
)

## Assign 1 to correct and 0 to incorrect
evals_df["score"] = evals_df["label"].apply(lambda x: 1 if x == "correct" else 0)
evals_df[["label", "score", "explanation"]].head()

llm_classify |          | 0/6 (0.0%) | ⏳ 00:00<? | ?it/s

Unnamed: 0_level_0,label,score,explanation
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
9c66b61c2cf52502,incorrect,0,"""Voom Portraits"" is not a widely recognized Br..."
55a8b854d602f733,correct,1,"The answer correctly identifies ""Avengers: End..."
2c150fc600402bc7,incorrect,0,"The answer claims that ""The Nights Belong to M..."
86b36e81f8a12b0e,incorrect,0,The answer lists several movies with perfect s...
03610eb2d13d7f12,incorrect,0,The answer 'I don't know' does not provide any...


In [78]:
import phoenix as px
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(
        dataframe=evals_df,
        eval_name="llm_correctness",
    )
)



Our improved agent now is able to answers all 6 questions but our `llm_correctness` eval was able to spot that the agent responses are not very good:

- querying for `Anime` and responding with `Frozen II` misses the mark on anime being a japanese form of animation
- the LLM thinks "top" or "best" means rating but doesn't take into account the number of votes. 

Our `txt2sql` prompt still needs more instructions if we want to improve it's performance. But we're on the right track and can find more ways to guide the LLM.

This tutorial demonstrated the core principles of building **evals that work** for AI applications. Here are the key concepts you should take away:

1. **Build & Trace**: Instrument your AI application with tracing from day one
2. **Annotate**: Use human judgment to label traces with simple heuristics like correct/incorrect  
3. **Create Evaluators**: Build both simple programmatic evals as well as LLM judges
4. **Experiment**: Run systematic experiments to compare different approaches
5. **Iterate**: Use evaluation results to improve prompts, models, or architecture



# Bibliography

<cite id="aires2024">Aires, A. R. (2024). *Movies Dataset*. Hugging Face Datasets. https://huggingface.co/AiresPucrs</cite>

<cite id="goyal2024">Goyal, A. (2024). *LLM Eval for TxtToSql Notebook*. Braintrust Cookbook. https://www.braintrust.dev/docs/cookbook/recipes/Text2SQL-Data</cite>

<cite id="yan2025">Yan, Z. (2025). An LLM-as-Judge Won't Save The Product—Fixing Your Process Will. *eugeneyan.com*. https://eugeneyan.com/writing/eval-process/</cite>