<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg" width="1000"/>
        <br>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Evals that Work</h1>

Building great AI native products requires a rigorous evaluation process. While the idea of evaluation-driven development may seem novel to some, it really is the scientific method in disguise.  Just as scientists meticulously record experiments and take detailed notes to advance their understanding, AI systems require rigorous observation through tracing and annotations to reach their full potential. The goal of AI-native products is to build tools that empower humans, and it requires careful human judgment to align AI with human preferences and values.

 <p style="text-align:center">
    <img alt="AI dev as scientific method" src="https://storage.googleapis.com/arize-phoenix-assets/assets/gifs/20250524_1125_Forest%20Robots%20Interaction_simple_compose_01jw1n770bep1a829kw3cvvcsc.gif" width="800" />
</p>
<img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/scientific_method.png">

In [1]:
!pip install "arize-phoenix>=10.0.0" openai 'httpx<0.28' duckdb datasets pyarrow "pydantic>=2.0.0" nest_asyncio openinference-instrumentation-openai --quiet

This tutorial assumes you have a locally running Phoenix server. We can think of phoenix like a video recorder, observing every activity of your AI application.

```shell
phoenix serve
```

Let's also setup tracing for OpenAI as we will be using their API to perform the synthesis.

In [1]:
from phoenix.otel import register

tracer_provider = register(
    project_name="basketball-app",
    auto_instrument=True, # Start recording traces via OpenAIInstrumentor
)

tracer = tracer_provider.get_tracer(__name__)

  from .autonotebook import tqdm as notebook_tqdm


🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: basketball-app
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: localhost:4317
|  Transport: gRPC
|  Transport Headers: {'user-agent': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



Let's make sure we can run async code in the notebook.

In [2]:
import nest_asyncio

nest_asyncio.apply()

Lastly, let's make sure we have our openai API key set up.

In [3]:
import os
from getpass import getpass

if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")

## Download Data

We are going to use the NBA dataset that information from 2014 - 2018. We will use DuckDB as our database.

In [4]:
import duckdb
from datasets import load_dataset

data = load_dataset("suzyanil/nba-data")["train"]

conn = duckdb.connect(database=":memory:", read_only=False)
conn.register("nba", data.to_pandas())

conn.query("SELECT * FROM nba LIMIT 5").to_df().to_dict(orient="records")[0]

{'Unnamed: 0': 1,
 'Team': 'ATL',
 'Game': 1,
 'Date': '10/29/14',
 'Home': 'Away',
 'Opponent': 'TOR',
 'WINorLOSS': 'L',
 'TeamPoints': 102,
 'OpponentPoints': 109,
 'FieldGoals': 40,
 'FieldGoalsAttempted': 80,
 'FieldGoals.': 0.5,
 'X3PointShots': 13,
 'X3PointShotsAttempted': 22,
 'X3PointShots.': 0.591,
 'FreeThrows': 9,
 'FreeThrowsAttempted': 17,
 'FreeThrows.': 0.529,
 'OffRebounds': 10,
 'TotalRebounds': 42,
 'Assists': 26,
 'Steals': 6,
 'Blocks': 8,
 'Turnovers': 17,
 'TotalFouls': 24,
 'Opp.FieldGoals': 37,
 'Opp.FieldGoalsAttempted': 90,
 'Opp.FieldGoals.': 0.411,
 'Opp.3PointShots': 8,
 'Opp.3PointShotsAttempted': 26,
 'Opp.3PointShots.': 0.308,
 'Opp.FreeThrows': 27,
 'Opp.FreeThrowsAttempted': 33,
 'Opp.FreeThrows.': 0.818,
 'Opp.OffRebounds': 16,
 'Opp.TotalRebounds': 48,
 'Opp.Assists': 26,
 'Opp.Steals': 13,
 'Opp.Blocks': 9,
 'Opp.Turnovers': 9,
 'Opp.TotalFouls': 22}

## Implement Text2SQL

Let's start by implementing a simple text2sql logic.

In [5]:
import os

import openai

from phoenix.client import Client
from phoenix.client.types import PromptVersion

phoenix_client = Client()
client = openai.AsyncClient()

columns = conn.query("DESCRIBE nba").to_df().to_dict(orient="records")

# We will use GPT4o to start
TASK_MODEL = "gpt-4o"
CONFIG = {"model": TASK_MODEL}

system_prompt = (
    "You are a SQL expert, and you are given a single table named nba with the following columns:\n"
    f'{",".join(column["column_name"] + ": " + column["column_type"] for column in columns)}\n'
    "Write a SQL query corresponding to the user's request. Return just the query text, "
    "with no formatting (backticks, markdown, etc.)."
)

prompt_template = phoenix_client.prompts.create(
    name="text2sql",
    version=PromptVersion(
        [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": "{{input}}",
            },
        ],
        description="Initial prompt for text2sql",
        model_name=TASK_MODEL,
    ),
)



@tracer.chain
async def generate_query(input):
    prompt = prompt_template.format(variables={"input": input}, sdk="openai")
    response = await client.chat.completions.create(
        **prompt,
        temperature=0,
    )
    return response.choices[0].message.content

In [6]:
query = await generate_query("Who won the most games?")
print(query)

SELECT Team, COUNT(*) AS Wins
FROM nba
WHERE WINorLOSS = 'W'
GROUP BY Team
ORDER BY Wins DESC
LIMIT 1;


Awesome, looks like the LLM is producing SQL! let's try running the query and see if we get the expected results.

In [7]:
@tracer.tool
def execute_query(query):
    return conn.query(query).fetchdf().to_dict(orient="records")


execute_query(query)

[{'Team': 'GSW', 'Wins': 265}]

Let's put the pieces together and see if we can create a basketball agent

In [25]:
@tracer.chain
async def text2sql(question):
    query = await generate_query(question)
    results = None
    error = None
    try:
        results = execute_query(query)
    except duckdb.Error as e:
        error = str(e)

    return {
        "query": query,
        "results": results,
        "error": error,
    }

@tracer.agent
async def basketball_agent(question):
    sql_response = await text2sql(question)
    if sql_response["error"]:
        raise Exception(sql_response["error"])
    results = sql_response["results"]
    answer = await client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system",
             "content": "You are a helpful assistant that can answer questions about the NBA. Do not use sql or abbriviations for teams. Use the full team names and use an informative, concise voice. Your response should be purely in natural language, do not include any sql or other technical details."
             },
            {"role": "user",
             "content": f"The sql results of the query are: {results}. Answer the following question: {question}."
             },
        ],
    )
    return answer.choices[0].message.content

await basketball_agent("Who won the most games?")


llm_classify |██████████| 5/5 (100.0%) | ⏳ 00:54<00:00 | 10.93s/it
Transient error StatusCode.UNAVAILABLE encountered while exporting traces to localhost:4317, retrying in 1s.


'The Golden State Warriors won the most games, with a total of 265 wins.'

Let's run the agent over some examples

In [26]:
from phoenix.trace import using_project

questions = [
    "Which team won the most games?",
    "Which team won the most games in 2015?",
    "Who led the league in 3 point shots?",
    "Which team had the biggest difference in records across two consecutive years?",
    "What is the average number of free throws per year?",
]

with using_project(project_name="basketball-agent-test"):
    for question in questions:
        try:
            answer = await basketball_agent(question)
            print(answer)
        except Exception as e:
            print(e)



The team that won the most games is the Golden State Warriors with 265 wins.
The Golden State Warriors won the most games in the 2015 NBA regular season. They achieved a record of 67 wins and 15 losses, which was the best in the league that year.
The Houston Rockets led the league in three-point shots.
Binder Error: No function matches the given name and argument types 'year(VARCHAR)'. You might need to add explicit type casts.
	Candidate functions:
	year(DATE) -> BIGINT
	year(INTERVAL) -> BIGINT
	year(TIMESTAMP) -> BIGINT
	year(TIMESTAMP WITH TIME ZONE) -> BIGINT

The data provided seems to be structured by specific days rather than years, as each entry represents the average number of free throws for possible dates across a season. Hence, it is not depicting yearly averages directly.

However, if you want to calculate an average for all the dates listed, you would sum the average free throws of each entry and divide by the number of entries. This would give an overall average of free

Let's look at the data and annotate some of the data to see what the issues might be. Once we've annotated some data we can bring it back into the notebook to analyze it.

In [27]:
from phoenix.client import Client
from phoenix.trace.dsl import SpanQuery

phoenix_client = Client()
query = SpanQuery().where("name == 'basketball_agent'")

spans_df = phoenix_client.spans.get_spans_dataframe(project_identifier="basketball-agent-test", query=query)
annotations_df = phoenix_client.spans.get_span_annotations_dataframe(
    spans_dataframe=spans_df, project_identifier="basketball-agent-test"
)

combined_df = annotations_df.join(spans_df, how="inner")

combined_df.head()

Unnamed: 0,annotation_name,annotator_kind,metadata,identifier,id,created_at,updated_at,source,user_id,result.label,...,status_code,status_message,events,context.span_id,context.trace_id,attributes.output.value,attributes.input.value,attributes.input.mime_type,attributes.openinference.span.kind,attributes.output.mime_type
8bf8a5ab44b7e1ee,correctness,HUMAN,{},,U3BhbkFubm90YXRpb246NQ==,2025-05-25T00:19:48+00:00,2025-05-25T00:19:48+00:00,APP,,correct,...,OK,,[],8bf8a5ab44b7e1ee,3c92fcf788f76805dd0db54dde7aeaae,The team that won the most games is the Golden...,Which team won the most games?,text/plain,AGENT,text/plain
f5ab39d3d9358214,correctness,HUMAN,{},,U3BhbkFubm90YXRpb246NA==,2025-05-25T00:19:37+00:00,2025-05-25T00:19:37+00:00,APP,,correct,...,OK,,[],f5ab39d3d9358214,dee64c41776694236ef3c98ff60ce60c,The Golden State Warriors won the most games i...,Which team won the most games in 2015?,text/plain,AGENT,text/plain
5af18c707c039957,correctness,HUMAN,{},,U3BhbkFubm90YXRpb246Mw==,2025-05-25T00:19:32+00:00,2025-05-25T00:19:32+00:00,APP,,correct,...,OK,,[],5af18c707c039957,fad9986ebb27dcc30be74ef8e26842ce,The Houston Rockets led the league in three-po...,Who led the league in 3 point shots?,text/plain,AGENT,text/plain
eb28c25edf7cc607,correctness,HUMAN,{},,U3BhbkFubm90YXRpb246Mg==,2025-05-25T00:19:27+00:00,2025-05-25T00:19:27+00:00,APP,,incorrect,...,ERROR,Exception: Binder Error: No function matches t...,"[{'name': 'exception', 'timestamp': '2025-05-2...",eb28c25edf7cc607,82507dc70d42fd51d7f47d79c1797251,,Which team had the biggest difference in recor...,text/plain,AGENT,
2b020f266ba4294b,correctness,HUMAN,{},,U3BhbkFubm90YXRpb246MQ==,2025-05-25T00:19:24+00:00,2025-05-25T00:19:24+00:00,APP,,correct,...,OK,,[],2b020f266ba4294b,43d3bf15478cf04bbc00f8240461418a,The data provided seems to be structured by sp...,What is the average number of free throws per ...,text/plain,AGENT,text/plain


In [28]:
expamples_df = combined_df[["annotation_name", "result.label", "attributes.input.value", "attributes.output.value"]].head()
expamples_df

Unnamed: 0,annotation_name,result.label,attributes.input.value,attributes.output.value
8bf8a5ab44b7e1ee,correctness,correct,Which team won the most games?,The team that won the most games is the Golden...
f5ab39d3d9358214,correctness,correct,Which team won the most games in 2015?,The Golden State Warriors won the most games i...
5af18c707c039957,correctness,correct,Who led the league in 3 point shots?,The Houston Rockets led the league in three-po...
eb28c25edf7cc607,correctness,incorrect,Which team had the biggest difference in recor...,
2b020f266ba4294b,correctness,correct,What is the average number of free throws per ...,The data provided seems to be structured by sp...


Let's see if we can create an LLM judge that aligns with our human evaluation.

In [33]:
eval_prompt = f"""
You are an expert evaluator of question and answer pairs. You will be given q human question and an answer from a model. 
Your job is to determine if the answer is "correct" or "incorrect".

Here are some examples of correct and incorrect answers:
{'\n\n'.join([f"Question: {example['attributes.input.value']}\nAnswer: {example['attributes.output.value']}\nLabel: {example['result.label']}" for example in expamples_df.to_dict(orient="records")])}

## Evaluation
Provide your answer in the following format:
Question: <question>
Answer: <answer>
Explanation: <explanation>
Label: <correct|incorrect>

Question: {{attributes.input.value}}
Answer: {{attributes.output.value}}
Explanation:
"""

print(eval_prompt)


You are an expert evaluator of question and answer pairs. You will be given q human question and an answer from a model. 
Your job is to determine if the answer is "correct" or "incorrect".

Here are some examples of correct and incorrect answers:
Question: Which team won the most games?
Answer: The team that won the most games is the Golden State Warriors with 265 wins.
Label: correct

Question: Which team won the most games in 2015?
Answer: The Golden State Warriors won the most games in the 2015 NBA regular season. They achieved a record of 67 wins and 15 losses, which was the best in the league that year.
Label: correct

Question: Who led the league in 3 point shots?
Answer: The Houston Rockets led the league in three-point shots.
Label: correct

Question: Which team had the biggest difference in records across two consecutive years?
Answer: None
Label: incorrect

Question: What is the average number of free throws per year?
Answer: The data provided seems to be structured by spec

In [34]:
spans_df[["attributes.input.value", "attributes.output.value"]].head()

Unnamed: 0_level_0,attributes.input.value,attributes.output.value
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1
8bf8a5ab44b7e1ee,Which team won the most games?,The team that won the most games is the Golden...
f5ab39d3d9358214,Which team won the most games in 2015?,The Golden State Warriors won the most games i...
5af18c707c039957,Who led the league in 3 point shots?,The Houston Rockets led the league in three-po...
eb28c25edf7cc607,Which team had the biggest difference in recor...,
2b020f266ba4294b,What is the average number of free throws per ...,The data provided seems to be structured by sp...


In [35]:
from phoenix.evals import llm_classify
from phoenix.evals.models import OpenAIModel
from phoenix.evals.templates import PromptTemplate

evals_df = llm_classify(
    data=spans_df,
    model=OpenAIModel(model="gpt-4o"),
    rails=["correct", "incorrect"],
    template=PromptTemplate(
        template=eval_prompt,
    ),
    exit_on_error=False,
    provide_explanation=True,
)

## Assign 1 to correct and 0 to incorrect
evals_df["score"] = evals_df["label"].apply(lambda x: 1 if x == "correct" else 0)
evals_df[["label", "score", "explanation"]].head()


                                                                   
[A                                                                

llm_classify |██████████| 5/5 (100.0%) | ⏳ 04:55<00:00 | 59.19s/it
[A

Retries exhausted after 1 attempts: Missing template variables: attributes.output.value


Transient error StatusCode.UNAVAILABLE encountered while exporting traces to localhost:4317, retrying in 1s.

[A
[A

Unnamed: 0_level_0,label,score,explanation
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8bf8a5ab44b7e1ee,correct,1,"The answer provides a specific team, the Golde..."
f5ab39d3d9358214,correct,1,The answer correctly identifies the Golden Sta...
5af18c707c039957,correct,1,The answer correctly identifies the Houston Ro...
eb28c25edf7cc607,,0,
2b020f266ba4294b,correct,1,The answer provides a reasonable explanation f...



[A

In [36]:
import phoenix as px
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(
       dataframe=evals_df,
       eval_name="llm_correctness",
    )
)


## Experimentation

<p style="text-align: center">
<img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/evaluator.png" width="800">
</p>
Evaluation consists of three parts — data, task, and scores. We'll start with data.

In [37]:
questions = [
    "Which team won the most games?",
    "Which team won the most games in 2015?",
    "Who led the league in 3 point shots?",
    "Which team had the biggest difference in records across two consecutive years?",
    "What is the average number of free throws per year?",
]

Let's store the data above as a versioned dataset in phoenix.

In [39]:
import phoenix as px
import pandas as pd

ds = px.Client().upload_dataset(
    dataset_name="nba-questions-v4",
    dataframe=pd.DataFrame([{"question": question} for question in questions]),
    input_keys=["question"],
    output_keys=[],
)

# If you have already uploaded the dataset, you can fetch it using the following line
# ds = px.Client().get_dataset(name="nba-questions")

📤 Uploading dataset...
💾 Examples uploaded: http://127.0.0.1:6006/datasets/RGF0YXNldDozMg==/examples
🗄️ Dataset version ID: RGF0YXNldFZlcnNpb246MzI=


Next, we'll define the task. The task is to generate SQL queries from natural language questions.

In [40]:
@tracer.chain
async def text2sql(question):
    query = await generate_query(question)
    results = None
    error = None
    try:
        results = execute_query(query)
    except duckdb.Error as e:
        error = str(e)

    return {
        "query": query,
        "results": results,
        "error": error,
    }

Finally, we'll define the scores. We'll use the following simple scoring functions to see if the generated SQL queries are correct.

In [41]:
# Test if there are no sql execution errors

def no_error(output):
    return 1.0 if output.get("error") is None else 0.0


# Test if the query has results
def has_results(output):
    results = output.get("results")
    has_results = results is not None and len(results) > 0
    return 1.0 if has_results else 0.0

Now let's run the evaluation experiment.

In [42]:
import phoenix as px
from phoenix.experiments import run_experiment


# Define the task to run text2sql on the input question
def task(input):
    return text2sql(input["question"])


experiment = run_experiment(
    ds, task=task, evaluators=[no_error, has_results], experiment_metadata=CONFIG
)

🧪 Experiment started.
📺 View dataset experiments: http://127.0.0.1:6006/datasets/RGF0YXNldDozMg==/experiments
🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDozMg==/compare?experimentId=RXhwZXJpbWVudDo0OQ==




✅ Task runs completed.
🧠 Evaluation started.



running tasks |██████████| 5/5 (100.0%) | ⏳ 00:03<00:00 |  1.61it/s

[A


🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDozMg==/compare?experimentId=RXhwZXJpbWVudDo0OQ==

Experiment Summary (05/24/25 06:24 PM -0600)
--------------------------------------------
     evaluator  n  n_scores  avg_score
0  has_results  5         5        0.6
1     no_error  5         5        0.8

Tasks Summary (05/24/25 06:24 PM -0600)
---------------------------------------
   n_examples  n_runs  n_errors
0           5       5         0


Ok! It looks like 3/5 of our queries are valid.


## Interpreting the results

Now that we ran the initial evaluation, it looks like two of the results are valid, two produce SQL errors, and one is incorrect.

- The incorrect query didn't seem to get the date format correct. That would probably be improved by showing a sample of the data to the model (e.g. few shot example).

- There are is a binder error, which may also have to do with not understanding the data format.

Let's try to improve the prompt with few-shot examples and see if we can get better results.

In [45]:
samples = conn.query("SELECT * FROM nba LIMIT 1").to_df().to_dict(orient="records")[0]
sample_rows = "\n".join(
    f"{column['column_name']} | {column['column_type']} | {samples[column['column_name']]}"
    for column in columns
)
system_prompt = (
    "You are a SQL expert, and you are given a single table named nba with the following columns:\n\n"
    "Column | Type | Example\n"
    "-------|------|--------\n"
    f"{sample_rows}\n"
    "\n"
    "Write a DuckDB SQL query corresponding to the user's request. "
    "Return just the query text, with no formatting (backticks, markdown, etc.)."
)


prompt_template = phoenix_client.prompts.create(
    name="text2sql",
    version=PromptVersion(
        [
            {
                "role": "system",
                "content": system_prompt,
            },
            {
                "role": "user",
                "content": "{{input}}",
            },
        ],
        description="Add few shot examples to the prompt",
        model_name=TASK_MODEL,
    ),
)

print(await generate_query("Which team won the most games in 2015?"))

```sql
SELECT Team, COUNT(*) AS Wins
FROM nba
WHERE WINorLOSS = 'W' AND Date LIKE '%/15'
GROUP BY Team
ORDER BY Wins DESC
LIMIT 1;
```


I0000 00:00:1748133371.775185 22271637 chttp2_transport.cc:1201] ipv6:%5B::1%5D:4317: Got goaway [11] err=UNAVAILABLE:GOAWAY received; Error code: 11; Debug Text: ping_timeout {grpc_status:14, http2_error:11, created_time:"2025-05-24T18:36:11.773335-06:00"}


Looking much better! Finally, let's add a scoring function that compares the results, if they exist, with the expected results.




In [46]:
experiment = run_experiment(
    ds, experiment_name="with examples", task=task, evaluators=[has_results, no_error], experiment_metadata=CONFIG
)

🧪 Experiment started.
📺 View dataset experiments: http://127.0.0.1:6006/datasets/RGF0YXNldDozMg==/experiments
🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDozMg==/compare?experimentId=RXhwZXJpbWVudDo1NA==


running experiment evaluations |██████████| 10/10 (100.0%) | ⏳ 01:50<00:00 | 11.03s/it


✅ Task runs completed.
🧠 Evaluation started.



[A
[A


🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDozMg==/compare?experimentId=RXhwZXJpbWVudDo1NA==

Experiment Summary (05/24/25 06:36 PM -0600)
--------------------------------------------
     evaluator  n  n_scores  avg_score
0  has_results  5         5        1.0
1     no_error  5         5        1.0

Tasks Summary (05/24/25 06:36 PM -0600)
---------------------------------------
   n_examples  n_runs  n_errors
0           5       5         0


Amazing. It looks like we removed one of the errors, and got a result for the incorrect query. Let's try out using LLM as a judge to see how well it can assess the results.


In [47]:
from phoenix.evals.models import OpenAIModel
from phoenix.experiments import evaluate_experiment
from phoenix.experiments.evaluators.llm_evaluators import LLMCriteriaEvaluator

llm_evaluator = LLMCriteriaEvaluator(
    name="is_sql",
    criteria="is_sql",
    description="the output is a valid SQL query and that it executes without errors",
    model=OpenAIModel(),
)

evaluate_experiment(experiment, evaluators=[llm_evaluator])

🧠 Evaluation started.




llm_classify |██████████| 5/5 (100.0%) | ⏳ 22:44<00:00 | 272.93s/it
running tasks |██████████| 5/5 (100.0%) | ⏳ 02:27<00:00 | 29.48s/it
running experiment evaluations |██████████| 10/10 (100.0%) | ⏳ 02:23<00:00 | 14.38s/it
Transient error StatusCode.UNAVAILABLE encountered while exporting traces to localhost:4317, retrying in 1s.


[A[A

[A[A

[A[A

[A[A


🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDozMg==/compare?experimentId=RXhwZXJpbWVudDo1NA==

Experiment Summary (05/24/25 06:38 PM -0600)
--------------------------------------------
  evaluator  n  n_scores  avg_score
0    is_sql  5         5        1.0

Experiment Summary (05/24/25 06:36 PM -0600)
--------------------------------------------
     evaluator  n  n_scores  avg_score
0  has_results  5         5        1.0
1     no_error  5         5        1.0

Tasks Summary (05/24/25 06:36 PM -0600)
---------------------------------------
   n_examples  n_runs  n_errors
0           5       5         0


RanExperiment(id='RXhwZXJpbWVudDo1NA==', dataset_id='RGF0YXNldDozMg==', dataset_version_id='RGF0YXNldFZlcnNpb246MzI=', repetitions=1)

Sure enough the LLM agrees with our scoring. Pretty neat trick! This can come in useful when it's difficult to define a scoring function.


We now have a simple text2sql pipeline that can be used to generate SQL queries from natural language questions. Since Phoenix has been tracing the entire pipeline, we can now use the Phoenix UI to convert the spans that generated successful queries into examples to use in **Golden Dataset** for regression testing!

## Generating more data
Now that we have a basic flow in place, let's generate some data. We're going to use the dataset itself to generate expected queries, and have a model describe the queries. This is a slightly more robust method than having it generate queries, because we'd expect a model to describe a query more accurately than generate one from scratch.




In [146]:
import json
from typing import List

from pydantic import BaseModel


class Question(BaseModel):
    sql: str
    question: str


class Questions(BaseModel):
    questions: List[Question]


sample_rows = "\n".join(
    f"{column['column_name']} | {column['column_type']} | {samples[column['column_name']]}"
    for column in columns
)
synthetic_data_prompt = f"""You are a SQL expert, and you are given a single table named nba with the following columns:

Column | Type | Example
-------|------|--------
{sample_rows}

Generate SQL queries that would be interesting to ask about this table. Return the SQL query as a string, as well as the
question that the query answers."""

response = await client.chat.completions.create(
    model="gpt-4o",
    temperature=0,
    messages=[
        {
            "role": "user",
            "content": synthetic_data_prompt,
        }
    ],
    tools=[
        {
            "type": "function",
            "function": {
                "name": "generate_questions",
                "description": "Generate SQL queries that would be interesting to ask about this table.",
                "parameters": Questions.model_json_schema(),
            },
        }
    ],
    tool_choice={"type": "function", "function": {"name": "generate_questions"}},
)

generated_questions = json.loads(response.choices[0].message.tool_calls[0].function.arguments)[
    "questions"
]
generated_questions[0]

{'sql': "SELECT Team, COUNT(*) AS Wins FROM nba WHERE WINorLOSS = 'W' GROUP BY Team ORDER BY Wins DESC;",
 'question': 'Which team has the most wins?'}

In [147]:
generated_dataset = []
for q in generated_questions:
    try:
        result = execute_query(q["sql"])
        generated_dataset.append(
            {
                "input": q["question"],
                "expected": {
                    "results": result,
                    "error": None,
                    "query": q["sql"],
                },
                "metadata": {
                    "category": "Generated",
                },
            }
        )
    except duckdb.Error as e:
        print(f"Query failed: {q['sql']}", e)
        print("Skipping...")

generated_dataset[0]

Query failed: SELECT Team, AVG(FieldGoals.) AS AvgFieldGoalPercentage FROM nba GROUP BY Team ORDER BY AvgFieldGoalPercentage DESC; Parser Error: syntax error at or near ")"
Skipping...


{'input': 'Which team has the most wins?',
 'expected': {'results': [{'Team': 'GSW', 'Wins': 265},
   {'Team': 'SAS', 'Wins': 230},
   {'Team': 'HOU', 'Wins': 217},
   {'Team': 'TOR', 'Wins': 215},
   {'Team': 'CLE', 'Wins': 211},
   {'Team': 'LAC', 'Wins': 202},
   {'Team': 'BOS', 'Wins': 196},
   {'Team': 'OKC', 'Wins': 195},
   {'Team': 'POR', 'Wins': 185},
   {'Team': 'WAS', 'Wins': 179},
   {'Team': 'UTA', 'Wins': 177},
   {'Team': 'ATL', 'Wins': 175},
   {'Team': 'IND', 'Wins': 173},
   {'Team': 'MIA', 'Wins': 170},
   {'Team': 'MEM', 'Wins': 162},
   {'Team': 'MIL', 'Wins': 160},
   {'Team': 'CHI', 'Wins': 160},
   {'Team': 'NOP', 'Wins': 157},
   {'Team': 'CHO', 'Wins': 153},
   {'Team': 'DET', 'Wins': 152},
   {'Team': 'DAL', 'Wins': 149},
   {'Team': 'DEN', 'Wins': 149},
   {'Team': 'MIN', 'Wins': 123},
   {'Team': 'SAC', 'Wins': 121},
   {'Team': 'ORL', 'Wins': 114},
   {'Team': 'NYK', 'Wins': 109},
   {'Team': 'PHI', 'Wins': 108},
   {'Team': 'PHO', 'Wins': 107},
   {'Team'

Awesome, let's crate a dataset with the new synthetic data.




In [148]:
synthetic_dataset = px.Client().upload_dataset(
    dataset_name="nba-golden-synthetic",
    inputs=[{"question": example["input"]} for example in generated_dataset],
    outputs=[example["expected"] for example in generated_dataset],
);

running experiment evaluations |██████████| 5/5 (100.0%) | ⏳ 03:54<00:00 | 46.90s/it

📤 Uploading dataset...
💾 Examples uploaded: http://127.0.0.1:6006/datasets/RGF0YXNldDozMQ==/examples
🗄️ Dataset version ID: RGF0YXNldFZlcnNpb246MzE=





In [149]:
run_experiment(
    synthetic_dataset, task=task, evaluators=[no_error, has_results], experiment_metadata=CONFIG
)

🧪 Experiment started.
📺 View dataset experiments: http://127.0.0.1:6006/datasets/RGF0YXNldDozMQ==/experiments
🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDozMQ==/compare?experimentId=RXhwZXJpbWVudDo0NQ==


llm_classify |██████████| 5/5 (100.0%) | ⏳ 20:23<00:00 | 244.67s/it
running tasks |██████████| 9/9 (100.0%) | ⏳ 00:02<00:00 |  4.79it/s

✅ Task runs completed.
🧠 Evaluation started.


llm_classify |          | 0/5 (0.0%) | ⏳ 37:06<? | ?it/s
running tasks |██████████| 9/9 (100.0%) | ⏳ 00:04<00:00 |  1.94it/s



🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDozMQ==/compare?experimentId=RXhwZXJpbWVudDo0NQ==

Experiment Summary (05/24/25 05:52 PM -0600)
--------------------------------------------
     evaluator  n  n_scores  avg_score
0  has_results  9         9   0.777778
1     no_error  9         9   0.777778

Tasks Summary (05/24/25 05:52 PM -0600)
---------------------------------------
   n_examples  n_runs  n_errors
0           9       9         0


RanExperiment(id='RXhwZXJpbWVudDo0NQ==', dataset_id='RGF0YXNldDozMQ==', dataset_version_id='RGF0YXNldFZlcnNpb246MzE=', repetitions=1)



Amazing! Now we have a rich dataset to work with and some failures to debug. From here, you could try to investigate whether some of the generated data needs improvement, or try tweaking the prompt to improve accuracy, or maybe even something more adventurous, like feed the errors back to the model and have it iterate on a better query. Most importantly, we have a good workflow in place to iterate on both the application and dataset.



Let's re-run our basketball agent with our newly improved prompt to see if we see improvements.

# Trying a smaller model
Just for fun, let's wrap things up by trying out GPT-3.5-turbo. All we need to do is switch the model name, and run our Eval() function again.




In [None]:
TASK_MODEL = "gpt-3.5-turbo"

experiment = run_experiment(
    synthetic_dataset,
    task=task,
    evaluators=[no_error, has_results],
    experiment_metadata={"model": TASK_MODEL},
)

Interesting! It looks like the smaller model is able to do decently well but we might want to ensure it follows instructions as well as a larger model. We can actually grab all the LLM spans from our previous GPT40 runs and use them to generate a OpenAI fine-tuning JSONL file!

<img src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/fine_tining_nba.png">
<img src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/openai_ft.png">

## Conclusion

In this example, we walked through the process of building a dataset for a text2sql application. We started with a few handwritten examples, and iterated on the dataset by using an LLM to generate more examples. We used the eval framework to track our progress, and iterated on the model and dataset to improve the results. Finally, we tried out a less powerful model to see if we could save cost or improve latency.

Happy evaluations!