<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>

# **Generating Synthetic Datasets using LLMs**

Synthetic datasets are a powerful way to test and refine your LLM applications, especially when real-world data is limited, sensitive, or hard to collect. By guiding the model to generate structured examples, you can quickly create datasets that cover common scenarios, complex multi-step cases, and edge cases like typos or out-of-scope queries.

In this notebook, we’ll walk through different strategies for dataset generation and show how they can be used to run experiments and test evaluators.

# **Set up Dependencies and Keys**

In [None]:
%pip install -qq openai arize-phoenix openinference-instrumentation-openai

In [None]:
import os
from getpass import getpass

import nest_asyncio

nest_asyncio.apply()

if not (phoenix_endpoint := os.getenv("PHOENIX_COLLECTOR_ENDPOINT")):
    phoenix_endpoint = getpass("🔑 Enter your Phoenix Collector Endpoint: ")
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = phoenix_endpoint


if not (phoenix_api_key := os.getenv("PHOENIX_API_KEY")):
    phoenix_api_key = getpass("🔑 Enter your Phoenix API key: ")
os.environ["PHOENIX_API_KEY"] = phoenix_api_key

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

In [None]:
from phoenix.otel import register

tracer_provider = register(project_name="generating-datasets", auto_instrument=True)

In [None]:
import pandas as pd
from openai import OpenAI

openai_client = OpenAI()

# **Creating Synthetic Benchmark Datasets to Test Evaluators**

**Goal:**
Create a synthetic dataset that allows you to test the accuracy and coverage of your evaluator.

**Use Case:**
Feed the generated dataset into an LLM-as-a-Judge or other evaluator to ensure it correctly labels intent, identifies errors, and handles a variety of query types including edge cases and noisy inputs.

----

Synthetic data is especially useful when you want to stress-test evaluators such as an LLM-as-a-Judge across a wide range of scenarios. By generating examples systematically, you can cover straightforward cases, tricky edge cases, ambiguous queries, and noisy inputs, ensuring your evaluator captures different angles of behavior.

A strong synthetic dataset, in this case, serves as the benchmark dataset, providing a reliable benchmark for evaluating and comparing application changes.

In the example below, we generate customer support queries in JSON, each with a user query, intent label, and sample response. This dataset can then be used to check how well your evaluator identifies intent and judges correctness across varied cases.

In [None]:
generate_queries_template = """
Generate 30 synthetic customer support classification examples.
Ensure good coverage across intents (refund, order_status, product_info),
and include both correct and incorrect classifications.
Each entry should follow this JSON schema:

{
  "input": "string (the user query)",
  "output": "refund | order_status | product_info (the predicted intent)",
  "classification": "correct | incorrect"
}
Respond ONLY with valid JSON array, no code fences, no extra text.
"""

In [None]:
resp = openai_client.chat.completions.create(
    model="gpt-4o-mini", messages=[{"role": "user", "content": generate_queries_template}]
)

In [None]:
import json

support_data = json.loads(resp.choices[0].message.content)
df_support_data = pd.DataFrame(support_data)
df_support_data.head()

## Upload Dataset

In [None]:
import phoenix as px

client = px.Client()

df = client.upload_dataset(
    dataframe=df_support_data,
    dataset_name="customer_support_queries",
    input_keys=["input"],
    output_keys=["output", "classification"],
)

## Example Usage: Test LLM Judge Effectiveness

In [None]:
llm_judge_template = """
You are an evaluator judging whether a model's classification of a customer support query is correct.
The possible classifications are: refund, order_status, product_info

Query: {input}
Model Prediction: {output}

Decide if the model's prediction is correct or incorrect.
Respond ONLY with one of: "correct" or "incorrect".
"""

In [None]:
from phoenix.evals import OpenAIModel, llm_classify


def task_function(input, reference):
    response_classification = llm_classify(
        data=pd.DataFrame([{"input": input["input"], "output": reference["output"]}]),
        template=llm_judge_template,
        model=OpenAIModel(model="gpt-4.1"),
        rails=["correct", "incorrect"],
        provide_explanation=True,
    )
    label = response_classification.iloc[0]["label"]
    return label


def evaluate_response(output, reference):
    expected_label = reference["classification"]
    predicted_label = output
    return 1 if expected_label == predicted_label else 0

In [None]:
from phoenix.experiments import run_experiment

initial_experiment = run_experiment(
    df, task=task_function, evaluators=[evaluate_response], experiment_name="evaluator performance"
)

# **Using Few Shot Examples for Synthetic Dataset Generation**

**Goal:**
Guide the LLM to generate synthetic examples that reflect different types of queries and scenarios while maintaining consistent labeling and structure. This approach allows for more customization and higher-quality examples in your dataset. Your few-shot examples can be real data as well.

-----

Few-shot prompting allows you to guide an LLM by showing a handful of examples, which helps produce more consistent and realistic outputs. In this approach, we provide a few labeled customer queries with their intents and sample responses, and ask the model to generate additional examples in the same format.

This is particularly useful for testing evaluators such as an LLM-as-a-Judge, because it ensures the synthetic dataset reflects patterns, labels, and structures the evaluator is expected to handle. By controlling the examples in the prompt, you can produce a dataset that covers a variety of scenarios, including tricky or ambiguous queries, to check whether your evaluator captures different angles of behavior.

In [None]:
few_shot_prompt = """
Generate synthetic customer support classification examples.
Ensure good coverage across intents (refund, order_status, product_info),
and include both correct and incorrect classifications.
Here are some examples of synthetic customer queries and labels:

Example 1:
{
  "user_query": "Ughhh I bought sneakers that squeak louder than a rubber duck... how do I return these?",
  "intent": "refund",
  "response": "Oh no, squeaky shoes aren’t fun! Let’s get that return started. Could you share your order number?",
  "classification": "correct"
}

Example 2:
{
  "user_query": "My package has been saying 'out for delivery' since last Tuesday… did it decide to take a vacation? Is it actually going to show up?",
  "intent": "refund",
  "response": "Looks like your package is taking its sweet time. Let me check where it’s stuck — can you give me the tracking number?",
  "classification: "incorrect"
}


Example 3:
{
  "user_query": "Thinking about upgrading my blender… does your new model actually crush ice?",
  "intent": "product_info",
  "response": "Haha our blender keeps its promises! It can definitely crush ice. Would you like more details on the specs?",
  "classification": "correct"
}

Now generate 25 new examples in the same format, keeping the reesponses friendly.
Respond ONLY with valid JSON array, no code fences, no extra text.
"""

In [None]:
resp = openai_client.chat.completions.create(
    model="gpt-4o-mini", messages=[{"role": "user", "content": few_shot_prompt}]
)

In [None]:
few_shot_data = json.loads(resp.choices[0].message.content)
few_shot_df = pd.DataFrame(few_shot_data)
few_shot_df.head()

## Upload Dataset

In [None]:
df = client.upload_dataset(
    dataframe=few_shot_df,
    dataset_name="customer_support_queries_few_shot",
    input_keys=["user_query"],
    output_keys=["intent", "response", "classification"],
)

## Example Usage: Test LLM Judge Effectiveness

In [None]:
llm_judge_template = """
You are an evaluator judging whether a model's classification of a customer support query is correct.
The possible classifications are: refund, order_status, product_info

Query: {query}
Model Prediction: {intent}

Decide if the model's prediction is correct or incorrect.
Respond ONLY with one of: "correct" or "incorrect".
"""

In [None]:


def task_function(input, reference):
    response_classification = llm_classify(
        data=pd.DataFrame([{"query": input["user_query"], "intent": reference["intent"]}]),
        template=llm_judge_template,
        model=OpenAIModel(model="gpt-4.1"),
        rails=["correct", "incorrect"],
        provide_explanation=True,
    )
    label = response_classification.iloc[0]["label"]
    return label


def evaluate_response(output, reference):
    expected_label = reference["classification"]
    predicted_label = output
    return 1 if expected_label == predicted_label else 0

In [None]:
from phoenix.experiments import run_experiment

initial_experiment = run_experiment(
    df, task=task_function, evaluators=[evaluate_response], experiment_name="evaluator performance"
)

# **Creating Synthetic Datasets for Agents**

**Goal:**
Build synthetic test data that captures a wide range of queries to evaluate an agent’s reliability and safety.

**Use Case:**
Test how an agent handles in-scope requests, refuses out-of-scope queries, and manages edge cases, adversarial inputs, and noisy data.

------

When creating synthetic datasets, first define the agent’s capabilities and boundaries (tools, in-scope vs. out-of-scope). Then organize queries into categories to ensure balanced coverage:

1. Happy-path: simple, common requests
2. Complex: multi-step or reasoning-heavy
3. Adversarial / refusal: out-of-scope or unsafe
4. Edge cases: ambiguous or incomplete inputs
5. Noise: typos, slang, multilingual

This structure makes it easier to stress-test the agent across realistic scenarios and confirm it behaves consistently.

**Why This Approach?**

This structure ensures comprehensive evaluation (core tasks, edge conditions, and safety) and systematic coverage (no major scenario overlooked). By simulating a wide range of real-world interactions, you can validate that the agent is reliable, robust, and safe.

In [None]:
AGENT_DATASET_PROMPT = """
You are helping me create a synthetic test dataset for evaluating an AI agent.
The agent has the following capabilities:
- search products, compare items, track orders, answer shipping questions

The dataset should cover a wide variety of use cases, not just the “happy path.”
Generate realistic **user queries**, grouped into categories:

1. **Happy-path**: straightforward, common use cases where the agent should succeed.
2. **Complex / multi-step**: queries requiring reasoning, multiple steps, or tool calls.
3. **Edge cases**: ambiguous requests, incomplete info, or queries with constraints.
4. **Adversarial / refusal**: queries that are out-of-scope or unsafe (where the agent should refuse or fallback).
5. **Noise / robustness**: queries with typos, slang, or in multiple languages.

For each example, return JSON with this schema:
{
  "category": "happy_path | multi_step | edge_case | adversarial | noise",
  "query": "string (the user’s input)",
  "expected_action": "string (the tool, behavior, or refusal the agent should take)",
  "expected_outcome": "string (what a correct response would look like at a high level)"
}

Generate **10 examples total**, ensuring at least a few from each category.
The queries should be diverse, realistic, and not repetitive.

Respond ONLY with valid JSON, no code fences, no extra text.
"""

In [None]:
resp = openai_client.chat.completions.create(
    model="gpt-4o-mini", messages=[{"role": "user", "content": AGENT_DATASET_PROMPT}]
)

In [None]:
agent_data = json.loads(resp.choices[0].message.content)
agent_data_df = pd.DataFrame(agent_data)
agent_data_df.head()

## Upload Dataset

In [None]:
df = client.upload_dataset(
    dataframe=agent_data_df,
    dataset_name="customer_support_agent",
    input_keys=["category", "query"],
    output_keys=["expected_action", "expected_outcome"],
)