<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg" width="1000"/>
        <br>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Quickstart: Datasets and Experiments</h1>

Phoenix helps you run experiments over your AI and LLM applications to evaluate and iteratively improve their performance. This quickstart shows you how to get up and running quickly.

## Setup

Install Phoenix.

Launch Phoenix.

Set your OpenAI API key.

In [1]:
import os
from getpass import getpass

import openai

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

## Datasets

Upload a dataset.

In [10]:
import pandas as pd

import phoenix as px

df = pd.DataFrame(
    [
        {
            "question": "What is Paul Graham known for?",
            "answer": "Co-founding Y Combinator and writing on startups and techology.",
            "metadata": {"topic": "tech"},
        }
    ]
    * 200
)
phoenix_client = px.Client()
dataset = phoenix_client.upload_dataset(
    dataset_name="test-dataset-large",
    dataframe=df,
    input_keys=["question"],
    output_keys=["answer"],
    metadata_keys=["metadata"],
)

📤 Uploading dataset...
💾 Examples uploaded: http://127.0.0.1:6006/datasets/RGF0YXNldDo5/examples
🗄️ Dataset version ID: RGF0YXNldFZlcnNpb246MTA=


## Tasks

Create a task to evaluate.

In [11]:
from openai import OpenAI

openai_client = OpenAI()

task_prompt_template = "Answer in a few words: {question}"


def task(input) -> str:
    question = input["question"]
    message_content = task_prompt_template.format(question=question)
    response = openai_client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": message_content}]
    )
    return response.choices[0].message.content

## Evaluators

Use pre-built evaluators to grade task output with code...

In [12]:
from phoenix.experiments.evaluators import ContainsAnyKeyword

contains_keyword = ContainsAnyKeyword(keywords=["Y Combinator", "YC"])

or LLMs.

In [13]:
from phoenix.evals.models import OpenAIModel
from phoenix.experiments.evaluators import ConcisenessEvaluator

model = OpenAIModel(model="gpt-4o")
conciseness = ConcisenessEvaluator(model=model)

Define custom evaluators with code...

In [14]:
from typing import Any, Dict


def jaccard_similarity(output: str, expected: Dict[str, Any]) -> float:
    # https://en.wikipedia.org/wiki/Jaccard_index
    actual_words = set(output.lower().split(" "))
    expected_words = set(expected["answer"].lower().split(" "))
    words_in_common = actual_words.intersection(expected_words)
    all_words = actual_words.union(expected_words)
    return len(words_in_common) / len(all_words)

or LLMs.

In [7]:
from phoenix.experiments.evaluators import create_evaluator

eval_prompt_template = """
Given the QUESTION and REFERENCE_ANSWER, determine whether the ANSWER is accurate.
Output only a single word (accurate or inaccurate).

QUESTION: {question}

REFERENCE_ANSWER: {reference_answer}

ANSWER: {answer}

ACCURACY (accurate / inaccurate):
"""


@create_evaluator(kind="llm")  # need the decorator or the kind will default to "code"
def accuracy(input: Dict[str, Any], output: str, expected: Dict[str, Any]) -> float:
    message_content = eval_prompt_template.format(
        question=input["question"], reference_answer=expected["answer"], answer=output
    )
    response = openai_client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": message_content}]
    )
    response_message_content = response.choices[0].message.content.lower().strip()
    return 1.0 if response_message_content == "accurate" else 0.0

## Experiments

Run an experiment and evaluate the results.

In [8]:
from phoenix.experiments import run_experiment

experiment = run_experiment(
    dataset,
    task,
    experiment_name="initial-experiment",
    evaluators=[
        jaccard_similarity,
        accuracy,
    ],
)

🐌!! If running inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


🧪 Experiment started.
📺 View dataset experiments: http://127.0.0.1:6006/datasets/RGF0YXNldDo4/experiments
🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDo4/compare?experimentId=RXhwZXJpbWVudDo0


running tasks |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s


🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDo4/compare?experimentId=RXhwZXJpbWVudDo0

Experiment Summary (06/08/25 02:52 PM -0700)
--------------------------------------------
| evaluator          |   n |   n_scores |   avg_score |
|:-------------------|----:|-----------:|------------:|
| accuracy           |   1 |          1 |    1        |
| jaccard_similarity |   1 |          1 |    0.294118 |

Tasks Summary (06/08/25 02:52 PM -0700)
---------------------------------------
|   n_examples |   n_runs |   n_errors |
|-------------:|---------:|-----------:|
|            1 |        1 |          0 |


Run more evaluators after the fact.

In [9]:
from phoenix.experiments import evaluate_experiment

experiment = evaluate_experiment(experiment, evaluators=[contains_keyword, conciseness])

🐌!! If running inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


🧠 Evaluation started.


running experiment evaluations |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s


🔗 View this experiment: http://127.0.0.1:6006/datasets/RGF0YXNldDo4/compare?experimentId=RXhwZXJpbWVudDo0

Experiment Summary (06/08/25 02:52 PM -0700)
--------------------------------------------
| evaluator                           |   n |   n_scores |   avg_score |
|:------------------------------------|----:|-----------:|------------:|
| Conciseness                         |   1 |          1 |           1 |
| ContainsAny(['Y Combinator', 'YC']) |   1 |          1 |           1 |

Experiment Summary (06/08/25 02:52 PM -0700)
--------------------------------------------
| evaluator          |   n |   n_scores |   avg_score |
|:-------------------|----:|-----------:|------------:|
| accuracy           |   1 |          1 |    1        |
| jaccard_similarity |   1 |          1 |    0.294118 |

Tasks Summary (06/08/25 02:52 PM -0700)
---------------------------------------
|   n_examples |   n_runs |   n_errors |
|-------------:|---------:|-----------:|
|            1 |        1 |     

And iterate 🚀