<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg" width="1000"/>
        <br>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Quickstart: Datasets and Experiments</h1>

Phoenix helps you run experiments over your AI and LLM applications to evaluate and iteratively improve their performance. This quickstart shows you how to get up and running quickly.

## Setup

Install Phoenix.

In [None]:
!pip install "arize-phoenix[evals]" openai 'httpx<0.28'

Launch Phoenix.

In [None]:
import phoenix as px

px.launch_app().view()

Set your OpenAI API key.

In [None]:
import os
from getpass import getpass

import openai

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

Initialize a Phoenix Client. This acts as the main entry point for interacting with the Phoenix API
and can be installed independently of Phoenix itself.

In [None]:
from phoenix.client import AsyncClient

px_client = AsyncClient()

## Datasets

Upload a dataset.

In [None]:
dataset = await px_client.datasets.create_dataset(
    name="experiment-quickstart-dataset",
    inputs=[{"question": "What is Paul Graham known for?"}],
    outputs=[{"answer": "Co-founding Y Combinator and writing on startups and techology."}],
    metadata=[{"topic": "tech"}],
)

## Tasks

Create a task to evaluate.

In [None]:
from typing import Any

from openai import OpenAI

openai_client = OpenAI()

task_prompt_template = "Answer in a few words: {question}"


def task(input: Any) -> str:
    question = input["question"]
    message_content = task_prompt_template.format(question=question)
    response = openai_client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": message_content}]
    )
    return response.choices[0].message.content or ""

## Evaluators

We can define evaluators as functions. If the function has only one argument, it will be called with the task output. Otherwise, evaluator functions can be defined with any combination of the following arguments:

- `input`: The input field of the dataset example
- `output`: The output of the task
- `expected`: The expected or reference output of the dataset example
- `reference`: An alias for `expected`
- `metadata`: Metadata associated with the dataset example


In [None]:
def contains_keyword(output: str) -> float:
    keywords = ["Y Combinator", "YC"]
    output_lower = output.lower()
    return 1.0 if any(keyword.lower() in output_lower for keyword in keywords) else 0.0

In [None]:
from phoenix.evals.models import OpenAIModel
from phoenix.experiments.evaluators import ConcisenessEvaluator

model = OpenAIModel(model="gpt-4o")
conciseness = ConcisenessEvaluator(model=model)

In [None]:
from typing import Any, Dict


def jaccard_similarity(output: str, expected: Dict[str, Any]) -> float:
    # https://en.wikipedia.org/wiki/Jaccard_index
    actual_words = set(output.lower().split(" "))
    expected_words = set(expected["answer"].lower().split(" "))
    words_in_common = actual_words.intersection(expected_words)
    all_words = actual_words.union(expected_words)
    return len(words_in_common) / len(all_words)

or LLMs.

In [None]:
from phoenix.client.experiments import create_evaluator

eval_prompt_template = """
Given the QUESTION and REFERENCE_ANSWER, determine whether the ANSWER is accurate.
Output only a single word (accurate or inaccurate).

QUESTION: {question}

REFERENCE_ANSWER: {reference_answer}

ANSWER: {answer}

ACCURACY (accurate / inaccurate):
"""


@create_evaluator(kind="llm")  # need the decorator or the kind will default to "code"
def accuracy(input: Dict[str, Any], output: str, expected: Dict[str, Any]) -> float:
    message_content = eval_prompt_template.format(
        question=input["question"], reference_answer=expected["answer"], answer=output
    )
    response = openai_client.chat.completions.create(
        model="gpt-4o", messages=[{"role": "user", "content": message_content}]
    )
    response_message_content = response.choices[0].message.content.lower().strip()
    return 1.0 if response_message_content == "accurate" else 0.0

## Experiments

Run an experiment and evaluate the results.

In [None]:
experiment = await px_client.experiments.run_experiment(
    dataset=dataset,
    task=task,
    experiment_name="initial-experiment",
    evaluators=[jaccard_similarity, accuracy],
)

Run more evaluators after the fact.

In [None]:
experiment = await px_client.experiments.evaluate_experiment(
    experiment=experiment, evaluators=[contains_keyword, conciseness]
)

In [None]:
from phoenix.experiments.types import EvaluationResult


def show_evaluation_summary(exp: Any):
    metrics = {
        "contains_keyword": [],
        "jaccard_similarity": [],
        "accuracy": [],
        "conciseness": [],
    }

    for run in exp["evaluation_runs"]:
        score = None
        if isinstance(run.result, dict) and "score" in run.result:
            score = run.result["score"]
        elif isinstance(run.result, EvaluationResult):
            score = run.result.score

        metric_name = run.name.lower()

        if metric_name not in metrics:
            metrics[metric_name] = []

        metrics[metric_name].append(score)

    print("📊 Evaluation Results:")

    if metrics["contains_keyword"]:
        avg = sum(metrics["contains_keyword"]) / len(metrics["contains_keyword"])
        print(f"  Contains Keyword: {avg:.3f} (n={len(metrics['contains_keyword'])})")

    if metrics["accuracy"]:
        avg = sum(metrics["accuracy"]) / len(metrics["accuracy"])
        print(f"  Accuracy: {avg:.3f} (n={len(metrics['accuracy'])})")

    if metrics["jaccard_similarity"]:
        avg = sum(metrics["jaccard_similarity"]) / len(metrics["jaccard_similarity"])
        print(f"  Jaccard Similarity: {avg:.3f} (n={len(metrics['jaccard_similarity'])})")

    if metrics["conciseness"]:
        avg = sum(metrics["conciseness"]) / len(metrics["conciseness"])
        print(f"  Conciseness: {avg:.3f} (n={len(metrics['conciseness'])})")


show_evaluation_summary(experiment)

And iterate 🚀