<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg" width="1000"/>
        <br>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Quickstart: Datasets and Experiments in Deno</h1>

Phoenix helps you run experiments over your AI and LLM applications to evaluate and iteratively improve their performance. This quickstart shows you how to get up and running quickly with the JavaScript SDK in a Deno environment.

## Setup

Let's start by importing the necessary packages.

In [None]:
import { createClient } from "npm:@arizeai/phoenix-client@latest";
import { runExperiment, asEvaluator, evaluateExperiment } from "npm:@arizeai/phoenix-client@latest/experiments";
import { createDataset } from "npm:@arizeai/phoenix-client@latest/datasets";
import { OpenAI } from "npm:openai";

Set up your OpenAI API key.

In [None]:
const openaiApiKey = prompt("Enter your OpenAI API key:");

if (!openaiApiKey) {
  console.error('Please enter your OpenAI API key to continue');
  Deno.exit(1);
}

const openai = new OpenAI({
  apiKey: openaiApiKey,
});

> **Note:** The code below only initializes the Phoenix client. You must have the Phoenix server running separately.
> See the [Docker deployment guide](https://arize.com/docs/phoenix/self-hosting/deployment-options/docker#docker) for information on how to start the Phoenix server.

In [3]:
const client = createClient();
console.log('Phoenix client initialized. Access Phoenix UI at http://localhost:6006');

Phoenix client initialized. Access Phoenix UI at http://localhost:6006


## Creating a Dataset

Let's create examples for our dataset.

In [4]:
console.log('Creating dataset examples...');

// Create examples directly as an array
const { datasetId } = await createDataset({
  name: `quickstart-dataset-${Date.now()}`,
  description: "Dataset for quickstart example",
  examples: [
  {
    id: `example-1`,
    updatedAt: new Date(),
    input: { question: "What is Paul Graham known for?" },
    output: { answer: "Co-founding Y Combinator and writing on startups and techology." },
    metadata: { topic: "tech" }
  },
  {
    id: `example-2`,
    updatedAt: new Date(),
    input: { question: "What companies did Elon Musk found?" },
    output: { answer: "Tesla, SpaceX, Neuralink, The Boring Company, and co-founded PayPal." },
    metadata: { topic: "entrepreneurs" }
  },
  {
    id: `example-3`,
    updatedAt: new Date(),
    input: { question: "What is Moore's Law?" },
    output: { answer: "The observation that the number of transistors in a dense integrated circuit doubles about every two years." },
    metadata: { topic: "computing" }
  }
]})

console.log(`examples created`);

Creating dataset examples...
examples created


## Defining the Task

Define the task function that will be evaluated.

In [5]:
import { type RunExperimentParams } from "npm:@arizeai/phoenix-client/experiments";

const taskPromptTemplate = "Answer in a few words: {question}";

const task: RunExperimentParams["task"] = async (example) => {
  const question = example.input.question || "No question provided";
  const messageContent = taskPromptTemplate.replace('{question}', question);
  
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: messageContent }]
  });
  
  return response.choices[0]?.message?.content || "";
};

## Creating Evaluators

Let's define evaluators for our experiment.

In [6]:
// 1. Code-based evaluator that checks if response contains specific keywords
const containsKeyword = asEvaluator({
  name: "contains_keyword",
  kind: "CODE",
  evaluate: async ({ output }) => {
    const keywords = ["Y Combinator", "YC"];
    const outputStr = String(output).toLowerCase();
    const contains = keywords.some(keyword => 
      outputStr.toLowerCase().includes(keyword.toLowerCase())
    );
    
    return {
      score: contains ? 1.0 : 0.0,
      label: contains ? "contains_keyword" : "missing_keyword",
      metadata: { keywords },
      explanation: contains ? 
        `Output contains one of the keywords: ${keywords.join(", ")}` : 
        `Output does not contain any of the keywords: ${keywords.join(", ")}`
    };
  }
});

console.log("Created 'contains_keyword' evaluator");

Created 'contains_keyword' evaluator


In [21]:
import { createClassificationEvaluator } from "npm:@arizeai/phoenix-evals@latest";
import { openai } from "npm:@ai-sdk/openai@latest";

// 2. LLM-based evaluator for conciseness
const concisenessEval = createClassificationEvaluator({
    model: openai("gpt-4o"),
    promptTemplate: "Rate the following test as concise or verbose {{output}}",
    choices: { concise: 1, verbose: 0 }
});

const conciseness = asEvaluator({
  name: "conciseness",
  kind: "LLM", 
  evaluate: async ({ output }) => {
    const res = await concisenessEval.evaluate({ output });
    
    return {
      ...res,
      metadata: {},
    };
  }
});

console.log("Created 'conciseness' evaluator");

Created 'conciseness' evaluator


In [11]:
// 3. Custom Jaccard similarity evaluator
const jaccardSimilarity = asEvaluator({
  name: "jaccard_similarity",
  kind: "CODE",
  evaluate: async ({ output, expected }) => {
    const actualWords = new Set(String(output).toLowerCase().split(" "));
    const expectedAnswer = expected?.answer || "";
    const expectedWords = new Set(expectedAnswer.toLowerCase().split(" "));
    
    const wordsInCommon = new Set(
      [...actualWords].filter(word => expectedWords.has(word))
    );
    
    const allWords = new Set([...actualWords, ...expectedWords]);
    const score = wordsInCommon.size / allWords.size;
    
    return {
      score,
      label: score > 0.5 ? "similar" : "dissimilar",
      metadata: { 
        actualWordsCount: actualWords.size,
        expectedWordsCount: expectedWords.size,
        commonWordsCount: wordsInCommon.size,
        allWordsCount: allWords.size
      },
      explanation: `Jaccard similarity: ${score}`
    };
  }
});

console.log("Created 'jaccard_similarity' evaluator");

Created 'jaccard_similarity' evaluator


In [18]:
import { createClassificationEvaluator } from "npm:@arizeai/phoenix-evals@latest";
import { openai } from "npm:@ai-sdk/openai@latest";

const accuracyEval = createClassificationEvaluator({
    model: openai("gpt-4o"),
    promptTemplate: "Given the QUESTION and REFERENCE_ANSWER, determine whether the ANSWER is accurate. <question>{{question}}</question><reference>{{referenceAnswer}}</reference><answer>{{referenceAnswer}}</answer>"
});

// 4. LLM-based accuracy evaluator
const accuracy = asEvaluator({
  name: "accuracy",
  kind: "LLM",
  evaluate: async ({ input, output, expected }) => {
    // Safely access question and answer with type assertions and fallbacks
    const question = input.question || "No question provided";
    const referenceAnswer = expected?.answer || "No reference answer provided";
    
    const res = await accuracyEval.evaluate({ question, referenceAnswer, answer: String(output) })
    return {
    ...res,
      metadata: {},
    };
  }
});

console.log("Created 'accuracy' evaluator");

Created 'accuracy' evaluator


## Running the Experiment

Now let's run the experiment with our defined evaluators.

In [19]:
console.log('Running initial experiment...');

// Pass dataset directly as the array of examples
const experiment = await runExperiment({
  client,
  experimentName: "simple-experiment",
  dataset: {
    datasetId,
  },
  task,
  evaluators: [jaccardSimilarity, accuracy],
  logger: console,
});

console.log('Initial experiment completed with ID:', experiment.id);

await Deno.jupyter.md`
### Initial experiment results

Experiment ID: \`${experiment.id}\`
`

Running initial experiment...


Error: @opentelemetry/api: Attempted duplicate registration of API: trace
    at registerGlobal (file:///Users/mikeldking/Library/Caches/deno/npm/registry.npmjs.org/@opentelemetry/api/1.9.0/build/src/internal/global-utils.js:32:21)
    at TraceAPI.setGlobalTracerProvider (file:///Users/mikeldking/Library/Caches/deno/npm/registry.npmjs.org/@opentelemetry/api/1.9.0/build/src/api/trace.js:54:59)
    at NodeTracerProvider.register (file:///Users/mikeldking/Library/Caches/deno/npm/registry.npmjs.org/@opentelemetry/sdk-trace-base/1.30.1/build/src/BasicTracerProvider.js:101:21)
    at NodeTracerProvider.register (file:///Users/mikeldking/Library/Caches/deno/npm/registry.npmjs.org/@opentelemetry/sdk-trace-node/1.30.1/build/src/NodeTracerProvider.js:43:15)
    at runExperiment (file:///Users/mikeldking/Library/Caches/deno/npm/registry.npmjs.org/@arizeai/phoenix-client/4.0.0/dist/esm/experiments/runExperiment.js:105:22)
    at eventLoopTick (ext:core/01_core.js:175:7)
    at async <anonymous>:3:

📊 View dataset: http://localhost:6006/datasets/RGF0YXNldDo3Mg==/examples
📺 View dataset experiments: http://localhost:6006/datasets/RGF0YXNldDo3Mg==/experiments
🔗 View this experiment: http://localhost:6006/datasets/RGF0YXNldDo3Mg==/compare?experimentId=RXhwZXJpbWVudDoxNDM=
🧪 Starting experiment "simple-experiment" on dataset "RGF0YXNldDo3Mg==" with task "task" and 2 evaluators and 5 concurrent runs
🔧 Running task "task" on dataset "RGF0YXNldDo3Mg=="
🔧 Running task "task" on example "RGF0YXNldEV4YW1wbGU6MjE4MDM= of dataset "RGF0YXNldDo3Mg=="
🔧 Running task "task" on example "RGF0YXNldEV4YW1wbGU6MjE4MDQ= of dataset "RGF0YXNldDo3Mg=="
🔧 Running task "task" on example "RGF0YXNldEV4YW1wbGU6MjE4MDU= of dataset "RGF0YXNldDo3Mg=="
✅ Task runs completed


Error: @opentelemetry/api: Attempted duplicate registration of API: trace
    at registerGlobal (file:///Users/mikeldking/Library/Caches/deno/npm/registry.npmjs.org/@opentelemetry/api/1.9.0/build/src/internal/global-utils.js:32:21)
    at TraceAPI.setGlobalTracerProvider (file:///Users/mikeldking/Library/Caches/deno/npm/registry.npmjs.org/@opentelemetry/api/1.9.0/build/src/api/trace.js:54:59)
    at NodeTracerProvider.register (file:///Users/mikeldking/Library/Caches/deno/npm/registry.npmjs.org/@opentelemetry/sdk-trace-base/1.30.1/build/src/BasicTracerProvider.js:101:21)
    at NodeTracerProvider.register (file:///Users/mikeldking/Library/Caches/deno/npm/registry.npmjs.org/@opentelemetry/sdk-trace-node/1.30.1/build/src/NodeTracerProvider.js:43:15)
    at evaluateExperiment (file:///Users/mikeldking/Library/Caches/deno/npm/registry.npmjs.org/@arizeai/phoenix-client/4.0.0/dist/esm/experiments/runExperiment.js:270:22)
    at runExperiment (file:///Users/mikeldking/Library/Caches/deno/npm/

🧠 Evaluating experiment "RXhwZXJpbWVudDoxNDM=" with 2 evaluators
🔗 View experiment evaluation: http://localhost:6006/datasets/RGF0YXNldDo3Mg==/compare?experimentId=RXhwZXJpbWVudDoxNDM=
🧠 Evaluating run "RXhwZXJpbWVudFJ1bjo0ODY2" with evaluator "jaccard_similarity"
🧠 Evaluating run "RXhwZXJpbWVudFJ1bjo0ODY3" with evaluator "jaccard_similarity"
🧠 Evaluating run "RXhwZXJpbWVudFJ1bjo0ODY4" with evaluator "jaccard_similarity"
🧠 Evaluating run "RXhwZXJpbWVudFJ1bjo0ODY2" with evaluator "accuracy"
🧠 Evaluating run "RXhwZXJpbWVudFJ1bjo0ODY3" with evaluator "accuracy"
✅ Evaluator "jaccard_similarity" on run "RXhwZXJpbWVudFJ1bjo0ODY2" completed
✅ Evaluator "jaccard_similarity" on run "RXhwZXJpbWVudFJ1bjo0ODY3" completed
✅ Evaluator "jaccard_similarity" on run "RXhwZXJpbWVudFJ1bjo0ODY4" completed


❌ Evaluator "accuracy" on run "RXhwZXJpbWVudFJ1bjo0ODY2" failed: Cannot convert undefined or null to object
❌ Evaluator "accuracy" on run "RXhwZXJpbWVudFJ1bjo0ODY3" failed: Cannot convert undefined or null to object


🧠 Evaluating run "RXhwZXJpbWVudFJ1bjo0ODY4" with evaluator "accuracy"


❌ Evaluator "accuracy" on run "RXhwZXJpbWVudFJ1bjo0ODY4" failed: Cannot convert undefined or null to object


✅ Evaluation runs completed
✅ Experiment RXhwZXJpbWVudDoxNDM= completed
🔍 View results: http://localhost:6006/datasets/RGF0YXNldDo3Mg==/compare?experimentId=RXhwZXJpbWVudDoxNDM=
Initial experiment completed with ID: RXhwZXJpbWVudDoxNDM=



### Initial experiment results

Experiment ID: `RXhwZXJpbWVudDoxNDM=`


## Running Additional Evaluators

Let's run more evaluators on the same experiment.

In [22]:
console.log('Running additional evaluators...');

// Use evaluateExperiment to add evaluators to an existing experiment
const evaluation = await evaluateExperiment({
  client,
  experiment, // Use the existing experiment object
  evaluators: [containsKeyword, conciseness],
  logger: console
});

console.log('Additional evaluations completed');
console.log('Evaluation runs:', evaluation.runs.length);

await Deno.jupyter.md`
### Additional Evaluation Results

The experiment has been evaluated with additional evaluators. View the complete results in the Phoenix UI:

http://localhost:6006

Experiment ID: \`${experiment.id}\`
`

Running additional evaluators...


Error: @opentelemetry/api: Attempted duplicate registration of API: trace
    at registerGlobal (file:///Users/mikeldking/Library/Caches/deno/npm/registry.npmjs.org/@opentelemetry/api/1.9.0/build/src/internal/global-utils.js:32:21)
    at TraceAPI.setGlobalTracerProvider (file:///Users/mikeldking/Library/Caches/deno/npm/registry.npmjs.org/@opentelemetry/api/1.9.0/build/src/api/trace.js:54:59)
    at NodeTracerProvider.register (file:///Users/mikeldking/Library/Caches/deno/npm/registry.npmjs.org/@opentelemetry/sdk-trace-base/1.30.1/build/src/BasicTracerProvider.js:101:21)
    at NodeTracerProvider.register (file:///Users/mikeldking/Library/Caches/deno/npm/registry.npmjs.org/@opentelemetry/sdk-trace-node/1.30.1/build/src/NodeTracerProvider.js:43:15)
    at evaluateExperiment (file:///Users/mikeldking/Library/Caches/deno/npm/registry.npmjs.org/@arizeai/phoenix-client/4.0.0/dist/esm/experiments/runExperiment.js:270:22)
    at <anonymous>:3:26
Error: @opentelemetry/api: Attempted duplicate 

🧠 Evaluating experiment "RXhwZXJpbWVudDoxNDM=" with 2 evaluators
🔗 View experiment evaluation: http://localhost:6006/datasets/RGF0YXNldDo3Mg==/compare?experimentId=RXhwZXJpbWVudDoxNDM=
🧠 Evaluating run "RXhwZXJpbWVudFJ1bjo0ODY2" with evaluator "contains_keyword"
🧠 Evaluating run "RXhwZXJpbWVudFJ1bjo0ODY3" with evaluator "contains_keyword"
🧠 Evaluating run "RXhwZXJpbWVudFJ1bjo0ODY4" with evaluator "contains_keyword"
🧠 Evaluating run "RXhwZXJpbWVudFJ1bjo0ODY2" with evaluator "conciseness"
🧠 Evaluating run "RXhwZXJpbWVudFJ1bjo0ODY3" with evaluator "conciseness"
✅ Evaluator "contains_keyword" on run "RXhwZXJpbWVudFJ1bjo0ODY2" completed
✅ Evaluator "contains_keyword" on run "RXhwZXJpbWVudFJ1bjo0ODY3" completed
✅ Evaluator "contains_keyword" on run "RXhwZXJpbWVudFJ1bjo0ODY4" completed
🧠 Evaluating run "RXhwZXJpbWVudFJ1bjo0ODY4" with evaluator "conciseness"
✅ Evaluator "conciseness" on run "RXhwZXJpbWVudFJ1bjo0ODY4" completed
✅ Evaluator "conciseness" on run "RXhwZXJpbWVudFJ1bjo0ODY3" complet


### Additional Evaluation Results

The experiment has been evaluated with additional evaluators. View the complete results in the Phoenix UI:

http://localhost:6006

Experiment ID: `RXhwZXJpbWVudDoxNDM=`


## Conclusion

You've successfully:
1. Created a dataset with examples
2. Defined a task using the OpenAI API
3. Created multiple evaluators using both code-based and LLM-based approaches
4. Run an experiment and evaluated the results
5. Added additional evaluators to the experiment

You can now explore the results in the Phoenix UI and iterate on your experiments to improve performance.