<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg" width="1000"/>
        <br>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Quickstart: Datasets and Experiments in Deno</h1>

Phoenix helps you run experiments over your AI and LLM applications to evaluate and iteratively improve their performance. This quickstart shows you how to get up and running quickly with the JavaScript SDK in a Deno environment.

## Setup

Let's start by importing the necessary packages.

In [None]:
import { createClient } from "npm:@arizeai/phoenix-client@latest";
import {runExperiment, asEvaluator} from "npm:@arizeai/phoenix-client@latest/experiments";
import { createDataset } from "npm:@arizeai/phoenix-client@latest/datasets";
import { OpenAI } from "npm:openai";

Set up your OpenAI API key.

In [None]:
const openaiApiKey = prompt("Enter your OpenAI API key:");

if (!openaiApiKey) {
  console.error('Please enter your OpenAI API key to continue');
  Deno.exit(1);
}

const openai = new OpenAI({
  apiKey: openaiApiKey,
});

Initialize the Phoenix client.

In [None]:
const client = createClient();
console.log('Phoenix client initialized. Access Phoenix UI at http://localhost:6006');

## Creating a Dataset

Let's create examples for our dataset.

In [None]:
console.log('Creating dataset examples...');

// Create examples directly as an array
const { datasetId } = await createDataset({
  name: "quickstart-dataset",
  description: "Dataset for quickstart example",
  examples: [
  {
    id: `example-1`,
    updatedAt: new Date(),
    input: { question: "What is Paul Graham known for?" },
    output: { answer: "Co-founding Y Combinator and writing on startups and techology." },
    metadata: { topic: "tech" }
  },
  {
    id: `example-2`,
    updatedAt: new Date(),
    input: { question: "What companies did Elon Musk found?" },
    output: { answer: "Tesla, SpaceX, Neuralink, The Boring Company, and co-founded PayPal." },
    metadata: { topic: "entrepreneurs" }
  },
  {
    id: `example-3`,
    updatedAt: new Date(),
    input: { question: "What is Moore's Law?" },
    output: { answer: "The observation that the number of transistors in a dense integrated circuit doubles about every two years." },
    metadata: { topic: "computing" }
  }
]})

console.log(`Created ${examples.length} example(s)`);

await Deno.jupyter.md`
### Example dataset entries

\`\`\`json
${JSON.stringify(examples, null, 2)}
\`\`\`
`

## Defining the Task

Define the task function that will be evaluated.

In [None]:
const taskPromptTemplate = "Answer in a few words: {question}";

const task = async (example) => {
  // Safely access question with a type assertion
  const question = example.input.question || "No question provided";
  const messageContent = taskPromptTemplate.replace('{question}', question);
  
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: messageContent }]
  });
  
  return response.choices[0]?.message?.content || "";
};

// Test the task on one example
const testResult = await task(examples[0]);
console.log("Test task result:", testResult);

## Creating Evaluators

Let's define evaluators for our experiment.

In [None]:
// 1. Code-based evaluator that checks if response contains specific keywords
const containsKeyword = asEvaluator({
  name: "contains_keyword",
  kind: "CODE",
  evaluate: async ({ output }) => {
    const keywords = ["Y Combinator", "YC"];
    const outputStr = String(output).toLowerCase();
    const contains = keywords.some(keyword => 
      outputStr.toLowerCase().includes(keyword.toLowerCase())
    );
    
    return {
      score: contains ? 1.0 : 0.0,
      label: contains ? "contains_keyword" : "missing_keyword",
      metadata: { keywords },
      explanation: contains ? 
        `Output contains one of the keywords: ${keywords.join(", ")}` : 
        `Output does not contain any of the keywords: ${keywords.join(", ")}`
    };
  }
});

console.log("Created 'contains_keyword' evaluator");

In [None]:
// 2. LLM-based evaluator for conciseness
const conciseness = asEvaluator({
  name: "conciseness",
  kind: "LLM", 
  evaluate: async ({ output }) => {
    const prompt = `
      Rate the following text on a scale of 0.0 to 1.0 for conciseness (where 1.0 is perfectly concise).
      
      TEXT: ${output}
      
      Return only a number between 0.0 and 1.0.
    `;
    
    const response = await openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: prompt }]
    });
    
    const scoreText = response.choices[0]?.message?.content?.trim() || "0";
    const score = parseFloat(scoreText);
    
    return {
      score: isNaN(score) ? 0.5 : score,
      label: score > 0.7 ? "concise" : "verbose",
      metadata: {},
      explanation: `Conciseness score: ${score}`
    };
  }
});

console.log("Created 'conciseness' evaluator");

In [None]:
// 3. Custom Jaccard similarity evaluator
const jaccardSimilarity = asEvaluator({
  name: "jaccard_similarity",
  kind: "CODE",
  evaluate: async ({ output, expected }) => {
    const actualWords = new Set(String(output).toLowerCase().split(" "));
    const expectedAnswer = expected?.answer || "";
    const expectedWords = new Set(expectedAnswer.toLowerCase().split(" "));
    
    const wordsInCommon = new Set(
      [...actualWords].filter(word => expectedWords.has(word))
    );
    
    const allWords = new Set([...actualWords, ...expectedWords]);
    const score = wordsInCommon.size / allWords.size;
    
    return {
      score,
      label: score > 0.5 ? "similar" : "dissimilar",
      metadata: { 
        actualWordsCount: actualWords.size,
        expectedWordsCount: expectedWords.size,
        commonWordsCount: wordsInCommon.size,
        allWordsCount: allWords.size
      },
      explanation: `Jaccard similarity: ${score}`
    };
  }
});

console.log("Created 'jaccard_similarity' evaluator");

In [None]:
// 4. LLM-based accuracy evaluator
const accuracy = asEvaluator({
  name: "accuracy",
  kind: "LLM",
  evaluate: async ({ input, output, expected }) => {
    // Safely access question and answer with type assertions and fallbacks
    const question = input.question || "No question provided";
    const referenceAnswer = expected?.answer || "No reference answer provided";
    
    const evalPromptTemplate = `
      Given the QUESTION and REFERENCE_ANSWER, determine whether the ANSWER is accurate.
      Output only a single word (accurate or inaccurate).
      
      QUESTION: {question}
      
      REFERENCE_ANSWER: {reference_answer}
      
      ANSWER: {answer}
      
      ACCURACY (accurate / inaccurate):
    `;
    
    const messageContent = evalPromptTemplate
      .replace('{question}', question)
      .replace('{reference_answer}', referenceAnswer)
      .replace('{answer}', String(output));
    
    const response = await openai.chat.completions.create({
      model: "gpt-4o",
      messages: [{ role: "user", content: messageContent }]
    });
    
    const responseContent = response.choices[0]?.message?.content?.toLowerCase().trim() || "";
    const isAccurate = responseContent === "accurate";
    
    return {
      score: isAccurate ? 1.0 : 0.0,
      label: isAccurate ? "accurate" : "inaccurate",
      metadata: {},
      explanation: `LLM determined the answer is ${isAccurate ? "accurate" : "inaccurate"}`
    };
  }
});

console.log("Created 'accuracy' evaluator");

## Running the Experiment

Now let's run the experiment with our defined evaluators.

In [None]:
console.log('Running initial experiment...');

// Pass dataset directly as the array of examples
const experiment = await runExperiment({
  client,
  experimentName: "initial-experiment",
  dataset: {
    datasetId,
  },
  task,
  evaluators: [jaccardSimilarity, accuracy],
  logger: console,
});

console.log('Initial experiment completed with ID:', experiment.id);

await Deno.jupyter.md`
### Initial experiment results

Experiment ID: \`${experiment.id}\`
`

## Running Additional Evaluators

Let's run more evaluators on the same experiment.

In [None]:
console.log('Running additional evaluators...');

const updatedExperiment = await runExperiment({
  client,
  experimentName: experiment.id, // Use the same experiment ID
  dataset: {
    datasetId,
  }, // Use the array of examples
  task: async () => "", // No-op task since we're just evaluating
  evaluators: [containsKeyword, conciseness],
  logger: console
});

console.log('Additional evaluations completed');
console.log('Experiment ID:', updatedExperiment.id);

await Deno.jupyter.md`
### Final experiment results

The experiment has been updated with additional evaluators. View the complete results in the Phoenix UI:

http://localhost:6006

Experiment ID: \`${updatedExperiment.id}\`
`

## Conclusion

You've successfully:
1. Created a dataset with examples
2. Defined a task using the OpenAI API
3. Created multiple evaluators using both code-based and LLM-based approaches
4. Run an experiment and evaluated the results
5. Added additional evaluators to the experiment

You can now explore the results in the Phoenix UI and iterate on your experiments to improve performance.