<center>
    <p style="text-align:center">
    <img alt="arize logo" src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="300"/>
        <br>
        <a href="https://docs.arize.com/arize/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/client_python">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>

# Using Arize with Experiments

This guide demonstrates how to use Arize for logging and analyzing prompt iteration experiments with your LLM. We're going to build a simple prompt experimentation pipeline for a haiku generator. In this tutorial, you will:

*   Set up an Arize dataset

*   Implement a script that generates LLM outputs

*   Setup a function to evaluate the output using an LLM

*   Log the data in Arize to compare results across prompts

ℹ️ This notebook requires:
- An OpenAI API key
- An Arize Space ID & Developer Key (explained below)


# Setup Config



Copy the Arize developer API Key and Space ID from the Datasets page (shown below) to the variables in the cell below.

<center><img src="https://storage.googleapis.com/arize-assets/fixtures/dataset_api_key.png" width="700"></center>


In [None]:
!pip install 'arize[Datasets]' 'arize-phoenix[evals]' openai datasets pyarrow pydantic nest_asyncio pandas --quiet

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/306.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m306.0/306.0 kB[0m [31m9.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m233.5/233.5 kB[0m [31m15.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.1/57.1 kB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m179.3/179.3 kB[0m [31m12.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
from uuid import uuid1

api_key = getpass("🔑 Enter your Arize API key: ")
space_id = getpass("🔑 Enter your Arize Space ID: ")
developer_key = getpass("🔑 Enter your Arize Developer Key: ")

# Upload Dataset

Below, we'll create a dataframe of points to use for your experiments.

In [None]:
# Setup Datasets client
import pandas as pd
from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.utils.constants import GENERATIVE
arize_client = ArizeDatasetsClient(developer_key=developer_key, api_key=api_key)

# Create dataframe to upload
data = [{"topic": "Zebras"}]
df = pd.DataFrame(data)

# Create dataset in Arize
dataset_id = arize_client.create_dataset(
    dataset_name="haiku-topics-"+ str(uuid1())[:5],
    data=df,
    space_id=space_id,
    dataset_type=GENERATIVE
)

In [None]:
# Get dataset from Arize
dataset = arize_client.get_dataset(
    space_id=space_id,
    dataset_id=dataset_id
)

Let's make sure we can run async code in the notebook.

In [None]:
import nest_asyncio

nest_asyncio.apply()

Lastly, let's make sure we have our openai API key set up.

In [None]:
import os
from getpass import getpass

if not os.getenv("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")

🔑 Enter your OpenAI API key: ··········


# Define Task

A **task** is a callable that maps the input of a dataset example to an output by invoking a chain, query engine, or LLM.

In [None]:
import openai

def create_haiku(dataset_row) -> str:
    topic = dataset_row.get("topic")
    openai_client = openai.OpenAI()
    response = openai_client.chat.completions.create(
	    model="gpt-4o-mini",
	    messages=[{"role": "user", "content": f"Write a haiku about {topic}"}],
	    max_tokens=20
    )
    assert response.choices
    return response.choices[0].message.content

# Define Evaluators

Our **evaluator** is used to grade the task outputs. The function `tone_eval` is used to determine the tone of the output.

In [None]:
from phoenix.evals import (
    OpenAIModel,
    llm_classify,
)

from arize.experimental.datasets.experiments.evaluators.base import EvaluationResult

CUSTOM_TEMPLATE = """
You are evaluating whether tone is positive, neutral, or negative

[Message]: {output}

Respond with either "positive", "neutral", or "negative"
"""

def tone_eval(output):
    df_in = pd.DataFrame({"output": output}, index=[0])
    eval_df = llm_classify(
        dataframe=df_in,
        template=CUSTOM_TEMPLATE,
        model=OpenAIModel(model="gpt-4o"),
        rails=["positive", "neutral", "negative"],
        provide_explanation=True
    )
    # return score, label, explanation
    return EvaluationResult(score=1, label=eval_df['label'][0], explanation=eval_df['explanation'][0])

# Run Experiment

Run the function below to run your task and evaluation across your whole dataset, and see the results of your experiment in Arize.

In [None]:
experiment_id, experiment_dataframe = arize_client.run_experiment(
    space_id=space_id,
    dataset_id=dataset_id,
    task=create_haiku,
    evaluators=[tone_eval],
    experiment_name=f"haiku-example-{str(uuid1())[:5]}"
)

[38;21m  arize.utils.logging | INFO | 🧪 Experiment started.[0m


running tasks |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

[38;21m  arize.utils.logging | INFO | ✅ Task runs completed.
Tasks Summary (11/14/24 10:33 PM +0000)
---------------------------------------
|   n_examples |   n_runs |   n_errors |
|-------------:|---------:|-----------:|
|            1 |        1 |          0 |[0m


running experiment evaluations |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

llm_classify |          | 0/1 (0.0%) | ⏳ 00:00<? | ?it/s

[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m


In [None]:
print(experiment_id)
experiment_dataframe = arize_client.get_experiment(space_id=space_id, experiment_id=experiment_id)
experiment_dataframe

RXhwZXJpbWVudDoyMTQ4Okl3QWs=


Unnamed: 0,id,example_id,result,result.trace.id,result.trace.timestamp,eval.tone_eval.score,eval.tone_eval.label,eval.tone_eval.explanation,eval.tone_eval.trace.id,eval.tone_eval.trace.timestamp
0,EXP_ID_6e8411,0329a785-6f8e-4d2f-8cfa-e1fa74248513,"Stripes of black and white, \nGrazing 'neath ...",4eb1949117b413d46fdd2637cf91992f,1731623585466,1,positive,The message describes a peaceful and serene sc...,a87c23d9b5e11931d3843b9942aa9dc3,1731623587449
