<center>
    <p style="text-align:center">
    <img alt="arize logo" src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="300"/>
        <br>
        <a href="https://docs.arize.com/arize/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/client_python">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg">Slack Community</a>
    </p>
</center>

# <center>Arize Experiments - Run Evals With Different Models</center>

This guide demonstrates how to use Arize for logging and analyzing different model experiments with your LLM. We're going to build RAG application, along with an experimentation pipeline that evaluates for hallucinations using different models. The original queries, documents, and outputs will be logged to an Arize dataset. Arize makes it easy to track and compare results from experiments, allowing you to identify which variations affect performance. You can read more about experiment tracking with Arize [here](https://docs.arize.com/arize/llm-experiments-and-testing/quickstart).
In this tutorial, you will:

*   Create a RAG application using LlamaIndex.

*   Instrument the application to log the queries, documents, and outputs to Arize.

*   Export traces from Arize, and subset into a dataset.

*   Implement a script that runs the hallucination eval with different models, generates outputs using an LLM, and logs all output to the given experiment.

*   Analyze the logged data in Arize to compare results across different models.

By leveraging Arize for experiment tracking, you'll be able to systematically test different models at scale and use the logged data to inform your development process. Let's get started!

ℹ️ This notebook requires:
- An OpenAI API key
- An Arize Space Key, API Key, and Developer Key (explained below)


# Step 1: Setup Config

* Navigate to Space Settings, and copy your Space ID and API Key.

* Next, Make sure a Developer Key is active prior to running code below. To retrieve your GraphQL API Key, navigate to the [GraphQL Explorer](https://docs.arize.com/arize/api-reference/graphql-api/getting-started-with-programmatic-access#accessing-the-api-explorer), and select "Get your developer key".



## Install Dependencies

In [None]:
!pip install -q "arize>7.29.0" "arize-phoenix-evals>=0.17.5" openai==1.57.1 arize-otel==0.7.1 gcsfs==2024.10.0 llama-index-llms-openai==0.3.3 llama-index-core==0.12.5 openinference-instrumentation-llama-index==3.0.4 opentelemetry-instrumentation-httpx==0.49b2 opentelemetry-sdk==1.28.2 opentelemetry-exporter-otlp==1.28.2 llama-index-embeddings-openai==0.3.1 llama-index==0.12.5

In [None]:
import os
from getpass import getpass
from uuid import uuid1

import nest_asyncio
import pandas as pd
from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.experiments.evaluators.base import (
    EvaluationResult,
    Evaluator,
)
from arize.experimental.datasets.utils.constants import GENERATIVE
from phoenix.evals import (
    OpenAIModel,
    llm_classify,
)
from typing import Dict, Any

nest_asyncio.apply()

In [None]:
if not os.environ.get("SPACE_ID"):
    os.environ["SPACE_ID"] = getpass("🔑 Enter your space id: ")

if not os.environ.get("ARIZE_API_KEY"):
    os.environ["ARIZE_API_KEY"] = getpass("🔑 Enter your ARIZE_API_KEY: ")

if not os.environ.get("ARIZE_DEVELOPER_KEY"):
    os.environ["ARIZE_DEVELOPER_KEY"] = getpass("🔑 Enter your developer key: ")

if not os.environ.get("OPENAI_API_KEY"):
    os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")

SPACE_ID = os.environ.get("SPACE_ID")
API_KEY = os.environ.get("ARIZE_API_KEY")
DEVELOPER_KEY = os.environ.get("ARIZE_DEVELOPER_KEY")

# Step 2. Configure a Tracer

We recommend using the `register` helper method below to configure a tracer.

In [None]:
# Import open-telemetry dependencies
from arize.otel import register
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

project_name = "model-change-experiment"

# Setup OTEL via our convenience function
tracer_provider = register(
    space_id=SPACE_ID,
    api_key=API_KEY,
    project_name=project_name,
)

Because we're using a Tracing Integration, this will take care of automatically creating the spans for your application.

In [None]:
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

# Step 3. Build Your LlamaIndex RAG Application

In [None]:
from gcsfs import GCSFileSystem
from llama_index.core import (
    Settings,
    StorageContext,
    load_index_from_storage,
)
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

Set your OpenAI API key if it is not already set as an environment variable.

In [None]:
openai_api_key = os.getenv("OPENAI_API_KEY")

if not openai_api_key:
    os.environ["OPENAI_API_KEY"] = getpass("🔑 Enter your OpenAI API key: ")

This example uses a `RetrieverQueryEngine` over a pre-built index of the Arize documentation, but you can use whatever LlamaIndex application you like. Download the pre-built index of the Arize docs from cloud storage and instantiate your storage context.

In [None]:
file_system = GCSFileSystem(project="public-assets-275721")
index_path = "arize-phoenix-assets/datasets/unstructured/llm/llama-index/arize-docs/index/"
storage_context = StorageContext.from_defaults(
    fs=file_system,
    persist_dir=index_path,
)

We are now ready to instantiate our query engine that will perform retrieval-augmented generation (RAG). Query engine is a generic interface in LlamaIndex that allows you to ask question over your data. A query engine takes in a natural language query, and returns a rich response. It is built on top of Retrievers. You can compose multiple query engines to achieve more advanced capability.

In [None]:
Settings.llm = OpenAI(model="gpt-4o")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-ada-002")
index = load_index_from_storage(
    storage_context,
)
query_engine = index.as_query_engine()

# Step 4. Use Our Instrumented Query Engine

In [None]:
from tqdm import tqdm

queries = [
    "How can I query for a monitor's status using GraphQL?",
    "How do I delete a model?",
    "How much does an enterprise license of Arize cost?",
    "How do I log a prediction using the python SDK?",
]

for query in tqdm(queries):
    response = query_engine.query(query)
    print(f"Query: {query}")
    print(f"Response: {response}")

# Step 5: Log into Arize and explore your application traces 🚀

Log into your Arize account, and look for the model with the same `model_id`. You are likely to see the following page if you are sending a brand new model. Arize is processing your data and your model will be accessible for you to explore your traces in no time.

# Step 6. Export Traces

It can be helpful to export the trace data from Arize for a variety of reasons. Common use cases include:

* Testing out evaluations with a subset of the data

* Create a dataset of few-shot examples programmatically 

* Augment trace data with metadata programmatically 

* Fine-tune a smaller model with production traces from Arize

In [None]:
from datetime import datetime, timedelta

from arize.exporter import ArizeExportClient
from arize.utils.types import Environments

export_client = ArizeExportClient(api_key=DEVELOPER_KEY)

start_time = datetime.now() - timedelta(days=14)  # 14 days ago
end_time = datetime.now()  # Today

print("#### Exporting your primary dataset into a dataframe.")

primary_df = export_client.export_model_to_df(
    space_id=SPACE_ID,
    model_id=project_name,
    environment=Environments.TRACING,
    start_time=start_time,
    end_time=end_time,
)

# Step 7. Subset & Format Data For Dataset

We'll subset the base spans and document spans, as they're what's necessary for the experiments below.

In [None]:
df_base_spans = primary_df[
    ["context.trace_id", "attributes.input.value", "attributes.output.value"]
][primary_df["name"] == "BaseQueryEngine.query"].reset_index(drop=True)

In [None]:
df_docs = primary_df[["context.trace_id", "attributes.retrieval.documents"]][
    primary_df["name"] == "BaseRetriever.retrieve"
].reset_index(drop=True)

df_docs

In [None]:
def format_documents(val: str):
    return "".join([doc["document.content"] for doc in val])

In [None]:
df_docs["documents_combined"] = df_docs["attributes.retrieval.documents"].apply(
    format_documents
)

# Step 8. Create Your Dataset

You can create many different kinds of datasets. We'll start with the simplest example below, and if you'd like to upload datasets with prompt variables, edit or delete your datasets, [follow this guide](https://docs.arize.com/arize/datasets/how-to-datasets/create-a-dataset-with-code).

In [None]:
df_dataset = pd.merge(
    df_base_spans, df_docs, on="context.trace_id", how="inner"
)

df_dataset

In [None]:
# Set up the Arize Dataset Client to create or update a dataset.
datasets_client = ArizeDatasetsClient(
    developer_key=DEVELOPER_KEY, api_key=API_KEY
)

In [None]:
dataset_id = datasets_client.create_dataset(
    space_id=SPACE_ID,
    dataset_name=f"arize_experiment_{str(uuid1())[:5]}",
    dataset_type=GENERATIVE,
    data=df_dataset,
)

print(dataset_id)

# Step 9. Setup LLM Attributes As A Task

In [None]:
def task(dataset_row: Dict[str, Any]) -> dict:
    """
    Executes an LLM task based on an input.
    Output must be JSON serialisable.

    Args:
        dataset_row: A dictionary representing a dataset row.

    Returns:
        LLM output as a dictionary.
    """

    lst_cols = [
        "attributes.input.value",
        "documents_combined",
        "attributes.output.value",
    ]

    dict_output_data = {
        attribute: str(dataset_row[attribute]) for attribute in lst_cols
    }

    return dict_output_data

## Create an OpenAIModel Eval

We'll create an eval model to check for hallucinations, using `gpt-4o`.

In [None]:
LLM_MODEL_NAME = "gpt-4o"
eval_llm = OpenAIModel(model=LLM_MODEL_NAME, temperature=0.0)

# Step 10. Setup Your Evaluator


Users have the option to run an experiment by creating an evaluator that inherits from the [Evaluator(ABC)](https://github.com/Arize-ai/client_python/blob/8ce56cf603f7e7887efe306fa81aaaa68b068ccd/arize/experimental/datasets/experiments/evaluators/base.py#L20) base class in the Arize Python SDK. The evaluator takes in a single dataset row as input and returns an [EvaluationResult](https://github.com/Arize-ai/client_python/blob/8ce56cf603f7e7887efe306fa81aaaa68b068ccd/arize/experimental/datasets/experiments/types.py#L103) dataclass.

In [None]:
from phoenix.evals import HALLUCINATION_PROMPT_RAILS_MAP

In [None]:
HALLUCINATION_PROMPT_TEMPLATE = """
In this task, you will be presented with a query, a reference text and an answer. The answer is
generated to the question based on the reference text. The answer may contain false information. You
must use the reference text to determine if the answer to the question contains false information,
if the answer is a hallucination of facts. Your objective is to determine whether the answer text
contains factual information and is not a hallucination. A 'hallucination' refers to
an answer that is not based on the reference text or assumes information that is not available in
the reference text. Your response should be a single word: either "factual" or "hallucinated", and
it should not include any other text or characters. "hallucinated" indicates that the answer
provides factually inaccurate information to the query based on the reference text. "factual"
indicates that the answer to the question is correct relative to the reference text, and does not
contain made up information. Please read the query and reference text carefully before determining
your response.

    # Query: {attributes.input.value}
    # Reference text: {documents_combined}
    # Answer: {attributes.output.value}
    Is the answer above factual or hallucinated based on the query and reference text?
"""


class HallucinationEval(Evaluator):
    """
    Demonstrates using an LLM to judge correctness.
    """

    def evaluate(
        self, *, output: str, dataset_row: Dict[str, Any], **_: Any
    ) -> EvaluationResult:
        """
        Evaluate the output with the HALLUCINATION_PROMPT_TEMPLATE template and determine if output is hallucinating.

        Args:
            output: The output to be evaluated.
            **_: Additional keyword arguments.

        Returns:
            EvaluationResult: The LLM evaluation result containing the explanation, score, and label.
        """

        # df_input = pd.DataFrame(dataset_row, index=[0])
        df_input = pd.DataFrame(output, index=[0])

        # Map the boolean values to the expected labels
        rails = list(HALLUCINATION_PROMPT_RAILS_MAP.values())

        # Apply the LLM as a judge template to the input
        eval_df = llm_classify(
            dataframe=df_input,
            template=HALLUCINATION_PROMPT_TEMPLATE,
            model=eval_llm,
            rails=rails,
            provide_explanation=True,
        )

        # Create the evaluation df
        eval_label = eval_df["label"][0]
        eval_result = EvaluationResult(
            # Provide label, explanation, and score
            label=eval_label,
            score=1 if eval_label == "hallucinated" else 0,
            explanation=eval_df["explanation"][0],
        )

        return eval_result

# Step 11A. Run Experiment (gpt-4o)

In [None]:
## Run Experiment
datasets_client.run_experiment(
    space_id=SPACE_ID,
    dataset_id=dataset_id,
    task=task,
    evaluators=[HallucinationEval()],
    experiment_name=f"Hallucination-Experiment-{LLM_MODEL_NAME}-{str(uuid1())[:5]}",
)

## Step 11B. Run the experiment (gpt-4o-mini)

We'll now run the same experiment, but use `gpt-4o-mini` as the evaluation model. This will provide insight into if a different model yields different eval output.

In [None]:
LLM_MODEL_NAME = "gpt-4o-mini"
eval_llm = OpenAIModel(model=LLM_MODEL_NAME, temperature=0.0)

In [None]:
## Run Experiment
datasets_client.run_experiment(
    space_id=SPACE_ID,
    dataset_id=dataset_id,
    task=task,
    evaluators=[HallucinationEval()],
    experiment_name=f"Hallucination-Experiment-{LLM_MODEL_NAME}-{str(uuid1())[:5]}",
)