# Arize AX: Improving Classification with LLMs using Prompt Learning

<img src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="600">

In this notebook we will leverage the PromptLearningOptimizer developed here at Arize to improve upon the accuracy of LLMs on classification tasks. Specifically we will be classifying support queries into 30 different classes, including

Account Creation

Login Issues

Password Reset

Two-Factor Authentication

Profile Updates

Billing Inquiry

Refund Request

and 24 more. 

You can view the dataset in datasets/support_queries.csv.

**Note: This notebook `arizeax_support_query_classification.ipynb` complements `support_query_classification.ipynb` by using Arize Phoenix datasets, experiments, and prompt management for Prompt Learning. It's a more end to end way for you to visualize your iterative prompt improvement and see how it performs on train/test sets, and also leverages methods for advanced features.**

In [1]:
%pip install -q arize-phoenix openai pandas arize gql

Note: you may need to restart the kernel to use updated packages.


In [2]:
import sys, os, getpass
import openai
import pandas as pd
from openai import AsyncOpenAI
import re
import pandas as pd
import nest_asyncio
nest_asyncio.apply()

In [3]:
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
openai_client = AsyncOpenAI(api_key=os.environ['OPENAI_API_KEY'])
ARIZE_API_KEY = getpass.getpass('ARIZE_API_KEY')
ARIZE_SPACE_ID = getpass.getpass('ARIZE_SPACE_ID')


In [15]:
# Add parent directory to path
parent_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

# **Setup**

In [16]:
from arize.experimental.datasets import ArizeDatasetsClient
import pandas as pd

# add your Arize API key here
client = ArizeDatasetsClient(api_key=ARIZE_API_KEY)

## **Make train/test sets**

We use an 80/20 train/test split to train our prompt. The optimizer will use the training set to visualize and analyze its errors and successes, and make prompt updates based on these results. We will then test on the test set to see how that prompt performs on unseen data. 

We will be exporting these datasets to Arize AX. In Arize you will be able to view the experiments we run on the train/test sets.

In [30]:
import pandas as pd
from uuid import uuid1
from arize.experimental.datasets.utils.constants import GENERATIVE


data = pd.read_csv("../datasets/support_queries.csv")


#train_set = data.sample(frac=0.8, random_state=42).sample(frac=0.02) #small sample test
train_set = data.sample(frac=0.8, random_state=42) #full dataset test train 123 / test 31 examples
test_set = data.drop(train_set.index)

UUID = str(uuid1())[:8] #associate UUID with datasets and prompt name to track across runs
TRAIN_DATASET_NAME = "prompt_optimizer_training_data+" + UUID
TEST_DATASET_NAME = "prompt_optimizer_test_data+" + UUID

train_dataset_id = client.create_dataset(
        space_id=ARIZE_SPACE_ID,
        dataset_name=TRAIN_DATASET_NAME,
        dataset_type=GENERATIVE,
        data=train_set,
    )

test_dataset_id = client.create_dataset(
        space_id=ARIZE_SPACE_ID,
        dataset_name=TEST_DATASET_NAME,
        dataset_type=GENERATIVE,
        data=test_set,
    )

# Get datasets as Dataframe
train_dataset = client.get_dataset(space_id=ARIZE_SPACE_ID, dataset_name=TRAIN_DATASET_NAME)
test_dataset = client.get_dataset(space_id=ARIZE_SPACE_ID, dataset_name=TEST_DATASET_NAME)

print("train dataset id:", train_dataset_id)
print("test dataset id:", test_dataset_id)

train dataset id: RGF0YXNldDozMDQyNDM6dmRqYw==
test dataset id: RGF0YXNldDozMDQyNDQ6U0tTSw==


## **Base Prompt for Optimization**

This is our base prompt - our 0th iteration. This is the prompt we will be optimizing for our task.

We also upload our prompt to Arize AX. Arize's Prompt Hub serves as a repository for your prompts. You will be able to view all iterations of your prompt as its optimized, along with some metrics.

In [31]:
pip install -q "arize[PromptHub]"

Note: you may need to restart the kernel to use updated packages.


In [32]:
#from phoenix.client.types import PromptVersion
# Prompt Hub docs: https://arize.com/docs/ax/reference/reference/prompt-hub-api

from arize.experimental.prompt_hub import ArizePromptClient, Prompt, LLMProvider

# Initialize the client with your Arize credentials
prompt_client = ArizePromptClient(
    space_id=ARIZE_SPACE_ID,
    api_key=ARIZE_API_KEY
)


system_prompt = """
support query: {query}
Account Creation
Login Issues
Password Reset
Two-Factor Authentication
Profile Updates
Billing Inquiry
Refund Request
Subscription Upgrade/Downgrade
Payment Method Update
Invoice Request
Order Status
Shipping Delay
Product Return
Warranty Claim
Technical Bug Report
Feature Request
Integration Help
Data Export
Security Concern
Terms of Service Question
Privacy Policy Question
Compliance Inquiry
Accessibility Support
Language Support
Mobile App Issue
Desktop App Issue
Email Notifications
Marketing Preferences
Beta Program Enrollment
General Feedback

Return just the category, no other text.
"""


def add_prompt_version(system_prompt, prompt_name, model_name, test_metric, loop_number):
    try:
        existing_prompt = prompt_client.pull_prompt(prompt_name=prompt_name)
        existing_prompt.messages = [{"role": "system", "content": system_prompt}]
        existing_prompt.commit_message = f"Loop {loop_number} - Test Metric: {test_metric}"
        prompt_client.push_prompt(existing_prompt, commit_message = existing_prompt.commit_message)
    except:
        existing_prompt = Prompt(
            name=prompt_name,
            model_name=model_name,
            messages=[{"role": "system", "content": system_prompt}],
            provider=LLMProvider.OPENAI,
        )
        existing_prompt.commit_message = f"Loop {loop_number} \n Test Metric: {test_metric}"
        prompt_client.push_prompt(existing_prompt, commit_message = existing_prompt.commit_message)



## **Output Generator**

This function calls OpenAI with our prompt on every row of our dataset to generate outputs. It leverages llm_generate, a Phoenix function, for concurrency in calling LLMs. 

We return the output column, which contains outputs for every row of our dataset, or every support query in our dataset. 

In [33]:
def generate_task(system_prompt):

    async def output_task(dataset_row):
        formatted_prompt = system_prompt.replace("{query}", dataset_row.get("query"))
        response = await openai_client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": formatted_prompt}],
        )
        return response.choices[0].message.content
    
    return output_task

In [24]:
def normalize(label):
        return label.strip().strip('"').strip("'").lower()

## **Evaluator**

In this section we define our LLM-as-judge eval. 

Prompt Learning works by generating natural language evaluations on your outputs. These evaluations help guide the prompt optimizer towards building an optimized prompt. 

You should spend time thinking about how to write an informative eval. Your eval makes or breaks this prompt optimizer. With helpful feedback, our prompt optimizer will be able to generate a stronger optimized prompt much more effectively than with sparse or unhelpful feedback. 

Below is a great example for building a strong eval. You can see that we return many evaluations, including
- **correctness**: correct/incorrect - whether the support query was classified correctly or incorrectly.

-  **explanation**: Brief explanation of why the predicted classification is correct or incorrect, referencing the correct label if relevant.

-  **confusion_reason**: If incorrect, explains why the model may have made this choice instead of the correct classification. Focuses on likely sources of confusion. If correct, 'no confusion'.

-  **error_type**: One of: 'broad_vs_specific', 'keyword_bias', 'multi_intent_confusion', 'ambiguous_query', 'off_topic', 'paraphrase_gap', 'other'. Use 'none' if correct. Include the definition of the chosen error type, which are passed into the evaluator's prompt. 

-  **evidence_span**: Exact phrase(s) from the query that strongly indicate the correct classification.

-  **prompt_fix_suggestion**: One clear instruction to add to the classifier prompt to prevent this error.

**Take a look at support_query_classification/evaluator_prompt.txt for the full prompt!**

Our evaluator leverages llm_generate once again to build these llm evals with concurrency. We use an output parser to ensure that our eval is returned in proper json format. 

In [34]:
import re
#from phoenix.experiments.types import EvaluationResult
from arize.experimental.datasets.experiments.evaluators.base import EvaluationResult



def find_attributes(output):
    patterns = {
        "correctness": r'"correctness":\s*"([^"]*)"',
        "explanation": r'"explanation":\s*"([^"]*)"',
        "confusion_reason": r'"confusion_reason":\s*"([^"]*)"',
        "error_type": r'"error_type":\s*"([^"]*)"',
        "evidence_span": r'"evidence_span":\s*"([^"]*)"',
        "prompt_fix_suggestion": r'"prompt_fix_suggestion":\s*"([^"]*)"'
    }

    return tuple(
        (match := re.search(pattern, output, re.IGNORECASE)) and match.group(1)
        for pattern in patterns.values()
    )


def eval_parser(response: str) -> dict:
    correctness, explanation, confusion_reason, error_type, evidence_span, prompt_fix_suggestion = find_attributes(response)
    return {
        "correctness": correctness,
        "explanation": explanation,
        "confusion_reason": confusion_reason,
        "error_type": error_type,
        "evidence_span": evidence_span,
        "prompt_fix_suggestion": prompt_fix_suggestion
    }


async def output_evaluator(dataset_row, output):
    with open("../prompts/support_query_classification/evaluator_prompt.txt", "r") as file:
        evaluator_prompt = file.read()

    evaluator_prompt = evaluator_prompt.replace("{query}", dataset_row.get("query"))
    evaluator_prompt = evaluator_prompt.replace("{ground_truth}", dataset_row.get("ground_truth"))
    evaluator_prompt = evaluator_prompt.replace("{output}", output)

    eval_result = await openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": evaluator_prompt}],
        response_format={"type": "json_object"},
    )

    response = eval_result.choices[0].message.content
    parsed_eval_result = eval_parser(response)
    explanation=f"""correctness: {parsed_eval_result.get("correctness", "")};
        explanation: {parsed_eval_result.get("explanation", "")};
        confusion_reason: {parsed_eval_result.get("confusion_reason", "")};
        error_type: {parsed_eval_result.get("error_type", "")};
        evidence_span: {parsed_eval_result.get("evidence_span", "")};
        prompt_fix_suggestion: {parsed_eval_result.get("prompt_fix_suggestion", "")};"""

    score = float(parsed_eval_result.get("correctness") == "correct")
    label = parsed_eval_result.get("correctness", "")
    explanation = explanation

    return EvaluationResult(
        score=score,
        label=label,
        explanation=explanation,
    )

async def test_evaluator(dataset_row, output):
    label=str(normalize(dataset_row.get("ground_truth")) == normalize(output))
    return EvaluationResult(
        label=label,
        score = float(label=="True"),
        explanation="placeholder"
        )


## Metrics

Below we define some metrics that will compute on each iteration of prompt optimization. It will help us measure how our classifier with the current iteration's prompt performs.

Specifically we use scikit learn for precision, recall, f1 score, and simple accuracy.

In [35]:
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
def compute_metric(experiment_id, prediction_column_name, scorer="accuracy", average="macro"):
    """
    Compute the requested classification metric from a Arize experiment.
    """
    experiment_df = client.get_experiment(ARIZE_SPACE_ID, experiment_id=experiment_id)

    print(experiment_df.head())

    y_pred = experiment_df[prediction_column_name]
    y_true = [1]*len(experiment_df)

    if scorer == "accuracy":
        return accuracy_score(y_true, y_pred)
    elif scorer == "f1":
        return f1_score(y_true, y_pred, zero_division=0, average=average)
    elif scorer == "precision":
        return precision_score(y_true, y_pred, zero_division=0, average=average)
    elif scorer == "recall":
        return recall_score(y_true, y_pred, zero_division=0, average=average)
    else:
        raise ValueError(f"Unknown scorer: {scorer}")

## Experiment Processor

This function pulls an Arize experiment and loads the data into a pandas dataframe so it can run through the optimizer.

Specifically it:
- Pulls the experiment data from Arize
- Adds the input column to the dataframe
- Adds the evals to the dataframe
- Adds the output to the dataframe
- Returns the dataframe

In [36]:
def process_experiment(arize_client, experiment_id, train_set, input_column_name, output_column_name, feedback_columns = None):

    experiment_df = arize_client.get_experiment(ARIZE_SPACE_ID, experiment_id=experiment_id)

    train_set_with_experiment_results = pd.merge(train_set, experiment_df, left_on='id', right_on='example_id', how='inner')
    
    for column in feedback_columns:
        train_set[column] = [None] * len(train_set)
    
    for idx, row in train_set_with_experiment_results.iterrows():
        index = row["example_id"]
        eval_output = row["eval.output_evaluator.explanation"]
        if feedback_columns:
            for item in eval_output.split(";"):
                key_value = item.split(":")
                if key_value[0].strip() in feedback_columns:
                    key, value = key_value[0].strip(), key_value[1].strip()
                    train_set.loc[train_set["id"] == index, key] = value

            

    train_set[output_column_name] = train_set_with_experiment_results["result"]

    train_set.rename(columns={"query": input_column_name}, inplace=True)
    
    return train_set


# Prompt Optimization Loop with Arize Experiments

This code implements an iterative prompt optimization system that uses **Arize AX experiments** to evaluate and improve prompts based on feedback from LLM evaluators.


## Overview

The `optimize_loop` function automates prompt engineering by:

- Evaluating prompts using Arize experiments  
- Collecting detailed feedback from LLM evaluators  
- Optimizing prompts via a learning-based optimizer  
- Iterating until the performance threshold is met or the loop limit is reached  


## Step-by-Step Breakdown

Each of these numbers are added as comments in the code.

### 1. Initialization

- Set up tracking variables:
  - `train_metrics`, `test_metrics`, `raw_dfs` for storing evaluation results
- Convert training dataset to a DataFrame for easy updates

### 2. Baseline Evaluation

- Run an initial experiment using the **test set**
- Establish a **baseline metric** (e.g., accuracy, F1) to compare against future improvements

### 3. Early Exit Check

- If the **initial prompt already meets the performance threshold**, skip further optimization to save time and compute

### 4. Main Optimization Loop

For each iteration (up to `loops`):

#### 4a. Run Training Experiment

- Execute the current prompt on the **training set**
- Use LLM evaluators to generate **natural language feedback**

#### 4b. Process Feedback

- Extract structured information from evaluator outputs:
  - Correctness
  - Explanation
  - Confusion reason
  - Error type
  - Prompt fix suggestions
- Update the training DataFrame with this feedback

#### 4c. Generate Learning Annotations

- Convert feedback into structured annotations for the optimizer to learn from
- This allows learning from evaluator insights in a consistent format

#### 4d. Optimize the Prompt

- Pass feedback to the **PromptLearningOptimizer**
- Generate an **improved prompt** that attempts to correct issues found in the previous iteration

#### 4e. Evaluate on Test Set

- Evaluate the updated prompt on the **held-out test set**
- Assess **generalization** beyond the training data

#### 4f. Track Metrics

- Log metrics for:
  - Training set performance
  - Test set performance
- Store raw results for further analysis or visualization

#### 4g. Convergence Check

- If the new prompt's test metric **meets or exceeds the threshold**, exit the loop early

In [37]:
from optimizer_sdk.prompt_learning_optimizer import PromptLearningOptimizer
import time

import copy

#prompt name that will show up in Arize Prompt Hub
prompt_name = "prompt_optimizer+" + UUID

def optimize_loop(
    train_dataset,
    test_dataset,
    system_prompt,
    feedback_columns,
    threshold=1,
    loops=5,
    scorer="accuracy",
    prompt_versions=[],
):
    """
    scorer: one of "accuracy", "f1", "precision", "recall"
    """
    curr_loop = 1
    train_metrics = []
    test_metrics = []
    raw_dfs = []
    train_df = train_dataset

    print(f"🚀 Starting prompt optimization with {loops} iterations (scorer: {scorer}, threshold: {threshold})")
    
    print(f"�� Initial evaluation:")

    task = generate_task(system_prompt)

    initial_experiment_id, _ = client.run_experiment(
        space_id=ARIZE_SPACE_ID,
        dataset_id=test_dataset_id,
        task=task,
        evaluators=[test_evaluator],
        experiment_name="initial_experiment_1",
        concurrency=10
    )

    time.sleep(3)
    
    initial_metric_value = compute_metric(initial_experiment_id, "eval.test_evaluator.score", scorer)
    print(f"✅ Initial {scorer}: {initial_metric_value}")

    test_metrics.append(initial_metric_value)
    raw_dfs.append(copy.deepcopy(test_set))
    prompt_versions.append(system_prompt)

    add_prompt_version(system_prompt, prompt_name, "gpt-4o-mini-2024-07-18", initial_metric_value, 0)
    if initial_metric_value >= threshold:
        print("🎉 Initial prompt already meets threshold!")
        return {
            "train": train_metrics,
            "test": test_metrics,
            "prompt": prompt_versions,
            "raw": raw_dfs
        }
    
    # Initialize all feedback columns

    while loops > 0:
        print(f"📊 Loop {curr_loop}: Optimizing prompt...")
        
        task = generate_task(system_prompt)

        train_experiment_id, _ = client.run_experiment(
            space_id=ARIZE_SPACE_ID,
            dataset_id=train_dataset_id,
            task=task,
            evaluators=[output_evaluator, test_evaluator],
            experiment_name=f"train_experiment_{curr_loop}",
            concurrency=10
        )

        time.sleep(3)

        train_metric = compute_metric(train_experiment_id, "eval.output_evaluator.score", "accuracy")
        train_metrics.append(train_metric)
        train_df = process_experiment(client, train_experiment_id, train_df, "query", "output", feedback_columns)
        print(f"✅ Training {scorer}: {train_metric}")
        
        optimizer = PromptLearningOptimizer(
            prompt=system_prompt,  
            model_choice="gpt-4o",
            openai_api_key=os.getenv("OPENAI_API_KEY")
        )

        with open("../prompts/support_query_classification/annotations_prompt.txt", "r") as file:
            annotations_prompt = file.read()

        annotations = optimizer.create_annotation(
            system_prompt,
            ["query"],
            train_df,
            feedback_columns,
            [annotations_prompt],
            "output",
            "ground_truth"
        )

        system_prompt = optimizer.optimize(
            train_df,
            "output",
            feedback_columns=feedback_columns,
            context_size_k=90000,
            annotations=annotations,
        )
        prompt_versions.append(system_prompt)

        
        test_experiment_id, _ = client.run_experiment(
            space_id=ARIZE_SPACE_ID,
            dataset_id=test_dataset_id,
            task=generate_task(system_prompt),
            evaluators=[test_evaluator],
            experiment_name=f"test_experiment_{curr_loop}",
            concurrency=10
        )

        time.sleep(3)

        test_metric = compute_metric(test_experiment_id, "eval.test_evaluator.score", scorer)
        test_metrics.append(test_metric)

        add_prompt_version(system_prompt, prompt_name, "gpt-4o-mini-2024-07-18", test_metric, curr_loop)
        print(f"✅ Test {scorer}: {test_metric}")

        if test_metric >= threshold:
            print("🎉 Prompt optimization met threshold!")
            break

        loops -= 1
        curr_loop += 1

    return {
        "train": train_metrics,
        "test": test_metrics,
        "prompts": prompt_versions,
        "raw": raw_dfs
    }

# Main execution - use asyncio.run() to run the async function
evaluators = [output_evaluator]
feedback_columns = ["correctness", "explanation", "confusion_reason", "error_type", "evidence_span", "prompt_fix_suggestion"]
result = optimize_loop(train_dataset, test_dataset, system_prompt, feedback_columns, loops=5, scorer="accuracy")

🚀 Starting prompt optimization with 5 iterations (scorer: accuracy, threshold: 1)
�� Initial evaluation:
[38;21m  arize.utils.logging | INFO | 🧪 Experiment started.[0m


running tasks |██████████| 31/31 (100.0%) | ⏳ 00:27<00:00 |  1.22it/s

[38;21m  arize.utils.logging | INFO | ✅ Task runs completed.
Tasks Summary (09/03/25 06:10 AM -0700)
---------------------------------------
   n_examples  n_runs  n_errors
0          31      31         0[0m


running tasks |██████████| 31/31 (100.0%) | ⏳ 00:29<00:00 |  1.06it/s


[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m


running experiment evaluations |██████████| 31/31 (100.0%) | ⏳ 00:18<00:00 |  1.63it/s


              id                            example_id            result  \
0  EXP_ID_a571db  e229198e-c98b-4dee-a5b8-a6f3f36d18e4  Account Creation   
1  EXP_ID_97b182  0e780f04-2457-469c-9bc9-954db24b3583   Billing Inquiry   
2  EXP_ID_7325f2  abe65d8c-297a-4759-b187-2f48cd510b3d      Order Status   
3  EXP_ID_96fe1e  8287e65d-8fb9-470f-9bcf-cc68491f230a    Password Reset   
4  EXP_ID_a0ab3d  8fc5bd05-f4fa-41b4-894e-2d76955a5cf1      Login Issues   

                    result.trace.id  result.trace.timestamp  \
0  503b767c1a09da8da1a540e281602b90           1756905015754   
1  a33f37c414b7d8b665e6629a244ee5af           1756905016765   
2  2d9bf1d868203b103ff11241b81ddd0c           1756905017755   
3  029370f301629ef18760545795751f05           1756905018738   
4  5277b47c30c9b29847dd989daef74b29           1756905019707   

   eval.test_evaluator.score eval.test_evaluator.label  \
0                        1.0                      True   
1                        0.0                    

running tasks |██████████| 123/123 (100.0%) | ⏳ 01:20<00:00 |  2.41it/s

[38;21m  arize.utils.logging | INFO | ✅ Task runs completed.
Tasks Summary (09/03/25 06:12 AM -0700)
---------------------------------------
   n_examples  n_runs  n_errors
0         123     123         0[0m


running tasks |██████████| 123/123 (100.0%) | ⏳ 01:21<00:00 |  1.51it/s
running experiment evaluations |██████████| 246/246 (100.0%) | ⏳ 00:55<00:00 |  1.89s/it

[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m
              id                            example_id  \
0  EXP_ID_9b66fb  dd17a2a0-2912-4d4c-bf7e-b3b3027b75fe   
1  EXP_ID_41a1ad  c820437c-026e-4cf5-9e0a-1d190fa39fae   
2  EXP_ID_6e739f  787aa17a-da44-4c75-b4fd-cc303be0bd56   
3  EXP_ID_856820  36e38c54-b02b-4219-94d1-8da3bb5e14d9   
4  EXP_ID_35cea8  ce75bd3c-9abe-4e4c-aaea-3f1fa1361886   

                      result                   result.trace.id  \
0    Privacy Policy Question  cef78b4e554c06dc6fcb9759b87f8d0e   
1            Billing Inquiry  1d0143c3ff57677d955029da9dabfb55   
2  Return label prints blank  39fbdb431830a2a9618ad829fd9deae6   
3            Billing Inquiry  f3e0b076a74b2a85cb3e8f9b16aa2ca2   
4           General Feedback  f72a2e1f78425abacd5c196aa7500a87   

   result.trace.timestamp  eval.output_evaluator.score  \
0           1756905074351                          1.0   
1           1756905075341                          1.0   
2        

running experiment evaluations |██████████| 246/246 (100.0%) | ⏳ 01:16<00:00 |  3.22it/s


✅ Training accuracy: 0.5772357723577236
🔍 Running annotator...
['query', 'ground_truth', 'created_at', 'updated_at', 'id', '__index_level_0__', 'correctness', 'explanation', 'confusion_reason', 'error_type', 'evidence_span', 'prompt_fix_suggestion', 'output']

🔧 Creating batches with 90,000 token limit
📊 Processing 123 examples in 1 batches
   ✅ Batch 1/1: Optimized
[38;21m  arize.utils.logging | INFO | 🧪 Experiment started.[0m


running tasks |██████████| 31/31 (100.0%) | ⏳ 00:27<00:00 |  1.10s/it

[38;21m  arize.utils.logging | INFO | ✅ Task runs completed.
Tasks Summary (09/03/25 06:15 AM -0700)
---------------------------------------
   n_examples  n_runs  n_errors
0          31      31         0[0m


running tasks |██████████| 31/31 (100.0%) | ⏳ 00:29<00:00 |  1.07it/s


[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m


running experiment evaluations |██████████| 31/31 (100.0%) | ⏳ 00:17<00:00 |  1.79it/s


              id                            example_id  \
0  EXP_ID_821972  e229198e-c98b-4dee-a5b8-a6f3f36d18e4   
1  EXP_ID_cdfe5b  0e780f04-2457-469c-9bc9-954db24b3583   
2  EXP_ID_18c746  abe65d8c-297a-4759-b187-2f48cd510b3d   
3  EXP_ID_dca0d6  8287e65d-8fb9-470f-9bcf-cc68491f230a   
4  EXP_ID_324a0c  8fc5bd05-f4fa-41b4-894e-2d76955a5cf1   

                           result                   result.trace.id  \
0                    Login Issues  8ef1f6a5aaeab6793cf77ad12c20f09a   
1  Subscription Upgrade/Downgrade  fbf1485120e77c40ae37a2587ff42bdc   
2                     Data Export  d12803c1ec835fb429c15b6ff2ce7cbb   
3                  Password Reset  bcb301b819adc1b749a932e492a3a102   
4                    Login Issues  dda6f07d37517e0b77f234231c78f30d   

   result.trace.timestamp  eval.test_evaluator.score  \
0           1756905288188                        0.0   
1           1756905289239                        0.0   
2           1756905290152                        1.0   


running tasks |██████████| 123/123 (100.0%) | ⏳ 01:20<00:00 |  2.28it/s

[38;21m  arize.utils.logging | INFO | ✅ Task runs completed.
Tasks Summary (09/03/25 06:17 AM -0700)
---------------------------------------
   n_examples  n_runs  n_errors
0         123     123         0[0m


running tasks |██████████| 123/123 (100.0%) | ⏳ 01:21<00:00 |  1.51it/s


[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m
              id                            example_id  \
0  EXP_ID_cc8656  dd17a2a0-2912-4d4c-bf7e-b3b3027b75fe   
1  EXP_ID_a92e3b  c820437c-026e-4cf5-9e0a-1d190fa39fae   
2  EXP_ID_50d9e5  787aa17a-da44-4c75-b4fd-cc303be0bd56   
3  EXP_ID_10e0c2  36e38c54-b02b-4219-94d1-8da3bb5e14d9   
4  EXP_ID_63d2ae  ce75bd3c-9abe-4e4c-aaea-3f1fa1361886   

                    result                   result.trace.id  \
0  Privacy Policy Question  e2fac03a0c9e7641f5fd38322a95fb85   
1          Billing Inquiry  d71732cd94f7e31b93f2735ef6f27339   
2     Technical Bug Report  cb31a4ff728acf07b8947a4df2ec5c60   
3          Billing Inquiry  cf4477d59646e553a639734cafe62393   
4          Feature Request  0b95ab486fb01caa2872551d24f37645   

   result.trace.timestamp  eval.output_evaluator.score  \
0           1756905344864                          1.0   
1           1756905345869                          1.0   
2           175690534

running experiment evaluations |██████████| 246/246 (100.0%) | ⏳ 01:04<00:00 |  3.81it/s


✅ Training accuracy: 0.7723577235772358
🔍 Running annotator...
['query', 'ground_truth', 'created_at', 'updated_at', 'id', '__index_level_0__', 'correctness', 'explanation', 'confusion_reason', 'error_type', 'evidence_span', 'prompt_fix_suggestion', 'output']

🔧 Creating batches with 90,000 token limit
📊 Processing 123 examples in 1 batches
   ✅ Batch 1/1: Optimized
[38;21m  arize.utils.logging | INFO | 🧪 Experiment started.[0m


running tasks |██████████| 31/31 (100.0%) | ⏳ 00:27<00:00 |  1.13it/s

[38;21m  arize.utils.logging | INFO | ✅ Task runs completed.
Tasks Summary (09/03/25 06:19 AM -0700)
---------------------------------------
   n_examples  n_runs  n_errors
0          31      31         0[0m


running tasks |██████████| 31/31 (100.0%) | ⏳ 00:29<00:00 |  1.06it/s


[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m


running experiment evaluations |██████████| 31/31 (100.0%) | ⏳ 00:03<00:00 |  8.68it/s


              id                            example_id  \
0  EXP_ID_2261c4  e229198e-c98b-4dee-a5b8-a6f3f36d18e4   
1  EXP_ID_ab4f1f  0e780f04-2457-469c-9bc9-954db24b3583   
2  EXP_ID_8e6039  abe65d8c-297a-4759-b187-2f48cd510b3d   
3  EXP_ID_fb7e4b  8287e65d-8fb9-470f-9bcf-cc68491f230a   
4  EXP_ID_bd75a0  8fc5bd05-f4fa-41b4-894e-2d76955a5cf1   

                           result                   result.trace.id  \
0                    Login Issues  1a967a4db8659598294fd0104bd0ec28   
1  Subscription Upgrade/Downgrade  29c8ecdc3c251b454c145087fa4546c1   
2                     Data Export  64781d712b854c281d8ca4a0c9684ecd   
3                  Password Reset  b731c5b4068400b50c8d99abcd03ea31   
4                    Login Issues  c4f9ade1c49198f8809f626afc743198   

   result.trace.timestamp  eval.test_evaluator.score  \
0           1756905545045                        0.0   
1           1756905546218                        0.0   
2           1756905547009                        1.0   


running tasks |██████████| 123/123 (100.0%) | ⏳ 01:20<00:00 |  1.26it/s

[38;21m  arize.utils.logging | INFO | ✅ Task runs completed.
Tasks Summary (09/03/25 06:21 AM -0700)
---------------------------------------
   n_examples  n_runs  n_errors
0         123     123         0[0m


running tasks |██████████| 123/123 (100.0%) | ⏳ 01:21<00:00 |  1.51it/s


[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m
              id                            example_id  \
0  EXP_ID_280a6b  dd17a2a0-2912-4d4c-bf7e-b3b3027b75fe   
1  EXP_ID_ea1bb3  c820437c-026e-4cf5-9e0a-1d190fa39fae   
2  EXP_ID_a8d783  787aa17a-da44-4c75-b4fd-cc303be0bd56   
3  EXP_ID_f6c57f  36e38c54-b02b-4219-94d1-8da3bb5e14d9   
4  EXP_ID_a73937  ce75bd3c-9abe-4e4c-aaea-3f1fa1361886   

                    result                   result.trace.id  \
0  Privacy Policy Question  7c504594577d69930848c19ef17cecf2   
1          Billing Inquiry  1738f24c66f0647f8863d31efcea45a3   
2       **Product Return**  fa46c4bb5d6b416228f04664bc8486d7   
3          Billing Inquiry  66c31275028840e540d4977e2702e387   
4          Feature Request  f9ff934cac9192e3d2f99c5db4a937c5   

   result.trace.timestamp  eval.output_evaluator.score  \
0           1756905602392                          1.0   
1           1756905603535                          1.0   
2           175690560

running experiment evaluations |██████████| 246/246 (100.0%) | ⏳ 01:00<00:00 |  4.05it/s


✅ Training accuracy: 0.7967479674796748
🔍 Running annotator...
['query', 'ground_truth', 'created_at', 'updated_at', 'id', '__index_level_0__', 'correctness', 'explanation', 'confusion_reason', 'error_type', 'evidence_span', 'prompt_fix_suggestion', 'output']

🔧 Creating batches with 90,000 token limit
📊 Processing 123 examples in 1 batches
   ✅ Batch 1/1: Optimized
[38;21m  arize.utils.logging | INFO | 🧪 Experiment started.[0m


running tasks |██████████| 31/31 (100.0%) | ⏳ 00:27<00:00 |  1.31it/s

[38;21m  arize.utils.logging | INFO | ✅ Task runs completed.
Tasks Summary (09/03/25 06:23 AM -0700)
---------------------------------------
   n_examples  n_runs  n_errors
0          31      31         0[0m


running tasks |██████████| 31/31 (100.0%) | ⏳ 00:29<00:00 |  1.05it/s


[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m


running experiment evaluations |██████████| 31/31 (100.0%) | ⏳ 00:03<00:00 |  8.96it/s


              id                            example_id  \
0  EXP_ID_da0286  e229198e-c98b-4dee-a5b8-a6f3f36d18e4   
1  EXP_ID_285fea  0e780f04-2457-469c-9bc9-954db24b3583   
2  EXP_ID_3e8095  abe65d8c-297a-4759-b187-2f48cd510b3d   
3  EXP_ID_3d21a8  8287e65d-8fb9-470f-9bcf-cc68491f230a   
4  EXP_ID_1896a9  8fc5bd05-f4fa-41b4-894e-2d76955a5cf1   

                               result                   result.trace.id  \
0                        Login Issues  9312b7cd948a832424c244402e261dd3   
1  **Subscription Upgrade/Downgrade**  c74967c366f29051aacfcd73579e92f6   
2                         Data Export  20552b811a6cf3980c971ee9a83a4cb4   
3                      Password Reset  13ac8580fe6a1d1a970ca35d70c65688   
4                        Login Issues  62700bb18a0bf97bbab76ab1f8706cab   

   result.trace.timestamp  eval.test_evaluator.score  \
0           1756905795072                        0.0   
1           1756905796068                        0.0   
2           1756905797104       

running tasks |██████████| 123/123 (100.0%) | ⏳ 01:20<00:00 |  2.32it/s

[38;21m  arize.utils.logging | INFO | ✅ Task runs completed.
Tasks Summary (09/03/25 06:25 AM -0700)
---------------------------------------
   n_examples  n_runs  n_errors
0         123     123         0[0m


running tasks |██████████| 123/123 (100.0%) | ⏳ 01:21<00:00 |  1.50it/s


[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m
              id                            example_id  \
0  EXP_ID_c13b46  dd17a2a0-2912-4d4c-bf7e-b3b3027b75fe   
1  EXP_ID_a44f6d  c820437c-026e-4cf5-9e0a-1d190fa39fae   
2  EXP_ID_8be281  787aa17a-da44-4c75-b4fd-cc303be0bd56   
3  EXP_ID_4778d0  36e38c54-b02b-4219-94d1-8da3bb5e14d9   
4  EXP_ID_921271  ce75bd3c-9abe-4e4c-aaea-3f1fa1361886   

                        result                   result.trace.id  \
0  **Privacy Policy Question**  a3f5ca6ead158a43e56fff8bf830971b   
1          **Billing Inquiry**  8945406b1fac08b576bcf6804309e004   
2           **Product Return**  0d7765ab60304021fe36e8b08dca337c   
3          **Billing Inquiry**  b43f27f77812dec7d8932dfc1a533e64   
4          **Feature Request**  bc63716b4fffb357890a15ca9692695b   

   result.trace.timestamp  eval.output_evaluator.score  \
0           1756905852002                          1.0   
1           1756905853128                          1.0 

running experiment evaluations |██████████| 246/246 (100.0%) | ⏳ 01:03<00:00 |  3.88it/s


✅ Training accuracy: 0.8292682926829268
🔍 Running annotator...
['query', 'ground_truth', 'created_at', 'updated_at', 'id', '__index_level_0__', 'correctness', 'explanation', 'confusion_reason', 'error_type', 'evidence_span', 'prompt_fix_suggestion', 'output']

🔧 Creating batches with 90,000 token limit
📊 Processing 123 examples in 1 batches
   ✅ Batch 1/1: Optimized
[38;21m  arize.utils.logging | INFO | 🧪 Experiment started.[0m


running tasks |██████████| 31/31 (100.0%) | ⏳ 00:26<00:00 |  1.18it/s

[38;21m  arize.utils.logging | INFO | ✅ Task runs completed.
Tasks Summary (09/03/25 06:27 AM -0700)
---------------------------------------
   n_examples  n_runs  n_errors
0          31      31         0[0m


running tasks |██████████| 31/31 (100.0%) | ⏳ 00:28<00:00 |  1.10it/s


[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m


running experiment evaluations |██████████| 31/31 (100.0%) | ⏳ 00:16<00:00 |  1.91it/s


              id                            example_id  \
0  EXP_ID_482a23  e229198e-c98b-4dee-a5b8-a6f3f36d18e4   
1  EXP_ID_91969e  0e780f04-2457-469c-9bc9-954db24b3583   
2  EXP_ID_a2c4a3  abe65d8c-297a-4759-b187-2f48cd510b3d   
3  EXP_ID_1abacc  8287e65d-8fb9-470f-9bcf-cc68491f230a   
4  EXP_ID_a51a15  8fc5bd05-f4fa-41b4-894e-2d76955a5cf1   

                           result                   result.trace.id  \
0                    Login Issues  2f97ae33e5b4ffe9f7f50da6c5089f1a   
1  Subscription Upgrade/Downgrade  71d71ce8ed8a3df4696bac06a9457f53   
2                     Data Export  b9c699da6ac2b22c8a3d27ff5376d8ee   
3                  Password Reset  2a22b58c8cbe61e9a0dd4bdddf174037   
4                    Login Issues  d11bd95ffa2564bf782d84aadcb0e171   

   result.trace.timestamp  eval.test_evaluator.score  \
0           1756906050492                        0.0   
1           1756906051558                        0.0   
2           1756906052501                        1.0   


running tasks |██████████| 123/123 (100.0%) | ⏳ 01:20<00:00 |  2.44it/s

[38;21m  arize.utils.logging | INFO | ✅ Task runs completed.
Tasks Summary (09/03/25 06:29 AM -0700)
---------------------------------------
   n_examples  n_runs  n_errors
0         123     123         0[0m


running tasks |██████████| 123/123 (100.0%) | ⏳ 01:21<00:00 |  1.51it/s


[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m
              id                            example_id  \
0  EXP_ID_c7b2e6  dd17a2a0-2912-4d4c-bf7e-b3b3027b75fe   
1  EXP_ID_456c75  c820437c-026e-4cf5-9e0a-1d190fa39fae   
2  EXP_ID_1529e6  787aa17a-da44-4c75-b4fd-cc303be0bd56   
3  EXP_ID_cfae9d  36e38c54-b02b-4219-94d1-8da3bb5e14d9   
4  EXP_ID_aaa517  ce75bd3c-9abe-4e4c-aaea-3f1fa1361886   

                    result                   result.trace.id  \
0  Privacy Policy Question  91b6ec2bf1eded6fb42939356b792c74   
1          Billing Inquiry  7b65b544896bfc7ab59968096bc4014f   
2       **Product Return**  c9e22a245f9b4e5f8a15e3a5206f4850   
3          Billing Inquiry  39906e374314920e46969e57f89052b3   
4          Feature Request  2cdbd3896a94862ab3b2a70f362fd3a1   

   result.trace.timestamp  eval.output_evaluator.score  \
0           1756906106208                          1.0   
1           1756906107240                          1.0   
2           175690610

running experiment evaluations |██████████| 246/246 (100.0%) | ⏳ 01:05<00:00 |  3.77it/s


✅ Training accuracy: 0.8130081300813008
🔍 Running annotator...
['query', 'ground_truth', 'created_at', 'updated_at', 'id', '__index_level_0__', 'correctness', 'explanation', 'confusion_reason', 'error_type', 'evidence_span', 'prompt_fix_suggestion', 'output']

🔧 Creating batches with 90,000 token limit
📊 Processing 123 examples in 1 batches
   ✅ Batch 1/1: Optimized
[38;21m  arize.utils.logging | INFO | 🧪 Experiment started.[0m


running tasks |██████████| 31/31 (100.0%) | ⏳ 00:27<00:00 |  1.15it/s

[38;21m  arize.utils.logging | INFO | ✅ Task runs completed.
Tasks Summary (09/03/25 06:32 AM -0700)
---------------------------------------
   n_examples  n_runs  n_errors
0          31      31         0[0m


running tasks |██████████| 31/31 (100.0%) | ⏳ 00:29<00:00 |  1.06it/s


[38;21m  arize.utils.logging | INFO | ✅ All evaluators completed.[0m


running experiment evaluations |██████████| 31/31 (100.0%) | ⏳ 00:17<00:00 |  1.81it/s


              id                            example_id  \
0  EXP_ID_83044b  e229198e-c98b-4dee-a5b8-a6f3f36d18e4   
1  EXP_ID_52ef60  0e780f04-2457-469c-9bc9-954db24b3583   
2  EXP_ID_d1e5b6  abe65d8c-297a-4759-b187-2f48cd510b3d   
3  EXP_ID_709e32  8287e65d-8fb9-470f-9bcf-cc68491f230a   
4  EXP_ID_69a659  8fc5bd05-f4fa-41b4-894e-2d76955a5cf1   

                           result                   result.trace.id  \
0                    Login Issues  35bd4b72241ec8aa9db44eb730deda39   
1  Subscription Upgrade/Downgrade  a0bd73ae0f449c51d116700e0dc7c267   
2                     Data Export  a547a17c4d280bb4e518008f0e28a6e4   
3                  Password Reset  e08015917c6acf5199dbe40e55d463c2   
4                    Login Issues  2fa27b32aeee195eb139f26f3b4ee918   

   result.trace.timestamp  eval.test_evaluator.score  \
0           1756906320028                        0.0   
1           1756906321039                        0.0   
2           1756906321997                        1.0   


# Prompt Optimized!

The code below picks the prompt with the highest score on the test set, and displays the training/test metrics and delta for that prompt.

In [None]:
# Find the best index based on highest test accuracy
best_idx = max(range(len(result["test"])), key=lambda i: result["test"][i])

# Retrieve values
best_prompt = result["prompts"][best_idx - 1]
best_test_acc = result["test"][best_idx]
best_train_acc = result["train"][best_idx - 1] if (best_idx - 1) < len(result["train"]) else None
initial_test_acc = result["test"][0]
initial_train_acc = result["train"][0] if result["train"] else None

# Print results
print("\n🔍 Best Prompt Found:")
print(best_prompt)
print(f"🧪 Initial Test Accuracy: {initial_test_acc}")
print(f"🧪 Optimized Test Accuracy: {best_test_acc} (Δ {best_test_acc - initial_test_acc:.4f})")


🔍 Best Prompt Found:

support query: {query}
Account Creation
Login Issues
Password Reset
Two-Factor Authentication
Profile Updates
Billing Inquiry
Refund Request
Subscription Upgrade/Downgrade
Payment Method Update
Invoice Request
Order Status
Shipping Delay
Product Return
Warranty Claim
Technical Bug Report
Feature Request
Integration Help
Data Export
Security Concern
Terms of Service Question
Privacy Policy Question
Compliance Inquiry
Accessibility Support
Language Support
Mobile App Issue
Desktop App Issue
Email Notifications
Marketing Preferences
Beta Program Enrollment
General Feedback

Return just the category, no other text.

🧪 Initial Test Accuracy: 0.5806451612903226
🧪 Optimized Test Accuracy: 0.7096774193548387 (Δ 0.1290)
