# Improving Classification with LLMs using Prompt Learning

<img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/phx.jpeg" width="800">

In this notebook we will leverage the PromptLearningOptimizer developed here at Arize to improve upon the accuracy of LLMs on classification tasks. Specifically we will be classifying support queries into 30 different classes, including

Account Creation

Login Issues

Password Reset

Two-Factor Authentication

Profile Updates

Billing Inquiry

Refund Request

and 24 more. 

You can view the dataset in datasets/support_queries.csv.

**Note: This notebook `phoenix_support_query_classification.ipynb` complements `support_query_classification.ipynb` by using Phoenix datasets, experiments, and prompt management for Prompt Learning. It's a more end to end way for you to visualize your iterative prompt improvement and see how it performs on train/test sets, and also leverages Phoenix methods for advanced features.**

In [None]:
%pip install arize-phoenix openai pandas

In [2]:
import sys, os, getpass
import openai
import pandas as pd
from openai import AsyncOpenAI
import re
import pandas as pd
import nest_asyncio
nest_asyncio.apply()

In [3]:
os.environ['OPENAI_API_KEY'] = getpass.getpass('OpenAI API Key:')
openai_client = AsyncOpenAI(api_key=os.environ['OPENAI_API_KEY'])

In [4]:
# Add parent directory to path
parent_dir = os.path.abspath(os.path.join(os.getcwd(), '..'))
if parent_dir not in sys.path:
    sys.path.append(parent_dir)

# **Setup**

In [5]:
import os
# If you're self-hosting Phoenix, change this value:
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = getpass.getpass('Phoenix Collector Endpoint:')

PHOENIX_API_KEY = getpass.getpass('Phoenix API Key:')
os.environ["PHOENIX_API_KEY"] = PHOENIX_API_KEY

from phoenix.client import Client
phoenix_client = Client()

## **Make train/test sets**

We use an 80/20 train/test split to train our prompt. The optimizer will use the training set to visualize and analyze its errors and successes, and make prompt updates based on these results. We will then test on the test set to see how that prompt performs on unseen data. 

We will be exporting these datasets to Phoenix. In Phoenix you will be able to view the experiments we run on the train/test sets.

In [7]:
data = pd.read_csv("../datasets/support_queries.csv")

train_set = data.sample(frac=0.7, random_state=42)
test_set = data.drop(train_set.index)

train_dataset = phoenix_client.datasets.create_dataset(
        name="training_data_support_query_classification_2",
        dataframe=train_set,
        input_keys=['query'],
        output_keys=['ground_truth'],
    )

test_dataset = phoenix_client.datasets.create_dataset(
        name="test_data_support_query_classification_2",
        dataframe=test_set,
        input_keys=['query'],
        output_keys=['ground_truth'],
    )
        

## **Base Prompt for Optimization**

This is our base prompt - our 0th iteration. This is the prompt we will be optimizing for our task.

We also upload our prompt to Phoenix. Phoenix Prompt Hub serves as a repository for your prompts. You will be able to view all iterations of your prompt as its optimized, along with some metrics.

In [8]:
from phoenix.client.types import PromptVersion

system_prompt = """
Account Creation
Login Issues
Password Reset
Two-Factor Authentication
Profile Updates
Billing Inquiry
Refund Request
Subscription Upgrade/Downgrade
Payment Method Update
Invoice Request
Order Status
Shipping Delay
Product Return
Warranty Claim
Technical Bug Report
Feature Request
Integration Help
Data Export
Security Concern
Terms of Service Question
Privacy Policy Question
Compliance Inquiry
Accessibility Support
Language Support
Mobile App Issue
Desktop App Issue
Email Notifications
Marketing Preferences
Beta Program Enrollment
General Feedback

Return just the category, no other text for the support query.
"""

def upload_prompt_phoenix(system_prompt, name, iteration, prompt_versions, train_metric, test_metric):
    prompt_version = PromptVersion(
        [{"role": "system", "content": system_prompt}],  # System message
        model_name="gpt-4o-mini-2024-07-18",  # Model being used
        description="Prompt for support query classification",
        model_provider="OPENAI"
    )

    # Create prompt in Phoenix
    initial_prompt_version = phoenix_client.prompts.create(
        name=name,
        version=prompt_version,
    )

    prompt_versions.append({
        "iteration": iteration,
        "prompt": system_prompt,
        "phoenix_id": initial_prompt_version.id if hasattr(initial_prompt_version, 'id') else None,
        "train_metric": train_metric,
        "test_metric": test_metric
    })
    return prompt_versions



## **Output Generator**

This function calls OpenAI with our prompt on every row of our dataset to generate outputs. It leverages llm_generate, a Phoenix function, for concurrency in calling LLMs. 

We return the output column, which contains outputs for every row of our dataset, or every support query in our dataset. 

In [9]:
def generate_task(system_prompt):

    async def output_task(input):
        response = await openai_client.chat.completions.create(
            model="gpt-4o-mini-2024-07-18",
            messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": f"query: {input.get('query')}"}],
        )
        return response.choices[0].message.content
    
    return output_task

In [10]:
def normalize(label):
        return label.strip().strip('"').strip("'").lower()

## **Evaluator**

In this section we define our LLM-as-judge eval. 

Prompt Learning works by generating natural language evaluations on your outputs. These evaluations help guide the prompt optimizer towards building an optimized prompt. 

You should spend time thinking about how to write an informative eval. Your eval makes or breaks this prompt optimizer. With helpful feedback, our prompt optimizer will be able to generate a stronger optimized prompt much more effectively than with sparse or unhelpful feedback. 

Below is a great example for building a strong eval. You can see that we return many evaluations, including
- **correctness**: correct/incorrect - whether the support query was classified correctly or incorrectly.

-  **explanation**: Brief explanation of why the predicted classification is correct or incorrect, referencing the correct label if relevant.

-  **confusion_reason**: If incorrect, explains why the model may have made this choice instead of the correct classification. Focuses on likely sources of confusion. If correct, 'no confusion'.

-  **error_type**: One of: 'broad_vs_specific', 'keyword_bias', 'multi_intent_confusion', 'ambiguous_query', 'off_topic', 'paraphrase_gap', 'other'. Use 'none' if correct. Include the definition of the chosen error type, which are passed into the evaluator's prompt. 

-  **evidence_span**: Exact phrase(s) from the query that strongly indicate the correct classification.

-  **prompt_fix_suggestion**: One clear instruction to add to the classifier prompt to prevent this error.

**Take a look at support_query_classification/evaluator_prompt.txt for the full prompt!**

Our evaluator leverages llm_generate once again to build these llm evals with concurrency. We use an output parser to ensure that our eval is returned in proper json format. 

In [None]:
from phoenix.evals import create_evaluator
from phoenix.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o")

SCHEMA = {
    "type": "object",
    "properties": {
        "correctness": {"type": "string", "enum": ["correct", "incorrect"]},
        "explanation": {"type": "string"},
        "confusion_reason": {"type": "string"},
        "error_type": {"type": "string"},
        "evidence_span": {"type": "string"},
        "prompt_fix_suggestion": {"type": "string"},
    },
    "required": [
        "correctness",
        "explanation",
        "confusion_reason",
        "error_type",
        "evidence_span",
        "prompt_fix_suggestion",
    ],
    "additionalProperties": False,
}

@create_evaluator(name="output_evaluator", source="llm")
def output_evaluator(query: str, ground_truth: str, output: str):
    with open("../prompts/support_query_classification/evaluator_prompt.txt", "r") as file:
        template = file.read()

    prompt = (
        template.replace("{query}", query)
            .replace("{ground_truth}", ground_truth)
            .replace("{output}", output)
    )
    obj = llm.generate_object(prompt=prompt, schema=SCHEMA)
    correctness = obj["correctness"]
    score = 1.0 if correctness == "correct" else 0.0
    explanation = (
        f'correctness: {correctness}; '
        f'explanation: {obj.get("explanation","")}; '
        f'confusion_reason: {obj.get("confusion_reason","")}; '
        f'error_type: {obj.get("error_type","")}; '
        f'evidence_span: {obj.get("evidence_span","")}; '
        f'prompt_fix_suggestion: {obj.get("prompt_fix_suggestion","")};'
    )
    return {"score": score, "label": correctness, "explanation": explanation}

async def test_evaluator(expected, output):
    return normalize(expected.get("ground_truth")) == normalize(output)

{'score': 0.0,
 'label': 'incorrect',
 'explanation': "correctness: incorrect; explanation: The predicted classification 'billing inquiry' does not match the correct classification 'refund request'. The query specifically asks about the 'refund policy,' which aligns closely with the intent of seeking information or action related to refunds.; confusion_reason: The model likely associated 'refund policy' with billing due to the financial aspect common to both refunds and billing inquiries, even though the specific intent was about refund processes.; error_type: broad_vs_specific → The model picked a broader category instead of the more specific correct one (or vice versa).; evidence_span: What’s the refund policy?; prompt_fix_suggestion: Emphasize specific refund-related language in the classifier as indicative of 'refund request' rather than grouping it under general billing inquiries.;"}

## Metrics

Below we define some metrics that will compute on each iteration of prompt optimization. It will help us measure how our classifier with the current iteration's prompt performs.

Specifically we use scikit learn for precision, recall, f1 score, and simple accuracy.

In [12]:
from sklearn.metrics import f1_score, precision_score, recall_score, accuracy_score
import requests
def compute_metric(experiment, scorer="accuracy", average="macro"):
    """
    Compute the requested classification metric from a Phoenix experiment.

    Args:
        experiment: an object with an .id field (Phoenix Experiment).
        scorer (str): one of "accuracy", "f1", "precision", "recall".
        average (str): averaging method for multi-class classification.
    
    Returns:
        float: computed metric value.
    """
    print(experiment)
    experiment_id = experiment["experiment_id"]
    url = f"{os.environ['PHOENIX_COLLECTOR_ENDPOINT']}/v1/experiments/{experiment_id}/json"
    headers = {
        "Authorization": f"Bearer {os.environ['PHOENIX_API_KEY']}"
    }

    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        raise RuntimeError(f"Failed to fetch experiment data: {response.status_code} {response.text}")

    results = response.json()
    
    y_true = [normalize(entry["reference_output"]["ground_truth"]) for entry in results]
    y_pred = [normalize(entry["output"]) for entry in results]

    if scorer == "accuracy":
        return accuracy_score(y_true, y_pred)
    elif scorer == "f1":
        return f1_score(y_true, y_pred, zero_division=0, average=average)
    elif scorer == "precision":
        return precision_score(y_true, y_pred, zero_division=0, average=average)
    elif scorer == "recall":
        return recall_score(y_true, y_pred, zero_division=0, average=average)
    else:
        raise ValueError(f"Unknown scorer: {scorer}")

## Experiment Processor

This function pulls a Phoenix experiment and loads the data into a pandas dataframe so it can run through the optimizer.

Specifically it:
- Pulls the experiment data from Phoenix
- Adds the input column to the dataframe
- Adds the evals to the dataframe
- Adds the output to the dataframe
- Returns the dataframe

In [13]:
import ast

def process_experiment(experiment, train_set, input_column_name, output_column_name,feedback_columns = None):
    """
    Update existing columns in `train_set` with feedback from experiment annotations.

    Args:
        experiment_json (list): JSON data from experiment.
        train_set (pd.DataFrame): DataFrame that already contains the feedback columns.
        feedback_columns (list): List of feedback field names to update.
    
    Returns:
        pd.DataFrame: Updated DataFrame with values filled in from experiment annotations.
    """

    experiment_id = experiment["experiment_id"]
    url = f"{os.environ['PHOENIX_COLLECTOR_ENDPOINT']}/v1/experiments/{experiment_id}/json"
    headers = {
        "Authorization": f"Bearer {os.environ['PHOENIX_API_KEY']}"
    }

    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        raise RuntimeError(f"Failed to fetch experiment data: {response.status_code} {response.text}")

    results = response.json()

    train_set["ground_truth"] = [None] * len(train_set)
    if feedback_columns:
        for col in feedback_columns:
            train_set[col] = [None] * len(train_set)

    train_set = train_set.reset_index(drop=True)

    for i, entry in enumerate(results):
        eval_output = entry["annotations"][0]["explanation"]
        train_set.loc[i, "ground_truth"] = entry["reference_output"]["ground_truth"]
        if feedback_columns:
            for item in eval_output.split(";"):
                key_value = item.split(":")
                if key_value[0].strip() in feedback_columns:
                    key, value = key_value[0].strip(), key_value[1].strip()
                    train_set.loc[i, key] = value

    if "output" in train_set.columns:
        train_set.rename(columns={"output": "ground_truth"}, inplace=True)

    train_set[output_column_name] = [entry.get("output") for entry in results]

    train_set[input_column_name] = [entry.get("input") for entry in results]
    
    return train_set


# Prompt Optimization Loop with Phoenix Experiments

This code implements an iterative prompt optimization system that uses **Phoenix experiments** to evaluate and improve prompts based on feedback from LLM evaluators.


## Overview

The `optimize_loop` function automates prompt engineering by:

- Evaluating prompts using Phoenix experiments  
- Collecting detailed feedback from LLM evaluators  
- Optimizing prompts via a learning-based optimizer  
- Iterating until the performance threshold is met or the loop limit is reached  


## Step-by-Step Breakdown

Each of these numbers are added as comments in the code.

### 1. Initialization

- Set up tracking variables:
  - `train_metrics`, `test_metrics`, `raw_dfs` for storing evaluation results
- Convert training dataset to a DataFrame for easy updates

### 2. Baseline Evaluation

- Run an initial experiment using the **test set**
- Establish a **baseline metric** (e.g., accuracy, F1) to compare against future improvements

### 3. Early Exit Check

- If the **initial prompt already meets the performance threshold**, skip further optimization to save time and compute

### 4. Main Optimization Loop

For each iteration (up to `loops`):

#### 4a. Run Training Experiment

- Execute the current prompt on the **training set**
- Use LLM evaluators to generate **natural language feedback**

#### 4b. Process Feedback

- Extract structured information from evaluator outputs:
  - Correctness
  - Explanation
  - Confusion reason
  - Error type
  - Prompt fix suggestions
- Update the training DataFrame with this feedback

#### 4c. Generate Learning Annotations

- Convert feedback into structured annotations for the optimizer to learn from
- This allows learning from evaluator insights in a consistent format

#### 4d. Optimize the Prompt

- Pass feedback to the **PromptLearningOptimizer**
- Generate an **improved prompt** that attempts to correct issues found in the previous iteration

#### 4e. Evaluate on Test Set

- Evaluate the updated prompt on the **held-out test set**
- Assess **generalization** beyond the training data

#### 4f. Track Metrics

- Log metrics for:
  - Training set performance
  - Test set performance
- Store raw results for further analysis or visualization

#### 4g. Convergence Check

- If the new prompt's test metric **meets or exceeds the threshold**, exit the loop early

In [None]:
from optimizer_sdk.prompt_learning_optimizer import PromptLearningOptimizer
from phoenix.client.experiments import async_run_experiment
import copy
import asyncio

prompt_name = "support_query_classification"

async def optimize_loop(
    train_dataset,
    test_dataset,
    system_prompt,
    evaluators,
    feedback_columns,
    threshold=1,
    loops=5,
    scorer="accuracy",
    prompt_versions=[],
):
    """
    scorer: one of "accuracy", "f1", "precision", "recall"
    """
    curr_loop = 1
    train_metrics = []
    test_metrics = []
    raw_dfs = []
    train_df = train_dataset.to_dataframe()

    print(f"🚀 Starting prompt optimization with {loops} iterations (scorer: {scorer}, threshold: {threshold})")
    
    print(f"�� Initial evaluation:")

    task = generate_task(system_prompt)

    initial_experiment = await async_run_experiment(
        dataset=test_dataset,
        task=task,
        evaluators=[test_evaluator]
    )

    initial_metric_value = compute_metric(initial_experiment, scorer)
    print(f"✅ Initial {scorer}: {initial_metric_value}")

    test_metrics.append(initial_metric_value)
    raw_dfs.append(copy.deepcopy(test_set))

    if initial_metric_value >= threshold:
        print("🎉 Initial prompt already meets threshold!")
        return {
            "train": train_metrics,
            "test": test_metrics,
            "prompt": prompt_versions,
            "raw": raw_dfs
        }
    
    prompt_versions = upload_prompt_phoenix(system_prompt, prompt_name, 0, [], None, initial_metric_value)

    # Initialize all feedback columns

    while loops > 0:
        print(f"📊 Loop {curr_loop}: Optimizing prompt...")
        
        task = generate_task(system_prompt)

        train_experiment = await async_run_experiment(
            dataset=train_dataset,
            task=task,
            evaluators=evaluators,
            concurrency=10
        )

        train_df = process_experiment(train_experiment, train_set, "query", "output", feedback_columns)

        optimizer = PromptLearningOptimizer(
            prompt=system_prompt,
            model_choice="gpt-4o",
            openai_api_key=os.getenv("OPENAI_API_KEY")
        )

        with open("../prompts/support_query_classification/annotations_prompt.txt", "r") as file:
            annotations_prompt = file.read()

        annotations = optimizer.create_annotation(
            system_prompt,
            ["query"],
            train_df,
            feedback_columns,
            [annotations_prompt],
            "output",
            "ground_truth"
        )

        system_prompt = optimizer.optimize(
            train_df,
            "output",
            feedback_columns=feedback_columns,
            context_size_k=90000,
            annotations=annotations,
        )
        train_metric_post_value = compute_metric(train_experiment, scorer)
        train_metrics.append(train_metric_post_value)

        test_experiment = await async_run_experiment(
            dataset=test_dataset,
            task=generate_task(system_prompt),
            evaluators=[test_evaluator]
        )
        test_metric_post_value = compute_metric(test_experiment, scorer)
        test_metrics.append(test_metric_post_value)

        print(f"✅ Train {scorer}: {train_metric_post_value}")
        print(f"✅ Test {scorer}: {test_metric_post_value}")

        prompt_versions = upload_prompt_phoenix(system_prompt, prompt_name, curr_loop, prompt_versions, train_metric_post_value, test_metric_post_value)

        if test_metric_post_value >= threshold:
            print("🎉 Prompt optimization met threshold!")
            break

        loops -= 1
        curr_loop += 1

    return {
        "train": train_metrics,
        "test": test_metrics,
        "prompt": prompt_versions,
        "raw": raw_dfs
    }

# Main execution - use asyncio.run() to run the async function
evaluators = [output_evaluator]
feedback_columns = ["correctness", "explanation", "confusion_reason", "error_type", "evidence_span", "prompt_fix_suggestion"]
result = asyncio.run(optimize_loop(train_dataset, test_dataset, system_prompt, evaluators, feedback_columns, loops=5, scorer="accuracy"))

🚀 Starting prompt optimization with 5 iterations (scorer: accuracy, threshold: 1)
�� Initial evaluation:
🧪 Experiment started.
📺 View dataset experiments: https://app.phoenix.arize.com/s/prompt-opt//datasets/RGF0YXNldDoxMjU=/experiments
🔗 View this experiment: https://app.phoenix.arize.com/s/prompt-opt//datasets/RGF0YXNldDoxMjU=/compare?experimentId=RXhwZXJpbWVudDozNjM=


running tasks |          | 0/46 (0.0%) | ⏳ 00:00<? | ?it/s

✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/46 (0.0%) | ⏳ 00:00<? | ?it/s

Experiment completed with 46 task runs and 46 evaluation runs
{'experiment_id': 'RXhwZXJpbWVudDozNjM=', 'dataset_id': 'RGF0YXNldDoxMjU=', 'dataset_version_id': 'RGF0YXNldFZlcnNpb246MTI1', 'task_runs': [{'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6ODA4Nw==', 'output': 'Login Issues', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 32, 26, 436145, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 32, 26, 436018, tzinfo=datetime.timezone.utc), 'trace_id': '0b4874064d2f6418fb0d36cc3a8892e4', 'error': None, 'id': 'RXhwZXJpbWVudFJ1bjoyMzgyNQ==', 'experiment_id': 'RXhwZXJpbWVudDozNjM='}, {'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6ODA4OA==', 'output': 'Billing Inquiry', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 32, 26, 554584, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 32, 26, 554496, tzinfo=datetime.timezone.utc), 'trace_id': '80a8b9abba5c2f06d8e42a81fce89fd7', 'error': None,

running tasks |          | 0/108 (0.0%) | ⏳ 00:00<? | ?it/s

✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/108 (0.0%) | ⏳ 00:00<? | ?it/s

Experiment completed with 108 task runs and 108 evaluation runs
🔍 Running annotator...
['query', 'ground_truth', 'correctness', 'explanation', 'confusion_reason', 'error_type', 'evidence_span', 'prompt_fix_suggestion', 'output']

🔧 Creating batches with 90,000 token limit
📊 Processing 108 examples in 1 batches
   ✅ Batch 1/1: Optimized
{'experiment_id': 'RXhwZXJpbWVudDozNjQ=', 'dataset_id': 'RGF0YXNldDoxMjQ=', 'dataset_version_id': 'RGF0YXNldFZlcnNpb246MTI0', 'task_runs': [{'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6Nzk3OQ==', 'output': 'Privacy Policy Question', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 32, 48, 798766, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 32, 48, 798701, tzinfo=datetime.timezone.utc), 'trace_id': 'b65ffc18aff015b2f82bac3f567a0856', 'error': None, 'id': 'RXhwZXJpbWVudFJ1bjoyMzg3NQ==', 'experiment_id': 'RXhwZXJpbWVudDozNjQ='}, {'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6Nzk4MA==', 'output': 'Billing

running tasks |          | 0/46 (0.0%) | ⏳ 00:00<? | ?it/s

✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/46 (0.0%) | ⏳ 00:00<? | ?it/s

Experiment completed with 46 task runs and 46 evaluation runs
{'experiment_id': 'RXhwZXJpbWVudDozNjU=', 'dataset_id': 'RGF0YXNldDoxMjU=', 'dataset_version_id': 'RGF0YXNldFZlcnNpb246MTI1', 'task_runs': [{'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6ODA4Nw==', 'output': 'Account Creation', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 36, 48, 844271, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 36, 48, 844190, tzinfo=datetime.timezone.utc), 'trace_id': '334f5335f6b4a27a812b08dba0aa85df', 'error': None, 'id': 'RXhwZXJpbWVudFJ1bjoyMzk4MA==', 'experiment_id': 'RXhwZXJpbWVudDozNjU='}, {'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6ODA4OA==', 'output': 'Refund Request', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 36, 48, 845949, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 36, 48, 845917, tzinfo=datetime.timezone.utc), 'trace_id': 'c44470ea90ed4931085cf93246b024ee', 'error': No

running tasks |          | 0/108 (0.0%) | ⏳ 00:00<? | ?it/s

✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/108 (0.0%) | ⏳ 00:00<? | ?it/s

Experiment completed with 108 task runs and 108 evaluation runs
🔍 Running annotator...
['query', 'ground_truth', 'correctness', 'explanation', 'confusion_reason', 'error_type', 'evidence_span', 'prompt_fix_suggestion', 'output']

🔧 Creating batches with 90,000 token limit
📊 Processing 108 examples in 1 batches
   ✅ Batch 1/1: Optimized
{'experiment_id': 'RXhwZXJpbWVudDozNjY=', 'dataset_id': 'RGF0YXNldDoxMjQ=', 'dataset_version_id': 'RGF0YXNldFZlcnNpb246MTI0', 'task_runs': [{'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6Nzk3OQ==', 'output': 'Privacy Policy Question', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 37, 10, 249141, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 37, 10, 249066, tzinfo=datetime.timezone.utc), 'trace_id': '8386c71daf6ead95c62f5b315a65f5a7', 'error': None, 'id': 'RXhwZXJpbWVudFJ1bjoyNDAyOQ==', 'experiment_id': 'RXhwZXJpbWVudDozNjY='}, {'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6Nzk4MA==', 'output': 'Billing

running tasks |          | 0/46 (0.0%) | ⏳ 00:00<? | ?it/s

✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/46 (0.0%) | ⏳ 00:00<? | ?it/s

Experiment completed with 46 task runs and 46 evaluation runs
{'experiment_id': 'RXhwZXJpbWVudDozNjc=', 'dataset_id': 'RGF0YXNldDoxMjU=', 'dataset_version_id': 'RGF0YXNldFZlcnNpb246MTI1', 'task_runs': [{'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6ODA4Nw==', 'output': 'Account Creation', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 41, 48, 434898, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 41, 48, 434848, tzinfo=datetime.timezone.utc), 'trace_id': '4c3634513021ffd2ec8c11465f66f9bf', 'error': None, 'id': 'RXhwZXJpbWVudFJ1bjoyNDEzMg==', 'experiment_id': 'RXhwZXJpbWVudDozNjc='}, {'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6ODA4OA==', 'output': 'Billing Inquiry', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 41, 48, 436072, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 41, 48, 436030, tzinfo=datetime.timezone.utc), 'trace_id': '81f6af087c48f7688cbe5fd7fa424251', 'error': N

running tasks |          | 0/108 (0.0%) | ⏳ 00:00<? | ?it/s

✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/108 (0.0%) | ⏳ 00:00<? | ?it/s

Worker timeout, requeuing


Task exception was never retrieved
future: <Task finished name='Task-2048' coro=<AsyncExperiments._run_evaluations.<locals>.async_evaluate_run() done, defined at /opt/anaconda3/envs/base2/lib/python3.12/site-packages/phoenix/client/resources/experiments/__init__.py:1980> exception=ConnectTimeout('')>
Traceback (most recent call last):
  File "/opt/anaconda3/envs/base2/lib/python3.12/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions
    yield
  File "/opt/anaconda3/envs/base2/lib/python3.12/site-packages/httpx/_transports/default.py", line 394, in handle_async_request
    resp = await self._pool.handle_async_request(req)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/anaconda3/envs/base2/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 256, in handle_async_request
    raise exc from None
  File "/opt/anaconda3/envs/base2/lib/python3.12/site-packages/httpcore/_async/connection_pool.py", line 236, in handle_async_re

Experiment completed with 108 task runs and 108 evaluation runs
🔍 Running annotator...
['query', 'ground_truth', 'correctness', 'explanation', 'confusion_reason', 'error_type', 'evidence_span', 'prompt_fix_suggestion', 'output']

🔧 Creating batches with 90,000 token limit
📊 Processing 108 examples in 1 batches
   ✅ Batch 1/1: Optimized
{'experiment_id': 'RXhwZXJpbWVudDozNjg=', 'dataset_id': 'RGF0YXNldDoxMjQ=', 'dataset_version_id': 'RGF0YXNldFZlcnNpb246MTI0', 'task_runs': [{'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6Nzk3OQ==', 'output': 'Privacy Policy Question', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 42, 10, 823897, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 42, 10, 823852, tzinfo=datetime.timezone.utc), 'trace_id': 'e3d34c4cea4a7534e6ba624b7847006b', 'error': None, 'id': 'RXhwZXJpbWVudFJ1bjoyNDE3OQ==', 'experiment_id': 'RXhwZXJpbWVudDozNjg='}, {'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6Nzk4MA==', 'output': 'Billing

running tasks |          | 0/46 (0.0%) | ⏳ 00:00<? | ?it/s

✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/46 (0.0%) | ⏳ 00:00<? | ?it/s

Experiment completed with 46 task runs and 46 evaluation runs
{'experiment_id': 'RXhwZXJpbWVudDozNjk=', 'dataset_id': 'RGF0YXNldDoxMjU=', 'dataset_version_id': 'RGF0YXNldFZlcnNpb246MTI1', 'task_runs': [{'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6ODA4Nw==', 'output': 'Account Creation', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 47, 21, 730846, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 47, 21, 730791, tzinfo=datetime.timezone.utc), 'trace_id': 'f5e9d6e26eaab09ed46000f341882d11', 'error': None, 'id': 'RXhwZXJpbWVudFJ1bjoyNDI4OA==', 'experiment_id': 'RXhwZXJpbWVudDozNjk='}, {'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6ODA4OA==', 'output': 'Billing Inquiry', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 47, 21, 732103, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 47, 21, 732068, tzinfo=datetime.timezone.utc), 'trace_id': '9eeae743389813d457e8ab25762c0b12', 'error': N

running tasks |          | 0/108 (0.0%) | ⏳ 00:00<? | ?it/s

✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/108 (0.0%) | ⏳ 00:00<? | ?it/s

Experiment completed with 108 task runs and 108 evaluation runs
🔍 Running annotator...
['query', 'ground_truth', 'correctness', 'explanation', 'confusion_reason', 'error_type', 'evidence_span', 'prompt_fix_suggestion', 'output']

🔧 Creating batches with 90,000 token limit
📊 Processing 108 examples in 1 batches
   ✅ Batch 1/1: Optimized
{'experiment_id': 'RXhwZXJpbWVudDozNzA=', 'dataset_id': 'RGF0YXNldDoxMjQ=', 'dataset_version_id': 'RGF0YXNldFZlcnNpb246MTI0', 'task_runs': [{'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6Nzk3OQ==', 'output': 'Privacy Policy Question', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 47, 43, 902750, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 47, 43, 902673, tzinfo=datetime.timezone.utc), 'trace_id': '43aefeaecdfa10e43bd371619abbfaf5', 'error': None, 'id': 'RXhwZXJpbWVudFJ1bjoyNDM0MA==', 'experiment_id': 'RXhwZXJpbWVudDozNzA='}, {'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6Nzk4MA==', 'output': 'Billing

running tasks |          | 0/46 (0.0%) | ⏳ 00:00<? | ?it/s

✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/46 (0.0%) | ⏳ 00:00<? | ?it/s

Experiment completed with 46 task runs and 46 evaluation runs
{'experiment_id': 'RXhwZXJpbWVudDozNzE=', 'dataset_id': 'RGF0YXNldDoxMjU=', 'dataset_version_id': 'RGF0YXNldFZlcnNpb246MTI1', 'task_runs': [{'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6ODA4Nw==', 'output': 'Account Creation', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 52, 19, 318984, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 52, 19, 318932, tzinfo=datetime.timezone.utc), 'trace_id': '107eb905aec55056e23d1d9b0fca0439', 'error': None, 'id': 'RXhwZXJpbWVudFJ1bjoyNDQ0Mg==', 'experiment_id': 'RXhwZXJpbWVudDozNzE='}, {'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6ODA4OA==', 'output': 'Billing Inquiry', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 52, 19, 320419, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 52, 19, 320374, tzinfo=datetime.timezone.utc), 'trace_id': '03d9ce053ad684eaec2b21477c5fbd38', 'error': N

running tasks |          | 0/108 (0.0%) | ⏳ 00:00<? | ?it/s

✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/108 (0.0%) | ⏳ 00:00<? | ?it/s

Experiment completed with 108 task runs and 108 evaluation runs
🔍 Running annotator...
['query', 'ground_truth', 'correctness', 'explanation', 'confusion_reason', 'error_type', 'evidence_span', 'prompt_fix_suggestion', 'output']

🔧 Creating batches with 90,000 token limit
📊 Processing 108 examples in 1 batches
   ✅ Batch 1/1: Optimized
{'experiment_id': 'RXhwZXJpbWVudDozNzI=', 'dataset_id': 'RGF0YXNldDoxMjQ=', 'dataset_version_id': 'RGF0YXNldFZlcnNpb246MTI0', 'task_runs': [{'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6Nzk3OQ==', 'output': 'Privacy Policy Question', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 52, 40, 22992, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 52, 40, 22905, tzinfo=datetime.timezone.utc), 'trace_id': '263fa310f8c72f37bc5fe77028f41cc1', 'error': None, 'id': 'RXhwZXJpbWVudFJ1bjoyNDQ5Mw==', 'experiment_id': 'RXhwZXJpbWVudDozNzI='}, {'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6Nzk4MA==', 'output': 'Billing I

running tasks |          | 0/46 (0.0%) | ⏳ 00:00<? | ?it/s

✅ Task runs completed.
🧠 Evaluation started.


running experiment evaluations |          | 0/46 (0.0%) | ⏳ 00:00<? | ?it/s

Experiment completed with 46 task runs and 46 evaluation runs
{'experiment_id': 'RXhwZXJpbWVudDozNzM=', 'dataset_id': 'RGF0YXNldDoxMjU=', 'dataset_version_id': 'RGF0YXNldFZlcnNpb246MTI1', 'task_runs': [{'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6ODA4Nw==', 'output': 'Login Issues', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 57, 10, 469135, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 57, 10, 469001, tzinfo=datetime.timezone.utc), 'trace_id': '1fb6d45526d905a3777b20a73b04e805', 'error': None, 'id': 'RXhwZXJpbWVudFJ1bjoyNDU5NQ==', 'experiment_id': 'RXhwZXJpbWVudDozNzM='}, {'dataset_example_id': 'RGF0YXNldEV4YW1wbGU6ODA4OA==', 'output': 'Billing Inquiry', 'repetition_number': 1, 'start_time': datetime.datetime(2025, 8, 25, 20, 57, 10, 471258, tzinfo=datetime.timezone.utc), 'end_time': datetime.datetime(2025, 8, 25, 20, 57, 10, 471226, tzinfo=datetime.timezone.utc), 'trace_id': 'fe799fb34f0104942f6e4d3a2f373b02', 'error': None,

# Prompt Optimized!

The code below picks the prompt with the highest score on the test set, and displays the training/test metrics and delta for that prompt.

In [None]:
# Find the best index based on highest test accuracy
best_idx = max(range(len(result["test"])), key=lambda i: result["test"][i])

# Retrieve values
best_prompt = result["prompt"][best_idx - 1]
best_test_acc = result["test"][best_idx]
best_train_acc = result["train"][best_idx - 1] if (best_idx - 1) < len(result["train"]) else None
initial_test_acc = result["test"][0]
initial_train_acc = result["train"][0] if result["train"] else None

# Print results
print("\n🔍 Best Prompt Found:")
print(best_prompt)
print(f"🧪 Initial Test Accuracy: {initial_test_acc}")
print(f"🧪 Optimized Test Accuracy: {best_test_acc} (Δ {best_test_acc - initial_test_acc:.4f})")


🔍 Best Prompt Found:
{'iteration': 0, 'prompt': '\nsupport query: {query}\nAccount Creation\nLogin Issues\nPassword Reset\nTwo-Factor Authentication\nProfile Updates\nBilling Inquiry\nRefund Request\nSubscription Upgrade/Downgrade\nPayment Method Update\nInvoice Request\nOrder Status\nShipping Delay\nProduct Return\nWarranty Claim\nTechnical Bug Report\nFeature Request\nIntegration Help\nData Export\nSecurity Concern\nTerms of Service Question\nPrivacy Policy Question\nCompliance Inquiry\nAccessibility Support\nLanguage Support\nMobile App Issue\nDesktop App Issue\nEmail Notifications\nMarketing Preferences\nBeta Program Enrollment\nGeneral Feedback\n\nReturn just the category, no other text.\n', 'phoenix_id': 'UHJvbXB0VmVyc2lvbjoxNjU=', 'train_metric': None, 'test_metric': 0.5483870967741935}
🧪 Initial Test Accuracy: 0.5483870967741935
🧪 Optimized Test Accuracy: 0.7096774193548387 (Δ 0.1613)
