# HW 3: LLM-as-Judge for Recipe Bot Evaluation with Arize

## 🎯 Assignment Overview

In this assignment, we'll evaluate our Recipe Bot's adherence to dietary preferences using an LLM-as-Judge approach with Arize for tracing and evaluation.

### Workflow:
1. **📊 Load trace examples** - Choose between provided data or generate new traces
2. **🏷️ Create datasets** - Prepare data for labeling queue in Arize
3. **🔍 Label traces** - Use Arize UI to manually label examples  
4. **⚖️ Write eval prompt** - Create judge prompt in Arize Playground
5. **📈 Run evaluation experiment** - Execute evaluation via Arize
6. **📊 Calculate metrics** - Export and analyze results

### Core Task: "Adherence to Dietary Preferences"
**Example**: If a user asks for a "vegan" recipe, does the bot provide one that is actually vegan?

Let's get started! 🚀


## 🔧 Setup and Environment Configuration

First, let's import the required libraries and set up our environment.


In [None]:
# Install required packages
import subprocess
import sys


def install_packages():
    packages = [
        "arize-phoenix[evals]",
        "openai",
        "pandas",
        "openinference-instrumentation-openai",
        "nest-asyncio",
        "arize-phoenix[evals]",
        "openai",
        "pandas",
        "numpy",
        "scipy",
        "openinference-instrumentation-openai",
        "nest-asyncio",
        "arize[AutoEmbeddings]",  # For ArizeExportClient and ArizeDatasetsClient
        "opentelemetry-api",
        "opentelemetry-sdk",
        "judgy",
    ]

    for package in packages:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])


# Uncomment to install packages
install_packages()

In [69]:
# Setup
import getpass
import os
from datetime import datetime, timedelta
from pathlib import Path

import numpy as np
import openai
import pandas as pd
from arize.experimental.datasets import ArizeDatasetsClient
from arize.experimental.datasets.utils.constants import GENERATIVE
from arize.exporter import ArizeExportClient
from arize.otel import register
from arize.utils.types import Environments
from judgy import estimate_success_rate
from openinference.instrumentation.openai import OpenAIInstrumentor

### 🔑 API Key Configuration



In [2]:
# Prompt for OpenAI API key if not set
if "OPENAI_API_KEY" not in os.environ or not os.environ["OPENAI_API_KEY"]:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OPENAI_API_KEY: ")

# Prompt for Arize API key if not set
if "ARIZE_API_KEY" not in os.environ or not os.environ["ARIZE_API_KEY"]:
    os.environ["ARIZE_API_KEY"] = getpass.getpass("Enter your ARIZE_API_KEY: ")

# Prompt for Arize Space key if not set
if "ARIZE_SPACE_ID" not in os.environ or not os.environ["ARIZE_SPACE_ID"]:
    os.environ["ARIZE_SPACE_ID"] = getpass.getpass("Enter your ARIZE_SPACE_ID: ")

### Tracing Setup

We'll set up OpenTelemetry tracing to automatically capture LLM interactions and send them to Arize. This enables real-time monitoring and evaluation of our Recipe Bot's performance in production.

In [None]:
# Initialize OpenAI client
client = openai.OpenAI()

print("✅ Setup complete!")

In [22]:
# Set up tracing
tracer_provider = register(
    space_id=os.environ["ARIZE_SPACE_ID"],
    project_name="RecipeBot",  # name this to whatever you would like
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Overriding of current TracerProvider is not allowed
Attempting to instrument while already instrumented


🔭 OpenTelemetry Tracing Details 🔭
|  Arize Project: RecipeBot
|  Span Processor: BatchSpanProcessor
|  Collector Endpoint: otlp.arize.com
|  Transport: gRPC
|  Transport Headers: {'authorization': '****', 'api_key': '****', 'arize-space-id': '****', 'space_id': '****', 'arize-interface': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



## Part 1: Create Arize Dataset for Labeling 

We'll upload our split dataset to Arize for manual labeling of ground truth examples. This step is crucial for establishing the "correct" answers that our judge will be evaluated against.


In [3]:
traces_path = Path("homeworks/hw3/data/raw_traces.csv")

traces_df = pd.read_csv(traces_path)

traces_df.head()

Unnamed: 0,query,dietary_restriction,response,success,error,trace_id,query_id
0,I'm vegan but I really want to make something ...,vegan,Certainly! For a vegan yogurt breakfast that m...,True,,1_8,1
1,I'm vegan but I really want to make something ...,vegan,Absolutely! While honey is a popular sweetener...,True,,1_9,1
2,I'm vegan but I really want to make something ...,vegan,Certainly! Since you're vegan and craving a yo...,True,,1_10,1
3,Need a quick gluten-free breakfast. I hate egg...,gluten-free,"Certainly! For a quick, gluten-free breakfast ...",True,,2_7,2
4,I'm vegan but I really want to make something ...,vegan,Absolutely! For a vegan breakfast that mimics ...,True,,1_27,1


In [4]:
## let's create a test dataset
datasets_client = ArizeDatasetsClient(api_key=os.environ["ARIZE_API_KEY"])

sample = traces_df.sample(n=100, random_state=42)

dataset_id = datasets_client.create_dataset(
    space_id=os.environ["ARIZE_SPACE_ID"],
    dataset_name="RecipeBot - testing",
    data=sample,
    dataset_type=GENERATIVE,
)

  from .autonotebook import tqdm as notebook_tqdm


### Alternative Option: Send in Traces
Alternatively you can send in traces using `dietary_quereies.csv` 

In [None]:
# from opentelemetry import trace


# # Load dietary queries
# queries_path = Path("homeworks/hw3/data/dietary_queries.csv")
# queries_df = pd.read_csv(queries_path)

# # Example with a single query
# single_query = queries_df['query'].iloc[1]  # Use a different example
# dietary_restriction = queries_df['dietary_restriction'].iloc[1]

# # Make the OpenAI call (which will be auto-instrumented)
# single_response = client.chat.completions.create(
#     model="gpt-4o-mini",
#     messages=[
#         {"role": "system", "content": system_prompt},
#         {"role": "user", "content": single_query}
#     ],
#     temperature=0.7
# )

# # Get the current span and add metadata to it
# current_span = trace.get_current_span()
# if current_span:
#     current_span.set_attribute("dietary_restriction", dietary_restriction)
#     current_span.set_attribute("query_id", int(queries_df['id'].iloc[1]))
#     current_span.set_attribute("use_case", "alternative_approach")

# print("Query:", single_query)
# print("Dietary Restriction:", dietary_restriction)
# response_content = single_response.choices[0].message.content
# if response_content:
#     print("Response snippet:", response_content[:200] + "...")
# else:
#     print("No response content available")


## Part 2: Prepare Data for Arize Labeling

Take your traces and prepare them for manual labeling in Arize.


### 📝 Labeling Criteria - Dietary Adherence

**CORRECT**: Recipe correctly follows all specified dietary restrictions  
**INCORRECT**: Recipe violates any specified dietary restrictions

**Examples:**
- ✅ CORRECT: 'vegan pasta' → recipe with nutritional yeast (no dairy)
- ❌ INCORRECT: 'vegan pasta' → recipe suggests honey (not vegan)  
- ✅ CORRECT: 'gluten-free bread' → recipe with almond flour
- ❌ INCORRECT: 'gluten-free bread' → recipe with regular flour


## Part 3: Create LLM as Judge Prompt

🎯 **Complete these steps in Arize:**

1. **🏷️ Label Rows**: Review dataset and annotate rows.

2. **⚖️ Develop Judge Prompt**: Create evaluation prompt for dietary adherence

3. **🧪 Test Evaluation**: Run judge prompt against ground truth labels in the playground

4. **🚀 Review Experiment**: Review evaluation experiment and iterate 

⏳ **Come back here after completing Arize work!**


In [None]:
# Function definition for the eval template
#  [
#   {
#     "type": "function",
#     "function": {
#       "name": "record_response",
#       "description": "A function to record your response.",
#       "parameters": {
#         "type": "object",
#         "properties": {
#           "explanation": {
#             "type": "string",
#             "description": "Explanation of the reasoning for your response."
#           },
#           "response": {
#             "type": "string",
#             "description": "Your response.",
#             "enum": [
#               "correct",
#               "incorrect"
#             ]
#           }
#         },
#         "additionalProperties": false
#       }
#     }
#   }
# ]

### Optional: Programmatic Evaluation with `llm_classify` and Experiments


Instead of building and testing the eval in the Arize UI, you can use the [`llm_classify`](https://arize.com/docs/ax/evaluate/online-evals/log-evaluations-to-arize) function in code. 

You can also run a full evaluation experiment programmatically using the [Arize Experiments API](https://arize.com/docs/ax/develop/datasets-and-experiments/run-experiments), which lets you compare LLM judge results to ground truth and analyze performance—all in code.

This approach is useful if you want to scale up, iterate quickly, or integrate evaluation into your ML pipeline.

See the next code cell for an example of how to use `llm_classify`.


In [None]:
# from phoenix.evals import (
#     llm_classify,
# )

# # The rails are used to hold the output to specific values based on the template
# # It will remove text such as ",,," or "..."
# # Will ensure the binary value expected from the template is returned
# rails = ["Correct", "Incorrect"]
# #MultiClass would be rails = ["irrelevant", "relevant", "semi-relevant"]
# eval_df = llm_classify(
#     dataframe=<YOUR_DATAFRAME_GOES_HERE>,
#     template=CATEGORICAL_TEMPLATE,
#     model=OpenAIModel('gpt-4o', api_key=''),
#     rails=rails
# )

## Part 4: Load Results & Calculate Metrics

After running your evaluation experiment in Arize, we'll load the results and compute performance metrics.


In [57]:
dataset_id = "RGF0YXNldDozMDI2MDE6S0RCQg=="
dataset_data = datasets_client.get_dataset(os.environ["ARIZE_SPACE_ID"], dataset_id=dataset_id)

dataset_data.head()

Unnamed: 0,query,dietary_restriction,response,success,error,trace_id,query_id,created_at,updated_at,id,userannotation.VXNlcjoxMTE4NzpnUkY0.Correctness.label,userannotation.VXNlcjoxMTE4NzpnUkY0.Correctness.updated_by,userannotation.VXNlcjoxMTE4NzpnUkY0.Correctness.updated_at,latest.userannotation.Correctness.label,latest.userannotation.Correctness.updated_by,latest.userannotation.Correctness.updated_at
0,Gluten-free pizza dough that actually tastes g...,gluten-free,Absolutely! Here's a delicious gluten-free piz...,True,,14_4,14,1754271594950,1754271594950,e4f2de5c-3e80-42f8-bff4-2ee69cffcb2c,Correct,SallyAnn DeLucia,1754335000000.0,Correct,SallyAnn DeLucia,1754335000000.0
1,I want to make a birthday cake but I'm diabeti...,diabetic-friendly,Certainly! Here's a delicious and diabetic-fri...,True,,7_22,7,1754271594950,1754271594950,a3e264e6-05a7-4207-8efa-7338492e3a37,Correct,SallyAnn DeLucia,1754334000000.0,Correct,SallyAnn DeLucia,1754334000000.0
2,Gluten-light recipe - I'm not celiac just sens...,gluten-free,Let's make a delicious **Garlic Herb Shrimp wi...,True,,48_33,48,1754271594950,1754271594950,d27ee86d-ab37-4429-9ebb-71e16ccb3e49,Correct,SallyAnn DeLucia,1754335000000.0,Correct,SallyAnn DeLucia,1754335000000.0
3,Low-carb pasta substitute that my Italian gran...,low-carb,Absolutely! Let me introduce you to a delightf...,True,,25_29,25,1754271594950,1754271594950,75da3ba3-ad5c-4bc8-aed3-7c2b1639109e,Correct,SallyAnn DeLucia,1754277000000.0,Correct,SallyAnn DeLucia,1754277000000.0
4,Something keto-ish but not super strict,keto,Great choice! Let me recommend a delicious **K...,True,,45_10,45,1754271594950,1754271594950,255b853c-f50e-4724-9678-9f430edf0096,Correct,SallyAnn DeLucia,1754277000000.0,Correct,SallyAnn DeLucia,1754277000000.0


In [58]:
experiment_id = "RXhwZXJpbWVudDoyMjU3MTozWjBP"
# example usage
experiments_data = datasets_client.get_experiment(
    os.environ["ARIZE_SPACE_ID"], experiment_id=experiment_id
)

experiments_data.head()

Unnamed: 0,output,example_id,id,count,template,invocation_parameters,tool_choice,tool_options,model_name,model_provider,eval.Label Match.label,eval.Label Match.score,eval.Label Match.explanation
0,"{""id"":""chatcmpl-C1QbT1DQVEevUDs5qbMjVWGQoVZqP""...",e4f2de5c-3e80-42f8-bff4-2ee69cffcb2c,EXP_ID_2794be,1,"[{""role"":""system"",""content"":""You are a dietary...",{},"""required""","[{""type"":""function"",""function"":{""name"":""record...",gpt-4o,openAI,match,1.0,The Output from the LLM judge states that the ...
1,"{""id"":""chatcmpl-C1QbTCeR1QZVMDg07ryjsEDAUPtXG""...",a3e264e6-05a7-4207-8efa-7338492e3a37,EXP_ID_ba5512,1,"[{""role"":""system"",""content"":""You are a dietary...",{},"""required""","[{""type"":""function"",""function"":{""name"":""record...",gpt-4o,openAI,match,1.0,1. The Output provides an explanation that the...
2,"{""id"":""chatcmpl-C1QbT0C0IVbs8RUquvRBg5MuUQWyr""...",d27ee86d-ab37-4429-9ebb-71e16ccb3e49,EXP_ID_383ffc,1,"[{""role"":""system"",""content"":""You are a dietary...",{},"""required""","[{""type"":""function"",""function"":{""name"":""record...",gpt-4o,openAI,match,1.0,1. The Output provides an explanation that the...
3,"{""id"":""chatcmpl-C1QbTnpRMDkSOiQYDJnIWOob1Pwes""...",75da3ba3-ad5c-4bc8-aed3-7c2b1639109e,EXP_ID_ba41c9,1,"[{""role"":""system"",""content"":""You are a dietary...",{},"""required""","[{""type"":""function"",""function"":{""name"":""record...",gpt-4o,openAI,match,1.0,1. The Output contains a tool call with an arg...
4,"{""id"":""chatcmpl-C1QbTIHfWZgsLFKXBnV7uiu6W8ukq""...",255b853c-f50e-4724-9678-9f430edf0096,EXP_ID_a6c481,1,"[{""role"":""system"",""content"":""You are a dietary...",{},"""required""","[{""type"":""function"",""function"":{""name"":""record...",gpt-4o,openAI,match,1.0,1. The Output provides an explanation that the...


In [59]:
# Join experiments_data (aliased as e) and dataset_data (aliased as d) on e.example_id = d.id
joined_df = experiments_data.merge(
    dataset_data, left_on="example_id", right_on="id", suffixes=("_e", "_d")
)

In [61]:
import json


def extract_label_from_output(output_str):
    """
    Extract the 'response' field from the tool_calls in the output JSON.
    Returns the first 'response' value found, or None if not found.
    """
    try:
        output_json = json.loads(output_str)
        # Traverse to choices[0].message.tool_calls
        choices = output_json.get("choices", [])
        for choice in choices:
            message = choice.get("message", {})
            tool_calls = message.get("tool_calls", [])
            for tool_call in tool_calls:
                function = tool_call.get("function", {})
                arguments_str = function.get("arguments", "")
                # arguments is a JSON string, so parse it
                try:
                    arguments = json.loads(arguments_str)
                    if "response" in arguments:
                        return arguments["response"]
                except Exception:
                    continue
        return None
    except Exception:
        return None


joined_df["parsed_label"] = joined_df["output"].apply(extract_label_from_output)

final_df = joined_df[
    [
        "parsed_label",
        "eval.Label Match.label",
        "eval.Label Match.score",
        "query",
        "dietary_restriction",
        "response",
        "latest.userannotation.Correctness.label",
    ]
]
final_df.head()

Unnamed: 0,parsed_label,eval.Label Match.label,eval.Label Match.score,query,dietary_restriction,response,latest.userannotation.Correctness.label
0,correct,match,1.0,Gluten-free pizza dough that actually tastes g...,gluten-free,Absolutely! Here's a delicious gluten-free piz...,Correct
1,correct,match,1.0,I want to make a birthday cake but I'm diabeti...,diabetic-friendly,Certainly! Here's a delicious and diabetic-fri...,Correct
2,correct,match,1.0,Gluten-light recipe - I'm not celiac just sens...,gluten-free,Let's make a delicious **Garlic Herb Shrimp wi...,Correct
3,correct,match,1.0,Low-carb pasta substitute that my Italian gran...,low-carb,Absolutely! Let me introduce you to a delightf...,Correct
4,correct,match,1.0,Something keto-ish but not super strict,keto,Great choice! Let me recommend a delicious **K...,Correct


### 📊 Judge Performance Analysis

Let's evaluate how well our LLM judge performed compared to human ground truth labels.


In [62]:
# Calculate judge performance metrics using final_df


def to_binary(label):
    """Convert text labels to binary (1 for correct/match, 0 for incorrect/mismatch)"""
    if pd.isna(label):
        return None
    label_str = str(label).strip().lower()
    # Handle both correctness labels and match/mismatch labels
    if label_str in ["correct", "match"]:
        return 1
    elif label_str in ["incorrect", "mismatch"]:
        return 0
    return None


# Extract ground truth from human annotations
ground_truth_labels = final_df["latest.userannotation.Correctness.label"]
ground_truth = [to_binary(label) for label in ground_truth_labels]

# Extract judge predictions from parsed_label (eval template output)
judge_pred_labels = final_df["parsed_label"]
judge_preds = [to_binary(label) for label in judge_pred_labels]

# Only keep valid pairs where both ground truth and predictions are available
valid = [
    (gt, pred) for gt, pred in zip(ground_truth, judge_preds) if gt is not None and pred is not None
]

if valid:
    gt, pred = zip(*valid)
    tp = sum(1 for g, p in zip(gt, pred) if g == 1 and p == 1)
    tn = sum(1 for g, p in zip(gt, pred) if g == 0 and p == 0)
    fp = sum(1 for g, p in zip(gt, pred) if g == 0 and p == 1)
    fn = sum(1 for g, p in zip(gt, pred) if g == 1 and p == 0)
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
    tnr = tn / (tn + fp) if (tn + fp) > 0 else 0
    accuracy = (tp + tn) / len(valid)

    print("📊 Judge Performance Metrics:")
    print(f"   True Positive Rate (TPR): {tpr:.3f}")
    print(f"   True Negative Rate (TNR): {tnr:.3f}")
    print(f"   Accuracy: {accuracy:.3f}")
    print(f"   Total number of valid label pairs: {len(valid)}")
    print(f"   True Positives: {tp}, True Negatives: {tn}")
    print(f"   False Positives: {fp}, False Negatives: {fn}")

    metrics = {"tpr": tpr, "tnr": tnr, "accuracy": accuracy}
else:
    print("❌ No valid label pairs found in final_df")
    print("   Ground truth labels:", ground_truth[:5])
    print("   Judge predictions:", judge_preds[:5])

📊 Judge Performance Metrics:
   True Positive Rate (TPR): 0.890
   True Negative Rate (TNR): 0.333
   Accuracy: 0.840
   Total number of valid label pairs: 100
   True Positives: 81, True Negatives: 3
   False Positives: 6, False Negatives: 10


## Part 5: Evaluate Live Traces

After testing and validating our evaluation template, we're ready to use it in a production setting.
First, set up the online evaluation task in the platform using the template.
Once that's done, you can send in traces, and the evaluation will run automatically on each trace.

In [3]:
# Initialize OpenAI client
client = openai.OpenAI()

print("✅ Setup complete!")

✅ Setup complete!


In [4]:
# Set up tracing
tracer_provider = register(
    space_id=os.environ["ARIZE_SPACE_ID"],
    project_name="RecipeBot",  # name this to whatever you would like
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

🔭 OpenTelemetry Tracing Details 🔭
|  Arize Project: RecipeBot
|  Span Processor: BatchSpanProcessor
|  Collector Endpoint: otlp.arize.com
|  Transport: gRPC
|  Transport Headers: {'authorization': '****', 'api_key': '****', 'arize-space-id': '****', 'space_id': '****', 'arize-interface': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



In [15]:
# Load dietary queries
queries_path = Path("homeworks/hw3/data/dietary_queries.csv")
queries_df = pd.read_csv(queries_path)

queries_df.head()

Unnamed: 0,id,query,dietary_restriction
0,1,I'm vegan but I really want to make something ...,vegan
1,2,Need a quick gluten-free breakfast. I hate egg...,gluten-free
2,3,Keto breakfast that I can meal prep for the week,keto
3,4,I'm dairy-free and also can't stand the taste ...,dairy-free
4,5,Vegetarian pizza but I don't like mushrooms or...,vegetarian


In [7]:
from arize.experimental.prompt_hub import ArizePromptClient

prompt_client = ArizePromptClient(
    space_id=os.environ["ARIZE_SPACE_ID"], api_key=os.environ["ARIZE_API_KEY"]
)

prompt = prompt_client.pull_prompt(prompt_name="RecipeBot System Prompt")

system_prompt = prompt.messages[0]["content"]

print(system_prompt)

# alternatively you can just assign the system prompt here

You are a helpful, accurate, and creative recipe assistant. Your job is to generate easy-to-follow, reliable recipes and cooking advice tailored to the user query below.

Core Responsibilities:
- Always include an ingredient list with precise measurements in standard US or metric units.
- Always include clear, numbered, step-by-step instructions that are logically ordered and easy to follow.
- Always structure your response in Markdown.

Ingredient Guidelines:
- Never suggest rare, expensive, or difficult-to-obtain ingredients without clearly providing readily available substitutions.
- Be specific with ingredients (e.g., “1 cup unsweetened almond milk” instead of “milk”).

Instructional Guidelines:
- Do not skip steps or assume prior knowledge.
- Use direct, instructional language.
- Include preparation and cook time only if reliably known.

Behavior & Ethics:
- Never include unsafe, unethical, or harmful suggestions. Politely decline and explain briefly if a request cannot be fulfill

In [8]:
from opentelemetry import trace

# Example with a single query
single_query = queries_df["query"].iloc[1]  # Use a different example
dietary_restriction = queries_df["dietary_restriction"].iloc[1]

# Make the OpenAI call (which will be auto-instrumented)
single_response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": single_query},
    ],
    temperature=0.7,
)

# Get the current span and add metadata to it
current_span = trace.get_current_span()
if current_span:
    current_span.set_attribute("dietary_restriction", dietary_restriction)
    current_span.set_attribute("query_id", int(queries_df["id"].iloc[1]))
    current_span.set_attribute("use_case", "alternative_approach")

print("Query:", single_query)
print("Dietary Restriction:", dietary_restriction)
response_content = single_response.choices[0].message.content
if response_content:
    print("Response snippet:", response_content[:200] + "...")
else:
    print("No response content available")

Query: Need a quick gluten-free breakfast. I hate eggs though.
Dietary Restriction: gluten-free
Response snippet: ## Quick Gluten-Free Banana Oatmeal Pancakes

These fluffy banana oatmeal pancakes are a quick and delicious gluten-free breakfast option that doesn’t require eggs. They’re perfect for busy mornings a...


In [17]:
import asyncio

from opentelemetry import trace

# Create async OpenAI client
async_client = openai.AsyncOpenAI()


async def get_response_with_metadata(i, query, dietary_restriction, query_id):
    """Process a single query with custom metadata"""
    # The OpenAI call will be auto-instrumented
    response = await async_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "system", "content": system_prompt}, {"role": "user", "content": query}],
        temperature=0.7,
    )

    # Get the current auto-instrumented span and add metadata to it
    current_span = trace.get_current_span()
    if current_span:
        current_span.set_attribute("attributes.metadata.dietary_restriction", dietary_restriction)

    return {
        "query": query,
        "dietary_restriction": dietary_restriction,
        "response": response.choices[0].message.content,
    }


# Process all queries with batching to avoid rate limits
responses = []
batch_size = 20  # Process 20 at a time to avoid rate limits

for batch_start in range(0, len(queries_df), batch_size):
    batch_end = min(batch_start + batch_size, len(queries_df))
    batch_tasks = []

    # Create tasks for this batch
    for i in range(batch_start, batch_end):
        query = str(queries_df.iloc[i]["query"])
        dietary_restriction = str(queries_df.iloc[i]["dietary_restriction"])
        query_id = int(queries_df.iloc[i]["id"])

        task = get_response_with_metadata(i, query, dietary_restriction, query_id)
        batch_tasks.append(task)

    # Process this batch concurrently
    batch_responses = await asyncio.gather(*batch_tasks)
    responses.extend(batch_responses)

    print(
        f"✅ Processed {len(responses)}/{queries_df.shape[0]} queries (batch {len(responses) // batch_size})"
    )

print(f"\n🚀 Successfully processed {len(responses)} queries with custom metadata!")
print(f"📊 Average time per query: ~{50 / len(responses):.1f} seconds (estimated)")

# Show a sample of the results
if responses:
    print("\n📝 Sample result:")
    sample_response = responses[0]
    print(f"Query: {sample_response['query'][:100]}...")
    print(f"Dietary Restriction: {sample_response['dietary_restriction']}")
    print(f"Response: {sample_response['response'][:150]}...")

✅ Processed 20/60 queries (batch 1)
✅ Processed 40/60 queries (batch 2)
✅ Processed 60/60 queries (batch 3)

🚀 Successfully processed 60 queries with custom metadata!
📊 Average time per query: ~0.8 seconds (estimated)

📝 Sample result:
Query: I'm vegan but I really want to make something with honey - is there a good substitute? i am craving ...
Dietary Restriction: vegan
Response: ## Vegan Yogurt Breakfast Bowl with Agave Nectar

If you're looking for a sweet and satisfying vegan breakfast, this yogurt bowl topped with fresh fru...


### Monitor Live Evaluation Results

Navigate to the Arize UI and check the traces. You should see your online evaluation task automatically processing the new traces. Look for the evaluation scores and any patterns in the results.

## Part 6: Statistical Analysis with Bias Correction 📊

Now we'll apply statistical bias correction to get a reliable estimate of the Recipe Bot's true dietary adherence performance. This implements the same methodology as the 'judgy' library but using our Arize workflow.



**What we're doing:**
1. **Export live traces** from Arize that have been automatically evaluated by our judge
2. **Use judge performance** (TPR/TNR) calculated from our labeled dataset above
3. **Apply bias correction** to get a more accurate estimate of true performance
4. **Calculate confidence intervals** to understand the reliability of our estimates

This approach lets us evaluate real production performance using statistical methods to account for judge bias.



In [76]:
export_client = ArizeExportClient(api_key=os.environ["ARIZE_API_KEY"])

# Set end_time to now and start_time to 24 hours ago
end_time = datetime.now()
start_time = end_time - timedelta(days=1)

new_traces_df = export_client.export_model_to_df(
    space_id=os.environ["ARIZE_SPACE_ID"],
    # api_key=os.environ["ARIZE_API_KEY"],
    model_id="RecipeBot",
    environment=Environments.TRACING,
    start_time=start_time,
    end_time=end_time,
    # Optionally specify columns to improve query performance
    # columns=['context.span_id', 'attributes.llm.input']
)

[38;21m  arize.utils.logging | INFO | Creating named session as 'python-sdk-arize_python_export_client-0760084f-9d98-4a02-9c44-9cdf77bde278'.[0m
[38;21m  arize.utils.logging | INFO | Fetching data...[0m
[38;21m  arize.utils.logging | INFO | Starting exporting...[0m


  exporting 151 rows: 100%|[38;2;0;128;0m██████████████████████[0m| 151/151 [00:00, 324.04 row/s][0m


In [77]:
# Get live predictions and judge performance
eval_score_col = "eval.Dietary Restriction Adherence.score"
live_predictions = np.array(new_traces_df[eval_score_col].dropna().tolist())

# Prepare data for judgy.estimate_success_rate
# We need: test_labels, test_preds, unlabeled_preds

# Extract test data from our labeled dataset (final_df)
test_ground_truth = []
test_judge_preds = []

for _, row in final_df.iterrows():
    # Convert ground truth labels to binary
    gt_label = row["latest.userannotation.Correctness.label"]
    if pd.notna(gt_label) and str(gt_label).strip().lower() == "correct":
        test_ground_truth.append(1)
    elif pd.notna(gt_label) and str(gt_label).strip().lower() == "incorrect":
        test_ground_truth.append(0)
    else:
        continue

    # Convert judge predictions to binary
    judge_label = row["parsed_label"]
    if pd.notna(judge_label) and str(judge_label).strip().lower() == "correct":
        test_judge_preds.append(1)
    elif pd.notna(judge_label) and str(judge_label).strip().lower() == "incorrect":
        test_judge_preds.append(0)
    else:
        # If we can't parse judge prediction, remove the corresponding ground truth
        test_ground_truth.pop()

# Convert to numpy arrays
test_labels = np.array(test_ground_truth)
test_preds = np.array(test_judge_preds)
unlabeled_preds = live_predictions  # These are already binary (0/1)

print("📊 Data for judgy:")
print(f"   Test labels: {len(test_labels)} samples")
print(f"   Test predictions: {len(test_preds)} samples")
print(f"   Live predictions: {len(unlabeled_preds)} samples")
print(f"   Test accuracy: {(test_labels == test_preds).mean():.3f}")

# Bias correction with judgy using correct parameters
results = estimate_success_rate(
    test_labels=test_labels, test_preds=test_preds, unlabeled_preds=unlabeled_preds
)

# Results
print("\n📊 Results:")
print(f"   Raw pass rate: {live_predictions.mean():.3f}")
print("   ✅ Used judgy.estimate_success_rate successfully!")

📊 Data for judgy:
   Test labels: 100 samples
   Test predictions: 100 samples
   Live predictions: 151 samples
   Test accuracy: 0.840

📊 Results:
   Raw pass rate: 0.993
   ✅ Used judgy.estimate_success_rate successfully!


## 🎉 Assignment Complete!

**What you accomplished:**
- ✅ Prepared trace data for evaluation testing
- ✅ Used Arize UI for manual labeling to establish ground truth
- ✅ Developed and tested LLM judge against human feedback
- ✅ Aligned judge performance with human annotations (TPR, TNR)
- ✅ Applied judge to evaluate "production" traces at scale
- ✅ Applied statistical bias correction to account for judge imperfections
- ✅ Generated comprehensive evaluation report with confidence intervals

**Key insight:** By aligning the LLM judge with human feedback through the Arize UI testing workflow, we can now evaluate dietary adherence at scale while accounting for judge bias through statistical correction.

