# HW 3: LLM-as-Judge for Recipe Bot Evaluation with Arize

## 🎯 Assignment Overview

In this assignment, we'll evaluate our Recipe Bot's adherence to dietary preferences using an LLM-as-Judge approach with Arize for tracing and evaluation.

### Workflow:
1. **📊 Load trace examples** - Choose between provided data or generate new traces
2. **🏷️ Create datasets** - Prepare data for labeling queue in Arize
3. **🔍 Label traces** - Use Arize UI to manually label examples  
4. **⚖️ Write eval prompt** - Create judge prompt in Arize Playground
5. **📈 Run evaluation experiment** - Execute evaluation via Arize
6. **📊 Calculate metrics** - Export and analyze results

### Core Task: "Adherence to Dietary Preferences"
**Example**: If a user asks for a "vegan" recipe, does the bot provide one that is actually vegan?

Let's get started! 🚀


## 🔧 Setup and Environment Configuration

First, let's import the required libraries and set up our environment.


In [None]:
# Install required packages
import subprocess
import sys


def install_packages():
    packages = [
        "arize-phoenix[evals]",
        "openai",
        "pandas",
        "openinference-instrumentation-openai",
        "nest-asyncio",
        "arize-phoenix[evals]",
        "openai", 
        "pandas",
        "numpy",
        "scipy",
        "openinference-instrumentation-openai",
        "nest-asyncio",
        "arize[AutoEmbeddings]",  # For ArizeExportClient and ArizeDatasetsClient
        "opentelemetry-api",
        "opentelemetry-sdk",
    ]

    for package in packages:
        print(f"Installing {package}...")
        subprocess.check_call([sys.executable, "-m", "pip", "install", package])


# Uncomment to install packages
# install_packages()

In [1]:
# Setup
import pandas as pd
import openai
import os
from pathlib import Path
from datetime import datetime
import numpy as np
import getpass
from datetime import datetime
from arize.exporter import ArizeExportClient
from arize.utils.types import Environments
from datetime import datetime, timedelta
from arize.otel import register
from openinference.instrumentation.openai import OpenAIInstrumentor
from arize.experimental.datasets import ArizeDatasetsClient
import pandas as pd
from arize.experimental.datasets.utils.constants import GENERATIVE




### 🔑 API Key Configuration

**Important**: For security, never hardcode API keys in notebooks. We'll use environment variables and secure input methods.


In [2]:
Prompt for OpenAI API key if not set
if "OPENAI_API_KEY" not in os.environ or not os.environ["OPENAI_API_KEY"]:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OPENAI_API_KEY: ")

# Prompt for Arize API key if not set
if "ARIZE_API_KEY" not in os.environ or not os.environ["ARIZE_API_KEY"]:
    os.environ["ARIZE_API_KEY"] = getpass.getpass("Enter your ARIZE_API_KEY: ")

# Prompt for Arize Space key if not set
if "ARIZE_SPACE_ID" not in os.environ or not os.environ["ARIZE_SPACE_ID"]:
    os.environ["ARIZE_SPACE_ID"] = getpass.getpass("Enter your ARIZE_SPACE_ID: ")


### Tracing Setup

We'll set up OpenTelemetry tracing to automatically capture LLM interactions and send them to Arize. This enables real-time monitoring and evaluation of our Recipe Bot's performance in production.

In [None]:
# Initialize OpenAI client
client = openai.OpenAI()

print("✅ Setup complete!")


In [22]:


# Set up tracing
tracer_provider = register(
    space_id=os.environ["ARIZE_SPACE_ID"],
    project_name="RecipeBot",  # name this to whatever you would like
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

Overriding of current TracerProvider is not allowed
Attempting to instrument while already instrumented


🔭 OpenTelemetry Tracing Details 🔭
|  Arize Project: RecipeBot
|  Span Processor: BatchSpanProcessor
|  Collector Endpoint: otlp.arize.com
|  Transport: gRPC
|  Transport Headers: {'authorization': '****', 'api_key': '****', 'arize-space-id': '****', 'space_id': '****', 'arize-interface': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



## Part 1: Create Arize Dataset for Labeling 

We'll upload our split dataset to Arize for manual labeling of ground truth examples. This step is crucial for establishing the "correct" answers that our judge will be evaluated against.


In [3]:
traces_path = Path("homeworks/hw3/data/raw_traces.csv")

traces_df = pd.read_csv(traces_path)

traces_df.head()


Unnamed: 0,query,dietary_restriction,response,success,error,trace_id,query_id
0,I'm vegan but I really want to make something ...,vegan,Certainly! For a vegan yogurt breakfast that m...,True,,1_8,1
1,I'm vegan but I really want to make something ...,vegan,Absolutely! While honey is a popular sweetener...,True,,1_9,1
2,I'm vegan but I really want to make something ...,vegan,Certainly! Since you're vegan and craving a yo...,True,,1_10,1
3,Need a quick gluten-free breakfast. I hate egg...,gluten-free,"Certainly! For a quick, gluten-free breakfast ...",True,,2_7,2
4,I'm vegan but I really want to make something ...,vegan,Absolutely! For a vegan breakfast that mimics ...,True,,1_27,1


In [4]:

## let's create a test dataset 
datasets_client = ArizeDatasetsClient(api_key=os.environ["ARIZE_API_KEY"])

sample = traces_df.sample(n=100, random_state=42)

# dataset_id = datasets_client.create_dataset(
#     space_id=os.environ["ARIZE_SPACE_ID"], 
#     dataset_name="RecipeBot",
#     data=sample,
#     dataset_type=GENERATIVE
# )

  from .autonotebook import tqdm as notebook_tqdm


### Alternative Option: Send in Traces
Alternatively you can send in traces using `dietary_quereies.csv` 

In [None]:

# from opentelemetry import trace


# # Load dietary queries
# queries_path = Path("homeworks/hw3/data/dietary_queries.csv")
# queries_df = pd.read_csv(queries_path)

# # Example with a single query
# single_query = queries_df['query'].iloc[1]  # Use a different example
# dietary_restriction = queries_df['dietary_restriction'].iloc[1]

# # Make the OpenAI call (which will be auto-instrumented)
# single_response = client.chat.completions.create(
#     model="gpt-4o-mini",
#     messages=[
#         {"role": "system", "content": system_prompt},
#         {"role": "user", "content": single_query}
#     ],
#     temperature=0.7
# )

# # Get the current span and add metadata to it
# current_span = trace.get_current_span()
# if current_span:
#     current_span.set_attribute("dietary_restriction", dietary_restriction)
#     current_span.set_attribute("query_id", int(queries_df['id'].iloc[1]))
#     current_span.set_attribute("use_case", "alternative_approach")

# print("Query:", single_query)
# print("Dietary Restriction:", dietary_restriction)
# response_content = single_response.choices[0].message.content
# if response_content:
#     print("Response snippet:", response_content[:200] + "...")
# else:
#     print("No response content available")


## Part 2: Prepare Data for Arize Labeling

Take your traces and prepare them for manual labeling in Arize.


### 📝 Labeling Criteria - Dietary Adherence

**CORRECT**: Recipe correctly follows all specified dietary restrictions  
**INCORRECT**: Recipe violates any specified dietary restrictions

**Examples:**
- ✅ CORRECT: 'vegan pasta' → recipe with nutritional yeast (no dairy)
- ❌ INCORRECT: 'vegan pasta' → recipe suggests honey (not vegan)  
- ✅ CORRECT: 'gluten-free bread' → recipe with almond flour
- ❌ INCORRECT: 'gluten-free bread' → recipe with regular flour


## Part 3: Create LLM as Judge Prompt

🎯 **Complete these steps in Arize:**

1. **🏷️ Label Rows**: Review dataset and annotate rows.

2. **⚖️ Develop Judge Prompt**: Create evaluation prompt for dietary adherence

3. **🧪 Test Evaluation**: Run judge prompt against ground truth labels in the playground

4. **🚀 Review Experiment**: Review evaluation experiment and iterate 

⏳ **Come back here after completing Arize work!**


### Optional: Programmatic Evaluation with `llm_classify` and Experiments


Instead of building and testing the eval in the Arize UI, you can use the [`llm_classify`](https://arize.com/docs/ax/evaluate/online-evals/log-evaluations-to-arize) function in code. 

You can also run a full evaluation experiment programmatically using the [Arize Experiments API](https://arize.com/docs/ax/develop/datasets-and-experiments/run-experiments), which lets you compare LLM judge results to ground truth and analyze performance—all in code.

This approach is useful if you want to scale up, iterate quickly, or integrate evaluation into your ML pipeline.

See the next code cell for an example of how to use `llm_classify`.


In [None]:
# from phoenix.evals import (
#     llm_classify,
# )

# # The rails are used to hold the output to specific values based on the template
# # It will remove text such as ",,," or "..."
# # Will ensure the binary value expected from the template is returned
# rails = ["Correct", "Incorrect"]
# #MultiClass would be rails = ["irrelevant", "relevant", "semi-relevant"]
# eval_df = llm_classify(
#     dataframe=<YOUR_DATAFRAME_GOES_HERE>,
#     template=CATEGORICAL_TEMPLATE,
#     model=OpenAIModel('gpt-4o', api_key=''),
#     rails=rails
# )

## Part 4: Load Results & Calculate Metrics

After running your evaluation experiment in Arize, we'll load the results and compute performance metrics.


In [45]:
dataset_id = "RGF0YXNldDozMDI0OTM6S3FsaA=="
dataset_data = datasets_client.get_dataset(os.environ["ARIZE_SPACE_ID"], dataset_id=dataset_id)

dataset_data.head()


Unnamed: 0,query,dietary_restriction,response,success,error,trace_id,query_id,created_at,updated_at,id,userannotation.VXNlcjoxMTE4NzpnUkY0.Correctness.label,userannotation.VXNlcjoxMTE4NzpnUkY0.Correctness.updated_by,userannotation.VXNlcjoxMTE4NzpnUkY0.Correctness.updated_at,latest.userannotation.Correctness.label,latest.userannotation.Correctness.updated_by,latest.userannotation.Correctness.updated_at,__index_level_0__
0,I eat pretty clean most of the time,whole30,"Great! Since you enjoy eating clean, I recomme...",True,,51_36,51,1754271594950,1754271594950,43897431-e017-4e28-b316-66c603e5c9b5,Correct,SallyAnn DeLucia,1754334000000.0,Correct,SallyAnn DeLucia,1754334000000.0,0
1,Low-carb dinner that doesn't require any cooki...,low-carb,Certainly! Here's a simple and delicious Low-C...,True,,36_2,36,1754271594950,1754271594950,93a54576-6a78-41ee-add6-95f471b1c81f,Correct,SallyAnn DeLucia,1754335000000.0,Correct,SallyAnn DeLucia,1754335000000.0,98
2,I'm pescatarian but shellfish makes me sick,pescatarian,Great! I recommend trying a **Lemon Garlic Bak...,True,,55_15,55,1754271594950,1754271594950,7611f7d2-91d4-4891-a795-e5e8284ddc2c,Correct,SallyAnn DeLucia,1754335000000.0,Correct,SallyAnn DeLucia,1754335000000.0,99
3,I'm dairy-free but pregnant and craving ice cr...,dairy-free,"Certainly! Here's a delightful dairy-free, pre...",True,,33_9,33,1754271594950,1754271594950,8088910c-5b1d-4b5c-9fbf-59ddd3feaeb6,Correct,SallyAnn DeLucia,1754335000000.0,Correct,SallyAnn DeLucia,1754335000000.0,100
4,Low-carb dinner that doesn't require any cooki...,low-carb,"Certainly! Here's a simple, low-carb, no-cook ...",True,,36_12,36,1754271594950,1754271594950,0674dd25-8429-40ad-9939-a2222fee5c9a,Correct,SallyAnn DeLucia,1754335000000.0,Correct,SallyAnn DeLucia,1754335000000.0,101


In [50]:
experiment_id = "RXhwZXJpbWVudDoyMjUyOToxNnpj"
# example usage
experiments_data = datasets_client.get_experiment(os.environ["ARIZE_SPACE_ID"], experiment_id=experiment_id)

experiments_data.head()

Unnamed: 0,output,example_id,id,count,template,invocation_parameters,tool_choice,tool_options,model_name,model_provider,eval.Label Match .label,eval.Label Match .score,eval.Label Match .explanation
0,"{""id"":""chatcmpl-C1KTrLWpNUAKm8OY7pXRGDvae40zr""...",43897431-e017-4e28-b316-66c603e5c9b5,EXP_ID_1877d7,1,"[{""role"":""system"",""content"":""You are a dietary...",{},"""required""","[{""type"":""function"",""function"":{""name"":""record...",gpt-4o-mini,openAI,match,1.0,1. The Output contains a JSON object with a 'r...
1,"{""id"":""chatcmpl-C1KTr1Gns9nyibZc1r57dGMlXDMKP""...",93a54576-6a78-41ee-add6-95f471b1c81f,EXP_ID_9a6a3d,1,"[{""role"":""system"",""content"":""You are a dietary...",{},"""required""","[{""type"":""function"",""function"":{""name"":""record...",gpt-4o-mini,openAI,match,1.0,The Output contains two tool calls with differ...
2,"{""id"":""chatcmpl-C1KTrQ6ulSr0mu4BaIt6qq8gep20b""...",7611f7d2-91d4-4891-a795-e5e8284ddc2c,EXP_ID_ceeae5,1,"[{""role"":""system"",""content"":""You are a dietary...",{},"""required""","[{""type"":""function"",""function"":{""name"":""record...",gpt-4o-mini,openAI,match,1.0,"1. The Ground Truth label is 'Correct', indica..."
3,"{""id"":""chatcmpl-C1KTrQV036WzL2gFH7rtBGEXRdx5Z""...",8088910c-5b1d-4b5c-9fbf-59ddd3feaeb6,EXP_ID_777a98,1,"[{""role"":""system"",""content"":""You are a dietary...",{},"""required""","[{""type"":""function"",""function"":{""name"":""record...",gpt-4o-mini,openAI,match,1.0,1. The Output from the LLM judge states that t...
4,"{""id"":""chatcmpl-C1KTrGRJrF3O3zN9KM5Ql6zz5EYCT""...",0674dd25-8429-40ad-9939-a2222fee5c9a,EXP_ID_7553cd,1,"[{""role"":""system"",""content"":""You are a dietary...",{},"""required""","[{""type"":""function"",""function"":{""name"":""record...",gpt-4o-mini,openAI,match,1.0,1. The Output provides an explanation that the...


In [51]:
# Join experiments_data (aliased as e) and dataset_data (aliased as d) on e.example_id = d.id
joined_df = experiments_data.merge(dataset_data, left_on='example_id', right_on='id', suffixes=('_e', '_d'))


In [52]:
import json

def extract_label_from_output(output_str):
    """
    Extract the 'response' field from the tool_calls in the output JSON.
    Returns the first 'response' value found, or None if not found.
    """
    try:
        output_json = json.loads(output_str)
        # Traverse to choices[0].message.tool_calls
        choices = output_json.get("choices", [])
        for choice in choices:
            message = choice.get("message", {})
            tool_calls = message.get("tool_calls", [])
            for tool_call in tool_calls:
                function = tool_call.get("function", {})
                arguments_str = function.get("arguments", "")
                # arguments is a JSON string, so parse it
                try:
                    arguments = json.loads(arguments_str)
                    if "response" in arguments:
                        return arguments["response"]
                except Exception:
                    continue
        return None
    except Exception:
        return None

joined_df['parsed_label'] = joined_df['output'].apply(extract_label_from_output)

final_df = joined_df[['parsed_label', 'eval.Label Match .label','eval.Label Match .score','query','dietary_restriction','response','latest.userannotation.Correctness.label']]
final_df.head()


Unnamed: 0,parsed_label,eval.Label Match .label,eval.Label Match .score,query,dietary_restriction,response,latest.userannotation.Correctness.label
0,correct,match,1.0,I eat pretty clean most of the time,whole30,"Great! Since you enjoy eating clean, I recomme...",Correct
1,correct,match,1.0,Low-carb dinner that doesn't require any cooki...,low-carb,Certainly! Here's a simple and delicious Low-C...,Correct
2,correct,match,1.0,I'm pescatarian but shellfish makes me sick,pescatarian,Great! I recommend trying a **Lemon Garlic Bak...,Correct
3,correct,match,1.0,I'm dairy-free but pregnant and craving ice cr...,dairy-free,"Certainly! Here's a delightful dairy-free, pre...",Correct
4,correct,match,1.0,Low-carb dinner that doesn't require any cooki...,low-carb,"Certainly! Here's a simple, low-carb, no-cook ...",Correct


### 📊 Judge Performance Analysis

Let's evaluate how well our LLM judge performed compared to human ground truth labels.


In [53]:
# Calculate judge performance metrics using final_df

def to_binary(label):
    """Convert text labels to binary (1 for correct/match, 0 for incorrect/mismatch)"""
    if pd.isna(label):
        return None
    label_str = str(label).strip().lower()
    # Handle both correctness labels and match/mismatch labels
    if label_str in ['correct', 'match']:
        return 1
    elif label_str in ['incorrect', 'mismatch']:
        return 0
    return None

# Extract ground truth from human annotations
ground_truth_labels = final_df['latest.userannotation.Correctness.label']
ground_truth = [to_binary(label) for label in ground_truth_labels]

# Extract judge predictions from parsed_label (eval template output)
judge_pred_labels = final_df['parsed_label']
judge_preds = [to_binary(label) for label in judge_pred_labels]

# Only keep valid pairs where both ground truth and predictions are available
valid = [(gt, pred) for gt, pred in zip(ground_truth, judge_preds) if gt is not None and pred is not None]

if valid:
    gt, pred = zip(*valid)
    tp = sum(1 for g, p in zip(gt, pred) if g == 1 and p == 1)
    tn = sum(1 for g, p in zip(gt, pred) if g == 0 and p == 0)
    fp = sum(1 for g, p in zip(gt, pred) if g == 0 and p == 1)
    fn = sum(1 for g, p in zip(gt, pred) if g == 1 and p == 0)
    tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
    tnr = tn / (tn + fp) if (tn + fp) > 0 else 0
    accuracy = (tp + tn) / len(valid)
    
    print("📊 Judge Performance Metrics:")
    print(f"   True Positive Rate (TPR): {tpr:.3f}")
    print(f"   True Negative Rate (TNR): {tnr:.3f}")
    print(f"   Accuracy: {accuracy:.3f}")
    print(f"   Total number of valid label pairs: {len(valid)}")
    print(f"   True Positives: {tp}, True Negatives: {tn}")
    print(f"   False Positives: {fp}, False Negatives: {fn}")
    
    metrics = {'tpr': tpr, 'tnr': tnr, 'accuracy': accuracy}
else:
    print("❌ No valid label pairs found in final_df")
    print("   Ground truth labels:", ground_truth[:5])
    print("   Judge predictions:", judge_preds[:5])


📊 Judge Performance Metrics:
   True Positive Rate (TPR): 0.931
   True Negative Rate (TNR): 0.000
   Accuracy: 0.900
   Total number of valid label pairs: 30
   True Positives: 27, True Negatives: 0
   False Positives: 1, False Negatives: 2


## Part 5: Evaluate Live Traces

After testing and validating our evaluation template, we're ready to use it in a production setting.
First, set up the online evaluation task in the platform using the template.
Once that's done, you can send in traces, and the evaluation will run automatically on each trace.

In [3]:
# Initialize OpenAI client
client = openai.OpenAI()

print("✅ Setup complete!")


✅ Setup complete!


In [4]:
from arize.otel import register

from openinference.instrumentation.openai import OpenAIInstrumentor

# Set up tracing
tracer_provider = register(
    space_id="U3BhY2U6NzE5Mjp4V1Q1",
    project_name="RecipeBot",  # name this to whatever you would like
)
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

🔭 OpenTelemetry Tracing Details 🔭
|  Arize Project: RecipeBot
|  Span Processor: BatchSpanProcessor
|  Collector Endpoint: otlp.arize.com
|  Transport: gRPC
|  Transport Headers: {'authorization': '****', 'api_key': '****', 'arize-space-id': '****', 'space_id': '****', 'arize-interface': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



In [15]:
# Load dietary queries
queries_path = Path("homeworks/hw3/data/dietary_queries.csv")
queries_df = pd.read_csv(queries_path)

queries_df.head()

Unnamed: 0,id,query,dietary_restriction
0,1,I'm vegan but I really want to make something ...,vegan
1,2,Need a quick gluten-free breakfast. I hate egg...,gluten-free
2,3,Keto breakfast that I can meal prep for the week,keto
3,4,I'm dairy-free and also can't stand the taste ...,dairy-free
4,5,Vegetarian pizza but I don't like mushrooms or...,vegetarian


In [7]:
from arize.experimental.prompt_hub import ArizePromptClient

prompt_client = ArizePromptClient(space_id="U3BhY2U6MjI2MjA6ckZlZA==", api_key=os.environ["ARIZE_API_KEY"])

prompt = prompt_client.pull_prompt(
    prompt_name="RecipeBot System Prompt"
)

system_prompt = prompt.messages[0]['content']

print(system_prompt)

#alternatively you can just assign the system prompt here

You are a helpful, accurate, and creative recipe assistant. Your job is to generate easy-to-follow, reliable recipes and cooking advice tailored to the user query below.

Core Responsibilities:
- Always include an ingredient list with precise measurements in standard US or metric units.
- Always include clear, numbered, step-by-step instructions that are logically ordered and easy to follow.
- Always structure your response in Markdown.

Ingredient Guidelines:
- Never suggest rare, expensive, or difficult-to-obtain ingredients without clearly providing readily available substitutions.
- Be specific with ingredients (e.g., “1 cup unsweetened almond milk” instead of “milk”).

Instructional Guidelines:
- Do not skip steps or assume prior knowledge.
- Use direct, instructional language.
- Include preparation and cook time only if reliably known.

Behavior & Ethics:
- Never include unsafe, unethical, or harmful suggestions. Politely decline and explain briefly if a request cannot be fulfill

In [8]:

from opentelemetry import trace

# Example with a single query
single_query = queries_df['query'].iloc[1]  # Use a different example
dietary_restriction = queries_df['dietary_restriction'].iloc[1]

# Make the OpenAI call (which will be auto-instrumented)
single_response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": single_query}
    ],
    temperature=0.7
)

# Get the current span and add metadata to it
current_span = trace.get_current_span()
if current_span:
    current_span.set_attribute("dietary_restriction", dietary_restriction)
    current_span.set_attribute("query_id", int(queries_df['id'].iloc[1]))
    current_span.set_attribute("use_case", "alternative_approach")

print("Query:", single_query)
print("Dietary Restriction:", dietary_restriction)
response_content = single_response.choices[0].message.content
if response_content:
    print("Response snippet:", response_content[:200] + "...")
else:
    print("No response content available")


Query: Need a quick gluten-free breakfast. I hate eggs though.
Dietary Restriction: gluten-free
Response snippet: ## Quick Gluten-Free Banana Oatmeal Pancakes

These fluffy banana oatmeal pancakes are a quick and delicious gluten-free breakfast option that doesn’t require eggs. They’re perfect for busy mornings a...


In [17]:
# Run on a sample of 500 queries with custom metadata - FAST async approach for Jupyter
import asyncio
from opentelemetry import trace

# Create async OpenAI client
async_client = openai.AsyncOpenAI()

async def get_response_with_metadata(i, query, dietary_restriction, query_id):
    """Process a single query with custom metadata"""
    # The OpenAI call will be auto-instrumented
    response = await async_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": query}
        ],
        temperature=0.7
    )
    
    # Get the current auto-instrumented span and add metadata to it
    current_span = trace.get_current_span()
    if current_span:
        current_span.set_attribute("dietary_restriction", dietary_restriction)
        current_span.set_attribute("query_id", query_id)
        current_span.set_attribute("use_case", "batch_processing")
        current_span.set_attribute("batch_index", i)
    
    return {
        "query": query,
        "dietary_restriction": dietary_restriction,
        "query_id": query_id,
        "response": response.choices[0].message.content
    }

# Process all queries with batching to avoid rate limits
responses = []
batch_size = 20  # Process 20 at a time to avoid rate limits

for batch_start in range(0, len(queries_df), batch_size):
    batch_end = min(batch_start + batch_size, len(queries_df))
    batch_tasks = []
    
    # Create tasks for this batch
    for i in range(batch_start, batch_end):
        query = str(queries_df.iloc[i]['query'])
        dietary_restriction = str(queries_df.iloc[i]['dietary_restriction'])
        query_id = int(queries_df.iloc[i]['id'])
        
        task = get_response_with_metadata(i, query, dietary_restriction, query_id)
        batch_tasks.append(task)
    
    # Process this batch concurrently
    batch_responses = await asyncio.gather(*batch_tasks)
    responses.extend(batch_responses)
    
    print(f"✅ Processed {len(responses)}/{queries_df.shape[0]} queries (batch {len(responses)//batch_size})")

print(f"\n🚀 Successfully processed {len(responses)} queries with custom metadata!")
print(f"📊 Average time per query: ~{50/len(responses):.1f} seconds (estimated)")

# Show a sample of the results
if responses:
    print(f"\n📝 Sample result:")
    sample_response = responses[0]
    print(f"Query: {sample_response['query'][:100]}...")
    print(f"Dietary Restriction: {sample_response['dietary_restriction']}")
    print(f"Response: {sample_response['response'][:150]}...")


✅ Processed 20/60 queries (batch 1)
✅ Processed 40/60 queries (batch 2)
✅ Processed 60/60 queries (batch 3)

🚀 Successfully processed 60 queries with custom metadata!
📊 Average time per query: ~0.8 seconds (estimated)

📝 Sample result:
Query: I'm vegan but I really want to make something with honey - is there a good substitute? i am craving ...
Dietary Restriction: vegan
Response: ## Vegan Yogurt Breakfast Bowl with Agave Nectar

If you're looking for a sweet and satisfying vegan breakfast, this yogurt bowl topped with fresh fru...


### Monitor Live Evaluation Results

Navigate to the Arize UI and check the traces. You should see your online evaluation task automatically processing the new traces. Look for the evaluation scores and any patterns in the results.

## Part 6: Statistical Analysis with Bias Correction 📊

Now we'll apply statistical bias correction to get a reliable estimate of the Recipe Bot's true dietary adherence performance. This implements the same methodology as the 'judgy' library but using our Arize workflow.



**What we're doing:**
1. **Export live traces** from Arize that have been automatically evaluated by our judge
2. **Use judge performance** (TPR/TNR) calculated from our labeled dataset above
3. **Apply bias correction** to get a more accurate estimate of true performance
4. **Calculate confidence intervals** to understand the reliability of our estimates

This approach lets us evaluate real production performance using statistical methods to account for judge bias.



In [16]:
export_client = ArizeExportClient(api_key=os.environ["ARIZE_API_KEY"])

# Set end_time to now and start_time to 24 hours ago
end_time = datetime.now()
start_time = end_time - timedelta(days=1)

new_traces_df = export_client.export_model_to_df(
    space_id="U3BhY2U6NzE5Mjp4V1Q1",
    # api_key=os.environ["ARIZE_API_KEY"],
    model_id='RecipeBot',
    environment=Environments.TRACING,
    start_time=start_time,
    end_time=end_time,
    # Optionally specify columns to improve query performance
    # columns=['context.span_id', 'attributes.llm.input']
)

new_traces_df.columns

[38;21m  arize.utils.logging | INFO | Creating named session as 'python-sdk-arize_python_export_client-add6be4b-dc15-4f62-bdaf-4c2e11ab0955'.[0m
[38;21m  arize.utils.logging | INFO | Fetching data...[0m
[38;21m  arize.utils.logging | INFO | Starting exporting...[0m


  exporting 151 rows: 100%|[38;2;0;128;0m██████████████████████[0m| 151/151 [00:00, 295.58 row/s][0m


Index(['attributes.llm.cost.completion_details.reasoning',
       'attributes.llm.token_count.completion',
       'attributes.llm.token_count.prompt_details.audio',
       'attributes.reranker.model_name', 'attributes.exception.message',
       'eval.Dietary Restriction Adherence.score',
       'attributes.llm.prompt_template.variables',
       'attributes.llm.token_count.completion_details.reasoning',
       'attributes.input.mime_type',
       'attributes.llm.token_count.completion_details.output',
       'eval.Dietary Restriction Adherence.label', 'attributes.llm.cost.total',
       'attributes.llm.input_messages', 'end_time', 'name', 'context.trace_id',
       'attributes.llm.provider',
       'attributes.llm.token_count.completion_details.audio', 'time',
       'attributes.exception.type', 'attributes.llm.prompt_template.template',
       'attributes.llm.cost.prompt', 'attributes.llm.output_messages',
       'latency_ms', 'attributes.llm.cost.completion',
       'attributes.llm.to

In [55]:
# First, let's check what evaluation columns are available in new_traces_df
print("📊 Available columns in new_traces_df:")
eval_columns = [col for col in new_traces_df.columns if 'eval' in col.lower()]
print(eval_columns)
print(f"\nTotal traces: {len(new_traces_df)}")

if not eval_columns:
    print("❌ No evaluation columns found in new_traces_df. Make sure your online evaluation is running.")
    print("Available columns:", list(new_traces_df.columns[:10]), "...")
else:
    # Find the evaluation score column
    eval_score_col = None
    for col in eval_columns:
        if 'score' in col.lower():
            eval_score_col = col
            break
    
    if eval_score_col:
        print(f"✅ Using evaluation column: {eval_score_col}")
        
        # Get judge performance metrics from labeled data
        TPR = metrics["tpr"]  # True Positive Rate (sensitivity)
        TNR = metrics["tnr"]  # True Negative Rate (specificity)
        
        print(f"\n📈 Judge Performance (from labeled data):")
        print(f"   TPR (True Positive Rate): {TPR:.3f}")
        print(f"   TNR (True Negative Rate): {TNR:.3f}")
        
        # Get judge predictions on live traces
        live_scores = new_traces_df[eval_score_col].dropna()
        if len(live_scores) == 0:
            print("❌ No evaluation scores found in live traces")
        else:
            # Calculate observed pass rate from live traces
            p_obs = live_scores.mean()
            n = len(live_scores)
            
            print(f"\n📊 Live Traces Analysis:")
            print(f"   Number of evaluated traces: {n}")
            print(f"   Raw pass rate (p_obs): {p_obs:.3f}")
            
            # Check if we can apply bias correction
            denom = TPR + TNR - 1
            
            if abs(denom) < 0.001:  # Close to zero
                print(f"\n⚠️  Warning: Cannot apply bias correction")
                print(f"   Denominator (TPR + TNR - 1) = {denom:.3f} ≈ 0")
                print(f"   This happens when the judge performance is poor or unbalanced")
                print(f"   Using raw pass rate as estimate: {p_obs:.3f}")
                theta_hat = p_obs
                
                # Simple confidence interval for proportion
                se_p = np.sqrt(p_obs * (1 - p_obs) / n)
                z = 1.96
                ci_lower = max(0, p_obs - z * se_p)
                ci_upper = min(1, p_obs + z * se_p)
                
                print(f"   95% CI (no bias correction): [{ci_lower:.3f}, {ci_upper:.3f}]")
                
            else:
                print(f"\n🔧 Applying Bias Correction:")
                print(f"   Denominator (TPR + TNR - 1) = {denom:.3f}")
                
                # Apply bias correction formula
                theta_hat = (p_obs + TNR - 1) / denom
                
                # Calculate confidence interval using delta method
                se_p_obs = np.sqrt(p_obs * (1 - p_obs) / n)
                se_theta = se_p_obs / abs(denom)
                z = 1.96
                ci_lower = max(0, theta_hat - z * se_theta)
                ci_upper = min(1, theta_hat + z * se_theta)
                
                print(f"   Bias-corrected true success rate (θ̂): {theta_hat:.3f}")
                print(f"   95% Confidence Interval: [{ci_lower:.3f}, {ci_upper:.3f}]")
            
            # Interpretation
            if theta_hat > 0.9:
                summary = "Recipe Bot is performing very well in adhering to dietary preferences."
            elif theta_hat > 0.75:
                summary = "Recipe Bot is generally adhering to dietary preferences, but there may be some room for improvement."
            elif theta_hat > 0.5:
                summary = "Recipe Bot shows moderate adherence to dietary preferences with room for improvement."
            else:
                summary = "Recipe Bot may be struggling to consistently adhere to dietary preferences."
            
            print(f"\n🎯 Final Interpretation:")
            print(f"   {summary}")
            print(f"   Estimated true dietary adherence rate: {theta_hat:.1%}")
            print(f"   95% Confidence Interval: {ci_lower:.1%} to {ci_upper:.1%}")
            
            if abs(denom) < 0.001:
                print(f"\n💡 Note: No bias correction applied due to poor judge discriminability.")
                print(f"   Consider improving the judge prompt or getting more diverse labeled data.")
    else:
        print("❌ No evaluation score column found. Available eval columns:", eval_columns)


📊 Available columns in new_traces_df:
['eval.Dietary Restriction Adherence.score', 'eval.Dietary Restriction Adherence.label', 'eval.Dietary Restriction Adherence.explanation']

Total traces: 151
✅ Using evaluation column: eval.Dietary Restriction Adherence.score

📈 Judge Performance (from labeled data):
   TPR (True Positive Rate): 0.931
   TNR (True Negative Rate): 0.000

📊 Live Traces Analysis:
   Number of evaluated traces: 151
   Raw pass rate (p_obs): 0.993

🔧 Applying Bias Correction:
   Denominator (TPR + TNR - 1) = -0.069
   Bias-corrected true success rate (θ̂): 0.096
   95% Confidence Interval: [0.000, 0.284]

🎯 Final Interpretation:
   Recipe Bot may be struggling to consistently adhere to dietary preferences.
   Estimated true dietary adherence rate: 9.6%
   95% Confidence Interval: 0.0% to 28.4%


## 🎉 Assignment Complete!

**What you accomplished:**
- ✅ Prepared trace data for evaluation testing
- ✅ Used Arize UI for manual labeling to establish ground truth
- ✅ Developed and tested LLM judge against human feedback
- ✅ Aligned judge performance with human annotations (TPR, TNR)
- ✅ Applied judge to evaluate "production" traces at scale
- ✅ Applied statistical bias correction to account for judge imperfections
- ✅ Generated comprehensive evaluation report with confidence intervals

**Key insight:** By aligning the LLM judge with human feedback through the Arize UI testing workflow, we can now evaluate dietary adherence at scale while accounting for judge bias through statistical correction.

