# HW 2: Recipe Bot Error Analysis

## 🎯 Assignment Overview

This notebook helps you perform error analysis for your Recipe Bot by:

1. **Part 1: Generate Test Queries** - Create diverse queries using key dimensions
2. **Part 2: Run & Annotate** - Test your bot and identify failure patterns  
3. **Part 3: Create Taxonomy** - Build structured failure mode categories

**Goal:** Systematically identify what goes wrong with your bot and why.


In [1]:
# Import required libraries
import os
import random
import warnings
from pathlib import Path

import openai
import pandas as pd
from dotenv import load_dotenv

import phoenix as px
from phoenix.evals import OpenAIModel, PromptTemplate, llm_generate

warnings.filterwarnings("ignore")

# Load environment variables
load_dotenv()

  from .autonotebook import tqdm as notebook_tqdm


False

In [2]:
import getpass

if "OPENAI_API_KEY" not in os.environ or not os.environ["OPENAI_API_KEY"]:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("Enter your OPENAI_API_KEY: ")

In [3]:
# Configuration
MODEL_NAME = "gpt-4o-mini"
OUTPUT_DIR = Path("./data")
OUTPUT_DIR.mkdir(exist_ok=True)

# Set up Phoenix OpenAI model
phoenix_model = OpenAIModel(model=MODEL_NAME, temperature=0.9)

print("✅ Setup complete - Ready for error analysis with Phoenix!")

✅ Setup complete - Ready for error analysis with Phoenix!


# Part 1: Define Dimensions & Generate Initial Queries

## Step 1.1: Identify Key Dimensions

Identify 3-4 key dimensions relevant to your Recipe Bot's functionality. For each dimension, list at least 3 example values.


In [4]:
# Define 4 key dimensions for Recipe Bot testing with specific values

DIMENSIONS = {
    "dietary_restriction": ["vegan", "vegetarian", "gluten-free", "keto", "no restrictions"],
    "cuisine_type": ["Italian", "Asian", "Mexican", "Mediterranean", "American", "any cuisine"],
    "meal_type": ["breakfast", "lunch", "dinner", "snack", "dessert"],
    "skill_level": ["beginner", "intermediate", "advanced"],
}

print("🎯 Defined key dimensions for Recipe Bot testing:")
for dim, values in DIMENSIONS.items():
    print(f"   {dim}: {', '.join(values)}")

print(
    f"\nTotal possible combinations: {len(DIMENSIONS['dietary_restriction']) * len(DIMENSIONS['cuisine_type']) * len(DIMENSIONS['meal_type']) * len(DIMENSIONS['skill_level'])}"
)

🎯 Defined key dimensions for Recipe Bot testing:
   dietary_restriction: vegan, vegetarian, gluten-free, keto, no restrictions
   cuisine_type: Italian, Asian, Mexican, Mediterranean, American, any cuisine
   meal_type: breakfast, lunch, dinner, snack, dessert
   skill_level: beginner, intermediate, advanced

Total possible combinations: 450


## Step 1.2: Generate Unique Combinations (Tuples)

Generate 15-20 unique combinations of these dimension values using programmatic sampling.


In [5]:
# Step 1: Generate diverse dimension tuples programmatically to ensure variety
print("🎯 Generating 25 diverse dimension tuples programmatically...")

# Create diverse combinations by sampling systematically
dimension_tuples = []
random.seed(42)  # For reproducible results

# Generate 25 diverse tuples
for i in range(25):
    tuple_data = {
        "dietary_restriction": random.choice(DIMENSIONS["dietary_restriction"]),
        "cuisine_type": random.choice(DIMENSIONS["cuisine_type"]),
        "meal_type": random.choice(DIMENSIONS["meal_type"]),
        "skill_level": random.choice(DIMENSIONS["skill_level"]),
        "tuple_id": i + 1,
    }
    dimension_tuples.append(tuple_data)

print(f"✅ Generated {len(dimension_tuples)} diverse dimension tuples")

# Step 2: Show some examples to verify diversity
print("\n📋 Sample dimension tuples:")
for i in range(min(5, len(dimension_tuples))):
    tuple_data = dimension_tuples[i]
    print(
        f"\nTuple {i + 1}: {tuple_data['dietary_restriction']}, {tuple_data['cuisine_type']}, {tuple_data['meal_type']}, {tuple_data['skill_level']}"
    )

print(f"\n✅ Successfully created {len(dimension_tuples)} diverse dimension tuples")

🎯 Generating 25 diverse dimension tuples programmatically...
✅ Generated 25 diverse dimension tuples

📋 Sample dimension tuples:

Tuple 1: vegan, Italian, dinner, beginner

Tuple 2: vegetarian, Asian, breakfast, advanced

Tuple 3: no restrictions, Italian, dessert, intermediate

Tuple 4: vegan, Italian, breakfast, beginner

Tuple 5: vegetarian, American, dessert, beginner

✅ Successfully created 25 diverse dimension tuples


## Step 1.3: Generate Natural Language User Queries

Take 5-7 of the generated tuples and create a natural language user query for your Recipe Bot for each selected tuple. Review these generated queries to ensure they are realistic and representative of how a user might interact with your bot.


In [10]:
selected_tuples = random.sample(dimension_tuples, 25)

print(f"📝 Selected {len(selected_tuples)} dimension tuples for query generation")

# Step 2: Create dataframe for query generation
query_input = []
for tuple_data in selected_tuples:
    tuple_str = f"dietary_restriction: {tuple_data['dietary_restriction']}, cuisine_type: {tuple_data['cuisine_type']}, meal_type: {tuple_data['meal_type']}, skill_level: {tuple_data['skill_level']}"
    query_input.append(
        {
            # 'tuple_id': tuple_data['tuple_id'],
            "tuple_description": tuple_str,
            "dietary_restriction": tuple_data["dietary_restriction"],
            "cuisine_type": tuple_data["cuisine_type"],
            "meal_type": tuple_data["meal_type"],
            "skill_level": tuple_data["skill_level"],
        }
    )

query_df = pd.DataFrame(query_input)

# Step 3: Template for converting dimension tuples to natural language queries
query_template = PromptTemplate("""
Convert this dimension tuple into a realistic user query for a Recipe Bot:

Dimension tuple: {tuple_description}

Create a natural language query that a real user with these characteristics might ask. Be creative and vary your style significantly.

Vary your vocabulary, sentence structure, and level of detail. Generate 1 unique, realistic query:
""")

print("🎯 Converting dimension tuples to natural language queries...")

# Step 4: Generate the queries with higher temperature for variety
phoenix_model_creative = OpenAIModel(model_name=MODEL_NAME, temperature=0.9)

queries_result = llm_generate(
    dataframe=query_df, template=query_template, model=phoenix_model_creative
)

print(f"✅ Generated {len(queries_result)} queries from dimension tuples")

# Step 5: Show examples of tuple → query conversion
print("\n📋 Sample tuple → query conversions:")
for i in range(min(3, len(queries_result))):
    input_row = query_df.iloc[i]
    query_row = queries_result.iloc[i]
    # Clean the query for display too
    clean_query = query_row["output"].strip().strip('"').strip("'").strip()
    print(f"\nTuple {i + 1}: {input_row['tuple_description']}")
    print(f"Query: {clean_query}")

# Step 6: Create final dataset with tuple information
final_data = []
for idx in range(len(queries_result)):
    query_row = queries_result.iloc[idx]
    original_input = query_df.iloc[idx]

    # Clean the query: strip quotes and extra whitespace
    clean_query = query_row["output"].strip().strip('"').strip("'").strip()

    final_data.append(
        {
            "id": f"SYN{idx + 1:03d}",
            "query": clean_query,
            "dietary_restriction": original_input["dietary_restriction"],
            "cuisine_type": original_input["cuisine_type"],
            "meal_type": original_input["meal_type"],
            "skill_level": original_input["skill_level"],
            "tuple_description": original_input["tuple_description"],
        }
    )

all_queries_df = pd.DataFrame(final_data)
print(f"\n🎯 Created dataset with {len(all_queries_df)} queries ready for testing!")

🐌!! If running inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


📝 Selected 25 dimension tuples for query generation
🎯 Converting dimension tuples to natural language queries...
The `model_name` field is deprecated. Use `model` instead.                 This will be removed in a future release.


llm_generate |██████████| 25/25 (100.0%) | ⏳ 00:30<00:00 |  1.21s/it

✅ Generated 25 queries from dimension tuples

📋 Sample tuple → query conversions:

Tuple 1: dietary_restriction: gluten-free, cuisine_type: any cuisine, meal_type: dessert, skill_level: beginner
Query: '"Hey there! I'm on the hunt for an easy gluten-free dessert recipe that doesn't stick to any specific cuisine. I’m a beginner in the kitchen, so something simple and delicious would be perfect. Any suggestions?"'

Tuple 2: dietary_restriction: gluten-free, cuisine_type: any cuisine, meal_type: dessert, skill_level: intermediate
Query: '"Hey there! I'm looking for a delicious dessert recipe that’s gluten-free. I’m open to any type of cuisine, but I want something that will challenge my intermediate cooking skills a bit. Could you whip up an exciting recipe suggestion for me?"'

Tuple 3: dietary_restriction: vegan, cuisine_type: American, meal_type: lunch, skill_level: advanced
Query: 'I'm looking for an advanced vegan lunch recipe that captures the essence of American cuisine. Something 




### Quality Review

Review the generated queries to make sure they're diverse and realistic: 

In [11]:
# Display all rows and columns, and show full text in each cell for all_queries_df
with pd.option_context(
    "display.max_rows", None, "display.max_columns", None, "display.max_colwidth", None
):
    display(all_queries_df)

Unnamed: 0,id,query,dietary_restriction,cuisine_type,meal_type,skill_level,tuple_description
0,SYN001,"Hey there! I'm on the hunt for an easy gluten-free dessert recipe that doesn't stick to any specific cuisine. I’m a beginner in the kitchen, so something simple and delicious would be perfect. Any suggestions?",gluten-free,any cuisine,dessert,beginner,"dietary_restriction: gluten-free, cuisine_type: any cuisine, meal_type: dessert, skill_level: beginner"
1,SYN002,"Hey there! I'm looking for a delicious dessert recipe that’s gluten-free. I’m open to any type of cuisine, but I want something that will challenge my intermediate cooking skills a bit. Could you whip up an exciting recipe suggestion for me?",gluten-free,any cuisine,dessert,intermediate,"dietary_restriction: gluten-free, cuisine_type: any cuisine, meal_type: dessert, skill_level: intermediate"
2,SYN003,"I'm looking for an advanced vegan lunch recipe that captures the essence of American cuisine. Something creative and challenging, perhaps a dish that utilizes innovative cooking techniques or unique plant-based ingredients. Any recommendations?",vegan,American,lunch,advanced,"dietary_restriction: vegan, cuisine_type: American, meal_type: lunch, skill_level: advanced"
3,SYN004,"I’m looking for an Italian dessert recipe that’s a bit challenging but not too complicated since I have some cooking experience. I have no dietary restrictions, so feel free to suggest something indulgent. Any recommendations?",no restrictions,Italian,dessert,intermediate,"dietary_restriction: no restrictions, cuisine_type: Italian, meal_type: dessert, skill_level: intermediate"
4,SYN005,"I'm looking for an advanced vegan breakfast recipe that has a Mediterranean flair. I've mastered some complex techniques in the kitchen, so I'm eager for a dish that will really challenge my skills and impress my guests. Any suggestions?",vegan,Mediterranean,breakfast,advanced,"dietary_restriction: vegan, cuisine_type: Mediterranean, meal_type: breakfast, skill_level: advanced"
5,SYN006,Hey there! I'm on the lookout for a vegetarian Asian snack recipe that isn't too beginner-friendly but also not too complicated. I’d love something that strikes a balance and showcases some authentic flavors. Any suggestions?,vegetarian,Asian,snack,intermediate,"dietary_restriction: vegetarian, cuisine_type: Asian, meal_type: snack, skill_level: intermediate"
6,SYN007,"Hey there, Recipe Bot! I'm looking to whip up a simple breakfast with an Asian twist, and I don't have any dietary restrictions. Can you suggest some easy recipes that are perfect for a cooking novice like me? Thanks!",no restrictions,Asian,breakfast,beginner,"dietary_restriction: no restrictions, cuisine_type: Asian, meal_type: breakfast, skill_level: beginner"
7,SYN008,I'm looking for a delicious vegan Mediterranean dinner recipe that's perfect for someone with an intermediate cooking skill level. Can you suggest something that showcases bold flavors and fresh ingredients? Maybe a dish that would impress my friends at a dinner party?,vegan,Mediterranean,dinner,intermediate,"dietary_restriction: vegan, cuisine_type: Mediterranean, meal_type: dinner, skill_level: intermediate"
8,SYN009,I'm looking for an intermediate-level gluten-free Asian lunch recipe that I can try out. Do you have any delicious options that use fresh ingredients and maybe a bit of spice? Thanks!,gluten-free,Asian,lunch,intermediate,"dietary_restriction: gluten-free, cuisine_type: Asian, meal_type: lunch, skill_level: intermediate"
9,SYN010,Hey there! I’m looking for a simple vegan Italian dinner recipe since I’m just starting out in the kitchen. Any recommendations for a dish that won’t overwhelm me but still has that delicious Italian flair? Thanks a bunch!,vegan,Italian,dinner,beginner,"dietary_restriction: vegan, cuisine_type: Italian, meal_type: dinner, skill_level: beginner"


### Save Dataset

Save the dataset for testing:

In [8]:
# Save the dataset to CSV for easy use
output_path = OUTPUT_DIR / "generated_synthetic_queries.csv"
all_queries_df.to_csv(output_path, index=False)

print(f"💾 Saved dataset to: {output_path}")
print(f"📊 Ready for testing with {len(all_queries_df)} queries!")

💾 Saved dataset to: data/generated_synthetic_queries.csv
📊 Ready for testing with 25 queries!


### Upload to Phoenix

You can either:
- **Option A:** Manually upload the CSV file to Phoenix UI
- **Option B:** Use the SDK upload below

In [13]:
# original_test = pd.read_csv("/Users/sallyanndelucia/Documents/GitHub/recipe-chatbot/data/generated_synthetic_queries.csv")
client = px.Client()
dataset = client.upload_dataset(
    dataframe=all_queries_df,
    dataset_name="recipe-bot-synthetic-queries",
    input_keys=["query"],
)

📤 Uploading dataset...
💾 Examples uploaded: http://127.0.0.1:6006/datasets/RGF0YXNldDox/examples
🗄️ Dataset version ID: RGF0YXNldFZlcnNpb246MQ==


## Part 1 Complete ✅

**What you now have:**
- 25 diverse test queries saved as CSV
- Dataset uploaded to Phoenix (ready for testing)
- Systematic coverage across key user dimensions  

**Next steps:**
1. Go to Phoenix UI
2. Run your Recipe Bot on these queries
3. Annotate problems you find
4. Come back to this notebook for analysis


# Part 2: Initial Error Analysis

## Step 2.1: Run Bot on Synthetic Queries

1. **Upload Dataset**: Load your synthetic queries into Phoenix playground
2. **Configure Bot**: Import your Recipe Bot prompt 
3. **Run Tests**: Execute all queries through your bot
4. **Record Results**: Save the interaction traces

## Step 2.2: Open Coding

Review the recorded traces and perform open coding to identify themes, patterns, and potential errors in your bot's responses.

**What to look for:**
- Factual errors or incorrect recommendations
- Confusing or unhelpful responses
- Inconsistent behavior across similar queries
- Format and communication issues

**How to annotate:**
- Be specific about what went wrong
- Note why something is problematic for users 




# Part 3: Axial Coding & Taxonomy Definition

## Step 3.1: Export Annotated Traces

Export your annotated traces and annotations from Phoenix.


In [32]:
# This method returns a list of dictionaries instead of a DataFrame
from phoenix.client import Client

client = Client()

# Query for spans that have notes
# query = SpanQuery().where("annotations['note']")
spans = client.spans.get_spans_dataframe(
    # query=query,
    project_identifier="UHJvamVjdDoy"
)

spans.reset_index(drop=True, inplace=True)

spans.head()

Unnamed: 0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,context.span_id,context.trace_id,...,attributes.llm.token_count.prompt_details.cache_read,attributes.llm.token_count.prompt,attributes.llm.invocation_parameters,attributes.url,attributes.llm.token_count.completion,attributes.input.mime_type,attributes.metadata,attributes.llm.token_count.prompt_details.audio,attributes.output.mime_type,attributes.llm.system
0,ChatCompletion,LLM,,2025-07-30 05:02:54.680996+00:00,2025-07-30 05:03:06.042090+00:00,OK,,[],16e2eccbb4fd49ca,1800389113fe83f890dfc1c81c7994ce,...,0,397,"{""temperature"": 1.0, ""top_p"": 1.0}","{'path': 'chat/completions', 'full': 'https://...",387,application/json,{'phoenix_prompt_id': 'recipe-assistant-v1-tes...,0,text/plain,openai
1,ChatCompletion,LLM,,2025-07-30 05:02:54.681153+00:00,2025-07-30 05:03:08.212350+00:00,OK,,[],7138c4e0fffc1305,bce5739e6fda4b5a8e4828c13e0b0a5c,...,0,388,"{""temperature"": 1.0, ""top_p"": 1.0}","{'path': 'chat/completions', 'full': 'https://...",456,application/json,{'phoenix_prompt_id': 'recipe-assistant-v1-tes...,0,text/plain,openai
2,ChatCompletion,LLM,,2025-07-30 05:02:54.680560+00:00,2025-07-30 05:03:11.375935+00:00,OK,,[],41565909b57f1b2b,32e80f33cf5594698b13c7aa19a30dbb,...,0,407,"{""temperature"": 1.0, ""top_p"": 1.0}","{'path': 'chat/completions', 'full': 'https://...",666,application/json,{'phoenix_prompt_id': 'recipe-assistant-v1-tes...,0,text/plain,openai
3,ChatCompletion,LLM,,2025-07-30 05:03:06.057379+00:00,2025-07-30 05:03:18.969275+00:00,OK,,[],26cf736bceaea3dc,b8cc683e595ef3748ac3b05f93f09fb4,...,0,398,"{""temperature"": 1.0, ""top_p"": 1.0}","{'path': 'chat/completions', 'full': 'https://...",473,application/json,{'phoenix_prompt_id': 'recipe-assistant-v1-tes...,0,text/plain,openai
4,ChatCompletion,LLM,,2025-07-30 05:03:08.213397+00:00,2025-07-30 05:03:19.429567+00:00,OK,,[],ad65c522d55926b7,f3b8ca911c8fa652aaab22fc01676190,...,0,390,"{""temperature"": 1.0, ""top_p"": 1.0}","{'path': 'chat/completions', 'full': 'https://...",468,application/json,{'phoenix_prompt_id': 'recipe-assistant-v1-tes...,0,text/plain,openai


In [33]:
# Then get all annotations (including notes) for these spans
annotations_df = client.spans.get_span_annotations_dataframe(
    spans_dataframe=spans,
    project_identifier="UHJvamVjdDoy",
    exclude_annotation_names=[],  # Include everything
)

# Reset index to make the index a column
annotations_df = annotations_df.reset_index()

annotations_df.head()

Unnamed: 0,span_id,annotation_name,annotator_kind,metadata,identifier,id,created_at,updated_at,source,user_id,result.label,result.score,result.explanation
0,4df612fcfe75b0b6,note,HUMAN,{},px-span-note:2025-07-30T00:20:28.839601,U3BhbkFubm90YXRpb246OQ==,2025-07-30T06:20:28+00:00,2025-07-30T06:20:28+00:00,APP,,,,Ignores dietary restrictions
1,a95eb1dbcd5e729e,note,HUMAN,{},px-span-note:2025-07-29T23:18:36.864250,U3BhbkFubm90YXRpb246Ng==,2025-07-30T05:18:36+00:00,2025-07-30T05:18:36+00:00,APP,,,,Incorrect formatting
2,3ef7a57bae611fbf,note,HUMAN,{},px-span-note:2025-07-29T23:16:40.973211,U3BhbkFubm90YXRpb246NA==,2025-07-30T05:16:40+00:00,2025-07-30T05:16:40+00:00,APP,,,,Recipe provided is a bit complicated for the u...
3,6587e423ae0948ab,note,HUMAN,{},px-span-note:2025-07-29T23:09:10.677786,U3BhbkFubm90YXRpb246Mg==,2025-07-30T05:09:10+00:00,2025-07-30T05:09:10+00:00,APP,,,,Asks for breakfast but a lunch or dinner is su...
4,4e6a458fbd57e2d4,note,HUMAN,{},px-span-note:2025-07-29T23:20:36.659177,U3BhbkFubm90YXRpb246OA==,2025-07-30T05:20:36+00:00,2025-07-30T05:20:36+00:00,APP,,,,Does not include unique ingredients like the u...


In [34]:
combined_df = pd.merge(
    spans,
    annotations_df,
    left_on="context.span_id",
    right_on="span_id",
    how="right",  # Keep all spans, even those without annotations
)[
    [
        "context.trace_id",
        "result.explanation",
        "attributes.llm.input_messages",
        "attributes.llm.output_messages",
    ]
]
combined_df.head()

Unnamed: 0,context.trace_id,result.explanation,attributes.llm.input_messages,attributes.llm.output_messages
0,80cd60cc61bd68c53f44e2986a800da9,Ignores dietary restrictions,"[{'message.role': 'system', 'message.content':...","[{'message.role': 'assistant', 'message.conten..."
1,501cc758aec7b0a713e2fd37a83052c9,Incorrect formatting,"[{'message.role': 'system', 'message.content':...","[{'message.role': 'assistant', 'message.conten..."
2,81a7d4e6c801603fd26e543eaa2507f7,Recipe provided is a bit complicated for the u...,"[{'message.role': 'system', 'message.content':...","[{'message.role': 'assistant', 'message.conten..."
3,8dc30dea4ca3d23a321726afc460923d,Asks for breakfast but a lunch or dinner is su...,"[{'message.role': 'system', 'message.content':...","[{'message.role': 'assistant', 'message.conten..."
4,4e2aa436bf98de2ac8632d2076ae472a,Does not include unique ingredients like the u...,"[{'message.role': 'system', 'message.content':...","[{'message.role': 'assistant', 'message.conten..."


## Step 3.2: Axial Coding & Taxonomy Definition

Group your observations from open coding into broader categories or failure modes. **We'll use an LLM to make this easier!**

**What the LLM will do:**
1. **Find Patterns**: Analyze all your annotations to identify common themes
2. **Create Categories**: Generate 4-6 systematic failure mode labels
3. **Apply Labels**: Classify each trace using the discovered failure modes

**What you'll get:**
- **Clear Title** for each failure mode
- **One-sentence Definition** explaining the failure
- **1-2 Examples** from your actual bot traces
- **Labeled dataset** with each trace classified

**Example failure modes:**
- "Dietary Mismatch" - Bot suggests food that violates stated dietary restrictions
- "Missing Steps" - Recipe instructions are incomplete or unclear
- "Wrong Context" - Bot misunderstands what the user is asking for


In [None]:
prompt = f"""
You are analyzing Recipe Bot failures. Look at these examples where a user queried the bot, the bot responded, and an analyst (me) described what went wrong.

EXAMPLES:
{combined_df.to_json(orient="records", lines=True)}

Based on the patterns you see in the analyst's descriptions of what went wrong, create 4-6 systematic failure mode labels that would be useful for categorizing these types of issues.

Each label should:
- Be short and clear (2 words max)
- Capture a distinct type of failure pattern
- Be applicable to multiple traces

Respond with a list of failure mode labels: ["label1", "label2", "label3", "label4", "label5", "label6"]
"""  # noqa: E501


client = openai.OpenAI()
response = client.chat.completions.create(
    model=MODEL_NAME,
    messages=[{"role": "user", "content": prompt}],
    temperature=0.3,
    max_tokens=1000,
)

response_content = response.choices[0].message.content

# result = json.loads(response_content)
# failure_modes = result.get('failure_modes', [])
# print(failure_modes)

print(response_content)

["Dietary Ignored", "Formatting Error", "Complexity Mismatch", "Meal Type Mismatch", "Ingredient Omission", "Skill Level Misalignment"]


In [29]:
import ast

failure_mode_labels = ast.literal_eval(response_content)

print(failure_mode_labels)

['Dietary Ignored', 'Formatting Error', 'Complexity Mismatch', 'Meal Type Mismatch', 'Ingredient Omission', 'Skill Level Misalignment']


In [None]:
# Create template for applying labels
classification_template = PromptTemplate(f"""
Look at this Recipe Bot interaction and the analyst's description of what went wrong.
Apply the most appropriate failure mode label(s) from the provided options.

USER QUERY: {{attributes.llm.input_messages}}
BOT RESPONSE: {{attributes.llm.output_messages}}
ANALYST'S ISSUE DESCRIPTION: {{result.explanation}}

AVAILABLE FAILURE MODE LABELS:
{failure_mode_labels}

Based on the analyst's description of the issue, pick the failure mode that best apply to this case.

Respond with just the label name
""")

# Run llm_generate for classification


results = llm_generate(dataframe=combined_df, template=classification_template, model=phoenix_model)

results.head()

🐌!! If running inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.
llm_generate |██████████| 9/9 (100.0%) | ⏳ 00:05<00:00 |  1.75it/s


Unnamed: 0,output
0,Dietary Ignored
1,Formatting Error
2,Complexity Mismatch
3,Meal Type Mismatch
4,Ingredient Omission


In [31]:
# Count the occurrences of each failure mode label in the results
label_counts = results["output"].value_counts()
label_counts

output
Complexity Mismatch         2
Meal Type Mismatch          2
Ingredient Omission         2
Dietary Ignored             1
Formatting Error            1
Skill Level Misalignment    1
Name: count, dtype: int64

In [24]:
# Join results to combined_df on the index (axis=1), then rename 'output' to 'failure model'
final_data = combined_df.join(results.rename(columns={"output": "failure model"}))
final_data.head()

final_data.to_csv("labeled_synthetic_data.csv", index=False)



# Summary & Expected Outputs

## What You'll Create

**Files you'll generate:**
- `generated_synthetic_queries.csv` - Your test dataset  
- `labeled_synthetic_data.csv` - Your final analysis with failure mode labels

## Steps to Complete

1. **Run Part 1 code** - Generate test queries and upload to Phoenix
2. **Part 2 (Phoenix UI)** - Run your prompt on queries, annotate problems with open coding  
3. **Run Part 3 code** - Export traces, use LLM to discover patterns and create taxonomy

## What Part 3 Creates

The LLM analysis will automatically generate:
- Failure mode categories discovered from your annotations
- Systematic classification of each trace
- Complete taxonomy with definitions and examples
- Analysis spreadsheet with binary failure mode columns
