# Advent of Haystack with Arize Phoenix

Santa collapsed in his chair in a huff. "What's wrong?" asked Mrs Claus.

"There's just too many toys to check and not enough time! Christmas is almost here!"

"Well can't you just check some of them?"

"I wish it were that easy! But my elves make so many different toys, and we have to make sure every kid gets the right one!"

Elf Jane couldn't help overhearing from the next room. She was a regular attendee at the local north pole hackathon, and thought she might have a solution. She'd learned a lot about evaluation recently, and thought she could build an LLM Judge to help.

In [1]:
! pip install -q arize-phoenix==6.1.0 haystack-ai==2.7.0 openinference-instrumentation-haystack==0.1.13

Elf Jane started by checking out the big elf database of christmas wishlists (aka the BEDCW).

In [2]:
children = [
    {'name': 'Timmy', 'age': 7, 'likes': 'Lego', 'dislikes': 'Vegetables', 'list': 'nice'},
    {'name': 'Tommy', 'age': 9, 'likes': 'Sports Equipment', 'dislikes': 'Reading', 'list': 'naughty'},
    {'name': 'Tammy', 'age': 8, 'likes': 'Art Supplies', 'dislikes': 'Loud Noises', 'list': 'nice'}, 
    {'name': 'Tina', 'age': 6, 'likes': 'Science Kits', 'dislikes': 'Spicy Food', 'list': 'nice'},
    {'name': 'Toby', 'age': 10, 'likes': 'Video Games', 'dislikes': 'Early Mornings', 'list': 'nice'},
    {'name': 'Tod', 'age': 5, 'likes': 'Musical Instruments', 'dislikes': 'Bath Time', 'list': 'nice'},
    {'name': 'Todd', 'age': 8, 'likes': 'Remote Control Cars', 'dislikes': 'Homework', 'list': 'naughty'},
    {'name': 'Tara', 'age': 7, 'likes': 'Magic Sets', 'dislikes': 'Thunder', 'list': 'nice'},
    {'name': 'Teri', 'age': 9, 'likes': 'Building Blocks', 'dislikes': 'Broccoli', 'list': 'nice'},
    {'name': 'Trey', 'age': 6, 'likes': 'Board Games', 'dislikes': 'Bedtime', 'list': 'nice'},
    {'name': 'Tyler', 'age': 8, 'likes': 'Action Figures', 'dislikes': 'Cleaning', 'list': 'nice'},
    {'name': 'Tracy', 'age': 7, 'likes': 'Dolls', 'dislikes': 'Dark', 'list': 'nice'},
    {'name': 'Tony', 'age': 9, 'likes': 'Chemistry Sets', 'dislikes': 'Dentist', 'list': 'nice'},
    {'name': 'Theo', 'age': 6, 'likes': 'Puzzles', 'dislikes': 'Shots', 'list': 'nice'},
    {'name': 'Terry', 'age': 10, 'likes': 'Model Trains', 'dislikes': 'Chores', 'list': 'naughty'},
    {'name': 'Tessa', 'age': 5, 'likes': 'Stuffed Animals', 'dislikes': 'Time Out', 'list': 'nice'},
    {'name': 'Troy', 'age': 8, 'likes': 'Robots', 'dislikes': 'Naps', 'list': 'nice'},
    {'name': 'Talia', 'age': 7, 'likes': 'Craft Kits', 'dislikes': 'Spinach', 'list': 'nice'},
    {'name': 'Tyson', 'age': 9, 'likes': 'Microscopes', 'dislikes': 'Cold', 'list': 'nice'},
    {'name': 'Tatum', 'age': 6, 'likes': 'Drawing Sets', 'dislikes': 'Shots', 'list': 'nice'},
]

# 1. Adding Tracing

Elf Jane knew that the elves were busy, and didn't always log their toy making process. She knew that she'd first need to trace the toy making process using Arize Phoenix.

In [3]:
from getpass import getpass
import os

api_key = getpass("Enter your Arize Phoenix API key: ")

from phoenix.otel import register
from openinference.instrumentation.haystack import HaystackInstrumentor

os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com/v1/traces"
os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={api_key}"

tracer_provider = register(project_name="adventofhaystack")
HaystackInstrumentor().instrument(tracer_provider=tracer_provider)


  from .autonotebook import tqdm as notebook_tqdm


🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: adventofhaystack
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: localhost:4317
|  Transport: gRPC
|  Transport Headers: {'authorization': '****', 'user-agent': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



# 2. Trace Toy Making Process

With tracing in place, Elf Jane had some of her closest elf friends build a batch of toys she could trace.

In [4]:
import os

os.environ["OPENAI_API_KEY"] = getpass("Enter your OpenAI API key: ")

In [5]:
from haystack.dataclasses import ChatMessage
from haystack.components.generators.chat import OpenAIChatGenerator
from haystack import Pipeline

toy_maker_prompt = """
Create a toy for {name} that they will like. {name} is {age} years old and likes {likes} and dislikes {dislikes}.
If the child is on the naughty list, give them a 'Rabbit R1'. {name} is on the {list} list.
"""


def make_toy(child):

    message = toy_maker_prompt.format(**child)

    messages = [
        ChatMessage.from_system("You are a toy maker elf. Your job is to make toys for the nice kids on the nice list."),
        ChatMessage.from_user(message),
    ]

    chat_generator = OpenAIChatGenerator(model="gpt-4o-mini")
    pipeline = Pipeline()
    pipeline.add_component("chat_generator", chat_generator)
    
    return pipeline.run({"messages": messages})["chat_generator"]["replies"]

In [6]:
for child in children:
    make_toy(child)

# 3. Evaluate Toy Correctness

Elf Jane was now ready to evaluate the toys she made. She knew that she could use an LLM Judge to evaluate whether the toys matched the child's wishlist. She started by building a judge.

In [7]:
llm_judge_prompt = """
Evaluate the toy for this child, based on their likes and dislikes

All children on the naughty list get a 'Rabbit R1'. Any other toy given to a naughty child is incorrect.

Respond with a single word: 'correct' or 'incorrect'. Also include a short explanation for your answer.

Description of the child: {description}
Toy: {toy}

*****
Example output:
label: 'correct'
explanation: 'The toy is a Lego set, which is one of the child's likes.'
*****
"""

In [14]:
import phoenix as px

spans_df = px.Client().get_spans_dataframe(project_name="adventofhaystack", filter_condition="span_kind == 'LLM'")
spans_df.head()
spans_df["description"] = spans_df["attributes.input.value"]
spans_df["toy"] = spans_df["attributes.output.value"]
spans_df.head()

Unnamed: 0_level_0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,context.span_id,context.trace_id,...,attributes.llm.input_messages,attributes.llm.model_name,attributes.llm.output_messages,attributes.input.value,attributes.llm.token_count.completion,attributes.input.mime_type,attributes.llm.token_count.total,attributes.output.mime_type,description,toy
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
86aa734939e598af,OpenAIChatGenerator (chat_generator),LLM,631841072259833e,2024-12-06 02:03:45.274561+00:00,2024-12-06 02:03:49.657018+00:00,OK,,[],86aa734939e598af,fe75e92e34351d7c36a7819fd0aafb0d,...,[{'message.content': 'You are a toy maker elf....,gpt-4o-mini-2024-07-18,"[{'message.content': 'For Timmy, who loves Leg...","{""messages"": [""ChatMessage(content='You are a ...",258,application/json,343,application/json,"{""messages"": [""ChatMessage(content='You are a ...","{""replies"": [""ChatMessage(content='For Timmy, ..."
eb8045ca610e80d3,OpenAIChatGenerator (chat_generator),LLM,cac8bac73dea2014,2024-12-06 02:03:49.672163+00:00,2024-12-06 02:03:52.988274+00:00,OK,,[],eb8045ca610e80d3,7f60a377a0dcc0f5f0e4f1dfe8097139,...,[{'message.content': 'You are a toy maker elf....,gpt-4o-mini-2024-07-18,[{'message.content': 'Since Tommy is on the na...,"{""messages"": [""ChatMessage(content='You are a ...",260,application/json,342,application/json,"{""messages"": [""ChatMessage(content='You are a ...","{""replies"": [""ChatMessage(content=\""Since Tomm..."
7fe8ab6a746168e7,OpenAIChatGenerator (chat_generator),LLM,467def3ddccd7179,2024-12-06 02:03:53.003857+00:00,2024-12-06 02:03:56.797269+00:00,OK,,[],7fe8ab6a746168e7,81be5f9f21518134162a062d9bd0bc23,...,[{'message.content': 'You are a toy maker elf....,gpt-4o-mini-2024-07-18,"[{'message.content': 'For Tammy, I’d create a ...","{""messages"": [""ChatMessage(content='You are a ...",267,application/json,351,application/json,"{""messages"": [""ChatMessage(content='You are a ...","{""replies"": [""ChatMessage(content='For Tammy, ..."
055c19a0e70830cd,OpenAIChatGenerator (chat_generator),LLM,4a5c7f2587b65bc7,2024-12-06 02:03:56.814081+00:00,2024-12-06 02:04:00.596804+00:00,OK,,[],055c19a0e70830cd,4da55db438c464e8caf38cf08e79155d,...,[{'message.content': 'You are a toy maker elf....,gpt-4o-mini-2024-07-18,"[{'message.content': 'For Tina, who is 6 years...","{""messages"": [""ChatMessage(content='You are a ...",245,application/json,329,application/json,"{""messages"": [""ChatMessage(content='You are a ...","{""replies"": [""ChatMessage(content='For Tina, w..."
eb993dc4ae7b8a24,OpenAIChatGenerator (chat_generator),LLM,c820edf06627537e,2024-12-06 02:04:00.615559+00:00,2024-12-06 02:04:03.812024+00:00,OK,,[],eb993dc4ae7b8a24,6f8ab6ac17d73e8f7fa984a68ad60da4,...,[{'message.content': 'You are a toy maker elf....,gpt-4o-mini-2024-07-18,"[{'message.content': 'For Toby, I’ve designed ...","{""messages"": [""ChatMessage(content='You are a ...",233,application/json,318,application/json,"{""messages"": [""ChatMessage(content='You are a ...","{""replies"": [""ChatMessage(content='For Toby, I..."


In [20]:
from phoenix.evals import (
    llm_classify,
    OpenAIModel
)
import nest_asyncio
nest_asyncio.apply()


eval_results = llm_classify(
    dataframe=spans_df,
    model=OpenAIModel(model="gpt-4o-mini"),
    template=llm_judge_prompt,
    provide_explanation=True,
    rails=["correct", "incorrect"]
)
eval_results["score"] = eval_results["label"].apply(lambda x: 1 if x == "correct" else 0)
eval_results.head()


llm_classify |██████████| 20/20 (100.0%) | ⏳ 00:05<00:00 |  3.66it/s


Unnamed: 0_level_0,label,explanation,exceptions,execution_status,execution_seconds,score
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
86aa734939e598af,correct,"The toy is a Lego set, which aligns with Timmy...",[],COMPLETED,0.868079,1
eb8045ca610e80d3,correct,"The toy is a 'Rabbit R1', which is the correct...",[],COMPLETED,1.134264,1
7fe8ab6a746168e7,correct,"The toy is an 'Art Adventure Kit', which align...",[],COMPLETED,1.462248,1
055c19a0e70830cd,correct,The toy is a 'Tina's Super Science Adventure K...,[],COMPLETED,1.562002,1
eb993dc4ae7b8a24,correct,"The toy is a 'Pixel Pal Creator', which aligns...",[],COMPLETED,1.616854,1


In [21]:
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(SpanEvaluations(
    eval_name="Toy Correctness", 
    dataframe=eval_results,
))


# 4. View the results in the Arize Phoenix UI

And just like that, Elf Jane had saved Santa hours of time and made sure every kid got the right toy!

In Phoenix, she could see "correct" and "incorrect" labels on all the traces, and even see the explanations for each label!

She couldn't wait to show Santa, and all her friends at the hackathon. 