# Prerequisites

To get the most value from this tutorial, you will need:

**Phoenix Developer Account:**
*  An active Phoenix Developer Edition account and API key
*  Follow the instructions in our [quickstart guide](https://docs.arize.com/phoenix/quickstart) to set up your account

**Technical Knowledge:**

* Basic understanding of Generative AI and LLM concepts
* Experience with Python programming
* Familiarity with using LLM endpoints and integrating LLMs in Python applications

**Python Environment:**
* This notebook is intended to be run as a Google Colab Notebook using Python version 3.11.11


# Introduction

When it comes to designing real-world applications using Large Language Models (LLMs), evaluation has emerged as one of the greatest challenges. With the stochastic, unstructured outputs of generative AI, traditional approaches to AI evaluation simply don't work in these scenarios.

The obvious solution is for humans to evaluate generated responses. However, human evaluation is a long and laborious task, and it rather defeats the purpose of using an AI solution if evaluation must be done for every single generated response.

Many AI application developers have instead turned to automated evaluators, ranging from rules-based systems to model-based approaches to using another LLM as an evaluator. These automated approaches scale up evaluation immensely; however, a key question remains: How can we be sure that we can trust these automated evaluators? This is particularly important for highly complex or high-risk use cases.

An increasingly popular solution to this problem is for expert humans to annotate a small evaluation dataset, known as a **Golden Dataset**. This dataset consists of a carefully selected set of samples that can be reasonably annotated by humans and used for initial evaluation. Once the Golden Dataset is ready, it can be used to develop and validate reliable automated evaluators.

**This notebook provides a tutorial on how to use Arize Phoenix to create and manage Golden Datasets for your GenAI application through annotation.**



## What Will You Learn?
In this tutorial, you'll learn how to:

**Set Up Your Environment:**
* Configure your API credentials
* Connect to Phoenix

**Design Your Evaluation Framework:**
* Create evaluation prompt sets
* Implement a Relationship Extraction application using DSPy
*Track and analyze traces in Phoenix

**Manage Annotations:**
* Use the Phoenix UI for annotation
* Programmatically annotate through the Python API
* Implement user feedback collection

**Create a Golden Dataset:**
* Build a Golden Dataset using the Phoenix

**What's Next:**
* Review next steps and advanced features
* Learn about additional evaluation strategies




## The Task: Relationship Extraction

Relationship Extraction is a challenging AI task that often requires complex, multi-step reasoning, making it an increasingly popular use case for LLMs. The objective is to identify relationships between entities of interest within unstructured text. In real-world applications, it's used for tasks such as parsing relationships within lengthy legal documents to automate painstaking business processes.

For this tutorial, we'll focus on a simplified version of the task: identifying person-to-person relationships within short text snippets.

For example, consider this sentence: **"Maria's friend John is her cousin Bert's ex-husband."**

From this sentence, we can extract the following relationships:

          {
              "subject": "Maria",
              "relation": "friend",
              "object": "John"
          }
          {
              "subject": "John",
              "relation": "friend",
              "object": "Maria"
          }
          {
              "subject": "Maria",
              "relation": "cousin",
              "object": "Bert"
          }
          {
              "subject": "Bert",
              "relation": "cousin",
              "object": "Maria"
          }
          {
              "subject": "Bert",
              "relation": "ex-husband",
              "object": "John"
          }
          {
              "subject": "John",
              "relation": "ex-husband",
              "object": "Bert"
          }
      

# Set Up Your Environment

## Install relevant libraries

As part of this exercise, we'll be using two libraries as part of our GenAI development tool set:

[**Arize-Phoenix**](https://docs.arize.com/phoenix)

This is the primary tool we'll be using. Phoenix is a developer tool designed to help AI engineers and data scientists run experiments, evaluate, troubleshoot, and improve their AI applications.

[**DSPy**](https://dspy.ai/)

 DSPy provides a "prompts as code" library, enabling AI developers to standardize, modularize, and optimize their AI applicatins.

> Note: Deep DSPy knowledge isn't required for this tutorial




In [None]:
!pip install arize-phoenix "dspy==2.5.43" "openinference-instrumentation-dspy>=0.1.13" openinference-instrumentation-litellm opentelemetry-exporter-otlp 'httpx<0.28'



In [None]:
import os
import pandas as pd

## Model set up
In this tutorial I'll be accessing models thorugh [Mistral](https://mistral.ai/) and through Huggingface's ([ Serverless inference API](https://huggingface.co/docs/api-inference/index), both of which can be used for free, with some limitiations. The API calls will be made through **DSPy**, which [integrates with a wide range of model providers](https://dspy.ai/)

Set Model provider API keys as environment variables.

In [None]:
# Comment out if API keys are not saved in your google colab userdata
from google.colab import userdata
os.environ["MISTRAL_API_KEY"] = userdata.get('MISTRAL_API_KEY')
os.environ["HUGGINGFACE_API_KEY"] = userdata.get('HUGGINGFACE_API_KEY')

## Uncomment and add API keys here if they are not saved in your google colab userdata
# os.environ["MISTRAL_API_KEY"] = 'YOUR_MISTRAL_API_KEY'
# os.environ["HUGGINGFACE_API_KEY"] = 'YOUR_HUGGINGFACE_API_KEY

Access the LLM endpoint with with DSPy.

In [None]:
import dspy

lm = dspy.LM('mistral/mistral-small-latest', api_key=os.environ["MISTRAL_API_KEY"])
dspy.configure(lm=lm)

Test the endpoint.


In [None]:

lm(messages=[{"role": "user", "content": "Say this is a test!"}])

["This is a test! How can I assist you today? Let's make sure everything is working perfectly."]

## Connect to Phoenix

Set environment variables as shown in the Phoenix UI. Once you have created an account and logged in, find your API key under **`keys`**.

<img src="https://drive.google.com/uc?id=12NmDf0JAdIyWAqy102AcHA3z3Frbe7to" width="75%">

In [None]:
API_KEY="YOUR_PHOENIX_API_KEY"

os.environ["OTEL_EXPORTER_OTLP_HEADERS"] = "api_key=" + API_KEY
os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + API_KEY
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"

os.environ["PHOENIX_API_KEY"] = API_KEY

Register your phoenix enpoint as the tracer_provider.

In [None]:
from phoenix.otel import register

tracer_provider = register(
  endpoint="https://app.phoenix.arize.com/v1/traces"
)

🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: default
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: https://app.phoenix.arize.com/v1/traces
|  Transport: HTTP + protobuf
|  Transport Headers: {'api_key': '****', 'authorization': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



### Instrument the application
This enables us to send traces to Phoenix.  Since we'll be using DSPY to make calls to models, we'll be using the **DSPyInstrumentor**.  We'll also use the **LiteLLMInstrumentor** since DSPy is built on top of LiteLLM. Phoenix provides integrations with a variety of [common libraries ](https://docs.arize.com/phoenix/tracing/integrations-tracing)as well.


In [None]:
from openinference.instrumentation.dspy import DSPyInstrumentor
from openinference.instrumentation.litellm import LiteLLMInstrumentor

DSPyInstrumentor().instrument(tracer_provider=tracer_provider, skip_dep_check=True)

LiteLLMInstrumentor().instrument(tracer_provider=tracer_provider, skip_dep_check=True)

### Test the connection

In [None]:
lm(messages=[{"role": "user", "content": "Say this is a test!"}])

["This is a test! How can I assist you today? Let's make sure everything is working perfectly."]

Now that we've instrumented DSPy the LLM call we made above should send a trace to Phoenix. We didn't specify a project name when se set up our trace provider, so we'll see the [trace](https://docs.arize.com/phoenix/tracing/llm-traces/what-are-traces) appear in the default project.

In Phoenix, [**tracing** ](https://docs.arize.com/phoenix/tracing/llm-traces)is the processes of tracking the individual steps taking through an LLM application as a request is made.  This involves tracking individual operations as **spans**, recording inputs,outputs, and additional data at each step.

<img src="https://raw.githubusercontent.com/SarahOstermeier/Tutorial_Figures/refs/heads/main/REC-20250204224037.GIF" />

# Design Your Evaluation Framework

## Prepare a set of evaluation prompts

Here I created a list of prompts to challenge my application in the relationship extraction task. These are synthetic examples generated by Claude 3.5 Sonnet. Ideally, in a real use case, examples should be taken from historic data or prepared with the help of end users or subject matter experts.

In [None]:
relation_dataset= [
    "John, who was previously married to Sarah, now works under her at Tech Corp. Their daughter Emma was recently hired by Sarah's new husband Tom as his research assistant.",

    "Dr. Adams mentors resident Jessica Chen, unaware that she was his wife's daughter from a previous marriage, given up for adoption twenty years ago.",

    "Tom started dating his son's ex-wife Maria six months after their divorce, creating tension when they all attend his daughter's soccer games.",

    "Professor Williams discovered his star student Mark is actually his biological son from a college relationship, while Mark's adoptive mother serves as department chair.",

    "Robert coaches his daughter Emma's basketball team, where she plays alongside his ex-wife's daughter from her second marriage, making them step-sisters and teammates.",

    "Senior partner James dated Rachel's mother in law school before marrying her sister, making family dinners awkward since Rachel became his legal protégé.",

    "Michael hired his former stepbrother David as head chef in his restaurant, unaware that David was now engaged to Michael's ex-wife Jenny.",

    "Amy babysits for the Smiths, whose daughter is actually her half-sister from her father's secret relationship with Mrs. Smith during college.",

    "Daniel discovered his apprentice Steve is actually his half-brother, not his nephew, after finding out his father had a second family.",

    "Lisa's step-sister Maria turned out to be her biological sister, as they shared the same mother who had given Maria up for adoption years before marrying Lisa's father.",

    "Chris manages the IT department where his twin brother Peter works, while hiding that they're both secretly dating their supervisor Sarah.",

    "Emma's piano teacher Ms. Thompson was briefly married to Emma's father before he met her mother, a fact they all pretend to ignore during lessons.",

    "Dr. Brown treats both Mark and his daughter Sophie, unaware that Sophie is actually his biological granddaughter from a teenage pregnancy.",

    "Coach Tim discovered his star player Jack was his biological son from a college relationship, while Jack's cousin (Tim's niece) Amy serves as team manager.",

    "Carol hired her former college roommate Beth, who had previously been engaged to Carol's brother before he married their mutual friend Diana.",

    "Principal Stevens hired his niece Jennifer as art teacher, not knowing she was secretly dating his daughter's ex-husband who teaches music.",

    "Ryan's stepbrother Matt is actually his biological brother, a fact their parents kept secret until they decided to open a gym together.",

    "Detective Wilson investigates a case involving his son-in-law Officer Carter, whose wife (Wilson's daughter) is prosecuting the same case.",

    "Anna's dance instructor Marina had a child with Anna's father before he married her mother, making Anna and her dance partner half-siblings.",

    "Sam manages his sister-in-law Rebecca at the coffee shop, while hiding his past relationship with her from his brother Tom, who does their accounting."
]

## Application design
We'll use the DSPy to structure a relation extraction application with the desired output structure.  DSPy helps automates prompt stucturing to make the application easy to set up.

In [None]:
class ExtractInfo(dspy.Signature):
    """Extract each person to person relationship from the text in the stuctured format.
    For each relationship format the output as:
    subject: str
    relation: str
    object: str

    For example, for the sentence, 'Maria, who was previously married to John before his tragic accident, is now engaged to David, her late husband's best friend from college.',
    the output relashipts should be:
    [
        {
            "subject": "Maria",
            "relation": "engaged to",
            "object": "David"
        },
          {
            "subject": "David",
            "relation": "engaged to",
            "object": "Maria"
        },
        {
            "subject": "Maria",
            "relation": "ex-spouse",
            "object": "John
        },
        {
            "subject": "John",
            "relation": "ex-spouse",
            "object": "Maria"
        },
        {
            "subject": "David",
            "relation": "best friend",
            "object": "John"
        },
        ]

    """
    text: str = dspy.InputField()
    relationships: list[dict[str, str]] = dspy.OutputField(desc="a list of all person to person relatioship in the text")
module = dspy.Predict(ExtractInfo)


## Track LLM Traces in Phoenix

Let's try submitting a request.  Here's the first prompt in our dataset.

In [None]:
relation_dataset[0]

"John, who was previously married to Sarah, now works under her at Tech Corp. Their daughter Emma was recently hired by Sarah's new husband Tom as his research assistant."

In [None]:
text = relation_dataset[0]
response = module(text=text)

response

Prediction(
    relationships=[{'subject': 'John', 'relation': 'ex-spouse', 'object': 'Sarah'}, {'subject': 'Sarah', 'relation': 'ex-spouse', 'object': 'John'}, {'subject': 'John', 'relation': 'works under', 'object': 'Sarah'}, {'subject': 'Sarah', 'relation': 'boss of', 'object': 'John'}, {'subject': 'Sarah', 'relation': 'mother', 'object': 'Emma'}, {'subject': 'Emma', 'relation': 'daughter', 'object': 'Sarah'}, {'subject': 'Tom', 'relation': 'husband', 'object': 'Sarah'}, {'subject': 'Sarah', 'relation': 'wife', 'object': 'Tom'}, {'subject': 'Tom', 'relation': 'boss of', 'object': 'Emma'}, {'subject': 'Emma', 'relation': 'works for', 'object': 'Tom'}]
)

Let's take a look at the Phoenix UI to see if this trace shows up.

You should see the trace at the top of the list in your default project. Clicking into it reveals the requests that were made under the hood within the `dspy.Predict` function, where we can see the full prompt that was sent to the underlying Mistral LLM, along with the responses at each step.

<img src="https://github.com/SarahOstermeier/Tutorial_Figures/blob/main/REC-20250205005941.GIF?raw=true" />


So far, I've been using a small [Mistral](https://auth.mistral.ai/ui/login?flow=e37561f9-fbd9-4123-b6d0-094233435689) Modal.  Let's see how things look when I try a model that is designed for reasoning. Here I'm using the DeepSeek-R1-distilled Qwen model, accessed through [HuggingFace](https://huggingface.co/docs/api-inference/index).

In [None]:
# Optional Cell - Comment out to Skip
lm2 = dspy.LM('huggingface/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B', api_key=os.environ["HUGGINGFACE_API_KEY"], max_tokens=2000)
dspy.configure(lm=lm2)
module2 = dspy.Predict(ExtractInfo)

text = relation_dataset[0]
response = module2(text=text)

response


Prediction(
    relationships=[{'subject': 'John', 'relation': 'ex-spouse', 'object': 'Sarah'}, {'subject': 'Sarah', 'relation': 'ex-spouse', 'object': 'John'}, {'subject': 'John', 'relation': 'works under', 'object': 'Sarah'}, {'subject': 'Sarah', 'relation': 'employs', 'object': 'John'}, {'subject': 'John', 'relation': 'father of', 'object': 'Emma'}, {'subject': 'Sarah', 'relation': 'mother of', 'object': 'Emma'}, {'subject': 'Emma', 'relation': 'hired by', 'object': 'Tom'}, {'subject': 'Tom', 'relation': 'hires', 'object': 'Emma'}, {'subject': 'Sarah', 'relation': 'husband', 'object': 'Tom'}, {'subject': 'Tom', 'relation': 'wife', 'object': 'Sarah'}]
)

Note that we can view the chain of reasoning that happened under the hood in Phoenix. In this case, the DeepSeek model took longer and went through more reasoning steps, but it still made some mistakes around the directionality of relationships.

<img src="https://github.com/SarahOstermeier/Tutorial_Figures/blob/main/REC-20250205022643%203.GIF?raw=true" />




Let's go back to the faster Mistral model and send the rest of our prompt dataset. You should see each of these appear in the UI.

In [None]:
dspy.configure(lm=lm)
module = dspy.Predict(ExtractInfo)

for text in relation_dataset:
  try:
    response = module(text=text)
  except AttributeError as e:
    print('Error during structuring of sample:', text)
    print(e)
    continue


# Annotating Traces
Phoenix provides [several options for human annotation](https://docs.arize.com/phoenix/tracing/how-to-tracing/capture-feedback), using either the Phoenix UI or by sending annotations through the python API

## Use the Phoenix UI for annotation

The Phoenix UI offers an easy annotation interface. It's a good option when you, the developer, are also the person annotating responses.

<img src="https://github.com/SarahOstermeier/Tutorial_Figures/blob/main/REC-20250205012012.GIF?raw=true" />


## Programmatically annotate through the Python API

To add an annotation through the API, we'll need a span_id to identify which span the annotation should be assigned to.

Let's take a look at the traces we've sent to Phoenix so far.

In [None]:
import phoenix as px

dataset = px.Client().get_trace_dataset(project_name="default")
span_df = dataset.get_spans_dataframe()
span_df = span_df.query("name == 'Predict(ExtractInfo).forward' and status_code == 'OK'")
span_df



Unnamed: 0_level_0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,context.span_id,context.trace_id,...,attributes.llm.invocation_parameters,attributes.output.mime_type,attributes.output.value,attributes.llm.input_messages,attributes.llm.output_messages,attributes.input.mime_type,attributes.llm.token_count.prompt,attributes.llm.token_count.completion,attributes.llm.model_name,attributes.llm.token_count.total
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
713037dfe328196f,Predict(ExtractInfo).forward,CHAIN,,2025-02-05 05:57:33.573063+00:00,2025-02-05 05:57:36.298396+00:00,OK,,[],713037dfe328196f,d0995faa081ba065cf4709d5aa1fde83,...,,application/json,"{""relationships"": [{""subject"": ""John"", ""relati...",,,application/json,,,,
05cb89b8cd62ccdc,Predict(ExtractInfo).forward,CHAIN,,2025-02-05 05:58:14.201006+00:00,2025-02-05 05:58:14.291518+00:00,OK,,[],05cb89b8cd62ccdc,5f1b07853997e32f2693df89f9507c56,...,,application/json,"{""relationships"": [{""subject"": ""John"", ""relati...",,,application/json,,,,
2f455b2d6541e086,Predict(ExtractInfo).forward,CHAIN,,2025-02-05 06:11:00.717673+00:00,2025-02-05 06:11:00.836809+00:00,OK,,[],2f455b2d6541e086,b3fcbdb73d66b9188b90fe56b5f91dac,...,,application/json,"{""relationships"": [{""subject"": ""John"", ""relati...",,,application/json,,,,
77b644bdff08e7a9,Predict(ExtractInfo).forward,CHAIN,,2025-02-05 06:11:00.872298+00:00,2025-02-05 06:11:02.151536+00:00,OK,,[],77b644bdff08e7a9,6e545e1ff144db3a88ad38fd07dd327e,...,,application/json,"{""relationships"": [{""subject"": ""Dr. Adams"", ""r...",,,application/json,,,,
6a15b05189e15a1f,Predict(ExtractInfo).forward,CHAIN,,2025-02-05 06:11:02.187044+00:00,2025-02-05 06:11:03.428261+00:00,OK,,[],6a15b05189e15a1f,77da556b87e5ac0e031a3c2b33aa597d,...,,application/json,"{""relationships"": [{""subject"": ""Tom"", ""relatio...",,,application/json,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ba5cab709c780936,Predict(ExtractInfo).forward,CHAIN,,2025-02-05 08:49:25.462735+00:00,2025-02-05 08:49:25.577579+00:00,OK,,[],ba5cab709c780936,c6c01256e9c28fa2c1ab3789843ddff0,...,,application/json,"{""relationships"": [{""subject"": ""Principal Stev...",,,application/json,,,,
4759323ec0d2ba8e,Predict(ExtractInfo).forward,CHAIN,,2025-02-05 08:49:25.610358+00:00,2025-02-05 08:49:25.722303+00:00,OK,,[],4759323ec0d2ba8e,b764354071b073ec2fe657c71bcf8634,...,,application/json,"{""relationships"": [{""subject"": ""Ryan"", ""relati...",,,application/json,,,,
39c39a40e2213c35,Predict(ExtractInfo).forward,CHAIN,,2025-02-05 08:49:25.755631+00:00,2025-02-05 08:49:25.867460+00:00,OK,,[],39c39a40e2213c35,1a7b1945b7ef22bd5c30913e8e8db8fe,...,,application/json,"{""relationships"": [{""subject"": ""Detective Wils...",,,application/json,,,,
9d27e6b6d75edc98,Predict(ExtractInfo).forward,CHAIN,,2025-02-05 08:49:25.902495+00:00,2025-02-05 08:49:26.335770+00:00,OK,,[],9d27e6b6d75edc98,c2f2e681724026b92f55a85116417b36,...,,application/json,"{""relationships"": [{""subject"": ""Anna"", ""relati...",,,application/json,,,,


To make things easy, we'll add an annotation to the first span on the list.

In [None]:
pd.set_option('display.max_colwidth', None)
span_df.loc[span_df.index[0], ['attributes.input.value', 'attributes.output.value']]

Unnamed: 0,713037dfe328196f
attributes.input.value,"{""text"": ""John, who was previously married to Sarah, now works under her at Tech Corp. Their daughter Emma was recently hired by Sarah's new husband Tom as his research assistant.""}"
attributes.output.value,"{""relationships"": [{""subject"": ""John"", ""relation"": ""ex-spouse"", ""object"": ""Sarah""}, {""subject"": ""Sarah"", ""relation"": ""ex-spouse"", ""object"": ""John""}, {""subject"": ""John"", ""relation"": ""works under"", ""object"": ""Sarah""}, {""subject"": ""Sarah"", ""relation"": ""boss of"", ""object"": ""John""}, {""subject"": ""Sarah"", ""relation"": ""mother"", ""object"": ""Emma""}, {""subject"": ""Emma"", ""relation"": ""daughter"", ""object"": ""Sarah""}, {""subject"": ""Tom"", ""relation"": ""husband"", ""object"": ""Sarah""}, {""subject"": ""Sarah"", ""relation"": ""wife"", ""object"": ""Tom""}, {""subject"": ""Tom"", ""relation"": ""boss of"", ""object"": ""Emma""}, {""subject"": ""Emma"", ""relation"": ""works for"", ""object"": ""Tom""}]}"


The function below takes a span_id, score, feedback and posts the annotation to Phoenix. Note the structure of the annotation payload.

In [None]:
import httpx
headers = {'api_key': os.environ["PHOENIX_API_KEY"]}
annotation_endpoint = "https://app.phoenix.arize.com/v1/span_annotations?sync=false"

client = httpx.Client()

def upload_feedback(span_id, score, feedback):
  # format for annoatation payload
  annotation_payload = {
    "data": [
        {
            "span_id": span_id,
            "name": "correctness score out of 5",
            "annotator_kind": "HUMAN",
            "result": {
                       "score": score,
                       "explanation": feedback
                       }
        }]}
  client.post(
      annotation_endpoint,
      json=annotation_payload,
      headers=headers
      )


Sent the annotation and check that it appears in the UI.

In [None]:
upload_feedback(span_id=span_df.index[0], score=5, feedback="This is completely correct")

## Capture feedback at inference time
To capture feedback at inference time, we'll need to save the span_id we want to attach the annotation to. To enable this, I set up a wrapper function, `call_llm`, which returns the list of relationships and the span_id.

In [None]:
from opentelemetry import trace
tracer = trace.get_tracer(__name__)

@tracer.start_as_current_span(name="UI Interaction") # decorator to start a new span
def call_llm(prompt):
  """Wrapper function to handle LLM call"""
  span = trace.get_current_span()
  span.set_attribute("openinference.span.kind", "chain")
  span.set_attribute("input.value", prompt)
  span_id = span.get_span_context().span_id.to_bytes(8, "big").hex()
  response = module(text=prompt)
  span.set_attribute("output.value", str(response.relationships))
  return response.relationships, span_id, span.set_status(trace.Status(trace.StatusCode.OK))


In [None]:
prompt = "Eloise is Bob's sister and she's married to his nemesis George's brother in law."

relationships, span_id, _ = call_llm(prompt)
relationships

[{'subject': 'Eloise', 'relation': 'sister', 'object': 'Bob'},
 {'subject': 'Bob', 'relation': 'sister', 'object': 'Eloise'},
 {'subject': 'Eloise',
  'relation': 'married to',
  'object': "George's brother in law"},
 {'subject': "George's brother in law",
  'relation': 'married to',
  'object': 'Eloise'},
 {'subject': 'Bob', 'relation': 'nemesis', 'object': 'George'},
 {'subject': 'George', 'relation': 'nemesis', 'object': 'Bob'}]

In [None]:
upload_feedback(span_id, score=4, feedback="Mostly correct, except that Bob is Eloise's brother, not her sister")

Here I utilised the decorator `tracer.start_as_current_span()` to start a new span when the function is called.  This enables me to capture the span_id and add attributes before the new span closes.  Note that the spans we have been seeing in previous steps will now appear as **child spans** to this new span, titled **UI interaction**.

<img src="https://github.com/SarahOstermeier/Tutorial_Figures/blob/main/Screenshot_2.jpg?raw=true" width="100%">


## Implement user feedback collection (Optional)

More often than not, the complex AI use cases that require human feedback require expert annotation or feedback from end-users.

This feedback is best collected directly from an interface when the user is interacting directly with the application.  This optional example demonstrates how one might structure such an applications to send traces and feedback to Phoenix.

Run this cell and scroll down to the bottom to see the interface. Try creating your own prompts and give feedback on the responses.

In [None]:
# Optional Cell

import ipywidgets as widgets
from IPython.display import display, HTML, clear_output

# Store feedback in a list
feedback_history = []
current_response = None
current_span_id = None

def handle_prompt_submission(button):
    """Handle the initial prompt submission and show the LLM response"""
    global current_response
    global current_span_id

    prompt = prompt_box.value
    if not prompt:
        display(HTML("<p style='color: red'>Please enter a prompt first.</p>"))
        return

    # Get LLM response (replace with actual LLM call)
    current_response, current_span_id, _ = call_llm(prompt)

    # Clear output and show response with feedback options
    clear_output()
    setup_response_view(prompt, current_response)

def handle_feedback_submission(button):
    """Handle the feedback submission"""
    global current_response
    global current_span_id

    # Get values from form widgets
    rating = rating_slider.value
    feedback = feedback_box.value

    # if not feedback:
    #     display(HTML("<p style='color: red'>Please provide feedback before submitting.</p>"))
    #     return

    # Save feedback
    feedback_history.append({
        'span_id': current_span_id,
        'prompt': prompt_box.value,
        'response': current_response,
        'rating': rating,
        'feedback': feedback
    })

    # Upload feedback to phoenix
    upload_feedback(current_span_id, rating, feedback)

    # Clear form and show confirmation
    clear_output()
    setup_initial_form()
    display(HTML("<p style='color: green'>Feedback submitted successfully!</p>"))
    display_history()

def get_star_display(rating):
    """Convert numeric rating to star display"""
    return '★' * rating + '☆' * (5 - rating)

def update_star_display(change):
    """Update the star display when slider value changes"""
    rating_display.value = f"Rating: {get_star_display(change.new)}"

def setup_initial_form():
    """Setup the initial prompt input form"""
    global prompt_box, submit_prompt_button

    print("Response & Feedback")
    print("-" * 30)

    # Create form elements
    prompt_box = widgets.Textarea(
        description='Prompt:',
        placeholder='Enter your prompt here...',
        layout={'width': '500px', 'height': '100px'}
    )

    submit_prompt_button = widgets.Button(description='Get Response')
    submit_prompt_button.on_click(handle_prompt_submission)

    # Display initial form
    display(prompt_box)
    display(submit_prompt_button)

def setup_response_view(prompt, response):
    """Setup the response view with feedback options"""
    global prompt_box, rating_slider, rating_display, feedback_box, submit_feedback_button

    print("LLM Interaction & Feedback")
    print("-" * 30)

    # Show original prompt (read-only)
    display(HTML(f"<p><b>Your Prompt:</b><br>{prompt}</p>"))

    # Show LLM response
    display(HTML(f"<p><b>LLM Response:</b><br>{response}</p>"))

    print("\nProvide Feedback:")

    # Create star rating widgets
    rating_slider = widgets.IntSlider(
        value=3,
        min=1,
        max=5,
        step=1,
        description='Move slider:',
        style={'description_width': 'initial'}
    )

    rating_display = widgets.HTML(
        value=f"Rating: {get_star_display(rating_slider.value)}"
    )

    # Connect the slider to the star display
    rating_slider.observe(update_star_display, names='value')

    feedback_box = widgets.Textarea(
        description='Feedback:',
        placeholder='Why did you give this rating?',
        layout={'width': '500px', 'height': '100px'}
    )

    submit_feedback_button = widgets.Button(description='Submit Feedback')
    submit_feedback_button.on_click(handle_feedback_submission)

    # Store original prompt for feedback submission
    prompt_box = widgets.Textarea(
        value=prompt,
        layout={'display': 'none'}
    )

    # Display feedback form
    display(widgets.VBox([rating_slider, rating_display]))
    display(feedback_box)
    display(submit_feedback_button)

    # Add option to start over without submitting feedback
    new_prompt_button = widgets.Button(description='Start New Prompt')
    new_prompt_button.on_click(lambda x: (clear_output(), setup_initial_form()))
    display(new_prompt_button)

def display_history():
    """Display feedback history"""
    if feedback_history:
        print("\nFeedback History:")
        print("-" * 30)
        for i, entry in enumerate(feedback_history, 1):
            print(f"\nEntry {i}:")
            print(f"Prompt: {entry['prompt']}")
            print(f"Response: {entry['response']}")
            print(f"Rating: {get_star_display(entry['rating'])}")
            print(f"Feedback: {entry['feedback']}")
            print("-" * 30)

# Initial setup
setup_initial_form()

Response & Feedback
------------------------------


Textarea(value='', description='Prompt:', layout=Layout(height='100px', width='500px'), placeholder='Enter you…

Button(description='Get Response', style=ButtonStyle())


Feedback History:
------------------------------

Entry 1:
Prompt: Harriet was my first grade teacher, but now she is my dad's wife and my half-sister's mom.
Response: [{'subject': 'Harriet', 'relation': 'teacher', 'object': 'me'}, {'subject': 'me', 'relation': 'student', 'object': 'Harriet'}, {'subject': 'Harriet', 'relation': 'wife', 'object': 'my dad'}, {'subject': 'my dad', 'relation': 'husband', 'object': 'Harriet'}, {'subject': 'Harriet', 'relation': 'mother', 'object': 'my half-sister'}, {'subject': 'my half-sister', 'relation': 'daughter', 'object': 'Harriet'}]
Rating: ★★★☆☆
Feedback: ok
------------------------------


# Create a Golden Dataset in the Phoenix UI

Create a new dataset in the Phoenix UI. This is where you will add samples to create your golden dataset.

<img src="https://github.com/SarahOstermeier/Tutorial_Figures/blob/main/REC-20250205025945.GIF?raw=true" />


Back in the default project, we can select some annotated samples to add to our golden dataset. After clicking into the dataset and selecting a sample, note that we can still link back to the original trace.

<img src="https://github.com/SarahOstermeier/Tutorial_Figures/blob/main/REC-20250205031001.GIF?raw=true"/>


# Next Steps

Congratulations on completing this tutorial! Here's how you can further leverage Phoenix to improve your GenAI applications:

**Enhance Your Evaluation Pipeline**
* Implement Phoenix [Evals](https://docs.arize.com/phoenix/evaluation/llm-evals) to assess model performance
* Design and deploy [custom evaluation metrics](https://docs.arize.com/phoenix/evaluation/concepts-evals/building-your-own-evals)
* Compare automated evaluation scores with human annotations on the Phoenix UI
* Validate evaluator reliability and accuracy

**Experiment with Models and Prompts**

* Use your Golden Dataset to benchmark different models (try benchmarking the Mistral Model and the DeepSeek Model used in this tutorial)
* Test various prompt engineering strategies (try leveraging DSPy for systematic prompt optimization)
* Challenge your models with more complex scenarios (start with the harder relation dataset in the cell below)
* Run [Experiments](https://docs.arize.com/phoenix/datasets-and-experiments/how-to-experiments/) on Phoenix to track and analyze performance variations



In [None]:
relation_dataset_hard = [
    "At the prestigious ballet academy, principal dancer Isabella mentors young prodigy Oliver, unaware that he is her biological son whom her twin sister gave up for adoption. Oliver's adoptive mother Margaret is the academy's benefactor and Isabella's former understudy, while his adoptive sister Lily studies under Isabella's daughter from her first marriage, creating an intricate web of personal and professional relationships."

    "In the competitive world of academia, Dr. Chen mentors both Dr. Williams and Dr. Rodriguez, despite the fact that Dr. Williams was originally Dr. Chen's advisor during her early PhD years. Dr. Rodriguez, who is married to Dr. Chen's former research partner Dr. Kim, frequently collaborates with both of them on groundbreaking papers.",

    "After James discovered he was adopted, he learned that his biological father Thomas was actually his adoptive mother Patricia's first cousin, making his childhood friend Emma - Thomas's daughter from another marriage - his half-sister. Patricia's current husband Robert has always treated James as his own son, even after the revelation.",

    "During the company merger, Alice found herself reporting to Bob, who used to be her intern five years ago. The situation became more complex when Bob's wife Carol, Alice's college roommate, joined the same department as a senior consultant, while their daughter Diana started as a summer intern under Alice's supervision.",

    "At the family reunion, Emily introduced her stepbrother Mike's ex-wife Jessica, who is now dating Emily's biological brother Tom, to her half-sister Sarah's adopted daughter Rachel. Rachel, as it turns out, is the biological niece of Jessica's first husband.",

    "Professor Thompson works closely with his former student Dr. Anderson on quantum physics research, while simultaneously serving as the thesis advisor to Dr. Anderson's wife, Lisa. Their academic collaboration became more intricate when Professor Thompson's daughter Katie joined Dr. Anderson's research team as a postdoctoral fellow.",

    "In the small theater company, director Mark cast his ex-wife Jennifer as the lead, opposite her current fiancé Steve, who happens to be Mark's cousin. The stage manager, Paula, who is Steve's sister and Jennifer's soon-to-be sister-in-law, tries to maintain professional relationships with all parties involved.",

    "Detective Johnson is investigating a complex case where the victim, Richard, was found in the home of his business partner's wife Susan, who is also his ex-fiancée. The prime suspect is Susan's current husband Michael, who recently discovered that his own brother James had been secretly dating Richard's daughter from his first marriage.",

    "At the law firm, senior partner David mentors associate Emma, unaware that she is his biological daughter given up for adoption thirty years ago by his college girlfriend Sarah, who is now the firm's biggest client. Emma's adoptive brother Jack recently joined the firm as a junior partner, reporting directly to Sarah.",

    "In the hospital hierarchy, Dr. Patel supervises Dr. Thompson, whose wife Dr. Chen is actually Dr. Patel's attending physician. The situation becomes more complicated when Dr. Patel's son joins the hospital as a resident under Dr. Chen's supervision, while dating Dr. Thompson's sister, a nurse in the same ward.",

    "Laura discovers that her mother's new husband James is the father of her childhood best friend Alex, making them step-siblings. Meanwhile, Alex's maternal cousin Sophie is engaged to Laura's biological father's son from his second marriage, creating an intricate web of future in-law relationships.",

    "In the startup incubator, mentor Kevin guides entrepreneurs Maya and Raj, unaware that Maya is his half-sister from his father's secret second family. Raj, who is Kevin's former college roommate, is now co-parenting with Maya's cousin Elena after their recent divorce, while Elena works as Kevin's executive assistant.",

    "During the charity gala, board member Victoria introduced her surrogate mother Sarah to her biological mother Rachel, who serves as the organization's legal counsel. Sarah's daughter Emma, who was raised alongside Victoria, recently married Rachel's nephew Nicholas, making the already complex family dynamics even more intertwined.",

    "Coach Anderson trains both the Wilson twins, whose mother Janet used to be his doubles partner before marrying his brother. Janet's new stepdaughter from her second marriage, Melissa, joined the tennis academy as an assistant coach, working directly under Coach Anderson while dating his son.",

    "In the political campaign office, campaign manager Peter works with his former mentor Sandra, who is now his stepdaughter's mother-in-law. Sandra's son Michael, who is married to Peter's stepdaughter, runs the social media team, while his ex-wife Rebecca serves as the candidate's chief strategist.",

    "The documentary follows filmmaker Hannah as she uncovers that her subject, renowned artist Marcus, is actually her biological father's identical twin, making him her uncle. Marcus's protégé Claire, who is engaged to Hannah's half-brother from her mother's second marriage, helps navigate the complex family dynamics during filming.",

    "At the family-owned restaurant, head chef Antonio works alongside his former stepson Marco, whose mother Lisa was Antonio's second wife. Marco's new sous chef is his step-cousin Emma, Antonio's current wife's niece, who is dating Antonio's biological son from his first marriage.",

    "Graduate student Sophia collaborates on research with Professor Yang, unaware that he was her late mother's fiancé before she married Sophia's father. The project team includes Professor Yang's current wife's daughter from her first marriage, who is also Sophia's childhood friend and soon-to-be sister-in-law.",

    "In the symphony orchestra, conductor Richard mentors young violinist Amy, whose mother was Richard's first love before she married his best friend Thomas. Amy's stepbrother from her father's second marriage is the orchestra's new concert master, creating tension as he dates Richard's daughter from his current marriage.",

    "Social worker Helen counsels troubled teen Jake, only to discover he is her adopted brother's biological son with her husband's sister, making him both her nephew and brother-in-law's nephew. Jake's current foster mother is Helen's cousin, who was previously married to Helen's adoptive father, adding another layer to their professional relationship."
]

# LLM Usage Disclosure

The following LLM-based assistants were used in the development of this notebook:

* Claude 3.5 Sonnet
*In-notebook Gemini code assistant

These LLMs contributed to:

* Generation of prompt lists
* Development of the Response & Feedback interface
* Code debugging assistance
* Proofreading, minor text editing and formatting

## Authorship
All core components, concepts, and technical implementation of this notebook were authored by Sarah Ostermeier. LLM assistance was limited to the specific tasks listed above.
