<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://raw.githubusercontent.com/Arize-ai/phoenix-assets/9e6101d95936f4bd4d390efc9ce646dc6937fb2d/images/socal/github-large-banner-phoenix.jpg" width="1000"/>
        <br>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Instrumenting a chatbot with human feedback</h1>

Phoenix provides endpoints to associate user-provided feedback directly with OpenInference spans as annotations.

In this tutorial, we will create a manually-instrument chatbot with user-triggered "👍" and "👎" feedback buttons. We will have those buttons trigger a callback that sends the user feedback to Phoenix and is viewable alongside the span. Automating associating feedback with spans is a powerful way to quickly focus on traces of your application that are not behaving as expected.

In [None]:
!pip install -q arize-phoenix-otel "arize-phoenix-client>=1.5.0" gradio

In [None]:
import os
from getpass import getpass
from typing import Any, Dict
from uuid import uuid4

import httpx

from phoenix.client import Client
from phoenix.otel import register

In [None]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")

if not (phoenix_api_key := os.getenv("PHOENIX_API_KEY")):
    phoenix_api_key = getpass("🔑 Enter your Phoenix API key: ")

os.environ["PHOENIX_CLIENT_HEADERS"] = f"api_key={phoenix_api_key}"
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
os.environ["PHOENIX_PROJECT_NAME"] = "Chatbot with Annotations"

## Define endpoints and configure OpenTelemetry tracing

In [None]:
tracer_provider = register()

In [None]:
FEEDBACK_ENDPOINT = f"{os.environ['PHOENIX_COLLECTOR_ENDPOINT']}/span_annotations"
OPENAI_API_URL = "https://api.openai.com/v1/chat/completions"
tracer = tracer_provider.get_tracer(__name__)

## Define and instrument chat service backend

Here we define two functions:

`generate_response` is a function that contains the chatbot logic for responding to a user query. `generate_response` is manually instrumented using the `OpenInference` semantic conventions. More information on how to manually instrument an application can be found [here](https://docs.arize.com/phoenix/tracing/how-to-tracing/manual-instrumentation). `generate_response` also returns the OpenTelemetry spanID, a hex-encoded string that is used to associate feedback with a specific trace.

`send_feedback` is a function that sends user feedback to Phoenix via the `span_annotations` REST route.

In [None]:
client = Client()
http_client = httpx.Client()


def generate_response(
    input_text: str, model: str = "gpt-3.5-turbo", temperature: float = 0.1
) -> Dict[str, Any]:
    user_message = {"role": "user", "content": input_text, "uuid": str(uuid4())}
    invocation_parameters = {"temperature": temperature}
    payload = {
        "model": model,
        **invocation_parameters,
        "messages": [user_message],
    }
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {openai_api_key}",
    }
    with tracer.start_as_current_span("llm_span", openinference_span_kind="llm") as span:
        span.set_input(user_message)

        # get the active hex-encoded spanID
        span_id = span.get_span_context().span_id.to_bytes(8, "big").hex()
        print(span_id)

        response = http_client.post(OPENAI_API_URL, headers=headers, json=payload)

        if not (200 <= response.status_code < 300):
            raise Exception(f"Failed to call OpenAI API: {response.text}")
        response_json = response.json()

        span.set_output(response_json)

        return response_json, span_id


def send_feedback(span_id: str, feedback: int, user_id: str) -> None:
    label = "👍" if feedback == 1 else "👎"
    client.annotations.add_span_annotation(
        span_id=span_id,
        annotation_name="user_feedback",
        label=label,
        score=feedback,
        metadata={"example_key": "123"},
        identifier=user_id,
    )
    print(f"Feedback sent for span_id {span_id}: {label}")

### Define an LLM evaluator to run on incorrect responses

In [None]:
def run_llm_eval(span_id: str, input_text: str, assistant_content: str):
    """
    Evaluates the quality of an LLM response by asking another LLM to classify its correctness.

    Args:
        span_id: The ID of the span to evaluate
        input_text: The original unchanged user query
        assistant_content: The assistant's response to evaluate
    """
    # Create a prompt for the evaluation model
    eval_prompt = f"""
    You are an expert evaluator of AI assistant responses. Please evaluate the following:

    User Query: {input_text}

    Assistant Response: {assistant_content}

    Is this response correct, helpful, and appropriate for the user query?
    Provide a brief analysis and then classify as either "CORRECT" or "INCORRECT".

    Format your response as follows:
    Analysis: [Your analysis here]
    Classification: [CORRECT or INCORRECT]
    """

    # Call the evaluation model using the OpenAI API

    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {openai_api_key}",
    }

    payload = {
        "model": "gpt-4o",  # Using a smaller model for evaluation
        "messages": [{"role": "user", "content": eval_prompt}],
    }

    # Increased timeout to prevent ReadTimeout errors
    eval_response = http_client.post(OPENAI_API_URL, headers=headers, json=payload, timeout=60.0)
    eval_response = eval_response.json()
    print(eval_response)
    eval_content = eval_response["choices"][0]["message"]["content"]

    # Store the evaluation as an annotation
    client.annotations.add_span_annotation(
        span_id=span_id,
        annotation_name="correctness",
        annotator_kind="LLM",
        label="INCORRECT" if "Classification: INCORRECT" in eval_content else "CORRECT",
        score=1 if "Classification: INCORRECT" in eval_content else 0,
        explanation=eval_content,
    )

    print(f"LLM Evaluation for span_id {span_id}:")
    print(eval_content)

## Create Chat Widget

We create a simple chat application using IPython widgets. Alongside the chatbot responses we provide feedback buttons that a user can click to provide feedback. These can be seen inside the Phoenix UI!

In [None]:
def create_gradio_chat():
    import gradio as gr

    def chat_response(message, history, user_id):
        # Send the message to the OpenAI API and get the response
        response_data, span_id = generate_response(message)
        assistant_content = response_data["choices"][0]["message"]["content"]

        # Store the span_id for feedback
        return assistant_content, span_id

    def submit_feedback(feedback_type, span_id, message, response, user_id):
        if feedback_type == "positive":
            send_feedback(span_id, 1, user_id)
            return "Thanks for your positive feedback! We'll use it to improve our assistant."
        else:  # negative feedback
            send_feedback(span_id, 0, user_id)
            run_llm_eval(span_id, message, response)
            return "Thanks for your feedback. We'll work on improving this type of response."

    with gr.Blocks() as demo:
        gr.HTML("<h3>Encyclopedia Chatbot</h3>")
        gr.HTML(
            "<p>Welcome to the Encyclopedia Chatbot. Ask any question about the world, and provide feedback to help us improve!</p>"
        )

        user_id = gr.Dropdown(
            choices=["user1", "user2", "user3", "user4", "user5"], value="user1", label="User ID"
        )

        chatbot = gr.Chatbot(height=400)
        msg = gr.Textbox(placeholder="Type your message here...")

        # Hidden state to store the current span_id
        current_span_id = gr.State("")
        feedback_message = gr.Markdown("")

        def respond(message, chat_history, user_id):
            # Get bot response
            bot_response, span_id = chat_response(message, chat_history, user_id)

            # Update chat history
            chat_history.append((message, bot_response))

            return "", chat_history, span_id

        # Send button
        msg.submit(respond, [msg, chatbot, user_id], [msg, chatbot, current_span_id])

        with gr.Row():
            thumbs_up = gr.Button("👍", scale=1)
            thumbs_down = gr.Button("👎", scale=1)

        # Feedback handlers
        def handle_positive_feedback(span_id, chat_history, user_id):
            if not chat_history:
                return "No message to provide feedback on."

            last_user_msg, last_bot_msg = chat_history[-1]
            return submit_feedback("positive", span_id, last_user_msg, last_bot_msg, user_id)

        def handle_negative_feedback(span_id, chat_history, user_id):
            if not chat_history:
                return "No message to provide feedback on."

            last_user_msg, last_bot_msg = chat_history[-1]
            return submit_feedback("negative", span_id, last_user_msg, last_bot_msg, user_id)

        thumbs_up.click(
            handle_positive_feedback, [current_span_id, chatbot, user_id], feedback_message
        )

        thumbs_down.click(
            handle_negative_feedback, [current_span_id, chatbot, user_id], feedback_message
        )

    return demo


# Create and display the Gradio interface
demo = create_gradio_chat()
demo.launch(inline=True, share=False)

## Analyze feedback using the Phoenix Client

We can use the Phoenix client to pull the annotated spans. By combining `get_spans_dataframe`
and `get_span_annotations_dataframe` we can create a dataframe of all annotations alongside
span data for analysis!

In [None]:
spans_df = client.spans.get_spans_dataframe(project_identifier=os.environ["PHOENIX_PROJECT_NAME"])
annotations_df = client.spans.get_span_annotations_dataframe(
    spans_dataframe=spans_df, project_identifier=os.environ["PHOENIX_PROJECT_NAME"]
)

In [None]:
annotations_df.join(spans_df, how="inner")

In [None]:
client.spans.get_span_annotations(
    span_ids=spans_df.index, project_identifier=os.environ["PHOENIX_PROJECT_NAME"]
)