# Homework 5 - Evals for Failure Analysis with Phoenix

<center>
    <p style="text-align:left">
        <img alt="phoenix logo" src="https://repository-images.githubusercontent.com/564072810/f3666cdf-cb3e-4056-8a25-27cb3e6b5848" width="600"/>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>

## Launch Phoenix

First, let's set up Phoenix on our local machine. You should run these commands within your terminal in your chosen environment.

(If you have already done this in a previous HW assignment, you are good to go.)

**Install Phoenix**

```pip install arize-phoenix```

**Boot up Phoenix on localhost**

```phoenix serve```

Run `phoenix serve` in your terminal to boot up Phoenix locally.

In [None]:
import getpass
import os
from typing import List

import matplotlib.pyplot as plt
import pandas as pd

import phoenix as px
from phoenix.client import AsyncClient
from phoenix.client.types.spans import SpanQuery

os.environ["OPENAI_API_KEY"] = os.getenv("OPENAI_API_KEY") or getpass.getpass(
    "Enter your OpenAI API key: "
)

# Pipeline States

Below are the states of the recipe agent pipeline we are simulating, in order.

**ParseRequest** - LLM interprets and analyzes the user's query to understand what they're asking for

**PlanToolCalls** - LLM decides which tools to invoke and in what order based on the parsed request

**GenRecipeArgs** - LLM constructs JSON arguments for the recipe database search based on customer profile

**GetRecipes** - Executes the recipe-search tool to find relevant recipes matching the criteria

**GenWebArgs** - LLM constructs JSON arguments for web search to find additional cooking tips/information

**GetWebInfo** - Executes the web-search tool to retrieve supplementary cooking information

**ComposeResponse** - LLM drafts the final answer combining recipes and web information

**DeliverResponse** - Agent sends the composed response to the user

In [None]:
PIPELINE_STATES: List[str] = [
    "ParseRequest",
    "PlanToolCalls",
    "GenRecipeArgs",
    "GetRecipes",
    "GenWebArgs",
    "GetWebInfo",
    "ComposeResponse",
]
STATE_INDEX = {s: i for i, s in enumerate(PIPELINE_STATES)}

## Generate Phoenix Traces

Here we are making 100 requests to the recipe bot and then collecting traces for those requests in Phoenix.

You can look at the generate_traces_phoenix.py file for more details on how this is implemented.

In [None]:
%run generate_traces_phoenix.py

## Evals

Here we run evals. We have 7 different evaluators with their own prompts, each designed to evaluate one of the 7 states of the recipe bot application. You can see the evaluator prompts in `evaluators` directory.

We use the Phoenix method `SpanQuery()` to load spans for the 7 different states.

We use the Phoenix method `llm_generate` to run the evals. llm_generate has built in concurrency that makes running your llm calls for your evals much quicker.

Finally we use `log_annotations_dataframe` to log our evals back to our spans in Phoenix.

In [None]:
# Evals
import os
import re

import nest_asyncio

from phoenix.evals import OpenAIModel, llm_generate

nest_asyncio.apply()

eval_to_path = {
    "ParseRequest": "evaluators/parse_request_eval.txt",
    "PlanToolCalls": "evaluators/plan_tool_calls_eval.txt",
    "GenRecipeArgs": "evaluators/gen_recipe_args_eval.txt",
    "GetRecipes": "evaluators/get_recipes_eval.txt",
    "GenWebArgs": "evaluators/gen_web_args_eval.txt",
    "GetWebInfo": "evaluators/get_web_info_eval.txt",
    "ComposeResponse": "evaluators/compose_response_eval.txt",
}


async def load_spans(name: str) -> pd.DataFrame:
    query = SpanQuery().where(f"name == '{name}'")
    px_client = AsyncClient()
    spans_df = await px_client.spans.get_spans_dataframe(query=query, project_identifier="recipe-agent-hw5")
    print(f"Successfully loaded {len(spans_df)} {name} spans from Phoenix")
    return spans_df


annotated_spans = []


async def eval_spans(spans_df: pd.DataFrame, eval_prompt: str) -> pd.DataFrame:
    def parser(response: str, row_index: int) -> dict:
        """Parser function for evaluate_output evaluator"""
        label = r'"label":\s*"([^"]*)"'
        explanation = r'"explanation":\s*"([^"]*)"'
        label_match = re.search(label, response, re.IGNORECASE)
        explanation_match = re.search(explanation, response, re.IGNORECASE)
        if label_match and explanation_match:
            return {"label": label_match.group(1), "explanation": explanation_match.group(1)}
        return {"label": "UNKNOWN", "explanation": "Failed to parse response"}

    eval_model = OpenAIModel(
        model="gpt-4o", model_kwargs={"response_format": {"type": "json_object"}, "temperature": 0}
    )

    # Generate evaluations using llm_generate
    failure_analysis = llm_generate(
        dataframe=spans_df,
        template=eval_prompt,
        model=eval_model,
        output_parser=parser,
        concurrency=10,
    )

    failure_analysis["context.trace_id"] = spans_df["context.trace_id"]
    failure_analysis["attributes.input.value"] = spans_df["attributes.input.value"]
    failure_analysis["attributes.output.value"] = spans_df["attributes.output.value"]
    annotated_spans.append(failure_analysis)

    from phoenix.client import AsyncClient

    px_client = AsyncClient()
    await px_client.annotations.log_span_annotations_dataframe(
        dataframe=failure_analysis,
        annotation_name="Eval",
        annotator_kind="LLM",
    )
    
    return failure_analysis


for eval_name, eval_path in eval_to_path.items():
    spans_df = await load_spans(eval_name)
    eval_prompt = open(eval_path, "r").read()
    await eval_spans(spans_df, eval_prompt)

## Attach Evals up to Root Trace

This code propagates our evals back to the root trace, so each root trace contains information on whether the 7 individual states within passed or failed.

In [None]:
# Track which traces have already been processed to avoid duplicates
traces = {}

for idx, span_df in enumerate(annotated_spans):
    failure_state_name = PIPELINE_STATES[idx]
    for span in span_df.iterrows():
        trace_id = span[1]["context.trace_id"]
        trace_query = SpanQuery().where(
            f"context.trace_id == '{trace_id}' and span_kind == 'AGENT'"
        )
        px_client = AsyncClient()
        trace_df = await px_client.spans.get_spans_dataframe(query=trace_query, project_identifier="recipe-agent-hw5")
        trace_df["label"] = [span[1]["label"]]
        trace_df["explanation"] = [span[1]["explanation"]]
        await px_client.annotations.log_span_annotations_dataframe(
            dataframe=trace_df,
            annotation_name=failure_state_name,
            annotator_kind="LLM",
        )
        trace = trace_df.iloc[0]
        del trace["label"]
        if trace_id not in traces:
            traces[trace_id] = trace
            traces[trace_id][failure_state_name] = span[1]["label"]
            traces[trace_id]["{failure_state_name}_explanation"] = span[1]["explanation"]
        else:
            trace = traces[trace_id]
            trace[failure_state_name] = span[1]["label"]
            trace["{failure_state_name}_explanation"] = span[1]["explanation"]

In [None]:
import pandas as pd

# Convert traces dict to DataFrame
df = pd.DataFrame.from_dict(traces, orient="index")

# Count failures for each state
failure_counts = (df[PIPELINE_STATES] == "fail").sum()

all_pass_count = (df[PIPELINE_STATES] != "fail").all(axis=1).sum()
failure_counts["all_pass"] = all_pass_count

# Plot
plt.figure(figsize=(8, 6))
failure_counts.plot(kind="bar")

plt.title("Distribution of Failures per Pipeline State")
plt.xlabel("Pipeline State")
plt.ylabel("Number of Failures")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

### Analysis of System Reliability and Failure Distribution

1. **Overall Reliability**
   Out of approximately 100 traces, about **one-third (32–33)** completed successfully without any failures. The remaining **two-thirds experienced at least one failure** along the pipeline. This indicates that while the system is not universally fragile, it is also not yet stable enough for production use.

2. **Where Failures Occur**
   Failures are not evenly distributed across the pipeline — instead, they cluster around a few key stages:

* **GetWebInfo (33 failures):** The largest single source of errors, suggesting issues with external dependencies such as API reliability, query quality, or rate limits.
* **ParseRequest (18) and PlanToolCalls (17):** Together these account for 35 failures, making input interpretation and tool planning the second biggest weakness.
* **GenRecipeArgs, GetRecipes, GenWebArgs (8–11 each):** Moderate failure levels, likely reflecting occasional misalignments when constructing arguments or retrieving results.
* **ComposeResponse (1):** Very low failure rate, showing the model can reliably synthesize answers when given valid inputs. Errors here are likely side effects of upstream problems.

3. **Key Insights**
   The system demonstrates a **bimodal pattern**:

* Either a trace runs flawlessly end-to-end,
* Or it fails in predictable spots (primarily GetWebInfo, ParseRequest, or PlanToolCalls).

This clustering of root causes means that **targeted improvements in just a few stages** could significantly increase the overall success rate and bring the pipeline closer to production stability.

## Completing the Feedback Loop

Evals are powerful because they don't only tell us where stuff went wrong, but WHY it did. This information is stored in the eval explanations.

You can use the eval explanations to generate feedback for improvement strategies for your application, using LLMs!

![feedback loop](https://storage.googleapis.com/arize-phoenix-assets/assets/images/Screenshot%202025-08-19%20at%208.38.11%E2%80%AFPM.png)

In [None]:
GetWebInfo_code = """
def gen_web_args(recipes: str, failure: int) -> str:
    '''Simulate GenWebArgs state - LLM constructs arguments for web search.'''
    if failure != 5:
        prompt = '''
        Generate web_search arguments to find cooking tips based on these recipes.

        Recipes: {{recipes}}

        Return just the query, which is a string containing a web search query for additional information, cooking tips, special ingredients, etc.
        '''
    response = chat_completion([{"role": "user", "content": prompt}])

@tracer.tool(name="GetWebInfo", description="Search the web for recipe information and tips")
def get_web_info(search_query: str, failure: int) -> str:
    '''GetWebInfo state - Executes Tavily web search for recipe information.'''

    # Initialize Tavily client
    tavily_client = TavilyClient(api_key=os.getenv("TAVILY_API_KEY"))

    if failure != 6:
        # Normal search for recipe information
        response = tavily_client.search(search_query)
        return json.dumps(response)
    else:
        # Search for off-topic information for testing
        off_topic_query = "unrelated topic information"
        response = tavily_client.search(off_topic_query)
        return json.dumps(response)

"""

In [None]:
from openai import OpenAI

PROMPT = """You are auditing failures for the state: {failure_state_name}.

You will receive many short “explanations” describing material defects detected by an evaluator. Your tasks:
1) Synthesize recurring failure patterns.
2) Propose concrete, testable fixes that reduce these failures at the source state.
3) Provide code for unit tests that should fail now and pass after the fixes.

Inputs, Outputs, Labels, and Explanations:
"""

per_class_prompts = {}

for idx, span_df in enumerate(annotated_spans):
    failure_state_name = PIPELINE_STATES[idx]
    df = span_df.reset_index(drop=True).fillna("")

    cases = []
    for i in range(len(df)):
        r = df.iloc[i]
        cases.append(
            f"Case {i + 1}:\n"
            f"  Label: {r['label']}\n"
            f"  Explanation: {r['explanation']}\n"
            f"  Input: {r['attributes.input.value']}\n"
            f"  Output: {r['attributes.output.value']}"
        )

    prompt = PROMPT.format(failure_state_name=failure_state_name) + "\n" + "\n\n".join(cases)
    per_class_prompts[failure_state_name] = prompt

In [None]:
from IPython.display import Markdown, display

label = "GetWebInfo"
prompt = per_class_prompts[label]
prompt += f"code for {label}: {GetWebInfo_code}"

openai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
response = openai_client.responses.create(
    model="gpt-4o",
    input=prompt,
)

display(Markdown(response.output_text))

# All done!

## We learned how to:

- Hyper focus our evals
- Build effective eval prompts
- Judge our application based on our evals
- Use eval explanations to generate targeted feedback