<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg?__hstc=259489365.a667dfafcfa0169c8aee4178d115dc81.1733501603539.1733501603539.1733501603539.1&__hssc=259489365.1.1733501603539&__hsfp=3822854628&submissionGuid=381a0676-8f38-437b-96f2-fc10875658df#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Tracing and Evaluating an Amazon Bedrock Agent</h1>

In this tutorial, you will:

- Build an Amazon Bedrock agent
- Instrument and trace the agent with Phoenix
- Add evaluations to your agent traces

## Background

[Amazon Bedrock Agents](https://aws.amazon.com/bedrock/agents/) is a fully managed capability in Amazon Bedrock that allows you to build AI agents that can complete tasks by interacting with enterprise systems, data sources, and APIs. These agents can understand user requests in natural language, break down complex tasks into steps, retrieve relevant information, and take actions to fulfill user requests. With Bedrock Agents, you can create conversational assistants that can answer questions, provide recommendations, and perform actions on behalf of users, all while maintaining context throughout the conversation.

In this tutorial, we'll use Phoenix to trace and evaluate an Amazon Bedrock Agent, providing visibility into how the agent processes requests, makes decisions, and interacts with various systems to complete tasks.

Let's get started!

ℹ️ This notebook requires an AWS account with access to Bedrock.

### Install Dependencies

In [None]:
!pip install -q uv
!uv pip install -q arize-phoenix-otel boto3 anthropic openinference-instrumentation-bedrock

In [None]:
import os
import time
from getpass import getpass

import boto3
import nest_asyncio

from phoenix.otel import register

nest_asyncio.apply()

### Set Phoenix Environment Variables

This example used [Phoenix Cloud](https://app.phoenix.arize.com), our free online hosted version of Phoenix. If you'd prefer, you can [self-host Phoenix](https://arize.com/docs/phoenix/self-hosting) instead.

In [None]:
os.environ["PHOENIX_COLLECTOR_ENDPOINT"] = "https://app.phoenix.arize.com"
if not os.environ.get("PHOENIX_CLIENT_HEADERS"):
    os.environ["PHOENIX_CLIENT_HEADERS"] = "api_key=" + getpass("Enter your Phoenix API key: ")

### Connect to Phoenix
Now you can connect your notebook to a Phoenix instance.

The `auto_instrument` flag below will search your environment for any openinference-instrumentation packages, and call any that are found. Because you installed the openinference-instrumentation-bedrock library, any calls you make to Bedrock or Bedrock agents will be automatically instrumented and sent to Phoenix.

In [None]:
project_name = "Amazon Bedrock Agent Example"

tracer_provider = register(project_name=project_name, auto_instrument=True)

## Configure your agent in Bedrock Agents

Within [Bedrock Agents](https://us-east-2.console.aws.amazon.com/bedrock/home?region=us-east-2#/overview), create a new agent and configure it however you'd like. This example uses:
1. A knowledgebase created using the webscraper tool.
2. A set of action group functions that retrieve information about Phoenix.

Bedrock Agents additionally supports Guardrails, Prompts, and more - all of which will be traced by Phoenix.

## Connect your notebook to AWS

You'll next need to create an AWS SSO profile that can connect to Bedrock agents. You can do this via the CLI using `aws configure sso`. Once you've run through that setup and created your agent in Bedrock Agents, fill in the variables below:

In [None]:
# SSO Profile Configuration
PROFILE_NAME = "phoenix"  # The name of the AWS SSO profile you created
REGION = "us-east-2"  # The region where your Bedrock agent is deployed
SERVICE_NAME = "bedrock-agent-runtime"  # The service name of your Bedrock agent

# Bedrock Agent Configuration
AGENT_ID = ""  # The ID of your Bedrock agent, found in the Bedrock Agents console
AGENT_ALIAS_ID = ""  # The alias ID of your Bedrock agent, found in the Bedrock Agents console

In [None]:
session = boto3.Session(profile_name=PROFILE_NAME)
bedrock_agent_runtime = session.client(SERVICE_NAME, region_name=REGION)

## Run your Agent
You're now ready to run your Bedrock Agent.

In [None]:
def run(input_text):
    session_id = f"default-session1_{int(time.time())}"

    attributes = dict(
        inputText=input_text,
        agentId=AGENT_ID,
        agentAliasId=AGENT_ALIAS_ID,
        sessionId=session_id,
        enableTrace=True,
    )
    response = bedrock_agent_runtime.invoke_agent(**attributes)

    # Stream the response
    for _, event in enumerate(response["completion"]):
        if "chunk" in event:
            print(event)
            chunk_data = event["chunk"]
            if "bytes" in chunk_data:
                output_text = chunk_data["bytes"].decode("utf8")
                print(output_text)
        elif "trace" in event:
            print(event["trace"])

In [None]:
run("Tell me about my recent Phoenix traces")

In [None]:
run("How do I run evaluations in Arize Phoenix?")

In [None]:
run("Tell me about my recent Phoenix experiments")

## View your Traces in Phoenix

You should now be able to see traces in your Phoenix dashboard:

![phoenix-bedrock-agent-traces-1](https://storage.googleapis.com/arize-phoenix-assets/assets/images/bedrock-agent-traces-1.png)
![phoenix-bedrock-agent-traces-2](https://storage.googleapis.com/arize-phoenix-assets/assets/images/bedrock-agent-traces-2.png)

# Evaluating your Agent

Phoenix also includes built in LLM evaluations and code-based experiment testing. In this next section, you'll add Agent tool calling evaluations to your traces.

Up until now, you'd just used the lighter-weight Phoenix OTEL tracing library. To run evals, you'll need to install the full library.

In [None]:
!pip install -q arize-phoenix

In [None]:
import json

import phoenix as px
from phoenix.evals import (
    TOOL_CALLING_PROMPT_RAILS_MAP,
    TOOL_CALLING_PROMPT_TEMPLATE,
    BedrockModel,
    llm_classify,
)
from phoenix.trace import SpanEvaluations
from phoenix.trace.dsl import SpanQuery

In [None]:
query = (
    SpanQuery()
    .where(
        # Filter for the `LLM` span kind.
        # The filter condition is a string of valid Python boolean expression.
        "span_kind == 'LLM'",
    )
    .select(
        question="input.value",
        outputs="output.value",
    )
)
trace_df = px.Client().query_spans(query, project_name=project_name)

In [None]:
# Apply JSON parsing to each row of trace_df.input.value
trace_df["question"] = trace_df["question"].apply(
    lambda x: json.loads(x).get("messages", [{}])[0].get("content", "") if isinstance(x, str) else x
)

In [None]:
# Function to extract tool call names from the output
def extract_tool_calls(output_value):
    tool_calls = []
    try:
        o = json.loads(output_value)

        # Check if the output has 'content' which is a list of message components
        if "content" in o and isinstance(o["content"], list):
            for item in o["content"]:
                # Check if this item is a tool_use type
                if isinstance(item, dict) and item.get("type") == "tool_use":
                    # Extract the name of the tool being called
                    tool_name = item.get("name")
                    if tool_name:
                        tool_calls.append(tool_name)
    except (json.JSONDecodeError, TypeError, AttributeError):
        pass

    return tool_calls


# Apply the function to each row of trace_df.output.value
trace_df["tool_call"] = trace_df["outputs"].apply(
    lambda x: extract_tool_calls(x) if isinstance(x, str) else []
)

# Display the tool calls found
print("Tool calls found in traces:", trace_df["tool_call"].sum())

In [None]:
# Keep only rows where tool_calls is not empty (has at least one tool call)
trace_df = trace_df[trace_df["tool_call"].apply(lambda x: len(x) > 0)]

trace_df.head()

In [None]:
trace_df["tool_definitions"] = (
    "phoenix-traces retrieves the latest trace information from Phoenix, phoenix-experiments retrieves the latest experiment information from Phoenix, phoenix-datasets retrieves the latest dataset information from Phoenix"
)

In [None]:
rails = list(TOOL_CALLING_PROMPT_RAILS_MAP.values())

eval_model = BedrockModel(session=session, model_id="anthropic.claude-3-5-haiku-20241022-v1:0")

response_classifications = llm_classify(
    data=trace_df,
    template=TOOL_CALLING_PROMPT_TEMPLATE,
    model=eval_model,
    rails=rails,
    provide_explanation=True,
)
response_classifications["score"] = response_classifications.apply(
    lambda x: 1 if x["label"] == "correct" else 0, axis=1
)

In [None]:
px.Client().log_evaluations(
    SpanEvaluations(eval_name="Tool Calling Eval", dataframe=response_classifications),
)

You should now see your evaluation labels in Phoenix!

![bedrock-agent-evals-1](https://storage.googleapis.com/arize-phoenix-assets/assets/images/bedrock-agent-evals-1.png)
![bedrock-agent-evals-2](https://storage.googleapis.com/arize-phoenix-assets/assets/images/bedrock-agent-evals-2.png)

# Next Steps

From here, you could look to expand your agent's capabilities in Bedrock by attaching it to your own tools and lambda functions. Or you could expand your testing and experiment flows in Phoenix by checking out [Experiments](https://arize.com/docs/phoenix/datasets-and-experiments/how-to-experiments/run-experiments) or [Prompts](https://arize.com/docs/phoenix/prompt-engineering/overview-prompts).

And for more on [Agents](https://arize.com/ai-agents/) and [Evaluation](https://arize.com/llm-evaluation), check out Arize's [website](https://arize.com).

We can't wait to see what you'll build!