<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>

# Instrumenting AWS Bedrock client with OpenInference and Phoenix

In this tutorial we will trace model calls to AWS Bedrock using OpenInference. The OpenInference Bedrock tracer instruments the Python `boto3` library, so all `invoke_model` calls will automatically generate traces that can be sent to Phoenix.

ℹ️ This notebook requires a valid AWS configuration and access to AWS Bedrock and the `claude-v2` model from Anthropic & an OpenAI API key for LLM as a Judge Evaluation. 

## 1. Install dependencies and set up OpenTelemetry tracer

First install dependencies

In [1]:
%pip install arize-phoenix boto3 openinference-instrumentation-bedrock

Note: you may need to restart the kernel to use updated packages.


Import libraries

In [2]:
import json
from urllib.parse import urljoin

import boto3
from openinference.instrumentation.bedrock import BedrockInstrumentor
from opentelemetry.sdk.trace.export import ConsoleSpanExporter

import phoenix as px
from phoenix.otel import SimpleSpanProcessor, register

  from .autonotebook import tqdm as notebook_tqdm


Start a Pheonix server to collect traces. Be sure to view Phoenix in your browser to watch traces show up in Phoenix as they are collected.

In [None]:
px.launch_app().view()
session_url = px.active_session().url

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix
📺 Opening a view to the Phoenix app. The app is running at http://localhost:6006/


Here we're configuring the OpenTelemetry tracer by adding two SpanProcessors. The first SpanProcessor will simply print all traces received from OpenInference instrumentation to the console. The second will export traces to Phoenix so they can be collected and viewed.

In [4]:
phoenix_otlp_endpoint = urljoin(session_url, "v1/traces")
tracer_provider = register()
tracer_provider.add_span_processor(SimpleSpanProcessor(span_exporter=ConsoleSpanExporter()))
tracer_provider.add_span_processor(SimpleSpanProcessor(endpoint=phoenix_otlp_endpoint))

🔭 OpenTelemetry Tracing Details 🔭
|  Phoenix Project: default
|  Span Processor: SimpleSpanProcessor
|  Collector Endpoint: localhost:4317
|  Transport: gRPC
|  Transport Headers: {'user-agent': '****'}
|  
|  Using a default SpanProcessor. `add_span_processor` will overwrite this default.
|  
|  `register` has set this TracerProvider as the global OpenTelemetry default.
|  To disable this behavior, call `register` with `set_global_tracer_provider=False`.



## 2. Instrumenting Bedrock clients

Now, let's create a `boto3` session. This initiates a configured environment for interacting with AWS services. If you haven't yet configured `boto3` to use your credentials, please refer to the [official documentation](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html). Or, if you have the AWS CLI, run `aws configure` from your terminal.

In [25]:
session = boto3.session.Session()

Clients created using this session configuration are currently uninstrumented. We'll make one for comparison.

In [6]:
uninstrumented_client = session.client("bedrock-runtime", region_name="us-west-2")

In [7]:
uninstrumented_client = session.client("bedrock-runtime", region_name="us-west-2")

Now we instrument Bedrock with our OpenInference instrumentor. All Bedrock clients created after this call will automatically produce traces when calling `invoke_model`.

In [8]:
BedrockInstrumentor().instrument(skip_dep_check=True)
instrumented_client = session.client("bedrock-runtime", region_name="us-west-2")

## 3. Calling the LLM and viewing OpenInference traces

Calling `invoke_model` using the `uninstrumented_client` will produce no traces, but will show the output from the LLM.

In [29]:
prompt = b'''{"prompt": "Human: What is the 3rd month of the year in alphabetical order? Assistant:", "max_tokens_to_sample": 1024}'''
response = uninstrumented_client.invoke_model(modelId="anthropic.claude-v2:1", body=prompt)
response_body = json.loads(response.get("body").read())
print(response_body["completion"])

 * The months of the year in alphabetical order are:
* April
* August 
* December
* February
* January
* July
* June
* March
* May
* November
* October
* September

So the 3rd month in alphabetical order is March.


LLM calls using the `instrumented_client` will print traces to the console! By configuring the `SpanProcessor` to export to a different OpenTelemetry collector, your OpenInference spans can be collected and analyzed to better understand the behavior of your LLM application.

In [30]:
response = instrumented_client.invoke_model(modelId="anthropic.claude-v2:1", body=prompt)
response_body = json.loads(response.get("body").read())
print(response_body["completion"])

{
    "name": "bedrock.invoke_model",
    "context": {
        "trace_id": "0x3659224e5acc91cb4d27b5ec2cb172d1",
        "span_id": "0x610edd29b195ce57",
        "trace_state": "[]"
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": null,
    "start_time": "2025-04-18T06:40:47.571786Z",
    "end_time": "2025-04-18T06:40:49.139912Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "input.value": "Human: What is the 3rd month of the year in alphabetical order? Assistant:",
        "llm.invocation_parameters": "{\"max_tokens_to_sample\": 1024}",
        "llm.token_count.prompt": 22,
        "llm.token_count.completion": 50,
        "llm.token_count.total": 72,
        "llm.model_name": "anthropic.claude-v2:1",
        "output.value": " * The months of the year in alphabetical order are:\nApril, August, December, February, January, July, June, March, May, November, October, September\n* The 3rd month alphabetically is March.",
        "openinference.

## 4. Collect all your Traces & Data

Use the `instrumented_client` to collect all your traces; This example uses a set of trivia questions. 

In [31]:
trivia_questions = [
    "What is the only U.S. state that starts with two vowels?",
    "What is the 3rd month of the year in alphabetical order?",
    "What is the capital of Mongolia?",
    "How many minutes are there in a leap year?",
    "If a train leaves New York at 3 PM traveling west at 60 mph, and another leaves Chicago at 4 PM traveling east at 80 mph, at what time will they meet?",
    "Which element has the chemical symbol 'Fe'?",
    "What five-letter word becomes shorter when you add two letters to it?",
    "What country has won the most FIFA World Cups?",
    "If today is Wednesday, what day of the week will it be 100 days from now?",
    "A farmer has 17 sheep and all but 9 run away. How many does he have left?",
]

for i, question in enumerate(trivia_questions, start=1):
    prompt_str = f'''
{{
    "prompt": "Human: {question} Assistant:",
    "max_tokens_to_sample": 300
}}
'''
    response = instrumented_client.invoke_model(
        modelId="anthropic.claude-v2:1",
        body=prompt_str.encode("utf-8"),
        contentType="application/json",
        accept="application/json"
    )

    response_body = json.loads(response.get("body").read())
    print(f"Q{i}: {question}")
    print(f"A{i}: {response_body['completion'].strip()}\n{'-'*60}")

{
    "name": "bedrock.invoke_model",
    "context": {
        "trace_id": "0xc1df3e4a316dcf2af30e65b965a07856",
        "span_id": "0xf1b9c797c212fe0e",
        "trace_state": "[]"
    },
    "kind": "SpanKind.INTERNAL",
    "parent_id": null,
    "start_time": "2025-04-18T06:40:52.295027Z",
    "end_time": "2025-04-18T06:40:55.746751Z",
    "status": {
        "status_code": "UNSET"
    },
    "attributes": {
        "input.value": "Human: What is the only U.S. state that starts with two vowels? Assistant:",
        "llm.invocation_parameters": "{\"max_tokens_to_sample\": 300}",
        "llm.token_count.prompt": 24,
        "llm.token_count.completion": 165,
        "llm.token_count.total": 189,
        "llm.model_name": "anthropic.claude-v2:1",
        "output.value": " The only U.S. state that starts with two vowels is Hawaii.\n\nThe 50 U.S. states are:\n\nAlabama, Alaska, Arizona, Arkansas, California, Colorado, Connecticut, Delaware, Florida, Georgia, Hawaii, Idaho, Illinois, Ind

## 5. Setup & Run your Eval

After importing your traces as a dataframe, modify your columns to fit into your eval template. Run ``llm_classify()`` to classify each input row of the dataframe using an LLM. 

In [35]:
qa_template = """You are given a question and an answer. You must determine whether the
given answer correctly answers the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {Question}
    ************
    [Answer]: {Answer}
    [END DATA]
Your response must be a single number, either 0 or 1,
and should not contain any text or characters or numbers aside from that digit.
"1" means that the question is correctly and fully answered by the answer.
"0" means that the question is not correct or only partially answered by the
answer."""

In [32]:
import phoenix as px

spans_df = px.Client().get_spans_dataframe(project_name="default")
spans_df.head()

Unnamed: 0_level_0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,context.span_id,context.trace_id,attributes.openinference.span.kind,attributes.llm.token_count.prompt,attributes.output.value,attributes.llm.token_count.completion,attributes.input.value,attributes.llm.invocation_parameters,attributes.llm.model_name,attributes.llm.token_count.total
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
0ebeeae9cd4f2505,bedrock.invoke_model,LLM,,2025-04-18 06:35:25.828994+00:00,2025-04-18 06:35:28.002651+00:00,UNSET,,[],0ebeeae9cd4f2505,b6661ab715d1ab884ae0b497c6abd751,LLM,22,* The months of the year in alphabetical orde...,69,Human: What is the 3rd month of the year in al...,"{""max_tokens_to_sample"": 1024}",anthropic.claude-v2:1,91
fda9c42a464aaad7,bedrock.invoke_model,LLM,,2025-04-18 06:35:35.606493+00:00,2025-04-18 06:35:39.109414+00:00,UNSET,,[],fda9c42a464aaad7,f2cf0d882d8f9c005613cb78cf43676c,LLM,24,The only U.S. state that starts with two vowe...,169,Human: What is the only U.S. state that starts...,"{""max_tokens_to_sample"": 300}",anthropic.claude-v2:1,193
42a3fc7e693f8c4e,bedrock.invoke_model,LLM,,2025-04-18 06:35:39.118771+00:00,2025-04-18 06:35:41.090864+00:00,UNSET,,[],42a3fc7e693f8c4e,72bacb025840060e99a883db20be2a29,LLM,22,* The months of the year in alphabetical orde...,69,Human: What is the 3rd month of the year in al...,"{""max_tokens_to_sample"": 300}",anthropic.claude-v2:1,91
c154f84e0cb1e197,bedrock.invoke_model,LLM,,2025-04-18 06:35:41.098117+00:00,2025-04-18 06:35:41.725908+00:00,UNSET,,[],c154f84e0cb1e197,249e702817aa3b3af188bafffda427ed,LLM,17,The capital of Mongolia is Ulaanbaatar.,16,Human: What is the capital of Mongolia? Assist...,"{""max_tokens_to_sample"": 300}",anthropic.claude-v2:1,33
4f35be6b25a4147c,bedrock.invoke_model,LLM,,2025-04-18 06:35:41.733760+00:00,2025-04-18 06:35:45.172950+00:00,UNSET,,[],4f35be6b25a4147c,92841e57de900e5844ffcc56fd6395ce,LLM,19,"Okay, let's break this down step-by-step:\n\n...",102,Human: How many minutes are there in a leap ye...,"{""max_tokens_to_sample"": 300}",anthropic.claude-v2:1,121


In [42]:
eval_df = spans_df[["context.span_id", "attributes.input.value", "attributes.output.value"]].copy()
eval_df.set_index("context.span_id", inplace=True)
eval_df.head()

Unnamed: 0_level_0,attributes.input.value,attributes.output.value
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0ebeeae9cd4f2505,Human: What is the 3rd month of the year in al...,* The months of the year in alphabetical orde...
fda9c42a464aaad7,Human: What is the only U.S. state that starts...,The only U.S. state that starts with two vowe...
42a3fc7e693f8c4e,Human: What is the 3rd month of the year in al...,* The months of the year in alphabetical orde...
c154f84e0cb1e197,Human: What is the capital of Mongolia? Assist...,The capital of Mongolia is Ulaanbaatar.
4f35be6b25a4147c,Human: How many minutes are there in a leap ye...,"Okay, let's break this down step-by-step:\n\n..."


In [34]:
evals_copy = eval_df.copy()
evals_copy["attributes.input.value"] = (
    evals_copy["attributes.input.value"]
    .str.replace(r"^Human: ", "", regex=True)
    .str.replace(r"Assistant:$", "", regex=True)
)

evals_copy = evals_copy.rename(columns={"attributes.input.value": "Question", 
                                        "attributes.output.value": "Answer"})
evals_copy.head()

Unnamed: 0_level_0,Question,Answer
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1
0ebeeae9cd4f2505,What is the 3rd month of the year in alphabeti...,* The months of the year in alphabetical orde...
fda9c42a464aaad7,What is the only U.S. state that starts with t...,The only U.S. state that starts with two vowe...
42a3fc7e693f8c4e,What is the 3rd month of the year in alphabeti...,* The months of the year in alphabetical orde...
c154f84e0cb1e197,What is the capital of Mongolia?,The capital of Mongolia is Ulaanbaatar.
4f35be6b25a4147c,How many minutes are there in a leap year?,"Okay, let's break this down step-by-step:\n\n..."


In [36]:
import os
import nest_asyncio
from phoenix.evals import OpenAIModel, llm_classify

nest_asyncio.apply()

model = OpenAIModel(model="gpt-4", temperature=0.0)

Q_and_A_classifications = llm_classify(
    data=evals_copy,
    template=qa_template,
    model=model,
    rails=["0", "1"],
    provide_explanation=True
)

llm_classify |██████████| 21/21 (100.0%) | ⏳ 00:07<00:00 |  2.87it/s


## 6. Log your traces into Phoenix 

In [40]:
eval_results = Q_and_A_classifications[['label', 'explanation']]
evals_copy["score"] = eval_results["label"].astype(int)
evals_copy["explanation"] = eval_results["explanation"].astype(str)
evals_copy["label"] = evals_copy["score"].map({1: "correct", 0: "incorrect"})
evals_copy.head()

Unnamed: 0_level_0,Question,Answer,score,explanation,label
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0ebeeae9cd4f2505,What is the 3rd month of the year in alphabeti...,* The months of the year in alphabetical orde...,1,The answer correctly lists the months of the y...,correct
fda9c42a464aaad7,What is the only U.S. state that starts with t...,The only U.S. state that starts with two vowe...,1,The answer correctly identifies Hawaii as the ...,correct
42a3fc7e693f8c4e,What is the 3rd month of the year in alphabeti...,* The months of the year in alphabetical orde...,1,The answer correctly lists the months of the y...,correct
c154f84e0cb1e197,What is the capital of Mongolia?,The capital of Mongolia is Ulaanbaatar.,1,The answer correctly identifies Ulaanbaatar as...,correct
4f35be6b25a4147c,How many minutes are there in a leap year?,"Okay, let's break this down step-by-step:\n\n...",0,The answer is incorrect. The calculation is wr...,incorrect


In [38]:
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(SpanEvaluations(eval_name="Q&A Correctness", dataframe=evals_copy))

More information about our instrumentation integrations, OpenInference can be found in our [documentation](https://docs.arize.com/phoenix/telemetry/instrumentation)