<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Tracing, Evaluation, and Analysis of an LLM Application using Phoenix</h1>

In this tutorial we will learn how to build, observe, evaluate, and analyze a LLM powered application. This is a LLM-powered Chat on Docs application that will answer questions about <a href="https://docs.arize.com/arize/">Arize</a> from their product documentation.

Key Concepts:
1. LLM Traces are  a category of telemetry data that is used to understand the execution of LLMs and the surrounding application context (such as retrieval from vector stores, usage of external tools, etc).

2. Traces are made up of a sequence of `spans`. A span represents a unit of work or operation (think a span of time).

3. LLM Evaluations help get visbility into the performance of the application

Run the next two code blocks to launch Phoenix.

## Launch Phoenix to visualize the app

In [7]:
!pip install -qq "arize-phoenix[experimental,llama-index]" "openai>=1" gcsfs nest_asyncio

# Import Statements
import phoenix as px
import pandas as pd

import os
import requests
from typing import cast, List
from getpass import getpass
from gcsfs import GCSFileSystem
from tqdm import tqdm

from llama_index import (
    ServiceContext,
    StorageContext,
    load_index_from_storage,
)
from llama_index.embeddings import OpenAIEmbedding
from llama_index.graph_stores.simple import SimpleGraphStore
from llama_index.llms import OpenAI
from llama_index.callbacks import CallbackManager

from phoenix.trace.llama_index import (
    OpenInferenceTraceCallbackHandler,
)
from phoenix.trace.utils import json_lines_to_df
from phoenix import TraceDataset
from phoenix.trace import SpanEvaluations
from phoenix.trace import DocumentEvaluations



In [8]:
# Run to visualize the app using Phoenix Tracing & Evals

trace_jsonl_url = "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/context-retrieval/trace.jsonl"
hallucination_eval_url = "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/context-retrieval/hallucination_eval.parquet"
qa_correctness_eval_url = "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/context-retrieval/qa_correctness_eval.parquet"
retrieved_documents_eval_url = "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/context-retrieval/retrieved_documents_eval.parquet"

response = requests.get(trace_jsonl_url)

if response.status_code == 200:
  with open("trace.jsonl", 'wb') as f:
      f.write(response.content)
  json_lines = []
  with open("trace.jsonl", "r") as f:
      json_lines = cast(List[str], f.readlines())
  trace_ds = TraceDataset(json_lines_to_df(json_lines))
else:
    print(f"Failed to download the file. Status code: {response.status_code}")

hallucination_eval_df = pd.read_parquet(hallucination_eval_url)
qa_correctness_eval_df = pd.read_parquet(qa_correctness_eval_url)
retrieved_documents_eval_df = pd.read_parquet(retrieved_documents_eval_url)

px.log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval_df),
)
px.log_evaluations(DocumentEvaluations(eval_name="Relevance", dataframe=retrieved_documents_eval_df))

px.active_session().view()


Sending Evaluations: 100%|██████████| 8/8 [00:00<00:00, 78.54it/s]
Sending Evaluations: 100%|██████████| 8/8 [00:00<00:00, 78.62it/s]


📺 Opening a view to the Phoenix app. The app is running at https://zbhi7ykgl2a22-496ff2e9c6d22116-6006-colab.googleusercontent.com/


## Learn how to Build the application and add Tracing

You can see the application's traces & evals above with Phoenix. If you'd like to see how the app was built using LlamaIndex & OpenAI, the sections below cover this. It will require an OpenAI key.

In [19]:
if (
    input("You can see the application's traces & evals above with Phoenix.\nIf you'd like to see how the app was built using LlamaIndex & OpenAI, type Y to continue. \nContinue [Y/n]?")
    .lower()
    .startswith("n")
):
    assert False, "notebook stopped"

KeyboardInterrupt: ignored

In [4]:
# Uses your OpenAI API Key
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

🔑 Enter your OpenAI API key: ··········


In [6]:
# Pulls down the Arize Product Documentation Knowledge Base
file_system = GCSFileSystem(project="public-assets-275721")
index_path = "arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/index/"
storage_context = StorageContext.from_defaults(
    fs=file_system,
    persist_dir=index_path,
    graph_store=SimpleGraphStore(),  # prevents unauthorized request to GCS
)

# Initialize the Llamaindex App and callback handler
callback_handler = OpenInferenceTraceCallbackHandler()

service_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.0),
    embed_model=OpenAIEmbedding(model="text-embedding-ada-002"),
    callback_manager=CallbackManager(handlers=[callback_handler])
)
index = load_index_from_storage(
    storage_context,
    service_context=service_context,
)
query_engine = index.as_query_engine()

# Asking the Application questions about the Arize product
queries = [
    "How can I query for a monitor's status using GraphQL?",
    "How do I delete a model?",
    "How much does an enterprise license of Arize cost?",
    "How do I log a prediction using the python SDK?",
]

for query in tqdm(queries):
    response = query_engine.query(query)
    print(f"Query: {query}")
    print(f"Response: {response}")

#Launch Phoenix
ds = TraceDataset.from_spans(list(callback_handler.get_spans()))
px.launch_app(trace=ds)

ERROR:gcsfs:_request non-retriable exception: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist)., 401
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gcsfs/retry.py", line 114, in retry_request
    return await func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/gcsfs/core.py", line 423, in _request
    validate_response(status, contents, path, args)
  File "/usr/local/lib/python3.10/dist-packages/gcsfs/retry.py", line 101, in validate_response
    raise HttpError(error)
gcsfs.retry.HttpError: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 'storage.objects.list' denied on resource (or it may not exist)., 401
ERROR:gcsfs:_request non-retriable exception: Anonymous caller does not have storage.objects.list access to the Google Cloud Storage bucket. Permission 

Query: How can I query for a monitor's status using GraphQL?
Response: You can query for a monitor's status using GraphQL by including the "status" field in your query.


 50%|█████     | 2/4 [00:03<00:03,  1.95s/it]ERROR:phoenix.trace.exporter:HTTPConnectionPool(host='127.0.0.1', port=6006): Max retries exceeded with url: /v1/spans (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f5517cd9c60>: Failed to establish a new connection: [Errno 111] Connection refused'))
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 791, in 

Query: How do I delete a model?
Response: To delete a model, you would need to access the model management or administration section of the platform or application you are using. Look for options or settings related to managing models, and from there, you should be able to find the option to delete a specific model. The exact steps may vary depending on the platform or application you are using.


 75%|███████▌  | 3/4 [00:05<00:02,  2.05s/it]ERROR:phoenix.trace.exporter:HTTPConnectionPool(host='127.0.0.1', port=6006): Max retries exceeded with url: /v1/spans (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f5517cd84c0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 791, in 

Query: How much does an enterprise license of Arize cost?
Response: I'm sorry, but I don't have access to pricing information for Arize. For detailed pricing information, I recommend reaching out to Arize's sales team at contacts@arize.com. They will be able to provide you with the most accurate and up-to-date information regarding the cost of an enterprise license.


100%|██████████| 4/4 [00:08<00:00,  2.24s/it]ERROR:phoenix.trace.exporter:HTTPConnectionPool(host='127.0.0.1', port=6006): Max retries exceeded with url: /v1/spans (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f550d7c2470>: Failed to establish a new connection: [Errno 111] Connection refused'))
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/usr/local/lib/python3.10/dist-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/urllib3/connectionpool.py", line 791, in 

Query: How do I log a prediction using the python SDK?
Response: To log a prediction using the Python SDK, you can use the `arize.log()` function. You need to provide the necessary parameters such as `prediction_id`, `model_id`, `model_type`, `environment`, `model_version`, `prediction_timestamp`, `features`, `prediction_label`, and `tags`. Additionally, you can also log the actual label by using the `arize.log()` function again and providing the `actual_label` parameter.
🌍 To view the Phoenix app in your browser, visit https://zbhi7ykgl2a21-496ff2e9c6d22116-6006-colab.googleusercontent.com/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


<phoenix.session.session.ThreadSession at 0x7f5517ca23e0>

## Learn how to Evaluate the application using Phoenix LLM Evals

We now have visibility into the inner workings of our application. Next, let's take a look at how to use LLM evals to evaluate our application.

We will be going through a few common evaluation metrics

1. Hallucination Eval: Checks if application response was an hallucination

2. Q&A Eval: Whether or not the application answers the question correctly

3. Document Relevance Eval: Grades whether or not the documents/chunks retrieved were actually relevant to answering the query

In [9]:
# Convert traces into a workable datasets

spans_df = px.active_session().get_spans_dataframe()
spans_df[["name", "span_kind", "attributes.input.value", "attributes.retrieval.documents"]].head()

from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents

retrieved_documents_df = get_retrieved_documents(px.active_session())
queries_df = get_qa_with_reference(px.active_session())

In [10]:
# Generating the Hallucination & Q&A Eval

import nest_asyncio
from phoenix.experimental.evals import (
    HALLUCINATION_PROMPT_RAILS_MAP,
    HALLUCINATION_PROMPT_TEMPLATE,
    QA_PROMPT_RAILS_MAP,
    QA_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

nest_asyncio.apply()  # Speeds up OpenAI API calls

# Creating Hallucination Eval which checks if the application hallucinated
hallucination_eval = llm_classify(
    dataframe=queries_df,
    model=OpenAIModel("gpt-4", temperature=0.0),
    template=HALLUCINATION_PROMPT_TEMPLATE,
    rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,  # Makes the LLM explain its reasoning
    concurrency=4,
)
hallucination_eval["score"] = (
    hallucination_eval.label[~hallucination_eval.label.isna()] == "factual"
).astype(int)

# Creating Q&A Eval which checks if the application answered the question correctly
qa_correctness_eval = llm_classify(
    dataframe=queries_df,
    model=OpenAIModel("gpt-4", temperature=0.0),
    template=QA_PROMPT_TEMPLATE,
    rails=list(QA_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,  # Makes the LLM explain its reasoning
    concurrency=4,
)

qa_correctness_eval["score"] = (
    hallucination_eval.label[~qa_correctness_eval.label.isna()] == "correct"
).astype(int)

# Logs the Evaluations to Phoenix
px.log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval),
)

llm_classify |          | 0/4 (0.0%) | ⏳ 00:00<? | ?it/s

llm_classify |          | 0/4 (0.0%) | ⏳ 00:00<? | ?it/s

In [11]:
hallucination_eval.head()

Unnamed: 0_level_0,label,explanation,score
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
c430c6cc-1027-4a8f-a31b-280e0b285dbc,factual,The answer provided is in line with the inform...,1
23a8a063-124f-4f13-a030-bb61d9d7f8f7,factual,The query asks for the cost of an enterprise l...,1
f31c335f-9b7f-4ed7-96eb-ace33a4df064,hallucinated,The reference text provides information about ...,0
bce5b9ae-4587-4ead-9ccc-de3fe29257bc,factual,The reference text provides a GraphQL query ex...,1


In [12]:
qa_correctness_eval.head()

Unnamed: 0_level_0,label,explanation,score
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
c430c6cc-1027-4a8f-a31b-280e0b285dbc,correct,The answer correctly explains how to log a pre...,0
23a8a063-124f-4f13-a030-bb61d9d7f8f7,correct,The question asks for the cost of an enterpris...,0
f31c335f-9b7f-4ed7-96eb-ace33a4df064,incorrect,The reference text does not provide any inform...,0
bce5b9ae-4587-4ead-9ccc-de3fe29257bc,correct,The reference text provides a GraphQL query ex...,0


As you can see from the results, one of the queries was flagged as a hallucination.

We can use Retrieval Relevance Evals to identify if these issues are caused by the retrieval process for RAG. We are going to use an LLM to grade whether or not the chunks retrieved are relevant to the query.

In [15]:
# Generating Retrieval Relevance Eval

from phoenix.experimental.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

retrieved_documents_eval = llm_classify(
    dataframe=retrieved_documents_df,
    model=OpenAIModel("gpt-4", temperature=0.0),
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,
)

retrieved_documents_eval["score"] = (
    retrieved_documents_eval.label[~retrieved_documents_eval.label.isna()] == "relevant"
).astype(int)

px.log_evaluations(DocumentEvaluations(eval_name="Relevance", dataframe=retrieved_documents_eval))

llm_classify |          | 0/8 (0.0%) | ⏳ 00:00<? | ?it/s

In [16]:
retrieved_documents_eval.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,label,explanation,score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
39fbed72-75ac-4f5f-a64b-e90e1dbee201,0,relevant,The question asks about how to log a predictio...,1
39fbed72-75ac-4f5f-a64b-e90e1dbee201,1,relevant,The question asks about how to log a predictio...,1
6839a9f7-7c99-497e-a3f0-bb1e33c451c1,0,irrelevant,The question is asking for the cost of an ente...,0
6839a9f7-7c99-497e-a3f0-bb1e33c451c1,1,irrelevant,The question is asking for the cost of an ente...,0
a71236bc-6f1f-46c6-b799-4163048c8c51,0,irrelevant,The question is asking for instructions on how...,0


Looks like we are getting a lot of irrelevant chunks of text that might be polluting the prompt sent to the LLM.

If we once again visit the UI, we will now see that Phoenix has aggregated up retrieval metrics (`precision`, `ndcg`, and `hit`). We see that our hallucinations and incorrect answers directly correlate to bad retrieval!

In [18]:
print("The Phoenix UI:", px.active_session().url)

The Phoenix UI: https://zbhi7ykgl2a16-496ff2e9c6d22116-6006-colab.googleusercontent.com/
