<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Tracing and Evaluating a LlamaIndex Application</h1>

LlamaIndex provides high-level APIs that enable users to build powerful applications in a few lines of code. However, it can be challenging to understand what is going on under the hood and to pinpoint the cause of issues. Phoenix makes your LLM applications *observable* by visualizing the underlying structure of each call to your query engine and surfacing problematic "spans" of execution based on latency, token count, or other evaluation metrics.

In this tutorial, you will:
- Build a simple query engine using LlamaIndex that uses retrieval-augmented generation to answer questions over the Arize documentation,
- Record trace data in OpenInference format,
- Inspect the traces and spans of your application to identify sources of latency and cost,
- Export your trace data as a pandas dataframe and run an LLM-assisted evaluation to measure the precision@k of your retrieval step.

ℹ️ This notebook requires an OpenAI API key.

## 1. Install Dependencies and Import Libraries

Install Phoenix, LlamaIndex, and OpenAI.

In [None]:
!pip install -qq "arize-phoenix[experimental]==0.0.33rc9" gcsfs llama-index openai

Import libraries.

In [None]:
import json
import os
from getpass import getpass
from urllib.request import urlopen

import openai
import pandas as pd
import phoenix as px
from gcsfs import GCSFileSystem
from llama_index import ServiceContext, StorageContext, load_index_from_storage
from llama_index.callbacks import CallbackManager
from llama_index.embeddings import OpenAIEmbedding
from llama_index.graph_stores.simple import SimpleGraphStore
from llama_index.llms import OpenAI
from phoenix.experimental.callbacks.llama_index_trace_callback_handler import (
    OpenInferenceTraceCallbackHandler,
)
from phoenix.experimental.evals import (
    compute_precisions_at_k,
    run_relevance_eval,
)
from tqdm import tqdm

pd.set_option("display.max_colwidth", 1000)

## 2. Launch Phoenix

You can run Phoenix in the background to collect trace data emitted by any LlamaIndex application that has been instrumented with the `OpenInferenceTraceCallbackHandler`.

Launch Phoenix and follow the instructions in the cell output to open the Phoenix UI (the UI should be empty because we have yet to run a LlamaIndex application).

In [None]:
px.launch_app()

## 3. Configure Your OpenAI API Key

Set your OpenAI API key if it is not already set as an environment variable.

In [None]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

## 4. Build Your LlamaIndex Application

This example uses a `RetrieverQueryEngine` over a pre-built index of the Arize documentation, but you can use whatever LlamaIndex application you like.

Download your pre-built index from cloud storage and instantiate your storage context.

In [None]:
file_system = GCSFileSystem(project="public-assets-275721")
index_path = "arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/index/"
storage_context = StorageContext.from_defaults(
    fs=file_system,
    persist_dir=index_path,
    graph_store=SimpleGraphStore(),  # prevents unauthorized request to GCS
)

Instantiate an `OpenInferenceTraceCallbackHandler` to store your data in [OpenInference format](https://github.com/Arize-ai/open-inference-spec). OpenInference is an open standard for capturing and storing LLM application traces thaandt enables production LLMapp servers to seamlessly integrate with LLM observability solutions such as Phoenix.

In [None]:
callback_handler = OpenInferenceTraceCallbackHandler()

Instantiate your query engine and attach the callback handler.

In [None]:
service_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo", temperature=0.0),
    embed_model=OpenAIEmbedding(model="text-embedding-ada-002"),
    callback_manager=CallbackManager(handlers=[callback_handler]),
)
index = load_index_from_storage(
    storage_context,
    service_context=service_context,
)
query_engine = index.as_query_engine()

## 5. Run Your Query Engine and View Your Traces in Phoenix

Download a sample of queries commonly asked of the Arize documentation.

In [None]:
queries_url = "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/context-retrieval/arize_docs_queries.jsonl"
queries = []
with urlopen(queries_url) as response:
    for line in response:
        line = line.decode("utf-8").strip()
        data = json.loads(line)
        queries.append(data["query"])
queries[:10]

Run a few queries.

In [None]:
for query in tqdm(queries[:10]):
    query_engine.query(query)

Check the Phoenix UI as your queries run. Your traces should appear in real time.

## 6. Export and Evaluate Your Trace Data

You can export your trace data as a pandas dataframe for further analysis and evaluation.

In this case, we will export our retriever spans and evaluate each retrieval to get LLM-assisted precision@k.

In [None]:
trace_df = px.active_session().get_span_dataframe('span_kind == "RETRIEVER"')
trace_df

Evaluate your retrieval spans and surface problematic spans:

- Make LLM calls to classify each retrieved document as relevant or irrelevant to the corresponding query,
- Compute the precision@k for k = 1, 2 for each document,
- Sort your spans by precision@2 to surface up the most problematic spans.


In [None]:
trace_df["llm_assisted_relevance"] = run_relevance_eval(trace_df)
trace_df["llm_assisted_precision_at_k"] = trace_df["llm_assisted_relevance"].map(
    lambda x: compute_precisions_at_k(x) if x else float("nan")
)
trace_df = trace_df.sort_values(
    by="llm_assisted_precision_at_k",
    key=lambda col: col.map(lambda x: x[-1] if isinstance(x, list) else 0.0),
    ascending=True,
)
trace_df[
    [
        "attributes.input.value",
        "attributes.retrieval.documents",
        "llm_assisted_relevance",
        "llm_assisted_precision_at_k",
    ]
]

ℹ️ Check back soon for more evals, improved ergonomics, and the ability to view your metrics and surface problematic traces and spans inside Phoenix.