<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Evaluating and Improving a LlamaIndex Semantic Retrieval Application</h1>

Imagine you're an engineer at Arize AI and you've built and deployed a documentation question-answering service using LlamaIndex. Users send questions about Arize's core product via a chat interface, and your service retrieves documents from your documentation in order to generate a response to the user. As the engineer in charge of evaluating and maintaining this system, you want to evaluate the quality of the responses from your service.

Phoenix helps you:
- identify gaps in your documentation
- detect queries for which the LLM gave bad responses
- detect failures to retrieve relevant context

In this tutorial, you will:

- Download an pre-indexed knowledge base of the Arize documentation and run a LlamaIndex application
- Visualize user queries and knowledge base documents to identify areas of user interest not answered by your documentation
- Find clusters of responses with negative user feedback
- Identify failed retrievals using cosine similarity, Euclidean distance, and LLM-assisted ranking metrics

⚠️ Parts of this notebook require an [OpenAI API key](https://platform.openai.com/account/api-keys) to run.

Let's get started!

## 1. Install Dependencies and Import Libraries

Install Phoenix and LlamaIndex.

In [None]:
!pip install -q arize-phoenix llama-index

Import libraries.

In [None]:
from functools import reduce
import hashlib
import json
import logging
import os
import sys
import tempfile
import textwrap
from tqdm import tqdm
from typing import Dict, List, Tuple
import urllib
import zipfile

from langchain.chat_models import ChatOpenAI
from llama_index import StorageContext, load_index_from_storage
from llama_index.embeddings.base import BaseEmbedding
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.indices.query.schema import QueryBundle
from llama_index.query_engine.retriever_query_engine import RetrieverQueryEngine
from llama_index.response.schema import Response
from llama_index import ServiceContext, LLMPredictor
from llama_index import StorageContext, load_index_from_storage
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.response.schema import Response
import numpy as np
import numpy.typing as npt
import openai
import pandas as pd
import phoenix as px
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)

pd.set_option("display.max_colwidth", None)

## 2. Configure Your OpenAI API Key (Optional)

⚠ This section is optional. You must configure an OpenAI API key in order to run your question-answering service and evaluate your responses with LLM-assisted evaluations. You can skip this cell if the `OPENAI_API_KEY` environment variable is already set in your notebook environment.

In [None]:
openai_api_key = "copy paste your api key here"
assert openai_api_key != "copy paste your api key here", "❌ Please set your OpenAI API key"
os.environ["OPENAI_API_KEY"] = openai_api_key

## 3. Download Your Knowledge Base

Download and unzip a pre-built knowledge base index consisting of chunks of the Arize documentation.

In [None]:
def download_file(url: str, output_path: str) -> None:
    """
    Downloads a file from the specified URL and saves to a local path.
    """
    urllib.request.urlretrieve(url, output_path)


def unzip_directory(zip_path: str, output_path: str) -> None:
    """
    Unzips a directory to a specified output path.
    """
    with zipfile.ZipFile(zip_path, "r") as f:
        f.extractall(output_path)


print("⏳ Downloading knowledge base...")
data_dir = tempfile.gettempdir()
zip_file_path = os.path.join(data_dir, "index.zip")
download_file(
    url="http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/index.zip",
    output_path=zip_file_path,
)

print("⏳ Unzipping knowledge base...")
index_dir = os.path.join(data_dir, "index")
unzip_directory(zip_file_path, index_dir)

print("✅ Done")

## 4. Run Your Question-Answering Service (Optional)

⚠ This section is optional and requires that you previously configured your OpenAI API key in step 2.

Start a LlamaIndex application from your downloaded index.

In [None]:
# configure the embedding model
embedding_model_name = "text-embedding-ada-002"
embedding_model = OpenAIEmbedding(model=embedding_model_name)

# configure the model to synthesize the final response after context retrieval
service_context_model_name = "gpt-3.5-turbo"
llm_predictor = LLMPredictor(llm=ChatOpenAI(model_name=service_context_model_name, temperature=0))
service_context = ServiceContext.from_defaults(
    llm_predictor=llm_predictor,
)

# load the index from disc
storage_context = StorageContext.from_defaults(
    persist_dir=os.path.join(index_dir),
)
index = load_index_from_storage(
    storage_context,
    service_context=service_context,
)

# instantiate a query engine
query_engine = index.as_query_engine()

Define functions to run queries and display the results.

In [None]:
def get_response_and_query_embedding(
    query_engine: RetrieverQueryEngine, query: str, embedding_model: BaseEmbedding
) -> Tuple[Response, List[float]]:
    """
    Queries the query engine and returns the response and query embedding used
    to retrieve context from the database.
    """
    query_embedding = embedding_model.get_text_embedding(query)
    query_bundle = QueryBundle(query, embedding=query_embedding)
    response = query_engine.query(query_bundle)
    return response, query_embedding


def display_llama_index_response(response: Response) -> None:
    """
    Displays a LlamaIndex response and its retrieved context.
    """

    print("Response")
    print("========")
    for line in textwrap.wrap(response.response.strip(), width=80):
        print(line)
    print()

    print("Retrieved Context")
    print("============")
    print()

    for source_node in response.source_nodes:
        print(f"doc_id: {source_node.node.doc_id}")
        print(f"score: {source_node.score}")
        print()
        for line in textwrap.wrap(source_node.node.text, width=80):
            print(line)
        print()

Ask a question of your question-answering service. View the response from your service in addition to the retrieved context from your knowledge base (the current LlamaIndex application is configured to retrieve the two most similar entries to the query by cosine similarity).

In [None]:
query = "What's the difference between primary and baseline datasets?"
# query = "How do I send in extra metadata with each record?"
# query = "How does Arize's surrogate explainability model work?"
response = query_engine.query(query)
response, query_embedding = get_response_and_query_embedding(
    query_engine,
    query,
    embedding_model,
)

display_llama_index_response(response)
print("Embedding Dimension")
print("===================")
print(len(query_embedding))

## 5. Load Your Data Into Pandas Dataframes

To use Phoenix, you must load your data into Pandas dataframes. First, load your knowledge base into a dataframe.

In [None]:
def load_llama_index_database_into_dataframe(docstore, vector_store) -> pd.DataFrame:
    """
    Loads LlamaIndex data into a Pandas dataframe.
    """
    text_list = []
    embeddings_list = []
    for doc_id in docstore["docstore/data"]:
        text_list.append(docstore["docstore/data"][doc_id]["__data__"]["text"])
        embeddings_list.append(np.array(vector_store["embedding_dict"][doc_id]))
    return pd.DataFrame(
        {
            "text": text_list,
            "text_vector": embeddings_list,
        }
    )


with open(os.path.join(index_dir, "docstore.json")) as f:
    docstore = json.load(f)
with open(os.path.join(index_dir, "vector_store.json")) as f:
    vector_store = json.load(f)

database_df = load_llama_index_database_into_dataframe(docstore, vector_store).drop_duplicates(
    subset=["text"]
)
database_df.head()

The columns of your dataframe are:
- **text:** the chunked text in your knowledge base
- **text_vector:** the embedding vector for the text, computed during the LlamaIndex build using "text-embedding-ada-002" from OpenAI

Next, download a dataframe containing query data.

In [None]:
query_df = (
    pd.read_parquet(
        "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/retrievals_with_user_feedback.parquet"
    )
    .rename(columns={"query_text": "text", "query_embedding": "text_vector"})
    .drop(columns=["context_doc_id_0", "context_doc_id_1"])
)
query_df.head()

The columns of the dataframe are:
- **text:** the query text
- **text_vector:** the embedding representation of the query, captured from LlamaIndex at query time
- **response:** the final response from the LlamaIndex application
- **context_text_0:** the first retrieved context from the knowledge base
- **context_similarity_0:** the cosine similarity between the query and the first retrieved context
- **context_text_1:** the second retrieved context from the knowledge base
- **context_similarity_1:** the cosine similarity between the query and the first retrieved context
- **user_feedback:** approval or rejection from the user (-1 means thumbs down, +1 means thumbs up)

The query and database datasets are drawn from different distributions; the queries are short questions while the database entries are several sentences to a paragraph. The embeddings from OpenAI's "text-embedding-ada-002" capture these differences and naturally separate the query and context embeddings into distinct regions of the embedding space. When using Phoenix, you want to "overlay" the query and context embedding distributions so that queries appear close to their retrieved context in the Phoenix point cloud. To achieve this, we compute a centroid for each dataset that represents an average point in the embedding distribution and center the two distributions so they overlap.

In [None]:
database_centroid = database_df["text_vector"].mean()
database_df["centered_text_vector"] = database_df["text_vector"].apply(
    lambda x: x - database_centroid
)
query_centroid = query_df["text_vector"].mean()
query_df["centered_text_vector"] = query_df["text_vector"].apply(lambda x: x - query_centroid)

## 6. Compute Proxy Metrics for Retrieval Quality

Cosine similarity and Euclidean distance can act as proxies for retrieval quality. The cosine distance between query and retrieved context was computed at query time and is part of the query dataframe downloaded above. Compute the Euclidean distance between each query embedding and retrieved context embedding and add corresponding columns to the query dataframe.

In [None]:
def compute_euclidean_distance(
    vector0: npt.NDArray[np.float32], vector1: npt.NDArray[np.float32]
) -> float:
    """
    Computes the Euclidean distance between two vectors.
    """
    return np.linalg.norm(vector0 - vector1)


num_retrieved_documents = 2
for context_index in range(num_retrieved_documents):
    euclidean_distances = []
    for _, row in query_df.iterrows():
        query_embedding = row["text_vector"]
        context_text = row[f"context_text_{context_index}"]
        database_row = database_df[database_df["text"] == context_text].iloc[0]
        database_embedding = database_row["text_vector"]
        euclidean_distance = compute_euclidean_distance(query_embedding, database_embedding)
        euclidean_distances.append(euclidean_distance)
    query_df[f"euclidean_distance_{context_index}"] = euclidean_distances
query_df.head()

## 7. Run LLM-Assisted Evaluations

Cosine similarity and Euclidean distance are reasonable proxies for retrieval quality, but they don't always work perfectly. A novel idea is to use LLMs to measure retrieval quality by simply asking the LLM whether each piece of context is relevant to the corresponding query.

Use OpenAI to predict whether each retrieved document is relevant or irrelevant to the query.

⚠️ This cell requires that you configured your OpenAI API key in step 2.

In [None]:
EVALUATION_SYSTEM_MESSAGE = "You will be given a query and a reference text. You must determine whether the reference text contains an answer to the input query. Your response must be binary (0 or 1) and should not contain any text or characters aside from 0 or 1. 0 means that the reference text does not contain an answer to the query. 1 means the reference text contains an answer to the query."
QUERY_CONTEXT_PROMPT_TEMPLATE = """# Query: {query}

# Reference: {reference}

# Binary: """


@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def evaluate_query_and_retrieved_context(query: str, context: str, model_name: str) -> str:
    prompt = QUERY_CONTEXT_PROMPT_TEMPLATE.format(
        query=query,
        reference=context,
    )
    response = openai.ChatCompletion.create(
        messages=[
            {"role": "system", "content": EVALUATION_SYSTEM_MESSAGE},
            {"role": "user", "content": prompt},
        ],
        model=model_name,
    )
    return response["choices"][0]["message"]["content"]


def evaluate_retrievals(
    retrievals_data: Dict[str, str],
    model_name: str,
) -> List[str]:
    responses = []
    for query, retrieved_context in tqdm(retrievals_data.items()):
        response = evaluate_query_and_retrieved_context(query, retrieved_context, model_name)
        responses.append(response)
    return responses


def process_binary_responses(
    binary_responses: List[str], binary_to_string_map: Dict[int, str]
) -> List[str]:
    """
    Parse binary responses and convert to the desired format
    converts them to the desired format. The binary_to_string_map parameter
    should be a dictionary mapping binary values (0 or 1) to the desired
    string values (e.g. "irrelevant" or "relevant").
    """
    processed_responses = []
    for binary_response in binary_responses:
        try:
            binary_value = int(binary_response.strip())
            processed_response = binary_to_string_map[binary_value]
        except (ValueError, KeyError):
            processed_response = None
        processed_responses.append(processed_response)
    return processed_responses


sample_query_df = query_df.head(10).copy()
evaluation_model_name = "gpt-3.5-turbo"
for context_index in range(num_retrieved_documents):
    retrievals_data = {
        row["text"]: row[f"context_text_{context_index}"] for _, row in sample_query_df.iterrows()
    }
    raw_responses = evaluate_retrievals(retrievals_data, evaluation_model_name)
    processed_responses = process_binary_responses(raw_responses, {0: "irrelevant", 1: "relevant"})
    sample_query_df[f"openai_relevance_{context_index}"] = processed_responses
sample_query_df[
    ["text", "context_text_0", "openai_relevance_0", "context_text_1", "openai_relevance_1"]
].head()

Running evaluations across the entire dataset takes a while, so download a dataset of pre-computed evaluations and add to the query dataframe.

In [None]:
openai_evaluations_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/evaluations.parquet"
)[["text", "relevance_0", "relevance_1"]]
openai_evaluations_df = openai_evaluations_df.rename(
    columns={"relevance_0": "openai_relevance_0", "relevance_1": "openai_relevance_1"}
)
query_df = pd.merge(query_df, openai_evaluations_df, on="text")
query_df[["text", "context_text_0", "context_text_1", "openai_relevance_0", "openai_relevance_1"]]

For comparison, we've also run evaluations using Google PaLM 2. Download those pre-computed evaluations and add to the query dataframe.

In [None]:
palm2_evaluations_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/palm2_evaluations.parquet"
)[["text", "relevance_0", "relevance_1"]]
palm2_evaluations_df = palm2_evaluations_df.rename(
    columns={"relevance_0": "palm2_relevance_0", "relevance_1": "palm2_relevance_1"}
)
query_df = pd.merge(query_df, palm2_evaluations_df, on="text")
query_df[["text", "context_text_0", "context_text_1", "palm2_relevance_0", "palm2_relevance_1"]]

Check the percent of agreeing documents.

In [None]:
num_equal = 0
for context_index in range(num_retrieved_documents):
    equal_relevance_mask = (
        query_df[f"openai_relevance_{context_index}"]
        == query_df[f"palm2_relevance_{context_index}"]
    )
    num_equal += equal_relevance_mask.sum()
percent_agreeing = num_equal / (len(query_df) * num_retrieved_documents)
percent_agreeing

You can see that for the vast majority of cases, the two LLMs agree. View the few examples where they disagree in the cell below.

In [None]:
retrievals_df = pd.concat(
    [
        query_df[
            [
                "text",
                f"context_text_{context_index}",
                f"openai_relevance_{context_index}",
                f"palm2_relevance_{context_index}",
            ]
        ].rename(
            columns={
                f"context_text_{context_index}": "context_text",
                f"openai_relevance_{context_index}": "openai_relevance",
                f"palm2_relevance_{context_index}": "palm2_relevance",
            }
        )
        for context_index in range(num_retrieved_documents)
    ]
)
disagreeing_evaluation_mask = retrievals_df["openai_relevance"] != retrievals_df["palm2_relevance"]
retrievals_df[disagreeing_evaluation_mask]

## 8. Compute Ranking Metrics

Now that you know whether each piece of retrieved context is relevant or irrelevant to the corresponding query, you can compute precision@k for k = 1, 2 for each query. This metric tells you what percentage of the retrieved context is relevant to the corresponding query.

$$
\text{precision@k} = \frac{\text{\# of top-\textit{k} retrieved documents that are relevant}}{\text{\textit{k} retrieved documents}}
$$

If your precision@2 is greater than zero for a particular query, your LlamaIndex application successfully retrieved at least one relevant piece of context with which to answer the query. If the precision@k is zero for a particular query, that means that no relevant piece of context was retrieved.

Compute precision@k for k = 1, 2 and view the results.

In [None]:
for model_name in ["openai", "palm2"]:
    num_relevant_documents_array = np.zeros(len(query_df))
    for retrieved_context_index in range(0, num_retrieved_documents):
        num_retrieved_documents = retrieved_context_index + 1
        num_relevant_documents_array += (
            query_df[f"{model_name}_relevance_{retrieved_context_index}"]
            .map(lambda x: int(x == "relevant"))
            .to_numpy()
        )
        query_df[f"{model_name}_precision@{num_retrieved_documents}"] = pd.Series(
            num_relevant_documents_array / num_retrieved_documents
        )

query_df[
    [
        "openai_relevance_0",
        "openai_relevance_1",
        "openai_precision@1",
        "openai_precision@2",
        "palm2_relevance_0",
        "palm2_relevance_1",
        "palm2_precision@1",
        "palm2_precision@2",
    ]
]

## 9. Launch Phoenix

Define a schema to tell Phoenix what the columns of your query and database dataframes represent (features, predictions, actuals, tags, embeddings, etc.). See the [docs](https://docs.arize.com/phoenix/) for guides on how to define your own schema and API reference on `phoenix.Schema` and `phoenix.EmbeddingColumnNames`.

In [None]:
query_df["response_vector"] = query_df[
    "centered_text_vector"
].copy()  # the response requires an embedding, but we don't have one, so we just use the prompt embedding
query_schema = px.Schema(
    prompt_column_names=px.EmbeddingColumnNames(
        raw_data_column_name="text",
        vector_column_name="centered_text_vector",
    ),
    response_column_names=px.EmbeddingColumnNames(
        raw_data_column_name="response",
        vector_column_name="response_vector",
    ),
    tag_column_names=[
        "context_text_0",
        "context_similarity_0",
        "context_text_1",
        "context_similarity_1",
        "euclidean_distance_0",
        "euclidean_distance_1",
        "openai_relevance_0",
        "openai_relevance_1",
        "palm2_relevance_0",
        "palm2_relevance_1",
        "openai_precision@1",
        "openai_precision@2",
        "palm2_precision@1",
        "palm2_precision@2",
        "user_feedback",
    ],
)
database_schema = px.Schema(
    prompt_column_names=px.EmbeddingColumnNames(
        raw_data_column_name="text",
        vector_column_name="centered_text_vector",
    ),
)

Create Phoenix datasets that wrap your dataframes with the schemas that describe them.

In [None]:
database_ds = px.Dataset(
    dataframe=database_df,
    schema=database_schema,
    name="database",
)
query_ds = px.Dataset(
    dataframe=query_df,
    schema=query_schema,
    name="query",
)

Launch Phoenix. Follow the instructions in the cell output to open the Phoenix UI.

In [None]:
session = px.launch_app(query_ds, database_ds)