Evaluating and Improving a LlamaIndex Semantic Retrieval Application

Imagine you're an engineer at Arize AI and you've built and deployed a documentation question-answering service using LlamaIndex. Users send questions about Arize's core product via a chat interface, and your service retrieves documents from your documentation in order to generate a response to the user. As the engineer in charge of evaluating and maintaining this system, you want to evaluate the quality of the responses from your service.

Arize helps you:
- identify gaps in your documentation
- detect queries for which the LLM gave bad responses
- detect failures to retrieve relevant context

In this tutorial, you will:

- Download an pre-indexed knowledge base of the Arize documentation
- Visualize user queries and knowledge base documents to identify areas of user interest not answered by your documentation
- Find clusters of responses with negative user feedback
- Identify failed retrievals using cosine similarity, Euclidean distance, and LLM-assisted ranking metrics

Let's get started!

## 1. Install Dependencies and Import Libraries

Install Arize.

In [None]:
!pip install -q arize

Import libraries.

In [None]:
import json
import os
import tempfile
import urllib
import zipfile
import numpy as np
import numpy.typing as npt
import pandas as pd

## 2. Download Your Knowledge Base

Download and unzip a pre-built knowledge base index consisting of chunks of the Arize documentation.

In [None]:
def download_file(url: str, output_path: str) -> None:
    """
    Downloads a file from the specified URL and saves to a local path.
    """
    urllib.request.urlretrieve(url, output_path)


def unzip_directory(zip_path: str, output_path: str) -> None:
    """
    Unzips a directory to a specified output path.
    """
    with zipfile.ZipFile(zip_path, "r") as f:
        f.extractall(output_path)


print("⏳ Downloading knowledge base...")
data_dir = tempfile.gettempdir()
zip_file_path = os.path.join(data_dir, "index.zip")
download_file(
    url="http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/index.zip",
    output_path=zip_file_path,
)

print("⏳ Unzipping knowledge base...")
index_dir = os.path.join(data_dir, "index")
unzip_directory(zip_file_path, index_dir)

print("✅ Done")

## 3. Load Your Data Into Pandas Dataframes

In [None]:
def load_llama_index_database_into_dataframe(docstore, vector_store) -> pd.DataFrame:
    """
    Loads LlamaIndex data into a Pandas dataframe.
    """
    text_list = []
    embeddings_list = []
    for doc_id in docstore["docstore/data"]:
        text_list.append(docstore["docstore/data"][doc_id]["__data__"]["text"])
        embeddings_list.append(np.array(vector_store["embedding_dict"][doc_id]))
    return pd.DataFrame(
        {
            "text": text_list,
            "text_vector": embeddings_list,
        }
    )


with open(os.path.join(index_dir, "docstore.json")) as f:
    docstore = json.load(f)
with open(os.path.join(index_dir, "vector_store.json")) as f:
    vector_store = json.load(f)

database_df = load_llama_index_database_into_dataframe(docstore, vector_store).drop_duplicates(
    subset=["text"]
)
database_df['pred'] = 0
database_df['pred_id'] = database_df.index

database_df.head()

The columns of your dataframe are:
- **text:** the chunked text in your knowledge base
- **text_vector:** the embedding vector for the text, computed during the LlamaIndex build using "text-embedding-ada-002" from OpenAI

Next, download a dataframe containing query data.

In [None]:
query_df = (
    pd.read_parquet(
        "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/retrievals_with_user_feedback.parquet"
    )
    .rename(columns={"query_text": "text", "query_embedding": "text_vector"})
    .drop(columns=["context_doc_id_0", "context_doc_id_1"])
)
query_df.head()

The columns of the dataframe are:
- **text:** the query text
- **text_vector:** the embedding representation of the query, captured from LlamaIndex at query time
- **response:** the final response from the LlamaIndex application
- **context_text_0:** the first retrieved context from the knowledge base
- **context_similarity_0:** the cosine similarity between the query and the first retrieved context
- **context_text_1:** the second retrieved context from the knowledge base
- **context_similarity_1:** the cosine similarity between the query and the first retrieved context
- **user_feedback:** approval or rejection from the user (-1 means thumbs down, +1 means thumbs up)

The query and database datasets are drawn from different distributions; the queries are short questions while the database entries are several sentences to a paragraph. The embeddings from OpenAI's "text-embedding-ada-002" capture these differences and naturally separate the query and context embeddings into distinct regions of the embedding space. When using Arize, you want to "overlay" the query and context embedding distributions so that queries appear close to their retrieved context in the Arize point cloud. To achieve this, we compute a centroid for each dataset that represents an average point in the embedding distribution and center the two distributions so they overlap.

In [None]:
database_centroid = database_df["text_vector"].mean()
database_df["centered_text_vector"] = database_df["text_vector"].apply(
    lambda x: x - database_centroid
)
query_centroid = query_df["text_vector"].mean()
query_df["centered_text_vector"] = query_df["text_vector"].apply(lambda x: x - query_centroid)

## 4. Compute Proxy Metrics for Retrieval Quality

Cosine similarity and Euclidean distance can act as proxies for retrieval quality. The cosine distance between query and retrieved context was computed at query time and is part of the query dataframe downloaded above. Compute the Euclidean distance between each query embedding and retrieved context embedding and add corresponding columns to the query dataframe.

In [None]:
def compute_euclidean_distance(
    vector0: npt.NDArray[np.float32], vector1: npt.NDArray[np.float32]
) -> float:
    """
    Computes the Euclidean distance between two vectors.
    """
    return np.linalg.norm(vector0 - vector1)


num_retrieved_documents = 2
for context_index in range(num_retrieved_documents):
    euclidean_distances = []
    for _, row in query_df.iterrows():
        query_embedding = row["text_vector"]
        context_text = row[f"context_text_{context_index}"]
        database_row = database_df[database_df["text"] == context_text].iloc[0]
        database_embedding = database_row["text_vector"]
        euclidean_distance = compute_euclidean_distance(query_embedding, database_embedding)
        euclidean_distances.append(euclidean_distance)
    query_df[f"euclidean_distance_{context_index}"] = euclidean_distances
query_df.head()

## 5. Run LLM-Assisted Evaluations

Cosine similarity and Euclidean distance are reasonable proxies for retrieval quality, but they don't always work perfectly. A novel idea is to use LLMs to measure retrieval quality by simply asking the LLM whether each piece of context is relevant to the corresponding query.

Use OpenAI to predict whether each retrieved document is relevant or irrelevant to the query. Running evaluations across the entire dataset takes a while, so for this example we're downloading a dataset of pre-computed evaluations and adding to the query dataframe.

In [None]:
openai_evaluations_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/evaluations.parquet"
)[["text", "relevance_0", "relevance_1"]]
openai_evaluations_df = openai_evaluations_df.rename(
    columns={"relevance_0": "openai_relevance_0", "relevance_1": "openai_relevance_1"}
)
query_df = pd.merge(query_df, openai_evaluations_df, on="text")
query_df[["text", "context_text_0", "context_text_1", "openai_relevance_0", "openai_relevance_1"]]

For comparison, we've also run evaluations using Google PaLM 2. Download those pre-computed evaluations and add to the query dataframe.

In [None]:
palm2_evaluations_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/palm2_evaluations.parquet"
)[["text", "relevance_0", "relevance_1"]]
palm2_evaluations_df = palm2_evaluations_df.rename(
    columns={"relevance_0": "palm2_relevance_0", "relevance_1": "palm2_relevance_1"}
)
query_df = pd.merge(query_df, palm2_evaluations_df, on="text")
query_df[["text", "context_text_0", "context_text_1", "palm2_relevance_0", "palm2_relevance_1"]]

Check the percent of agreeing documents.

In [None]:
num_equal = 0
for context_index in range(num_retrieved_documents):
    equal_relevance_mask = (
        query_df[f"openai_relevance_{context_index}"]
        == query_df[f"palm2_relevance_{context_index}"]
    )
    num_equal += equal_relevance_mask.sum()
percent_agreeing = num_equal / (len(query_df) * num_retrieved_documents)
percent_agreeing

You can see that for the vast majority of cases, the two LLMs agree. View the few examples where they disagree in the cell below.

In [None]:
retrievals_df = pd.concat(
    [
        query_df[
            [
                "text",
                f"context_text_{context_index}",
                f"openai_relevance_{context_index}",
                f"palm2_relevance_{context_index}",
            ]
        ].rename(
            columns={
                f"context_text_{context_index}": "context_text",
                f"openai_relevance_{context_index}": "openai_relevance",
                f"palm2_relevance_{context_index}": "palm2_relevance",
            }
        )
        for context_index in range(num_retrieved_documents)
    ]
)
disagreeing_evaluation_mask = retrievals_df["openai_relevance"] != retrievals_df["palm2_relevance"]
retrievals_df[disagreeing_evaluation_mask]

In [None]:
retrievals_df

## 6. Compute Ranking Metrics

Now that you know whether each piece of retrieved context is relevant or irrelevant to the corresponding query, you can compute precision@k for k = 1, 2 for each query. This metric tells you what percentage of the retrieved context is relevant to the corresponding query.

precision@k = (# of top-k retrieved documents that are relevant) / (k retrieved documents)

If your precision@2 is greater than zero for a particular query, your LlamaIndex application successfully retrieved at least one relevant piece of context with which to answer the query. If the precision@k is zero for a particular query, that means that no relevant piece of context was retrieved.

Compute precision@k for k = 1, 2 and view the results.

In [None]:
for model_name in ["openai", "palm2"]:
    num_relevant_documents_array = np.zeros(len(query_df))
    for retrieved_context_index in range(0, num_retrieved_documents):
        num_retrieved_documents = retrieved_context_index + 1
        num_relevant_documents_array += (
            query_df[f"{model_name}_relevance_{retrieved_context_index}"]
            .map(lambda x: int(x == "relevant"))
            .to_numpy()
        )
        query_df[f"{model_name}_precision@{num_retrieved_documents}"] = pd.Series(
            num_relevant_documents_array / num_retrieved_documents
        )

query_df[
    [
        "openai_relevance_0",
        "openai_relevance_1",
        "openai_precision@1",
        "openai_precision@2",
        "palm2_relevance_0",
        "palm2_relevance_1",
        "palm2_precision@1",
        "palm2_precision@2",
    ]
]
query_df["response_vector"] = query_df[
    "centered_text_vector"
].copy()

query_df['context_text_1'] = query_df['context_text_1'].apply(lambda x: x[:1000])
query_df['context_text_0'] = query_df['context_text_0'].apply(lambda x: x[:1000])
query_df.rename(columns={"openai_precision@1":"openai_precision_1",
        "openai_precision@2":"openai_precision_2",
        "palm2_precision@1":"palm2_precision_1",
        "palm2_precision@2":"palm2_precision_2"},inplace=True)

## 7. Send the data to Arize for analysis

Sign up/ log in to your Arize account [here](https://app.arize.com/auth/login). Find your [space and API keys](https://docs.arize.com/arize/api-reference/arize.pandas/client). Copy/paste into the cell below. For more information on Arize spaces, please visit the [Authentication section](https://docs.arize.com/arize/api-reference/rest-api#authentication) of the Arize AI docs.

<img src="https://storage.googleapis.com/arize-assets/fixtures/copy-keys.png" width="700">

In [None]:
from arize.pandas.logger import Client, Schema
from arize.utils.types import ModelTypes, Environments, Schema, EmbeddingColumnNames

SPACE_KEY = "SPACE_KEY"  # Change this line
API_KEY = "API_KEY"  # Change this line
MODEL_ID = 'arize-tutorial-search-and-retrieval'  # Change this line

if SPACE_KEY == "SPACE_KEY" or API_KEY == "API_KEY":
    raise ValueError("❌ NEED TO CHANGE SPACE AND/OR API_KEY")
else:
    print("✅ Import and Setup Arize Client Done! Now we can start using Arize!")

arize_client = Client(space_key=SPACE_KEY, api_key=API_KEY)


prod_schema = Schema(
    prompt_column_names=EmbeddingColumnNames(
        data_column_name="text",
        vector_column_name="centered_text_vector",
    ),
    response_column_names=EmbeddingColumnNames(
        data_column_name="response",
        vector_column_name="response_vector",
    ),
    tag_column_names=[
        "context_text_0",
        "context_similarity_0",
        "context_text_1",
        "context_similarity_1",
        "euclidean_distance_0",
        "euclidean_distance_1",
        "openai_relevance_0",
        "openai_relevance_1",
        "palm2_relevance_0",
        "palm2_relevance_1",
        "openai_precision_1",
        "openai_precision_2",
        "palm2_precision_1",
        "palm2_precision_2",
        "user_feedback",
    ],
)
val_schema = Schema(
    prediction_id_column_name='pred_id',
    prompt_column_names=EmbeddingColumnNames(
        data_column_name="text",
        vector_column_name="centered_text_vector",
    ),
    response_column_names=EmbeddingColumnNames(
        vector_column_name="centered_text_vector",
    ),
    prediction_label_column_name='pred',
    actual_label_column_name='pred'
)



In [None]:
response = arize_client.log(
    dataframe=query_df,
    schema=prod_schema,
    model_id=MODEL_ID,
    model_version="1.0",
    model_type=ModelTypes.GENERATIVE_LLM,
    environment=Environments.PRODUCTION
)
if response.status_code == 200:
    print(f"✅ Successfully logged data to Arize!")
else:
    print(
        f'❌ Logging failed with status code {response.status_code} and message "{response.text}"'
    )

In [None]:
database_df.reset_index(drop=True, inplace=True)
response = arize_client.log(
    schema=val_schema,
    dataframe=database_df,
    model_id=MODEL_ID,
    model_version="1.0",
    model_type=ModelTypes.GENERATIVE_LLM,
    environment=Environments.VALIDATION,
    batch_id="batch 1"
)
if response.status_code == 200:
    print(f"✅ Successfully logged data to Arize!")
else:
    print(
        f'❌ Logging failed with status code {response.status_code} and message "{response.text}"'
    )