<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Improving Your Knowledge Base</h1>

Imagine you've built and deployed an LLM question-answering service that enables users to ask questions and receive answers from a knowledge base. You want to understand what kinds of questions your users are asking and whether you're providing good answers to those questions.

Phoenix helps you pinpoint user queries that are not answered by your knowledge base so that you know which topics to iterate and improve upon. As you'll see, your users are asking questions on several topics that your knowledge base does not cover.

In this tutorial, you will:

- Download an pre-indexed knowledge base and run a LlamaIndex application
- Download user query data and knowledge base data, including embeddings computed using the OpenAI API
- Define a schema to describe the format of your data
- Launch Phoenix to visually explore your embeddings
- Investigate clusters of user queries with no corresponding knowledge base entry

⚠️ This notebook requires an [OpenAI API key](https://platform.openai.com/account/api-keys).

Let's get started!

## Building a Knowledge Base With LlamaIndex

[LlamaIndex](https://github.com/jerryjliu/llama_index#readme) is an open-source library that provides high-level APIs for LLM-powered applications. This tutorial leverages LlamaIndex to build a semantic search/ question-answering services over a knowledge base of chunked documents.

![an illustration of](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llama-index-knowledge-base-tutorial/context_retrieval.webp)

The details of indexing 

## Install Dependencies and Import Libraries

Install Phoenix and 

In [None]:
!pip install arize-phoenix llama-index

Import libraries.

In [None]:
import json
import os
import tempfile
import textwrap
from tqdm import tqdm
from typing import Dict, List
import urllib
import zipfile

from langchain import OpenAI
from llama_index import StorageContext, load_index_from_storage
from llama_index.response.schema import Response
import numpy as np
import openai
import pandas as pd
import phoenix as px
from sklearn.metrics.pairwise import cosine_similarity
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)

pd.set_option("display.max_colwidth", None)

Set your OpenAI API key. You can skip this cell if the `OPENAI_API_KEY` environment variable is already set in your notebook environment.

In [None]:
os.environ["OPENAI_API_KEY"] = "copy paste your api key here"
assert (
    os.environ["OPENAI_API_KEY"] != "copy paste your api key here"
), "❌ Please set your OpenAI API key"

## Download Your Knowledge Base

Download and unzip a pre-built knowledge base index of Wikipedia articles.

In [None]:
def download_file(url: str, output_path: str) -> None:
    """
    Downloads a file from the specified URL and saves to a local path.
    """
    urllib.request.urlretrieve(url, output_path)


def unzip_directory(zip_path: str, output_path: str) -> None:
    """
    Unzips a directory to a specified output path.
    """
    with zipfile.ZipFile(zip_path, "r") as f:
        f.extractall(output_path)


print("⏳ Downloading knowledge base...")
data_dir = tempfile.gettempdir()
zip_file_path = os.path.join(data_dir, "database_index.zip")
download_file(
    url="http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/database_index.zip",
    output_path=zip_file_path,
)

print("⏳ Unzipping knowledge base...")
index_dir = os.path.join(data_dir, "database_index")
unzip_directory(zip_file_path, index_dir)

print("✅ Done")

## Run Your Question-Answering Service

Start a LlamaIndex application from your pre-computed index.

In [None]:
storage_context = StorageContext.from_defaults(
    persist_dir=index_dir,
)
llm = OpenAI(temperature=0, model_name="gpt-4")
index = load_index_from_storage(storage_context, llm=llm)
query_engine = index.as_query_engine()

Ask a question of your question-answering service. See the response in addition to the retrieved context from your knowledge base (by default, LlamaIndex retrieves the two most similar entries to the query by cosine similarity).

In [None]:
def display_llama_index_response(response: Response) -> None:
    """
    Displays a LlamaIndex response and its source nodes (retrieved context).
    """

    print("Response")
    print("========")
    for line in textwrap.wrap(response.response.strip(), width=80):
        print(line)
    print()

    print("Source Nodes")
    print("============")
    print()

    for source_node in response.source_nodes:
        print(f"doc_id: {source_node.node.doc_id}")
        print(f"score: {source_node.score}")
        print()
        for line in textwrap.wrap(source_node.node.text, width=80):
            print(line)
        print()


query = "What is the name of the character Microsoft used to make Windows 8 seem more personable?"
response = query_engine.query(query)
display_llama_index_response(response)

Change the query in the cell above and re-run to ask another question of your choice. You can see example user queries in the `query_df` below.

## Load Database and Query Data into Pandas

Download a dataset of user query data. View a few rows of the dataframe.

In [None]:
query_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/query.parquet"
)
query_df = (
    query_df.drop(columns=["broad_subject"], axis=1)
    .rename(columns={"granular_subject": "subject"})
    .reset_index(drop=True)
)  # fixme: update dataset and remove this line
query_df.head()

The columns of the dataframe are:
- **subject:** the subject of the Wikipedia article (e.g., "Beyoncé", "Liberia")
- **text:** the text of the paragraph
- **text_vector:** the embedding vector representing that text

Load your previously downloaded LlamaIndex data, including embeddings, into a dataframe. Download metadata and join with the LlamaIndex data. View a few database rows of the merged dataframe.

In [None]:
def load_llama_index_database_into_dataframe(docstore, vector_store) -> pd.DataFrame:
    """
    Loads LlamaIndex data into a Pandas dataframe.
    """
    text_list = []
    embeddings_list = []
    for doc_id in docstore["docstore/data"]:
        text_list.append(docstore["docstore/data"][doc_id]["__data__"]["text"])
        embeddings_list.append(np.array(vector_store["embedding_dict"][doc_id]))
    return pd.DataFrame(
        {
            "text": text_list,
            "text_vector": embeddings_list,
        }
    )


with open(os.path.join(index_dir, "docstore.json")) as f:
    docstore = json.load(f)
with open(os.path.join(index_dir, "vector_store.json")) as f:
    vector_store = json.load(f)


database_df = load_llama_index_database_into_dataframe(docstore, vector_store)

# FIXME: update dataset and remove the following lines
database_metadata_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/database_full.parquet"
)
database_metadata_df = (
    database_metadata_df.drop(columns=["broad_subject"], axis=1)
    .rename(columns={"granular_subject": "subject"})
    .reset_index(drop=True)
)
database_df = pd.merge(
    database_df,
    database_metadata_df[["text", "subject", "article_index", "paragraph_index"]],
    on="text",
    how="inner",
)
database_df = (
    database_df[["article_index", "paragraph_index", "subject", "text", "text_vector"]]
    .sort_values(by=["article_index", "paragraph_index"])
    .reset_index(drop=True)
)
# FIXME: end

database_df.sample(n=5)

The database dataframe has two additional columns to the query database:
- **article_index:** a unique index for each article in the knowledge base
- **paragraph_index:** the index of the paragraph in the article

Notice that the text column of the database dataframe contains entire paragraphs that are in many cases much longer than the questions of the query dataframe.

## Compute LLM-Assisted Evaluations and Precision@k

At query time, your LlamaIndex application is configured to retrieve the two most similar pieces of context from the database by cosine similarity. In order to evaluate the retrieval process, compute the two most similar database entries for each query. These are the exact pieces of context that are sent to the LLM along with the query to generate the final response that the end user of the LlamaIndex application sees.

In [None]:
def get_retrieval_data(
    database_df: pd.DataFrame,
    query_df: pd.DataFrame,
    num_retrieved_docs_per_query: int,
) -> pd.DataFrame:
    """
    Given database and query dataframes containing text and text embeddings,
    return a dataframe containing retrieval data.
    """
    query_list = []
    context_list = []
    scores = []
    retrieval_ranks = []
    database_embeddings = np.stack(database_df["text_vector"].to_list())
    query_embeddings = np.stack(query_df["text_vector"].to_list())
    pairwise_cosine_similarity = cosine_similarity(query_embeddings, database_embeddings)
    retrieved_context_index_matrix = np.argsort(-pairwise_cosine_similarity, axis=1)[
        :, :num_retrieved_docs_per_query
    ]
    for query_index, retrieved_context_indexes in enumerate(retrieved_context_index_matrix):
        query = query_df["text"].iloc[query_index]
        for index, retrieved_context_index in enumerate(retrieved_context_indexes):
            retrieval_rank = index + 1
            query_list.append(query)
            retrieved_context = database_df["text"].iloc[retrieved_context_index]
            score = pairwise_cosine_similarity[query_index, retrieved_context_index]
            context_list.append(retrieved_context)
            scores.append(score)
            retrieval_ranks.append(retrieval_rank)
    return pd.DataFrame(
        {
            "query": query_list,
            "retrieved_context": context_list,
            "retrieval_rank": retrieval_ranks,
            "cosine_similarity": scores,
        }
    )


num_retrieved_documents_per_query = 2
retrievals_df = get_retrieval_data(
    database_df, query_df, num_retrieved_docs_per_query=num_retrieved_documents_per_query
)
retrievals_df.head()

The columns of the dataframe are:
- **query:** the user query (each query appears twice in the dataframe, once for each retrieved piece of context)
- **retrieved_context:** the context retrieved by the LlamaIndex application
- **retrieval_rank:** the rank of the retrieval (1 being the most similar piece of context, 2 being the second-most similar, etc.)
- **cosine_similarity:** the cosine similarity between the embeddings of the query and retrieved context

You'll use OpenAI to determine the relevance of each retrieved piece of context to the corresponding query. Define and view the prompt template used during evaluation.

In [None]:
evaluation_prompt_template = """You will be given a query and a reference text. You must determine whether the reference text contains an answer to the input query. Your response must be binary (0 or 1) and should not contain any text or characters aside from 0 or 1. 0 means that the reference text does not contain an answer to the query. 1 means the reference text contains an answer to the query.

# Query: {query}

# Reference: {reference}

# Score: """

print("Evaluation Prompt Template")
print("==========================")
print()
for line in textwrap.wrap(evaluation_prompt_template, width=80, replace_whitespace=False):
    print(line)

# FIXME: fix formatting

Estimate the cost of running LLM-assisted evaluations on your data.

In [None]:
# FIXME: implement code to estimate cost of OpenAI API calls and display to user

For each query-context pair, create a prompt by formatting the prompt template above. Send the resulting prompts to the OpenAI completions API to get relevance predictions.

⚠ This cell takes a couple of minutes to run. If you run into rate-limiting issues, try adjusting the parameters in the retry decorator below.

In [None]:
@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def complete_batch_of_prompts(prompts: List[str], model_name: str) -> List[str]:
    """
    Completes a list of prompts using the OpenAI completion API and the
    specified model. As of June 2023, OpenAI supports a maximum of 20 prompts
    per completion request. This function is wrapped in a retry decorator in
    order to avoid rate-limiting. Retry settings were copied from
    https://github.com/openai/openai-cookbook/blob/main/examples/How_to_handle_rate_limits.ipynb.
    """
    response = openai.Completion.create(
        model=model_name,
        prompt=prompts,
    )
    return [choice["text"] for choice in response["choices"]]


def complete_prompts(
    prompts: List[str],
    model_name: str,
    batch_size: int = 20,  # the max number of prompts per completion request as of June 2023
) -> List[str]:
    """
    Completes a list of prompts using the OpenAI completion API. The list may be
    of arbitrary length and will be batched using the batch_size parameter.
    """
    completions = []
    progress_bar = tqdm(total=len(prompts))
    for batch_of_prompts in (
        prompts[index : index + batch_size] for index in range(0, len(prompts), batch_size)
    ):
        completions.extend(complete_batch_of_prompts(batch_of_prompts, model_name))
        num_prompts_in_batch = len(batch_of_prompts)
        progress_bar.update(num_prompts_in_batch)
    return completions


def process_completions(
    raw_completions: List[str], binary_to_string_map: Dict[int, str]
) -> List[str]:
    """
    Parses the raw completions returned by the OpenAI completion API and
    converts them to the desired format. The binary_to_string_map parameter
    should be a dictionary mapping binary values (0 or 1) to the desired
    string values (e.g. "irrelevant" or "relevant").
    """
    processed_completions = []
    for raw_completion in raw_completions:
        try:
            binary_value = int(raw_completion.strip())
            processed_completion = binary_to_string_map[binary_value]
        except (ValueError, KeyError):
            processed_completion = None
        processed_completions.append(processed_completion)
    return processed_completions


model_name = "text-davinci-003"  # this is the most powerful model available for the completion API as of June 2023
evaluation_prompts = retrievals_df.apply(
    lambda row: evaluation_prompt_template.format(
        query=row["query"], reference=row["retrieved_context"]
    ),
    axis=1,
).to_list()
raw_completions = complete_prompts(evaluation_prompts, model_name)
processed_completions = process_completions(raw_completions, {0: "irrelevant", 1: "relevant"})
retrievals_df["relevance"] = processed_completions
retrievals_df

Inspect the query-context pairs in the dataframe above to convince yourself that the LLM-assisted evaluation is doing a reasonable job of predicting the relevance of each piece of context to the corresponding query.

Even though the evaluation prompt template explicitly instructs the LLM to return binary output, LLMs sometimes fail to follow instructions and can produce output with an unexpected format that is difficult to parse. It's recommended to check the distribution of the "relevance" column before proceeding. You should see "relevant" and "irrelevant" entries and at most a few NaN entries indicating occasions where the LLM produced unparseable output.

In [None]:
retrievals_df["relevance"].value_counts()

Now that you know whether each piece of retrieved context is relevant or irrelevant to the corresponding query, you can compute precision@k for $k = 1, 2$ for each query. This metric tells you what percentage of the retrieved context is relevant to the corresponding query.

$$
\text{precision@k} = \frac{\text{\# of top-\textit{k} retrieved documents that are relevant}}{\text{\textit{k} retrieved documents}}
$$

If your precision@2 is greater than zero for a particular query, your LlamaIndex application successfully retrieved at least one relevant piece of context with which to answer the query. If the precision@k is zero for a particular query, that means that no relevant piece of context was retrieved.

In [None]:
PRECISION_AT_K_FORMAT_STRING = "precision@{k}"


def compute_precision_at_k(retrievals_df: pd.DataFrame, K: int) -> pd.DataFrame:
    """
    Computes precision at k for k = 1, 2, ..., K for a given retrieval
    dataframe.
    """
    queries = []
    precision_data = {PRECISION_AT_K_FORMAT_STRING.format(k=k): [] for k in range(1, K + 1)}
    for query, group in retrievals_df.groupby("query"):
        queries.append(query)
        num_relevant_documents_column = (
            group.sort_values(by=["retrieval_rank"])["relevance"] == "relevant"
        ).cumsum()
        for index, num_relevant_documents in enumerate(num_relevant_documents_column.to_list()):
            num_retrieved_documents = index + 1
            precision_at_k = num_relevant_documents / num_retrieved_documents
            precision_at_k_column_name = PRECISION_AT_K_FORMAT_STRING.format(
                k=num_retrieved_documents
            )
            precision_data[precision_at_k_column_name].append(precision_at_k)
    return pd.DataFrame(
        {
            "query": queries,
            **precision_data,
        }
    )


precision_df = compute_precision_at_k(retrievals_df, K=2)
precision_df

## Add Metrics to Dataframes

Add the following columns to the query dataframe:
- **precision@k** for $k = 1, 2$
- **max_cosine_similarity:** the cosine similarity between the query and the most similar piece of context in the database
- **retrieved_context:** all retrieved context with corresponding similarity score

First, grab the previously computed precision data.

In [None]:
precision_at_k_df = precision_df.set_index("query")
precision_at_k_df

Next, compute the max cosine similarity.

In [None]:
max_cosine_sim_df = (
    retrievals_df[retrievals_df["retrieval_rank"] == 1][["query", "cosine_similarity"]]
    .rename(columns={"cosine_similarity": "max_cosine_similarity"})
    .set_index("query")
)
max_cosine_sim_df

Format the retrieved context data to view in the app.

In [None]:
def format_retrievals(dataframe: pd.DataFrame) -> str:
    formatted_retrieval_template = """Retrieval Rank: {retrieval_rank}
Cosine Similarity: {cosine_similarity}    
Retrieved Context: {retrieved_context}"""
    formatted_retrievals_column = dataframe.apply(
        lambda row: formatted_retrieval_template.format(
            retrieval_rank=row["retrieval_rank"],
            cosine_similarity=row["cosine_similarity"],
            retrieved_context=row["retrieved_context"],
        ),
        axis=1,
    )
    return "\n\n".join(formatted_retrievals_column.to_list())


retrieved_contexts_column = retrievals_df.groupby("query").apply(format_retrievals)
retrieved_contexts_column.name = "retrieved_context"
retrieved_contexts_df = retrieved_contexts_column.to_frame()
retrieved_contexts_df.head()

Add the data from the cells above to a new query dataframe.

In [None]:
query_df_with_metrics = query_df.set_index("text")
for merge_df in [precision_at_k_df, max_cosine_sim_df, retrieved_contexts_df]:
    query_df_with_metrics = query_df_with_metrics.merge(merge_df, left_index=True, right_index=True)
query_df_with_metrics.index = query_df_with_metrics.index.set_names(["text"])
query_df_with_metrics = query_df_with_metrics.reset_index()
query_df_with_metrics = query_df_with_metrics[
    [
        "subject",
        "text",
        "text_vector",
        "max_cosine_similarity",
        "precision_at_1",
        "precision_at_2",
        "retrieved_context",
    ]
]
query_df_with_metrics

In [None]:
pd.merge(query_df, precision_df, left_on="text", right_on="query").drop(columns=["query"])

## Launch Phoenix

Define a schema to tell Phoenix what the columns of your query dataframe represent (features, predictions, actuals, tags, embeddings, etc.). See the [docs](https://docs.arize.com/phoenix/) for guides on how to define your own schema and API reference on `phoenix.Schema` and `phoenix.EmbeddingColumnNames`.

In [None]:
query_schema = px.Schema(
    embedding_feature_column_names={
        "text_embedding": px.EmbeddingColumnNames(
            vector_column_name="text_vector",
            raw_data_column_name="text",
        )
    },
    tag_column_names=[
        "subject",
        "max_cosine_similarity",
        "precision_at_1",
        "precision_at_2",
        "retrieved_context",
    ],
)

Similarly define a scheme for your database data.

In [None]:
database_schema = px.Schema(
    embedding_feature_column_names={
        "text_embedding": px.EmbeddingColumnNames(
            vector_column_name="text_vector",
            raw_data_column_name="text",
        )
    },
    tag_column_names=[
        "subject",
        "article_index",
        "paragraph_index",
    ],
)

Create Phoenix datasets that wrap your dataframes with the schemas that describe them.

In [None]:
database_ds = px.Dataset(database_df, database_schema, name="database")
query_ds = px.Dataset(query_df_with_metrics, query_schema, name="query")

Launch Phoenix. Follow the instructions in the cell output to open the Phoenix UI.

In [None]:
session = px.launch_app(primary=query_ds, reference=database_ds)

## Investigate User Interests and Improve Your Knowledge Base

Click on "text_embedding" to go to the embeddings page.

![click on text embedding](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llama-index-knowledge-base-tutorial/click_on_text_embedding.png)

Increase the number of sampled points that appear in the point cloud to 2500.

![adjust number of samples for umap](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llama-index-knowledge-base-tutorial/adjust_number_of_samples_for_umap.png)

Inspect the clusters in the panel on the left. The top clusters contain mostly user queries and few database entries.

![investigate top clusters](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llama-index-knowledge-base-tutorial/investigate_top_clusters.png)

You can color the data by **granular_subject** to visualize the topics represented within each cluster. What topics are your users asking about that are not answered by your database?

![color by granular subject](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llama-index-knowledge-base-tutorial/color_by_granular_subject.png)

Congrats! You've found the topics your users are asking about that are not covered in your knowledge base (Richard Feynman, Neptune, and Playstation 3). As an actionable next step, you can augment your knowledge base to cover these topics so your users get answers to the questions they are interested in.