<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Improving Your Knowledge Base</h1>

Imagine you've built and deployed an LLM question-answering service that enables users to ask questions and receive answers from a knowledge base. You want to understand what kinds of questions your users are asking and whether you're providing good answers to those questions.

Phoenix helps you pinpoint user queries that are not answered by your knowledge base so that you know which topics to iterate and improve upon. As you'll see, your users are asking questions on several topics that your knowledge base does not cover.

In this tutorial, you will:

- Download an pre-indexed knowledge base and run a LlamaIndex application
- Download user query data and knowledge base data, including embeddings computed using the OpenAI API
- Define a schema to describe the format of your data
- Launch Phoenix to visually explore your embeddings
- Investigate clusters of user queries with no corresponding knowledge base entry

⚠️ This notebook requires an [OpenAI API key](https://platform.openai.com/account/api-keys).

Let's get started!

## Building a Knowledge Base With LlamaIndex

[LlamaIndex](https://github.com/jerryjliu/llama_index#readme) is an open-source library that provides high-level APIs for LLM-powered applications. This tutorial leverages LlamaIndex to build a semantic search/ question-answering services over a knowledge base of chunked documents.

![an illustration of](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llama-index-knowledge-base-tutorial/context_retrieval.webp)

The details of indexing 

## Install Dependencies and Import Libraries

Install Phoenix and 

In [None]:
!pip install arize-phoenix llama-index

Import libraries.

In [None]:
import os
import tempfile
import textwrap
import urllib
import zipfile

from langchain import OpenAI
from llama_index import StorageContext, load_index_from_storage
from llama_index.response.schema import Response
import pandas as pd
import phoenix as px

## Download Your Knowledge Base

Download and unzip a pre-built knowledge base index of Wikipedia articles.

In [None]:
def download_file(url: str, output_path: str) -> None:
    """
    Downloads a file from the specified URL and saves to a local path.
    """
    urllib.request.urlretrieve(url, output_path)


def unzip_directory(zip_path: str, output_path: str) -> None:
    """
    Unzips a directory to a specified output path.
    """
    with zipfile.ZipFile(zip_path, "r") as f:
        f.extractall(output_path)


print("⏳ Downloading knowledge base...")
data_dir = tempfile.gettempdir()
zip_file_path = os.path.join(data_dir, "database_index.zip")
download_file(
    url="http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/database_index.zip",
    output_path=zip_file_path,
)

print("⏳ Unzipping knowledge base...")
index_dir = os.path.join(data_dir, "database_index")
unzip_directory(zip_file_path, index_dir)

print("✅ Done")

## Run Your Question-Answering Service

Set your OpenAI API key. You can skip this cell if the `OPENAI_API_KEY` environment variable is already set in your notebook environment.

In [None]:
os.environ["OPENAI_API_KEY"] = "copy paste your api key here"

assert (
    os.environ["OPENAI_API_KEY"] != "copy paste your api key here"
), "❌ Please set your OpenAI API key"

Start a LlamaIndex application from your pre-computed index.

In [None]:
storage_context = StorageContext.from_defaults(
    persist_dir=index_dir,
)
llm = OpenAI(temperature=0, model_name="gpt-4")
index = load_index_from_storage(storage_context, llm=llm)
query_engine = index.as_query_engine()

Ask a question of your question-answering service. See the response in addition to the retrieved context from your knowledge base (by default, LlamaIndex retrieves the two most similar entries to the query by cosine similarity).

In [None]:
def display_llama_index_response(response: Response) -> None:
    """
    Displays a LlamaIndex response and its source nodes (retrieved context).
    """

    print("Response")
    print("========")
    for line in textwrap.wrap(response.response.strip(), width=80):
        print(line)
    print()

    print("Source Nodes")
    print("============")
    print()

    for source_node in response.source_nodes:
        print(f"doc_id: {source_node.node.doc_id}")
        print(f"score: {source_node.score}")
        print()
        for line in textwrap.wrap(source_node.node.text, width=80):
            print(line)
        print()

query = "What is the name of the character Microsoft used to make Windows 8 seem more personable?"
response = query_engine.query(query)
display_llama_index_response(response)

Change the query in the cell above and re-run to ask another question of your choice. You can see example user queries in the `query_df` below.

## Download and Inspect Your Data

Suppose in addition to your actual database. Download knowledge base (database) and user query data. This particular knowledge base consists of paragraphs from Wikipedia articles.

In [None]:
database_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/database.parquet"
)
query_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/query.parquet"
)

View a few database rows.

In [None]:
database_df.head()

The fields of the dataframe are:
- **article_index:** a unique index for each article in the knowledge base
- **paragraph_index:** the index of the paragraph in the article
- **granular_subject:** the subject of the Wikipedia article (e.g., "Beyoncé", "Liberia")
- **broad_subject:** a more general category to which the subject belongs (e.g., "Music", "Geography and Places")
- **text:** the text of the paragraph
- **text_vector:** the embedding vector representing that text

View a few query rows.

In [None]:
query_df.head()

The query dataframe has the same columns, but is missing the **article_index** and **paragraph_index** columns, and the **text** column contains not paragraph but user queries.

## Launch Phoenix

Define a schema to tell Phoenix what the columns of your training dataframe represent (features, predictions, actuals, tags, embeddings, etc.). See the [docs](https://docs.arize.com/phoenix/) for guides on how to define your own schema and API reference on `phoenix.Schema` and `phoenix.EmbeddingColumnNames`.

In [None]:
schema = px.Schema(
    embedding_feature_column_names={
        "text_embedding": px.EmbeddingColumnNames(
            vector_column_name="text_vector",
            raw_data_column_name="text",
        )
    },
    tag_column_names=["granular_subject", "broad_subject"],
)

Create Phoenix datasets that wrap your dataframes with schemas that describe them.

In [None]:
database_ds = px.Dataset(database_df, schema, name="database")
query_ds = px.Dataset(query_df, schema, name="query")

Launch Phoenix. Follow the instructions in the cell output to open the Phoenix UI.

In [None]:
session = px.launch_app(primary=query_ds, reference=database_ds)

## Investigate User Interests and Improve Your Knowledge Base

Click on "text_embedding" to go to the embeddings page.

![click on text embedding](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llama-index-knowledge-base-tutorial/click_on_text_embedding.png)

Increase the number of sampled points that appear in the point cloud to 2500.

![adjust number of samples for umap](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llama-index-knowledge-base-tutorial/adjust_number_of_samples_for_umap.png)

Inspect the clusters in the panel on the left. The top clusters contain mostly user queries and few database entries.

![investigate top clusters](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llama-index-knowledge-base-tutorial/investigate_top_clusters.png)

You can color the data by **granular_subject** to visualize the topics represented within each cluster. What topics are your users asking about that are not answered by your database?

![color by granular subject](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llama-index-knowledge-base-tutorial/color_by_granular_subject.png)

Congrats! You've found the topics your users are asking about that are not covered in your knowledge base (Richard Feynman, Neptune, and Playstation 3). As an actionable next step, you can augment your knowledge base to cover these topics so your users get answers to the questions they are interested in.