<center>
    <p style="text-align:center">
        <img alt="arize llama-index logos" src="https://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llama-index-knowledge-base-tutorial/arize_llamaindex.png" width="400">
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Evaluating and Improving a LlamaIndex Search and Retrieval Application</h1>

Imagine you're an engineer at Arize AI and you've built and deployed a documentation question-answering service using LlamaIndex. Users send questions about Arize's core product via a chat interface, and your service retrieves documents from your documentation in order to generate a response to the user. As the engineer in charge of evaluating and maintaining this system, you want to evaluate the quality of the responses from your service.

Phoenix helps you:
- identify gaps in your documentation
- detect queries for which the LLM gave bad responses
- detect failures to retrieve relevant context

In this tutorial, you will:

- Download an pre-indexed knowledge base of the Arize documentation and run a LlamaIndex application
- Visualize user queries and knowledge base documents to identify areas of user interest not answered by your documentation
- Find clusters of responses with negative user feedback
- Identify failed retrievals using cosine similarity, Euclidean distance, and LLM-assisted ranking metrics

Parts of this notebook require an [OpenAI API key](https://platform.openai.com/account/api-keys) to run. If you don't have an OpenAI key, you can still run Phoenix by skipping cells preceded by the 💭 emoji.


## Chatbot Architecture

Your chatbot was built using LlamaIndex's low-level API. The architecture of your chatbot is shown below and can be explained in five steps.

![llama-index chatbot architecture](http://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llama-index-knowledge-base-tutorial/llama_index_chatbot_architecture.png)

1. The user sends a query about Arize to your service.
1. `langchain.embeddings.OpenAIEmbeddings` makes a request to the OpenAI embeddings API to embed the user query using the `text-embedding-ada-002` model.
1. `llama_index.retrievers.RetrieverQueryEngine` does a similarity search against the entries of your index knowledge base for the two most similar pieces of context by cosine similarity.
1. `llama_index.indices.query.ResponseSynthesizer` generates a response by formatting the query and retrieved context into a single prompt and sending a request to OpenAI chat completions API with the `gpt-3.5-turbo`.
2. The response is returned to the user.

Phoenix makes your search and retrieval system *observable* by capturing the inputs and outputs of these steps for analysis, including:

- your query embeddings
- the retrieved context and similarity scores for each query
- the generated response that is return to the user

With that overview in mind, let's dive into the notebook.

## 1. Install Dependencies and Import Libraries

Install Phoenix and LlamaIndex.

In [None]:
!pip install -q "arize-phoenix[experimental]" gcsfs llama-index

Import libraries.

In [1]:
import os
import textwrap
from datetime import timedelta

import numpy as np
import openai
import pandas as pd
import phoenix as px
from gcsfs import GCSFileSystem
from IPython.display import YouTubeVideo
from langchain.chat_models import ChatOpenAI
from llama_index import LLMPredictor, ServiceContext, StorageContext, load_index_from_storage
from llama_index.callbacks import CallbackManager, OpenInferenceCallbackHandler
from llama_index.callbacks.open_inference_callback import as_dataframe
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.graph_stores.simple import SimpleGraphStore
from phoenix.experimental.evals.retrievals import (
    classify_relevance,
    compute_precisions_at_k,
)

pd.set_option("display.max_colwidth", 1000)

## 2. Configure Your OpenAI API Key

💭 Configure your OpenAI API key.

In [None]:
openai_api_key = ""
assert openai_api_key != "", "❌ Please set your OpenAI API key"
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

## 3. Download Your Knowledge Base

Download your pre-built index from cloud storage and instantiate your storage context.

In [2]:
# file_system = GCSFileSystem(project="public-assets-275721")
# index_path = "arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/index/"
index_path = "/Users/xandersong/Desktop/index3"
storage_context = StorageContext.from_defaults(
    # fs=file_system,
    persist_dir=index_path,
)

Download and unzip a pre-built knowledge base index consisting of chunks of the Arize documentation.

## 4. Run Your Question-Answering Service

💭 Start a LlamaIndex application from your downloaded index. Use the `OpenInferenceCallbackHandler` to store your data in [OpenInference format](https://github.com/Arize-ai/open-inference-spec), an open standard for capturing and storing AI model inferences that enables production LLMapp servers to seamlessly integrate with LLM observability solutions such as Arize and Phoenix.

In [None]:
callback_handler = OpenInferenceCallbackHandler()
service_context = ServiceContext.from_defaults(
    llm_predictor=LLMPredictor(llm=ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)),
    embed_model=OpenAIEmbedding(model="text-embedding-ada-002"),
    callback_manager=CallbackManager(handlers=[callback_handler]),
)
index = load_index_from_storage(
    storage_context,
    service_context=service_context,
)
query_engine = index.as_query_engine()

💭 Ask a few questions of your question-answering service and view the responses.

In [None]:
max_line_length = 80
for query in [
    "How do I get an Arize API key?",
    "Can I create monitors with an API?",
    "How do I need to format timestamps?",
    "What is the price of the Arize platform",
]:
    print("Query")
    print("=====")
    print()
    print(textwrap.fill(query, max_line_length))
    print()
    response = query_engine.query(query)
    print("Response")
    print("========")
    print()
    print(textwrap.fill(str(response), max_line_length))
    print()

## 5. Load Your Data Into Pandas Dataframes

To use Phoenix, you must load your data into pandas dataframes. 

💭 Your query data is saved in a buffer on the callback handler you defined in step 4. Load the data from the buffer into a dataframe.

In [None]:
query_data_buffer = callback_handler.flush_query_data_buffer()
sample_query_df = as_dataframe(query_data_buffer)
sample_query_df

The columns of the dataframe are:

- **:id.id:**: the query ID
- **:timestamp.iso_8601:**: the time at which the query was made
- **:feature.text:prompt**: the query text
- **:feature.[float].embedding:prompt**: the embedding representation of the query
- **:prediction.text:response**: the final response presented to the user
- **:feature.[str].retrieved_document_ids:prompt**: the list of IDs of the retrieved documents
- **:feature.[float].retrieved_document_scores:prompt**: the lists of cosine similarities between the query and retrieved documents

The column names are in OpenInference format and describe the category, data type and intent of each column.

Running queries against a large dataset takes a long time. Download a dataframe containing query data.

In [3]:
query_df = pd.read_parquet(
    # "http://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/llama-index/arize-docs/query_data_complete3.parquet",
    "/Users/xandersong/Desktop/query_data_complete4.parquet"
)
query_df.head()

Unnamed: 0,:id.id:,:timestamp.iso_8601:,:feature.text:prompt,:feature.[float].embedding:prompt,:prediction.text:response,:feature.[str].retrieved_document_ids:prompt,:feature.[float].retrieved_document_scores:prompt,:tag.bool:relevance_0,:tag.bool:relevance_1,:tag.float:precision_at_1,:tag.float:precision_at_2,:tag.float:document_similarity_0,:tag.float:document_similarity_1,:tag.float:user_feedback
0,98e4fbe9-ed44-4195-b39f-9beef3bc11cf,2023-08-08T16:11:08.233322,How do I use the SDK to upload a ranking model?,"[-0.009529398754239082, 0.008511829189956188, 0.013249027542769909, -0.0025473609566688538, 0.009852546267211437, 0.02994954027235508, -0.019155055284500122, -0.019058797508478165, 0.005679136607795954, -0.035037387162446976, 0.02468293160200119, -0.013097766786813736, -0.016803644597530365, -0.019595084711909294, -0.022647792473435402, -0.008724968880414963, 0.013407163321971893, 0.00344117172062397, 0.008099301718175411, 0.002243121387436986, -0.01655612699687481, 0.01801372691988945, 0.0005272624548524618, -0.014534739777445793, -0.011660793796181679, 0.015236037783324718, 0.021685225889086723, -0.01333153247833252, 0.016322361305356026, 0.0020145121961832047, 0.0010055372258648276, 0.0113995261490345, -0.02113518863916397, -0.008030546829104424, -0.01827499456703663, 0.012898378074169159, -0.009103120304644108, 0.00855308212339878, 0.010278824716806412, -0.02312907576560974, 0.020983928814530373, 0.01849501021206379, 0.0021485837642103434, -0.02237277291715145, -0.0044621787965...",assistant: The given context information does not provide any information about using the SDK to upload a ranking model.,"[9a54577a2179bcb66b90d792c2f162ffbce2ea50fa742ec016022a3bfc3f6477, c2a25c7ac74006acabf34e244b62aee62966f6f75abfa7948e14b75de33afb46]","[0.8274160459205127, 0.8192319406200022]",False,False,0.0,0.0,0.827416,0.819232,-1.0
1,bb1ac747-ccf6-498b-9079-dfe075029562,2023-08-08T16:11:09.797790,What drift metrics are supported in Arize?,"[-0.009346794337034225, -0.00042854511411860585, -0.0009247552370652556, -0.017379986122250557, -0.0047960965894162655, 0.030313927680253983, -0.016629355028271675, 0.006423665676265955, -0.027657851576805115, -0.017495466396212578, 0.01267410907894373, 0.019270997494459152, -0.01836157962679863, -0.005113671068102121, -0.01797182857990265, 0.005756037775427103, 0.0026596863754093647, 0.003597974544391036, 0.02191263996064663, -0.006073612254112959, -0.015792112797498703, -0.017293374985456467, -0.003929984290152788, 0.008646687492728233, -0.011115106754004955, 0.0181161817163229, 0.01463007926940918, -0.005117279943078756, 0.015128093771636486, -0.02220134437084198, 0.04408511146903038, 0.027499062940478325, -0.00033742288360372186, 0.014211458154022694, -0.034817710518836975, -0.011475986801087856, 0.007636222988367081, 0.020497988909482956, 0.028119778260588646, -0.011779126711189747, 0.0045903949066996574, -0.017235632985830307, 0.0013929972192272544, -0.005752428900450468, 0.0...",assistant: The drift metrics supported in Arize are PSI.,"[5b7de32306fe8845039b3292b677fe0018226967a7b8a6c8181bfc40ac0fcf71, d8796ab7c6ad95f29525b4677f1f480ffd143d5963fdcdb4d859c0b934e54b08]","[0.8820403932197866, 0.8806944536173509]",True,True,1.0,1.0,0.88204,0.880694,
2,38cb85a8-71c4-456b-b064-684ac961a507,2023-08-08T16:11:10.885690,Does Arize support batch models?,"[-0.016511010006070137, -0.000836528604850173, -0.016215920448303223, -0.007496701553463936, 0.0061793336644768715, 0.018225345760583878, 0.026698654517531395, 0.0029333392158150673, -0.008684088476002216, -0.035551365464925766, 0.007904207333922386, 0.025855539366602898, 0.0010196426883339882, -0.008346842601895332, -0.01403084583580494, 0.004577414132654667, 0.012492160312831402, 0.017831891775131226, 0.023171622306108475, -0.006569274235516787, -0.013180704787373543, -0.008740296587347984, 0.008466283790767193, -0.019827265292406082, 0.008951075375080109, 0.03175734728574753, 0.01712929457426071, 0.00010313892562408, 0.00959746353328228, 0.001536050927825272, 0.029874389991164207, 0.026867277920246124, 0.005013023968786001, -0.02311541512608528, -0.02837083488702774, -0.00782692153006792, 0.0023800446651875973, 0.013876275159418583, 0.01722765900194645, -0.0036324223037809134, 0.039991773664951324, -0.021766429767012596, 0.01748059317469597, 0.009147802367806435, -0.003293419722...","assistant: Based on the context information provided, it is not explicitly mentioned whether Arize supports batch models or not.","[26fe9c78f9cf1c571575919a69b68275895791a447736cfdb0a545dec7052b2f, adbfc794d55b50c07ce7b7bf7c2283181af99ea943cc5734bd05991f975f9e1a]","[0.8478920291443393, 0.846377784838003]",False,False,0.0,0.0,0.847892,0.846378,
3,ed44ecc6-5aac-4961-9b89-8230b6d722bc,2023-08-08T16:11:12.514544,Does Arize support training data?,"[-0.02076658234000206, -0.008881663903594017, -0.010041688568890095, -0.012817208655178547, 0.009273082949221134, 0.04059375822544098, 0.022616928443312645, 0.00037496205186471343, -0.01182798482477665, -0.03253763169050217, 0.009493701159954071, 0.028353001922369003, -0.00714874267578125, -0.009878003969788551, -0.016496550291776657, 0.009650268591940403, 0.01914397068321705, 0.01123729720711708, 0.009934937581419945, -0.028353001922369003, -0.008184225298464298, -0.005284162703901529, 0.004131254740059376, -0.011607366614043713, 0.015984147787094116, 0.019115503877401352, 0.01851769909262657, -0.015372109599411488, 0.008952830918133259, -0.0059958347119390965, 0.047511205077171326, 0.036949995905160904, 0.006999291479587555, -0.014027049764990807, -0.03965435177087784, -0.026787323877215385, 0.0072768437676131725, 0.009394067339599133, 0.032452233135700226, 0.006565171759575605, 0.02923547476530075, -0.02293006330728531, 0.009052464738488197, -0.00987088680267334, -0.011991669423...","assistant: Based on the given context information, it is not mentioned whether Arize supports training data or not.","[adbfc794d55b50c07ce7b7bf7c2283181af99ea943cc5734bd05991f975f9e1a, 329ed39cce3f91a1aa503c21ed2396dfcae103087374b2d22dc60414462ca971]","[0.8634916990872465, 0.8541730734667358]",False,False,0.0,0.0,0.863492,0.854173,-1.0
4,79db86ef-584f-46d2-a920-0c35fc130a07,2023-08-08T16:11:13.941522,How do I configure a threshold if my data has seasonality trends?,"[-0.0010318977292627096, -0.011036744341254234, 0.023051166906952858, 0.00039546037442050874, 0.01626250147819519, 0.023464269936084747, -0.024235395714640617, -0.005728366319090128, -0.017309030517935753, -0.026397302746772766, 0.009990215301513672, 0.02944049797952175, -0.004458073526620865, -0.0005680170725099742, 0.009542686864733696, 0.004027757328003645, 0.018878823146224022, -0.0002870207536034286, 0.02944049797952175, -0.021674154326319695, -0.034618061035871506, 0.015629075467586517, -0.00647539459168911, -0.005057073198258877, -0.010809537023305893, -0.003518263343721628, 0.03381939232349396, -0.0332961305975914, -0.014444846659898758, 0.015298593789339066, 0.039988402277231216, -0.02676909603178501, -0.008103710599243641, -0.010423974134027958, -0.013838961720466614, 0.004767900798469782, 0.010093491524457932, 0.0036112116649746895, 0.03717929869890213, -0.007518480531871319, 0.01806638576090336, 0.006413429509848356, -0.008509928360581398, -0.019828960299491882, 0.02507...","assistant: If your data has seasonality trends, you can configure a threshold by using either the automatic threshold or the custom threshold option. \n\nFor the automatic threshold, you need at least 14 days of production data to determine a trend. The auto threshold value will be calculated dynamically based on the data points. However, it is important to note that the auto threshold may not work well with seasonality trends.\n\nIf you already know the precise threshold value to use for your data with seasonality trends, you can set a custom threshold. This allows you to manually define the threshold value for additional flexibility.\n\nTo configure the threshold, you can adjust the number of standard deviations used in the calculation. Lowering the number of standard deviations will increase the sensitivity of the threshold, while increasing the standard deviation number will decrease the sensitivity.\n\nBy configuring the threshold appropriately, you can monitor your data and r...","[8a22f0af802ccd918d3cc88d41cb8c32a1087b2d3d3e2b64f6e62c2d1ba9e88c, 9eab2317d0ffc12d7386dc34312151aeafc706c6fa59297f2a40f69ef9ea7894]","[0.8167743979555812, 0.8139839429154886]",False,False,0.0,0.0,0.816774,0.813984,-1.0


In addition to the columns of the previous dataframe, this data has a few additional fields:

- **:tag.float:user_feedback:** approval or rejection from the user (-1 means thumbs down, +1 means thumbs up)
- **:tag.str:openai_relevance_0:** a binary classification (relevant vs. irrelevant) by GPT-4 predicting whether the first retrieved document is relevant to the query
- **:tag.str:openai_relevance_1:** a binary classification (relevant vs. irrelevant) by GPT-4 predicting whether the second retrieved document is relevant to the query

We'll go over how to compute the relevance classifications in section 6.

Next load your knowledge base into a dataframe.

In [4]:
def storage_context_to_dataframe(storage_context: StorageContext) -> pd.DataFrame:
    """Converts the storage context to a pandas dataframe.

    Args:
        storage_context (StorageContext): Storage context containing the index
        data.

    Returns:
        pd.DataFrame: The dataframe containing the index data.
    """
    document_ids = []
    document_texts = []
    document_embeddings = []
    docstore = storage_context.docstore
    vector_store = storage_context.vector_store
    for node_id, node in docstore.docs.items():
        document_ids.append(node.hash)  # use node hash as the document ID
        document_texts.append(node.text)
        document_embeddings.append(np.array(vector_store.get(node_id)))
    return pd.DataFrame(
        {
            "document_id": document_ids,
            "text": document_texts,
            "text_vector": document_embeddings,
        }
    )


database_df = storage_context_to_dataframe(storage_context)
database_df = database_df.drop_duplicates(subset=["text"])
database_df.head()

Unnamed: 0,document_id,text,text_vector
0,bdc2809cf2877895c36eb8c2b15f299513a32e3362581f8350129c99251d3886,"Arize AI\nML Observability Platform for real-time monitoring, analysis, and explainability\nArize is the \nmachine learning observability platform\n for ML practitioners to monitor, troubleshoot, and explain models. Data Science and ML Engineering teams of all sizes (from individuals to enterprises) use Arize to:\nEvaluate, monitor, and troubleshoot LLM applications\nMonitor real-time model performance, with support for delayed ground truth/feedback\nRoot cause model failures/performance degradation using tracing and explainability\nConduct multi-model performance comparisons\nSurface drift, data quality, and model fairness/bias metrics \nArize Product Demo\nWhat am I logging to Arize?\nThe Arize platform logs model inferences across training, validation and production environments. Check out how Arize and ML Observability fit into your \nML workflow here\n. \nHow Does Arize Fit Into ML Stack","[0.009107207879424095, -0.011134054511785507, 0.0016833710251376033, -0.016173966228961945, 0.015997126698493958, 0.01911221258342266, 0.010324675589799881, 0.018445666879415512, -0.010991223156452179, -0.029028799384832382, 0.0026440827641636133, 0.006662068422883749, -0.026117756962776184, -0.012120272032916546, -0.04042811319231987, 0.0034109519328922033, 0.012956855818629265, -0.0038224426098167896, 0.012875238433480263, -0.0065634469501674175, -0.03245675563812256, -0.0050841206684708595, -0.004584210459142923, 0.01002540998160839, 0.0026559855323284864, 0.017221396788954735, 0.00906639825552702, -0.018744932487607002, -0.0013126893900334835, 0.010127432644367218, 0.019588317722082138, 0.012235897593200207, 0.0033429369796067476, -0.00959011446684599, -0.01862250454723835, 0.00935206189751625, 0.013521380722522736, 0.00850867573171854, 0.005114727653563023, -0.028076589107513428, 0.030470717698335648, -0.015983523800969124, -0.0018568093655630946, -0.0028583300299942493, -0.00..."
1,8b7413f49224a5805296fe5b2513c5b8a9f84e03d0f97ab2cb015f63911eddfd,"ML workflow here\n. \nHow Does Arize Fit Into ML Stack\nYour ML Stack might already include a feature store, model store, and serving layer. Once your models are deployed into production, ML Observability provides deep understanding of your model’s performance, and root causing exactly why it’s behaving in certain ways. This is where an \ninference/evaluation store\n can help.\nML Canonical Stack featuring Feature, Model, and Evaluation Store\nPlatform and Model Agnostic \nArize is an \nopen platform\n that works with your machine learning infrastructure, and can be deployed as SaaS or \non-premise\n.\nOpen Platform designed to work across platforms and model frameworks\nNext\nWhat is ML Observability?\nLast modified \n13d ago","[0.012960536405444145, -0.0005041257245466113, -0.012926212511956692, -0.006367000751197338, 0.010551029816269875, 0.005680531729012728, 0.013255718164145947, 0.016845950856804848, -0.005536373239010572, -0.007180466782301664, 0.014251098036766052, 0.004674854222685099, -0.020525425672531128, -0.00886231567710638, -0.03306034952402115, 0.0033259426709264517, 0.006960796657949686, -0.007784559391438961, 0.00254336791113019, -0.011230634525418282, -0.027856916189193726, -0.007187331095337868, -0.003950629383325577, -0.006174789275974035, 0.006559212226420641, 0.018424829468131065, 0.017230374738574028, -0.027554869651794434, -0.004228649660944939, 0.010358818806707859, 0.021651234477758408, 0.01145716942846775, 6.78089327266207e-06, -0.025468003004789352, -0.018397372215986252, -0.015116048976778984, 0.009020203724503517, 0.001169571653008461, -0.0012176245218142867, -0.02723909355700016, 0.04052913561463356, -0.0245069470256567, -0.010393141768872738, -0.0101254191249609, -0.0052343..."
2,3a68817c7523aef177452990617c9eaf0108877ccb8502a7c3b848572c0baf48,"All Tutorials/Notebooks\nExample tutorials of how to use and troubleshoot with Arize.\nAccess tutorials of what's possible with Arize below:\n1.\n​\nModel Type Examples\n​\n2.\n​\nExplainability Tutorials\n​\n3.\n​\nCloud Storage Examples\n​\nModel Type Examples\nYour model type determines which performance metrics are available to you. Learn more about model types \nhere\n.\nModel Type\nPandas Batch\nPython Single Record\nCSV\nParquet \nBinary Classification (Only Classification Metrics)\n​\nColab Link\n​\n​\nColab Link\n​\n​\n​\nDownload\n​\n File\n\n*Open Parquet Reader \nHere\n​\nBinary Classification (Classification, AUC/Log Loss Metrics) \n​\nColab Link\n​\n​\nColab Link\n​\n​\n​\nDownload File","[0.001906583202071488, 0.007040511351078749, 0.020322686061263084, -0.004288924392312765, 0.007306793704628944, 0.0031936157029122114, 0.021998491138219833, -0.0025527621619403362, -0.027735993266105652, -0.04590001329779625, 0.002501280978322029, 0.002080554375424981, -0.030022472143173218, -0.0007660061819478869, -0.02985205128788948, -0.006547000724822283, 0.005865317303687334, 0.0011689804960042238, 0.03220953792333603, -0.01076491642743349, -0.0339137464761734, 0.010480881668627262, -0.015252665616571903, -0.008215704932808876, -0.004246319178491831, 0.02671346813440323, 0.022736981511116028, -0.017098890617489815, 0.0034510220866650343, -0.009841803461313248, 0.024015137925744057, 0.023134630173444748, 0.004743380006402731, -0.03184029459953308, 0.003498953068628907, 0.0033977655693888664, 0.03203911706805229, 0.0292555782943964, 0.003727955976501107, -0.015991155058145523, 0.021785464137792587, -0.030391717329621315, 0.005151680205017328, 0.0016766926273703575, -0.0064049833..."
3,66741b24f57c0b8753142bdb98b46f54ceb893f40de9db2fc46adeca47cf1863,"*Open Parquet Reader \nHere\n​\nBinary Classification (Classification, AUC/Log Loss, Regression) \n​\nColab Link\n​\n​\nColab Link\n​\n​\nDownload File\n​\n​\nDownload File\n\n*Open Parquet Reader \nHere\n​\nMulticlass Classification (Only Classification Metrics)\n​\nColab Link\n​\n​\nColab Link\n​\n​\n​\nDownload \n​\nFile\n\n*Open Parquet Reader \nHere\n​\nMulticlass Classification (Classification, AUC/Log Loss Metrics)\n​\nColab Link\n​\n​\nColab Link\n​\n​\n​\nDownload \n​\nFile\n\n*Open Parquet Reader \nHere\n​\nRegression\n​\nColab Link\n��​\n​\nColab Link\n​\n​\nDownload File\n​\n​\nDownload \n​\nFile","[-0.0067940098233520985, 0.004925481975078583, 0.025114698335528374, -0.008722133934497833, 0.007284805178642273, 0.007607327774167061, 0.011105997487902641, -0.007656407542526722, -0.03603840246796608, -0.011912303976714611, 0.018495973199605942, 0.01714979112148285, -0.0270498339086771, -0.015032360330224037, -0.03157917410135269, 0.003940385300666094, 0.016462678089737892, -0.01514454185962677, 0.014527542516589165, -0.021945562213659286, -0.028101539239287376, 0.03730044513940811, 0.0039964765310287476, -0.015396950766444206, -0.005475873593240976, 0.010699338279664516, 0.024287357926368713, -0.032925356179475784, 0.010271645151078701, 0.001134087797254324, 0.018229540437459946, 0.016560837626457214, -0.0013531928416341543, -0.04010498896241188, -0.003042931202799082, 0.008652021177113056, 0.027694879099726677, 0.03760894760489464, 0.006723896134644747, -0.011091974563896656, 0.009233963675796986, -0.030990220606327057, -0.0055074249394237995, -0.01105691771954298, 0.0101314177..."
4,50a8e9ae86a940481f313d3f9d817ab28f51bfeecd17c1f5c71b12370fc92abe,*Open Parquet Reader \nHere\n​\nTimeseries Forecasting \n​\nColab Link\n​\n​\nColab Link\n​\n​\n​\nDownload \n​\nFile\n\n*Open Parquet Reader \nHere\n​\nRanking with Relevance Score\n​\nColab Link\n​\n​\nColab Link\n​\n​\n​\nDownload \n​\nFile\n\n*Open Parquet Reader \nHere\n​\nRanking with Single Label\n​\nColab Link\n​\n​\nColab Link\n​\n​\n​\nDownload \n​\nFile\n\n*Open Parquet Reader \nHere\n​\nRanking with Multiple Labels\n​\nColab Link\n​\n​\nColab Link\n​\n​\n​\nDownload \n​\nFile,"[-0.02158258855342865, -0.005794007331132889, 0.01560718473047018, 0.0035087710712105036, 0.01357981562614441, 0.0005290721892379224, 0.012683505192399025, -0.011481311172246933, -0.03633614256978035, -0.018310343846678734, 0.02095659449696541, 0.004076078999787569, -0.02280612289905548, -0.005000843666493893, -0.03479961305856705, -0.0054916804656386375, 0.014838919043540955, -0.019035927951335907, 0.013636724092066288, -0.01832457073032856, -0.01315300166606903, 0.009432600811123848, -0.0007064669625833631, -0.009717144072055817, 0.0018281888915225863, 0.02151145227253437, 0.025765370577573776, -0.03497033566236496, 0.009290330111980438, -0.0007887177052907646, 0.026462500914931297, 0.013380635529756546, 0.006928622722625732, -0.028596574440598488, -0.010193753987550735, -0.005015070550143719, 0.016247406601905823, 0.019875330850481987, 0.004058294929563999, -0.02136918157339096, 0.0052676028572022915, -0.04199855029582977, 0.0006379988044500351, 0.0001955121842911467, -0.0053707..."


The columns of your dataframe are:

- **document_id:** the ID of the chunked document
- **text:** the chunked text in your knowledge base
- **text_vector:** the embedding vector for the text, computed during the LlamaIndex build using "text-embedding-ada-002" from OpenAI

The query and database datasets are drawn from different distributions; the queries are short questions while the database entries are several sentences to a paragraph. The embeddings from OpenAI's "text-embedding-ada-002" capture these differences and naturally separate the query and context embeddings into distinct regions of the embedding space. When using Phoenix, you want to "overlay" the query and context embedding distributions so that queries appear close to their retrieved context in the Phoenix point cloud. To achieve this, we compute a centroid for each dataset that represents an average point in the embedding distribution and center the two distributions so they overlap.

In [5]:
database_centroid = database_df["text_vector"].mean()
database_df["text_vector"] = database_df["text_vector"].apply(lambda x: x - database_centroid)

query_centroid = query_df[":feature.[float].embedding:prompt"].mean()
query_df[":feature.[float].embedding:prompt"] = query_df[":feature.[float].embedding:prompt"].apply(
    lambda x: x - query_centroid
)

## 6. Run LLM-Assisted Evaluations

Cosine similarity and Euclidean distance are reasonable proxies for retrieval quality, but they don't always work perfectly. A novel idea is to use LLMs to measure retrieval quality by simply asking the LLM whether each retrieved document is relevant to the corresponding query.

💭 Use OpenAI to predict whether each retrieved document is relevant or irrelevant to the query.

⚠️ It's strongly recommended to use GPT-4 for evaluations if you have access.

In [None]:
evals_model_name = "gpt-3.5-turbo"
# evals_model_name = "gpt-4"  # use GPT-4 if you have access
query_texts = sample_query_df[":feature.text:prompt"].tolist()
list_of_document_id_lists = sample_query_df[":feature.[str].retrieved_document_ids:prompt"].tolist()
document_id_to_text = dict(zip(database_df["document_id"].to_list(), database_df["text"].to_list()))

first_document_texts, second_document_texts = [
    [
        document_id_to_text[document_ids[document_index]]
        for document_ids in list_of_document_id_lists
    ]
    for document_index in [0, 1]
]
first_document_relevances, second_document_relevances = [
    [
        classify_relevance(query_text, document_text, evals_model_name)
        for query_text, document_text in zip(query_texts, first_document_texts)
    ]
    for document_texts in [first_document_texts, second_document_texts]
]


sample_query_df = sample_query_df.assign(
    retrieved_document_text_0=first_document_texts,
    retrieved_document_text_1=second_document_texts,
    relevance_0=first_document_relevances,
    relevance_1=second_document_relevances,
)
sample_query_df[
    [
        ":feature.text:prompt",
        "retrieved_document_text_0",
        "retrieved_document_text_1",
        "relevance_0",
        "relevance_1",
    ]
].rename(columns={":feature.text:prompt": "query_text"})

## 7. Compute Ranking Metrics

Now that you know whether each piece of retrieved context is relevant or irrelevant to the corresponding query, you can compute precision@k for k = 1, 2 for each query. This metric tells you what percentage of the retrieved context is relevant to the corresponding query.

precision@k = (# of top-k retrieved documents that are relevant) / (k retrieved documents)

If your precision@2 is greater than zero for a particular query, your LlamaIndex application successfully retrieved at least one relevant piece of context with which to answer the query. If the precision@k is zero for a particular query, that means that no relevant piece of context was retrieved.

Compute precision@k for k = 1, 2 and view the results.

In [8]:
first_document_relevances = [
    {"relevant": True, "irrelevant": False}.get(rel)
    for rel in sample_query_df[":tag.bool:relevance_0"].tolist()
]
second_document_relevances = [
    {"relevant": True, "irrelevant": False}.get(rel)
    for rel in sample_query_df[":tag.bool:relevance_1"].tolist()
]
list_of_precisions_at_k_lists = [
    compute_precisions_at_k([rel0, rel1])
    for rel0, rel1 in zip(first_document_relevances, second_document_relevances)
]
precisions_at_1, precisions_at_2 = [
    [precisions_at_k[index] for precisions_at_k in list_of_precisions_at_k_lists]
    for index in [0, 1]
]
sample_query_df[f":tag.float:precision_at_1"] = precisions_at_1
sample_query_df[f":tag.float:precision_at_2"] = precisions_at_2
sample_query_df[
    [
        ":tag.bool:relevance_0",
        ":tag.bool:relevance_1",
        ":tag.float:precision_at_1",
        ":tag.float:precision_at_2",
    ]
]

Unnamed: 0,:tag.bool:relevance_0,:tag.bool:relevance_1,:tag.float:precision_at_1,:tag.float:precision_at_2
0,False,False,,
1,True,True,,
2,False,False,,
3,False,False,,
4,False,False,,
...,...,...,...,...
148,False,False,,
149,False,False,,
150,False,False,,
151,False,False,,


## 7. Launch Phoenix

Define your knowledge base dataset with a schema that specifies the meaning of each column (features, predictions, actuals, tags, embeddings, etc.). See the [docs](https://docs.arize.com/phoenix/) for guides on how to define your own schema and API reference on `phoenix.Schema` and `phoenix.EmbeddingColumnNames`.

In [None]:
# get a random sample of 500 documents (including retrieved documents)
# this will be handled by by the application in a coming release
num_sampled_point = 500
retrieved_document_ids = set(
    [
        doc_id
        for doc_ids in query_df[":feature.[str].retrieved_document_ids:prompt"].to_list()
        for doc_id in doc_ids
    ]
)
retrieved_document_mask = database_df["document_id"].isin(retrieved_document_ids)
num_retrieved_documents = len(retrieved_document_ids)
num_additional_samples = num_sampled_point - num_retrieved_documents
unretrieved_document_mask = ~retrieved_document_mask
sampled_unretrieved_document_ids = set(
    database_df[unretrieved_document_mask]["document_id"]
    .sample(n=num_additional_samples, random_state=0)
    .to_list()
)
sampled_unretrieved_document_mask = database_df["document_id"].isin(
    sampled_unretrieved_document_ids
)
sampled_document_mask = retrieved_document_mask | sampled_unretrieved_document_mask
sampled_database_df = database_df[sampled_document_mask]

In [None]:
database_schema = px.Schema(
    prediction_id_column_name="document_id",
    prompt_column_names=px.EmbeddingColumnNames(
        vector_column_name="text_vector",
        raw_data_column_name="text",
    ),
)
database_ds = px.Dataset(
    dataframe=sampled_database_df,
    schema=database_schema,
    name="database",
)

Define your query dataset. Because the query dataframe is in OpenInference format, Phoenix is able to infer the meaning of each column without a user-defined schema by using the `phoenix.Dataset.from_open_inference` class method.

In [None]:
query_ds = px.Dataset.from_open_inference(query_df)

Launch Phoenix. Follow the instructions in the cell output to open the Phoenix UI.

In [None]:
session = px.launch_app(primary=query_ds, corpus=database_ds)

## 8. Surface Problematic Clusters and Data Points

Phoenix helps you:

- visualize your embeddings
- color the resulting point cloud using evaluation metrics
- cluster the points and surface up problematic clusters based on whatever metric you care about

Follow along with the tutorial walkthrough [here](https://youtu.be/hbQYDpJayFw?t=1782), or view the video in your notebook by running the cell below. The video will show you how to investigate your query and knowledge base and identify problematic clusters of data points using Phoenix.

In [None]:
start_time_in_seconds = int(timedelta(hours=0, minutes=29, seconds=42).total_seconds())
YouTubeVideo("hbQYDpJayFw", start=start_time_in_seconds, width=560, height=315)

Congrats! You've identified a problematic cluster of queries. You now have tools at your disposal to investigate clusters of queries where your search and retrieval application is performing poorly based on:

- query percentage
- user feedback
- LLM-assisted ranking metrics

As an actionable next step, you should augment your knowledge base to include information about the pricing and cost of the Arize platform, since your users seem especially interested in this topic.