<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Evaluating and Improving Search and Retrieval Applications</h1>

Imagine you're an engineer at Arize AI and you've built and deployed a documentation question-answering service using LangChain and Qdrant. Users send questions about Arize's core product via a chat interface, and your service retrieves chunks of your indexed documentation in order to generate a response to the user. As the engineer in charge of maintaining this system, you want to evaluate the quality of the responses from your service.

Phoenix helps you:
- identify gaps in your documentation
- detect queries for which the LLM gave bad responses
- detect failures to retrieve relevant documents

In this tutorial, you will:

- Ask questions of a LangChain application backed by Qdrant over a knowledge base of the Arize documentation
- Use Phoenix to visualize user queries and knowledge base documents to identify areas of user interest not answered by your documentation
- Find clusters of responses with negative user feedback
- Identify failed retrievals using query density, cosine similarity, query distance, and LLM-assisted ranking metrics



## Chatbot Architecture

The architecture of your chatbot is shown below and can be explained in five steps.

![chatbot architecture diagram](http://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/langchain-pinecone-search-and-retrieval/langchain_pinecone_openai_chatbot_architecture.png)

1. The user sends a query about Arize to your service.
1. `langchain.embeddings.OpenAIEmbeddings` makes a request to OpenAI to embed the user query using the text-embedding-ada-002 model.
1. We retrieve by searching against the entries of your Qdrant database for the most similar pieces of context by MMR.
1. `langchain.llms.ChatOpenAI` generates a response by formatting the query and retrieved context into a single prompt and sending a request to OpenAI with the gpt-4-turbo-preview model.
1. The response is returned to the user.

Phoenix makes your search and retrieval system observable by capturing the inputs and outputs of these steps for analysis, including:

- your query embeddings
- the retrieved documents and similarity scores (relevance) to each query
- the generated response that is return to the user

With that overview in mind, let's dive into the notebook.

## 1. Install needed dependencies and import relevant packages

---

In [None]:
!pip install --upgrade langchain qdrant-client langchain_community tiktoken cohere langchain-openai protobuf==3.20.3 "arize-phoenix[experimental,llama-index]" "openai>=1"



Import libraries.

In [None]:
# Standard library imports
import json
import os
import urllib
from getpass import getpass
from urllib.request import urlopen
import logging

# Third-party library imports
import nest_asyncio
import numpy as np
import pandas as pd
import openai

# Phoenix imports
import phoenix as px
from phoenix.experimental.evals import (
    HallucinationEvaluator,
    OpenAIModel,
    QAEvaluator,
    RelevanceEvaluator,
    run_evals,
)
from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents
from phoenix.trace import DocumentEvaluations, SpanEvaluations
from phoenix.trace.langchain import LangChainInstrumentor

# LangChain imports
from langchain.chains import RetrievalQA
from langchain.document_loaders import GitbookLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import Document
from langchain.chat_models import ChatOpenAI as ChatOpenAI_LangChain
from langchain.callbacks import StdOutCallbackHandler
from langchain_openai import ChatOpenAI
from langchain_community.vectorstores import Qdrant

# Miscellaneous imports
from tqdm import tqdm

# Configuration and Initialization
logging.basicConfig(level=logging.DEBUG)
nest_asyncio.apply()
pd.set_option("display.max_colwidth", None)


## 2. Configure Your OpenAI API Key

---

In [None]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

🔑 Enter your OpenAI API key: ··········


## 3. Configure your Qdrant client in memory

We need to configure the embeddings to be used as well as the documents to be used. In this example, the documents come from arize's documentation

---

In [None]:
model_name = 'text-embedding-ada-002'

embeddings = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=openai_api_key
)

  warn_deprecated(


In [None]:
def load_gitbook_docs(docs_url):
    """
    Loads documentation from a Gitbook URL.
    """

    loader = GitbookLoader(
        docs_url,
        load_all_paths=True,
    )
    return loader.load()

docs_url = "https://docs.arize.com/arize/"
docs = load_gitbook_docs(docs_url)

  k = self.parse_starttag(i)
Fetching pages: 100%|##########| 187/187 [00:18<00:00,  9.90it/s]


We build our qdrant vectorstore in memory for this example, however additional alternatives can be found in both Langchain's and Qdrant's documentation.

In [None]:
qdrant = Qdrant.from_documents(
    docs,
    embeddings,
    location=":memory:",
    collection_name="my_documents",
)

## 4. Build Your LangChain Application

---

This example uses a `RetrievalQA` chain over a pre-built index of the Arize documentation, but you can use whatever LangChain application you like.

In [None]:
handler = StdOutCallbackHandler()


num_retrieved_documents = 2
retriever = qdrant.as_retriever(search_type="mmr",
                                search_kwargs={"k": num_retrieved_documents},
                                enable_limit=True)
chain_type = "stuff"  # stuff, refine, map_reduce, and map_rerank
chat_model_name = "gpt-4-turbo-preview"
llm = ChatOpenAI(model_name=chat_model_name, temperature=0.0)
chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type=chain_type,
    retriever=retriever,
    metadata={"application_type": "question_answering"},
    callbacks=[handler]
)

To use Phoenix, you must load your data into Pandas dataframes. First, load your knowledge base into a dataframe.


In [None]:
database_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/llm/context-retrieval/langchain-pinecone/database.parquet"
)
#database_df.head()

The columns of your dataframe are:

text: the chunked text in your knowledge base
text_vector: the embedding vector for the text, computed during the LangChain build using the "text-embedding-ada-002" embedding model from OpenAI

Next, download a dataframe containing query data.

In [None]:
query_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/llm/context-retrieval/langchain-pinecone/langchain_pinecone_query_dataframe_with_user_feedbackv2.parquet"
)
#query_df.head()

The columns of the dataframe are:

text: the query text
text_vector: the embedding representation of the query, captured from LangChain at query time
response: the final response from the LangChain application
context_text_0: the first retrieved context from the knowledge base
context_similarity_0: the cosine similarity between the query and the first retrieved context
context_text_1: the second retrieved context from the knowledge base
context_similarity_1: the cosine similarity between the query and the first retrieved context
user_feedback: approval or rejection from the user (-1 means thumbs down, +1 means thumbs up)

Lets try out running out the first 10 queries on the query_df by using Qdrant as retriever!

In [None]:
for i in range(10):
    row = query_df.iloc[i]
    response = chain.invoke(row['text'])
    print(response)



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
{'query': 'How do I use the SDK to upload a ranking model?', 'result': "The provided context does not include specific instructions or code examples on how to use an SDK to upload a ranking model. It focuses on using the Arize platform to troubleshoot and improve the performance of a search ranking model, particularly through evaluating NDCG at different @k values, comparing datasets, and identifying feature performance issues. \n\nFor uploading a ranking model or any model using an SDK, typically, you would follow these general steps, which might vary depending on the specific platform or SDK you are using:\n\n1. **Installation**: Ensure the SDK is installed in your environment. This usually involves running a pip install command or adding the SDK to your project dependencies.\n\n2. **Authentication**: Set up authentication by providing necessary credentials, which could be an API key, a service account, or other

The query and database datasets are drawn from different distributions; the queries are short questions while the database entries are several sentences to a paragraph. The embeddings from OpenAI's "text-embedding-ada-002" capture these differences and naturally separate the query and context embeddings into distinct regions of the embedding space. When using Phoenix, you want to "overlay" the query and context embedding distributions so that queries appear close to their retrieved context in the Phoenix point cloud. To achieve this, we compute a centroid for each dataset that represents an average point in the embedding distribution and center the two distributions so they overlap.

In [None]:
database_centroid = database_df["text_vector"].mean()
database_df["centered_text_vector"] = database_df["text_vector"].apply(
    lambda x: x - database_centroid
)
query_centroid = query_df["text_vector"].mean()
query_df["centered_text_vector"] = query_df["text_vector"].apply(lambda x: x - query_centroid)

## 5. Launch your phoenix session and Instrument LangChain

Define a schema to tell Phoenix what the columns of your query and database dataframes represent (features, predictions, actuals, tags, embeddings, etc.). See the docs for guides on how to define your own schema and API reference on phoenix.Schema and phoenix.EmbeddingColumnNames.

In [None]:
query_schema = px.Schema(
    prompt_column_names=px.EmbeddingColumnNames(
        raw_data_column_name="text",
        vector_column_name="centered_text_vector",
    ),
    response_column_names="response",
    tag_column_names=[
        "context_text_0",
        "context_similarity_0",
        "context_text_1",
        "context_similarity_1",
        "euclidean_distance_0",
        "euclidean_distance_1",
        "openai_relevance_0",
        "openai_relevance_1",
        "openai_precision@1",
        "openai_precision@2",
        "user_feedback",
    ],
)
database_schema = px.Schema(
    document_column_names=px.EmbeddingColumnNames(
        raw_data_column_name="text",
        vector_column_name="centered_text_vector",
    ),
)

Create Phoenix datasets that wrap your dataframes with the schemas that describe them.

In [None]:
database_ds = px.Dataset(
    dataframe=database_df,
    schema=database_schema,
    name="qdrant",
)
query_ds = px.Dataset(
    dataframe=query_df,
    schema=query_schema,
    name="query",
)

Launch Phoenix. Note that the database dataset is passed in as the corpus - a reserved keyword argument for passing in the knowledge base for search and retrieval analysis. Follow the instructions in the cell output to open the Phoenix UI.

In [None]:
session = px.launch_app(query_ds, corpus=database_ds)

🌍 To view the Phoenix app in your browser, visit https://dbgz5stb1261-496ff2e9c6d22116-6006-colab.googleusercontent.com/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


In order to make your LLM application observable, it must be instrumented. That is, the code must emit traces. The instrumented data must then be sent to an Observability backend, in our case the Phoenix server.

In [None]:
LangChainInstrumentor().instrument()

## 6. Run LLM assisted Evals

Cosine similarity and Euclidean distance are reasonable proxies for retrieval quality, but they don't always work perfectly. A novel idea is to use LLMs to measure retrieval quality by simply asking the LLM whether each piece of retrieved context is relevant or irrelevant to the corresponding query.

⚠️ It's strongly recommended to use GPT-4 for this step if you have access, since we've found it to be more trustworthy for this particular task.

💬 Use OpenAI to predict whether each retrieved document is relevant or irrelevant to the query.

In [None]:
EVALUATION_SYSTEM_MESSAGE = (
    "You will be given a query and a reference text. "
    "You must determine whether the reference text contains an answer to the input query. "
    "Your response must be binary (0 or 1) and "
    "should not contain any text or characters aside from 0 or 1. "
    "0 means that the reference text does not contain an answer to the query. "
    "1 means the reference text contains an answer to the query."
)
QUERY_CONTEXT_PROMPT_TEMPLATE = """# Query: {query}

# Reference: {reference}

f# Binary: """

In [None]:
from typing import List, Dict
from tenacity import (
    retry,
    stop_after_attempt,
    wait_random_exponential,
)

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def evaluate_query_and_retrieved_context(query: str, context: str, model_name: str) -> str:
    prompt = QUERY_CONTEXT_PROMPT_TEMPLATE.format(
        query=query,
        reference=context,
    )
    response = openai.chat.completions.create(
        messages=[
            {"role": "system", "content": EVALUATION_SYSTEM_MESSAGE},
            {"role": "user", "content": prompt},
        ],
        model=model_name,
    )

    print(response.choices[0].message.content)

    return response.choices[0].message.content

def evaluate_retrievals(
    retrievals_data: Dict[str, str],
    model_name: str,
) -> List[str]:
    responses = []
    for query, retrieved_context in tqdm(retrievals_data.items()):
        response = evaluate_query_and_retrieved_context(query, retrieved_context, model_name)
        responses.append(response)
    return responses

def process_binary_responses(
    binary_responses: List[str], binary_to_string_map: Dict[int, str]
) -> List[str]:
    """
    Parse binary responses and convert to the desired format
    converts them to the desired format. The binary_to_string_map parameter
    should be a dictionary mapping binary values (0 or 1) to the desired
    string values (e.g. "irrelevant" or "relevant").
    """
    processed_responses = []
    for binary_response in binary_responses:
        try:
            binary_value = int(binary_response.strip())
            processed_response = binary_to_string_map[binary_value]
        except (ValueError, KeyError):
            processed_response = None
        processed_responses.append(processed_response)
    return processed_responses

sample_query_df = query_df.copy() #.head(10).copy() to make a subset to work on
evaluation_model_name = "gpt-4-1106-preview"  # use any GPT variable you have access to
for context_index in range(num_retrieved_documents):
    retrievals_data = {
        row["text"]: row[f"context_text_{context_index}"] for _, row in sample_query_df.iterrows()
    }
    raw_responses = evaluate_retrievals(retrievals_data, evaluation_model_name)
    processed_responses = process_binary_responses(raw_responses, {0: "irrelevant", 1: "relevant"})
    sample_query_df[f"openai_relevance_{context_index}"] = processed_responses
sample_query_df[
    ["text", "context_text_0", "openai_relevance_0", "context_text_1", "openai_relevance_1"]
]#.head(10)

  1%|          | 1/153 [00:00<01:28,  1.71it/s]

0


  1%|▏         | 2/153 [00:01<02:14,  1.12it/s]

1


  2%|▏         | 3/153 [00:02<01:48,  1.38it/s]

0


  3%|▎         | 4/153 [00:03<02:18,  1.08it/s]

0


  3%|▎         | 5/153 [00:04<02:25,  1.01it/s]

0


  4%|▍         | 6/153 [00:05<02:01,  1.21it/s]

0


  5%|▍         | 7/153 [00:06<02:06,  1.15it/s]

0


  5%|▌         | 8/153 [00:06<01:50,  1.31it/s]

0


  6%|▌         | 9/153 [00:07<01:40,  1.43it/s]

1


  7%|▋         | 10/153 [00:07<01:34,  1.52it/s]

0


  7%|▋         | 11/153 [00:08<01:41,  1.40it/s]

0


  8%|▊         | 12/153 [00:09<01:40,  1.40it/s]

0


  8%|▊         | 13/153 [00:10<01:49,  1.28it/s]

0


  9%|▉         | 14/153 [00:11<01:58,  1.17it/s]

0


 10%|▉         | 15/153 [00:11<01:43,  1.33it/s]

1


 10%|█         | 16/153 [00:12<01:52,  1.22it/s]

0


 11%|█         | 17/153 [00:13<01:40,  1.35it/s]

1


 12%|█▏        | 18/153 [00:13<01:31,  1.48it/s]

0


 12%|█▏        | 19/153 [00:14<01:37,  1.38it/s]

0


 13%|█▎        | 20/153 [00:14<01:22,  1.62it/s]

0


 14%|█▎        | 21/153 [00:15<01:18,  1.67it/s]

1


 14%|█▍        | 22/153 [00:16<01:23,  1.57it/s]

0


 15%|█▌        | 23/153 [00:16<01:20,  1.61it/s]

1


 16%|█▌        | 24/153 [00:17<01:22,  1.56it/s]

1


 16%|█▋        | 25/153 [00:18<01:16,  1.67it/s]

0


 17%|█▋        | 26/153 [00:19<01:31,  1.38it/s]

0


 18%|█▊        | 27/153 [00:19<01:25,  1.47it/s]

0


 18%|█▊        | 28/153 [00:20<01:22,  1.52it/s]

1


 19%|█▉        | 29/153 [00:20<01:15,  1.65it/s]

0


 20%|█▉        | 30/153 [00:21<01:10,  1.74it/s]

0


 20%|██        | 31/153 [00:22<01:22,  1.49it/s]

1


 21%|██        | 32/153 [00:23<02:01,  1.00s/it]

0


 22%|██▏       | 33/153 [00:24<01:49,  1.10it/s]

0


 22%|██▏       | 34/153 [00:24<01:28,  1.34it/s]

0


 23%|██▎       | 35/153 [00:26<01:46,  1.10it/s]

1


 24%|██▎       | 36/153 [00:27<01:47,  1.08it/s]

0


 24%|██▍       | 37/153 [00:27<01:33,  1.24it/s]

0


 25%|██▍       | 38/153 [00:28<01:33,  1.23it/s]

0


 25%|██▌       | 39/153 [00:30<01:55,  1.01s/it]

0


 26%|██▌       | 40/153 [00:30<01:38,  1.15it/s]

1


 27%|██▋       | 41/153 [00:31<01:32,  1.22it/s]

1


 27%|██▋       | 42/153 [00:31<01:26,  1.28it/s]

0


 28%|██▊       | 43/153 [00:32<01:25,  1.29it/s]

0


 29%|██▉       | 44/153 [00:33<01:11,  1.53it/s]

0


 29%|██▉       | 45/153 [00:34<01:28,  1.21it/s]

0


 30%|███       | 46/153 [00:34<01:14,  1.44it/s]

1


 31%|███       | 47/153 [00:35<01:08,  1.54it/s]

1


 31%|███▏      | 48/153 [00:37<01:45,  1.00s/it]

0


 32%|███▏      | 49/153 [00:37<01:29,  1.17it/s]

1


 33%|███▎      | 50/153 [00:38<01:18,  1.32it/s]

1


 33%|███▎      | 51/153 [00:38<01:09,  1.46it/s]

0


 34%|███▍      | 52/153 [00:39<01:07,  1.49it/s]

1


 35%|███▍      | 53/153 [00:39<01:01,  1.62it/s]

1


 35%|███▌      | 54/153 [00:40<01:08,  1.44it/s]

1


 36%|███▌      | 55/153 [00:41<01:02,  1.57it/s]

1


 37%|███▋      | 56/153 [00:41<01:05,  1.48it/s]

1


 37%|███▋      | 57/153 [00:42<01:10,  1.36it/s]

0


 38%|███▊      | 58/153 [00:43<01:04,  1.48it/s]

0


 39%|███▊      | 59/153 [00:44<01:30,  1.04it/s]

0


 39%|███▉      | 60/153 [00:45<01:22,  1.13it/s]

1


 40%|███▉      | 61/153 [00:46<01:08,  1.34it/s]

1


 41%|████      | 62/153 [00:46<00:57,  1.57it/s]

1


 41%|████      | 63/153 [00:47<00:59,  1.52it/s]

0


 42%|████▏     | 64/153 [00:47<00:52,  1.69it/s]

1


 42%|████▏     | 65/153 [00:48<00:54,  1.63it/s]

0


 43%|████▎     | 66/153 [00:49<00:57,  1.52it/s]

0


 44%|████▍     | 67/153 [00:49<00:57,  1.49it/s]

0


 44%|████▍     | 68/153 [00:50<00:54,  1.55it/s]

1


 45%|████▌     | 69/153 [00:50<00:46,  1.79it/s]

1


 46%|████▌     | 70/153 [00:51<00:48,  1.70it/s]

1


 46%|████▋     | 71/153 [00:52<00:52,  1.56it/s]

0


 47%|████▋     | 72/153 [00:52<00:56,  1.43it/s]

0


 48%|████▊     | 73/153 [00:53<00:56,  1.40it/s]

1


 48%|████▊     | 74/153 [00:55<01:26,  1.10s/it]

1


 49%|████▉     | 75/153 [00:56<01:21,  1.04s/it]

0


 50%|████▉     | 76/153 [00:57<01:10,  1.10it/s]

0


 50%|█████     | 77/153 [00:57<01:04,  1.17it/s]

0


 51%|█████     | 78/153 [00:58<01:01,  1.22it/s]

1


 52%|█████▏    | 79/153 [00:59<00:57,  1.28it/s]

1


 52%|█████▏    | 80/153 [01:00<00:58,  1.24it/s]

1


 53%|█████▎    | 81/153 [01:00<00:52,  1.38it/s]

0


 54%|█████▎    | 82/153 [01:01<00:53,  1.32it/s]

0


 54%|█████▍    | 83/153 [01:01<00:45,  1.55it/s]

1


 55%|█████▍    | 84/153 [01:02<00:46,  1.49it/s]

0


 56%|█████▌    | 85/153 [01:03<00:45,  1.48it/s]

0


 56%|█████▌    | 86/153 [01:04<01:02,  1.07it/s]

0


 57%|█████▋    | 87/153 [01:06<01:11,  1.08s/it]

0


 58%|█████▊    | 88/153 [01:07<01:03,  1.02it/s]

0


 58%|█████▊    | 89/153 [01:07<00:58,  1.09it/s]

0


 59%|█████▉    | 90/153 [01:08<00:52,  1.19it/s]

0


 59%|█████▉    | 91/153 [01:09<00:46,  1.32it/s]

0


 60%|██████    | 92/153 [01:09<00:44,  1.38it/s]

0


 61%|██████    | 93/153 [01:10<00:43,  1.37it/s]

0


 61%|██████▏   | 94/153 [01:11<00:46,  1.26it/s]

1


 62%|██████▏   | 95/153 [01:11<00:41,  1.40it/s]

0


 63%|██████▎   | 96/153 [01:12<00:35,  1.59it/s]

0


 63%|██████▎   | 97/153 [01:13<00:42,  1.32it/s]

0


 64%|██████▍   | 98/153 [01:14<00:39,  1.40it/s]

0


 65%|██████▍   | 99/153 [01:14<00:34,  1.55it/s]

0


 65%|██████▌   | 100/153 [01:14<00:30,  1.76it/s]

1


 66%|██████▌   | 101/153 [01:15<00:31,  1.63it/s]

1


 67%|██████▋   | 102/153 [01:16<00:27,  1.85it/s]

0


 67%|██████▋   | 103/153 [01:16<00:24,  2.06it/s]

0


 68%|██████▊   | 104/153 [01:17<00:27,  1.76it/s]

0


 69%|██████▊   | 105/153 [01:18<00:32,  1.48it/s]

0


 69%|██████▉   | 106/153 [01:18<00:30,  1.57it/s]

1


 70%|██████▉   | 107/153 [01:19<00:26,  1.70it/s]

1


 71%|███████   | 108/153 [01:19<00:30,  1.46it/s]

1


 71%|███████   | 109/153 [01:20<00:28,  1.53it/s]

0


 72%|███████▏  | 110/153 [01:21<00:26,  1.63it/s]

0


 73%|███████▎  | 111/153 [01:21<00:22,  1.83it/s]

1


 73%|███████▎  | 112/153 [01:22<00:23,  1.73it/s]

1


 74%|███████▍  | 113/153 [01:24<00:45,  1.13s/it]

0


 75%|███████▍  | 114/153 [01:24<00:35,  1.11it/s]

0


 75%|███████▌  | 115/153 [01:25<00:32,  1.16it/s]

0


 76%|███████▌  | 116/153 [01:26<00:32,  1.12it/s]

0


 76%|███████▋  | 117/153 [01:27<00:26,  1.34it/s]

0


 77%|███████▋  | 118/153 [01:27<00:24,  1.46it/s]

0


 78%|███████▊  | 119/153 [01:28<00:22,  1.51it/s]

1


 78%|███████▊  | 120/153 [01:28<00:20,  1.58it/s]

0


 79%|███████▉  | 121/153 [01:29<00:21,  1.52it/s]

1


 80%|███████▉  | 122/153 [01:29<00:18,  1.64it/s]

1


 80%|████████  | 123/153 [01:30<00:17,  1.71it/s]

0


 81%|████████  | 124/153 [01:31<00:17,  1.69it/s]

1


 82%|████████▏ | 125/153 [01:31<00:17,  1.57it/s]

0


 82%|████████▏ | 126/153 [01:32<00:15,  1.73it/s]

0


 83%|████████▎ | 127/153 [01:33<00:16,  1.61it/s]

0


 84%|████████▎ | 128/153 [01:34<00:18,  1.37it/s]

1


 84%|████████▍ | 129/153 [01:34<00:17,  1.40it/s]

0


 85%|████████▍ | 130/153 [01:35<00:18,  1.24it/s]

0


 86%|████████▌ | 131/153 [01:36<00:18,  1.22it/s]

0


 86%|████████▋ | 132/153 [01:37<00:17,  1.21it/s]

0


 87%|████████▋ | 133/153 [01:37<00:14,  1.39it/s]

0


 88%|████████▊ | 134/153 [01:38<00:12,  1.52it/s]

0


 88%|████████▊ | 135/153 [01:38<00:10,  1.64it/s]

0


 89%|████████▉ | 136/153 [01:39<00:10,  1.66it/s]

0


 90%|████████▉ | 137/153 [01:40<00:10,  1.54it/s]

0


 90%|█████████ | 138/153 [01:40<00:08,  1.74it/s]

0


 91%|█████████ | 139/153 [01:41<00:07,  1.85it/s]

0


 92%|█████████▏| 140/153 [01:41<00:08,  1.57it/s]

0


 92%|█████████▏| 141/153 [01:42<00:08,  1.38it/s]

0


 93%|█████████▎| 142/153 [01:43<00:07,  1.48it/s]

0


 93%|█████████▎| 143/153 [01:45<00:12,  1.23s/it]

0


 94%|█████████▍| 144/153 [01:46<00:09,  1.01s/it]

0


 95%|█████████▍| 145/153 [01:47<00:08,  1.10s/it]

0


 95%|█████████▌| 146/153 [01:48<00:06,  1.03it/s]

1


 96%|█████████▌| 147/153 [01:48<00:04,  1.27it/s]

0


 97%|█████████▋| 148/153 [01:49<00:03,  1.43it/s]

0


 97%|█████████▋| 149/153 [01:50<00:02,  1.42it/s]

0


 98%|█████████▊| 150/153 [01:51<00:02,  1.12it/s]

0


 99%|█████████▊| 151/153 [01:51<00:01,  1.37it/s]

0


 99%|█████████▉| 152/153 [01:52<00:00,  1.49it/s]

0


100%|██████████| 153/153 [01:52<00:00,  1.35it/s]


0


  1%|          | 1/153 [00:01<02:53,  1.14s/it]

0


  1%|▏         | 2/153 [00:01<01:57,  1.28it/s]

1


  2%|▏         | 3/153 [00:02<01:41,  1.47it/s]

0


  3%|▎         | 4/153 [00:02<01:27,  1.70it/s]

0


  3%|▎         | 5/153 [00:03<01:18,  1.89it/s]

0


  4%|▍         | 6/153 [00:03<01:35,  1.54it/s]

0


  5%|▍         | 7/153 [00:04<01:20,  1.80it/s]

0


  5%|▌         | 8/153 [00:05<01:40,  1.44it/s]

0


  6%|▌         | 9/153 [00:05<01:30,  1.59it/s]

1


  7%|▋         | 10/153 [00:06<01:27,  1.63it/s]

0


  7%|▋         | 11/153 [00:06<01:22,  1.73it/s]

0


  8%|▊         | 12/153 [00:07<01:18,  1.80it/s]

0


  8%|▊         | 13/153 [00:07<01:11,  1.97it/s]

0


  9%|▉         | 14/153 [00:08<01:16,  1.81it/s]

0


 10%|▉         | 15/153 [00:09<01:21,  1.69it/s]

1


 10%|█         | 16/153 [00:09<01:14,  1.84it/s]

0


 11%|█         | 17/153 [00:10<01:31,  1.49it/s]

0


 12%|█▏        | 18/153 [00:10<01:18,  1.73it/s]

0


 12%|█▏        | 19/153 [00:11<01:19,  1.68it/s]

0


 13%|█▎        | 20/153 [00:12<01:25,  1.55it/s]

0


 14%|█▎        | 21/153 [00:12<01:14,  1.77it/s]

0


 14%|█▍        | 22/153 [00:13<01:32,  1.42it/s]

0


 15%|█▌        | 23/153 [00:14<01:32,  1.40it/s]

0


 16%|█▌        | 24/153 [00:14<01:22,  1.57it/s]

1


 16%|█▋        | 25/153 [00:15<01:27,  1.46it/s]

0


 17%|█▋        | 26/153 [00:16<01:32,  1.38it/s]

0


 18%|█▊        | 27/153 [00:16<01:18,  1.60it/s]

0


 18%|█▊        | 28/153 [00:17<01:07,  1.84it/s]

0


 19%|█▉        | 29/153 [00:18<01:23,  1.49it/s]

0


 20%|█▉        | 30/153 [00:18<01:25,  1.43it/s]

0


 20%|██        | 31/153 [00:19<01:26,  1.41it/s]

0


 21%|██        | 32/153 [00:20<01:33,  1.30it/s]

0


 22%|██▏       | 33/153 [00:21<01:17,  1.54it/s]

0


 22%|██▏       | 34/153 [00:21<01:13,  1.62it/s]

0


 23%|██▎       | 35/153 [00:22<01:24,  1.39it/s]

0


 24%|██▎       | 36/153 [00:23<01:23,  1.40it/s]

1


 24%|██▍       | 37/153 [00:23<01:21,  1.43it/s]

0


 25%|██▍       | 38/153 [00:24<01:17,  1.49it/s]

1


 25%|██▌       | 39/153 [00:25<01:12,  1.57it/s]

0


 26%|██▌       | 40/153 [00:25<01:16,  1.49it/s]

1


 27%|██▋       | 41/153 [00:26<01:23,  1.35it/s]

0


 27%|██▋       | 42/153 [00:27<01:28,  1.25it/s]

1


 28%|██▊       | 43/153 [00:28<01:15,  1.46it/s]

0


 29%|██▉       | 44/153 [00:28<01:09,  1.58it/s]

0


 29%|██▉       | 45/153 [00:29<01:10,  1.53it/s]

1


 30%|███       | 46/153 [00:30<01:13,  1.45it/s]

1


 31%|███       | 47/153 [00:30<01:08,  1.54it/s]

0


 31%|███▏      | 48/153 [00:31<01:00,  1.73it/s]

0


 32%|███▏      | 49/153 [00:31<00:57,  1.80it/s]

1


 33%|███▎      | 50/153 [00:32<00:59,  1.72it/s]

1


 33%|███▎      | 51/153 [00:32<01:04,  1.58it/s]

1


 34%|███▍      | 52/153 [00:33<00:57,  1.76it/s]

1


 35%|███▍      | 53/153 [00:33<00:53,  1.86it/s]

0


 35%|███▌      | 54/153 [00:34<00:48,  2.04it/s]

0


 36%|███▌      | 55/153 [00:34<00:47,  2.04it/s]

0


 37%|███▋      | 56/153 [00:35<00:59,  1.63it/s]

0


 37%|███▋      | 57/153 [00:36<01:00,  1.59it/s]

0


 38%|███▊      | 58/153 [00:36<00:57,  1.65it/s]

0


 39%|███▊      | 59/153 [00:37<00:50,  1.85it/s]

0


 39%|███▉      | 60/153 [00:37<00:55,  1.68it/s]

1


 40%|███▉      | 61/153 [00:38<00:52,  1.75it/s]

0


 41%|████      | 62/153 [00:39<00:58,  1.55it/s]

1


 41%|████      | 63/153 [00:40<01:06,  1.36it/s]

0


 42%|████▏     | 64/153 [00:40<01:03,  1.39it/s]

1


 42%|████▏     | 65/153 [00:41<00:57,  1.53it/s]

1


 43%|████▎     | 66/153 [00:41<00:56,  1.53it/s]

1


 44%|████▍     | 67/153 [00:42<00:49,  1.73it/s]

0


 44%|████▍     | 68/153 [00:43<00:54,  1.56it/s]

0


 45%|████▌     | 69/153 [00:43<00:47,  1.76it/s]

1


 46%|████▌     | 70/153 [00:44<00:45,  1.81it/s]

1


 46%|████▋     | 71/153 [00:45<01:07,  1.22it/s]

1


 47%|████▋     | 72/153 [00:45<00:55,  1.46it/s]

0


 48%|████▊     | 73/153 [00:46<00:56,  1.40it/s]

1


 48%|████▊     | 74/153 [00:47<00:50,  1.56it/s]

1


 49%|████▉     | 75/153 [00:47<00:47,  1.64it/s]

0


 50%|████▉     | 76/153 [00:48<00:44,  1.72it/s]

1


 50%|█████     | 77/153 [00:48<00:40,  1.90it/s]

0


 51%|█████     | 78/153 [00:49<00:52,  1.43it/s]

0


 52%|█████▏    | 79/153 [00:50<00:50,  1.46it/s]

0


 52%|█████▏    | 80/153 [00:51<00:52,  1.40it/s]

0


 53%|█████▎    | 81/153 [00:52<00:57,  1.24it/s]

0


 54%|█████▎    | 82/153 [00:52<00:53,  1.32it/s]

0


 54%|█████▍    | 83/153 [00:53<00:45,  1.55it/s]

1


 55%|█████▍    | 84/153 [00:53<00:42,  1.62it/s]

0


 56%|█████▌    | 85/153 [00:54<00:44,  1.54it/s]

0


 56%|█████▌    | 86/153 [00:54<00:37,  1.77it/s]

0


 57%|█████▋    | 87/153 [00:55<00:42,  1.55it/s]

0


 58%|█████▊    | 88/153 [00:56<00:42,  1.52it/s]

0


 58%|█████▊    | 89/153 [00:56<00:38,  1.67it/s]

0


 59%|█████▉    | 90/153 [00:57<00:37,  1.66it/s]

0


 59%|█████▉    | 91/153 [00:57<00:35,  1.77it/s]

0


 60%|██████    | 92/153 [00:58<00:30,  1.99it/s]

0


 61%|██████    | 93/153 [00:58<00:32,  1.87it/s]

0


 61%|██████▏   | 94/153 [00:59<00:30,  1.93it/s]

1


 62%|██████▏   | 95/153 [01:00<00:35,  1.64it/s]

0


 63%|██████▎   | 96/153 [01:00<00:32,  1.78it/s]

0


 63%|██████▎   | 97/153 [01:01<00:36,  1.54it/s]

0


 64%|██████▍   | 98/153 [01:02<00:35,  1.57it/s]

0


 65%|██████▍   | 99/153 [01:02<00:36,  1.47it/s]

0


 65%|██████▌   | 100/153 [01:03<00:30,  1.72it/s]

0


 66%|██████▌   | 101/153 [01:03<00:30,  1.70it/s]

0


 67%|██████▋   | 102/153 [01:04<00:37,  1.37it/s]

1


 67%|██████▋   | 103/153 [01:05<00:32,  1.53it/s]

1


 68%|██████▊   | 104/153 [01:06<00:35,  1.39it/s]

0


 69%|██████▊   | 105/153 [01:06<00:31,  1.51it/s]

1


 69%|██████▉   | 106/153 [01:07<00:31,  1.51it/s]

0


 70%|██████▉   | 107/153 [01:08<00:33,  1.36it/s]

1


 71%|███████   | 108/153 [01:08<00:28,  1.60it/s]

1


 71%|███████   | 109/153 [01:09<00:27,  1.59it/s]

0


 72%|███████▏  | 110/153 [01:10<00:31,  1.36it/s]

0


 73%|███████▎  | 111/153 [01:10<00:28,  1.45it/s]

1


 73%|███████▎  | 112/153 [01:11<00:24,  1.70it/s]

1


 74%|███████▍  | 113/153 [01:11<00:24,  1.64it/s]

0


 75%|███████▍  | 114/153 [01:12<00:23,  1.66it/s]

0


 75%|███████▌  | 115/153 [01:12<00:19,  1.90it/s]

0


 76%|███████▌  | 116/153 [01:13<00:20,  1.78it/s]

0


 76%|███████▋  | 117/153 [01:14<00:22,  1.62it/s]

0


 77%|███████▋  | 118/153 [01:14<00:20,  1.73it/s]

0


 78%|███████▊  | 119/153 [01:15<00:18,  1.82it/s]

1


 78%|███████▊  | 120/153 [01:15<00:17,  1.83it/s]

0


 79%|███████▉  | 121/153 [01:16<00:19,  1.65it/s]

1


 80%|███████▉  | 122/153 [01:18<00:30,  1.02it/s]

0


 80%|████████  | 123/153 [01:18<00:24,  1.21it/s]

1


 81%|████████  | 124/153 [01:19<00:21,  1.38it/s]

1


 82%|████████▏ | 125/153 [01:20<00:21,  1.31it/s]

0


 82%|████████▏ | 126/153 [01:20<00:17,  1.54it/s]

0


 83%|████████▎ | 127/153 [01:21<00:20,  1.29it/s]

0


 84%|████████▎ | 128/153 [01:21<00:16,  1.54it/s]

0


 84%|████████▍ | 129/153 [01:22<00:16,  1.49it/s]

1


 85%|████████▍ | 130/153 [01:23<00:16,  1.41it/s]

0


 86%|████████▌ | 131/153 [01:24<00:14,  1.54it/s]

0


 86%|████████▋ | 132/153 [01:24<00:13,  1.60it/s]

0


 87%|████████▋ | 133/153 [01:25<00:11,  1.67it/s]

0


 88%|████████▊ | 134/153 [01:27<00:22,  1.17s/it]

0


 88%|████████▊ | 135/153 [01:28<00:20,  1.14s/it]

0


 89%|████████▉ | 136/153 [01:29<00:16,  1.04it/s]

0


 90%|████████▉ | 137/153 [01:29<00:13,  1.17it/s]

0


 90%|█████████ | 138/153 [01:31<00:14,  1.05it/s]

0


 91%|█████████ | 139/153 [01:31<00:11,  1.22it/s]

0


 92%|█████████▏| 140/153 [01:31<00:08,  1.45it/s]

0


 92%|█████████▏| 141/153 [01:32<00:07,  1.58it/s]

0


 93%|█████████▎| 142/153 [01:32<00:06,  1.68it/s]

1


 93%|█████████▎| 143/153 [01:33<00:07,  1.35it/s]

0


 94%|█████████▍| 144/153 [01:34<00:06,  1.45it/s]

0


 95%|█████████▍| 145/153 [01:35<00:05,  1.51it/s]

0


 95%|█████████▌| 146/153 [01:35<00:04,  1.66it/s]

0


 96%|█████████▌| 147/153 [01:36<00:03,  1.84it/s]

0


 97%|█████████▋| 148/153 [01:36<00:02,  1.83it/s]

0


 97%|█████████▋| 149/153 [01:36<00:01,  2.02it/s]

0


 98%|█████████▊| 150/153 [01:37<00:01,  1.98it/s]

0


 99%|█████████▊| 151/153 [01:38<00:01,  1.63it/s]

0


 99%|█████████▉| 152/153 [01:38<00:00,  1.73it/s]

0


100%|██████████| 153/153 [01:39<00:00,  1.54it/s]

0





Unnamed: 0,text,context_text_0,openai_relevance_0,context_text_1,openai_relevance_1
0,How do I use the SDK to upload a ranking model?,\n**Check out our** **How to Monitor Ranking Model's** **blog and follow along with our various** **Colab** **examples here.**&#x20;\n\n,irrelevant,"\n**Use the 'arize-demo-hotel-ranking' model, available in all free accounts, to follow along.**&#x20;\n\n",irrelevant
1,What drift metrics are supported in Arize?,"\nArize calculates drift metrics such as Population Stability Index, KL Divergence, and Wasserstein Distance. Arize computes drift by measuring distribution changes between the model’s production values and a baseline (reference dataset). Users can configure a baseline to be any time window of a:\n\n1. Pre-production dataset (training, test, validation) or\n2. Fixed or moving time period from production (e.g. last 30 days, last 60 days).&#x20;\n\nBaselines are saved in Arize so that users can compare several versions and/or environments against each other across moving or fixed time windows. For more details on baselines, visit here.\n\n",relevant,"\nDrift monitors measure distribution drift, which is the difference between two statistical distributions.&#x20;\n\nArize offers various distributional drift metrics to choose from when setting up a monitor. Each metric is tailored to a specific use case; refer to this guide to help choose the appropriate metric for various ML use cases.\n\n",relevant
2,Does Arize support batch models?,\nArize supports many model types - check out our various Model Types to learn more.&#x20;\n\n,irrelevant,"\nArize natively supports binary classification, multi-class classification, regression, ranking, NLP, and CV model types. Your model type informs the **data ingestion format** and **the performance metrics** that can be utilized in the platform.&#x20;\n\n",irrelevant
3,Does Arize support training data?,"\nArize natively supports binary classification, multi-class classification, regression, ranking, NLP, and CV model types. Your model type informs the **data ingestion format** and **the performance metrics** that can be utilized in the platform.&#x20;\n\n",irrelevant,"\nArize natively supports tabular/structured data types (strings, floats, booleans, etc), as well as embedding support for NLP, Image, and other unstructured data types.\n\n",irrelevant
4,How do I configure a threshold if my data has seasonality trends?,\nAutomatic thresholds set a dynamic threshold value for each data point. Arize generates an auto threshold when there are at least 14 days of production data to determine a trend. You can edit your auto threshold sensitivity by changing the standard deviation number in the 'Monitor Settings' card.&#x20;\n\nLearn more here about how an auto threshold value is calculated.&#x20;\n\nAuto Threshold\n\nToggle automatic thresholds on or off from the “Edit monitor” configuration.\n\n,irrelevant,"\nAutothresholds are calculated based on a statistical analysis of data over 14 days. Each day, a data point is collected, and after 14 days, the average (mean) and standard deviation of these data points are computed. The thresholds is then set by adding or subtracting the standard deviation from the average.\n\n",irrelevant
...,...,...,...,...,...
148,Do you support IoU for image segmentation?,\nArize supports 2 methods for ingesting and visualizing feature importance\n\nMethodUser Calculated SHAPshap.mdimage (33).pngGroup 1 (2).pngSurrogate Modelsurrogate-model.mdimage (23).pngGroup 2 (2).png\n\n\n\n,irrelevant,"\nObject detection models identify and locate objects within images or videos by assigning them specific bounding boxes.\n\nApplicable Metrics: Accuracy, Euclidian Distance (embeddings)\n\nClick here for all valid model types and metric combinations.&#x20;\n\n",irrelevant
149,This is a test question?,"\nIn the example above, there wasn't enough context on video quality to be able to correctly answer the user questions. Adding more context can help.&#x20;\n\n",irrelevant,\n,irrelevant
150,?,\n\n\n\n,irrelevant,\n,irrelevant
151,This is a question?,\n,irrelevant,\n\n\n\n,irrelevant


## 7. Compute Ranking Metrics

Now that you know whether each piece of retrieved context is relevant or irrelevant to the corresponding query, you can compute precision@k for k = 1, 2 for each query. This metric tells you what percentage of the retrieved context is relevant to the corresponding query.

precision@k = (# of top-k retrieved documents that are relevant) / (k retrieved documents)

If your precision@2 is greater than zero for a particular query, your LangChain application successfully retrieved at least one relevant piece of context with which to answer the query. If the precision@k is zero for a particular query, that means that no relevant piece of context was retrieved.

Compute precision@k for k = 1, 2 and view the results.

In [None]:
num_relevant_documents_array = np.zeros(len(sample_query_df))
num_retrieved_documents = 2
for retrieved_document_index in range(0, num_retrieved_documents):
    num_retrieved_documents = retrieved_document_index + 1
    num_relevant_documents_array += (
        sample_query_df[f"openai_relevance_{retrieved_document_index}"]
        .map(lambda x: int(x == "relevant"))
        .to_numpy()
    )
    sample_query_df[f"openai_precision@{num_retrieved_documents}"] = pd.Series(
        num_relevant_documents_array / num_retrieved_documents
    )

sample_query_df[
    [
        "openai_relevance_0",
        "openai_relevance_1",
        "openai_precision@1",
        "openai_precision@2",
    ]
]

Unnamed: 0,openai_relevance_0,openai_relevance_1,openai_precision@1,openai_precision@2
0,irrelevant,irrelevant,0.0,0.0
1,relevant,relevant,1.0,1.0
2,irrelevant,irrelevant,0.0,0.0
3,irrelevant,irrelevant,0.0,0.0
4,irrelevant,irrelevant,0.0,0.0
...,...,...,...,...
148,irrelevant,irrelevant,0.0,0.0
149,irrelevant,irrelevant,0.0,0.0
150,irrelevant,irrelevant,0.0,0.0
151,irrelevant,irrelevant,0.0,0.0


In [None]:
print(f"🚀 Open the Phoenix UI if you haven't already: {session.url}")

🚀 Open the Phoenix UI if you haven't already: https://dbgz5stb1262-496ff2e9c6d22116-6006-colab.googleusercontent.com/
