<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Evaluating and Improving Search and Retrieval Applications</h1>

Imagine you're an engineer at Arize AI and you've built and deployed a documentation question-answering service using LangChain and Qdrant. Users send questions about Arize's core product via a chat interface, and your service retrieves chunks of your indexed documentation in order to generate a response to the user. As the engineer in charge of maintaining this system, you want to evaluate the quality of the responses from your service.

Phoenix helps you:
- identify gaps in your documentation
- detect queries for which the LLM gave bad responses
- detect failures to retrieve relevant documents

In this tutorial, you will:

- Ask questions of a LangChain application backed by Qdrant over a knowledge base of the Arize documentation
- Use Phoenix to visualize user queries and knowledge base documents to identify areas of user interest not answered by your documentation
- Find clusters of responses with negative user feedback
- Identify failed retrievals using query density, cosine similarity, query distance, and LLM-assisted ranking metrics



## Chatbot Architecture

The architecture of your chatbot can be explained in five steps.

1. The user sends a query about Arize to your service.
1. `langchain.embeddings.OpenAIEmbeddings` makes a request to OpenAI to embed the user query using the text-embedding-ada-002 model.
1. We retrieve by searching against the entries of your Qdrant database for the most similar pieces of context by MMR.
1. `langchain.llms.ChatOpenAI` generates a response by formatting the query and retrieved context into a single prompt and sending a request to OpenAI with the gpt-4-turbo-preview model.
1. The response is returned to the user.

Phoenix makes your search and retrieval system observable by capturing the inputs and outputs of these steps for analysis, including:

- your query embeddings
- the retrieved documents and similarity scores (relevance) to each query
- the generated response that is return to the user

With that overview in mind, let's dive into the notebook.

## 1. Install needed dependencies and import relevant packages

---

In [None]:
!pip install --upgrade langchain qdrant-client langchain_community tiktoken cohere langchain-openai "protobuf>=3.20.3" "arize-phoenix[evals, llama-index]" "openai>=1"

Import libraries.

In [None]:
# Standard library imports
import os
from getpass import getpass

# Third-party library imports
import nest_asyncio
import numpy as np
import pandas as pd

# Phoenix imports
import phoenix as px
from langchain.callbacks import StdOutCallbackHandler

# LangChain imports
from langchain.chains import RetrievalQA
from langchain.document_loaders import GitbookLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
from langchain_openai import ChatOpenAI
from phoenix.trace.langchain import LangChainInstrumentor

# Miscellaneous imports

# Configuration and Initialization
nest_asyncio.apply()
pd.set_option("display.max_colwidth", None)

## 2. Configure Your OpenAI API Key

---

In [None]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

## 3. Configure your Qdrant client in memory

We need to configure the embeddings to be used as well as the documents to be used. In this example, the documents come from Arize's documentation

---

In [None]:
model_name = "text-embedding-ada-002"

embeddings = OpenAIEmbeddings(model=model_name, openai_api_key=openai_api_key)

In [None]:
def load_gitbook_docs(docs_url):
    """
    Loads documentation from a Gitbook URL.
    """

    loader = GitbookLoader(
        docs_url,
        load_all_paths=True,
    )
    return loader.load()


docs_url = "https://docs.arize.com/arize/"
docs = load_gitbook_docs(docs_url)

We build our qdrant vectorstore in memory for this example, however additional alternatives can be found in both Langchain's and Qdrant's documentation.

In [None]:
qdrant = Qdrant.from_documents(
    docs,
    embeddings,
    location=":memory:",
    collection_name="my_documents",
)

## 4. Instrument LangChain

---

In order to make your LLM application observable, it must be instrumented. That is, the code must emit traces. The instrumented data must then be sent to an Observability backend, in our case the Phoenix server.

In [None]:
LangChainInstrumentor().instrument()

## 5. Build Your LangChain Application

---

This example uses a `RetrievalQA` chain over an index of the Arize documentation, but you can use whatever LangChain application you like.

In [None]:
handler = StdOutCallbackHandler()


num_retrieved_documents = 2
retriever = qdrant.as_retriever(
    search_type="mmr", search_kwargs={"k": num_retrieved_documents}, enable_limit=True
)
chain_type = "stuff"  # stuff, refine, map_reduce, and map_rerank
chat_model_name = "gpt-4-turbo-preview"
llm = ChatOpenAI(model_name=chat_model_name, temperature=0.0)
chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type=chain_type,
    retriever=retriever,
    metadata={"application_type": "question_answering"},
    callbacks=[handler],
)

Let's download a dataframe containing query data and the retrievals used to generate responses.

In [None]:
query_df = pd.read_parquet(
    "http://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/llm/context-retrieval/langchain-pinecone/langchain_pinecone_query_dataframe_with_user_feedbackv2.parquet"
)

Launch Phoenix.

In [None]:
session = px.launch_app()

The columns of the dataframe are:

- text: the query text
- text_vector: the embedding representation of the query, captured from LangChain at query time
- response: the final response from the LangChain application
- context_text_0: the first retrieved context from the knowledge base
- context_similarity_0: the cosine similarity between the query and the first retrieved context
- context_text_1: the second retrieved context from the knowledge base
- context_similarity_1: the cosine similarity between the query and the first retrieved context
- user_feedback: approval or rejection from the user (-1 means thumbs down, +1 means thumbs up)

Lets try out running out the first 10 queries on the query_df by using Qdrant as retriever! Traces for these queries can be viewed in `phoenix`

In [None]:
for i in range(10):
    row = query_df.iloc[i]
    response = chain.invoke(row["text"])
    print(response)

## 6. Run LLM assisted Evals using `phoenix.evals`

---

Cosine similarity and Euclidean distance are reasonable proxies for retrieval quality, but they don't always work perfectly. A novel idea is to use LLMs to evaluate retrieval quality by simply asking the LLM whether each piece of retrieved context is relevant or irrelevant to the corresponding query.

💬 Use `phoenix.evals` to predict whether each retrieved document is relevant or irrelevant to the query.

In [None]:
from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify,
)

# create evaluation dataframes with "input" and "reference" columns
context0_eval_df = query_df.copy()
context0_eval_df["input"] = context0_eval_df["text"]
context0_eval_df["reference"] = context0_eval_df["context_text_0"]

context1_eval_df = query_df.copy()
context1_eval_df["input"] = context1_eval_df["text"]
context1_eval_df["reference"] = context1_eval_df["context_text_1"]

model = OpenAIModel(model="gpt-4")
context0_relevance = llm_classify(
    context0_eval_df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,
    model=model,
)
context1_relevance = llm_classify(
    context1_eval_df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,
    model=model,
)

In [None]:
sample_query_df = query_df.copy()
sample_query_df["openai_relevance_0"] = context0_relevance["label"]
sample_query_df["openai_relevance_1"] = context1_relevance["label"]

## 7. Compute Ranking Metrics

Now that you know whether each piece of retrieved context is relevant or irrelevant to the corresponding query, you can compute precision@k for k = 1, 2 for each query. This metric tells you what percentage of the retrieved context is relevant to the corresponding query.

precision@k = (# of top-k retrieved documents that are relevant) / (k retrieved documents)

If your precision@2 is greater than zero for a particular query, your LangChain application successfully retrieved at least one relevant piece of context with which to answer the query. If the precision@k is zero for a particular query, that means that no relevant piece of context was retrieved.

Compute precision@k for k = 1, 2 and view the results.

In [None]:
num_relevant_documents_array = np.zeros(len(sample_query_df))
num_retrieved_documents = 2
for retrieved_document_index in range(0, num_retrieved_documents):
    num_retrieved_documents = retrieved_document_index + 1
    num_relevant_documents_array += (
        sample_query_df[f"openai_relevance_{retrieved_document_index}"]
        .map(lambda x: int(x == "relevant"))
        .to_numpy()
    )
    sample_query_df[f"openai_precision@{num_retrieved_documents}"] = pd.Series(
        num_relevant_documents_array / num_retrieved_documents
    )

sample_query_df[
    [
        "openai_relevance_0",
        "openai_relevance_1",
        "openai_precision@1",
        "openai_precision@2",
    ]
]

In [None]:
print(f"🚀 Open the Phoenix UI if you haven't already: {session.url}")