<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://arize.com/docs/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-2w57bhem8-hq24MB6u7yE_ZF_ilOYSBw#/shared-invite/email">Community</a>
    </p>
</center>
<h1 align="center">Evaluate RAG with LLM Evals</h1>

In this tutorial we will look into building a RAG pipeline and evaluating it with Phoenix Evals.

It has the the following sections:

1. Understanding Retrieval Augmented Generation (RAG).
2. Building RAG (with the help of a framework such as LlamaIndex).
3. Evaluating RAG with Phoenix Evals.

## Retrieval Augmented Generation (RAG)

LLMs are trained on vast datasets, but these will not include your specific data (things like company knowledge bases and documentation). Retrieval-Augmented Generation (RAG) addresses this by dynamically incorporating your data as context during the generation process. This is done not by altering the training data of the LLMs but by allowing the model to access and utilize your data in real-time to provide more tailored and contextually relevant responses.

In RAG, your data is loaded and prepared for queries. This process is called indexing. User queries act on this index, which filters your data down to the most relevant context. This context and your query then are sent to the LLM along with a prompt, and the LLM provides a response.

RAG is a critical component for building applications such a chatbots or agents and you will want to know RAG techniques on how to get data into your application.

<img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/RAG_Pipeline.png">

## Stages within RAG

There are five key stages within RAG, which will in turn be a part of any larger RAG application.

- **Loading**: This refers to getting your data from where it lives - whether it's text files, PDFs, another website, a database or an API - into your pipeline.
- **Indexing**: This means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.
- **Storing**: Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it.

- **Querying**: For any given indexing strategy there are many ways you can utilize LLMs and data structures to query, including sub-queries, multi-step queries, and hybrid strategies. 
- **Evaluation**: A critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures on how accurate, faithful, and fast your responses to queries are.


## Build a RAG system 

Now that we have understood the stages of RAG, let's build a pipeline. We will use [LlamaIndex](https://www.llamaindex.ai/) for RAG and [Phoenix Evals](https://arize.com/docs/phoenix/llm-evals/llm-evals) for evaluation.


In [None]:
%pip install -qq "arize-phoenix[evals,llama-index]" "arize-phoenix-client" "llama-index-llms-openai" "openai>=1" gcsfs nest_asyncio 'httpx<0.28'

For this tutorial we will be using OpenAI for creating synthetic data as well as for evaluation. 

In [None]:
import os
from getpass import getpass

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("ðŸ”‘ Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

In [None]:
import pandas as pd
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.llms.openai import OpenAI

from phoenix.client import Client
from phoenix.client.types.spans import SpanQuery  # pyright: ignore[reportUnusedImport]

client = Client()

During this tutorial, we will capture all the data we need to evaluate our RAG pipeline using Phoenix Tracing. To enable this, simply start the phoenix application and instrument LlamaIndex.

Enable Phoenix tracing via `LlamaIndexInstrumentor`. Phoenix uses OpenInference traces - an open-source standard for capturing and storing LLM application traces that enables LLM applications to seamlessly integrate with LLM observability solutions such as Phoenix.

In [None]:
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

from phoenix.otel import register

tracer_provider = register(project_name="phoenix-rag-llama-index")
LlamaIndexInstrumentor().instrument(skip_dep_check=True, tracer_provider=tracer_provider)

### Load Data and Build an Index

Let's use an [essay by Paul Graham](https://www.paulgraham.com/worked.html) to build our RAG pipeline.

In [None]:
import tempfile
from urllib.request import urlretrieve

with tempfile.NamedTemporaryFile() as tf:
    urlretrieve(
        "https://raw.githubusercontent.com/Arize-ai/phoenix-assets/main/data/paul_graham/paul_graham_essay.txt",
        tf.name,
    )
    documents = SimpleDirectoryReader(input_files=[tf.name]).load_data()

In [None]:
# Define an LLM
llm = OpenAI(model="gpt-4o")

# Build index with a chunk_size of 512
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

Build a QueryEngine and start querying.

In [None]:
query_engine = vector_index.as_query_engine()

In [None]:
response_vector = query_engine.query("What did the author do growing up?")

Check the response that you get from the query.

In [None]:
response_vector.response

By default LlamaIndex retrieves two similar nodes/ chunks. You can modify that in `vector_index.as_query_engine(similarity_top_k=k)`.

Let's check the text in each of these retrieved nodes.

In [None]:
# First retrieved node
response_vector.source_nodes[0].get_text()

In [None]:
# Second retrieved node
response_vector.source_nodes[1].get_text()

Remember that we are using Phoenix Tracing to capture all the data we need to evaluate our RAG pipeline. You can view the traces in the phoenix application.

We can access the traces by directly pulling the spans from the phoenix session.

In [None]:
spans_df = client.spans.get_spans_dataframe(project_name="phoenix-rag-llama-index")

In [None]:
spans_df[["name", "span_kind", "attributes.input.value", "attributes.retrieval.documents"]].head()

Note that the traces have captured the documents that were retrieved by the query engine. This is nice because it means we can introspect the documents without having to keep track of them ourselves.

In [None]:
spans_with_docs_df = spans_df[spans_df["attributes.retrieval.documents"].notnull()]

In [None]:
spans_with_docs_df[["attributes.input.value", "attributes.retrieval.documents"]].head()

We have built a RAG pipeline and also have instrumented it using Phoenix Tracing. We now need to evaluate it's performance. We can assess our RAG system/query engine using Phoenix's LLM Evals. Let's examine how to leverage these tools to quantify the quality of our retrieval-augmented generation system.

## Evaluation

Evaluation should serve as the primary metric for assessing your RAG application. It determines whether the pipeline will produce accurate responses based on the data sources and range of queries.

While it's beneficial to examine individual queries and responses, this approach is impractical as the volume of edge-cases and failures increases. Instead, it's more effective to establish a suite of metrics and automated evaluations. These tools can provide insights into overall system performance and can identify specific areas that may require scrutiny.

In a RAG system, evaluation focuses on two critical aspects:

- **Retrieval Evaluation**: To assess the accuracy and relevance of the documents that were retrieved
- **Response Evaluation**: Measure the appropriateness of the response generated by the system when the context was provided.

### Generate Question Context Pairs

For the evaluation of a RAG system, it's essential to have queries that can fetch the correct context and subsequently generate an appropriate response.

For this tutorial, let's use Phoenix's `llm_generate` to help us create the question-context pairs.

First, let's create a dataframe of all the document chunks that we have indexed.

In [None]:
# Let's construct a dataframe of just the documents that are in our index
document_chunks_df = pd.DataFrame({"text": [node.get_text() for node in nodes]})
document_chunks_df = document_chunks_df.sample(10, random_state=42)
document_chunks_df.head()

Now that we have the document chunks, let's prompt an LLM to generate us 3 questions per chunk. Note that you could manually solicit questions from your team or customers, but this is a quick and easy way to generate a large number of questions.

In [None]:
generate_questions_template = """\
Context information is below.

---------------------
{text}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
3 questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."

Output the questions in JSON format with the keys question_1, question_2, question_3.
"""

In [None]:
import json

from phoenix.evals import OpenAIModel, llm_generate


def output_parser(response: str, index: int):
    try:
        return json.loads(response)
    except json.JSONDecodeError as e:
        return {"__error__": str(e)}


questions_df = llm_generate(
    dataframe=document_chunks_df,
    template=generate_questions_template,
    model=OpenAIModel(model="gpt-3.5-turbo"),
    output_parser=output_parser,
    concurrency=20,
)

In [None]:
questions_df.head()

In [None]:
# Construct a dataframe of the questions and the document chunks
questions_with_document_chunk_df = pd.concat([questions_df, document_chunks_df], axis=1)
questions_with_document_chunk_df = questions_with_document_chunk_df.melt(
    id_vars=["text"], value_name="question"
).drop("variable", axis=1)
# If the above step was interrupted, there might be questions missing. Let's run this to clean up the dataframe.
questions_with_document_chunk_df = questions_with_document_chunk_df[
    questions_with_document_chunk_df["question"].notnull()
]

The LLM has generated three questions per chunk. Let's take a quick look.

In [None]:
questions_with_document_chunk_df.head(10)

### Retrieval Evaluation

We are now prepared to perform our retrieval evaluations. We will execute the queries we generated in the previous step and verify whether or not that the correct context is retrieved.

In [None]:
# loop over the questions and generate the answers
for _, row in questions_with_document_chunk_df.iterrows():
    question = row["question"]
    response_vector = query_engine.query(question)
    print(f"Question: {question}\nAnswer: {response_vector.response}\n")

Now that we have executed the queries, we can start validating whether or not the RAG system was able to retrieve the correct context. Let's extract all the retrieved documents from the traces logged to phoenix. (For an in-depth explanation of how to export trace data from the phoenix runtime, consult the [docs](https://arize.com/docs/phoenix/how-to/extract-data-from-spans)).

In [None]:
from openinference.semconv.trace import DocumentAttributes, SpanAttributes

retrieved_documents_df = client.spans.get_spans_dataframe(
    project_name="phoenix-rag-llama-index",
    query=(
        SpanQuery()
        .where("span_kind == 'RETRIEVER'")
        .select("trace_id", SpanAttributes.INPUT_VALUE)
        .explode(
            SpanAttributes.RETRIEVAL_DOCUMENTS,
            reference=DocumentAttributes.DOCUMENT_CONTENT,
            document_score=DocumentAttributes.DOCUMENT_SCORE,
        )
    ),
)
retrieved_documents_df.rename(columns={"input.value": "input"}, inplace=True)
retrieved_documents_df

Let's now use Phoenix's LLM Evals to evaluate the relevance of the retrieved documents with regards to the query. Note, we've turned on `explanations` which prompts the LLM to explain it's reasoning. This can be useful for debugging and for figuring out potential corrective actions.

In [None]:
relevance_template = """
You are comparing a reference text to a question and trying to determine if the reference text
contains information relevant to answering the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference text]: {reference}
    ************
    [END DATA]
Compare the Question above to the Reference text. You must determine whether the Reference text
contains information that can answer the Question. Please focus on whether the very specific
question can be answered by the information in the Reference text.
Your response must be single word, either "relevant" or "unrelated",
and should not contain any text or characters aside from that word.
"unrelated" means that the reference text does not contain an answer to the Question.
"relevant" means the reference text contains an answer to the Question."""

In [None]:
from openinference.instrumentation import suppress_tracing

from phoenix.evals import (
    async_evaluate_dataframe,
    create_classifier,
)
from phoenix.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o")

relevance_eval = create_classifier(
    name="relevance",
    prompt_template=relevance_template,
    llm=llm,
    choices={"relevant": 1.0, "unrelated": 0.0},
)
with suppress_tracing():
    retrieved_documents_relevance_df = await async_evaluate_dataframe(
        dataframe=retrieved_documents_df, evaluators=[relevance_eval]
    )
retrieved_documents_relevance_df.head()

In [None]:
retrieved_documents_relevance_df.head()

We can now combine the documents with the relevance evaluations to compute retrieval metrics. These metrics will help us understand how well the RAG system is performing.

In [None]:
documents_with_relevance_df = pd.concat(
    [retrieved_documents_df, retrieved_documents_relevance_df.add_prefix("eval_")], axis=1
)
documents_with_relevance_df.head()

Let's compute Normalized Discounted Cumulative Gain [NCDG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) at 2 for all our retrieval steps.  In information retrieval, this metric is often used to measure effectiveness of search engine algorithms and related applications.

In [None]:
import numpy as np
from sklearn.metrics import ndcg_score


def _compute_ndcg(df: pd.DataFrame, k: int):
    """Compute NDCG@k in the presence of missing values"""
    n = max(2, len(df))
    eval_scores = np.zeros(n)
    doc_scores = np.zeros(n)
    parsed_scores = df.eval_relevance_score.apply(
        lambda x: json.loads(x)["score"] if isinstance(x, str) else x
    ).values
    eval_scores[: len(df)] = parsed_scores
    doc_scores[: len(df)] = df.document_score
    try:
        return ndcg_score([eval_scores], [doc_scores], k=k)
    except ValueError:
        return np.nan


ndcg_at_2 = pd.DataFrame(
    {"score": documents_with_relevance_df.groupby("context.span_id").apply(_compute_ndcg, k=2)}
)

In [None]:
ndcg_at_2

Let's also compute precision at 2 for all our retrieval steps.

In [None]:
precision_at_2 = pd.DataFrame(
    {
        "score": documents_with_relevance_df.groupby("context.span_id").apply(
            lambda x: x.eval_relevance_score.apply(
                lambda s: json.loads(s)["score"] if isinstance(s, str) else s
            )[:2].sum(skipna=False)
            / 2
        )
    }
)

In [None]:
precision_at_2

Lastly, let's compute whether or not a correct document was retrieved at all for each query (e.g. a hit)

In [None]:
hit = pd.DataFrame(
    {
        "hit": documents_with_relevance_df.groupby("context.span_id").apply(
            lambda x: x.eval_relevance_score.apply(
                lambda s: json.loads(s)["score"] if isinstance(s, str) else s
            )[:2].sum(skipna=False)
            > 0
        )
    }
)

Let's now view the results in a combined dataframe.

In [None]:
retrievals_df = client.spans.get_spans_dataframe(
    query=(
        SpanQuery()
        .where("span_kind == 'RETRIEVER' and input.value is not None")
        .select("input.value")
    )
)
rag_evaluation_dataframe = pd.concat(
    [
        retrievals_df["input.value"],
        ndcg_at_2.add_prefix("ncdg@2_"),
        precision_at_2.add_prefix("precision@2_"),
        hit,
    ],
    axis=1,
)
rag_evaluation_dataframe

### Observations

Let's now take our results and aggregate them to get a sense of how well our RAG system is performing.

In [None]:
# Aggregate the scores across the retrievals
results = rag_evaluation_dataframe.mean(numeric_only=True)
results

As we can see from the above numbers, our RAG system is not perfect, there are times when it fails to retrieve the correct context within the first two documents. At other times the correct context is included in the top 2 results but non-relevant information is also included in the context. This is an indication that we need to improve our retrieval strategy. One possible solution could be to increase the number of documents retrieved and then use a more sophisticated ranking strategy (such as a reranker) to select the correct context.

We have now evaluated our RAG system's retrieval performance. Let's send these evaluations to Phoenix for visualization. By sending the evaluations to Phoenix, you will be able to view the evaluations alongside the traces that were captured earlier.

In [None]:
from phoenix.client.resources.spans import SpanAnnotationData

ndcg_rows = ndcg_at_2.reset_index()[["context.span_id", "score"]].dropna()
prec_rows = precision_at_2.reset_index()[["context.span_id", "score"]].dropna()

ndcg_annotations: list[SpanAnnotationData] = [
    SpanAnnotationData(
        name="ndcg@2",
        span_id=str(row["context.span_id"]),
        annotator_kind="CODE",
        result={"score": float(row["score"])},
    )
    for _, row in ndcg_rows.iterrows()
]

precision_annotations: list[SpanAnnotationData] = [
    SpanAnnotationData(
        name="precision@2",
        span_id=str(row["context.span_id"]),
        annotator_kind="CODE",
        result={"score": float(row["score"])},
    )
    for _, row in prec_rows.iterrows()
]

client.spans.log_span_annotations(span_annotations=ndcg_annotations, sync=False)
client.spans.log_span_annotations(span_annotations=precision_annotations, sync=False)

### Response Evaluation

The retrieval evaluations demonstrates that our RAG system is not perfect. However, it's possible that the LLM is able to generate the correct response even when the context is incorrect. Let's evaluate the responses generated by the LLM.

In [None]:
from phoenix.client.types.spans import SpanQuery

qa_df = client.spans.get_spans_dataframe(
    project_name="phoenix-rag-llama-index",
    query=(
        SpanQuery()
        .select("span_id", "input.value", "output.value")
        .where("parent_id is None")
        .with_index("trace_id")
    ),
)

docs_df = client.spans.get_spans_dataframe(
    project_name="phoenix-rag-llama-index",
    query=(
        SpanQuery()
        .where("span_kind == 'RETRIEVER'")
        .concat("retrieval.documents", reference="document.content")
        .with_index("trace_id")
    ),
)

if qa_df is None or qa_df.empty:
    print("No spans found.")
    qa_with_reference_df = qa_df
elif docs_df is None or docs_df.empty:
    print("No retrieval documents found.")
    qa_with_reference_df = None
else:
    ref = docs_df[["reference"]]
    qa_with_reference_df = pd.concat([qa_df, ref], axis=1, join="inner").set_index(
        "context.span_id"
    )

qa_with_reference_df.rename(
    columns={"input.value": "input", "output.value": "output"}, inplace=True
)
qa_with_reference_df.head()

Now that we have a dataset of the question, context, and response (input, reference, and output), we now can measure how well the LLM is responding to the queries. For details on the QA correctness evaluation, see the [LLM Evals documentation](https://arize.com/docs/phoenix/llm-evals/running-pre-tested-evals/q-and-a-on-retrieved-data).

In [None]:
hallucination_prompt = """
In this task, you will be presented with a query, a reference text and an answer. The answer is
generated to the question based on the reference text. The answer may contain false information. You
must use the reference text to determine if the answer to the question contains false information,
if the answer is a hallucination of facts. Your objective is to determine whether the answer text
contains factual information and is not a hallucination. A 'hallucination' refers to
an answer that is not based on the reference text or assumes information that is not available in
the reference text. Your response should be a single word: either "factual" or "hallucinated", and
it should not include any other text or characters. "hallucinated" indicates that the answer
provides factually inaccurate information to the query based on the reference text. "factual"
indicates that the answer to the question is correct relative to the reference text, and does not
contain made up information. Please read the query and reference text carefully before determining
your response.

    [BEGIN DATA]
    ************
    [Query]: {input}
    ************
    [Reference text]: {reference}
    ************
    [Answer]: {output}
    ************
    [END DATA]

    Is the answer above factual or hallucinated based on the query and reference text?
"""

qa_correctness_prompt = """
You are given a question, an answer and reference text. You must determine whether the
given answer correctly answers the question based on the reference text. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference]: {reference}
    ************
    [Answer]: {output}
    [END DATA]
Please read the query, reference text and answer carefully, then write out in a step by step manner
an EXPLANATION to show how to determine if the answer is "correct" or "incorrect". Avoid simply
stating the correct answer at the outset. Your response LABEL must be a single word, either
"correct" or "incorrect", and should not contain any text or characters aside from that word.
"correct" means that the question is correctly and fully answered by the answer.
"incorrect" means that the question is not correctly or only partially answered by the
answer.

Example response:
************
EXPLANATION: An explanation of your reasoning for why the label is "correct" or "incorrect"
LABEL: "correct" or "incorrect"
************

EXPLANATION:"""

In [None]:
llm = LLM(provider="openai", model="gpt-4")

hallucination_eval = create_classifier(
    name="hallucination",
    prompt_template=hallucination_prompt,
    llm=llm,
    choices={"factual": 1.0, "hallucinated": 0.0},
)
qa_correctness_eval = create_classifier(
    name="qa_correctness",
    prompt_template=qa_correctness_prompt,
    llm=llm,
    choices={"correct": 1.0, "incorrect": 0.0},
)
with suppress_tracing():
    results_df = await async_evaluate_dataframe(
        dataframe=qa_with_reference_df,
        evaluators=[hallucination_eval, qa_correctness_eval],
    )

In [None]:
results_df.head()

#### Observations

Let's now take our results and aggregate them to get a sense of how well the LLM is answering the questions given the context.

In [None]:
qa_correctness_eval = results_df.qa_correctness_score.apply(
    lambda x: json.loads(x)["score"] if isinstance(x, str) else x
).mean()

print(f"Q&A Correctness Score: {qa_correctness_eval}")
qa_correctness_eval

In [None]:
hallucination_eval = results_df.hallucination_score.apply(
    lambda x: json.loads(x)["score"] if isinstance(x, str) else x
).mean()

print(f"Hallucination Mean Score: {hallucination_eval}")
hallucination_eval

Since we have evaluated our RAG system's QA performance and Hallucinations performance, let's send these evaluations to Phoenix for visualization.

In [None]:
from phoenix.evals.utils import to_annotation_dataframe

relevancy_eval_df = to_annotation_dataframe(dataframe=results_df)

client.annotations.log_span_annotations_dataframe(
    dataframe=relevancy_eval_df,
    annotator_kind="LLM",
)

We now have sent all our evaluations to Phoenix. Let's go to the Phoenix application and view the results! Since we've sent all the evals to Phoenix, we can analyze the results together to make a determination on whether or not poor retrieval or irrelevant context has an effect on the LLM's ability to generate the correct response.

## Conclusion

We have explored how to build and evaluate a RAG pipeline using LlamaIndex and Phoenix, with a specific focus on evaluating the retrieval system and generated responses within the pipelines. 

Phoenix offers a variety of other evaluations that can be used to assess the performance of your LLM Application. For more details, see the [LLM Evals](https://arize.com/docs/phoenix/llm-evals/llm-evals) documentation.