<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Evaluate RAG with LLM Evals</h1>

In this tutorial we will look into building a RAG pipeline and evaluating it with Phoenix Evals.

It has the the following sections:

1. Understanding Retrieval Augmented Generation (RAG).
2. Building RAG (with the help of a framework such as LlamaIndex).
3. Evaluating RAG with Phoenix Evals.

## Retrieval Augmented Generation (RAG)

LLMs are trained on vast datasets, but these will not include your specific data (things like company knowledge bases and documentation). Retrieval-Augmented Generation (RAG) addresses this by dynamically incorporating your data as context during the generation process. This is done not by altering the training data of the LLMs but by allowing the model to access and utilize your data in real-time to provide more tailored and contextually relevant responses.

In RAG, your data is loaded and prepared for queries. This process is called indexing. User queries act on this index, which filters your data down to the most relevant context. This context and your query then are sent to the LLM along with a prompt, and the LLM provides a response.

RAG is a critical component for building applications such a chatbots or agents and you will want to know RAG techniques on how to get data into your application.

<img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/RAG_Pipeline.png">

## Stages within RAG

There are five key stages within RAG, which will in turn be a part of any larger RAG application.

- **Loading**: This refers to getting your data from where it lives - whether it's text files, PDFs, another website, a database or an API - into your pipeline.
- **Indexing**: This means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.
- **Storing**: Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it.

- **Querying**: For any given indexing strategy there are many ways you can utilize LLMs and data structures to query, including sub-queries, multi-step queries, and hybrid strategies. 
- **Evaluation**: A critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures on how accurate, faithful, and fast your responses to queries are.


## Build a RAG system 

Now that we have understood the stages of RAG, let's build a pipeline. We will use [LlamaIndex](https://www.llamaindex.ai/) for RAG and [Phoenix Evals](https://docs.arize.com/phoenix/llm-evals/llm-evals) for evaluation.


In [26]:
!pip install -qq "arize-phoenix[evals]" "llama-index>=0.10.3" "openinference-instrumentation-llama-index>=1.0.0" "llama-index-callbacks-arize-phoenix>=0.1.2" "llama-index-llms-openai" "openai>=1" gcsfs nest_asyncio

In [27]:
# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
import nest_asyncio

nest_asyncio.apply()

import os
from getpass import getpass

import pandas as pd
import phoenix as px
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex, set_global_handler
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.llms.openai import OpenAI

During this tutorial, we will capture all the data we need to evaluate our RAG pipeline using Phoenix Tracing. To enable this, simply start the phoenix application and instrument LlamaIndex.

In [28]:
px.launch_app()

Existing running Phoenix instance detected! Shutting it down and starting a new instance...


🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


<phoenix.session.session.ThreadSession at 0x29e8678b0>

In [29]:
set_global_handler("arize_phoenix")

For this tutorial we will be using OpenAI for creating synthetic data as well as for evaluation. 

In [30]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

### Load Data and Build an Index

Let's use an [essay by Paul Graham](https://www.paulgraham.com/worked.html) to build our RAG pipeline.

In [31]:
import tempfile
from urllib.request import urlretrieve

with tempfile.NamedTemporaryFile() as tf:
    urlretrieve(
        "https://raw.githubusercontent.com/Arize-ai/phoenix-assets/main/data/paul_graham/paul_graham_essay.txt",
        tf.name,
    )
    documents = SimpleDirectoryReader(input_files=[tf.name]).load_data()

In [32]:
# Define an LLM
llm = OpenAI(model="gpt-4")

# Build index with a chunk_size of 512
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

Build a QueryEngine and start querying.

In [33]:
query_engine = vector_index.as_query_engine()

In [34]:
response_vector = query_engine.query("What did the author do growing up?")

Check the response that you get from the query.

In [35]:
response_vector.response

'The author focused on writing short stories and programming, particularly experimenting with early versions of Fortran on an IBM 1401 computer during 9th grade.'

By default LlamaIndex retrieves two similar nodes/ chunks. You can modify that in `vector_index.as_query_engine(similarity_top_k=k)`.

Let's check the text in each of these retrieved nodes.

In [36]:
# First retrieved node
response_vector.source_nodes[0].get_text()

'What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early version of Fortran. You had to type programs on punch cards, then stack

In [37]:
# Second retrieved node
response_vector.source_nodes[1].get_text()

"I remember taking the boys to the coast on a sunny day in 2015 and figuring out how to deal with some problem involving continuations while I watched them play in the tide pools. It felt like I was doing life right. I remember that because I was slightly dismayed at how novel it felt. The good news is that I had more moments like this over the next few years.\n\nIn the summer of 2016 we moved to England. We wanted our kids to see what it was like living in another country, and since I was a British citizen by birth, that seemed the obvious choice. We only meant to stay for a year, but we liked it so much that we still live there. So most of Bel was written in England.\n\nIn the fall of 2019, Bel was finally finished. Like McCarthy's original Lisp, it's a spec rather than an implementation, although like McCarthy's Lisp it's a spec expressed as code.\n\nNow that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essays through 2020, but I also s

Remember that we are using Phoenix Tracing to capture all the data we need to evaluate our RAG pipeline. You can view the traces in the phoenix application.

In [38]:
print("phoenix URL", px.active_session().url)

phoenix URL http://localhost:6006/


We can access the traces by directly pulling the spans from the phoenix session.

In [39]:
spans_df = px.Client().get_spans_dataframe()

In [40]:
spans_df[["name", "span_kind", "attributes.input.value", "attributes.retrieval.documents"]].head()

Unnamed: 0_level_0,name,span_kind,attributes.input.value,attributes.retrieval.documents
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
a9624da12b296f3c,llm,LLM,,
fc4b91051a10681a,synthesize,CHAIN,What did the author do growing up?,
1338a0c51b753dbb,embedding,EMBEDDING,,
951bd923259c3b0e,retrieve,RETRIEVER,What did the author do growing up?,[{'document.metadata': {'file_path': '/var/fol...
9542c939a540620a,query,CHAIN,What did the author do growing up?,


Note that the traces have captured the documents that were retrieved by the query engine. This is nice because it means we can introspect the documents without having to keep track of them ourselves.

In [41]:
spans_with_docs_df = spans_df[spans_df["attributes.retrieval.documents"].notnull()]

In [42]:
spans_with_docs_df[["attributes.input.value", "attributes.retrieval.documents"]].head()

Unnamed: 0_level_0,attributes.input.value,attributes.retrieval.documents
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1
951bd923259c3b0e,What did the author do growing up?,[{'document.metadata': {'file_path': '/var/fol...


We have built a RAG pipeline and also have instrumented it using Phoenix Tracing. We now need to evaluate it's performance. We can assess our RAG system/query engine using Phoenix's LLM Evals. Let's examine how to leverage these tools to quantify the quality of our retrieval-augmented generation system.

## Evaluation

Evaluation should serve as the primary metric for assessing your RAG application. It determines whether the pipeline will produce accurate responses based on the data sources and range of queries.

While it's beneficial to examine individual queries and responses, this approach is impractical as the volume of edge-cases and failures increases. Instead, it's more effective to establish a suite of metrics and automated evaluations. These tools can provide insights into overall system performance and can identify specific areas that may require scrutiny.

In a RAG system, evaluation focuses on two critical aspects:

- **Retrieval Evaluation**: To assess the accuracy and relevance of the documents that were retrieved
- **Response Evaluation**: Measure the appropriateness of the response generated by the system when the context was provided.

### Generate Question Context Pairs

For the evaluation of a RAG system, it's essential to have queries that can fetch the correct context and subsequently generate an appropriate response.

For this tutorial, let's use Phoenix's `llm_generate` to help us create the question-context pairs.

First, let's create a dataframe of all the document chunks that we have indexed.

In [43]:
# Let's construct a dataframe of just the documents that are in our index
document_chunks_df = pd.DataFrame({"text": [node.get_text() for node in nodes]})
document_chunks_df.head()

Unnamed: 0,text
0,What I Worked On\n\nFebruary 2021\n\nBefore co...
1,I was puzzled by the 1401. I couldn't figure o...
2,I remember vividly how impressed and envious I...
3,I couldn't have put this into words when I was...
4,The default language at Cornell was a Pascal-l...


Now that we have the document chunks, let's prompt an LLM to generate us 3 questions per chunk. Note that you could manually solicit questions from your team or customers, but this is a quick and easy way to generate a large number of questions.

In [44]:
generate_questions_template = """\
Context information is below.

---------------------
{text}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
3 questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."

Output the questions in JSON format with the keys question_1, question_2, question_3.
"""

In [45]:
import json

from phoenix.evals import OpenAIModel, llm_generate


def output_parser(response: str, index: int):
    try:
        return json.loads(response)
    except json.JSONDecodeError as e:
        return {"__error__": str(e)}


questions_df = llm_generate(
    dataframe=document_chunks_df,
    template=generate_questions_template,
    model=OpenAIModel(
        model_name="gpt-3.5-turbo",
    ),
    output_parser=output_parser,
    concurrency=20,
)

The `model_name` field is deprecated. Use `model` instead.                 This will be removed in a future release.


llm_generate |          | 0/61 (0.0%) | ⏳ 00:00<? | ?it/s

In [46]:
questions_df.head()

Unnamed: 0,question_1,question_2,question_3
0,What were the two main things the author worke...,Describe the author's experience with programm...,How did the author's experience with programmi...
1,What was the author's experience with programm...,Describe the author's transition from using th...,How did the author's interest in programming d...
2,What was the author's first experience with pr...,Why did the author initially plan to study phi...,What two specific influences led the author to...
3,What novel by Heinlein inspired the individual...,What programming language did the individual l...,"For their undergraduate thesis, what program d..."
4,What was the default language at Cornell and o...,What was the author's experience with learning...,What realization did the author come to during...


In [47]:
# Construct a dataframe of the questions and the document chunks
questions_with_document_chunk_df = pd.concat([questions_df, document_chunks_df], axis=1)
questions_with_document_chunk_df = questions_with_document_chunk_df.melt(
    id_vars=["text"], value_name="question"
).drop("variable", axis=1)
# If the above step was interrupted, there might be questions missing. Let's run this to clean up the dataframe.
questions_with_document_chunk_df = questions_with_document_chunk_df[
    questions_with_document_chunk_df["question"].notnull()
]

The LLM has generated three questions per chunk. Let's take a quick look.

In [48]:
questions_with_document_chunk_df.head(10)

Unnamed: 0,text,question
0,What I Worked On\n\nFebruary 2021\n\nBefore co...,What were the two main things the author worke...
1,I was puzzled by the 1401. I couldn't figure o...,What was the author's experience with programm...
2,I remember vividly how impressed and envious I...,What was the author's first experience with pr...
3,I couldn't have put this into words when I was...,What novel by Heinlein inspired the individual...
4,The default language at Cornell was a Pascal-l...,What was the default language at Cornell and o...
5,"I applied to 3 grad schools: MIT and Yale, whi...",What realization did the author come to during...
6,So I looked around to see what I could salvage...,What was the main reason the author decided to...
7,"And indeed, it would seem very feeble work. On...",What realization did the author have while vis...
8,And as an artist you could be truly independen...,What was the author's initial perception of th...
9,I remember when my friend Robert Morris got ki...,What was the topic chosen by the author for th...


### Retrieval Evaluation

We are now prepared to perform our retrieval evaluations. We will execute the queries we generated in the previous step and verify whether or not that the correct context is retrieved.

In [49]:
# First things first, let's reset phoenix
px.close_app()
px.launch_app()

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


<phoenix.session.session.ThreadSession at 0x2afc63040>

In [50]:
# loop over the questions and generate the answers
for _, row in questions_with_document_chunk_df.iterrows():
    question = row["question"]
    response_vector = query_engine.query(question)
    print(f"Question: {question}\nAnswer: {response_vector.response}\n")

Question: What were the two main things the author worked on before college?
Answer: The author worked on writing and programming before college.

Question: What was the author's experience with programming on the 1401 computer and why does he not remember any programs he wrote on it?
Answer: The author's experience with programming on the 1401 computer involved using an early version of Fortran where programs had to be typed on punch cards, stacked in the card reader, and then loaded into memory to run. The author found it challenging to work with the 1401 as the only input method was through punched cards, and without any data stored on punched cards, there were limited options for program execution. The author couldn't recall any specific programs written on the 1401 because the programs likely didn't achieve much due to the constraints of the system and the author's limited knowledge of math at that time.

Question: What was the author's first experience with programming and what c

Now that we have executed the queries, we can start validating whether or not the RAG system was able to retrieve the correct context. Let's extract all the retrieved documents from the traces logged to phoenix. (For an in-depth explanation of how to export trace data from the phoenix runtime, consult the [docs](https://docs.arize.com/phoenix/how-to/extract-data-from-spans)).

In [51]:
from phoenix.session.evaluation import get_retrieved_documents

retrieved_documents_df = get_retrieved_documents(px.Client())
retrieved_documents_df

Unnamed: 0_level_0,Unnamed: 1_level_0,context.trace_id,input,reference,document_score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
3a9d75a6685a38c4,0,d43eb98002bb650e55c70374561ff8f1,Why is it mentioned that there may exist at le...,And at 50 there was some opportunity cost to s...,0.856829
3a9d75a6685a38c4,1,d43eb98002bb650e55c70374561ff8f1,Why is it mentioned that there may exist at le...,Individually these two phenomena are tedious b...,0.851461
5dfcd3e2bd17873b,0,1757222d5ef48007db5b6ee222745495,How does the author describe the impact of lea...,Surely the biggest source of stress in one's w...,0.841630
5dfcd3e2bd17873b,1,1757222d5ef48007db5b6ee222745495,How does the author describe the impact of lea...,"""You know,"" he said, ""you should make sure Y C...",0.837039
1d8524bf87f53687,0,ab80fe26411d35aedd87816a29017d1a,Why does the author dislike the term 'deal flo...,The YC logo itself is an inside joke: the Viaw...,0.843317
...,...,...,...,...,...
b71599449b06684b,1,375e1f36f66f3e0808d0cea84ab7466b,What was the author's first experience with pr...,What I Worked On\n\nFebruary 2021\n\nBefore co...,0.870292
8304b5cffaf1ee80,0,ad8dfdd419b2c262f06cbe18a0a1af22,What was the author's experience with programm...,I was puzzled by the 1401. I couldn't figure o...,0.893548
8304b5cffaf1ee80,1,ad8dfdd419b2c262f06cbe18a0a1af22,What was the author's experience with programm...,What I Worked On\n\nFebruary 2021\n\nBefore co...,0.880519
ded386711d92c952,0,9dc0d9a3148eb0b405bf73db7d7cc89c,What were the two main things the author worke...,What I Worked On\n\nFebruary 2021\n\nBefore co...,0.843013


Let's now use Phoenix's LLM Evals to evaluate the relevance of the retrieved documents with regards to the query. Note, we've turned on `explanations` which prompts the LLM to explain it's reasoning. This can be useful for debugging and for figuring out potential corrective actions.

In [52]:
from phoenix.evals import (
    RelevanceEvaluator,
    run_evals,
)

relevance_evaluator = RelevanceEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))

retrieved_documents_relevance_df = run_evals(
    evaluators=[relevance_evaluator],
    dataframe=retrieved_documents_df,
    provide_explanation=True,
    concurrency=20,
)[0]

run_evals |          | 0/366 (0.0%) | ⏳ 00:00<? | ?it/s

In [53]:
retrieved_documents_relevance_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,label,score,explanation
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3a9d75a6685a38c4,0,relevant,1,The question asks why it is mentioned that the...
3a9d75a6685a38c4,1,relevant,1,The question asks why it is mentioned that the...
5dfcd3e2bd17873b,0,unrelated,0,The question asks about the author's descripti...
5dfcd3e2bd17873b,1,relevant,1,The question asks about the author's descripti...
1d8524bf87f53687,0,relevant,1,The reference text directly addresses the ques...


We can now combine the documents with the relevance evaluations to compute retrieval metrics. These metrics will help us understand how well the RAG system is performing.

In [54]:
documents_with_relevance_df = pd.concat(
    [retrieved_documents_df, retrieved_documents_relevance_df.add_prefix("eval_")], axis=1
)
documents_with_relevance_df

Unnamed: 0_level_0,Unnamed: 1_level_0,context.trace_id,input,reference,document_score,eval_label,eval_score,eval_explanation
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
3a9d75a6685a38c4,0,d43eb98002bb650e55c70374561ff8f1,Why is it mentioned that there may exist at le...,And at 50 there was some opportunity cost to s...,0.856829,relevant,1,The question asks why it is mentioned that the...
3a9d75a6685a38c4,1,d43eb98002bb650e55c70374561ff8f1,Why is it mentioned that there may exist at le...,Individually these two phenomena are tedious b...,0.851461,relevant,1,The question asks why it is mentioned that the...
5dfcd3e2bd17873b,0,1757222d5ef48007db5b6ee222745495,How does the author describe the impact of lea...,Surely the biggest source of stress in one's w...,0.841630,unrelated,0,The question asks about the author's descripti...
5dfcd3e2bd17873b,1,1757222d5ef48007db5b6ee222745495,How does the author describe the impact of lea...,"""You know,"" he said, ""you should make sure Y C...",0.837039,relevant,1,The question asks about the author's descripti...
1d8524bf87f53687,0,ab80fe26411d35aedd87816a29017d1a,Why does the author dislike the term 'deal flo...,The YC logo itself is an inside joke: the Viaw...,0.843317,relevant,1,The reference text directly addresses the ques...
...,...,...,...,...,...,...,...,...
b71599449b06684b,1,375e1f36f66f3e0808d0cea84ab7466b,What was the author's first experience with pr...,What I Worked On\n\nFebruary 2021\n\nBefore co...,0.870292,relevant,1,The question asks for two specific pieces of i...
8304b5cffaf1ee80,0,ad8dfdd419b2c262f06cbe18a0a1af22,What was the author's experience with programm...,I was puzzled by the 1401. I couldn't figure o...,0.893548,relevant,1,The reference text directly addresses the ques...
8304b5cffaf1ee80,1,ad8dfdd419b2c262f06cbe18a0a1af22,What was the author's experience with programm...,What I Worked On\n\nFebruary 2021\n\nBefore co...,0.880519,relevant,1,The reference text directly addresses the ques...
ded386711d92c952,0,9dc0d9a3148eb0b405bf73db7d7cc89c,What were the two main things the author worke...,What I Worked On\n\nFebruary 2021\n\nBefore co...,0.843013,relevant,1,The question asks about the two main activitie...


Let's compute Normalized Discounted Cumulative Gain [NCDG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) at 2 for all our retrieval steps.  In information retrieval, this metric is often used to measure effectiveness of search engine algorithms and related applications.

In [55]:
import numpy as np
from sklearn.metrics import ndcg_score


def _compute_ndcg(df: pd.DataFrame, k: int):
    """Compute NDCG@k in the presence of missing values"""
    n = max(2, len(df))
    eval_scores = np.zeros(n)
    doc_scores = np.zeros(n)
    eval_scores[: len(df)] = df.eval_score
    doc_scores[: len(df)] = df.document_score
    try:
        return ndcg_score([eval_scores], [doc_scores], k=k)
    except ValueError:
        return np.nan


ndcg_at_2 = pd.DataFrame(
    {"score": documents_with_relevance_df.groupby("context.span_id").apply(_compute_ndcg, k=2)}
)

In [56]:
ndcg_at_2

Unnamed: 0_level_0,score
context.span_id,Unnamed: 1_level_1
026b2abd5dcb9849,1.0
026c709c4b5a4e09,1.0
03e2f09afc0aa0b1,1.0
05234f3093317c5f,1.0
05eee051d13f2a7d,1.0
...,...
fca2541676815e1a,1.0
fd44acf228788f3b,1.0
fdecfc1e376821de,1.0
fee739fffff0cf58,1.0


Let's also compute precision at 2 for all our retrieval steps.

In [57]:
precision_at_2 = pd.DataFrame(
    {
        "score": documents_with_relevance_df.groupby("context.span_id").apply(
            lambda x: x.eval_score[:2].sum(skipna=False) / 2
        )
    }
)

In [58]:
precision_at_2

Unnamed: 0_level_0,score
context.span_id,Unnamed: 1_level_1
026b2abd5dcb9849,1.0
026c709c4b5a4e09,0.5
03e2f09afc0aa0b1,1.0
05234f3093317c5f,0.5
05eee051d13f2a7d,1.0
...,...
fca2541676815e1a,1.0
fd44acf228788f3b,1.0
fdecfc1e376821de,1.0
fee739fffff0cf58,1.0


Lastly, let's compute whether or not a correct document was retrieved at all for each query (e.g. a hit)

In [59]:
hit = pd.DataFrame(
    {
        "hit": documents_with_relevance_df.groupby("context.span_id").apply(
            lambda x: x.eval_score[:2].sum(skipna=False) > 0
        )
    }
)

Let's now view the results in a combined dataframe.

In [60]:
retrievals_df = px.Client().get_spans_dataframe("span_kind == 'RETRIEVER'")
rag_evaluation_dataframe = pd.concat(
    [
        retrievals_df["attributes.input.value"],
        ndcg_at_2.add_prefix("ncdg@2_"),
        precision_at_2.add_prefix("precision@2_"),
        hit,
    ],
    axis=1,
)
rag_evaluation_dataframe

Unnamed: 0_level_0,attributes.input.value,ncdg@2_score,precision@2_score,hit
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3a9d75a6685a38c4,Why is it mentioned that there may exist at le...,1.00000,1.0,True
5dfcd3e2bd17873b,How does the author describe the impact of lea...,0.63093,0.5,True
1d8524bf87f53687,Why does the author dislike the term 'deal flo...,1.00000,1.0,True
43c475deb5849b80,Discuss the lesson learned from the author's e...,1.00000,1.0,True
4640d0a90cc272e5,Discuss the relationship between money and coo...,0.00000,0.0,False
...,...,...,...,...
3b86bcd4bcd0a600,What was the default language at Cornell and o...,1.00000,0.5,True
f3205a8197cb8e6d,What novel by Heinlein inspired the individual...,1.00000,1.0,True
b71599449b06684b,What was the author's first experience with pr...,1.00000,1.0,True
8304b5cffaf1ee80,What was the author's experience with programm...,1.00000,1.0,True


### Observations

Let's now take our results and aggregate them to get a sense of how well our RAG system is performing.

In [61]:
# Aggregate the scores across the retrievals
results = rag_evaluation_dataframe.mean(numeric_only=True)
results

ncdg@2_score         0.934685
precision@2_score    0.849727
hit                  0.950820
dtype: float64

As we can see from the above numbers, our RAG system is not perfect, there are times when it fails to retrieve the correct context within the first two documents. At other times the correct context is included in the top 2 results but non-relevant information is also included in the context. This is an indication that we need to improve our retrieval strategy. One possible solution could be to increase the number of documents retrieved and then use a more sophisticated ranking strategy (such as a reranker) to select the correct context.

We have now evaluated our RAG system's retrieval performance. Let's send these evaluations to Phoenix for visualization. By sending the evaluations to Phoenix, you will be able to view the evaluations alongside the traces that were captured earlier.

In [62]:
from phoenix.trace import DocumentEvaluations, SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(dataframe=ndcg_at_2, eval_name="ndcg@2"),
    SpanEvaluations(dataframe=precision_at_2, eval_name="precision@2"),
    DocumentEvaluations(dataframe=retrieved_documents_relevance_df, eval_name="relevance"),
)

### Response Evaluation

The retrieval evaluations demonstrates that our RAG system is not perfect. However, it's possible that the LLM is able to generate the correct response even when the context is incorrect. Let's evaluate the responses generated by the LLM.

In [63]:
from phoenix.session.evaluation import get_qa_with_reference

qa_with_reference_df = get_qa_with_reference(px.Client())
qa_with_reference_df

Unnamed: 0_level_0,input,output,reference
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0e920887bad0f2bc,Why is it mentioned that there may exist at le...,Presumably aliens need numbers and errors and ...,And at 50 there was some opportunity cost to s...
a424561f80b13e37,How does the author describe the impact of lea...,The author describes the impact of leaving YC ...,Surely the biggest source of stress in one's w...
b12bafbd789ea6df,Why does the author dislike the term 'deal flo...,The author dislikes the term 'deal flow' becau...,The YC logo itself is an inside joke: the Viaw...
ef38d55b343fb6c2,Discuss the lesson learned from the author's e...,The lesson learned from the author's experienc...,"Customary VC practice had once, like the custo..."
aaf2cd7f34c25184,Discuss the relationship between money and coo...,"In the art world, there is a common perception...",You want to emphasize the visual cues that tel...
...,...,...,...
471dd22d80405a26,What was the default language at Cornell and o...,The default language at Cornell and other univ...,The default language at Cornell was a Pascal-l...
20ead71f6c7ba0bb,What novel by Heinlein inspired the individual...,The Moon is a Harsh Mistress,I couldn't have put this into words when I was...
a9c939cbbb74461b,What was the author's first experience with pr...,The author's first experience with programming...,I remember vividly how impressed and envious I...
5f0ab100e5472478,What was the author's experience with programm...,The author's experience with programming on th...,I was puzzled by the 1401. I couldn't figure o...


Now that we have a dataset of the question, context, and response (input, reference, and output), we now can measure how well the LLM is responding to the queries. For details on the QA correctness evaluation, see the [LLM Evals documentation](https://docs.arize.com/phoenix/llm-evals/running-pre-tested-evals/q-and-a-on-retrieved-data).

In [64]:
from phoenix.evals import (
    HallucinationEvaluator,
    OpenAIModel,
    QAEvaluator,
    run_evals,
)

qa_evaluator = QAEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))
hallucination_evaluator = HallucinationEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))

qa_correctness_eval_df, hallucination_eval_df = run_evals(
    evaluators=[qa_evaluator, hallucination_evaluator],
    dataframe=qa_with_reference_df,
    provide_explanation=True,
    concurrency=20,
)

run_evals |          | 0/366 (0.0%) | ⏳ 00:00<? | ?it/s

Exception in worker on attempt 1: raised InternalServerError('<html>\r\n<head><title>502 Bad Gateway</title></head>\r\n<body>\r\n<center><h1>502 Bad Gateway</h1></center>\r\n<hr><center>cloudflare</center>\r\n</body>\r\n</html>')
Requeuing...


In [65]:
qa_correctness_eval_df.head()

Unnamed: 0_level_0,label,score,explanation
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0e920887bad0f2bc,incorrect,0,The question asks why it is mentioned that the...
a424561f80b13e37,incorrect,0,The reference text provides a detailed account...
b12bafbd789ea6df,correct,1,The question asks for two pieces of informatio...
ef38d55b343fb6c2,correct,1,The given answer accurately captures the essen...
aaf2cd7f34c25184,incorrect,0,The question asks about the relationship betwe...


In [66]:
hallucination_eval_df.head()

Unnamed: 0_level_0,label,score,explanation
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0e920887bad0f2bc,factual,0,The query asks why it is mentioned that there ...
a424561f80b13e37,hallucinated,1,The reference text provides detailed informati...
b12bafbd789ea6df,factual,0,The answer provided directly reflects the info...
ef38d55b343fb6c2,factual,0,The answer accurately reflects the content and...
aaf2cd7f34c25184,factual,0,The answer discusses the perception of still l...


#### Observations

Let's now take our results and aggregate them to get a sense of how well the LLM is answering the questions given the context.

In [67]:
qa_correctness_eval_df.mean(numeric_only=True)

score    0.879781
dtype: float64

In [68]:
hallucination_eval_df.mean(numeric_only=True)

score    0.081967
dtype: float64

Our QA Correctness score of `0.91` and a Hallucinations score `0.05` signifies that the generated answers are correct ~91% of the time and that the responses contain hallucinations 5% of the time - there is room for improvement. This could be due to the retrieval strategy or the LLM itself. We will need to investigate further to determine the root cause.

Since we have evaluated our RAG system's QA performance and Hallucinations performance, let's send these evaluations to Phoenix for visualization.

In [69]:
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(dataframe=qa_correctness_eval_df, eval_name="Q&A Correctness"),
    SpanEvaluations(dataframe=hallucination_eval_df, eval_name="Hallucination"),
)

We now have sent all our evaluations to Phoenix. Let's go to the Phoenix application and view the results! Since we've sent all the evals to Phoenix, we can analyze the results together to make a determination on whether or not poor retrieval or irrelevant context has an effect on the LLM's ability to generate the correct response.

In [70]:
print("phoenix URL", px.active_session().url)

phoenix URL http://localhost:6006/


## Conclusion

We have explored how to build and evaluate a RAG pipeline using LlamaIndex and Phoenix, with a specific focus on evaluating the retrieval system and generated responses within the pipelines. 

Phoenix offers a variety of other evaluations that can be used to assess the performance of your LLM Application. For more details, see the [LLM Evals](https://docs.arize.com/phoenix/llm-evals/llm-evals) documentation.