<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Evaluate RAG with LLM Evals</h1>

In this tutorial we will look into building a RAG pipeline and evaluating it with Phoenix Evals.

It has the the following sections:

1. Understanding Retrieval Augmented Generation (RAG).
2. Building RAG (with the help of a framework such as LlamaIndex and LLM providers like Mistral).
3. Evaluating RAG with Evals.

## Retrieval Augmented Generation (RAG)

LLMs are trained on vast datasets, but these will not include your specific data (things like company knowledge bases and documentation). Retrieval-Augmented Generation (RAG) addresses this by dynamically incorporating your data as context during the generation process. This is done not by altering the training data of the LLMs but by allowing the model to access and utilize your data in real-time to provide more tailored and contextually relevant responses.

In RAG, your data is loaded and prepared for queries. This process is called indexing. User queries act on this index, which filters your data down to the most relevant context. This context and your query then are sent to the LLM along with a prompt, and the LLM provides a response.

RAG is a critical component for building applications such a chatbots or agents and you will want to know RAG techniques on how to get data into your application.

<img src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/RAG_Pipeline.png" width="800px">

## Stages within RAG

There are five key stages within RAG, which will in turn be a part of any larger RAG application.

- **Loading**: This refers to getting your data from where it lives - whether it's text files, PDFs, another website, a database or an API - into your pipeline.
- **Indexing**: This means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.
- **Storing**: Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it.

- **Querying**: For any given indexing strategy there are many ways you can utilize LLMs and data structures to query, including sub-queries, multi-step queries, and hybrid strategies. 
- **Evaluation**: A critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures on how accurate, faithful, and fast your responses to queries are.


## Build a RAG system 

Now that we have understood the stages of RAG, let's build a pipeline. We will use [LlamaIndex](https://www.llamaindex.ai/) for RAG and [Phoenix Evals](https://docs.arize.com/phoenix/llm-evals/llm-evals) for evaluation.


In [64]:
!pip install -qq "arize-phoenix" "arize-phoenix-evals>=0.4.0" "llama-index==0.10.19" "llama-index-llms-mistralai" "llama-index-embeddings-mistralai"  "openinference-instrumentation-llama-index>=1.0.0" "llama-index-callbacks-arize-phoenix>=0.1.2" gcsfs nest_asyncio

In [94]:
# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
import nest_asyncio

nest_asyncio.apply()

import os
from getpass import getpass

import pandas as pd
import phoenix as px
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex, set_global_handler
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.embeddings.mistralai import MistralAIEmbedding
from llama_index.llms.mistralai import MistralAI
from phoenix.trace import using_project

First, let's setup the environment and a few constants that we will use throughout the tutorial.

In [95]:
# Setup projects to collect tracing under
os.environ["PHOENIX_PROJECT_NAME"] = "mistral-rag" # Collect traces under the project "mistral-rag"
INDEXING_PROJECT = "indexing" # For llama-index indexing
TESTSET_PROJECT = "testset" # For capturing synthetic testset traces

During this tutorial, we will capture all the data we need to evaluate our RAG pipeline using Phoenix Tracing. To enable this, simply start the phoenix application and instrument LlamaIndex.

In [None]:
px.launch_app()

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


<phoenix.session.session.ThreadSession at 0x17c7a5750>

2024-03-20 15:30:09,673 ERROR strawberry.execution: Unknown project: UHJvamVjdDoy

GraphQL request:4:3
3 | ) {
4 |   node(id: $id) {
  |   ^
5 |     __typename
Traceback (most recent call last):
  File "/Users/mikeldking/work/phoenix/.venv/lib/python3.10/site-packages/graphql/execution/execute.py", line 521, in execute_field
    result = resolve_fn(source, info, **args)
  File "/Users/mikeldking/work/phoenix/.venv/lib/python3.10/site-packages/strawberry/schema/schema_converter.py", line 692, in _resolver
    return _get_result_with_extensions(
  File "/Users/mikeldking/work/phoenix/.venv/lib/python3.10/site-packages/strawberry/schema/schema_converter.py", line 678, in extension_resolver
    return reduce(
  File "/Users/mikeldking/work/phoenix/.venv/lib/python3.10/site-packages/strawberry/schema/schema_converter.py", line 673, in wrapped_get_result
    return _get_result(
  File "/Users/mikeldking/work/phoenix/.venv/lib/python3.10/site-packages/strawberry/schema/schema_converter.py", l

In [67]:
set_global_handler("arize_phoenix")

For this tutorial we will be using OpenAI for creating synthetic data as well as for evaluation. 

In [97]:
if not (mistral_api_key := os.getenv("MISTRAL_API_KEY")):
    mistral_api_key = getpass("🔑 Enter your MISTRAL API key: ")
os.environ["MISTRAL_API_KEY"] = mistral_api_key

### Load Data and Build an Index

Let's use an [essay by Paul Graham](https://www.paulgraham.com/worked.html) to build our RAG pipeline.

In [98]:
import tempfile
from urllib.request import urlretrieve

with tempfile.NamedTemporaryFile() as tf:
    urlretrieve(
        "https://raw.githubusercontent.com/Arize-ai/phoenix-assets/main/data/paul_graham/paul_graham_essay.txt",
        tf.name,
    )
    documents = SimpleDirectoryReader(input_files=[tf.name]).load_data()

In [99]:
# Define an LLM
llm = MistralAI(model="mistral-large-latest")
Settings.llm = llm
Settings.embed_model = MistralAIEmbedding()

with using_project(INDEXING_PROJECT): # Collect traces under the project "indexing"
    # Build index with a chunk_size of 512
    node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
    nodes = node_parser.get_nodes_from_documents(documents)
    vector_index = VectorStoreIndex(nodes)

Build a QueryEngine and start querying.

In [100]:
query_engine = vector_index.as_query_engine()

In [101]:
response_vector = query_engine.query("What did the author do growing up?")

Check the response that you get from the query.

In [102]:
response_vector.response

"The author didn't have a natural talent for drawing in high school, but he was closer to the tribe of kids who could draw than those seeking a signature style. He later attended the Rhode Island School of Design (RISD) where he learned a lot in a color class, but mostly taught himself to paint. In 1993, he dropped out of RISD and moved to a rent-controlled apartment in New York, becoming a New York artist in the technical sense of making paintings and living in New York. He was nervous about money and decided to write another book on Lisp to live frugally off the royalties and spend all his time painting. He also had the privilege of knowing Idelle Weber, a painter and one of the early photorealists, whose painting class he took at Harvard."

By default LlamaIndex retrieves two similar nodes/ chunks. You can modify that in `vector_index.as_query_engine(similarity_top_k=k)`.

Let's check the text in each of these retrieved nodes.

In [103]:
# First retrieved node
response_vector.source_nodes[0].get_text()

"I certainly did. So at the end of the summer Dan and I switched to working on this new dialect of Lisp, which I called Arc, in a house I bought in Cambridge.\n\nThe following spring, lightning struck. I was invited to give a talk at a Lisp conference, so I gave one about how we'd used Lisp at Viaweb. Afterward I put a postscript file of this talk online, on paulgraham.com, which I'd created years before using Viaweb but had never used for anything. In one day it got 30,000 page views. What on earth had happened? The referring urls showed that someone had posted it on Slashdot. [10]\n\nWow, I thought, there's an audience. If I write something and put it on the web, anyone can read it. That may seem obvious now, but it was surprising then. In the print era there was a narrow channel to readers, guarded by fierce monsters known as editors. The only way to get an audience for anything you wrote was to get it published as a book, or in a newspaper or magazine. Now anyone could publish anyt

In [104]:
# Second retrieved node
response_vector.source_nodes[1].get_text()

"I was not one of the kids who could draw in high school, but at RISD I was definitely closer to their tribe than the tribe of signature style seekers.\n\nI learned a lot in the color class I took at RISD, but otherwise I was basically teaching myself to paint, and I could do that for free. So in 1993 I dropped out. I hung around Providence for a bit, and then my college friend Nancy Parmet did me a big favor. A rent-controlled apartment in a building her mother owned in New York was becoming vacant. Did I want it? It wasn't much more than my current place, and New York was supposed to be where the artists were. So yes, I wanted it! [7]\n\nAsterix comics begin by zooming in on a tiny corner of Roman Gaul that turns out not to be controlled by the Romans. You can do something similar on a map of New York City: if you zoom in on the Upper East Side, there's a tiny corner that's not rich, or at least wasn't in 1993. It's called Yorkville, and that was my new home. Now I was a New York art

Remember that we are using Phoenix Tracing to capture all the data we need to evaluate our RAG pipeline. You can view the traces in the phoenix application.

In [105]:
print("phoenix URL", px.active_session().url)

phoenix URL http://localhost:6006/


We can access the traces by directly pulling the spans from the phoenix session.

In [106]:
spans_df = px.Client().get_spans_dataframe()

In [15]:
spans_df[["name", "span_kind", "attributes.input.value", "attributes.retrieval.documents"]].head()

Unnamed: 0_level_0,name,span_kind,attributes.input.value,attributes.retrieval.documents
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
7fc7e248de69a46b,llm,LLM,,
a7f75bfaccbc9b15,chunking,CHAIN,,
32270b7ac9eee5f2,chunking,CHAIN,,
cd07c7eba44b349f,synthesize,CHAIN,What did the author do growing up?,
f551c60080ae6be3,embedding,EMBEDDING,,


Note that the traces have captured the documents that were retrieved by the query engine. This is nice because it means we can introspect the documents without having to keep track of them ourselves.

In [16]:
spans_with_docs_df = spans_df[spans_df["attributes.retrieval.documents"].notnull()]

In [17]:
spans_with_docs_df[["attributes.input.value", "attributes.retrieval.documents"]].head()

Unnamed: 0_level_0,attributes.input.value,attributes.retrieval.documents
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2d51fccd2d7a5403,What did the author do growing up?,[{'document.content': 'I certainly did. So at ...


We have built a RAG pipeline and also have instrumented it using Phoenix Tracing. We now need to evaluate it's performance. We can assess our RAG system/query engine using Phoenix's LLM Evals. Let's examine how to leverage these tools to quantify the quality of our retrieval-augmented generation system.

## Evaluation

Evaluation should serve as the primary metric for assessing your RAG application. It determines whether the pipeline will produce accurate responses based on the data sources and range of queries.

While it's beneficial to examine individual queries and responses, this approach is impractical as the volume of edge-cases and failures increases. Instead, it's more effective to establish a suite of metrics and automated evaluations. These tools can provide insights into overall system performance and can identify specific areas that may require scrutiny.

In a RAG system, evaluation focuses on two critical aspects:

- **Retrieval Evaluation**: To assess the accuracy and relevance of the documents that were retrieved
- **Response Evaluation**: Measure the appropriateness of the response generated by the system when the context was provided.

### Generate Question Context Pairs

For the evaluation of a RAG system, it's essential to have queries that can fetch the correct context and subsequently generate an appropriate response.

For this tutorial, let's use Phoenix's `llm_generate` to help us create the question-context pairs.

First, let's create a dataframe of all the document chunks that we have indexed.

In [108]:
# Let's construct a dataframe of just the documents that are in our index
document_chunks_df = pd.DataFrame({"text": [node.get_text() for node in nodes]})
document_chunks_df.head()

Unnamed: 0,text
0,What I Worked On\n\nFebruary 2021\n\nBefore co...
1,I was puzzled by the 1401. I couldn't figure o...
2,I remember vividly how impressed and envious I...
3,I couldn't have put this into words when I was...
4,The default language at Cornell was a Pascal-l...


Now that we have the document chunks, let's prompt an LLM to generate us 3 questions per chunk. Note that you could manually solicit questions from your team or customers, but this is a quick and easy way to generate a large number of questions.

In [109]:
generate_questions_template = """\
Context information is below.

---------------------
{text}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
3 questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."

Output the questions in JSON format with the keys question_1, question_2, question_3.
"""

In [110]:
import json

from phoenix.evals import MistralAIModel, llm_generate


def output_parser(response: str, index: int):
    try:
        return json.loads(response)
    except json.JSONDecodeError as e:
        return {"__error__": str(e)}


with using_project(TESTSET_PROJECT): # Collect traces under the project "testset"
    questions_df = llm_generate(
        dataframe=document_chunks_df,
        template=generate_questions_template,
        model=MistralAIModel(
            model="mistral-large-latest"
        ),
        output_parser=output_parser,
        concurrency=20,
    )

llm_generate |          | 0/61 (0.0%) | ⏳ 00:00<? | ?it/s

In [23]:
questions_df.head()

Unnamed: 0,question_1,question_2,question_3
0,What type of stories did the author write befo...,Describe the author's first experience with pr...,What were the limitations the author faced whi...
1,What was the first microcomputer the author's ...,"What was the author's first personal computer,...",What were some of the programs the author wrot...
2,What type of computer did the speaker convince...,Why did the speaker initially plan to study ph...,What were the two specific influences that mad...
3,What was the name of the novel by Heinlein tha...,What programming language was regarded as the ...,What was the subject of the individual's under...
4,What programming language did the author learn...,What was the subject of the author's undergrad...,What realization did the author come to during...


In [111]:
# Construct a dataframe of the questions and the document chunks
questions_with_document_chunk_df = pd.concat([questions_df, document_chunks_df], axis=1)
questions_with_document_chunk_df = questions_with_document_chunk_df.melt(
    id_vars=["text"], value_name="question"
).drop("variable", axis=1)
# If the above step was interrupted, there might be questions missing. Let's run this to clean up the dataframe.
questions_with_document_chunk_df = questions_with_document_chunk_df[
    questions_with_document_chunk_df["question"].notnull()
]

The LLM has generated three questions per chunk. Let's take a quick look.

In [112]:
questions_with_document_chunk_df.head(10)

Unnamed: 0,text,question
0,What I Worked On\n\nFebruary 2021\n\nBefore co...,What type of stories did the author write befo...
1,I was puzzled by the 1401. I couldn't figure o...,What was the first microcomputer the author's ...
2,I remember vividly how impressed and envious I...,What type of computer did the speaker convince...
3,I couldn't have put this into words when I was...,What was the name of the novel by Heinlein tha...
4,The default language at Cornell was a Pascal-l...,What programming language did the author learn...
5,"I applied to 3 grad schools: MIT and Yale, whi...","What led the author to realize that AI, as pra..."
6,So I looked around to see what I could salvage...,What programming language did the author decid...
7,"And indeed, it would seem very feeble work. On...",What realization did the author have at the Ca...
8,And as an artist you could be truly independen...,What was the author's initial perception about...
9,I remember when my friend Robert Morris got ki...,Why did the author decide to write his dissert...


### Retrieval Evaluation

We are now prepared to perform our retrieval evaluations. We will execute the queries we generated in the previous step and verify whether or not that the correct context is retrieved.

In [113]:
# loop over the questions and generate the answers
for _, row in questions_with_document_chunk_df.iterrows():
    question = row["question"]
    response_vector = query_engine.query(question)
    print(f"Question: {question}\nAnswer: {response_vector.response}\n")

Question: What type of stories did the author write before college and what were their main characteristics?
Answer: Before college, the author wrote short stories. These stories were not particularly good, lacking a well-defined plot. Instead, they were characterized by strong emotions attributed to the characters, which the author mistakenly believed added depth to the narratives.

Question: What was the first microcomputer the author's friend built from a kit, and who sold this kit?
Answer: The first microcomputer the author's friend built from a kit was sold by Heathkit. The specific model name is not mentioned in the provided context.

Question: What type of computer did the speaker convince his father to buy in around 1980, and what were some of the programs he wrote using this computer?
Answer: The speaker convinced his father to buy a TRS-80 computer around 1980. With this computer, he wrote various programs including simple games, a program to predict the height his model rock

KeyboardInterrupt: 

Now that we have executed the queries, we can start validating whether or not the RAG system was able to retrieve the correct context. Let's extract all the retrieved documents from the traces logged to phoenix. (For an in-depth explanation of how to export trace data from the phoenix runtime, consult the [docs](https://docs.arize.com/phoenix/how-to/extract-data-from-spans)).

In [115]:
from phoenix.session.evaluation import get_retrieved_documents

retrieved_documents_df = get_retrieved_documents(px.Client())
retrieved_documents_df

Unnamed: 0_level_0,Unnamed: 1_level_0,context.trace_id,input,reference,document_score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
f2a69823504dc8a7,0,ca9cc579613fd70fe08379ccd144d9dc,What is the distinctive feature of YC accordin...,The YC logo itself is an inside joke: the Viaw...,0.784029
f2a69823504dc8a7,1,ca9cc579613fd70fe08379ccd144d9dc,What is the distinctive feature of YC accordin...,The part we got first was to be an angel firm....,0.779252
84327f2121ec5cee,0,907a37a76cc5a63a8b3d1aee0ea1338b,What was the unique approach of YC in comparis...,The part we got first was to be an angel firm....,0.840994
84327f2121ec5cee,1,907a37a76cc5a63a8b3d1aee0ea1338b,What was the unique approach of YC in comparis...,"Customary VC practice had once, like the custo...",0.819195
86e6a3bef9148314,0,98888589f8a871880ebf87d09df88452,What event triggered the speaker's decision to...,In early 2005 she interviewed for a marketing ...,0.780301
...,...,...,...,...,...
005e71071e2e29bf,1,38ec9569382c6694f4f34906c53bddfc,What was the first microcomputer the author's ...,To call this a difficult sale would be an unde...,0.745964
7d62808e39ca85ea,0,b8a90d2acecd9606e3b9203f98c98334,What type of stories did the author write befo...,What I Worked On\n\nFebruary 2021\n\nBefore co...,0.740611
7d62808e39ca85ea,1,b8a90d2acecd9606e3b9203f98c98334,What type of stories did the author write befo...,At least not the painting department. The text...,0.737859
b74e125569a4684e,0,b380f44b5311621708bbcb2a4b09ee4b,What did the author do growing up?,I certainly did. So at the end of the summer D...,0.733581


Let's now use Phoenix's LLM Evals to evaluate the relevance of the retrieved documents with regards to the query. Note, we've turned on `explanations` which prompts the LLM to explain it's reasoning. This can be useful for debugging and for figuring out potential corrective actions.

In [116]:
from phoenix.evals import (
    RelevanceEvaluator,
    run_evals,
)

relevance_evaluator = RelevanceEvaluator(MistralAIModel(model="mistral-large-latest"))

retrieved_documents_relevance_df = run_evals(
    evaluators=[relevance_evaluator],
    dataframe=retrieved_documents_df,
    provide_explanation=True,
    concurrency=20,
)[0]

run_evals |          | 0/88 (0.0%) | ⏳ 00:00<? | ?it/s

Exception in worker on attempt 1: raised MistralException(message=Unexpected exception (ReadError): )
Requeuing...


In [120]:
retrieved_documents_relevance_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,label,score,explanation
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
f2a69823504dc8a7,0,unrelated,0,The question is asking for the distinctive fea...
f2a69823504dc8a7,1,relevant,1,The question is asking for the distinctive fea...
84327f2121ec5cee,0,relevant,1,The question is asking about the unique approa...
84327f2121ec5cee,1,relevant,1,The question is asking about the unique approa...
86e6a3bef9148314,0,relevant,1,The reference text provides information that d...


We can now combine the documents with the relevance evaluations to compute retrieval metrics. These metrics will help us understand how well the RAG system is performing.

In [121]:
documents_with_relevance_df = pd.concat(
    [retrieved_documents_df, retrieved_documents_relevance_df.add_prefix("eval_")], axis=1
)
documents_with_relevance_df

Unnamed: 0_level_0,Unnamed: 1_level_0,context.trace_id,input,reference,document_score,eval_label,eval_score,eval_explanation
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
f2a69823504dc8a7,0,ca9cc579613fd70fe08379ccd144d9dc,What is the distinctive feature of YC accordin...,The YC logo itself is an inside joke: the Viaw...,0.784029,unrelated,0,The question is asking for the distinctive fea...
f2a69823504dc8a7,1,ca9cc579613fd70fe08379ccd144d9dc,What is the distinctive feature of YC accordin...,The part we got first was to be an angel firm....,0.779252,relevant,1,The question is asking for the distinctive fea...
84327f2121ec5cee,0,907a37a76cc5a63a8b3d1aee0ea1338b,What was the unique approach of YC in comparis...,The part we got first was to be an angel firm....,0.840994,relevant,1,The question is asking about the unique approa...
84327f2121ec5cee,1,907a37a76cc5a63a8b3d1aee0ea1338b,What was the unique approach of YC in comparis...,"Customary VC practice had once, like the custo...",0.819195,relevant,1,The question is asking about the unique approa...
86e6a3bef9148314,0,98888589f8a871880ebf87d09df88452,What event triggered the speaker's decision to...,In early 2005 she interviewed for a marketing ...,0.780301,relevant,1,The reference text provides information that d...
...,...,...,...,...,...,...,...,...
005e71071e2e29bf,1,38ec9569382c6694f4f34906c53bddfc,What was the first microcomputer the author's ...,To call this a difficult sale would be an unde...,0.745964,unrelated,0,The reference text provided does not contain a...
7d62808e39ca85ea,0,b8a90d2acecd9606e3b9203f98c98334,What type of stories did the author write befo...,What I Worked On\n\nFebruary 2021\n\nBefore co...,0.740611,relevant,1,The question asks about the type of stories th...
7d62808e39ca85ea,1,b8a90d2acecd9606e3b9203f98c98334,What type of stories did the author write befo...,At least not the painting department. The text...,0.737859,unrelated,0,The reference text does not provide any inform...
b74e125569a4684e,0,b380f44b5311621708bbcb2a4b09ee4b,What did the author do growing up?,I certainly did. So at the end of the summer D...,0.733581,unrelated,0,The question is asking about what the author d...


Let's compute Normalized Discounted Cumulative Gain [NCDG](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) at 2 for all our retrieval steps.  In information retrieval, this metric is often used to measure effectiveness of search engine algorithms and related applications.

In [122]:
import numpy as np
from sklearn.metrics import ndcg_score


def _compute_ndcg(df: pd.DataFrame, k: int):
    """Compute NDCG@k in the presence of missing values"""
    n = max(2, len(df))
    eval_scores = np.zeros(n)
    doc_scores = np.zeros(n)
    eval_scores[: len(df)] = df.eval_score
    doc_scores[: len(df)] = df.document_score
    try:
        return ndcg_score([eval_scores], [doc_scores], k=k)
    except ValueError:
        return np.nan


ndcg_at_2 = pd.DataFrame(
    {"score": documents_with_relevance_df.groupby("context.span_id").apply(_compute_ndcg, k=2)}
)

In [123]:
ndcg_at_2

Unnamed: 0_level_0,score
context.span_id,Unnamed: 1_level_1
005e71071e2e29bf,1.0
0293435993fa5c4e,1.0
05dc735655dd1576,0.63093
13d266378337470d,1.0
1a29a1f4e5908e28,1.0
1e5b437b67d2c863,1.0
1f5654a562754d0a,1.0
223b0177bd6ac993,0.63093
23d1fe10556fc98a,0.0
36a301e36394c5f4,0.0


Let's also compute precision at 2 for all our retrieval steps.

In [124]:
precision_at_2 = pd.DataFrame(
    {
        "score": documents_with_relevance_df.groupby("context.span_id").apply(
            lambda x: x.eval_score[:2].sum(skipna=False) / 2
        )
    }
)

In [125]:
precision_at_2

Unnamed: 0_level_0,score
context.span_id,Unnamed: 1_level_1
005e71071e2e29bf,0.5
0293435993fa5c4e,1.0
05dc735655dd1576,0.5
13d266378337470d,0.5
1a29a1f4e5908e28,1.0
1e5b437b67d2c863,1.0
1f5654a562754d0a,0.5
223b0177bd6ac993,0.5
23d1fe10556fc98a,0.0
36a301e36394c5f4,0.0


Lastly, let's compute whether or not a correct document was retrieved at all for each query (e.g. a hit)

In [126]:
hit = pd.DataFrame(
    {
        "hit": documents_with_relevance_df.groupby("context.span_id").apply(
            lambda x: x.eval_score[:2].sum(skipna=False) > 0
        )
    }
)

Let's now view the results in a combined dataframe.

In [127]:
retrievals_df = px.Client().get_spans_dataframe("span_kind == 'RETRIEVER'")
rag_evaluation_dataframe = pd.concat(
    [
        retrievals_df["attributes.input.value"],
        ndcg_at_2.add_prefix("ncdg@2_"),
        precision_at_2.add_prefix("precision@2_"),
        hit,
    ],
    axis=1,
)
rag_evaluation_dataframe

Unnamed: 0_level_0,attributes.input.value,ncdg@2_score,precision@2_score,hit
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
f2a69823504dc8a7,What is the distinctive feature of YC accordin...,0.63093,0.5,True
84327f2121ec5cee,What was the unique approach of YC in comparis...,1.0,1.0,True
86e6a3bef9148314,What event triggered the speaker's decision to...,1.0,1.0,True
e478acde2255cd38,What was the speaker's motivation for giving a...,0.0,0.0,False
0293435993fa5c4e,What was the unique concept behind the party h...,1.0,1.0,True
5fb3d1fcf04a6187,What was the title of the book that O'Reilly r...,0.0,0.0,False
82917cdf9c0b60f3,What did the author realize would be a 'turnin...,1.0,1.0,True
223b0177bd6ac993,What programming language did the author work ...,0.63093,0.5,True
dab7532e49e632e7,What was the original name that the speaker co...,1.0,1.0,True
13d266378337470d,What was the initial name given to the kind of...,1.0,0.5,True


### Observations

Let's now take our results and aggregate them to get a sense of how well our RAG system is performing.

In [128]:
# Aggregate the scores across the retrievals
results = rag_evaluation_dataframe.mean(numeric_only=True)
results

ncdg@2_score         0.699672
precision@2_score    0.511364
hit                  0.750000
dtype: float64

As we can see from the above numbers, our RAG system is not perfect, there are times when it fails to retrieve the correct context within the first two documents. At other times the correct context is included in the top 2 results but non-relevant information is also included in the context. This is an indication that we need to improve our retrieval strategy. One possible solution could be to increase the number of documents retrieved and then use a more sophisticated ranking strategy (such as a reranker) to select the correct context.

We have now evaluated our RAG system's retrieval performance. Let's send these evaluations to Phoenix for visualization. By sending the evaluations to Phoenix, you will be able to view the evaluations alongside the traces that were captured earlier.

In [129]:
from phoenix.trace import DocumentEvaluations, SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(dataframe=ndcg_at_2, eval_name="ndcg@2"),
    SpanEvaluations(dataframe=precision_at_2, eval_name="precision@2"),
    DocumentEvaluations(dataframe=retrieved_documents_relevance_df, eval_name="relevance"),
)

In [130]:
ndcg_at_2.head()

Unnamed: 0_level_0,score
context.span_id,Unnamed: 1_level_1
005e71071e2e29bf,1.0
0293435993fa5c4e,1.0
05dc735655dd1576,0.63093
13d266378337470d,1.0
1a29a1f4e5908e28,1.0


### Response Evaluation

The retrieval evaluations demonstrates that our RAG system is not perfect. However, it's possible that the LLM is able to generate the correct response even when the context is incorrect. Let's evaluate the responses generated by the LLM.

In [131]:
from phoenix.session.evaluation import get_qa_with_reference

qa_with_reference_df = get_qa_with_reference(px.Client())
qa_with_reference_df

Unnamed: 0_level_0,input,output,reference
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
d3c28b0c35bc1d75,What is the distinctive feature of YC accordin...,,The YC logo itself is an inside joke: the Viaw...
192ee08841e5d4e9,What was the unique approach of YC in comparis...,YC's unique approach was a combination of seve...,The part we got first was to be an angel firm....
1e93158a92d2e72f,What event triggered the speaker's decision to...,The speaker decided to start their own investm...,In early 2005 she interviewed for a marketing ...
a7958eb18896a135,What was the speaker's motivation for giving a...,The context does not provide information on th...,"I applied to 3 grad schools: MIT and Yale, whi..."
532a562eff1c9452,What was the unique concept behind the party h...,The unique concept behind the party hosted at ...,"I also worked on spam filters, and did some mo..."
2328c9fed11ea68b,What was the title of the book that O'Reilly r...,The context does not provide the title of the ...,Even then it took me several years to understa...
44111d90e1231932,What did the author realize would be a 'turnin...,The author realized that the turning point in ...,I certainly did. So at the end of the summer D...
37899a2c74d75cb1,What programming language did the author work ...,The author worked on a new dialect of Lisp at ...,And at 50 there was some opportunity cost to s...
b01192b7c1916e5c,What was the original name that the speaker co...,The original name the speaker considered for h...,It seemed obvious that this was the future. I ...
b45189f7a498e3b8,What was the initial name given to the kind of...,The initial name given to the kind of company ...,This name didn't last long before it was repla...


Now that we have a dataset of the question, context, and response (input, reference, and output), we now can measure how well the LLM is responding to the queries. For details on the QA correctness evaluation, see the [LLM Evals documentation](https://docs.arize.com/phoenix/llm-evals/running-pre-tested-evals/q-and-a-on-retrieved-data).

In [132]:
from phoenix.evals import (
    HallucinationEvaluator,
    OpenAIModel,
    QAEvaluator,
    run_evals,
)

qa_evaluator = QAEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))
hallucination_evaluator = HallucinationEvaluator(OpenAIModel(model="gpt-4-turbo-preview"))

qa_correctness_eval_df, hallucination_eval_df = run_evals(
    evaluators=[qa_evaluator, hallucination_evaluator],
    dataframe=qa_with_reference_df,
    provide_explanation=True,
    concurrency=20,
)

run_evals |          | 0/88 (0.0%) | ⏳ 00:00<? | ?it/s

In [133]:
qa_correctness_eval_df.head()

Unnamed: 0_level_0,label,score,explanation
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
d3c28b0c35bc1d75,incorrect,0,The question asks for the distinctive feature ...
192ee08841e5d4e9,correct,1,
1e93158a92d2e72f,correct,1,The question asks for the event that triggered...
a7958eb18896a135,correct,1,The reference text does not mention any talk g...
532a562eff1c9452,correct,1,The question asks for the unique concept behin...


In [134]:
hallucination_eval_df.head()

Unnamed: 0_level_0,label,score,explanation
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
d3c28b0c35bc1d75,hallucinated,1,The query asks for the distinctive feature of ...
192ee08841e5d4e9,factual,0,The answer provided is factual based on the re...
1e93158a92d2e72f,factual,0,The answer is factual based on the reference t...
a7958eb18896a135,factual,0,The reference text does not mention any talk g...
532a562eff1c9452,factual,0,The answer accurately reflects the information...


#### Observations

Let's now take our results and aggregate them to get a sense of how well the LLM is answering the questions given the context.

In [135]:
qa_correctness_eval_df.mean(numeric_only=True)

score    0.931818
dtype: float64

In [136]:
hallucination_eval_df.mean(numeric_only=True)

score    0.045455
dtype: float64

Our QA Correctness score of `0.91` and a Hallucinations score `0.05` signifies that the generated answers are correct ~91% of the time and that the responses contain hallucinations 5% of the time - there is room for improvement. This could be due to the retrieval strategy or the LLM itself. We will need to investigate further to determine the root cause.

Since we have evaluated our RAG system's QA performance and Hallucinations performance, let's send these evaluations to Phoenix for visualization.

In [137]:
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(dataframe=qa_correctness_eval_df, eval_name="Q&A Correctness"),
    SpanEvaluations(dataframe=hallucination_eval_df, eval_name="Hallucination"),
)

We now have sent all our evaluations to Phoenix. Let's go to the Phoenix application and view the results! Since we've sent all the evals to Phoenix, we can analyze the results together to make a determination on whether or not poor retrieval or irrelevant context has an effect on the LLM's ability to generate the correct response.

In [138]:
print("phoenix URL", px.active_session().url)

phoenix URL http://localhost:6006/


## Embeddings Analysis
[Embeddings](https://arize.com/blog-course/embeddings-meaning-examples-and-how-to-compute/) encode the meaning of retrieved documents and user queries. Not only are they an essential part of RAG systems, but they are immensely useful for understanding and debugging LLM application performance.

Phoenix takes the high-dimensional embeddings from your RAG application, reduces their dimensionality, and clusters them into semantically meaningful groups of data. You can then select the metric of your choice (e.g. hallucinations or QA correctness) to visually inspect the performance of your application and surface problematic clusters. The advantage of this approach is that it provides metrics on granular yet meaningful subsets of your data that help you analyze local, not merely global, performance across a dataset. It's also helpful for gaining intuition around what kind of queries your LLM application is struggling to answer.

In [143]:
# First, let's grab the embeddings from our queries
from phoenix.trace.dsl.helpers import SpanQuery

embeddings_df = px.Client().query_spans(
    SpanQuery()
    .with_index("trace_id")
    .explode(
        "embedding.embeddings",
        query="embedding.text",
        vector="embedding.vector",
    ),
)
queries_df = px.Client().query_spans(
    SpanQuery()
    .with_index("trace_id")
    .select(
        "span_id",
        query="input.value",
        response="response.value"
    )
    .where("parent_id is None")
)
query_embeddings_df = queries_df.join(embeddings_df, how="inner").set_index("context.span_id")
query_embeddings_df.head()

Unnamed: 0_level_0,response,query,vector
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
d3c28b0c35bc1d75,What is the distinctive feature of YC accordin...,What is the distinctive feature of YC accordin...,"[-0.03582763671875, 0.016754150390625, 0.05102..."
192ee08841e5d4e9,What was the unique approach of YC in comparis...,What was the unique approach of YC in comparis...,"[-0.007587432861328125, 0.03240966796875, 0.01..."
1e93158a92d2e72f,What event triggered the speaker's decision to...,What event triggered the speaker's decision to...,"[-0.01418304443359375, 0.050018310546875, 0.03..."
a7958eb18896a135,What was the speaker's motivation for giving a...,What was the speaker's motivation for giving a...,"[-0.05609130859375, 0.031982421875, 0.02626037..."
532a562eff1c9452,What was the unique concept behind the party h...,What was the unique concept behind the party h...,"[-0.047271728515625, 0.047210693359375, 0.0614..."


In [150]:
# Now let's add our evaluations to the dataframe
query_embeddings_with_evals_df = pd.concat(
    [hallucination_eval_df[["label", "score"]].rename(columns={ "label": "hallucination_label", "score": "hallucination_score"}),
    qa_correctness_eval_df[["label", "score"]].rename(columns={ "label": "qa_correctness_label", "score": "qa_correctness_score"}),
    query_embeddings_df,
    ],
    axis=1,        # joining on the row indices
    join="inner",  # inner-join by the indices of the DataFrames
)

query_embeddings_with_evals_df.head()

Unnamed: 0_level_0,hallucination_label,hallucination_score,qa_correctness_label,qa_correctness_score,response,query,vector
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
d3c28b0c35bc1d75,hallucinated,1,incorrect,0,What is the distinctive feature of YC accordin...,What is the distinctive feature of YC accordin...,"[-0.03582763671875, 0.016754150390625, 0.05102..."
192ee08841e5d4e9,factual,0,correct,1,What was the unique approach of YC in comparis...,What was the unique approach of YC in comparis...,"[-0.007587432861328125, 0.03240966796875, 0.01..."
1e93158a92d2e72f,factual,0,correct,1,What event triggered the speaker's decision to...,What event triggered the speaker's decision to...,"[-0.01418304443359375, 0.050018310546875, 0.03..."
a7958eb18896a135,factual,0,correct,1,What was the speaker's motivation for giving a...,What was the speaker's motivation for giving a...,"[-0.05609130859375, 0.031982421875, 0.02626037..."
532a562eff1c9452,factual,0,correct,1,What was the unique concept behind the party h...,What was the unique concept behind the party h...,"[-0.047271728515625, 0.047210693359375, 0.0614..."


In [151]:
# Next let's grab the embeddings from our corpus (indexing)
from phoenix.trace.dsl.helpers import SpanQuery

client = px.Client()
corpus_df = px.Client().query_spans(
    SpanQuery().explode(
        "embedding.embeddings",
        text="embedding.text",
        vector="embedding.vector",
    ),
    project_name=INDEXING_PROJECT,
)
corpus_df.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,text,vector
context.span_id,position,Unnamed: 2_level_1,Unnamed: 3_level_1
7acd15037ea5aa5d,0,file_path: /var/folders/1s/4vdv59n15b1ghg42frd...,"[-0.044952392578125, 0.0740966796875, 0.036346..."
2085ace2dae5dc6f,0,file_path: /var/folders/1s/4vdv59n15b1ghg42frd...,"[-0.05560302734375, 0.080810546875, 0.04663085..."
2085ace2dae5dc6f,1,file_path: /var/folders/1s/4vdv59n15b1ghg42frd...,"[-0.052276611328125, 0.07275390625, 0.04721069..."
2085ace2dae5dc6f,2,file_path: /var/folders/1s/4vdv59n15b1ghg42frd...,"[-0.045654296875, 0.0538330078125, 0.037750244..."
2085ace2dae5dc6f,3,file_path: /var/folders/1s/4vdv59n15b1ghg42frd...,"[-0.04693603515625, 0.055084228515625, 0.03265..."


In [62]:
# Let's now merge in our evaluations into the queries
query_embeddings_df = query_embeddings_df.iloc[::-1]
query_df = pd.concat(
    [
        hallucination_eval_df[["score", "label"]].rename(columns={"score": "hallucination_score", "label": "hallucination_label"}).reset_index(drop=True),
        qa_correctness_eval_df[["score", "label"]].rename(columns={"score": "qa_correctness_score", "label": "qa_correctness_label"}).reset_index(drop=True),
        query_embeddings_df[["query", "vector"]].reset_index(drop=True),
    ],
    axis=1,
)
query_df.head()

Unnamed: 0,hallucination_score,hallucination_label,qa_correctness_score,qa_correctness_label,query,vector
0,0.0,factual,1.0,correct,What did the author do growing up?,"[-0.051177978515625, 0.0357666015625, 0.040527..."
1,0.0,factual,1.0,correct,What type of stories did the author write befo...,"[-0.0513916015625, 0.02825927734375, 0.0482177..."
2,0.0,factual,1.0,correct,What was the first microcomputer the author's ...,"[-0.020782470703125, 0.01148223876953125, 0.03..."
3,0.0,factual,1.0,correct,What type of computer did the speaker convince...,"[-0.048797607421875, 0.045654296875, 0.0026817..."
4,0.0,factual,1.0,correct,What was the name of the novel by Heinlein tha...,"[-0.0313720703125, 0.04644775390625, 0.0225067..."


In [None]:
query_schema = px.Schema(
    prompt_column_names=px.EmbeddingColumnNames(
        raw_data_column_name="query", vector_column_name="vector"
    ),
    response_column_names="response",
    tag_column_names=["hallucination_label", "hallucination_score", "qa_correctness_label", "qa_correctness_score"],
)
corpus_schema = px.Schema(
    prompt_column_names=px.EmbeddingColumnNames(
        raw_data_column_name="text", vector_column_name="vector"
    )
)
# relaunch phoenix with a primary and corpus dataset to view embeddings
px.close_app()
session = px.launch_app(
    primary=px.Dataset(query_df, query_schema, "query"),
    corpus=px.Dataset(corpus_df.reset_index(drop=True), corpus_schema, "corpus"),
)

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


  prediction_id=dataset[PREDICTION_ID][row_id],
  link_to_data=dataset[self.dimension.link_to_data][row_id],
  raw_data=dataset[self.dimension.raw_data][row_id],
  prediction_label=dataset[PREDICTION_LABEL][row_id],
  prediction_score=dataset[PREDICTION_SCORE][row_id],
  actual_label=dataset[ACTUAL_LABEL][row_id],
  actual_score=dataset[ACTUAL_SCORE][row_id],
  prediction_id=dataset[PREDICTION_ID][row_id],
  link_to_data=dataset[dimension.link_to_data][row_id],
  raw_data=dataset[dimension.raw_data][row_id],
2024-03-20 15:54:09,680 ERROR strawberry.execution: Unknown project: UHJvamVjdDoy

GraphQL request:4:3
3 | ) {
4 |   node(id: $id) {
  |   ^
5 |     __typename
Traceback (most recent call last):
  File "/Users/mikeldking/work/phoenix/.venv/lib/python3.10/site-packages/graphql/execution/execute.py", line 521, in execute_field
    result = resolve_fn(source, info, **args)
  File "/Users/mikeldking/work/phoenix/.venv/lib/python3.10/site-packages/strawberry/schema/schema_converter.py",

## Conclusion

We have explored how to build and evaluate a RAG pipeline using LlamaIndex and Phoenix, with a specific focus on evaluating the retrieval system and generated responses within the pipelines. 

Phoenix offers a variety of other evaluations that can be used to assess the performance of your LLM Application. For more details, see the [LLM Evals](https://docs.arize.com/phoenix/llm-evals/llm-evals) documentation.