<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Evaluate RAG with LLM Evals</h1>

In this tutorial we will look into building a RAG pipeline and evaluating it with Phoenix Evals.

It has the the following sections:

1. Understanding Retrieval Augmented Generation (RAG).
1. Building RAG (with the help of a framework such as LlamaIndex).
1. Evaluating RAG with Phoenix Evals.

## Retrieval Augmented Generation (RAG)

LLMs are trained on vast datasets, but these will not include your specific data (things like company knowledge bases and documentation). Retrieval-Augmented Generation (RAG) addresses this by dynamically incorporating your data as context during the generation process. This is done not by altering the training data of the LLMs but by allowing the model to access and utilize your data in real-time to provide a more tailored and contextually relevant responses.

In RAG, your data is loaded and prepared for queries. This process is called and indexing. User queries act on this index, which filters your data down to the most relevant context. This context and your query then are sent to the LLM along with a prompt, and the LLM provides a response.

RAG is a critical component for building applications such a chatbots or agents and you will want to know RAG techniques on how to get data into your application.

<img src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/RAG_Pipeline.png">

## Stages within RAG

There are five key stages within RAG, which will in turn be a part of any larger RAG application.

- **Loading**: This refers to getting your data from where it lives - whether it's text files, PDFs, another website, a database or an API - into your pipeline.
- **Indexing**: This means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.
- **Storing**: Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it.

- **Querying**: For any given indexing strategy there are many ways you can utilize LLMs and data structures to query, including sub-queries, multi-step queries, and hybrid strategies. 
- **Evaluation**: A critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures on how accurate, faithful, and fast your responses to queries are.


## Build a RAG system 

Now that we have understood the stages of RAG, let's build a pipeline. We will use [LlamaIndex](https://www.llamaindex.ai/) for RAG and [Phoenix Evals](https://docs.arize.com/phoenix/llm-evals/llm-evals) for evaluation.


In [121]:
!pip install -qq "arize-phoenix[experimental]" llama-index

In [122]:
# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
import nest_asyncio

nest_asyncio.apply()

import getpass
import os

import pandas as pd
import phoenix as px
from llama_index import SimpleDirectoryReader, VectorStoreIndex, set_global_handler
from llama_index.llms import OpenAI
from llama_index.node_parser import SimpleNodeParser

During this tutorial, we will capture all the data we need to evaluate our RAG pipeline using Phoenix Tracing. To enable this, simply start the phoenix application and instrument LlamaIndex.

In [123]:
px.launch_app()

Existing running Phoenix instance detected! Shutting it down and starting a new instance...


🌍 To view the Phoenix app in your browser, visit http://127.0.0.1:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


<phoenix.session.session.ThreadSession at 0x2b1aba7d0>

In [124]:
set_global_handler("arize_phoenix")

For this tutorial we will be using OpenAI for creating synthetic data as well as for evaluation. 

In [125]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

Let's use an [essay by Paul Graham](https://www.paulgraham.com/worked.html) to build our RAG pipeline.

In [126]:
!mkdir -p 'data/paul_graham/'
!curl 'https://raw.githubusercontent.com/Arize-ai/phoenix-assets/main/data/paul_graham/paul_graham_essay.txt' -o 'data/paul_graham/paul_graham_essay.txt'

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 75041  100 75041    0     0   209k      0 --:--:-- --:--:-- --:--:--  210k


### Load Data and Build an Index

In [127]:
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# Define an LLM
llm = OpenAI(model="gpt-4")

# Build index with a chunk_size of 512
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

Build a QueryEngine and start querying.

In [128]:
query_engine = vector_index.as_query_engine()

In [129]:
response_vector = query_engine.query("What did the author do growing up?")

Check the response that you get from the query.

In [130]:
response_vector.response

"The author, growing up, worked on writing and programming. They wrote short stories and tried writing programs on an IBM 1401 computer. They used an early version of Fortran and had to type programs on punch cards. However, they were puzzled by the computer and didn't have much data to work with. The author also mentioned that with the advent of microcomputers, everything changed."

By default LlamaIndex retrieves two similar nodes/ chunks. You can modify that in `vector_index.as_query_engine(similarity_top_k=k)`.

Let's check the text in each of these retrieved nodes.

In [131]:
# First retrieved node
response_vector.source_nodes[0].get_text()

'What I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn\'t write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.\n\nThe first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district\'s 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain\'s lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.\n\nThe language we used was an early version of Fortran. You had to type programs on punch cards, then stack

In [132]:
# Second retrieved node
response_vector.source_nodes[1].get_text()

"It felt like I was doing life right. I remember that because I was slightly dismayed at how novel it felt. The good news is that I had more moments like this over the next few years.\n\nIn the summer of 2016 we moved to England. We wanted our kids to see what it was like living in another country, and since I was a British citizen by birth, that seemed the obvious choice. We only meant to stay for a year, but we liked it so much that we still live there. So most of Bel was written in England.\n\nIn the fall of 2019, Bel was finally finished. Like McCarthy's original Lisp, it's a spec rather than an implementation, although like McCarthy's Lisp it's a spec expressed as code.\n\nNow that I could write essays again, I wrote a bunch about topics I'd had stacked up. I kept writing essays through 2020, but I also started to think about other things I could work on. How should I choose what to do? Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that 

Remember that we are using Phoenix Tracing to capture all the data we need to evaluate our RAG pipeline. You can view the traces in the phoenix application.

In [133]:
print("phoenix URL", px.active_session().url)

phoenix URL http://127.0.0.1:6006/


We can access the traces by directly pulling the spans from the phoenix session.

In [134]:
spans_df = px.active_session().get_spans_dataframe()

In [135]:
spans_df.head()

Unnamed: 0_level_0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,conversation,context.trace_id,...,attributes.output.value,attributes.__computed__.latency_ms,attributes.__computed__.error_count,attributes.__computed__.cumulative_token_count.total,attributes.__computed__.cumulative_token_count.prompt,attributes.__computed__.cumulative_token_count.completion,attributes.input.value,attributes.embedding.model_name,attributes.embedding.embeddings,attributes.retrieval.documents
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
13c2133e-fbaa-4253-9529-eff59286eea0,llm,LLM,49e4e78a-81d8-4858-a40f-5fcf108cfe60,2023-12-05T00:05:48.726237+00:00,2023-12-05T00:05:51.512111+00:00,OK,,[],,f853a479-dc4a-4ca3-a214-2c3c9d3b1a03,...,"The author, growing up, worked on writing and ...",2785.874,0,1144.0,1065.0,79.0,,,,
49e4e78a-81d8-4858-a40f-5fcf108cfe60,synthesize,CHAIN,01447d07-fd83-4f39-b290-9881920524e1,2023-12-05T00:05:48.722384+00:00,2023-12-05T00:05:51.512342+00:00,OK,,[],,f853a479-dc4a-4ca3-a214-2c3c9d3b1a03,...,"The author, growing up, worked on writing and ...",2789.958,0,1144.0,1065.0,79.0,What did the author do growing up?,,,
d9119763-7316-4006-8484-eca276dae00e,embedding,EMBEDDING,4197f72b-4d97-4a88-96a6-6f1c3934792d,2023-12-05T00:05:48.555014+00:00,2023-12-05T00:05:48.718349+00:00,OK,,[],,f853a479-dc4a-4ca3-a214-2c3c9d3b1a03,...,,163.335,0,,,,,text-embedding-ada-002,"[{'embedding.vector': [0.010107065550982952, -...",
4197f72b-4d97-4a88-96a6-6f1c3934792d,retrieve,RETRIEVER,01447d07-fd83-4f39-b290-9881920524e1,2023-12-05T00:05:48.554931+00:00,2023-12-05T00:05:48.722315+00:00,OK,,[],,f853a479-dc4a-4ca3-a214-2c3c9d3b1a03,...,,167.384,0,,,,What did the author do growing up?,,,[{'document.id': '16bc6cd8-f4cd-488b-979c-d983...
01447d07-fd83-4f39-b290-9881920524e1,query,CHAIN,,2023-12-05T00:05:48.554871+00:00,2023-12-05T00:05:51.512380+00:00,OK,,[],,f853a479-dc4a-4ca3-a214-2c3c9d3b1a03,...,"The author, growing up, worked on writing and ...",2957.509,0,1144.0,1065.0,79.0,What did the author do growing up?,,,


Note that the traces have captured the documents that were retrieved by the query engine.

In [136]:
spans_with_docs_df = spans_df[spans_df["attributes.retrieval.documents"].notnull()]
spans_with_docs_df.head()

Unnamed: 0_level_0,name,span_kind,parent_id,start_time,end_time,status_code,status_message,events,conversation,context.trace_id,...,attributes.output.value,attributes.__computed__.latency_ms,attributes.__computed__.error_count,attributes.__computed__.cumulative_token_count.total,attributes.__computed__.cumulative_token_count.prompt,attributes.__computed__.cumulative_token_count.completion,attributes.input.value,attributes.embedding.model_name,attributes.embedding.embeddings,attributes.retrieval.documents
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
4197f72b-4d97-4a88-96a6-6f1c3934792d,retrieve,RETRIEVER,01447d07-fd83-4f39-b290-9881920524e1,2023-12-05T00:05:48.554931+00:00,2023-12-05T00:05:48.722315+00:00,OK,,[],,f853a479-dc4a-4ca3-a214-2c3c9d3b1a03,...,,167.384,0,,,,What did the author do growing up?,,,[{'document.id': '16bc6cd8-f4cd-488b-979c-d983...


In [137]:
spans_with_docs_df[["attributes.input.value", "attributes.retrieval.documents"]].head()

Unnamed: 0_level_0,attributes.input.value,attributes.retrieval.documents
context.span_id,Unnamed: 1_level_1,Unnamed: 2_level_1
4197f72b-4d97-4a88-96a6-6f1c3934792d,What did the author do growing up?,[{'document.id': '16bc6cd8-f4cd-488b-979c-d983...


We have built a RAG pipeline and also have instrumented it using Phoenix Tracing. We now need to evaluate it's performance. We can assess our RAG system/query engine using Phoenix's LLM Evals. Let's examine how to leverage these tools to quantify the quality of our retrieval-augmented generation system.

## Evaluation

Evaluation should serve as the primary metric for assessing your RAG application. It determines whether the pipeline will produce accurate responses based on the data sources and range of queries.

While it's beneficial to examine individual queries and responses at the start, this approach is impractical as the volume of edge-cases and failures increases. Instead, it's more effective to establish a suite of metrics and automated evaluations. These tools can provide insights into overall system performance and can identify specific areas that may require scrutiny.

In a RAG system, evaluation focuses on two critical aspects:

- **Retrieval Evaluation**: To assess the accuracy and relevance of the documents that were retrieved
- **Response Evaluation**: Measure the appropriateness of the response generated by the system when the context was provided.

### Generate Question Context Pairs

For the evaluation of a RAG system, it's essential to have queries that can fetch the correct the correct context and subsequently generate an appropriate response.

For this tutorial, let's use Phoenix's `llm_generate` to create the question-context pairs.

In [138]:
# Let's construct a dataframe of just the documents that are in our index
documents_df = pd.DataFrame({
    "text": [node.get_text() for node in nodes]
})
documents_df.head()

Unnamed: 0,text
0,What I Worked On\n\nFebruary 2021\n\nBefore co...
1,I was puzzled by the 1401. I couldn't figure o...
2,I remember vividly how impressed and envious I...
3,I couldn't have put this into words when I was...
4,This was more like it; this was what I had exp...


In [139]:
generate_questions_template = """\
Context information is below.

---------------------
{text}
---------------------

Given the context information and not prior knowledge.
generate only questions based on the below query.

You are a Teacher/ Professor. Your task is to setup \
3 questions for an upcoming \
quiz/examination. The questions should be diverse in nature \
across the document. Restrict the questions to the \
context information provided."

Output the questions in JSON format with the keys question_1, question_2, question_3.
"""

In [140]:
import json

from phoenix.experimental.evals import OpenAIModel, llm_generate


def output_parser(response: str):
        try:
            return json.loads(response)
        except json.JSONDecodeError as e:
            return {"__error__": str(e)}


questions_df = llm_generate(
    dataframe=documents_df,
    template=generate_questions_template,
    model=OpenAIModel(
        model_name="gpt-4-1106-preview",
        model_kwargs={
            "response_format": {"type": "json_object"}
        }
        ),
    output_parser=output_parser
)

In [None]:
questions_df.head()

Unnamed: 0,question_1,question_2,question_3
0,Describe the environment in which the author f...,What programming language did the author use w...,Explain the significance of the author's exper...
1,Describe the limitations of programming on the...,How did the advent of microcomputers change th...,What were some of the applications the narrato...
2,What was the first computer the narrator convi...,Which field of study did the narrator initiall...,Name the two influences that sparked the narra...
3,"What science fiction novel by Heinlein, featur...",Describe the impact that learning Lisp had on ...,What was the name of the program that the auth...
4,What was the primary focus of the undergraduat...,Describe the unique aspect of the program at C...,Based on the author's experience during their ...


In [None]:
question_context_pairs = pd.concat([questions_df, documents_df], axis=1)

In [None]:
question_context_pairs.head()

Unnamed: 0,question_1,question_2,question_3,text
0,Describe the environment in which the author f...,What programming language did the author use w...,Explain the significance of the author's exper...,What I Worked On\n\nFebruary 2021\n\nBefore co...
1,Describe the limitations of programming on the...,How did the advent of microcomputers change th...,What were some of the applications the narrato...,I was puzzled by the 1401. I couldn't figure o...
2,What was the first computer the narrator convi...,Which field of study did the narrator initiall...,Name the two influences that sparked the narra...,I remember vividly how impressed and envious I...
3,"What science fiction novel by Heinlein, featur...",Describe the impact that learning Lisp had on ...,What was the name of the program that the auth...,I couldn't have put this into words when I was...
4,What was the primary focus of the undergraduat...,Describe the unique aspect of the program at C...,Based on the author's experience during their ...,This was more like it; this was what I had exp...


In [None]:
# Let's construct a dataframe that has a question per row
question_context_pairs = question_context_pairs.melt(id_vars=["text"], value_name="question").drop('variable', axis=1)

In [None]:
question_context_pairs.head(10)

Unnamed: 0,text,question
0,What I Worked On\n\nFebruary 2021\n\nBefore co...,Describe the environment in which the author f...
1,I was puzzled by the 1401. I couldn't figure o...,Describe the limitations of programming on the...
2,I remember vividly how impressed and envious I...,What was the first computer the narrator convi...
3,I couldn't have put this into words when I was...,"What science fiction novel by Heinlein, featur..."
4,This was more like it; this was what I had exp...,What was the primary focus of the undergraduat...
5,"Only Harvard accepted me, so that was where I ...",What realization did the author come to during...
6,"So I decided to focus on Lisp. In fact, I deci...",What is the title of the book the author wrote...
7,Anyone who wanted one to play around with coul...,In what year did the narrator visit Rich Drave...
8,I knew intellectually that people made art — t...,What was the initial perception of the author ...
9,Then one day in April 1990 a crack appeared in...,What was the topic of the dissertation written...


### Retrieval Evaluation

We are now prepared to conduct our retrieval evaluations. We will execute the queries we generated in the previous step and ensure that the correct context is retrieved.

In [None]:
# First things first, let's reset phoenix
px.close_app()
px.launch_app()

🌍 To view the Phoenix app in your browser, visit http://127.0.0.1:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


<phoenix.session.session.ThreadSession at 0x2b1f86aa0>

In [None]:
# loop over the questions and generate the answers
answers = []
for _, row in question_context_pairs.iterrows():
    question = row["question"]
    response_vector = query_engine.query(question)
    answers.append(response_vector.response)

AttributeError: 'float' object has no attribute 'query_str'

Now that we have executed the queries, we can start validating whether or not the RAG system was able to retrieve the correct context.

In [None]:
from phoenix.session.evaluation import get_retrieved_documents

retrieved_documents = get_retrieved_documents(px.active_session())
retrieved_documents

Unnamed: 0_level_0,Unnamed: 1_level_0,input,reference,document_score,context.trace_id
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
b7fa37fe-813f-41e3-8d1d-f29e307c9d46,0,What were the three main components of the sof...,[8]\n\nThere were three main parts to the soft...,0.828050,8bead10f-180a-4721-9b55-8bd99d4f3b9f
b7fa37fe-813f-41e3-8d1d-f29e307c9d46,1,What were the three main components of the sof...,"It may look clunky today, but in 1996 it was t...",0.812155,8bead10f-180a-4721-9b55-8bd99d4f3b9f
63eb71ea-6ec9-4080-9d49-a151d1d96c4b,0,What were the three main components of the eco...,[8]\n\nThere were three main parts to the soft...,0.855644,a6da757c-2e6d-4298-b2d0-83f9ee2d56b1
63eb71ea-6ec9-4080-9d49-a151d1d96c4b,1,What were the three main components of the eco...,"It may look clunky today, but in 1996 it was t...",0.829060,a6da757c-2e6d-4298-b2d0-83f9ee2d56b1
3cf06534-6ab8-469d-9a35-5ba61cef990a,0,What financial situation was the author in whe...,One night in October 2003 there was a big part...,0.839288,223b68be-6fd3-47b8-bd02-9ba97fdd5940
...,...,...,...,...,...
16aee420-8edb-4516-a722-a6807a20fcab,1,What was the first computer the narrator convi...,I was puzzled by the 1401. I couldn't figure o...,0.837659,6be748b7-f25a-4927-821b-603674a436a2
2aeafcd2-5832-4583-a677-59feb802404d,0,Describe the limitations of programming on the...,I was puzzled by the 1401. I couldn't figure o...,0.875245,d321148b-ad91-484a-ae53-b7ce8797835b
2aeafcd2-5832-4583-a677-59feb802404d,1,Describe the limitations of programming on the...,What I Worked On\n\nFebruary 2021\n\nBefore co...,0.868623,d321148b-ad91-484a-ae53-b7ce8797835b
5c3829e3-2ae6-4030-953c-e355fc2bad55,0,Describe the environment in which the author f...,What I Worked On\n\nFebruary 2021\n\nBefore co...,0.823877,55998230-9de7-47a0-b97f-1f3a96ab1da4


Let's now use Phoenix's LLM Evals to evaluate the relevance of the retrieved documents with regards to the query.

In [None]:
from phoenix.experimental.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    llm_classify,
)

retrieved_documents_eval = llm_classify(
    retrieved_documents,
    OpenAIModel(model_name="gpt-4-1106-preview"),
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,
)
retrieved_documents_eval["score"] = (
    retrieved_documents_eval.label[~retrieved_documents_eval.label.isna()] == "relevant"
).astype(int)

llm_classify |          | 0/104 (0.0%) | ⏳ 00:00<? | ?it/s

In [None]:
retrieved_documents_eval.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,label,explanation,score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
b7fa37fe-813f-41e3-8d1d-f29e307c9d46,0,relevant,The question asks for the three main component...,1
b7fa37fe-813f-41e3-8d1d-f29e307c9d46,1,relevant,The question asks for the three main component...,1
63eb71ea-6ec9-4080-9d49-a151d1d96c4b,0,relevant,The question asks for specific information abo...,1
63eb71ea-6ec9-4080-9d49-a151d1d96c4b,1,relevant,The question asks for specific information abo...,1
3cf06534-6ab8-469d-9a35-5ba61cef990a,0,irrelevant,The question asks about the financial situatio...,0


We can now combine the documents with the evaluations to compute ranking metrics. These metrics will help us understand how well the RAG system is performing.

In [None]:
documents_with_relevance = pd.concat([retrieved_documents, retrieved_documents_eval.add_prefix("eval_")], axis=1)
documents_with_relevance

Unnamed: 0_level_0,Unnamed: 1_level_0,input,reference,document_score,context.trace_id,eval_label,eval_explanation,eval_score
context.span_id,document_position,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
b7fa37fe-813f-41e3-8d1d-f29e307c9d46,0,What were the three main components of the sof...,[8]\n\nThere were three main parts to the soft...,0.828050,8bead10f-180a-4721-9b55-8bd99d4f3b9f,relevant,The question asks for the three main component...,1
b7fa37fe-813f-41e3-8d1d-f29e307c9d46,1,What were the three main components of the sof...,"It may look clunky today, but in 1996 it was t...",0.812155,8bead10f-180a-4721-9b55-8bd99d4f3b9f,relevant,The question asks for the three main component...,1
63eb71ea-6ec9-4080-9d49-a151d1d96c4b,0,What were the three main components of the eco...,[8]\n\nThere were three main parts to the soft...,0.855644,a6da757c-2e6d-4298-b2d0-83f9ee2d56b1,relevant,The question asks for specific information abo...,1
63eb71ea-6ec9-4080-9d49-a151d1d96c4b,1,What were the three main components of the eco...,"It may look clunky today, but in 1996 it was t...",0.829060,a6da757c-2e6d-4298-b2d0-83f9ee2d56b1,relevant,The question asks for specific information abo...,1
3cf06534-6ab8-469d-9a35-5ba61cef990a,0,What financial situation was the author in whe...,One night in October 2003 there was a big part...,0.839288,223b68be-6fd3-47b8-bd02-9ba97fdd5940,irrelevant,The question asks about the financial situatio...,0
...,...,...,...,...,...,...,...,...
16aee420-8edb-4516-a722-a6807a20fcab,1,What was the first computer the narrator convi...,I was puzzled by the 1401. I couldn't figure o...,0.837659,6be748b7-f25a-4927-821b-603674a436a2,relevant,The question asks for two specific pieces of i...,1
2aeafcd2-5832-4583-a677-59feb802404d,0,Describe the limitations of programming on the...,I was puzzled by the 1401. I couldn't figure o...,0.875245,d321148b-ad91-484a-ae53-b7ce8797835b,relevant,The reference text provides specific details a...,1
2aeafcd2-5832-4583-a677-59feb802404d,1,Describe the limitations of programming on the...,What I Worked On\n\nFebruary 2021\n\nBefore co...,0.868623,d321148b-ad91-484a-ae53-b7ce8797835b,relevant,The reference text provides specific details a...,1
5c3829e3-2ae6-4030-953c-e355fc2bad55,0,Describe the environment in which the author f...,What I Worked On\n\nFebruary 2021\n\nBefore co...,0.823877,55998230-9de7-47a0-b97f-1f3a96ab1da4,relevant,The reference text provides a detailed account...,1


Let's compute NCDG at 2.

In [None]:
import numpy as np
from sklearn.metrics import ndcg_score


def _compute_ndcg(df: pd.DataFrame, k: int):
    """Compute NDCG@k in the presence of missing values (e.g. as a result of keyboard interrupt)."""
    eval_scores = [np.nan] * k
    pred_scores = [np.nan] * k
    for i in range(k):
        if i >= len(df.eval_score):
            break
        eval_scores[i] = df.eval_score[i]
        pred_scores[i] = df.document_score[i]
    try:
        return ndcg_score([eval_scores], [pred_scores])
    except ValueError:
        return np.nan


ndcg_at_2 = pd.DataFrame({"score": documents_with_relevance.groupby("context.span_id").apply(_compute_ndcg, k=2)})

In [None]:
ndcg_at_2

Unnamed: 0_level_0,score
context.span_id,Unnamed: 1_level_1
026b442a-bc13-498b-a720-1c4b63fd909d,1.0
0a05026e-9455-465e-ad09-56b81b1b9320,1.0
16aee420-8edb-4516-a722-a6807a20fcab,1.0
1752eb6b-a923-40f3-a750-bc0f8b7b6f52,1.0
19218210-84f2-411c-be31-797b2cb3f001,0.0
1e38204d-cfd0-4c6f-8e2e-aa0cfde40d16,1.0
22d6bb20-75df-41f3-9753-ebf0262311c8,0.0
22e2d01e-4865-4301-8aa5-feb4633122c4,1.0
288097ba-3e5e-4e2f-b443-972ebabf3b80,1.0
2aeafcd2-5832-4583-a677-59feb802404d,1.0


Let's also compute precision at 2.

In [None]:
precision_at_2 = pd.DataFrame(
    {
        "score": documents_with_relevance.groupby("context.span_id").apply(
            lambda x: x.eval_score[:2].sum(skipna=False) / 2
        )
    }
)

In [None]:
precision_at_2

Unnamed: 0_level_0,score
context.span_id,Unnamed: 1_level_1
026b442a-bc13-498b-a720-1c4b63fd909d,1.0
0a05026e-9455-465e-ad09-56b81b1b9320,0.5
16aee420-8edb-4516-a722-a6807a20fcab,1.0
1752eb6b-a923-40f3-a750-bc0f8b7b6f52,0.5
19218210-84f2-411c-be31-797b2cb3f001,0.0
1e38204d-cfd0-4c6f-8e2e-aa0cfde40d16,1.0
22d6bb20-75df-41f3-9753-ebf0262311c8,0.0
22e2d01e-4865-4301-8aa5-feb4633122c4,1.0
288097ba-3e5e-4e2f-b443-972ebabf3b80,1.0
2aeafcd2-5832-4583-a677-59feb802404d,1.0


Let's now view the results in a combined dataframe.

In [None]:
rag_evaluation_dataframe = pd.concat([documents_with_relevance, ndcg_at_2, precision_at_2], axis=1)
rag_evaluation_dataframe