# Evaluate RAG with LlamaIndex

I have familiarity with building RAG applications with Langchain, but this is my first time exploring LlamaIndex.  

RAG evaluation is important and how exactly to go about doing it hasn't been clear to me. Is there a consensus on what metric(s) to use? How can user satisfaction be quantified? This notebook is my attempt to improve my understanding and heavily references this [tutorial](https://github.com/openai/openai-cookbook/blob/main/examples/evaluation/Evaluate_RAG_with_LlamaIndex.ipynb). Let's dive in!

## Import libraries & data

In [4]:
from llama_index.evaluation import generate_question_context_pairs
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.node_parser import SimpleNodeParser
from llama_index.evaluation import RetrieverEvaluator
from llama_index.llms import OpenAI
from llama_index.response.notebook_utils import display_source_node

import pandas as pd
from dotenv import load_dotenv

# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
import nest_asyncio

nest_asyncio.apply()
load_dotenv()


True

Documents taken from [Paul Graham's website](https://www.paulgraham.com/worked.html).

In [2]:
documents = SimpleDirectoryReader("./data/").load_data()

In [36]:
documents[0].text[:200]

"\n\nWhat I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed "

## Setup RAG pipeline

Initialize LLM (OpenAI's GPT4) and build index:

In [62]:
# Define an LLM
llm = OpenAI(model="gpt-4-1106-preview")

# Build index with a chunk_size of 512
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)

Build a QueryEngine and start querying.

In [63]:
query_engine = vector_index.as_query_engine(similarity_top_k=3)

Run a query and get a response:

In [76]:
llm_query = "Did the author enjoy philosophy courses? Explain and justify your answer."

In [64]:
response_vector = query_engine.query(llm_query)
response_vector.response


'The author did not enjoy philosophy courses. This can be inferred from the statement that "All I knew at the time was that I kept taking philosophy courses and they kept being boring." The author\'s decision to switch to AI suggests a lack of interest and enjoyment in philosophy.'

By default it retrieves two similar nodes/ chunks. This can be modified in `vector_index.as_query_engine(similarity_top_k=k)`  

Let's check the text in each of these retrieved nodes.

In [77]:
retriever = vector_index.as_retriever(similarity_top_k=3)
retrieved_nodes = retriever.retrieve(llm_query)


In [None]:
for node in retrieved_nodes:
    display_source_node(node, source_length=1000)


## How to evaluate responses?

We have built a RAG pipeline and now need to evaluate its performance. We can assess our RAG system/query engine using LlamaIndex's core evaluation modules. Let's examine how to leverage these tools to quantify the quality of our retrieval-augmented generation system.



Evaluation focuses on two critical aspects:

* **Retrieval Evaluation**: This assesses the accuracy and relevance of the information retrieved by the system.
* **Response Evaluation**: This measures the quality and appropriateness of the responses generated by the system based on the retrieved information.


`generate_question_context_pairs`:
* Used to generate a set of (question, context) pairs over a given unstructured text corpus.  
* This uses the LLM to auto-generate questions from each context chunk.  
* The output is a `EmbeddingQAFinetuneDataset` object. At a high-level this contains a set of ids mapping to queries and relevant doc chunks, as well as the corpus itself.  
* This is then used in the assessment of the RAG system of both Retrieval and Response Evaluation.


In [69]:
qa_dataset = generate_question_context_pairs(nodes, llm=llm, num_questions_per_chunk=2)


100%|██████████| 57/57 [08:37<00:00,  9.08s/it]


## Retrieval Evaluation

Define RetrieverEvaluator. We use **Hit Rate** and **MRR** metrics to evaluate our Retriever.

**Hit Rate:**

Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it’s about how often our system gets it right within the top few guesses.

**Mean Reciprocal Rank (MRR):**

For each query, MRR evaluates the system’s accuracy by looking at the rank of the highest-placed relevant document. Specifically, it’s the average of the reciprocals of these ranks across all the queries. So, if the first relevant document is the top result, the reciprocal rank is 1; if it’s second, the reciprocal rank is 1/2, and so on.

In [81]:
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)


In [82]:
# Evaluate
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)


Let's define a function to display the Retrieval evaluation results in table format.

In [84]:
def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
    )

    return metric_df


In [85]:
display_results("OpenAI Embedding Retriever", eval_results)


Unnamed: 0,Retriever Name,Hit Rate,MRR
0,OpenAI Embedding Retriever,0.824561,0.665205
