# Evaluate RAG with LlamaIndex

I have familiarity with building RAG applications with Langchain, but this is my first time exploring LlamaIndex.  

RAG evaluation is important and how exactly to go about doing it hasn't been clear to me. Is there a consensus on what metric(s) to use? How can user satisfaction be quantified? This notebook is my attempt to improve my understanding and heavily references this [tutorial](https://github.com/openai/openai-cookbook/blob/main/examples/evaluation/Evaluate_RAG_with_LlamaIndex.ipynb). Let's dive in!

## Import libraries & data

In [24]:
from llama_index.evaluation import generate_question_context_pairs
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.node_parser import SimpleNodeParser
from llama_index.evaluation import RetrieverEvaluator
from llama_index.evaluation import FaithfulnessEvaluator
from llama_index.llms import OpenAI
from llama_index.response.notebook_utils import display_source_node


import pandas as pd
from dotenv import load_dotenv

# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
import nest_asyncio

nest_asyncio.apply()
load_dotenv()

True

Documents taken from [Paul Graham's website](https://www.paulgraham.com/worked.html).

In [2]:
documents = SimpleDirectoryReader("./data/").load_data()

In [3]:
documents[0].text[:200]

"\n\nWhat I Worked On\n\nFebruary 2021\n\nBefore college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed "

## Setup RAG pipeline

Initialize LLM (OpenAI's GPT-4 Turbo) and build index:

In [4]:
# Define an LLM
llm = OpenAI(model="gpt-4-1106-preview", temperature=0)
service_context = ServiceContext.from_defaults(llm=llm)

# Build index with a chunk_size of 512
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes, service_context=service_context)

Build a QueryEngine and start querying.

In [5]:
query_engine = vector_index.as_query_engine(similarity_top_k=2)

Run a query and get a response:

In [6]:
llm_query = "Did the author enjoy philosophy courses? Explain and justify your answer."

In [7]:
response_vector = query_engine.query(llm_query)
response_vector.response

"The author did not enjoy philosophy courses in college. Despite initially believing that philosophy would be the study of ultimate truths, the author found that other fields covered much of the space of ideas, leaving philosophy with what seemed to be only edge cases. The author's experience with philosophy courses was that they were boring, which led to a decision to switch to studying artificial intelligence (AI) instead."

By default it retrieves two similar nodes/ chunks. This can be modified in `vector_index.as_query_engine(similarity_top_k=k)`  

Let's check the text in each of these retrieved nodes.

In [8]:
retriever = vector_index.as_retriever(similarity_top_k=2)
retrieved_nodes = retriever.retrieve(llm_query)

In [9]:
for node in retrieved_nodes:
    display_source_node(node, source_length=1000)

**Node ID:** 25a77d90-cedb-4d5d-9f42-a15cce9425a0<br>**Similarity:** 0.7976894574453169<br>**Text:** This was when I really started programming. I wrote simple games, a program to predict how high my model rockets would fly, and a word processor that my father used to write at least one book. There was only room in memory for about 2 pages of text, so he'd write 2 pages at a time and then print them out, but it was a lot better than a typewriter.

Though I liked programming, I didn't plan to study it in college. In college I was going to study philosophy, which sounded much more powerful. It seemed, to my naive high school self, to be the study of the ultimate truths, compared to which the things studied in other fields would be mere domain knowledge. What I discovered when I got to college was that the other fields took up so much of the space of ideas that there wasn't much left for these supposed ultimate truths. All that seemed left for philosophy were edge cases that people in other fields felt could safely be ignored.

I couldn't have put this into words when I was 18. All I ...<br>

**Node ID:** b52dd38f-bdf7-42b1-b977-23b73f5f3049<br>**Similarity:** 0.7954562357185037<br>**Text:** How should I choose what to do? Well, how had I chosen what to work on in the past? I wrote an essay for myself to answer that question, and I was surprised how long and messy the answer turned out to be. If this surprised me, who'd lived it, then I thought perhaps it would be interesting to other people, and encouraging to those with similarly messy lives. So I wrote a more detailed version for others to read, and this is the last sentence of it.









Notes

[1] My experience skipped a step in the evolution of computers: time-sharing machines with interactive OSes. I went straight from batch processing to microcomputers, which made microcomputers seem all the more exciting.

[2] Italian words for abstract concepts can nearly always be predicted from their English cognates (except for occasional traps like polluzione). It's the everyday words that differ. So if you string together a lot of abstract concepts with a few simple verbs, you can make a little Italian go a long way.

[...<br>

## How to evaluate responses?

We have built a RAG pipeline and now need to evaluate its performance. We can assess our RAG system/query engine using LlamaIndex's core evaluation modules. Let's examine how to leverage these tools to quantify the quality of our retrieval-augmented generation system.



Evaluation focuses on two critical aspects:

* **Retrieval Evaluation**: This assesses the accuracy and relevance of the information retrieved by the system.
* **Response Evaluation**: This measures the quality and appropriateness of the responses generated by the system based on the retrieved information.


`generate_question_context_pairs`:
* Used to generate a set of (question, context) pairs over a given unstructured text corpus.  
* This uses the LLM to auto-generate questions from each context chunk.  
* The output is a `EmbeddingQAFinetuneDataset` object. At a high-level this contains a set of ids mapping to queries and relevant doc chunks, as well as the corpus itself.  
* This is then used in the assessment of the RAG system of both Retrieval and Response Evaluation.


In [10]:
qa_dataset = generate_question_context_pairs(nodes, llm=llm, num_questions_per_chunk=2)

100%|██████████| 57/57 [09:11<00:00,  9.68s/it]


## Retrieval Evaluation

Define RetrieverEvaluator. We use **Hit Rate** and **MRR** metrics to evaluate our Retriever.

**Hit Rate:**

Hit rate calculates the fraction of queries where the correct answer is found within the top-k retrieved documents. In simpler terms, it’s about how often our system gets it right within the top few guesses.

**Mean Reciprocal Rank (MRR):**

For each query, MRR evaluates the system’s accuracy by looking at the rank of the highest-placed relevant document. Specifically, it’s the average of the reciprocals of these ranks across all the queries. So, if the first relevant document is the top result, the reciprocal rank is 1; if it’s second, the reciprocal rank is 1/2, and so on.

In [11]:
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"], retriever=retriever
)

In [12]:
# Evaluate
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

Let's define a function to display the Retrieval evaluation results in table format.

In [13]:
def display_results(name, eval_results):
    """Display results from evaluate."""

    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)

    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame(
        {"Retriever Name": [name], "Hit Rate": [hit_rate], "MRR": [mrr]}
    )

    return metric_df

In [14]:
display_results("OpenAI Embedding Retriever", eval_results)

Unnamed: 0,Retriever Name,Hit Rate,MRR
0,OpenAI Embedding Retriever,0.780702,0.657895


**Retrieval evaluation results:**
* MRR < Hit Rate suggests top ranking results aren't always the most relevant.
* How to improve MRR? Consider use of rerankers that refine order of retrieved documents.
* [Blog post](https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker-models-42d079022e83) on how rerankers can help optimize retrieval metrics.

## Response Evaluation

Two ways to evaluate the quality of response:

**Faithfulness Evaluator:**  
Measures if the response from a query engine matches any source nodes which is useful for measuring if the response is hallucinated.  

**Relevancy Evaluator:**  
Measures if the response + source nodes match the query.

In [15]:
# Get the list of queries from the above created qa_dataset

queries = list(qa_dataset.queries.values())


Preview of `queries`:

In [20]:
queries[0:5]

['Describe the differences in user interaction between the IBM 1401 and microcomputers as experienced by the author. Provide examples from the text to support your answer.',
 "Based on the author's experience, discuss the limitations of early programming with punch cards on the IBM 1401 and how it affected the types of programs that could be written by a beginner.",
 "Describe the limitations and challenges faced by early programmers when the only form of input to programs was data stored on punched cards, as illustrated by the author's personal experience.",
 "How did the introduction of microcomputers, like the TRS-80 and Apple II, change the process of programming according to the author's narrative, and what were some of the applications the author was able to create with these new technologies?",
 'In the context provided, the author initially planned to study philosophy in college before switching to AI. Based on their experience, discuss the reasons that led to the change in the

### FaithfulnessEvaluator

gpt-3.5-turbo will be used for generating response for a given query and gpt-4 for evaluation.

In [23]:
# gpt-3.5-turbo
gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
service_context_gpt35 = ServiceContext.from_defaults(llm=gpt35)

# gpt-4 turbo
gpt4t = OpenAI(temperature=0, model="gpt-4-1106-preview")
service_context_gpt4t = ServiceContext.from_defaults(llm=gpt4t)


Create a QueryEngine with `service_context_gpt35` to generate response for the query.

In [22]:
vector_index_gpt35 = VectorStoreIndex(nodes, service_context=service_context_gpt35)
query_engine_gpt35 = vector_index_gpt35.as_query_engine()


Define `FaithfulnessEvaluator` like so:

In [25]:
faithfulness_gpt4t = FaithfulnessEvaluator(service_context=service_context_gpt4t)


Select one query ...

In [32]:
eval_query = queries[3]

eval_query


"How did the introduction of microcomputers, like the TRS-80 and Apple II, change the process of programming according to the author's narrative, and what were some of the applications the author was able to create with these new technologies?"

... and with the response we get from gpt3.5 ... 

In [39]:
response_vector = query_engine_gpt35.query(eval_query)
response_vector.response


"The introduction of microcomputers, such as the TRS-80 and Apple II, revolutionized the process of programming according to the author's narrative. With microcomputers, programmers were no longer limited to using punched cards as the only form of input. Instead, they could have a computer right in front of them on a desk that could respond to their keystrokes in real-time. This allowed for a much more interactive and immediate programming experience.\n\nAs for the applications the author was able to create with these new technologies, they mentioned writing simple games, a program to predict the height of model rockets, and a word processor that their father used to write at least one book. Despite the limited memory capacity of the TRS-80, the author found it to be a significant improvement over a typewriter."

... use GPT4-turbo to evaluate it:

In [41]:
eval_result = faithfulness_gpt4t.evaluate_response(response=response_vector)
# You can check passing parameter in eval_result if it passed the evaluation.
eval_result.passing


True

### Relevancy Evaluator
