# Evaluate RAG with LlamaIndex

In this notebook we will look into building an RAG pipeline and evaluating it with LlamaIndex. It has following 3 sections.

1. Understanding Retrieval Augmented Generation (RAG).
2. Building RAG with LlamaIndex.
3. Evaluating RAG with LlamaIndex.

**Retrieval Augmented Generation (RAG)**

LLMs are trained on vast datasets, but these will not include your specific data. Retrieval-Augmented Generation (RAG) addresses this by dynamically incorporating your data during the generation process. This is done not by altering the training data of LLMs, but by allowing the model to access and utilize your data in real-time to provide more tailored and contextually relevant responses.

In RAG, your data is loaded and and prepared for queries or “indexed”. User queries act on the index, which filters your data down to the most relevant context. This context and your query then go to the LLM along with a prompt, and the LLM provides a response.

Even if what you’re building is a chatbot or an agent, you’ll want to know RAG techniques for getting data into your application.

**Stages within RAG**

There are five key stages within RAG, which in turn will be a part of any larger application you build. These are:

**Loading:** this refers to getting your data from where it lives – whether it’s text files, PDFs, another website, a database, or an API – into your pipeline. LlamaHub provides hundreds of connectors to choose from.

**Indexing:** this means creating a data structure that allows for querying the data. For LLMs this nearly always means creating vector embeddings, numerical representations of the meaning of your data, as well as numerous other metadata strategies to make it easy to accurately find contextually relevant data.

**Storing:** Once your data is indexed, you will want to store your index, along with any other metadata, to avoid the need to re-index it.

**Querying:** for any given indexing strategy there are many ways you can utilize LLMs and LlamaIndex data structures to query, including sub-queries, multi-step queries and hybrid strategies.

**Evaluation:** a critical step in any pipeline is checking how effective it is relative to other strategies, or when you make changes. Evaluation provides objective measures of how accurate, faithful and fast your responses to queries are.

## Build RAG system.

Now that we have understood the significance of RAG system, let's build a simple RAG pipeline.

In [None]:
#!pip install llama-index
!pip install llama-index
!pip install ragas
!pip install datasets
!pip install pypdf



In [None]:
# The nest_asyncio module enables the nesting of asynchronous functions within an already running async loop.
# This is necessary because Jupyter notebooks inherently operate in an asynchronous loop.
# By applying nest_asyncio, we can run additional async functions within this existing loop without conflicts.
import nest_asyncio
nest_asyncio.apply()

from llama_index.core.evaluation import generate_question_context_pairs
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.core.evaluation import RetrieverEvaluator
from llama_index.llms.openai import OpenAI

import os
import pandas as pd

from datasets import Dataset

In [None]:

from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine

Set Your OpenAI API Key

In [None]:
os.environ['OPENAI_API_KEY'] = ''


#### Load Data and Build Index.

In [None]:
documents = SimpleDirectoryReader("/content/Data").load_data()

# Define an LLM
llm = OpenAI(model="gpt-3.5-turbo")



# Build index with a chunk_size of 512
node_parser = SimpleNodeParser.from_defaults(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
vector_index = VectorStoreIndex(nodes)





In [None]:
!pip install llama-index-llms-huggingface llama-index-embeddings-huggingface



In [None]:
# from llama_index.core import Settings
# from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Settings.llm = llm
# Settings.embedding = HuggingFaceEmbedding(model_name="dangvantuan/vietnamese-embedding")

Build a QueryEngine and start querying.

In [None]:
query_engine = vector_index.as_query_engine(similarity_top_k=5)

#HyDE

In [None]:
# HyDE setup
hyde = HyDEQueryTransform(include_original=True)
# Transform the query engine using HyDE
hyde_query_engine = TransformQueryEngine(query_engine, hyde)

In [None]:
response_vector = hyde_query_engine.query("ngành Xây dựng và Quản lý Dự án Xây dựng đại học Bách Khoa TPHCM bao gồm những chuyên ngành nào ?")

Check response.

In [None]:
response_vector.response

'Ngành Xây dựng và Quản lý Dự án Xây dựng tại Đại học Bách Khoa TP.HCM bao gồm hai chuyên ngành chính là Kỹ thuật Xây dựng và Quản lý Xây dựng.'

In [None]:
#!pip install openpyxl

In [None]:
import pandas as pd

# Specify the file path
file_path = r'/content/Test set (2).xlsx'

# Read the Excel file into a DataFrame
test = pd.read_excel(file_path)

# Print the DataFrame
print(test)

                                            question   ground_truth
0  Mã ngành Địa Lý học chương trình tiêu chuẩn đạ...  [': 7310501']


We have built a RAG pipeline and now need to evaluate its performance. We can assess our RAG system/query engine using LlamaIndex's core evaluation modules. Let's examine how to leverage these tools to quantify the quality of our retrieval-augmented generation system.

In [None]:
questions = test["question"].to_list()

In [None]:
answers = []
contexts = []

for i in questions:
  response_vector = hyde_query_engine.query(i)
  answers.append(response_vector)
  contexts.append([a.get_text() for a in response_vector.source_nodes])

In [None]:
ground_truths = [[a] for a in test["ground_truth"]]

In [None]:
for num, i in enumerate(ground_truths):
    if type(i) != str:
        ground_truths[num] = str(i)

In [None]:
answers = [str(i) for i in answers]

In [None]:
datasample = {
    "question": questions,
    "contexts": contexts,
    "answer": answers,
    "ground_truth": ground_truths
}

dataset = Dataset.from_dict(datasample)

In [None]:
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
#from ragas.metrics.critique import harmfulness

metrics = [
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
]

In [None]:
from ragas import evaluate
#from ragas.metrics import context_precision, faithfulness, answer_relevancy, context_recall

#result = evaluate(dataset, metrics=metrics)

result = evaluate(dataset, metrics=[faithfulness,
    answer_relevancy,
    context_precision,
    context_recall],raise_exceptions=False)

Evaluating:   0%|          | 0/4 [00:00<?, ?it/s]

  value = np.nanmean(self.scores[cn])


In [None]:
rs = result.to_pandas()
rs

In [None]:
rs.to_excel("test2.xlsx")