# Experiment with Evaluating RAG systems 

In this notebook, I test the `RagDatasetGenerator` and the `RagEvaluatorPack` classes to evaluate a RAG systems.

I use as external document for the RAG the paper 
"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding"
(only the first 2 pages)

In [None]:
import openai
import environ
from llama_index import SimpleDirectoryReader, VectorStoreIndex
from llama_index import ServiceContext, StorageContext
from llama_index import OpenAIEmbedding
from llama_index.llms import OpenAI
from llama_index.llama_dataset.generator import RagDatasetGenerator # "RagDatasetGenerator" requires the library "spacy"
from llama_index.llama_pack import download_llama_pack
import nest_asyncio

env = environ.Env()
environ.Env.read_env()
API_KEY = env('OPENAI_API_KEY')
openai.api_key = API_KEY

RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")

nest_asyncio.apply()

In [2]:
# Set paths
docs_path = "documents_pdf"

# Set LLM and model for embedding 
llm = OpenAI(temperature=0, model_name="gpt-3.5-turbo")
embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

# Functions

In [3]:
# load documents
def load_docs(doc_path):
    docs = SimpleDirectoryReader(input_dir=doc_path).load_data()
    return docs

In [4]:
def create_vector_db(docs, llm, embed_model):
    """
    Build an index (vector database using the VectorStoreIndex class of LlamaIndex).

    Args:
        docs (Document): An object of the Document class from LlamaIndex.
        llm: OpenAI Chat large language models API.
            Example: llm = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
        embed_model: OpenAI embedding models.
            Example: llm_emb = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

    Returns:
        index (VectorStoreIndex): An object of the VectorStoreIndex class from 
            LlamaIndex to use to build a query engine.
    """
    
    # ----------------------------------------------------------------------------------
    # Define The ServiceContext: a bundle of commonly used resources used during
    # the indexing and querying stage in a LlamaIndex pipeline/application.
    service_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model
    )
    
    # ----------------------------------------------------------------------------------
    # Storage context: The storage context container is a utility container for storing 
    # nodes, indices, and vectors. It contains the following:
    # - docstore: BaseDocumentStore
    # - index_store: BaseIndexStore
    # - vector_store: VectorStore
    # - graph_store: GraphStore
    storage_context = StorageContext.from_defaults()

    # ----------------------------------------------------------------------------------
    # VectorStoreIndex: a data structure that allows for the retrieval of relevant context
    # for a user query. This is particularly useful for retrieval-augmented generation (RAG) use-cases.
    # VectorStoreIndex stores data in Node objects, which represent chunks of the original documents,
    # and exposes a Retriever interface that supports additional configuration and automation.
    print("Creating Vector Database ...")
    index = VectorStoreIndex.from_documents(
        documents=docs,
        service_context=service_context,
        storage_context=storage_context,
        show_progress=True
    )
    print("Done")

    return index

# Query engine

In [5]:
docs_all = load_docs(docs_path)

In [6]:
len(docs_all)

16

Since in this notebooks we are not interested in the evaluating the RAG system but to create an example for a RAG eval pipeline,
I only select 2 pages from the paper.

In [7]:
docs = docs_all[0:2]
len(docs)

2

In [8]:
index = create_vector_db(docs, llm, embed_model)

  from .autonotebook import tqdm as notebook_tqdm


Creating Vector Database ...


Parsing nodes: 100%|██████████| 2/2 [00:00<00:00, 559.43it/s]
Generating embeddings: 100%|██████████| 3/3 [00:00<00:00,  5.08it/s]

Done





In [9]:
query_engine = index.as_query_engine(similarity_top_k=3)

# Evaluate RAG

In [10]:
service_context_rag_dataset = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model
)

In [11]:
# "RagDatasetGenerator" requires the library "spacy"
dataset_generator = RagDatasetGenerator.from_documents(
    documents=docs,
    service_context=service_context_rag_dataset,
    # set the number of questions per nodes
    # (Each document is chunked of size 512 words)
    num_questions_per_chunk=3, # 10,  
    show_progress=True
)

rag_dataset = dataset_generator.generate_dataset_from_nodes()

Parsing nodes: 100%|██████████| 2/2 [00:00<00:00, 527.15it/s]
100%|██████████| 3/3 [00:05<00:00,  1.81s/it]
100%|██████████| 3/3 [00:08<00:00,  2.85s/it]
100%|██████████| 3/3 [00:03<00:00,  1.25s/it]
100%|██████████| 3/3 [00:12<00:00,  4.13s/it]


In [12]:
# since there are 28 nodes, there should be a total of 84 questions
len(dataset_generator.nodes)

3

In [13]:
len(rag_dataset.dict()["examples"])

9

In [14]:
rag_dataset[0].query

'What is the purpose of BERT in natural language processing tasks?'

In [15]:
rag_dataset[0].reference_contexts

['BERT: Pre-training of Deep Bidirectional Transformers for\nLanguage Understanding\nJacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova\nGoogle AI Language\n{jacobdevlin,mingweichang,kentonl,kristout }@google.com\nAbstract\nWe introduce a new language representa-\ntion model called BERT , which stands for\nBidirectional Encoder Representations from\nTransformers. Unlike recent language repre-\nsentation models (Peters et al., 2018a; Rad-\nford et al., 2018), BERT is designed to pre-\ntrain deep bidirectional representations from\nunlabeled text by jointly conditioning on both\nleft and right context in all layers. As a re-\nsult, the pre-trained BERT model can be ﬁne-\ntuned with just one additional output layer\nto create state-of-the-art models for a wide\nrange of tasks, such as question answering and\nlanguage inference, without substantial task-\nspeciﬁc architecture modiﬁcations.\nBERT is conceptually simple and empirically\npowerful. It obtains new state-of-the-art re-\ns

In [16]:
rag_dataset[0].reference_answer

'The purpose of BERT in natural language processing tasks is to pretrain deep bidirectional representations from unlabeled text by conditioning on both left and right context in all layers. This allows the pre-trained BERT model to be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.'

In [17]:
rag_dataset.to_pandas().head()

Unnamed: 0,query,reference_contexts,reference_answer,reference_answer_by,query_by
0,What is the purpose of BERT in natural languag...,[BERT: Pre-training of Deep Bidirectional Tran...,The purpose of BERT in natural language proces...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
1,How does BERT improve the limitations of previ...,[BERT: Pre-training of Deep Bidirectional Tran...,BERT improves the limitations of previous lang...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
2,What is the masked language model pre-training...,[BERT: Pre-training of Deep Bidirectional Tran...,The masked language model (MLM) pre-training o...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
3,What is the main objective of BERT's pre-train...,[word based only on its context. Unlike left-t...,The main objective of BERT's pre-training appr...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)
4,How does BERT's pre-trained representations re...,[word based only on its context. Unlike left-t...,BERT's pre-trained representations reduce the ...,ai (gpt-3.5-turbo),ai (gpt-3.5-turbo)


The number of questions-response evaluated by `RagEvaluatorPack` is

N_nodes * num_questions_per_chunk

with a `batch_size` of 10 (default)

The default judge model in `RagEvaluatorPack` is  `OpenAI(temperature=0, model="gpt-4-1106-preview")`,
which consumes much more credits compared to `OpenAI(temperature=0, model="gpt-3.5-turbo")`

In [18]:
rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine,  # built with the same source documents as the rag_dataset
    rag_dataset=rag_dataset,
    judge_llm=OpenAI(temperature=0, model="gpt-3.5-turbo") # default: OpenAI(temperature=0, model="gpt-4-1106-preview")
)

Note that `rag_evaluator_pack.run()` saves 2 files in the same directory in which the pack was invoked:
- benchmark.csv (CSV format of the benchmark scores)
- _evaluations.json (raw evaluation results for all examples & predictions)

In [19]:
# benchmark_df = await rag_evaluator.run() ## TypeError: object DataFrame can't be used in "await" expression
benchmark_df = rag_evaluator.run()

100%|██████████| 9/9 [01:08<00:00,  7.61s/it]
2it [00:16,  8.13s/it]
2it [00:14,  7.37s/it]
2it [00:15,  7.60s/it]
2it [00:15,  7.98s/it]
1it [00:10, 10.24s/it]


In [20]:
benchmark_df

rag,base_rag
metrics,Unnamed: 1_level_1
mean_correctness_score,4.333333
mean_relevancy_score,1.0
mean_faithfulness_score,1.0
mean_context_similarity_score,0.969672
