# Experiments with RAG

In this notebook, I test different RAG systems within the **LlamaIndex** framework.

In [None]:
import os
from pathlib import Path
from IPython.display import display, Markdown
from llama_index import SimpleDirectoryReader, VectorStoreIndex
from llama_index import ServiceContext, StorageContext
from llama_index import load_index_from_storage
from llama_index import OpenAIEmbedding
from llama_index.llms import OpenAI
from llama_index.node_parser import SimpleNodeParser
from llama_index.indices.query.query_transform import HyDEQueryTransform
from llama_index.query_engine.transform_query_engine import TransformQueryEngine

import openai
import environ

env = environ.Env()
environ.Env.read_env()
API_KEY = env('OPENAI_API_KEY')
openai.api_key = API_KEY

In [44]:
# Set paths
docs_path = "documents_pdf"
embedding_path = "vector_db"

llm= OpenAI(temperature=0, model_name="gpt-3.5-turbo")
embed_model = OpenAIEmbedding(model="text-embedding-ada-002")

In [45]:
param_dict = {
    "chunk_size": [256], # [256, 512, 1024]
    "top_k_results": [5] # [1, 3, 5]
}

# Basic RAG

In [46]:
# load documents
def load_docs(doc_path):
    docs = SimpleDirectoryReader(input_dir=doc_path).load_data()
    return docs

In [47]:
def create_vector_db(docs, llm, embed_model):
    """
    Build an index (vector database using the VectorStoreIndex class of LlamaIndex).

    Args:
        docs (Document): An object of the Document class from LlamaIndex.
        llm: OpenAI Chat large language models API.
            Example: llm = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
        embed_model: OpenAI embedding models.
            Example: llm_emb = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)

    Returns:
        index (VectorStoreIndex): An object of the VectorStoreIndex class from 
            LlamaIndex to use to build a query engine.
    """
    
    # ----------------------------------------------------------------------------------
    # Define The ServiceContext: a bundle of commonly used resources used during
    # the indexing and querying stage in a LlamaIndex pipeline/application.
    service_context = ServiceContext.from_defaults(
        llm=llm,
        embed_model=embed_model
    )
    
    # ----------------------------------------------------------------------------------
    # Storage context: The storage context container is a utility container for storing 
    # nodes, indices, and vectors. It contains the following:
    # - docstore: BaseDocumentStore
    # - index_store: BaseIndexStore
    # - vector_store: VectorStore
    # - graph_store: GraphStore
    storage_context = StorageContext.from_defaults()

    # ----------------------------------------------------------------------------------
    # VectorStoreIndex: a data structure that allows for the retrieval of relevant context
    # for a user query. This is particularly useful for retrieval-augmented generation (RAG) use-cases.
    # VectorStoreIndex stores data in Node objects, which represent chunks of the original documents,
    # and exposes a Retriever interface that supports additional configuration and automation.
    print("Creating Vector Database ...")
    index = VectorStoreIndex.from_documents(
        documents=docs,
        service_context=service_context,
        storage_context=storage_context,
        show_progress=True
    )
    print("Done")

    return index

In [48]:
docs = load_docs(docs_path)

In [49]:
index = create_vector_db(docs, llm, embed_model)

Creating Vector Database ...


Parsing nodes: 100%|██████████| 774/774 [00:00<00:00, 2829.57it/s]
Generating embeddings: 100%|██████████| 778/778 [00:15<00:00, 50.26it/s]


Done


In [50]:
# User query
# user_query = "What percentage of my salary I get as unemployment benefit?"
user_query = """
I worked for six months in Germany.
How long will I get the unemployment benefit,
and what percentage of my salary I get?
""" 


# Build an object of the QueryEngine class
query_engine = index.as_query_engine(similarity_top_k=param_dict["top_k_results"][0])

# Query engine with base RAG
response = query_engine.query(user_query)

In [51]:
def print_response(response):
    print("Response:")
    # print(response.response)
    display(Markdown(response.response))
    for i, meta_data in enumerate(response.metadata):   
        print(f"Source {i}:")
        print(f"\tFile name: {response.metadata[meta_data]['file_name']}")
        print(f"\tPage: {response.metadata[meta_data]['page_label']}")

In [52]:
print_response(response)

Response:


The duration of unemployment benefits in Germany depends on various factors, such as your previous employment history and the length of time you have contributed to the social security system. Generally, you can receive unemployment benefits for up to 12 months if you have paid into the system for at least 12 months. However, if you have worked for six months in Germany, the specific duration and percentage of your salary you will receive as unemployment benefits would need to be determined based on your individual circumstances and the applicable regulations. It is recommended to contact the relevant authorities, such as the Federal Employment Agency (Bundesagentur für Arbeit), to get accurate and up-to-date information regarding your specific situation.

Source 0:
	File name: dok_ba013155.pdf
	Page: 26
Source 1:
	File name: dok_ba013155.pdf
	Page: 22
Source 2:
	File name: merkblatt-fuer-arbeitslose_ba036520.pdf
	Page: 26
Source 3:
	File name: dok_ba013155.pdf
	Page: 24
Source 4:
	File name: dok_ba013155.pdf
	Page: 20


# RAG - chunk size

In [53]:
# vector_db_path = f"{embedding_path}/vector_db_1"
# print(f"Path: {vector_db_path}")
# print(f"Path exist: {os.path.exists(vector_db_path)}")

In [54]:
def build_index_from_file(folder_with_index):
    """
    Rebuild storage context from a vector database and return a query engine.

    Args:
        folder_with_index (str): Folder where the vector database is.

    Returns:
        index (VectorStoreIndex): An object of the VectorStoreIndex class from 
            LlamaIndex to use to build a query engine.
    """
    # rebuild storage context
    storage_context = StorageContext.from_defaults(persist_dir=folder_with_index)
    # load index
    index = load_index_from_storage(storage_context)
    return index

In [55]:
def create_vector_db_chunksize(docs, vector_db_folder, llm, embed_model, chunk_size, chunk_overlap=0):
    """
    Build an index (vector database using the VectorStoreIndex class of LlamaIndex).

    Args:
        docs (Document): An object of the Document class from LlamaIndex.
        vector_db_folder (str): Folder where to save or load the index.
        llm: OpenAI Chat large language models API.
            Example: llm = ChatOpenAI(temperature=0, openai_api_key=OPENAI_API_KEY)
        embed_model: OpenAI embedding models.
            Example: llm_emb = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY)
        chunk_size (Int): chunck size.
        chunk_overlap (Int): = sentence chunk overlap (default=0).

    Returns:
        index (VectorStoreIndex): An object of the VectorStoreIndex class from 
            LlamaIndex to use to build a query engine.
    """

    # Set the path to save the vector db
    vector_db_path = f"{vector_db_folder}/vector_db_{chunk_size}"
    
    if not os.path.exists(vector_db_path):
        # Create the folder that contains the vector database
        # Path(vector_db_path).mkdir(parents=True, exist_ok=True)
        Path(vector_db_path).mkdir()

        # ----------------------------------------------------------------------------------
        # Define SimpleNodeParser: a node parser used to split a document loaded from a file
        # into Nodes (automatically detects the NodeParser to use based on file type).
        node_parser = SimpleNodeParser.from_defaults(
            chunk_size=chunk_size,
            #chunk_overlap=chunk_overlap,
        )
        nodes = node_parser.get_nodes_from_documents(docs)
    
        # ----------------------------------------------------------------------------------
        # Define The ServiceContext: A bundle of commonly used resources used during
        # the indexing and querying stage in a LlamaIndex pipeline/application.
        service_context = ServiceContext.from_defaults(
            llm=llm,
            embed_model=embed_model
        )
    
        # ----------------------------------------------------------------------------------
        # Storage context: The storage context container is a utility container for storing 
        # nodes, indices, and vectors. It contains the following:
        # - docstore: BaseDocumentStore
        # - index_store: BaseIndexStore
        # - vector_store: VectorStore
        # - graph_store: GraphStore
        storage_context = StorageContext.from_defaults()

        # ----------------------------------------------------------------------------------
        # VectorStoreIndex: a data structure that allows for the retrieval of relevant context
        # for a user query. This is particularly useful for retrieval-augmented generation (RAG) use-cases.
        # VectorStoreIndex stores data in Node objects, which represent chunks of the original documents,
        # and exposes a Retriever interface that supports additional configuration and automation.
        print("Creating Vector Database ...")
        index = VectorStoreIndex(
            nodes=nodes,
            service_context=service_context,
            storage_context=storage_context,
            show_progress=True
        )

        print("Saving Vector Database ...")
        index.storage_context.persist(persist_dir=vector_db_path)
        print("Done")
        
    else:
        index = build_index_from_file(vector_db_path)


    return index

In [56]:
index_chunk_size = create_vector_db_chunksize(
    docs=docs,
    vector_db_folder=embedding_path,
    llm=llm,
    embed_model=embed_model,
    chunk_size=param_dict["chunk_size"][0]
)

In [57]:
# Build an object of the QueryEngine class
query_engine_chunk_size = index_chunk_size.as_query_engine(similarity_top_k=param_dict["top_k_results"][0])

# Query engine with base RAG
response_chunk_size = query_engine_chunk_size.query(user_query)

In [58]:
print_response(response_chunk_size)

Response:


You will be eligible for unemployment benefits in Germany for a maximum duration of six months. The percentage of your salary that you will receive as unemployment benefits is not mentioned in the provided context information.

Source 0:
	File name: dok_ba013155.pdf
	Page: 22
Source 1:
	File name: dok_ba013155.pdf
	Page: 22
Source 2:
	File name: merkblatt-fuer-arbeitslose_ba036520.pdf
	Page: 26
Source 3:
	File name: dok_ba013155.pdf
	Page: 24
Source 4:
	File name: dok_ba013155.pdf
	Page: 26


# HyDE

## basic query engine

In [59]:
hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)
response_HyDE = hyde_query_engine.query(user_query)

In [60]:
print_response(response_HyDE)

Response:


The duration of your unemployment benefit and the percentage of your salary that you will receive will depend on various factors, such as the length of your employment and your age. Based on the provided context information, it states that if you have worked for at least 12 months within the last 30 months, you may be eligible for a certain duration of unemployment benefit. However, the specific details regarding the duration and percentage cannot be determined without further information. It is recommended to consult the relevant authorities or refer to the specific regulations and guidelines in your country for accurate information regarding your eligibility and entitlements.

Source 0:
	File name: dok_ba013155.pdf
	Page: 26
Source 1:
	File name: dok_ba013155.pdf
	Page: 22
Source 2:
	File name: merkblatt-fuer-arbeitslose_ba036520.pdf
	Page: 26
Source 3:
	File name: merkblatt-fuer-arbeitslose_ba036520.pdf
	Page: 33
Source 4:
	File name: merkblatt-fuer-arbeitslose_ba036520.pdf
	Page: 32


## RAG - chuck size

In [61]:
hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine_chunk_size = TransformQueryEngine(query_engine_chunk_size, hyde)
response_HyDE_chunk_size = hyde_query_engine_chunk_size.query(user_query)

In [62]:
print_response(response_HyDE_chunk_size)

Response:


You will be eligible for unemployment benefits in Germany if you meet certain criteria. According to the provided information, if you become unemployed in Germany and want to search for work in another EU member state, you can take your entitlement to German unemployment benefits for a period of three months (referred to as the "Mitnahmezeitraum"). This period can be extended for up to a maximum of six months for the purpose of job search. The specific percentage of your salary that you will receive as unemployment benefits is not mentioned in the given context.

Source 0:
	File name: dok_ba013155.pdf
	Page: 22
Source 1:
	File name: dok_ba013155.pdf
	Page: 22
Source 2:
	File name: merkblatt-fuer-arbeitslose_ba036520.pdf
	Page: 26
Source 3:
	File name: dok_ba013155.pdf
	Page: 32
Source 4:
	File name: dok_ba013155.pdf
	Page: 24


# Scores

In [71]:
def print_score(response):
    N = len(response.source_nodes)
    scores  = [response.source_nodes[i].score for i in range(N)]
    return scores

In [72]:
print_score(response)

[0.8520634121940881,
 0.8495681877333476,
 0.8470285279883525,
 0.8453798820266532,
 0.8451710640669897]

In [73]:
print_score(response_chunk_size)

[0.8572357545733745,
 0.8570144623198696,
 0.8527991402582654,
 0.851910417604173,
 0.8468197305380054]

In [74]:
print_score(response_HyDE)

[0.8843449567325018,
 0.8833562839707858,
 0.8820011889899166,
 0.8779475314934001,
 0.8776750187342072]

In [75]:
print_score(response_HyDE_chunk_size)

[0.8892044539961814,
 0.8842722173806833,
 0.8828456505031381,
 0.8752202744227146,
 0.8749440691654466]