# Retrieval-Augmented Generation (RAG) System Using LlamaIndex

This project implements a Retrieval-Augmented Generation (RAG) system that retrieves relevant information from a PDF document and generates responses to user queries using LlamaIndex.

## Key Components:


*   Document Processing: A PDF file (transformers.pdf) is loaded and indexed using LlamaIndex’s SimpleDirectoryReader.
*   Embedding Model: Instead of OpenAI embeddings, the project uses a HuggingFace embedding model "BAAI/bge-small-en" for document vectorization.
*   Vector Store Index: The indexed documents are stored in a VectorStoreIndex, allowing efficient retrieval of relevant passages.
*   Custom LLM Wrapper: A custom GPT4All model is integrated using CustomLLM, ensuring compatibility with LlamaIndex.
*   Query Engine: A retriever-based query engine (RetrieverQueryEngine) is created, which first retrieves the most relevant document fragments and then passes them to the LLM for response generation.
*   Response Generation: The retrieved text is processed by the GPT4All model, providing meaningful answers based on the document contents.

## Workflow:

1.   Load & Preprocess Documents: The PDF file is read and converted into an indexed format.
2.   Embed & Store: Documents are vectorized using HuggingFace embedding model "BAAI/bge-small-en" and stored in VectorStoreIndex.
3.  Retrieve & Generate: When a user asks a query, the system retrieves relevant context and generates a response using the GPT4All model.
4.  Display Results: The final response is printed for the user.


This system enables offline, cost-efficient, and private retrieval-augmented generation without relying on cloud-based models like OpenAI’s GPT-4.  However, the speed of generating a response is quite low.

# References

This project is based on the course "Building Your first RAG System using LlamaIndex" by Prashant Sahu / Analytics Vidhya and is available at the folowing [link](https://courses.analyticsvidhya.com/courses/building-your-first-rag-system-free-course).



## Setup

In [None]:
# Mounting to Google Drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
cd "YOUR-PATH-HERE"

In [None]:
%%capture
!pip install llama-index llama-index-embeddings-huggingface gpt4all llama-index-readers-file

In [None]:
import os
import llama_index
from llama_index.core import SimpleDirectoryReader
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex
from llama_index.core import global_handler
from gpt4all import gpt4all

# Data Loader

In [None]:
documents = SimpleDirectoryReader(input_files=['data/transformers.pdf']).load_data()

In [None]:
# Check the datatype and length of the loaded documents
type(documents)

list

In [None]:
# total number of pages read from the PDF
len(documents)

15

In [None]:
# Retrieve the first document (the first page in the PDF)
documents[0]

Document(id_='39d0aea2-7d30-4cc3-b910-5774f5762dd9', embedding=None, metadata={'page_label': '1', 'file_name': 'transformers.pdf', 'file_path': 'data/transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2025-02-07', 'last_modified_date': '2024-06-27'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.com\nNoam Shazeer∗\nGoogle Brain\nnoam@google.com\nNiki P

In [None]:
# Get the ID of the first document
documents[0].id_

'39d0aea2-7d30-4cc3-b910-5774f5762dd9'

In [None]:
documents[0].doc_id

'39d0aea2-7d30-4cc3-b910-5774f5762dd9'

In [None]:
# Get the metadata of the first document
documents[0].metadata

{'page_label': '1',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2025-02-07',
 'last_modified_date': '2024-06-27'}

In [None]:
# Get the text content of the first document
print(documents[0].text)

Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirely. Exp

## Embedding Model

Next, we need to prepare our document for embedding and interaction with a large language model. I will use the HuggingFaceEmbedding for this purpose.

In [None]:
# Initialize the embedding model
# embed_model = OpenAIEmbedding(model="text-embedding-3-small") #'text-embedding-3-large')
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/90.8k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## LLM

Similarly, let's set up our large language model (LLM):

In [None]:
%%capture
!pip install --upgrade gpt4all

In [None]:
# Initialize the large language model
from gpt4all import GPT4All

model = GPT4All("Meta-Llama-3-8B-Instruct.Q4_0.gguf")

Downloading: 100%|██████████| 4.66G/4.66G [02:59<00:00, 26.0MiB/s]
Verifying: 100%|██████████| 4.66G/4.66G [00:08<00:00, 525MiB/s]


# Indexing

Here, we use the `VectorStoreIndex` class to create an index from the loaded documents. We pass the document chunks, embedding model to the `from_documents` method.

In [None]:
# Create an index from the documents using the embedding model and LLM
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

# Retrieval

Finally, we set up a retriever to query our indexed documents. This allows us to retrieve relevant information based on our queries.

In [None]:
# Setting up the Index as Retriever
retriever = index.as_retriever()

The `as_retriever` method converts our index into a retriever, and the `retrieve` method allows us to query the index.

In [None]:
# Retrieve information based on the user query
retrieved_nodes = retriever.retrieve("What is self attention?")

We can check the metadata of the retrieved nodes to understand the source of the information:

The metadata provides details such as the page label, file name, file path, file type, and other relevant information.

In [None]:
# Get the metadata of the first retrieved node
retrieved_nodes[0].metadata

{'page_label': '13',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2025-02-07',
 'last_modified_date': '2024-06-27'}

let's access the ID of the first retrieved node, which is a unique identifier for the first node:

In [None]:
# Access the ID of the first retrieved node
retrieved_nodes[0].id_

'3fcc8353-e2bf-4663-bde8-f586b7554000'

Similarly, we can access the node_id attribute, which typically holds the same value:

In [None]:
# Access the node_id of the first retrieved node
retrieved_nodes[0].node_id

'3fcc8353-e2bf-4663-bde8-f586b7554000'

Next, let's explore the `node` attribute of the retrieved node. This attribute contains a `TextNode` object, which holds all the relevant information extracted during the retrieval process: The `TextNode` object includes various details such as metadata and text content.

In [None]:
# Access the full node object of the first retrieved node
retrieved_nodes[0].node

TextNode(id_='3fcc8353-e2bf-4663-bde8-f586b7554000', embedding=None, metadata={'page_label': '13', 'file_name': 'transformers.pdf', 'file_path': 'data/transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2025-02-07', 'last_modified_date': '2024-06-27'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='949d8c2e-107c-43c3-a5ba-f055f85c6e28', node_type='4', metadata={'page_label': '13', 'file_name': 'transformers.pdf', 'file_path': 'data/transformers.pdf', 'file_type': 'application/pdf', 'file_size': 2215244, 'creation_date': '2025-02-07', 'last_modified_date': '2024-06-27'}, hash='5f7399351bbf395997212d8ae6ef4d7fca6ede3d75c622a2c373312e6479ba5e')}, metadata_template='{key}

We can also extract and inspect the text content of this node to understand the retrieved information better:

In [None]:
# Access the text content of the first retrieved node
print(retrieved_nodes[0].text)

Attention Visualizations
Input-Input Layer5
It
is
in
this
spirit
that
a
majority
of
American
governments
have
passed
new
laws
since
2009
making
the
registration
or
voting
process
more
difficult
.
<EOS>
<pad>
<pad>
<pad>
<pad>
<pad>
<pad>
It
is
in
this
spirit
that
a
majority
of
American
governments
have
passed
new
laws
since
2009
making
the
registration
or
voting
process
more
difficult
.
<EOS>
<pad>
<pad>
<pad>
<pad>
<pad>
<pad>
Figure 3: An example of the attention mechanism following long-distance dependencies in the
encoder self-attention in layer 5 of 6. Many of the attention heads attend to a distant dependency of
the verb ‘making’, completing the phrase ‘making...more difficult’. Attentions here shown only for
the word ‘making’. Different colors represent different heads. Best viewed in color.
13


In [None]:
retrieved_nodes[1].metadata

{'page_label': '5',
 'file_name': 'transformers.pdf',
 'file_path': 'data/transformers.pdf',
 'file_type': 'application/pdf',
 'file_size': 2215244,
 'creation_date': '2025-02-07',
 'last_modified_date': '2024-06-27'}

In [None]:
print(retrieved_nodes[1].text)

output values. These are concatenated and once again projected, resulting in the final values, as
depicted in Figure 2.
Multi-head attention allows the model to jointly attend to information from different representation
subspaces at different positions. With a single attention head, averaging inhibits this.
MultiHead(Q, K, V) = Concat(head1, ...,headh)WO
where headi = Attention(QWQ
i , KWK
i , V WV
i )
Where the projections are parameter matricesWQ
i ∈ Rdmodel×dk , WK
i ∈ Rdmodel×dk , WV
i ∈ Rdmodel×dv
and WO ∈ Rhdv×dmodel .
In this work we employ h = 8 parallel attention layers, or heads. For each of these we use
dk = dv = dmodel/h = 64. Due to the reduced dimension of each head, the total computational cost
is similar to that of single-head attention with full dimensionality.
3.2.3 Applications of Attention in our Model
The Transformer uses multi-head attention in three different ways:
• In "encoder-decoder attention" layers, the queries come from the previous decoder layer,
and the

# Query Engine and Generating Response

Next, we set up a query engine. This engine will allow us to query our indexed documents and receive synthesized responses from the LLM:

In [None]:
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.llms import CustomLLM
from llama_index.core.llms import LLMMetadata

# Load GPT4All Local Model
gpt4all_model = GPT4All(model_name="Meta-Llama-3-8B-Instruct.Q4_0.gguf")

# Define a Custom LLM Wrapper for GPT4All
class GPT4AllCustomLLM(CustomLLM):
    def complete(self, prompt: str, **kwargs) -> str:
        """Generate a response using GPT4All."""
        response = gpt4all_model.generate(prompt)
        return self._format_response(response)

    @property
    def metadata(self) -> LLMMetadata:
        """Define metadata (e.g., model name, context length)."""
        return LLMMetadata(
            model_name="GPT4All",
            context_window=2048,
            num_output=256,
            is_chat_model=False,
            is_function_calling_model=False,
        )

    def stream_complete(self, prompt: str, **kwargs):
        """Not implemented, as GPT4All doesn't support streaming in this setup."""
        raise NotImplementedError("Streaming not supported for GPT4AllCustomLLM")

    def _format_response(self, response: str) -> str:
        """Return response as an object with `.text`."""
        class Response:
            def __init__(self, text):
                self.text = text

        return Response(response)

# Initialize the Custom LLM
custom_llm = GPT4AllCustomLLM()

# Load and Index Documents
documents = SimpleDirectoryReader(input_files=['data/transformers.pdf']).load_data()
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

# Create a Query Engine (Retriever)
query_engine = RetrieverQueryEngine.from_args(index.as_retriever(), llm=custom_llm)

# Run a Query
query = "What is self attention?"
response = query_engine.query(query)

# Print the generated response
print("\nGenerated Response:\n", response)


Generated Response:
  Self-attention in this model refers to a mechanism where each position in the encoder or decoder can attend to all positions in the previous layer of the same component (encoder or decoder). This allows for long-distance dependencies to be captured and modeled.
---------------------


page_13.txt

Transformers.pdf page 5
3.2.1 Input-Output Layer6
The Transformer uses multi-head attention in three different ways: • In "encoder-decoder attention" layers, the queries come from the previous decoder layer,
and the memory keys and values come from the output of the encoder. This allows every position in the decoder to attend over all positions in the input sequence.
This mimics the typical encoder-decoder attention mechanisms in sequence-to-sequence models such as [38, 2, 9].
• The encoder contains self-attention layers. In a self-attention layer all of the keys, values and queries come from the same place,
in this case, the output of the previous
