# Implementation of RAG Architecture for Efficient Information Retrieval
This notebook showcases my implementation of a Retrieval-Augmented Generation (RAG) architecture for advanced information retrieval tasks. Leveraging the combined strengths of retrieval-based and generative models, this architecture significantly enhances the quality and accuracy of responses. The notebook provides a step-by-step guide through the entire process, including reading files, chunking text, generating and storing embeddings, prompting, and retrieving answers using large language models (LLMs). This project reflects my hands-on experience and understanding of cutting-edge techniques in natural language processing.

### Architecture

![RAG Architecture](RAG%20Architecture.JPG)

The diagram illustrates the workflow of a **Retrieval-Augmented Generation (RAG)** architecture. The process begins with a user submitting a **query**, which is processed by an **embedding model** to convert the query into a dense **vector representation (query embedding).** This query embedding is sent to a **vector database** that contains embeddings of various documents. The vector database retrieves the most relevant documents by comparing the query embedding with document embeddings stored in the database. Documents, such as **Safety Data Sheets (SDS) or Technical Data Sheets (TDS),** are first embedded using an embedding model and stored in the vector database along with metadata. The relevant document embeddings, along with their context, are retrieved from the vector database. This retrieved context is combined with the original query and fed into a **generative model**, which processes the query and context to generate a comprehensive response. This RAG architecture effectively combines the strengths of retrieval-based systems, providing relevant context, and generative models, generating coherent and contextually accurate responses.

## Environment Setup and Technologies

In [None]:
!pip install langchain bitsandbytes accelerate langchain_community sentence-transformers faiss-gpu

Collecting langchain
  Downloading langchain-0.2.6-py3-none-any.whl (975 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m975.5/975.5 kB[0m [31m8.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes
  Downloading bitsandbytes-0.43.1-py3-none-manylinux_2_24_x86_64.whl (119.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m119.8/119.8 MB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate
  Downloading accelerate-0.31.0-py3-none-any.whl (309 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m20.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain_community
  Downloading langchain_community-0.2.6-py3-none-any.whl (2.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.2/2.2 MB[0m [31m46.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━

In [None]:
!pip install pypdf

Collecting pypdf
  Downloading pypdf-4.2.0-py3-none-any.whl (290 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m290.4/290.4 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pypdf
Successfully installed pypdf-4.2.0


In [None]:
from transformers import BitsAndBytesConfig
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain.vectorstores import FAISS
from langchain import HuggingFaceHub
from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer
from langchain.chains import RetrievalQA
from transformers import pipeline

import torch
import os
import warnings

warnings.filterwarnings("ignore")

### Description

1. **Installing Necessary Libraries:**
   - `!pip install langchain bitsandbytes accelerate langchain_community sentence-transformers faiss-gpu`: This command installs several key libraries:
     - `langchain`: A library for building language model applications.
     - `bitsandbytes`: A library for 8-bit optimizers, which can significantly reduce memory usage.
     - `accelerate`: A library to facilitate easy multi-GPU and mixed precision training.
     - `langchain_community`: A community-driven extension of the LangChain library.
     - `sentence-transformers`: A library for sentence embeddings using transformers.
     - `faiss-gpu`: Facebook AI Similarity Search (FAISS) optimized for GPU, used for efficient similarity search.

   - `!pip install pypdf`: Installs the `pypdf` library for PDF file handling.

2. **Importing Libraries and Modules:**
   - `from langchain.document_loaders import PyPDFLoader`: Imports `PyPDFLoader` to load PDF documents for processing.
   - `from langchain.text_splitter import RecursiveCharacterTextSplitter`: Imports `RecursiveCharacterTextSplitter` to split text into chunks of characters recursively.
   - `from langchain.embeddings import HuggingFaceEmbeddings`: Imports `HuggingFaceEmbeddings` to generate embeddings using HuggingFace models.
   - `from langchain.prompts import ChatPromptTemplate`: Imports `ChatPromptTemplate` to create prompt templates for chat interactions.
   - `from langchain.vectorstores import FAISS`: Imports `FAISS` for vector storage and similarity search.
   - `from langchain import HuggingFaceHub`: Imports `HuggingFaceHub` to interface with HuggingFace's model hub.
   - `from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline`: Imports `HuggingFacePipeline` to use HuggingFace models in a pipeline.
   - `from transformers import AutoModelForCausalLM, AutoTokenizer`: Imports `AutoModelForCausalLM` and `AutoTokenizer` to load and tokenize causal language models.
   - `from langchain.chains import RetrievalQA`: Imports `RetrievalQA` to create a retrieval-based question-answering chain.
   - `from transformers import BitsAndBytesConfig`: Imports `BitsAndBytesConfig` to configure 8-bit optimizers for transformer models.
   - `from transformers import pipeline`: Imports `pipeline` to create and use various transformer pipelines.

3. **Additional Libraries:**
   - `import torch`: Imports PyTorch, a deep learning library, which is used for tensor operations and model training.
   - `import os`: Imports the `os` module for interacting with the operating system, such as handling file paths.
   - `import warnings`: Imports the `warnings` module to manage warnings that occur during code execution.
   - `warnings.filterwarnings("ignore")`: Ignores all warnings to avoid cluttering the output with warning messages.

### Technologies Used

- **LangChain**: A framework to build applications powered by language models, providing tools for document loading, text splitting, embeddings, prompt creation, and more.
- **BitsAndBytes**: A library to optimize memory usage during training and inference by using 8-bit optimizers.
- **Accelerate**: Facilitates easy-to-use multi-GPU and mixed precision training, improving performance and resource utilization.
- **Sentence-Transformers**: Provides pre-trained models to generate sentence and text embeddings using transformers.
- **FAISS**: A library for efficient similarity search and clustering of dense vectors, optimized for GPU usage.
- **HuggingFace Transformers**: A library that provides state-of-the-art machine learning models, including tokenizers, language models, and pipelines for various NLP tasks.
- **PyTorch**: A deep learning library used for tensor computation and building machine learning models.

This setup forms the foundation for a robust Retrieval-Augmented Generation (RAG) system, combining powerful tools for document processing, embedding generation, similarity search, and natural language generation.

## 1. Reading the Files

### Description
This section reads a PDF file and splits it into individual pages using the PyPDFLoader from the langchain library. The load_and_split method extracts the text content from the PDF and splits it into separate pages for further processing. This step is crucial for converting the document into a format suitable for text processing and embedding.

In [None]:
pdf_file = PyPDFLoader("/content/1706.03762v7.pdf")
pages = pdf_file.load_and_split()

In [None]:
pages[2]

Document(page_content='Most competitive neural sequence transduction models have an encoder-decoder structure [ 5,2,35].\nHere, the encoder maps an input sequence of symbol representations (x1, ..., x n)to a sequence\nof continuous representations z= (z1, ..., z n). Given z, the decoder then generates an output\nsequence (y1, ..., y m)of symbols one element at a time. At each step the model is auto-regressive\n[10], consuming the previously generated symbols as additional input when generating the next.\n2', metadata={'source': '/content/1706.03762v7.pdf', 'page': 1})

## 2. Chunking

### Description
This section splits the extracted text from the PDF into smaller chunks using the RecursiveCharacterTextSplitter from the langchain library. The chunk_size parameter defines the maximum size of each chunk, and chunk_overlap specifies the overlap between consecutive chunks. This step ensures that the text is divided into manageable pieces for embedding and further processing.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1024,
    chunk_overlap=32
)

In [None]:
chunks = text_splitter.split_documents(pages)

In [None]:
chunks[23]

Document(page_content='recurrent layers, by a factor of k. Separable convolutions [ 6], however, decrease the complexity\nconsiderably, to O(k·n·d+n·d2). Even with k=n, however, the complexity of a separable\nconvolution is equal to the combination of a self-attention layer and a point-wise feed-forward layer,\nthe approach we take in our model.\nAs side benefit, self-attention could yield more interpretable models. We inspect attention distributions\nfrom our models and present and discuss examples in the appendix. Not only do individual attention\nheads clearly learn to perform different tasks, many appear to exhibit behavior related to the syntactic\nand semantic structure of the sentences.\n5 Training\nThis section describes the training regime for our models.\n5.1 Training Data and Batching\nWe trained on the standard WMT 2014 English-German dataset consisting of about 4.5 million\nsentence pairs. Sentences were encoded using byte-pair encoding [ 3], which has a shared source-', m

## 3. Embedding

### Description
This section generates embeddings for the text chunks using the HuggingFaceEmbeddings model from the sentence-transformers library. The embeddings are created on a CUDA-enabled device (GPU) for faster computation. The FAISS library is then used to store these embeddings in a vector database, enabling efficient similarity search. This step is essential for converting the text into a numerical format that can be used for retrieval and similarity comparison.

The embeddings generated in this step are stored in a vector database using the FAISS library. This database allows for efficient storage and retrieval of embeddings, enabling quick similarity searches based on the query embeddings. The FAISS library is optimized for high-dimensional similarity search, making it suitable for handling large sets of embeddings.

In [None]:
Embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2",
                                   model_kwargs={"device": "cuda"})

vector_db = FAISS.from_texts([str(chunk) for chunk in chunks], Embeddings)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

## 4. Prompting

### Description
This section defines a question and performs a similarity search in the vector database to find the most relevant text chunks. The ChatPromptTemplate is used to create a structured prompt template for querying the generative model. This step combines the relevant context from the vector database with the user query, preparing it for the generative model to produce a coherent response.

In [None]:
question = """
what is the purpose of the decoder?
"""

relevant_results = vector_db.similarity_search(question, k=2)
relevant_results[0]

Document(page_content="page_content='and the memory keys and values come from the output of the encoder. This allows every\\nposition in the decoder to attend over all positions in the input sequence. This mimics the\\ntypical encoder-decoder attention mechanisms in sequence-to-sequence models such as\\n[38, 2, 9].\\n•The encoder contains self-attention layers. In a self-attention layer all of the keys, values\\nand queries come from the same place, in this case, the output of the previous layer in the\\nencoder. Each position in the encoder can attend to all positions in the previous layer of the\\nencoder.\\n•Similarly, self-attention layers in the decoder allow each position in the decoder to attend to\\nall positions in the decoder up to and including that position. We need to prevent leftward\\ninformation flow in the decoder to preserve the auto-regressive property. We implement this\\ninside of scaled dot-product attention by masking out (setting to −∞) all values in the input\\

In [None]:
prompt = """
Using this piece of information:
\n
{context}
\n
Answer the following question:
\n
{question}
\n
Answer:
\n
"""

prompt = ChatPromptTemplate.from_template(prompt)

## 5. Retrieval / LLM

### Description
This section configures and uses a quantized language model (LLM) for efficient inference. The BitsAndBytesConfig is used to load the model in 4-bit precision, reducing memory usage while maintaining performance. The AutoTokenizer and AutoModelForCausalLM from the transformers library load the Mistral-7B-Instruct model. A text generation pipeline is created for question answering, and a RetrievalQA chain is set up using the HuggingFacePipeline. This step combines retrieval-based and generative capabilities to answer user queries based on the relevant context.

In [None]:
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4"
)

In [None]:
# Authenticate using Hugging Face token
huggingface_token = "sample_api_key"

tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", use_auth_token=huggingface_token)
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-Instruct-v0.2", use_auth_token=huggingface_token, quantization_config=bnb_config)

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

loading file tokenizer.model from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/99259002b41e116d28ccb2d04a9fbe22baed0c7f/tokenizer.model
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/99259002b41e116d28ccb2d04a9fbe22baed0c7f/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/99259002b41e116d28ccb2d04a9fbe22baed0c7f/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/99259002b41e116d28ccb2d04a9fbe22baed0c7f/tokenizer_config.json


config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/99259002b41e116d28ccb2d04a9fbe22baed0c7f/config.json
Model config MistralConfig {
  "_name_or_path": "mistralai/Mistral-7B-Instruct-v0.2",
  "architectures": [
    "MistralForCausalLM"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 32768,
  "model_type": "mistral",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "rms_norm_eps": 1e-05,
  "rope_theta": 1000000.0,
  "sliding_window": null,
  "tie_word_embeddings": false,
  "torch_dtype": "bfloat16",
  "transformers_version": "4.41.2",
  "use_cache": true,
  "vocab_size": 32000
}

The device_map was not initialized. Setting device_map to {'':torch.cuda.current_device()}. If you want to use the model for infer

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/99259002b41e116d28ccb2d04a9fbe22baed0c7f/model.safetensors.index.json


Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Instantiating MistralForCausalLM model under default dtype torch.float16.
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}



Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

All model checkpoint weights were used when initializing MistralForCausalLM.

All the weights of MistralForCausalLM were initialized from the model checkpoint at mistralai/Mistral-7B-Instruct-v0.2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use MistralForCausalLM for predictions without further training.


generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--mistralai--Mistral-7B-Instruct-v0.2/snapshots/99259002b41e116d28ccb2d04a9fbe22baed0c7f/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 1,
  "eos_token_id": 2
}



In [None]:
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=128)
lc_pipeline = HuggingFacePipeline(pipeline=pipe)

## 6. Evaluating the Answer

### Description
This section evaluates the answers generated by the RetrievalQA chain. It submits various queries to the chain and prints the results, along with the source documents used for generating the answers. This step validates the effectiveness of the RAG architecture by checking the accuracy and relevance of the responses generated by the model.

In [None]:
qa_chain = RetrievalQA.from_chain_type(
    llm=lc_pipeline,
    retriever=vector_db.as_retriever(search_kwargs = {"k": 3}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt}
)

In [None]:
result = qa_chain({"query": question})

Disabling tokenizer parallelism, we're using DataLoader multithreading already
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [None]:
print(result["result"].split("Answer:")[1])




The decoder is a component of the Transformer model, which is used to generate the output sequence based on the input sequence. It takes the output of the encoder as input and generates the next token in the sequence based on the attention mechanism. The decoder also contains self-attention layers, but it is


In [None]:
print(result["source_documents"])

[Document(page_content="page_content='and the memory keys and values come from the output of the encoder. This allows every\\nposition in the decoder to attend over all positions in the input sequence. This mimics the\\ntypical encoder-decoder attention mechanisms in sequence-to-sequence models such as\\n[38, 2, 9].\\n•The encoder contains self-attention layers. In a self-attention layer all of the keys, values\\nand queries come from the same place, in this case, the output of the previous layer in the\\nencoder. Each position in the encoder can attend to all positions in the previous layer of the\\nencoder.\\n•Similarly, self-attention layers in the decoder allow each position in the decoder to attend to\\nall positions in the decoder up to and including that position. We need to prevent leftward\\ninformation flow in the decoder to preserve the auto-regressive property. We implement this\\ninside of scaled dot-product attention by masking out (setting to −∞) all values in the input\

In [None]:
question = """
What is the attention function?
"""

result = qa_chain({"query": question})

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [None]:
print(result["result"].split("Answer:")[1])




The attention function is a mechanism used in deep learning models, particularly in sequence-to-sequence models, to allow each position in the decoder to attend to all positions in the input sequence. It computes a weighted sum of the input sequence based on the query and the input sequence itself. The weights are determined


In [None]:
question = """
What are the three ways of using multi-head attention?
"""

result = qa_chain({"query": question})

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [None]:
print(result["result"].split("Answer:")[1])




The Transformer uses multi-head attention in three different ways:

1. In "encoder-decoder attention" layers, the queries come from the previous decoder layer.
2. In "self-attention" layers, the queries, keys, and values come from the same input sequence.
3. In "multi-head attention with different input sequences", the queries, keys, and values come from different input sequences.



Reference:


page_content='4 Why Self-Attention\nIn this section we compare various aspects of self-attention layers to the recurrent


In [None]:
print(result["source_documents"])

[Document(page_content="page_content='4 Why Self-Attention\\nIn this section we compare various aspects of self-attention layers to the recurrent and convolu-\\ntional layers commonly used for mapping one variable-length sequence of symbol representations\\n(x1, ..., x n)to another sequence of equal length (z1, ..., z n), with xi, zi∈Rd, such as a hidden\\nlayer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we\\nconsider three desiderata.\\nOne is the total computational complexity per layer. Another is the amount of computation that can\\nbe parallelized, as measured by the minimum number of sequential operations required.\\nThe third is the path length between long-range dependencies in the network. Learning long-range\\ndependencies is a key challenge in many sequence transduction tasks. One key factor affecting the\\nability to learn such dependencies is the length of the paths forward and backward signals have to\\ntraverse in the netw

In [None]:
question = """
Why did the authors use self-attention
"""

result = qa_chain({"query": question})

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [None]:
print(result["result"].split("Answer:")[1])




The authors used self-attention for several reasons. First, it has a lower computational complexity per layer compared to recurrent and convolutional layers. Second, it allows for more parallelization as it requires fewer sequential operations. Third, it enables the learning of long-range dependencies, which is important for many sequence transduction tasks. Additionally, self-attention could yield more interpretable models, as individual attention heads can learn to perform different tasks and exhibit behavior related to the syntactic and semantic structure of the sentences.


In [None]:
print(result["source_documents"])

[Document(page_content="page_content='4 Why Self-Attention\\nIn this section we compare various aspects of self-attention layers to the recurrent and convolu-\\ntional layers commonly used for mapping one variable-length sequence of symbol representations\\n(x1, ..., x n)to another sequence of equal length (z1, ..., z n), with xi, zi∈Rd, such as a hidden\\nlayer in a typical sequence transduction encoder or decoder. Motivating our use of self-attention we\\nconsider three desiderata.\\nOne is the total computational complexity per layer. Another is the amount of computation that can\\nbe parallelized, as measured by the minimum number of sequential operations required.\\nThe third is the path length between long-range dependencies in the network. Learning long-range\\ndependencies is a key challenge in many sequence transduction tasks. One key factor affecting the\\nability to learn such dependencies is the length of the paths forward and backward signals have to\\ntraverse in the netw

In [None]:
question = """
What were the specs of the machine that was used to train the model?
"""

result = qa_chain({"query": question})

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [None]:
print(result["result"].split("Answer:")[1])




The machine used to train the model had 8 NVIDIA P100 GPUs. Each training step took about 0.4 seconds for the base models and 1.0 seconds for the big models. The base models were trained for 100,000 steps or 12 hours, while the big models were trained for 300,000 steps (3.5 days). The optimizer used was the Adam optimizer with β1= 0.9, β2= 0.98, and ϵ= 10−9.


In [None]:
print(result["source_documents"])