# Creating RAG from scratch

## Importing necessary libraries

In [1]:
import os
os.environ['KMP_DUPLICATE_LIB_OK']='TRUE'

In [2]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
import fitz
from transformers import BertTokenizer,BertModel
import re
import torch
import faiss
import numpy as np
import google.generativeai as genai
from dotenv import load_dotenv


[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\91859\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
  from .autonotebook import tqdm as notebook_tqdm


In [3]:
load_dotenv()
genai.configure(api_key=os.getenv("API_KEY"))

In [4]:
DEVICE=torch.device("cuda" if torch.cuda.is_available() else "cpu")
DEVICE

device(type='cpu')

In [5]:
tokenizer=BertTokenizer.from_pretrained("bert-base-uncased")

In [6]:
bert_model=BertModel.from_pretrained("bert-base-uncased")
bert_model.to(DEVICE)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

## Document Loading

In [7]:
def load_doc(path):
    full_text=[]
    doc=fitz.open(path)
    for page_num in range(len(doc)):
        page=doc.load_page(page_num)
        text=page.get_text()
        full_text.append({
        "page_number": page_num + 1,
        "text": text
    })
    doc.close()
    return full_text

doc=load_doc(r"C:\Users\91859\Desktop\rag_scratch\rag\sample_pdf.pdf")
for item in doc:
    print(f"Page {item['page_number']}:\n{item['text']}")
    print("-" * 40)


Page 1:
Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗†
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions
entirel

In [8]:
def preprocess_doc(doc):
    placeholder = "___PARAGRAPH_BREAK___"
    for page in doc:
        page["text"]=re.sub(r'\n\d\n+',r'\n',page["text"])  #removes page numbers
        page["text"]=re.sub(r'(\n )+',placeholder,page["text"])
        page["text"]=re.sub(r'\n',' ',page["text"])
        page["text"]=re.sub(placeholder,'\n',page["text"])
    return doc

doc=preprocess_doc(doc)
for item in doc:
    print(f"Page {item['page_number']}:\n{item['text']}")
    print("-" * 40)


Page 1:
Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works. Attention Is All You Need Ashish Vaswani∗ Google Brain avaswani@google.com Noam Shazeer∗ Google Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com Jakob Uszkoreit∗ Google Research usz@google.com Llion Jones∗ Google Research llion@google.com Aidan N. Gomez∗† University of Toronto aidan@cs.toronto.edu Łukasz Kaiser∗ Google Brain lukaszkaiser@google.com Illia Polosukhin∗‡ illia.polosukhin@gmail.com Abstract The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirel

## Text Splitting

In [9]:
def split_text(doc,chunk_size=250,overlap=20): # character based splitting
    chunks=[]
    page_num=[]
    for page in doc:
        start=0
        while(start< len(page["text"])):
            
            end=min(start+chunk_size,len(page["text"]))
            a="".join(page["text"][start:end+1])
            page_num.append(page["page_number"])
            start=start+chunk_size-overlap
            chunks.append(a)
    return chunks,page_num

chunks,page_num=split_text(doc)
print(chunks)
print(page_num)

['Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works. Attention Is All You Need Ashish Vaswani∗ Google Brain avaswani@google.com No', 'vaswani@google.com Noam Shazeer∗ Google Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com Jakob Uszkoreit∗ Google Research usz@google.com Llion Jones∗ Google Research llion@google.com Aidan N. Gomez∗† University of Toronto aidan@cs.t', 'of Toronto aidan@cs.toronto.edu Łukasz Kaiser∗ Google Brain lukaszkaiser@google.com Illia Polosukhin∗‡ illia.polosukhin@gmail.com Abstract The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that i', 'eural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on atte

Splitting by character is not efficient since it may break words

In [10]:
def split(doc,chunk_size=80,overlap=4):  # word-based splitting
    chunks=[]
    page_num=[]
    for page in doc:
        words=page["text"].split()
        start=0
        while(start<len(words)):
            end=min(start+chunk_size,len(words))
            a=" ".join(words[start:end])
            page_num.append(page["page_number"])
            start=start+chunk_size-overlap
            chunks.append(a)
    return chunks,page_num



In [11]:
chunks,page_num=split(doc)
for i in chunks[0:5]:
    print(i)
    print("--------------------------------------------------------------------------------")
print(page_num)
print("No of chunks : ", len(chunks))
print("Length of pages array : ", len(page_num))

Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works. Attention Is All You Need Ashish Vaswani∗ Google Brain avaswani@google.com Noam Shazeer∗ Google Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com Jakob Uszkoreit∗ Google Research usz@google.com Llion Jones∗ Google Research llion@google.com Aidan N. Gomez∗† University of Toronto aidan@cs.toronto.edu Łukasz Kaiser∗ Google Brain lukaszkaiser@google.com Illia Polosukhin∗‡ illia.polosukhin@gmail.com Abstract The dominant sequence transduction models are based on
--------------------------------------------------------------------------------
models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transform

In [12]:
def split_by_sent(doc,max_length=150):
    chunks=[]
    page_num=[]
    for page in doc:
        sentences = sent_tokenize(page["text"])
        current=''
        for sent in sentences:
            if len(current.split())+len(sent.split()) <max_length:
                current+=' '+sent
            else:
                chunks.append(current)
                page_num.append(page["page_number"])
                current=sent
        if current:
            chunks.append(current.strip())
            page_num.append(page["page_number"])
        
    return chunks,page_num

In [13]:
chunks,page_num=split_by_sent(doc)
for i in chunks[0:5]:
    print(i)
    print("--------------------------------------------------------------------------------")
print(page_num)
print("No of chunks : ", len(chunks))
print("Length of pages array : ", len(page_num))

 Provided proper attribution is provided, Google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works. Attention Is All You Need Ashish Vaswani∗ Google Brain avaswani@google.com Noam Shazeer∗ Google Brain noam@google.com Niki Parmar∗ Google Research nikip@google.com Jakob Uszkoreit∗ Google Research usz@google.com Llion Jones∗ Google Research llion@google.com Aidan N. Gomez∗† University of Toronto aidan@cs.toronto.edu Łukasz Kaiser∗ Google Brain lukaszkaiser@google.com Illia Polosukhin∗‡ illia.polosukhin@gmail.com Abstract The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
----

## Tokenizing

In [14]:
tokens=tokenizer.batch_encode_plus(chunks,add_special_tokens=True,padding=True,truncation=True)
tokens

{'input_ids': [[101, 3024, 5372, 2012, 18886, 29446, 2003, 3024, 1010, 8224, 2182, 3762, 8624, 6656, 2000, 21376, 1996, 7251, 1998, 4481, 1999, 2023, 3259, 9578, 2005, 2224, 1999, 4988, 2594, 2030, 12683, 2573, 1012, 3086, 2003, 2035, 2017, 2342, 6683, 4509, 12436, 26760, 7088, 30125, 8224, 4167, 10927, 26760, 7088, 1030, 8224, 1012, 4012, 2053, 3286, 21146, 23940, 2099, 30125, 8224, 4167, 2053, 3286, 1030, 8224, 1012, 4012, 23205, 2072, 19177, 2099, 30125, 8224, 2470, 23205, 11514, 1030, 8224, 1012, 4012, 19108, 2149, 29002, 2890, 4183, 30125, 8224, 2470, 2149, 2480, 1030, 8224, 1012, 4012, 2222, 3258, 3557, 30125, 8224, 2470, 2222, 3258, 1030, 8224, 1012, 4012, 12643, 1050, 1012, 12791, 30125, 1526, 2118, 1997, 4361, 12643, 1030, 20116, 1012, 4361, 1012, 3968, 2226, 1105, 15750, 17112, 15676, 30125, 8224, 4167, 23739, 2480, 11151, 8043, 1030, 8224, 1012, 4012, 5665, 2401, 11037, 6342, 10023, 2378, 30125, 1527, 5665, 2401, 1012, 11037, 6342, 10023, 2378, 1030, 20917, 4014, 1012, 4012,

In [15]:
len(tokens["input_ids"][0])

345

In [16]:
text=tokenizer.decode(tokens['input_ids'][0])
print(text)

[CLS] provided proper attribution is provided, google hereby grants permission to reproduce the tables and figures in this paper solely for use in journalistic or scholarly works. attention is all you need ashish vaswani∗ google brain avaswani @ google. com noam shazeer∗ google brain noam @ google. com niki parmar∗ google research nikip @ google. com jakob uszkoreit∗ google research usz @ google. com llion jones∗ google research llion @ google. com aidan n. gomez∗ † university of toronto aidan @ cs. toronto. edu łukasz kaiser∗ google brain lukaszkaiser @ google. com illia polosukhin∗ ‡ illia. polosukhin @ gmail. com abstract the dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. the best performing models also connect the encoder and decoder through an attention mechanism. we propose a new simple network architecture, the transformer, based solely on attention mechanisms, dispensing with recurrenc

In [17]:
input_ids=torch.tensor(tokens["input_ids"]).to(DEVICE)
mask=torch.tensor(tokens["attention_mask"]).to(DEVICE)

## Converting to Embeddings

In [18]:
id_tensor = input_ids[0].unsqueeze(0)  # Add batch dim
mask_tensor = mask[0].unsqueeze(0)   
out= bert_model(input_ids=id_tensor,attention_mask= mask_tensor)[0]
print(out.shape)

torch.Size([1, 345, 768])


In [19]:
def mean_embedding(bert_model,input_ids,attention_masks):
    mean_emb=[]
    print(type(bert_model))
    for id,mask in zip(input_ids,attention_masks):
        id=id.unsqueeze(0)
        mask=mask.unsqueeze(0)
        with torch.no_grad():
            emb=bert_model(id,mask)[0].squeeze(0)

            valid_mask=mask[0]==1
            valid_emb=emb[valid_mask,:]
            mean_embedding = valid_emb.mean(dim=0)

            mean_emb.append(mean_embedding.unsqueeze(0))
    aggregated_mean_emb=torch.cat(mean_emb)
    return aggregated_mean_emb
            
            
aggr_mean_emb=mean_embedding(bert_model,input_ids,mask)

<class 'transformers.models.bert.modeling_bert.BertModel'>


In [20]:
aggr_mean_emb.shape

torch.Size([50, 768])

## Storing in faiss

In [21]:
np_aggr_mean_emb=aggr_mean_emb.cpu().numpy().astype('float32')

In [22]:
print(np_aggr_mean_emb.dtype)
print(np_aggr_mean_emb.shape)

float32
(50, 768)


In [23]:
dimension = 768  
index = faiss.IndexFlatL2(dimension)
index.d

768

In [24]:
index.add(np_aggr_mean_emb)

# Retrieval Stage

In [25]:
query="Explain the encoder and decoder stacks of transformers"
query_tokens=tokenizer.encode(query,add_special_tokens=True,return_tensors='pt')
query_tokens.shape

torch.Size([1, 13])

In [26]:

query_emb=bert_model(query_tokens)[0].squeeze(0)
print("Shape of vector before taking mean : ",query_emb.shape)
query_emb=query_emb.mean(dim=0).unsqueeze(0)
print("Shape of vector after taking mean : ",query_emb.shape)


Shape of vector before taking mean :  torch.Size([13, 768])
Shape of vector after taking mean :  torch.Size([1, 768])


In [27]:
query_emb_np=query_emb.cpu().detach().numpy().astype('float32')
D,I=index.search(query_emb_np,k=7)
print("Indices:", I)
print("Distances:", D)
retrieved_chunks=[]
retrieved_pages=[]
for i in I[0]:
    print("CHUNK : ",chunks[i])
    print("PAGE_NO: ",page_num[i])
    retrieved_chunks.append(chunks[i])
    retrieved_pages.append(page_num[i])

Indices: [[ 8  3 15 19 16 28 14]]
Distances: [[47.394146 49.032997 49.158653 49.342785 49.409195 49.418396 49.83879 ]]
CHUNK :   Figure 1: The Transformer - model architecture. The Transformer follows this overall architecture using stacked self-attention and point-wise, fully connected layers for both the encoder and decoder, shown in the left and right halves of Figure 1, respectively. 3.1 Encoder and Decoder Stacks Encoder: The encoder is composed of a stack of N = 6 identical layers. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple, position- wise fully connected feed-forward network. We employ a residual connection [11] around each of the two sub-layers, followed by layer normalization [1]. That is, the output of each sub-layer is LayerNorm(x + Sublayer(x)), where Sublayer(x) is the function implemented by the sub-layer itself. To facilitate these residual connections, all sub-layers in the model, as well as the embeddin

Defining a function that performs tasks we have done till now

In [28]:
def retrieve_chunks(query,chunks,page_num):
    query_tokens=tokenizer.encode(query,add_special_tokens=True,return_tensors='pt')
    query_emb=bert_model(query_tokens)[0].squeeze(0)
    query_emb=query_emb.mean(dim=0).unsqueeze(0)
    query_emb_np=query_emb.cpu().detach().numpy().astype('float32')
    D,I=index.search(query_emb_np,k=7)
    retrieved_chunks=[]
    retrieved_pages=[]
    for i in I[0]:
        retrieved_chunks.append(chunks[i])
        retrieved_pages.append(page_num[i])
    return retrieved_chunks,retrieved_pages
    

In [29]:
def generate_ans(query,chunks,page_num):
    
    context="Context: "
    for i,(chunk,page) in enumerate(zip(chunks,page_num)):
        context+=f"\nChunk {i}: {chunk}\n page of chunk {i} is : {page}"

    prompt=f"""You are a helpfull assistant. Your task is to answer the question based on the given context and their corresponding page numbers.
    - Cite relevant page numbers to support your answer, but group citations logically rather than repeating them after every sentence.
    - When multiple sentences refer to the same context, place the page number once at the end of the related section.
    - Provide detailed and accurate explanations wherever applicable.
    - Do not make up answer if the answer is not found in context.
    {context}
    Question:
    {query}
    Answer:
    """
    
    try:
        model = genai.GenerativeModel('gemini-1.5-flash-latest')
        generation_config = genai.types.GenerationConfig(temperature=0.7,max_output_tokens=1024)
        response=model.generate_content(prompt, generation_config=generation_config)
        return response.text
    except Exception as e:
        print("Error occured : ",e)

In [30]:

result=generate_ans(query,retrieved_chunks,retrieved_pages)

In [31]:
print(result)

The Transformer's encoder is composed of a stack of six identical layers (N=6) (page 3). Each layer contains two sub-layers: a multi-head self-attention mechanism and a position-wise fully connected feed-forward network (page 3).  A residual connection is used around each sub-layer, followed by layer normalization, with the output of each sub-layer being LayerNorm(x + Sublayer(x)) (page 3).  All sub-layers and embedding layers output a dimension of dmodel = 512 to facilitate these residual connections (page 3).

The decoder also uses stacked layers, but the provided text does not specify the number of layers in the decoder stack.  The decoder's self-attention layers allow each position to attend to all positions up to and including that position, preventing leftward information flow to maintain the autoregressive property (page 5).  Similar to the encoder, the decoder also contains position-wise feed-forward networks (page 5).  The decoder uses multi-head attention in three ways: encod

In [32]:
query="Explain the working of self attention layer in decoder"
c,p=retrieve_chunks(query,chunks,page_num)
result=generate_ans(query,c,p)
print(result)

In the decoder, self-attention layers allow each position to attend to all positions in the decoder up to and including that position.  To maintain the auto-regressive property, leftward information flow is prevented by masking out (setting to −∞) values in the softmax input that correspond to illegal connections. (page 5)



# Improvements that can be made in future

* Better text splitting
* Generating multiple queries from user query
* Reranking chunks after retrieval