### **Implementing Retrieval Augmented Generation System for PDF Documents Using Generative AI:**

#### **Objectives:**
*   Understanding Retrieval Augmented Generation (`RAG`)
*   Data Preprocessing and Chunking
*   Embedding Text Chunks
*   Similarity Search Implementation
*   Generative AI Model Selection
*   Model Integration

#### **Tasks:**

* Develop a PDF processing module to extract text from PDF documents and convert them into a format suitable for further processing.

* Implement a chunking algorithm to divide the text from PDF documents into smaller, manageable pieces for efficient storage and retrieval.

* Convert the smaller text chunks into embeddings using techniques such as Word2Vec or Universal Sentence Encoder

* Set up a vector database and store the content chunks from PDF documents in vectorized form to enable similarity search.

* Implement a similarity search algorithm, such as cosine similarity or nearest neighbor search, to retrieve relevant content chunks from the vector database based on user queries.

* Choose an appropriate generative AI model, such as `MixTRAL` or `LLAMA 70B`, based on factors such as performance, compatibility, and availability of pre-trained weights.

* Integrate the selected generative AI model with the retrieval system, providing an interface for users to input queries and receive answers generated based on the retrieved content chunks.


### **Installing all dependencies**

In [None]:
!pip install langchain

In [None]:
# Installing Embedding model
!pip install sentence_transformers

# Installing Vector database
!pip install chromadb

In [None]:
!pip install PyPDF

!pip install groq

### **Importing all necessary packages**

In [4]:
import os

from langchain.document_loaders import PyPDFLoader
from langchain.embeddings import SentenceTransformerEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from groq import Groq

### **Code Implementation**

#### **Loading the document**

In [6]:
# Loading the PDF 'sample_pdf' by using PyPDFLoader
loader = PyPDFLoader("/content/sample_pdf.pdf")
documents = loader.load()

#### **Creating chunks of the document**

In [7]:
def split_docs(documents,chunk_size=1000,chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_documents(documents)
  return docs

docs = split_docs(documents)
print(len(docs))

47


#### **Creating embedding of the chunks and Vector database**

In [None]:
# Using 'All-MiniLM-L6-v2' model to create embeddings
embeddings = SentenceTransformerEmbeddings(model_name="All-MiniLM-L6-v2")

# db = Chroma.from_documents(docs, embeddings)

# Creating Vector database
persist_directory = "chroma_db_day_4"
vectordb = Chroma.from_documents(
    documents=docs, embedding=embeddings, persist_directory=persist_directory
)

#### **Using similarity search to retrive the relevant content from database**

In [16]:
# query according to which the content needs to be retrieved
query = "From Assessment Year 2024-25, a maximum rebate of Rs. 25,000 is allowed under which section?"

# using similarity search to retrieve content
matching_docs = vectordb.similarity_search(query, k=3)

# printing the list of content which are retrieved from the similarity search
for doc in matching_docs:
    print(doc.page_content)

# joining list of retrieved content into a single string to create context for the generative model
context = "/n".join([doc.page_content for doc in matching_docs])
print("-------------------------------------------------------------------------")
print(context)

whichever is less.  
(b) From Assessment Year 2024 -25, a maximum rebate of Rs. 25,000 is allowed under section 
87A, If the total income of an individual, who is opting for the new tax scheme under Section 
115BAC(1A), is up to Rs. 7,00,000. Further, if the to tal income of the resident individual 
(opting section 115BAC(1A) exceeds Rs. 7,00,000 and the tax payable on such income 
exceeds the difference between the total income and Rs. 7,00,000, he can claim a rebate with 
marginal relief to the extent of the diffe rence between the tax payable on such total income 
and the amount by which it exceeds Rs. 7,00,000  
(c) If an assessee has opted for new tax regime, the provisions of AMT shall not be applicable.  
 
Conditions to be satisfied:  
The option to pay tax at lower rates  shall be available only if the total income of assessee is 
computed without claiming following exemptions or deductions:  
a) Leave Travel concession [Section 10(5)]
[As amended  by Finance  Act, 2023]   
 


#### **Creating Prompt consists of Context and query.**

In [17]:
prompt = f'''You are a chatbot you must generate a good summarised answer from the context.
Use the following provided context to answer the query enclosed within triple backticks.
    Context: {context}
    User Query: ```{query}```
    Answer:
'''

#### **Generating answer from the context and query.**

In [None]:
# set the environment variable for groq

# creating groq client to use the model `llama2-70b-4096`
client = Groq()

# generating answer from the context and query using 'llama2-70b-4096' model:
completion = client.chat.completions.create(
    model="llama2-70b-4096",
    messages=[
        {
            "role": "user",
            "content": f""" {prompt}
            """
        }
    ],
    temperature=0.5,
    max_tokens=1024,
    top_p=1,
    stream=True,
    stop=None,
)

# printing the result
for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")

Section 87A.

The question asks about the maximum rebate allowed under section 87A from Assessment Year 2024-25. The answer is Rs. 25,000.