**# Objective**

The objective of this Kaggle notebook is to design a sophisticated Question-Answering system using the Retrieval-Augmented Generation (RAG) technology of LLM. The system is capable of ingesting multiple documents (PDF, ppt, etc) as its knowledge base understands the context and nuances within these documents, and generates precise, informative answers to a wide range of user queries. The notebook highlights the system's ability to perform real-time information retrieval from a diverse document set and produce relevant answers.

In [None]:
# Installing Required Libraries
%pip install python-pptx
%pip install PyPDF2
%pip install langchain
%pip install langchain_community
%pip install langchain_google_genai
%pip install langchain_text_splitters
%pip install sentence-transformers
%pip install faiss-cpu
%pip install cohere

In [6]:
# necessary Imports
from docx import Document
from PyPDF2 import PdfReader
from pptx import Presentation
from langchain_community.llms import Cohere
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAI
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import AIMessage, HumanMessage
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts  import PromptTemplate, ChatPromptTemplate, MessagesPlaceholder

# Data Loading

For this Notebook, I have taken three different data files.


* PDF1:- On American History
* PDF2:- American Constitutional Report
* PPT :- US Civil War report presentation



In [9]:
pdf_file = open('/kaggle/input/rag-questionanswermodel/USHistory-LR.pdf','rb')
ppt_file = Presentation("/kaggle/input/rag-questionanswermodel/US-CIVILWARslides.pptx")
pdf_file2 = open('/kaggle/input/rag-questionanswermodel/USAdvisory-1776Report.pdf','rb')

# Data Extraction

**PDF data** 
- extracted using PyPDF2 
- text is stored in a string

**PPT data**
- extracted using python-pptx module
- text is stored in a string

A single string combining the text from all files is used for further text processing.

In [10]:
pdf_text = ""
pdf_reader = PdfReader(pdf_file)
for page in pdf_reader.pages:
    pdf_text += page.extract_text()

# extracting pdf 2 data
pdf2_text = ""
pdf_reader2 = PdfReader(pdf_file2)
for page in pdf_reader2.pages:
    pdf2_text += page.extract_text()
    
# extracting ppt data
ppt_text = ""
for slide in ppt_file.slides:
    for shape in slide.shapes:
        if hasattr(shape, "text"):
            ppt_text += shape.text + '\n'


*Merge all the extracted text into one string*

In [11]:
# merging all the text 

all_text = pdf_text + '\n' + ppt_text + '\n' + pdf2_text
len(all_text)

3019717

# Chunking

 This creates the data chunks using Recursive Character Splitter which is useful for indexing data and feeding into a model. Large chunks are difficult to parse and won’t fit in a model’s finite context window.

In [13]:
# splitting the text into chunks for embeddings creation

text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 1000, 
        chunk_overlap = 200, # This is helpul to handle the data loss while chunking.
        length_function = len,
        separators=['\n', '\n\n', ' ', '']
    )
    
chunks = text_splitter.split_text(text = all_text)

In [14]:
len(chunks)

3710

In [15]:
import os

os.environ['HuggingFaceHub_API_Token']= '--y'
os.environ['GOOGLE_API_KEY']= "---"
os.environ['cohere_api_key'] = "-ui"



# Embeddings Creation

This process involves converting textual information from documents into dense, high-dimensional vectors known as embeddings. 

These embeddings are designed to capture the semantic meaning of words, sentences, or even entire documents, enabling the Q&A system to understand and process natural language more effectively.



In [16]:

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

  embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]



1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Indexing

Facebook AI Similarity Search (FAISS) library facilitates the efficient search for similarities in large datasets, especially useful for tasks involving high-dimensional vectors like text embeddings. When applied to document-based Q&A, FAISS indexes the embeddings of document chunks (e.g., paragraphs, sentences) to optimize the retrieval process.

In [17]:
# Indexing the data using FAISS
vectorstore = FAISS.from_texts(chunks, embedding = embeddings)

Retriever

The retriever utilizes the pre-indexed embeddings of document chunks, searching through them to find the most relevant pieces of content in response to a user query. 

In [18]:
# creating retriever
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})

Some possible questions->
How did economic differences between the North and South contribute to the Civil War?
To what extent did the abolitionist movement influence the start of the Civil War?
How did the election of Abraham Lincoln in 1860 act as a catalyst for secession?

In [25]:
retrieved_docs = retriever.invoke("What role did the Boston Massacre and Boston Tea Party play in escalating tensions leading to war?")

In [26]:
len(retrieved_docs)

6

In [27]:
print(retrieved_docs[0].page_content)

followers approached the three ships. Some were disguised as Mohawks. Protected by a crowd of
spectators, they systematically dumped all the tea into the harbor, destroying goods worth almost $1
million in today’s dollars, a very significant loss. This act soon inspired further acts of resistance up and
down the East Coast. However, not all colonists, and not even all Patriots, supported the dumping of the
tea. The wholesale destruction of property shocked people on both sides of the Atlantic.
To learn more about the Boston Tea Party, explore the extensive resources in the
Boston Tea Party Ships and Museum collection (http://openstaxcollege.org/l/
teapartyship) of articles, photos, and video. At the museum itself, you can board
replicas of the Eleanor and the Beaver and experience a recreation of the dumping of
the tea.
PARLIAMENT RESPONDS: THE COERCIVE ACTS
In London, response to the destruction of the tea was swift and strong. The violent destruction of property


# LLM Models

A Large Language Model (LLM) is a type of artificial intelligence model trained on vast amounts of textual data to understand, generate, and interact in human-like language. LLMs, such as GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers), are based on deep learning architectures, particularly transformers, which allow them to process and generate contextually coherent responses.

**RAG (Retrieval-Augmented Generation)** combines the power of LLMs with external retrieval mechanisms to enhance question-answering capabilities by fetching relevant information from external knowledge sources.

Cohere LLM

In [28]:
prompt_template = """Answer the question as precise as possible using the provided context. If the answer is
                not contained in the context, say "Answer not available in context" \n\n
                Context: \n {context}?\n
                Question: \n {question} \n
                Answer:"""

prompt = PromptTemplate.from_template(template=prompt_template)

In [29]:
# function to create a single string of relevant documents given by Faiss.
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [30]:
# RAG Chain

def generate_answer(question):
    cohere_llm = Cohere(model="command", temperature=0.1, cohere_api_key = os.getenv('cohere_api_key'))
    
    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | cohere_llm
        | StrOutputParser()
    )
    
    return rag_chain.invoke(question)

# Results 

In [31]:
ans = generate_answer("What role did the Boston Massacre and Boston Tea Party play in escalating tensions leading to war?")
print(ans)

  cohere_llm = Cohere(model="command", temperature=0.1, cohere_api_key = os.getenv('cohere_api_key'))


 The Boston Massacre and Boston Tea Party were pivotal events in the lead-up to the American War of Independence. The Boston Tea Party was a significant act of rebellion against British taxation policies, which Dumped tea worth millions of dollars into Boston Harbor as a protest against British tax laws. This event inspired further acts of resistance and rebellion against British rule and taxation without representation. The Boston Massacre, on the other hand, was a violent confrontation between British troops and colonists in 1770, which resulted in the deaths of several colonists. This event further escalated tensions between the British and the colonists and was widely portrayed as an act of aggression and oppression by the British establishment, fuelling anti-British sentiment and fostering unity and patriotism among the colonists. These two events played a crucial role in escalating tensions and provoking war. 


In [32]:
ans = generate_answer("How did the American Revolution influence other independence movements around the world, such as in France or Latin America?")
print(ans)

 The American Revolution inspired other independence movements by demonstrating that it was possible for colonies to successfully fight against oppressive imperial powers and win their freedom. This idea was especially appealing to colonies in Latin America and France, who were also fighting for their independence from Spain and France, respectively. 

Additionally, the American Revolution established the model for modern democratic republics, showcasing the idea that governments are created by the people and for the people. This inspired people around the world to fight for their own governments, regardless of their current oppressive circumstances. 

Overall, the American Revolution influenced many other independence movements by showing that freedom and self-governance were possible with enough courage and determination. 


In [35]:
ans = generate_answer("How are the key figures of the American Revolution, such as George Washington, Thomas Jefferson, and Benjamin Franklin, remembered differently today?")
print(ans)

 While all three are remembered as being key players in the American Revolution and forging the path for the United States to succeed, they are all remembered differently. George Washington is often remembered as one of the greatest American generals and presidents, largely due to his role as the Commander in Chief of the Continental Army and his two terms as the first President of the United States. Thomas Jefferson is often lauded for his political philosophy and writing the Declaration of Independence, as well as his role in shaping the early republic with the Louisiana Purchase and other political moves. Benjamin Franklin is remembered for his political savvy, as well as his scientific discoveries and inventions like the Franklin stove and his work with electricity. While all three have their own places in the annals of history, all are remembered differently and for different accomplishments. 


In [36]:
ans = generate_answer("What were the immediate challenges faced by the new United States government after achieving independence?")
print(ans)

 The immediate challenges faced by the new United States government after achieving independence were: securing international recognition and legitimacy, establishing effective governance and institutions, managing foreign relations with European powers, addressing economic instability and national debt, and maintaining unity and political stability among the states. These challenges would test the resilience and capacity of the newly independent nation as it sought to establish itself in the international sphere. 


In [37]:
ans = generate_answer("What were the key provisions of the Treaty of Paris (1783), and how did it shape the new United States?")
print(ans)

 The Treaty of Paris, signed in 1783, recognized the independence of the United States and defined its western, eastern, northern, and southern boundaries. It also granted fishing rights to New Englanders in the waters off Newfoundland. Additionally, the treaty encouraged individual states to treat Loyalists fairly and return their confiscated property. 

The Treaty of Paris significantly shaped the new United States by establishing its official independence from Great Britain and defining its territorial boundaries. This set the stage for the country's future growth and development as a sovereign nation.  It also encouraged the treatment of Loyalists, allowing them to return to the country and recover their assets. 


In [38]:
ans = generate_answer("What role did technological advancements (like the use of railroads and telegraphs) play in the Civil War?")
print(ans)

 The use of railroads and telegraphs played a significant role in the Civil War, as it allowed for improved communication and transportation for the Northern forces. The railroad grid in the North helped supply Union troops with food and war materials, while also allowing them to move more quickly than the South. The telegraph also allowed for the dissemination of news at a faster pace than previously possible. This helped to galvanize popular support for the Union war effort, and also ensured that troops could be moved quickly to wherever they were needed. 

While the Confederacy also had railroads and telegraphs, they were unable to match the scale of those in the North, and their economies were unable to keep up with the increased demand for resources needed for the war effort. 

Thus, technological advancements played a key role in the Union victory in the Civil War. 


# Conclusion

In conclusion, this Kaggle notebook has successfully demonstrated the application of Retrieval-Augmented Generation (RAG) for multi-document Question and Answering system. It showcased the power of combining retrieval and generation capabilities to provide accurate, context-aware answers sourced from multiple documents. Through detailed examples, performance evaluations, and interactive demonstrations, the notebook highlights the efficiency and scalability of RAG in handling complex Q&A tasks.