# Building a Multi PDF RAG Chatbot

## Libs

In [None]:
from PyPDF2 import PdfReader
import fitz # PyMuPDF
import pytesseract
from PIL import Image
import io

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_community.embeddings.spacy_embeddings import SpacyEmbeddings

# for efficient similarity search of vectors, which is useful for finding information quickly in large datasets
from langchain_community.vectorstores import FAISS 
from langchain.tools.retriever import create_retriever_tool
from langchain.llms import HuggingFacePipeline

import torch
import transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"

## Overview

<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/0*Q6HOo4_KyCnyFkbf.png' width=500>

## Reading and Processing PDF Files

When a user uploads one or more PDF files, the application reads each page of these documents and extracts the text, merging it into a single continuous string.

Once the text is extracted, it is split into manageable chunks of 1000 characters each.

In [7]:
def pdf_read_PyPDF2(pdf_doc):
    text = ''
    for pdf in pdf_doc:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

def pdf_read_PyMuPDF(pdf_doc):
    text = ''
    for pdf in pdf_doc:
        if type(pdf) == str: # read with path
            pdf_reader = fitz.open(pdf)
        else: # read with file object
            pdf_reader = fitz.open(stream=pdf.read())
            
        for page_num in range(pdf_reader.page_count):
            page = pdf_reader[page_num]
            text += page.get_text()
    return text

In [17]:
pdf_docs = ['./demo_pdf_file/CV-Ho-Dang-Cao.pdf', './demo_pdf_file/academic_transcript.pdf']
print(pdf_read_PyPDF2([pdf_docs[0]])[:750])

Hồ Đăng Cao
AI Engineer Intern
OBJECTIVE
I am eager to apply for the AI Engineer Intern position at TMA Tech Group. I aspire to 
apply my academic knowledge in real-world scenarios while learning and further  
developing my skills under the guidance of industry experts. I have continuously been  
reading science papers, learning and working on projects in this field every day. With 
enthusiasm and a growth-oriented mindset, I am confident in my ability to contribute to 
the company’ s success and simultaneously advance my professional growth in a creative
and challenging environment.
SKILLS
Programming language: Python, SQL.  
Data crawling:  Selenium, BeautifulSoup.
Data pr ocessing: Numpy , Pandas, Excel.
Machine learning:  Pytorch, Sciki


In [18]:
raw_text = pdf_read_PyMuPDF([pdf_docs[0]])
print(raw_text[:750])

Hồ Đăng Cao
AI Engineer Intern
OBJECTIVE
I am eager to apply for the AI Engineer Intern position at TMA Tech Group. I aspire to 
apply my academic knowledge in real-world scenarios while learning and further 
developing my skills under the guidance of industry experts. I have continuously been 
reading science papers, learning and working on projects in this field every day. With 
enthusiasm and a growth-oriented mindset, I am confident in my ability to contribute to 
the company’s success and simultaneously advance my professional growth in a creative
and challenging environment.
SKILLS
Programming language: Python, SQL. 
Data crawling: Selenium, BeautifulSoup.
Data processing: Numpy, Pandas, Excel.
Machine learning: Pytorch, Scikit-learn,


`Note:`
- PyMuPDF text is cleaner.
- PyPDF2 code is cleaner.

In [21]:
pdf_read_PyMuPDF([pdf_docs[1]])

''

## Upgrade to read scanned PDF file

Scanned PDFs typically contain images rather than text data that regular PDF text extraction libraries like PyPDF2, pdfplumber, and PyMuPDF can NOT handle. For scanned PDFs, use Optical Character Recognition (OCR) to extract text from images within the document.

In [None]:
def evolved_pdf_read_PyMuPDF(pdf_doc: list, ocr_config: str):
    text = ''
    for pdf in pdf_doc:
        if type(pdf) == str: # read with path
            pdf_reader = fitz.open(pdf)
        else: # read with file object
            pdf_reader = fitz.open(stream=pdf.read())
            
        for page_num in range(pdf_reader.page_count):
            page = pdf_reader[page_num]            
            text_page = page.get_text()

            if not text_page.strip(): # if no text => scanned file
                pix = page.get_pixmap()
                img = Image.open(io.BytesIO(pix.tobytes('png')))
                text_page = pytesseract.image_to_string(img, config=ocr_config)

            text += text_page
    return text.strip()

In [55]:
ocr_config = r' --psm 11 --oem 3'

raw_text = evolved_pdf_read_PyMuPDF([pdf_docs[1]], ocr_config)
print(raw_text[:100])
  

ye

=

ose NATIONAL UNIVERSITY -HeMC

SOCIALIST REPUBLIC OF VIETNAM

i

UNIVERSITY OF SCIENCE,

Inde


## Generating chunks

In [19]:
def get_chunks(text):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=200)
    chunks = text_splitter.split_text(text)
    return chunks

In [20]:
text_chunks = get_chunks(raw_text)
text_chunks

['Hồ Đăng Cao\nAI Engineer Intern\nOBJECTIVE\nI am eager to apply for the AI Engineer Intern position at TMA Tech Group. I aspire to \napply my academic knowledge in real-world scenarios while learning and further \ndeveloping my skills under the guidance of industry experts. I have continuously been \nreading science papers, learning and working on projects in this field every day. With \nenthusiasm and a growth-oriented mindset, I am confident in my ability to contribute to',
 'reading science papers, learning and working on projects in this field every day. With \nenthusiasm and a growth-oriented mindset, I am confident in my ability to contribute to \nthe company’s success and simultaneously advance my professional growth in a creative\nand challenging environment.\nSKILLS\nProgramming language: Python, SQL. \nData crawling: Selenium, BeautifulSoup.\nData processing: Numpy, Pandas, Excel.\nMachine learning: Pytorch, Scikit-learn, Statsmodel, Tensorflow, Keras.',
 'SKILLS\nProgrammi

## Creating a Searchable Text Database and Making Embeddings

The application turns text chunks into vectors and saves these vectors locally. 

In [6]:
embeddings = SpacyEmbeddings(model_name='en_core_web_sm')

def vector_store(text_chunks):
    vector_store = FAISS.from_texts(text_chunks, embedding=embeddings)
    vector_store.save_local('faiss_db')
    
vector_store(text_chunks)

In [None]:
new_db = FAISS.load_local('faiss_db', embeddings, allow_dangerous_deserialization=True)
retriever = new_db.as_retriever(search_kwargs={"k": 1}) # top_k=1 will return only the top chunk.
retriever_chain = create_retriever_tool(retriever, 'pdf_extractor', 'This tool is to give answer to queries from the pdf')
print(retriever_chain.name)
print(retriever_chain.description)

pdf_extractor
This tool is to give answer to queries from the pdf


In [8]:
queries = ['show GPA','list the projects titles','December']
new_db.similarity_search(queries[2])

[Document(metadata={}, page_content='Programming language: Python, SQL.  \nData crawling:  Selenium, BeautifulSoup.\nData pr ocessing: Numpy , Pandas, Excel.\nMachine learning:  Pytorch, Scikit-learn, Statsmodel, T ensorflow , Keras.\nBig data: Hadoop, Pyspark.  \nDataBase : MSSQL, MongoDB  \ue11b 097 2367 154\n✉ dangcaoho151202@gmail.com\n\uf0ac https://github.com/hodangcao  \nEDUCA TION\nUniversity of Science-VNUHCM       2020 -2024\nINFORMA TION TECNOLOGY\n• Major: Data Science (High-Quality Program)\n  • GPA: 8.18/10 \n \nAWARDS & CER TIFICA TES'),
 Document(metadata={}, page_content='reading science papers, learning and working on projects in this field every day. With \nenthusiasm and a growth-oriented mindset, I am confident in my ability to contribute to \nthe company’ s success and simultaneously advance my professional growth in a creative\nand challenging environment.\nSKILLS\nProgramming language: Python, SQL.  \nData crawling:  Selenium, BeautifulSoup.\nData pr ocessing: N

In [9]:
retriever.get_relevant_documents(queries[2])

  retriever.get_relevant_documents(queries[2])


[Document(metadata={}, page_content='Programming language: Python, SQL.  \nData crawling:  Selenium, BeautifulSoup.\nData pr ocessing: Numpy , Pandas, Excel.\nMachine learning:  Pytorch, Scikit-learn, Statsmodel, T ensorflow , Keras.\nBig data: Hadoop, Pyspark.  \nDataBase : MSSQL, MongoDB  \ue11b 097 2367 154\n✉ dangcaoho151202@gmail.com\n\uf0ac https://github.com/hodangcao  \nEDUCA TION\nUniversity of Science-VNUHCM       2020 -2024\nINFORMA TION TECNOLOGY\n• Major: Data Science (High-Quality Program)\n  • GPA: 8.18/10 \n \nAWARDS & CER TIFICA TES'),
 Document(metadata={}, page_content='reading science papers, learning and working on projects in this field every day. With \nenthusiasm and a growth-oriented mindset, I am confident in my ability to contribute to \nthe company’ s success and simultaneously advance my professional growth in a creative\nand challenging environment.\nSKILLS\nProgramming language: Python, SQL.  \nData crawling:  Selenium, BeautifulSoup.\nData pr ocessing: N

In [10]:
retriever_chain.invoke(queries[2])

"Programming language: Python, SQL.  \nData crawling:  Selenium, BeautifulSoup.\nData pr ocessing: Numpy , Pandas, Excel.\nMachine learning:  Pytorch, Scikit-learn, Statsmodel, T ensorflow , Keras.\nBig data: Hadoop, Pyspark.  \nDataBase : MSSQL, MongoDB  \ue11b 097 2367 154\n✉ dangcaoho151202@gmail.com\n\uf0ac https://github.com/hodangcao  \nEDUCA TION\nUniversity of Science-VNUHCM       2020 -2024\nINFORMA TION TECNOLOGY\n• Major: Data Science (High-Quality Program)\n  • GPA: 8.18/10 \n \nAWARDS & CER TIFICA TES\n\nreading science papers, learning and working on projects in this field every day. With \nenthusiasm and a growth-oriented mindset, I am confident in my ability to contribute to \nthe company’ s success and simultaneously advance my professional growth in a creative\nand challenging environment.\nSKILLS\nProgramming language: Python, SQL.  \nData crawling:  Selenium, BeautifulSoup.\nData pr ocessing: Numpy , Pandas, Excel.\nMachine learning:  Pytorch, Scikit-learn, Statsmod

## Setting Up the Conversational AI

- **AI Configuration**: The app sets up a conversational AI to answer questions based on the PDF content it has processed.
- **Conversation Chain**: The AI uses a set of prompts to understand the context and provide accurate responses to user queries. If the answer to a question isn’t available in the text, the AI is programmed to respond with “answer is not available in the context,” ensuring that users do not receive incorrect information.

In [11]:
# Load the model and tokenizer locally from Hugging Face
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, clean_up_tokenization_spaces=True)

# Create the Hugging Face pipeline for conversational tasks with specified max_length
generation_pipeline = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    device='cuda' if torch.cuda.is_available() else 'cpu',
    # max_length=1024,
    max_new_tokens=200, 
    pad_token_id=50256
)

# Wrap the pipeline with HuggingFacePipeline to integrate with LangChain
llm = HuggingFacePipeline(pipeline=generation_pipeline)

  llm = HuggingFacePipeline(pipeline=generation_pipeline)


In [12]:
prompt = ChatPromptTemplate.from_messages([
        ("system", """You are a helpful assistant. Answer the question as detailed as possible from the provided context, make sure to provide all the details, if the answer is not in provided context just say, "answer is not available in the context", don't provide the wrong answer"""),
        ("placeholder", "{chat_history}"),
        ("human", "{input}"),
        ("placeholder", "{agent_scratchpad}"),
    ])

In [13]:
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are an AI assistant with access to the following tool: {tool_name}. {tool_description}"),
    ("user", "Given document: {related_chunk}"), 
    ("user", "{query}"),
    ("assistant", "Let me think...")
])

In [None]:
def get_conversational_chain(retriever_chain, query):
    tool_response = retriever_chain.invoke({'query': query})
    
    # Check if tool_response is a string or dict and access accordingly
    if isinstance(tool_response, dict):
        agent_scratchpad = tool_response.get('output', "No output found.")
    else:
        agent_scratchpad = tool_response

    # Combine tool response and context with user input and prompt
    formatted_prompt = prompt.format(
        tool_name=retriever_chain.name,
        tool_description=retriever_chain.description,
        related_chunk=agent_scratchpad,
        query=query,
    )

    # Call the LLM with the formatted prompt
    response = llm.invoke(formatted_prompt, temperature=0.7)
    return response[len(formatted_prompt):].strip()

query = 'list the projects titles'
response_without_context = get_conversational_chain(retriever_chain, query)
print(response_without_context)

My main goal for this  august position is to learn how the project is progressing and become a part of  
the human side of the   ecosystem. I want to assist     other            I. e. find new                   
I am already an engineer but I need to keep up with             , and I want to provide my advices in               
How does the project will be organized              You can find           
The first step of          a good            t


## Running app

In [56]:
!streamlit run app.py

^C


## Build docker

In [None]:
# !docker build -t pdf_rag .
# !docker run -it pdf_rag

# Reference

[Building a Multi PDF RAG Chatbot: Langchain, Streamlit with code](https://blog.gopenai.com/building-a-multi-pdf-rag-chatbot-langchain-streamlit-with-code-d21d0a1cf9e5)