# Building a Multi PDF RAG Chatbot

## Libs

In [None]:
# from PyPDF2 import PdfReader
import fitz # PyMuPDF
import pytesseract
from PIL import Image
import io

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings.spacy_embeddings import SpacyEmbeddings

# for efficient similarity search of vectors, which is useful for finding information quickly in large datasets
from langchain_community.vectorstores import FAISS 
from langchain.tools.retriever import create_retriever_tool

## Overview

<img src='https://miro.medium.com/v2/resize:fit:1100/format:webp/0*Q6HOo4_KyCnyFkbf.png' width=500>

## Reading and Processing PDF Files

When a user uploads one or more PDF files, the application reads each page of these documents and extracts the text, merging it into a single continuous string.

Once the text is extracted, it is split into manageable chunks of 1000 characters each.

In [10]:
def pdf_read_PyPDF2(pdf_doc):
    text = ''
    for pdf in pdf_doc:
        pdf_reader = PdfReader(pdf)
        for page in pdf_reader.pages:
            text += page.extract_text()
    return text

def pdf_read_PyMuPDF(pdf_doc):
    text = ''
    for pdf in pdf_doc:
        if type(pdf) == str: # read with path
            pdf_reader = fitz.open(pdf)
        else: # read with file object
            pdf_reader = fitz.open(stream=pdf.read())
            
        for page_num in range(pdf_reader.page_count):
            page = pdf_reader[page_num]
            text += page.get_text()
    return text

pdf_docs = ['./demo_pdf_file/CV-Ho-Dang-Cao.pdf', './demo_pdf_file/academic_transcript.pdf']

In [None]:
print(pdf_read_PyPDF2([pdf_docs[0]])[:750])

Hồ Đăng Cao
AI Engineer Intern
OBJECTIVE
I am eager to apply for the AI Engineer Intern position at TMA Tech Group. I aspire to 
apply my academic knowledge in real-world scenarios while learning and further  
developing my skills under the guidance of industry experts. I have continuously been  
reading science papers, learning and working on projects in this field every day. With 
enthusiasm and a growth-oriented mindset, I am confident in my ability to contribute to 
the company’ s success and simultaneously advance my professional growth in a creative
and challenging environment.
SKILLS
Programming language: Python, SQL.  
Data crawling:  Selenium, BeautifulSoup.
Data pr ocessing: Numpy , Pandas, Excel.
Machine learning:  Pytorch, Sciki


In [11]:
raw_text = pdf_read_PyMuPDF([pdf_docs[0]])
print(raw_text[:750])

Hồ Đăng Cao
AI Engineer Intern
OBJECTIVE
I am eager to apply for the AI Engineer Intern position at TMA Tech Group. I aspire to 
apply my academic knowledge in real-world scenarios while learning and further 
developing my skills under the guidance of industry experts. I have continuously been 
reading science papers, learning and working on projects in this field every day. With 
enthusiasm and a growth-oriented mindset, I am confident in my ability to contribute to 
the company’s success and simultaneously advance my professional growth in a creative
and challenging environment.
SKILLS
Programming language: Python, SQL. 
Data crawling: Selenium, BeautifulSoup.
Data processing: Numpy, Pandas, Excel.
Machine learning: Pytorch, Scikit-learn,


`Note:`
- PyMuPDF text is cleaner.
- PyPDF2 code is cleaner.

In [12]:
pdf_read_PyMuPDF([pdf_docs[1]])

''

## Upgrade to read scanned PDF file

Scanned PDFs typically contain images rather than text data that regular PDF text extraction libraries like PyPDF2, pdfplumber, and PyMuPDF can NOT handle. For scanned PDFs, use Optical Character Recognition (OCR) to extract text from images within the document.

In [13]:
def evolved_pdf_read_PyMuPDF(pdf_doc: list, ocr_config: str):
    text = ''
    for pdf in pdf_doc:
        if type(pdf) == str: # read with path
            pdf_reader = fitz.open(pdf)
        else: # read with file object
            pdf_reader = fitz.open(stream=pdf.read())
            
        for page_num in range(pdf_reader.page_count):
            page = pdf_reader[page_num]            
            text_page = page.get_text()

            if not text_page.strip(): # if no text => scanned file
                pix = page.get_pixmap()
                img = Image.open(io.BytesIO(pix.tobytes('png')))
                text_page = pytesseract.image_to_string(img, config=ocr_config)

            text += text_page
    return text.strip()

In [14]:
ocr_config = r' --psm 11 --oem 3'

raw_text = evolved_pdf_read_PyMuPDF([pdf_docs[1]], ocr_config)
print(raw_text[:100])

ye

=

ose NATIONAL UNIVERSITY -HeMC

SOCIALIST REPUBLIC OF VIETNAM

i

UNIVERSITY OF SCIENCE,

Inde


## Generating chunks

In [15]:
def get_chunks(text):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=200)
    chunks = text_splitter.split_text(text)
    return chunks

In [16]:
text_chunks = get_chunks(raw_text)
text_chunks

['ye\n\n=\n\nose NATIONAL UNIVERSITY -HeMC\n\nSOCIALIST REPUBLIC OF VIETNAM\n\ni\n\nUNIVERSITY OF SCIENCE,\n\nIndependence - Fresom = Happiness\n\ni\n\n12\n\nACADEMIC TRANSCRIPT\n\nFall mame of student: HO DANG CAO\n\nStudent 1D: 20127482\n\nCourse: 2020-2024\n\nDate of bith\n\nDecember 15,2002,\n\nProgram\n\nachelor of Science\n\nPlace of bith\n\nHo Chi Mink City\n\nMajoe: Information Technology\n\n1 Jeourse 10\n\nCourse title\n\ncredits\n\nTO-Point | &Point\n\n‘grade\n\ngrade\n\nT]BAAO0003 | Hoth\n\n20\n\n200\n\n350\n\nFe Meolony\n\n2] BAAo0\n\nGeneral law',
 'Ho Chi Mink City\n\nMajoe: Information Technology\n\n1 Jeourse 10\n\nCourse title\n\ncredits\n\nTO-Point | &Point\n\n‘grade\n\ngrade\n\nT]BAAO0003 | Hoth\n\n20\n\n200\n\n350\n\nFe Meolony\n\n2] BAAo0\n\nGeneral law\n\n30\n\n750\n\n325\n\n3] BAA0000S | Basie Economics\n\n20\n\n3.00\n\n400\n\n4|paaoo02r | Gymnastics 1\n\n4]\n\n20\n\n780\n\n340\n\n20\n\n310\n\n5|BAA000%2 | Gymnasties2\n\n400\n\n£6|BAA00030 | Nations Defence Educat

## Creating a Searchable Text Database and Making Embeddings

The application turns text chunks into vectors and saves these vectors locally. 

In [18]:
embeddings = SpacyEmbeddings(model_name='en_core_web_sm')

def vector_store(text_chunks):
    vector_store = FAISS.from_texts(text_chunks, embedding=embeddings)
    vector_store.save_local('faiss_db')
    
vector_store(text_chunks)

In [19]:
new_db = FAISS.load_local('faiss_db', embeddings, allow_dangerous_deserialization=True)
retriever = new_db.as_retriever(search_kwargs={"k": 1}) # top_k=1 will return only the top chunk.
retriever_chain = create_retriever_tool(retriever, 'pdf_extractor', 'This tool is to give answer to queries from the pdf')
print(retriever_chain.name)
print(retriever_chain.description)

pdf_extractor
This tool is to give answer to queries from the pdf


In [20]:
queries = ['show GPA','list the projects titles','December']
new_db.similarity_search(queries[2])

[Document(metadata={}, page_content='‘Student ID; 20127482\n\nBachel\n\nof Science\n\nori\n\nDecember 15,2002\n\nProgram\n\nHo Chi Minh City\n\nMajor: Information Technology\n\nfbi\n\ncredits\n\nTo-point | +\n\n‘course title\n\ngrade_|_grade_|\n\na [course 10\n\nrn\n\n920,\n\n400,\n\nZO] PRTYO0007 | Physics for Information Technolony\n\nHo Chi Minh City, October 14,2024\n\n‘otal Accumulated Credits:\n\n7\n\nBY ORDEROF RECTOR\n\nGrade Point Average creepntain\n\npepury i\n\n\\CADEMIC|AFFAIRS OFFICE,\n\nGrade Pint Average aerguiatsate 3.56\n\nmoe\n\nwi\n\nPHAM THUTHUAN\n\nby\n\n‘Sad with'),
 Document(metadata={}, page_content='ye\n\n=\n\nose NATIONAL UNIVERSITY -HeMC\n\nSOCIALIST REPUBLIC OF VIETNAM\n\ni\n\nUNIVERSITY OF SCIENCE,\n\nIndependence - Fresom = Happiness\n\ni\n\n12\n\nACADEMIC TRANSCRIPT\n\nFall mame of student: HO DANG CAO\n\nStudent 1D: 20127482\n\nCourse: 2020-2024\n\nDate of bith\n\nDecember 15,2002,\n\nProgram\n\nachelor of Science\n\nPlace of bith\n\nHo Chi Mink City\n\

In [22]:
retriever.invoke(queries[2])

[Document(metadata={}, page_content='‘Student ID; 20127482\n\nBachel\n\nof Science\n\nori\n\nDecember 15,2002\n\nProgram\n\nHo Chi Minh City\n\nMajor: Information Technology\n\nfbi\n\ncredits\n\nTo-point | +\n\n‘course title\n\ngrade_|_grade_|\n\na [course 10\n\nrn\n\n920,\n\n400,\n\nZO] PRTYO0007 | Physics for Information Technolony\n\nHo Chi Minh City, October 14,2024\n\n‘otal Accumulated Credits:\n\n7\n\nBY ORDEROF RECTOR\n\nGrade Point Average creepntain\n\npepury i\n\n\\CADEMIC|AFFAIRS OFFICE,\n\nGrade Pint Average aerguiatsate 3.56\n\nmoe\n\nwi\n\nPHAM THUTHUAN\n\nby\n\n‘Sad with')]

In [23]:
retriever_chain.invoke(queries[2])

'‘Student ID; 20127482\n\nBachel\n\nof Science\n\nori\n\nDecember 15,2002\n\nProgram\n\nHo Chi Minh City\n\nMajor: Information Technology\n\nfbi\n\ncredits\n\nTo-point | +\n\n‘course title\n\ngrade_|_grade_|\n\na [course 10\n\nrn\n\n920,\n\n400,\n\nZO] PRTYO0007 | Physics for Information Technolony\n\nHo Chi Minh City, October 14,2024\n\n‘otal Accumulated Credits:\n\n7\n\nBY ORDEROF RECTOR\n\nGrade Point Average creepntain\n\npepury i\n\n\\CADEMIC|AFFAIRS OFFICE,\n\nGrade Pint Average aerguiatsate 3.56\n\nmoe\n\nwi\n\nPHAM THUTHUAN\n\nby\n\n‘Sad with'

## Setting Up the Conversational AI

- **AI Configuration**: The app sets up a conversational AI to answer questions based on the PDF content it has processed.
- **Conversation Chain**: The AI uses a set of prompts to understand the context and provide accurate responses to user queries. If the answer to a question isn’t available in the text, the AI is programmed to respond with “answer is not available in the context,” ensuring that users do not receive incorrect information.

In [None]:
from langchain_ollama import ChatOllama
from langchain_core.messages import AIMessage, HumanMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOllama(model='mistral', quantization='8bit')

In [25]:
history = [
    ("system", f"You are an AI assistant with access to the following tool: {retriever_chain.name}. {retriever_chain.description}"),
]

In [30]:
def get_conversational_chain(retriever_chain, query):
    related_chunk = retriever_chain.invoke({'query': query})
    history.append(("user", "Given document: {related_chunk}"))
    history.append(("user", "{query}"))
    
    prompt = ChatPromptTemplate.from_messages(history)

    chain = prompt | llm | StrOutputParser()

    response = chain.invoke({'related_chunk':related_chunk, 'query':query}, temperature=0).strip()
    history.append(("assistant", response))

    return response

query = 'list the projects titles'
response = get_conversational_chain(retriever_chain, query)
print(response)

The document provided doesn't seem to contain any project titles as it appears to be a student's academic transcript or certificate. Project titles would typically be found in documents related to research, coursework, or assignments where students conduct independent study or complete group projects. If you have a different PDF containing project titles, I'd be happy to help with that!


In [31]:
query = 'assess the performance. good or bad?'
get_conversational_chain(retriever_chain, query)

"Based on the provided document, the student's Grade Point Average (GPA) is 3.56. In many educational systems, a GPA between 3.0 and 4.0 is considered average to good, so it seems like the student's performance is generally good. However, keep in mind that different institutions have their own grading criteria, so the interpretation of these grades may vary. It would be best to consult with an academic advisor or the institution for a more accurate assessment of the student's performance."

## Running app

In [56]:
!streamlit run app.py

^C


## Build docker

In [None]:
# !docker build -t pdf_rag .
# !docker run -it pdf_rag

# Reference

[Building a Multi PDF RAG Chatbot: Langchain, Streamlit with code](https://blog.gopenai.com/building-a-multi-pdf-rag-chatbot-langchain-streamlit-with-code-d21d0a1cf9e5)

[Query SQL Database Using Natural Language with Llama 3 and LangChain](https://medium.com/dev-genius/query-sql-database-using-natural-language-with-llama-3-and-langchain-a310e6d7dc14)