# **PDFQueryBot -  PDF-Based Question Answering Chatbot**


## **What are we trying to do?**

We aim to build an AI-powered chatbot that can:

- Read a PDF document

- Understand its content intelligently

- Answer questions asked by the user in natural language

## **Why is this useful?**

PDFs are widely used to store information, but they’re not interactive and searching for specific details is manual and slow. This project turns static documents into a conversational experience using Large Language Models (LLMs) and semantic search, making it easier to explore content deeply and quickly.

## **Pipeline Summary: From PDF → Smart Answers**

In [1]:
# Package installation
!pip install langchain
!pip install openai
!pip install PyPDF2
!pip install faiss-cpu
!pip install tiktoken
!pip install -U langchain-community
!pip install langchain_huggingface
!pip install chromadb
!pip install -U huggingface_hub langchain-community

Collecting PyPDF2
  Downloading pypdf2-3.0.1-py3-none-any.whl.metadata (6.8 kB)
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m232.6/232.6 kB[0m [31m5.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Collecting faiss-cpu
  Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (4.8 kB)
Downloading faiss_cpu-1.11.0-cp311-cp311-manylinux_2_28_x86_64.whl (31.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m66.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu
Successfully installed faiss-cpu-1.11.0
Collecting langchain-community
  Downloading langchain_community-0.3.25-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting pydantic-settings<3.0.

In [2]:
# Impoert all necessary libraries
from PyPDF2 import PdfReader
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import HuggingFaceEndpoint
from langchain_huggingface import ChatHuggingFace,HuggingFacePipeline
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain_core.output_parsers import StrOutputParser


###**Hugging Face Login**


- Logs you into Hugging Face using a secure access token.
- Required to use Hugging Face-hosted models via APIs for tasks like text generation (used later).

In [4]:
# Access "HuggingFace Access Token" from secrets section in GooglColab
from google.colab import userdata
userdata.get('HUGGINGFACEHUB_ACCESS_TOKEN')

'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'

In [6]:
# Login to HuggingFace using access token
import os
from huggingface_hub import login
login(token=os.getenv('HUGGINGFACE_ACCESS_TOKEN'))

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

###**PDF Text Extraction**

- Extracts readable text from all pages in the given PDF.
- Converts non-interactive content into plain text so we can process it with NLP tools

In [7]:
# Step 1: Load PDF
pdfreader = PdfReader("YOUR_PDF_FILE_PATH.pdf")  # Replace with your PDF path

In [8]:
# Extract text from PDF
# chunk pdf into pages and pages into text
raw_text = ""
for page in pdfreader.pages:
    content = page.extract_text()
    if content:
        raw_text += content

###**Text Chunking**

- Splits long PDF content into smaller overlapping text chunks.
- LLMs have token limits. Splitting ensures better context management and avoids exceeding length constraints. Overlapping preserves continuity between chunks.

In [9]:
# Step 2: Split Text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function = len,
    separators = ["\n\n", "\n", " ", ""]
)
texts = text_splitter.split_text(raw_text)

In [10]:
raw_text

"Machine Learning Interview Questions & Answers\nBasic ML Algorithm Concepts\nQ: What is the difference between supervised and unsupervised learning?\nA: Supervised learning uses labeled data (input-output pairs), whereas unsupervised learning deals with\nunlabeled data. Example: Spam detection (supervised), customer segmentation (unsupervised).\nQ: Explain bias-variance trade-off.\nA: Bias is error from overly simplistic assumptions. Variance is error from model sensitivity to training data.\nHigh bias causes underfitting; high variance causes overfitting. The goal is to balance both.\nQ: What is underfitting and overfitting?\nA: Underfitting occurs when the model is too simple and cannot capture the data's complexity. Overfitting\nhappens when the model learns noise in the training data. Regularization, pruning, and cross-validation help.\nQ: What is cross-validation and why is it used?\nA: Cross-validation evaluates model performance on different subsets of data to reduce overfittin

###**Generate Embeddings**

- Converts each chunk into a numerical vector (embedding) using a pre-trained sentence transformer.
- Enables semantic understanding — similar meanings result in closer vectors. Crucial for effective document retrieval.

In [11]:
# Step 3: Convert Chunks into Embeddings using Open Source Model
embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  embedding = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

###**Store Embeddings in a Vector Store**

- Stores the embeddings in Chroma, an in-memory vector database.
- Allows fast and accurate similarity search — helps find the most relevant chunks for a user’s query.

In [12]:
# Step 4: Store Embeddings into Chroma Vector Store (In-Memory)
vectorstore = Chroma.from_texts(texts, embedding)

###**Create Retriever**

- Configures the vector store to act as a retriever — fetches top-k similar chunks for a question.
- Narrows down the PDF content to only the most relevant parts before sending it to the LLM — this is Retrieval-Augmented Generation (RAG).

###**Load Language Model (LLM)**

- Loads the Google Gemma-3B text-generation model via Hugging Face’s pipeline.
- This is the brain of the chatbot — it generates human-like answers using the context retrieved from the document.

In [13]:
# Step 5: Create Retriever to Fetch Relevant Chunks
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 4})

In [14]:
# STEP 6: Load LLM from HuggingFace (Gemma-3B)
llm = HuggingFacePipeline.from_model_id(
    model_id="google/gemma-3-1b-it",
    task = "text-generation",
    pipeline_kwargs=dict(
        temperature= 0.7,
        max_new_tokens=100
    ),
)

tokenizer_config.json:   0%|          | 0.00/1.16M [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.69M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/33.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/35.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/662 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/899 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/2.00G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

Device set to use cpu


In [15]:
model = ChatHuggingFace(llm=llm)

###**Prompt Template**

- A structured input message given to the LLM, including context and the user’s question.
- Well-crafted prompts lead to better responses from the model. It also tells the model not to hallucinate.

In [16]:
# Step 7: Create Prompt Template
prompt = PromptTemplate(
    input_variables=["context", "question"],
    template="""
    You are a helpful AI assistant. Use the following pieces of context to answer the question at the end.
    If you don't know the answer, just say that you don't know, don't try to make up an answer.

    Context: {context}
    Question: {question}
    """
)

In [17]:
# define parser
parser = StrOutputParser()

###**Chain Construction**

Combines everything into a single callable function that takes a question and returns an answer.

In [18]:
# define chain using all components
# Combine Prompt → LLM → OutputParser
chain = prompt | model | parser

In [19]:
# Step 8: Query Function
def ask_question(query):
    """
    Given a user query, retrieve relevant PDF chunks and generate an answer.
    """
    docs = retriever.get_relevant_documents(query)
    context = "\n\n".join([doc.page_content for doc in docs])
    return chain.invoke({"context": context, "question": query})

In [20]:
# Example Query
ask_question("explain bias variance trade off")

  docs = retriever.get_relevant_documents(query)


"<bos><start_of_turn>user\nYou are a helpful AI assistant. Use the following pieces of context to answer the question at the end.\n    If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n    Context: High bias causes underfitting; high variance causes overfitting. The goal is to balance both.\nQ: What is underfitting and overfitting?\nA: Underfitting occurs when the model is too simple and cannot capture the data's complexity. Overfitting\nhappens when the model learns noise in the training data. Regularization, pruning, and cross-validation help.\nQ: What is cross-validation and why is it used?\n\nQ: What is cross-validation and why is it used?\nA: Cross-validation evaluates model performance on different subsets of data to reduce overfitting and\nassess generalization. K-fold cross-validation is common.\nLinear Regression\nQ: How does linear regression work?\nA: It models the relationship between input variables and the target using the line

In [22]:
ask_question("what is the bias how to handle it")

"<bos><start_of_turn>user\nYou are a helpful AI assistant. Use the following pieces of context to answer the question at the end.\n    If you don't know the answer, just say that you don't know, don't try to make up an answer.\n\n    Context: High bias causes underfitting; high variance causes overfitting. The goal is to balance both.\nQ: What is underfitting and overfitting?\nA: Underfitting occurs when the model is too simple and cannot capture the data's complexity. Overfitting\nhappens when the model learns noise in the training data. Regularization, pruning, and cross-validation help.\nQ: What is cross-validation and why is it used?\n\nMachine Learning Interview Questions & Answers\nBasic ML Algorithm Concepts\nQ: What is the difference between supervised and unsupervised learning?\nA: Supervised learning uses labeled data (input-output pairs), whereas unsupervised learning deals with\nunlabeled data. Example: Spam detection (supervised), customer segmentation (unsupervised).\nQ: 

This project combines Retrieval-Augmented Generation (RAG) + modern LLMs for an intelligent, scalable chatbot that can learn from your own documents.